# Artist Prediction Model 🎶

This model attempts to predict a user's preferred artist based on their top 50 artists from Spotify. The information is limited, but hopefully something useful might come of it.

### Eric Boam's Analysis
Eric collected data on music from Man, Machine and Media. Media being friends on social media; this saw the highest percentage of top ten albums (highest quality) while the Machine had a lower percentage but many more data points (highest quantity).

The Release Radar playlist is released daily for each user. 
These are the general trends that Eric found:
- Songs added at midnight
- Keep song on list for max 4 weeks
- Favour familiar artists
- Use remixes when music runs thin

[More on this](https://medium.com/@ericboam/i-decoded-the-spotify-recommendation-algorithm-heres-what-i-found-4b0f3654035b)

### BaRT
The AI system in charge of organising your Spotify homepage is called Bandits for Recommendations as Treatments (BaRT). It works on the balance between exploit and explore, the former recommendinf what the user already likes and the latter showing them new stuff.
Important metrics:
1. 30-Sec rule-ignore songs that get skipped 
2. Audio Analysis-idenitfy audio patterns to find more like it
3. User age, location and gender-find what others like

[More on this](https://analyticsindiamag.com/how-spotifys-algorithm-manages-to-find-your-inner-groove/)

#### 30-Sec Rule:

This generally indicates that someone likes a song, listening to the end should further add to this. Continuously listening to a song over a period further indeicates appeal as it prevents the "One Hit Wonder Issue" which could skew recommendations with trending songs.


#### Artist/Song Follow-Up

When a user listens to a song, do they checkout the artist after? Are their next played songs from that artist/album? 

These are called "high-quality actions" where these actions give a very good indication of the user liking it.


### Recommendation Models
1. **Collaborative Filtering**

    Layperson:
        This finds other users who like the same stuff as you and recommends what you don't have


    
[More on this](https://medium.com/s/story/spotifys-discover-weekly-how-machine-learning-finds-your-new-music-19a41ab76efe)
    

2. **Acoustic Analysis**
    
    Spotify gives the high-level audio characteristics that describe how the song sounds. Some include "danceability" and "tempo".

    Sample JSON Response:
    ``` json
    {
      "duration_ms" : 255349,
      "key" : 5,
      "mode" : 0,
      "time_signature" : 4,
      "acousticness" : 0.514,
      "danceability" : 0.735,
      "energy" : 0.578,
      "instrumentalness" : 0.0902,
      "liveness" : 0.159,
      "loudness" : -11.840,
      "speechiness" : 0.0461,
      "valence" : 0.624,
      "tempo" : 98.002,
      "id" : "06AKEBrKUckW0KREUWRnvT",
      "uri" : "spotify:track:06AKEBrKUckW0KREUWRnvT",
      "track_href" : "https://api.spotify.com/v1/tracks/06AKEBrKUckW0KREUWRnvT",
      "analysis_url" : "https://api.spotify.com/v1/audio-analysis/06AKEBrKUckW0KREUWRnvT",
      "type" : "audio_features"
    }
    ```
    The above fields are a lot easier to work with compared to raw audio data which is difficult to analyse and expensive to process.
    
    
    
3. **Natural Language Processing**

    Spotify web scrapes for similar artists. *Hard to do-exclude.*


### Their Data Pipeline
The image below is an estimate of the architecture used in Spotify to supply recommendations.

![Spotify Data Pipeline](../res/spotify_pipeline.png)

The "Batch Audio Models" and "Batch NLP Models" are little out of my depth and require a lot of processing to get done so Audio/Text Analysis of songs will be left out. The main focus will be:
- Play Logs
- Track metadata
- Batch CF Models (Collabarative Filtering)

&nbsp;
## Model Columns

***Given a user's top 50 artists, recommend 5 other artists***

1. Get info for each artist
2. Search for artist with similar information on MusicBrainz/Cache
3. Find artists not in Top 50

The predicted column (y) in this case is the name of the recommended artist. The columns used to predict y are the remaining columns (X).

**X Columns: Nationality, Age, Genres, Gender, Race (not done)**

- The Model would have to search through the cached JSON file for artists with similar attributes.
- A LabelEncoder would have to be implemented to deal with the non-numeric values
    - Nationality: 0-194
    - Age: 0-90
    - Genre: map most popular to 0-x
    - Gender: True/False
    - Race: map most popular to 0-x (not done)
    - Artist: map most popular to 0-10000 
- More data needs to be gathered for favourite artists
- More than the current top 500 artists would need to be collected

In [24]:
import pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy.util as util
from client_secret import *

## Spotify Connection & DataFrame

In [14]:
def spotify_connect(user_scope, redirect_uri, artist_limit, time_range):
    """Connects to Spotify API, returning user's top artists"""
    
    # Load in secret keys
    client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
    client = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

    # Create security token
    security_token = util.prompt_for_user_token(username, user_scope, client_id=client_id, client_secret=client_secret, redirect_uri=redirect_uri)
    
    # Gets favourite artists 
    if security_token:
        spotify_client = spotipy.Spotify(auth=security_token)
        spotify_client.trace = False
        # Loop through time ranges
        for r in time_range:
            results = spotify_client.current_user_top_artists(time_range=r, limit=artist_limit)
            user = spotify_client.current_user()
            if 'display_name' in user:
                name = user['display_name']
        return results, name
    
scope = "user-top-read"
redirect_uri = "http://localhost:8080"
results,name = spotify_connect(scope, redirect_uri, 100, ['short_term', 'medium_term', 'long_term'])

In [22]:
def make_df(response):
    """Pass results from Spotfy API call and returns cleaned DataFrame"""
    items = pd.DataFrame(response['items'])
    # Drop unnecessary columns
    items = items.drop(['external_urls', 'href', 'id', 'images', 'uri'], axis=1)
    # Followes column needs cleaning
    for i in range(0, len(items)):
        items.followers[i] = items.followers[i]['total']

    return items.sort_values(by='popularity', ascending=False)  
    
artists = make_df(results)#.sort_values(by='name')
artists.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  items.followers[i] = items.followers[i]['total']


Unnamed: 0,followers,genres,name,popularity,type
6,47541476,"[canadian hip hop, canadian pop, hip hop, pop ...",Drake,100,artist
37,10654922,[rap],Travis Scott,98,artist
3,10491353,"[chicago rap, melodic rap]",Juice WRLD,98,artist
1,21629577,"[canadian contemporary r&b, canadian pop, pop]",The Weeknd,97,artist
15,28713845,"[electropop, pop]",Billie Eilish,94,artist


### Load My Top 50 Cached

In [23]:
mytop_50 = pd.read_json('../data/mytop50_artists.json')
mytop_50.head()

Unnamed: 0,artist,gender,age,type,country,city_1,district_1,city_2,district_2,city_3,district_3
0,Drake,male,33,person,CA,,,Toronto,,,
1,Travis Scott,male,28,person,US,,,Houston,,,
2,Juice WRLD,male,21,person,US,,,Chicago,,Oak Lawn,
3,The Weeknd,male,30,person,CA,,,Scarborough,,,
4,Billie Eilish,female,18,person,US,,,Los Angeles,,,


Unnamed: 0,artist,gender,age,type,country,city_1,district_1,city_2,district_2,city_3,district_3,followers
10,21 Savage,male,27,person,US,,,,Plaistow,,,7109
30,A Tribe Called Quest,mixed,35,group,US,,,St. Albans,,,,1713079
11,A$AP Rocky,male,31,person,US,,,,Harlem,,,36023
12,Arctic Monkeys,mixed,18,group,GB,,,,High Green,,,292546
25,Bill Withers,male,81,person,US,,,Slab Fork,,Los Angeles,,9348885
