# Artist Prediction Model 🎶

This model attempts to predict a user's preferred artist based on their top 50 artists from Spotify. The information is limited, but hopefully something useful might come of it.

### Eric Boam's Analysis
Eric collected data on music from Man, Machine and Media. Media being friends on social media; this saw the highest percentage of top ten albums (highest quality) while the Machine had a lower percentage but many more data points (highest quantity).

The Release Radar playlist is released daily for each user. 
These are the general trends that Eric found:
- Songs added at midnight
- Keep song on list for max 4 weeks
- Favour familiar artists
- Use remixes when music runs thin

[More on this](https://medium.com/@ericboam/i-decoded-the-spotify-recommendation-algorithm-heres-what-i-found-4b0f3654035b)

### BaRT
The AI system in charge of organising your Spotify homepage is called Bandits for Recommendations as Treatments (BaRT). It works on the balance between exploit and explore, the former recommendinf what the user already likes and the latter showing them new stuff.
Important metrics:
1. 30-Sec rule-ignore songs that get skipped 
2. Audio Analysis-idenitfy audio patterns to find more like it
3. User age, location and gender-find what others like

[More on this](https://analyticsindiamag.com/how-spotifys-algorithm-manages-to-find-your-inner-groove/)

#### 30-Sec Rule:

This generally indicates that someone likes a song, listening to the end should further add to this. Continuously listening to a song over a period further indeicates appeal as it prevents the "One Hit Wonder Issue" which could skew recommendations with trending songs.


#### Artist/Song Follow-Up

When a user listens to a song, do they checkout the artist after? Are their next played songs from that artist/album? 

These are called "high-quality actions" where these actions give a very good indication of the user liking it.


### Recommendation Models
1. **Collaborative Filtering**

    Layperson:
        This finds other users who like the same stuff as you and recommends what you don't have


    
[More on this](https://medium.com/s/story/spotifys-discover-weekly-how-machine-learning-finds-your-new-music-19a41ab76efe)
    

2. **Acoustic Analysis**
    
    Spotify gives the high-level audio characteristics that describe how the song sounds. Some include "danceability" and "tempo".

    Sample JSON Response:
    ``` json
    {
      "duration_ms" : 255349,
      "key" : 5,
      "mode" : 0,
      "time_signature" : 4,
      "acousticness" : 0.514,
      "danceability" : 0.735,
      "energy" : 0.578,
      "instrumentalness" : 0.0902,
      "liveness" : 0.159,
      "loudness" : -11.840,
      "speechiness" : 0.0461,
      "valence" : 0.624,
      "tempo" : 98.002,
      "id" : "06AKEBrKUckW0KREUWRnvT",
      "uri" : "spotify:track:06AKEBrKUckW0KREUWRnvT",
      "track_href" : "https://api.spotify.com/v1/tracks/06AKEBrKUckW0KREUWRnvT",
      "analysis_url" : "https://api.spotify.com/v1/audio-analysis/06AKEBrKUckW0KREUWRnvT",
      "type" : "audio_features"
    }
    ```
    The above fields are a lot easier to work with compared to raw audio data which is difficult to analyse and expensive to process.
    
    
    
3. **Natural Language Processing**

    Spotify web scrapes for similar artists. *Hard to do-exclude.*


### Their Data Pipeline
The image below is an estimate of the architecture used in Spotify to supply recommendations.

![Spotify Data Pipeline](../res/spotify_pipeline.png)

The "Batch Audio Models" and "Batch NLP Models" are little out of my depth and require a lot of processing to get done so Audio/Text Analysis of songs will be left out. The main focus will be:
- Play Logs
- Track metadata
- Batch CF Models (Collabarative Filtering)

&nbsp;
## Model Columns

***Given a user's top 50 artists, recommend 5 other artists***

1. Get info for each artist
2. Search for artist with similar information on MusicBrainz/Cache
3. Find artists not in Top 50

The predicted column (y) in this case is the name of the recommended artist. The columns used to predict y are the remaining columns (X).

**X Columns: Nationality, Age, Genres, Gender, Race (not done)**

- The Model would have to search through the cached JSON file for artists with similar attributes.
- A LabelEncoder would have to be implemented to deal with the non-numeric values
    - Nationality: 0-194
    - Age: 0-90
    - Genre: map most popular to 0-x
    - Gender: True/False
    - Race: map most popular to 0-x (not done)
    - Artist: map most popular to 0-10000 
    - Type: True/False
- More data needs to be gathered for favourite artists
- More than the current top 500 artists would need to be collected

In [2]:
import pandas as pd
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy.util as util
# from client_secret import *

## Load My Top 50

Unnamed: 0,artist,gender,age,type,country,city_1,district_1,city_2,district_2,city_3,district_3
0,Drake,male,33,person,CA,,,Toronto,,,
1,Travis Scott,male,28,person,US,,,Houston,,,
2,Juice WRLD,male,21,person,US,,,Chicago,,Oak Lawn,
3,The Weeknd,male,30,person,CA,,,Scarborough,,,
4,Billie Eilish,female,18,person,US,,,Los Angeles,,,


### Split Training & Validation

In [5]:
from sklearn.model_selection import train_test_split

# Load My Yop 50
mytop_50 = pd.read_json('../data/mytop_50.json', dtype={
    'artist': 'object',
    'gender': 'object',
    'age': 'int64',
    'type': 'object',
    'country': 'object',
    'city_1': 'object',
    'district_1': 'object',
    'city_2': 'object',
    'district_2': 'object',
    'city_3': 'object',
    'district_3': 'object'
})

# Predicted value
y = mytop_50.index


# Find 'n/a' columns
missing_rows = mytop_50.loc[mytop_50.country == 'n/a']
rows = mytop_50.country - missing_rows.country
rows

# Drop columns
# mytop_50.drop(['city_1', 'district_1', 'city_2', 'district_2', 'city_3', 'district_3', 'gender', 'type'], axis=1, inplace=True)


# X_train, X_valid, y_train, y_valid = train_test_split(mytop_50, y,
#                                                       train_size=0.8, test_size=0.2,
#                                                       random_state=0)
# X_train.head()

ValueError: Expected object or value

### Define Mean Absolute Error (MAE) Function
This finds the average error of the predicted values

In [22]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

### Label Encode Countries
Label encode countries so model can make sense of it.

In [35]:
countries = pd.read_csv('countries.csv')
countries

Unnamed: 0,AD,42.546245,1.601554,Andorra
0,AE,23.424076,53.847818,United Arab Emirates
1,AF,33.939110,67.709953,Afghanistan
2,AG,17.060816,-61.796428,Antigua and Barbuda
3,AI,18.220554,-63.068615,Anguilla
4,AL,41.153332,20.168331,Albania
...,...,...,...,...
239,YE,15.552727,48.516388,Yemen
240,YT,-12.827500,45.166244,Mayotte
241,ZA,-30.559482,22.937506,South Africa
242,ZM,-13.133897,27.849332,Zambia
