# CS109A Project Group 21
# Baseline Model

*****

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix
from collections import Counter

### Importing the relevant datasets

First we import two subset of 100 playlists, each subset of 100 playlists to be used for the training and test sets respectively. 

These playlists are randomly chosen from the full one million playlist dataset that has been matched with the Spotify API data, thus including data on track-specific features such as loudness, speechiness and tempo.

In [None]:
train_df = pd.read_csv("data/subset100playlists.csv")
test_df = pd.read_csv("data/subset100playlists_test.csv")

In [None]:
train_df.head()

### Creating the train and test set

We then create the train and test set by keeping select numeric features and the track URI, as a way to maintain a unique ID for each song.

Specifically, for the test set, we split it into two subsets: calibration (20%) and withheld songs (80%), where the calibration songs are used as inputs to the model to predict songs. These predicted songs are then compared to the withheld songs as a comparison metric.

In [None]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
df_cleaned= train_df.select_dtypes(include=numerics)

var_drop = ["index","pid","pos", "count", "Unnamed: 0"]
df_cleaned = df_cleaned.drop(var_drop, axis =1)

train = pd.concat([df_cleaned, train_df['track_uri']],axis=1)

In [None]:
test_df_cleaned = test_df.select_dtypes(include=numerics)

test_var_drop = ["index","pos", "count", "Unnamed: 0"]
test_df_cleaned = test_df_cleaned.drop(test_var_drop, axis =1)

test_df_cleaned = pd.concat([test_df_cleaned, test_df['track_uri']],axis=1)
calibration, withheld = train_test_split(test_df_cleaned, test_size=0.8, random_state=209, stratify = test_df_cleaned['pid'])

In [None]:
train.describe()

### Dropping duplicates
In the baseline model, given that we are only looking at predicting songs based on their similarity to other tracks, we can drop duplicates of the same song from the training set. We do so below.

In [None]:
train.drop_duplicates(subset ="track_uri",keep = False, inplace = True) 
train.head()

In [None]:
train = train.set_index('track_uri') 

In [None]:
train.shape

### Fitting the training set to a Nearest Neighbors model

This leaves us with 3098 songs in our training set, which we then fit onto an unsupervised Nearest Neighbors learning model, where n_neighbors = 20 and cosine distance is used as a distance of measure. 

Here, we have set the neighbors search algorithm to auto, so that the algorithm attempts to determine the best approach from the training data across the options of BallTree, KDTree, and a brute-force algorithm.

In [None]:
# pivot ratings into song features
model_knn = NearestNeighbors(metric='cosine', algorithm='auto', n_neighbors=20, n_jobs=-1)# fit the dataset
model_knn.fit(train)

### Creating a function to make recommendations

We then proceed to create a function that will take as inputs 1) the Nearest Neighbors model fitted on the training data, 2) a playlist id (to identify the songs in the test set), 3) the number of neighbors to select from and 4) number of recommended songs to predict.

This function will return a list of recommended songs.

In [None]:
def make_recommendation(model_knn, playlist_id, n_neighbors, n_recommendations):
    
    calibration_songs = calibration[calibration['pid']==playlist_id]
    calibration_songs_clean = calibration_songs.drop(columns=['pid', 'track_uri'])

    song_freq_dict = {}

    for index, song in calibration_songs_clean.iterrows():
        distances, indices = model_knn.kneighbors(song.values.reshape(1,-1), n_neighbors=n_neighbors)

        for index in indices[0]:
            if song_freq_dict.get(index) is None:
                song_freq_dict[index] = 1
            else:
                song_freq_dict[index] += 1

    k = Counter(song_freq_dict) 

    # Finding n highest values 
    top_songs = [i[0] for i in k.most_common(n_recommendations)]    

    rec_songs = train.iloc[top_songs].index

    return rec_songs

### Measuring success

For starters, we will use the R-precision score to measure how successful our song predictions are. Here, we define a function `r_precision`, which computes number of retrieved relevant tracks divided by the number of known relevant tracks (i.e., the number of withheld tracks). It takes as inputs a list of predicted songs and compares it against a list of the actual songs from the given playlist ("withheld songs").

In [None]:
#Function to calculate r_precision (https://recsys-challenge.spotify.com/rules)
#R-precision is the number of retrieved relevant tracks divided by the number of known relevant tracks (i.e., the number of withheld tracks) 

def r_precision(preds, known):
    for i in known:
        if i in preds:
            song = combined_df[combined_df['track_uri']==i]['track_name']
            print(f'{song} appeared in both our predicted playlist and the known list of songs.')
    score = np.sum(known.isin(preds))/known.shape[0]
    return score

### Running the baseline model on test data

We then run the following code to loop through all the unique playlists in the test set to predict songs based on a random subset of 20% of songs from the given test playlist (`calibration`). The predicted songs are then fed into the R-precision score function together with the remaining 80% of songs from the given test playlist (`withheld`).

For testing purposes, we print out the actual track names of the songs being predicted and the withheld songs, so that we can get a qualitative understanding of the songs.

In [None]:
n_neighbors, n_recommendations = 50, 100
r_precision_scores = []

#Loop through all unique playlists in test set to identify predicted songs
for index, pid in enumerate(withheld['pid'].drop_duplicates()): 
    print (index, pid)
    pred_songs = make_recommendation(model_knn, pid, n_neighbors, n_recommendations)
    validation_set = withheld[withheld.pid == pid].track_uri
    print("Predicted songs\n", combined_df[combined_df['track_uri'].isin(pred_songs)]['track_name'])
    print("Known songs\n", test_df[test_df['track_uri'].isin(validation_set)]['track_name'])
    
    rps = r_precision(list(pred_songs), validation_set)
    r_precision_scores.append(rps)
    
    print(f'Playlist {pid}: The R precision score is {rps}')

### Measuring success

To measure success for the baseline model, we compute the mean of the R-precision scores across all the playlists in the test set. This returns us a value of 0.005, which implies that on average, out of all the withheld songs in the test playlists, we only manage to predict 0.8% of them using our model. This is low -  as a baseline model, it gives us a sense of the challenges that lie ahead in our predictions. 

In [None]:
avg_rp = np.mean(r_precision_scores)

print('Avg. R-Precision Score: ', avg_rp)

### Challenges and next steps

One of the significant challenges that we have identified could be the fact that the withheld songs in the test data set are not in our original training set. Hence, logically, the model trained on songs in the training set would not be able to return predicted songs that would match 1-to-1 with the withheld songs.

To validate this hypothesis, we ran a quick check on the intersection of songs between both sets, and as expected, only 1552 songs in the withheld list of songs, out of a total of 11074 songs, appear in the training set.

Going ahead, this will mean possibly expanding the training set to increase the overlap in songs between the two sets.

In [None]:
def intersection(lst1, lst2): 
    lst3 = [value for value in lst1 if value in lst2] 
    return lst3 
  
print (len(intersection(list(withheld.track_uri), list(train.index))))
print (len(list(withheld.track_uri)))