##  Spotify Song Recommendation System

We implement a content-based filtering approach for Spotify Song recommendation based on a [medium article](https://towardsdatascience.com/part-iii-building-a-song-recommendation-system-with-spotify-cf76b52705e7) for building a Spotify song recommendation system series.

In [2]:
import csv
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from operator import index
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from textblob import TextBlob
import re

In [3]:
# playlist_data = pd.read_csv("data/spotify.csv")

spotify_playlists = pd.read_csv('data/spotify_playlists.csv', encoding_errors='ignore', index_col=0, header=0)
spotify_playlists['playlist'].value_counts()

New Music Friday      100
New Pop Picks         100
just hits             100
Hip Hop Controller     99
RapCaviar              51
Today's Top Hits       50
Hot Hits USA           50
Name: playlist, dtype: int64

First, we want to check for song duplicates in the playlist. The following code uses the `drop_duplicates` function in **pandas** to drop duplicate songs while building an underlying dataframe with all unique content.

In [None]:
# Duplicates of songs accross playlists
playlistDF = df.copy(deep = True)
playlistDF[['artist','name','playlist']].head(3)

In [None]:
# Drop song duplicates
def drop_duplicates(df):
    df['artists_song']=df.apply(lambda row: row['artist']+' - '+row['name'],axis=1)
    return df.drop_duplicates('artists_song')

songDF = drop_duplicates(playlistDF)
print(len(pd.unique(songDF.artists_song)) == len(songDF))

For the audio features, we can categorize each attribute into four general categories as follows.

- **Mood**: Danceability, Energy, Tempo, Valence
- **Properties**: Instrumentalness, Loudness, Speechiness
- **Context**: Acousticness, Liveness
- **Metadata**: key, mode, time_signature


In [None]:
songDF = songDF[[
    'name', 'track_id', 'release_date', 'popularity', # Track Metadata
    'artist', 'artist_id', 'artist_pop', 'artist_genres', # Artist Info
    'danceability', 'energy', 'valence', 'tempo', # Audio Features - Mood
    'instrumentalness', 'loudness', 'speechiness', # Audio Features - Properties
    'acousticness', 'liveness', # Audio Features - Context
    'key', 'mode', 'time_signature' # Audio Features - Metadata
]]

### Feature Generation


Data feature engineering methods are an integral part of recommender systems. We implement the following process into the feature generation pipeline. 

#### 1. Sentiment Analysis

The following code performs a simple sentiment analysis using the subjectivity and polarity forms from the TextBlob package. Subjectivity, on a scale from 0 to 1, is the amount of personal opinion and factual information in the text. Polarity, on a scale from -1 to 1, is the degree of sentimentality that leads to negation.

In [None]:
# Get subjectivity & polarity using textblob
def getSubjectivity(text):
    return TextBlob(text).sentiment.subjectivity
def getPolarity(text):
    return TextBlob(text).sentiment.polarity

# Categorize polarity & subjectivity score
def getAnalysis(score, task="polarity"):
    if task == "subjectivity":
        if score < 1/3:
            return "low"
        elif score > 1/3:
            return "high"
        else:
            return "medium"
    else:
        if score < 0:
            return 'Negative'
        elif score == 0:
            return 'Neutral'
        else:
            return 'Positive'

# Perform sentiment analysis on text
def sentiment_analysis(df, text_col):
    df['subjectivity'] = df[text_col].apply(getSubjectivity).apply(lambda x: getAnalysis(x,"subjectivity"))
    df['polarity'] = df[text_col].apply(getPolarity).apply(getAnalysis)
    return df

In [None]:
sentimentDF = sentiment_analysis(songDF, "name")
sentimentDF[['name', 'artist', 'subjectivity', 'polarity']].head(3)

#### 2. One-Hot Encoding

We now use one-hot encoding to include the sentiment of a song as input. One-hot encoding converts categorical variables into a syntactic form that machines can understand. The first step involves converting each category into a column representing either True or False. 


![](https://iq.opengenus.org/content/images/2022/01/TW5m0aJ.png)

In [None]:
# Create One Hot Encoded features of a specific column
def ohe_prep(df, column, new_name):
    tf_df = pd.get_dummies(df[column])
    feature_names = tf_df.columns
    tf_df.columns = [new_name + "|" + str(i) for i in feature_names]
    tf_df.reset_index(drop = True, inplace = True)    
    
    return tf_df # One-hot encoded features 

In [None]:
# One-hot encoding for the subjectivity 
subject_ohe = ohe_prep(sentimentDF, 'subjectivity','subject')
subject_ohe.iloc[0]

#### 3. TF-IDF



Spotify's genres are imbalanced, with some more prevalent than others. Therefore, we weigh the importance of each genre to prevent overemphasizing some types and underestimating others. 

The Term Frequency-Inverse Document Frequency (TF-IDF) quantifies words in a set of documents, showing the importance of a word in the corpus: $ \text{Term Frequency}\times\text{Inverse Document Frequency}$.


The term frequency (TF) is the number of times a term appears in each document divided by the total word count, and the inverse document frequency (IDF) is the log value of the document frequency.


In [None]:
# TF-IDF implementation
tfidf = TfidfVectorizer()
tfidf_matrix =  tfidf.fit_transform(songDF['artist_genres'].apply(lambda x: " ".join(x)))

# Genres dataframe
genre_df = pd.DataFrame(tfidf_matrix.toarray())
genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names()]
genre_df.reset_index(drop = True, inplace=True)
genre_df.iloc[0]

#### 4. Normalization


We need to normalize the popularity variable and audio features from 0 to 1. We use the MinMaxScaler function from scikit-learn, which automatically scales all values in min and max to the range 0 to 1.


In [None]:
# artist_pop distribution descriptive stats
print(songDF['artist_pop'].describe())

Next, we apply hyperparameter tuning to the audio features of a song to improve the prediction. Specifically, the normalization of this data stems from the maximum and minimum values of each attribute.


In [None]:
# Normalization
pop = songDF[["artist_pop"]].reset_index(drop = True)
scaler = MinMaxScaler()
pop_scaled = pd.DataFrame(scaler.fit_transform(pop), columns = pop.columns)
pop_scaled.head()

#### Feature Generation

Finally, we use the following code to generate all the above features and concatenate all the variables into a new data frame. We define the following function to process and create a final set of features to generate recommendations.


In [None]:
def create_feature_set(df, float_cols):
    
    # Tfidf genre lists
    tfidf = TfidfVectorizer()
    tfidf_matrix =  tfidf.fit_transform(df['artist_genres'].apply(lambda x: " ".join(x)))
    genre_df = pd.DataFrame(tfidf_matrix.toarray())
    genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names()]
    genre_df.reset_index(drop = True, inplace=True)
    
    # Sentiment analysis
    df = sentiment_analysis(df, "name")

    # One-hot encoding
    subject_ohe = ohe_prep(df, 'subjectivity','subject') * 0.3
    polar_ohe = ohe_prep(df, 'polarity','polar') * 0.5
    key_ohe = ohe_prep(df, 'key','key') * 0.5
    mode_ohe = ohe_prep(df, 'mode','mode') * 0.5

    # Normalization - scale popularity columns
    pop = df[["artist_pop","popularity"]].reset_index(drop = True)
    scaler = MinMaxScaler()
    pop_scaled = pd.DataFrame(scaler.fit_transform(pop), columns = pop.columns) * 0.2 

    # Scale audio feature columns
    floats = df[float_cols].reset_index(drop = True)
    scaler = MinMaxScaler()
    floats_scaled = pd.DataFrame(scaler.fit_transform(floats), columns = floats.columns) * 0.2

    # Concanenate all features
    final = pd.concat([genre_df, floats_scaled, pop_scaled, subject_ohe, polar_ohe, key_ohe, mode_ohe], axis = 1)
    final.insert(loc=0, column='track_id', value=df['track_id'].values) # Add song name
    
    return final # Final set of features 

In [None]:
# Save data and generate features
float_cols = songDF.dtypes[songDF.dtypes == 'float64'].index.values
complete_feature_set = create_feature_set(songDF, float_cols=float_cols)

# songDF.to_csv("../data/allsong_data.csv", index = False)
#complete_feature_set.to_csv("../data/complete_feature.csv", index = False)
complete_feature_set.head(3)

### Content-based Filtering Recommendation


The next step is to perform content-based filtering based on the song features. To do so, we concatenate all songs in a playlist into one summarization vector. Then, we find the similarity between the summarized playlist vector with all songs (not including the songs in the playlist) in the database. Then, we use the similarity measure retrieved the most relevant song that is not in the playlist to recommend it.



#### Choose Playlist


In this part, we test the data with *Mom's playlist* in the dataset.


In [None]:
testDF = playlistDF[playlistDF['playlist'] == "but my feet in bottega"]

#### Extract features

The next step is to generate the features. We need to first use the `id` to differentiate songs that are in the playlist and those that are not. Then, we simply add the features for all songs in the playlist together as a summary vector.




In [None]:
# Summarize playlist into a single vector
def generate_playlist_feature(feat_set, playlist_df):
    
    # Find song features in the playlist
    feat_set_playlist = feat_set[feat_set['track_id'].isin(playlist_df['track_id'].values)]    
    
    # Find all non-playlist song features
    feat_set_nonplaylist = feat_set[~feat_set['track_id'].isin(playlist_df['track_id'].values)]
    feat_set_playlist_final = feat_set_playlist.drop(columns = "track_id")
    
    # Single vector feature summarizing playlist
    return feat_set_playlist_final.sum(axis = 0), feat_set_nonplaylist

> In other words, this vector describes the whole playlist as if it is one song.


In [None]:
# Generate the features
feat_set_pl, feat_set_nonpl = generate_playlist_feature(complete_feature_set, testDF)
# Non-playlist features feat_set_nonpl.head()

In [None]:
# Summarized playlist features
complete_feature_set
feat_set_pl


#### Find similarity

In our code, we used the `cosine_similarity()` function from `scikit learn` to measure the similarity between each song and the summarized playlist vector.


In [None]:
# Generated recommendation based on songs in aspecific playlist
def generate_playlist_recos(df, features, nonplaylist_features):
    '''
    features (pandas series): summarized playlist feature (single vector)
    nonplaylist_features (pandas dataframe): feature set of songs that are not in the selected playlist
    '''
    
    non_playlist_df = df[df['id'].isin(nonplaylist_features['id'].values)]
    # Find cosine similarity between the playlist and the complete song set
    non_playlist_df['sim'] = cosine_similarity(nonplaylist_features.drop('id', axis = 1).values, features.values.reshape(1, -1))[:,0]
    non_playlist_df_top_40 = non_playlist_df.sort_values('sim',ascending = False).head(40)
    
    # Top 40 recommendations for that playlist
    return non_playlist_df_top_40

--------------------------------------------------

In [None]:
def create_df_playlist(api_results,sp = None, append_audio = True):
    
    # DataFrame with track_name, track_id, artist, album, duration, popularity
    df = create_df_saved_songs(api_results["tracks"])
    
    # Whether to append audio features
    if append_audio == True:
        assert sp != None, "sp needs to be specified for appending audio features"
        df = append_audio_features(df,sp)
        
    return df

In [None]:

    
def append_audio_features(df,spotify_auth, return_feat_df = False):
    """ 
    Fetches the audio features for all songs in a DataFrame and
    appends these as rows to the DataFrame.
    Requires spotipy to be set up with an auth token.
    Parameters
    ----------
    df : Dataframe containing at least track_name and track_id for spotify songs
    spotify_auth: spotfiy authentication token (result of authenticate())
    return_feat_df: argument to choose whether to also return df with just the audio features
    
    Returns
    -------
    df: DataFrame containing all original rows and audio features for each song
    df_features(optional): DataFrame containing just the audio features
    """
    audio_features = spotify_auth.audio_features(df["track_id"][:])
    assert len(audio_features) == len(df["track_id"][:])
    feature_cols = list(audio_features[0].keys())[:-7]
    features_list = []
    for features in audio_features:
        try:
            song_features = [features[col] for col in feature_cols]
            features_list.append(song_features)
        except TypeError:
            pass
    df_features = pd.DataFrame(features_list,columns = feature_cols)
    df = pd.concat([df,df_features],axis = 1)
    if return_feat_df == False:
        return df
    else:
        return df,df_features

In [None]:
# Get playlist data from API
playlist_uri = "INSERT YOUR SPOTIFY PLAYLIST URI"
playlist_df = create_df_playlist(playlist, sp = sp)
# Get seed tracks for recommendations
seed_tracks = playlist_df["track_id"].tolist()

#create recommendation df from multiple recommendations
recomm_dfs = []
for i in range(5,len(seed_tracks)+1,5):
    recomms = sp.recommendations(seed_tracks = seed_tracks[i-5:i],limit = 25)
    recomms_df = append_audio_features(create_df_recommendations(recomms),sp)
    recomm_dfs.append(recomms_df)
recomms_df = pd.concat(recomm_dfs)
recomms_df.reset_index(drop = True, inplace = True)