# Content-based Filtering Spotify Song Recommendation System

**Disclaimer: this notebook is based on the notebook that can be found [here](https://github.com/enjuichang/PracticalDataScience-ENCA/tree/main).**

This notebook describes is a content-based filtering approach for Spotify Song recommendation. 


## Structure

- Package Setup
- Preprocessing
- Feature Generation
- Content-based Filtering Recommendation

## Setup

**Downloaded Package**
- TextBlob

**Imported Packages**

- Pandas
- Scikit-learn
- re
- Spotipy

## Credits

This notebook builds on top of Madhav Thaker's [spotify-recommendation-system tutorial](https://github.com/madhavthaker/spotify-recommendation-system).





### Package Setup
#### Download Dependencies

In [31]:
!pip install textblob




[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


#### Import Dependencies

In [32]:
# import library
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from textblob import TextBlob
import re

#### Data Import
The data here is not raw data, but the data created in [extract_features.py](https://github.com/enjuichang/PracticalDataScience-ENCA/blob/main/notebooks/Extract%20Features%20Script.ipynb). The script uses the Spotify API library [spotipy](https://spotipy.readthedocs.io/en/2.22.1/?highlight=sp%20audio_features#spotipy.client.Spotify.audio_features) in order to extract [audio features](https://developer.spotify.com/documentation/web-api/reference/get-audio-features), like loudness, danceability, acousticness, .... Spotify created these features by applying a trained Convolutional Neural Network to the [spectrogram](https://en.wikipedia.org/wiki/Spectrogram) of a song, which is a 2D-image representation of the song. Therefore, we don't need to do this ourselves, and just need to use Spotify's API to get those features. In this case, I am using a subset of the data provided [here](https://github.com/enjuichang/PracticalDataScience-ENCA/blob/main/data/processed_data.csv).

In [33]:
# import processed data
playlist_df = pd.read_csv("data/sorted_processed_data_train.csv")
print(playlist_df.columns)
playlist_df.drop(columns=["Unnamed: 0", 'Unnamed: 0.1'], inplace = True)  # remove unneccessary columns
playlist_df.head()

Index(['Unnamed: 0.2', 'Unnamed: 0.1', 'Unnamed: 0', 'pos', 'artist_name',
       'track_uri', 'artist_uri', 'track_name', 'album_uri', 'duration_ms_x',
       'album_name', 'name', 'danceability', 'energy', 'key', 'loudness',
       'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'type', 'id', 'uri', 'track_href', 'analysis_url',
       'duration_ms_y', 'time_signature', 'artist_pop', 'genres', 'track_pop'],
      dtype='object')


Unnamed: 0,Unnamed: 0.2,pos,artist_name,track_uri,artist_uri,track_name,album_uri,duration_ms_x,album_name,name,...,type,id,uri,track_href,analysis_url,duration_ms_y,time_signature,artist_pop,genres,track_pop
0,10062,9,Andy Grammer,4nMlau89VAjmV7agkl7OY3,spotify:artist:2oX42qP5ineK3hrhBECLmj,Fresh Eyes,spotify:album:5YpK59N59zCX7Hkc9aBiGv,198001,Fresh Eyes,CHiLl,...,audio_features,4nMlau89VAjmV7agkl7OY3,spotify:track:4nMlau89VAjmV7agkl7OY3,https://api.spotify.com/v1/tracks/4nMlau89VAjm...,https://api.spotify.com/v1/audio-analysis/4nMl...,198001,4,74,dance_pop modern_rock neo_mellow pop pop_rap p...,0
1,11756,8,gnash,7vRriwrloYVaoAe3a9wJHe,spotify:artist:3iri9nBFs9e4wN7PLIetAw,"i hate u, i love u (feat. olivia o'brien)",spotify:album:3L0H4RjVXpEkwfDgi3XOdf,251033,us,CHiLl,...,audio_features,7vRriwrloYVaoAe3a9wJHe,spotify:track:7vRriwrloYVaoAe3a9wJHe,https://api.spotify.com/v1/tracks/7vRriwrloYVa...,https://api.spotify.com/v1/audio-analysis/7vRr...,251034,4,70,electropop pop pop_rap,81
2,16776,20,Khalid,152lZdxL1OR0ZMW6KquMif,spotify:artist:6LuN9FCkKOj5PcnpouEgny,Location,spotify:album:6kf46HbnYCZzP6rjvQHYzg,219080,American Teen,CHiLl,...,audio_features,152lZdxL1OR0ZMW6KquMif,spotify:track:152lZdxL1OR0ZMW6KquMif,https://api.spotify.com/v1/tracks/152lZdxL1OR0...,https://api.spotify.com/v1/audio-analysis/152l...,219080,4,89,pop pop_r&b,81
3,2510,19,Chance The Rapper,3eze1OsZ1rqeXkKStNfTmi,spotify:artist:1anyVhU62p31KFi8MEzkbf,Juke Jam (feat. Justin Bieber & Towkio),spotify:album:71QyofYesSsRMwFOTafnhB,219683,Coloring Book,CHiLl,...,audio_features,3eze1OsZ1rqeXkKStNfTmi,spotify:track:3eze1OsZ1rqeXkKStNfTmi,https://api.spotify.com/v1/tracks/3eze1OsZ1rqe...,https://api.spotify.com/v1/audio-analysis/3eze...,219683,1,80,chicago_rap conscious_hip_hop hip_hop pop_rap rap,68
4,10054,2,Maroon 5,2QbFClFyhMMtiurUjuQlAe,spotify:artist:04gDigrS5kc9YWfZHwBETP,Don't Wanna Know (feat. Kendrick Lamar),spotify:album:1Jmq5HEJeA9kNi2SgQul4U,214265,Red Pill Blues,CHiLl,...,audio_features,2QbFClFyhMMtiurUjuQlAe,spotify:track:2QbFClFyhMMtiurUjuQlAe,https://api.spotify.com/v1/tracks/2QbFClFyhMMt...,https://api.spotify.com/v1/audio-analysis/2QbF...,214265,4,88,pop pop_rock,0


Each column adds information to a track:

- Unnamed:0: This is the index over all tracks in the database. The name is this weird, because it was empty in the .csv file and the DataFrame library named it that way.
- pos: the index of the track in the playlist it belongs to
- artist_name: the name of the artist of the track
- track_uri: unique identifier of the track ([more](https://developer.spotify.com/documentation/web-api/concepts/spotify-uris-ids#:~:text=the%20following%20parameters%3A-,Spotify%20URI,-The%20resource%20identifier))
- artist_uri: unique identifier of the artist ([more](https://developer.spotify.com/documentation/web-api/concepts/spotify-uris-ids#:~:text=the%20following%20parameters%3A-,Spotify%20URI,-The%20resource%20identifier))
- track_name: name of the track
- album_uri: unique identifier of the album ([more](https://developer.spotify.com/documentation/web-api/concepts/spotify-uris-ids#:~:text=the%20following%20parameters%3A-,Spotify%20URI,-The%20resource%20identifier))
- duration_ms: the duration of the track in milliseconds
- album_name: the name of the album the track was published with
- name: the name of the playlist the track belongs to

### Preprocessing

The following cells conducts further preprocessing for the imported data to cater the data specifically for the content-based filtering.

Here is the general pipeline:
1. Useful data Selection
2. List concatenation

#### Useful Data Selection

Due to the nature of playlist, there will be duplicates in songs across multiple playlists. Therefore, I combined the song and the artist and used the `drop_duplicates()` function in `pandas` to remove duplicate songs when building the base dataframe with all unique songs.

In [34]:
# Show that there are duplicates of songs accross playlists
playlist_df[['artist_name','track_name','name']]

Unnamed: 0,artist_name,track_name,name
0,Andy Grammer,Fresh Eyes,CHiLl
1,gnash,"i hate u, i love u (feat. olivia o'brien)",CHiLl
2,Khalid,Location,CHiLl
3,Chance The Rapper,Juke Jam (feat. Justin Bieber & Towkio),CHiLl
4,Maroon 5,Don't Wanna Know (feat. Kendrick Lamar),CHiLl
...,...,...,...
64569,Rae Sremmurd,Black Beatles,vibin'
64570,Big Sean,Dance (A$$) Remix,vibin'
64571,Kendrick Lamar,m.A.A.d city,vibin'
64572,Travis Porter,Ayy Ladies,vibin'


Now, I drop the duplicates with `pandas` by combining the artist name and track name. This is to prevent dropping songs from different artists but with the same names.

In [35]:
# Drop song duplicates
def drop_duplicates(df):
    '''
    Drop duplicate songs
    '''
    df['artists_song'] = df.apply(lambda row: row['artist_name']+row['track_name'],axis = 1)
    return df.drop_duplicates('artists_song')

song_df = drop_duplicates(playlist_df)
print("Are all songs unique: ", len(pd.unique(song_df.artists_song)) == len(song_df))

Are all songs unique:  True


Finally, I select the features I would use later on. The following is a short list of them in categories:
1. Metadata
    - id
    - genres
    - artist_pop (popularity of artist: read more about it [here](https://developer.spotify.com/documentation/web-api/reference/get-an-artist#:~:text=of%20the%20artist.-,popularity,-integer))
    - track_pop (popularity of track: read more about it [here](https://developer.spotify.com/documentation/web-api/reference/get-track#:~:text=of%20the%20track.-,popularity,-integer))
2. Audio (read more about the audio features [here](https://developer.spotify.com/documentation/web-api/reference/get-audio-features))
    - **Mood**: Danceability, Valence, Energy, Tempo
    - **Properties**: Loudness, Speechiness, Instrumentalness
    - **Context**: Liveness, Acousticness
    - **metadata**: key, mode
3. Text
    - track_name

In [36]:
# select useful columns
def select_cols(df):
       '''
       Select useful columns
       '''
       return df[['artist_name','id','track_name','danceability', 'energy', 'key', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', "artist_pop", "genres", "track_pop"]]
song_df = select_cols(song_df)
song_df.head()

Unnamed: 0,artist_name,id,track_name,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,artist_pop,genres,track_pop
0,Andy Grammer,4nMlau89VAjmV7agkl7OY3,Fresh Eyes,0.822,0.544,0,-6.797,1,0.0332,0.562,2.4e-05,0.144,0.861,122.047,74,dance_pop modern_rock neo_mellow pop pop_rap p...,0
1,gnash,7vRriwrloYVaoAe3a9wJHe,"i hate u, i love u (feat. olivia o'brien)",0.492,0.275,6,-13.4,0,0.3,0.687,0.0,0.101,0.18,92.6,70,electropop pop pop_rap,81
2,Khalid,152lZdxL1OR0ZMW6KquMif,Location,0.736,0.449,1,-11.462,0,0.425,0.33,0.000162,0.0898,0.326,80.126,89,pop pop_r&b,81
3,Chance The Rapper,3eze1OsZ1rqeXkKStNfTmi,Juke Jam (feat. Justin Bieber & Towkio),0.505,0.397,5,-9.349,1,0.324,0.716,0.0,0.0853,0.558,95.063,80,chicago_rap conscious_hip_hop hip_hop pop_rap rap,68
4,Maroon 5,2QbFClFyhMMtiurUjuQlAe,Don't Wanna Know (feat. Kendrick Lamar),0.775,0.617,7,-6.166,1,0.0701,0.341,0.0,0.0985,0.485,100.048,88,pop pop_rock,0


#### List Concatenation

After selecting the useful data, due to the import format of a dataframe, we need to convert the `genres` columns back into a list. This is done by using the `split()` function:

In [37]:
def genre_preprocess(df):
    '''
    Preprocess the genre data
    '''
    df['genres_list'] = df['genres'].apply(lambda x: x.split(" "))
    return df
song_df = genre_preprocess(song_df)
song_df['genres_list'].head()

0    [dance_pop, modern_rock, neo_mellow, pop, pop_...
1                           [electropop, pop, pop_rap]
2                                       [pop, pop_r&b]
3    [chicago_rap, conscious_hip_hop, hip_hop, pop_...
4                                      [pop, pop_rock]
Name: genres_list, dtype: object

Lastly, I created a pipeline for preprocessing any new playlist as below:

In [38]:
def playlist_preprocess(df):
    '''
    Preprocess imported playlist
    '''
    df = drop_duplicates(df)
    df = select_cols(df)
    df = genre_preprocess(df)

    return df

### Feature Generation
Now that the data is usable, we can now feature-engineer the data for the purpose of the recommendation system. In this project, the following process is conducted into a pipeline for feature generation.

1. Sentiment Analysis
2. One-hot Encoding
3. TF-IDF
4. Normalization

#### Sentiment Analysis

In our data, we will perform a simply sentiment analysis using subjectivity and polarity form `TextBlob` package.
- **Subjectivity** (0,1): The amount of personal opinion and factual information contained in the text.
- **Polarity** (-1,1): The degree of strong or clearly defined sentiment accounting for negation.

We will then use one-hot encoding to list the sentiment of the song titles as one of the input.

In [39]:
def getSubjectivity(text):
  '''
  Getting the Subjectivity using TextBlob
  '''
  return TextBlob(text).sentiment.subjectivity

def getPolarity(text):
  '''
  Getting the Polarity using TextBlob
  '''
  return TextBlob(text).sentiment.polarity

def getAnalysis(score, task="polarity"):
  '''
  Categorizing the Polarity & Subjectivity score
  '''
  if task == "subjectivity":
    if score < 1/3:
      return "low"
    elif score > 1/3:
      return "high"
    else:
      return "medium"
  else:
    if score < 0:
      return 'Negative'
    elif score == 0:
      return 'Neutral'
    else:
      return 'Positive'

def sentiment_analysis(df, text_col):
  '''
  Perform sentiment analysis on text
  ---
  Input:
  df (pandas dataframe): Dataframe of interest
  text_col (str): column of interest
  '''
  df['subjectivity'] = df[text_col].apply(getSubjectivity).apply(lambda x: getAnalysis(x,"subjectivity"))
  df['polarity'] = df[text_col].apply(getPolarity).apply(getAnalysis)
  return df

In [40]:
# Show result
sentiment = sentiment_analysis(song_df, "track_name")
sentiment.head()

Unnamed: 0,artist_name,id,track_name,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,artist_pop,genres,track_pop,genres_list,subjectivity,polarity
0,Andy Grammer,4nMlau89VAjmV7agkl7OY3,Fresh Eyes,0.822,0.544,0,-6.797,1,0.0332,0.562,2.4e-05,0.144,0.861,122.047,74,dance_pop modern_rock neo_mellow pop pop_rap p...,0,"[dance_pop, modern_rock, neo_mellow, pop, pop_...",high,Positive
1,gnash,7vRriwrloYVaoAe3a9wJHe,"i hate u, i love u (feat. olivia o'brien)",0.492,0.275,6,-13.4,0,0.3,0.687,0.0,0.101,0.18,92.6,70,electropop pop pop_rap,81,"[electropop, pop, pop_rap]",high,Negative
2,Khalid,152lZdxL1OR0ZMW6KquMif,Location,0.736,0.449,1,-11.462,0,0.425,0.33,0.000162,0.0898,0.326,80.126,89,pop pop_r&b,81,"[pop, pop_r&b]",low,Neutral
3,Chance The Rapper,3eze1OsZ1rqeXkKStNfTmi,Juke Jam (feat. Justin Bieber & Towkio),0.505,0.397,5,-9.349,1,0.324,0.716,0.0,0.0853,0.558,95.063,80,chicago_rap conscious_hip_hop hip_hop pop_rap rap,68,"[chicago_rap, conscious_hip_hop, hip_hop, pop_...",low,Neutral
4,Maroon 5,2QbFClFyhMMtiurUjuQlAe,Don't Wanna Know (feat. Kendrick Lamar),0.775,0.617,7,-6.166,1,0.0701,0.341,0.0,0.0985,0.485,100.048,88,pop pop_rock,0,"[pop, pop_rock]",low,Neutral


#### One-hot encoding

One-hot encoding is a method to transform categorical variables into a machine-understandable langauge. This is done by converting each category into a column so that each category can be represented as either True or False.


![ohe_img](https://cdn-images-1.medium.com/max/1600/0*KVGWy9c3eo2RiAe3.png) 

In [41]:
def ohe_prep(df, column, new_name): 
    ''' 
    Create One Hot Encoded features of a specific column
    ---
    Input: 
    df (pandas dataframe): Spotify Dataframe
    column (str): Column to be processed
    new_name (str): new column name to be used
        
    Output: 
    tf_df: One-hot encoded features 
    '''
    
    tf_df = pd.get_dummies(df[column])
    feature_names = tf_df.columns
    tf_df.columns = [new_name + "|" + str(i) for i in feature_names]
    tf_df.reset_index(drop = True, inplace = True)    
    return tf_df

In [42]:
# One-hot encoding for the subjectivity 
subject_ohe = ohe_prep(sentiment, 'subjectivity','subject')
subject_ohe.iloc[0]

subject|high       True
subject|low       False
subject|medium    False
Name: 0, dtype: bool

#### TF-IDF
TF-IDF, also known as Term Frequency-Inverse Document Frequency, is a tool to quantify words in a set of documents. The goal of TF-IDF is to show the importance of a word in the documents and the corpus. The general formula for calculating TF-IDF is:
$$ \text{Term Frequency}\times\text{Inverse Document Frequency}$$
- **Term Frequency (TF)**: The number of times a term appears in each document divided by the total word count in the document.
- **Inverse Document Frequency (IDF)**: The log value of the document frequency. Document frequency is the total number of documents where one term is present.

The motivation is to find words that are not only important in each document but also accounting for the entire corpus. The log value was taken to decrease the impact of a large N, which would lead a very large IDF compared to TF. TF is focused on importance of a word in a document, while IDF is focused on the importance of a word across documents.

In this project, the documents are analogous to songs. Therefore, we are calculating the most prominent genre in each song and their prevelent across songs to determine the weight of the genre. This is much better than simply one-hot encoding since there is no weights to determine how important and widespread each genre is, leading to overweighting on uncommon genres.

![tfidf_img](https://miro.medium.com/max/1400/1*V9ac4hLVyms79jl65Ym_Bw.jpeg)

In [43]:
# TF-IDF implementation
tfidf = TfidfVectorizer()
tfidf_matrix =  tfidf.fit_transform(song_df['genres_list'].apply(lambda x: " ".join(x)))
genre_df = pd.DataFrame(tfidf_matrix.toarray())
genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names_out()]
genre_df.drop(columns='genre|unknown')
genre_df.reset_index(drop = True, inplace=True)
genre_df.iloc[0]

genre|21st_century_classical    0.0
genre|432hz                     0.0
genre|_hip_hop                  0.0
genre|_roll                     0.0
genre|a_cappella                0.0
                               ... 
genre|york_indie                0.0
genre|zambian_hip_hop           0.0
genre|zhongguo_feng             0.0
genre|zolo                      0.0
genre|zouk                      0.0
Name: 0, Length: 2134, dtype: float64

#### Normalization
Lastly, we need to normalize some variables. As shown below, the popularity variables are not normalized to 0 to 1, which would be problematic in the consine similarity function later on. In addition, the audio features are also not normalized. 

To solve this problem, we used the `MinMaxScaler()` function from `scikit learn` which automatically scales all values from the min and max into a range of 0 to 1.

In [44]:
# artist_pop distribution descriptive stats
print(song_df['artist_pop'].describe())

count    33430.000000
mean        61.859348
std         19.099182
min          0.000000
25%         51.000000
50%         65.000000
75%         75.000000
max        100.000000
Name: artist_pop, dtype: float64


In [45]:
# Normalization
pop = song_df[["artist_pop"]].reset_index(drop = True)
scaler = MinMaxScaler()
pop_scaled = pd.DataFrame(scaler.fit_transform(pop), columns = pop.columns)
pop_scaled.head()

Unnamed: 0,artist_pop
0,0.74
1,0.7
2,0.89
3,0.8
4,0.88


#### Feature Generation
Finally, we generate all features mentioned above using the following cell and concatenate all variables into a new dataframe.

In [46]:
def create_feature_set(df, float_cols):
    '''
    Process spotify df to create a final set of features that will be used to generate recommendations
    ---
    Input: 
    df (pandas dataframe): Spotify Dataframe
    float_cols (list(str)): List of float columns that will be scaled
            
    Output: 
    final (pandas dataframe): Final set of features 
    '''
    
    # Tfidf genre lists
    tfidf = TfidfVectorizer()
    tfidf_matrix =  tfidf.fit_transform(df['genres_list'].apply(lambda x: " ".join(x)))
    genre_df = pd.DataFrame(tfidf_matrix.toarray())
    genre_df.columns = ['genre' + "|" + i for i in tfidf.get_feature_names_out()]
    genre_df.drop(columns='genre|unknown') # drop unknown genre
    genre_df.reset_index(drop = True, inplace=True)
    
    # Sentiment analysis
    df = sentiment_analysis(df, "track_name")

    # One-hot Encoding
    subject_ohe = ohe_prep(df, 'subjectivity','subject') * 0.3
    polar_ohe = ohe_prep(df, 'polarity','polar') * 0.5
    key_ohe = ohe_prep(df, 'key','key') * 0.5
    mode_ohe = ohe_prep(df, 'mode','mode') * 0.5

    # Normalization
    # Scale popularity columns
    pop = df[["artist_pop","track_pop"]].reset_index(drop = True)
    scaler = MinMaxScaler()
    pop_scaled = pd.DataFrame(scaler.fit_transform(pop), columns = pop.columns) * 0.2 

    # Scale audio columns
    floats = df[float_cols].reset_index(drop = True)
    scaler = MinMaxScaler()
    floats_scaled = pd.DataFrame(scaler.fit_transform(floats), columns = floats.columns) * 0.2

    # Concanenate all features
    final = pd.concat([genre_df, floats_scaled, pop_scaled, subject_ohe, polar_ohe, key_ohe, mode_ohe], axis = 1)
    
    # Add song id
    final['id']=df['id'].values
    
    return final

In [47]:
# Save the data and generate the features
float_cols = song_df.dtypes[song_df.dtypes == 'float64'].index.values
song_df.to_csv("data/allsong_data_train.csv", index = False) 

# Generate features
complete_feature_set = create_feature_set(song_df, float_cols=float_cols)  # one hot encoded, normalized, ...
complete_feature_set.to_csv("data/complete_feature_train.csv", index = False)
complete_feature_set.head()

Unnamed: 0,genre|21st_century_classical,genre|432hz,genre|_hip_hop,genre|_roll,genre|a_cappella,genre|abstract_beats,genre|abstract_hip_hop,genre|accordion,genre|acid_jazz,genre|acid_rock,...,key|5,key|6,key|7,key|8,key|9,key|10,key|11,mode|0,mode|1,id
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,4nMlau89VAjmV7agkl7OY3
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.5,0.0,7vRriwrloYVaoAe3a9wJHe
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,152lZdxL1OR0ZMW6KquMif
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,3eze1OsZ1rqeXkKStNfTmi
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.5,2QbFClFyhMMtiurUjuQlAe


### Content-based Filtering Recommendation
The next step is to perform content-based filtering based on the song features we have. To do so, we concatenate all songs in a playlist into one summarization vector. Then, we find the similarity between the summarized playlist vector with all songs (not including the songs in the playlist) in the database. Then, we use the similarity measure retrieved the most relevant song that is not in the playlist to recommend it.

There are thre steps in this section:
1. **Choose playlist**: In this part, we retrieve a playlist
2. **Extract features**: In this part, we retrieve playlist-of-interest features and non-playlist-of-interest features.
3. **Find similarity**: In this part, we compare the summarized playlist features with all other songs.

#### Choose Playlist
In this part, we test the data with *Mom's playlist* in the dataset.


In [48]:
### This is the test data
# playlistDF_test = pd.read_csv("../data/test_playlist.csv")
# playlistDF_test = playlist_preprocess(playlistDF_test)
# playlistDF_test.head()

# Test playlist:  Mom's playlist
playlistDF_test = playlist_df[playlist_df['name'] == "Mom's playlist"]
playlistDF_test.head()
playlistDF_test.to_csv("data/test_playlist.csv")

#### Extract features
The next step is to generate all the features. We need to first use the `id` to differentiate songs that are in the playlist and those that are not. Then, we simply add the features for all songs in the playlist together as a summary vector, which is similar to the figure below that was modified version of the work by [Madhav Thaker](https://github.com/madhavthaker/spotify-recommendation-system/blob/main/spotify-recommendation-engine.ipynb).

![pipeline_img](flow.png)


In [49]:
def generate_playlist_feature(complete_feature_set, playlist_df):
    '''
    Summarize a user's playlist into a single vector
    ---
    Input: 
    complete_feature_set (pandas dataframe): Dataframe which includes all of the features for the spotify songs
    playlist_df (pandas dataframe): playlist dataframe
        
    Output: 
    complete_feature_set_playlist_final (pandas series): single vector feature that summarizes the playlist
    complete_feature_set_nonplaylist (pandas dataframe): 
    '''
    
    # Find song features in the playlist
    complete_feature_set_playlist = complete_feature_set[complete_feature_set['id'].isin(playlist_df['id'].values)]
    # Find all non-playlist song features
    complete_feature_set_nonplaylist = complete_feature_set[~complete_feature_set['id'].isin(playlist_df['id'].values)]
    complete_feature_set_playlist_final = complete_feature_set_playlist.drop(columns = "id")
    return complete_feature_set_playlist_final.sum(axis = 0), complete_feature_set_nonplaylist

In [50]:
# Generate the features
complete_feature_set_playlist_vector, complete_feature_set_nonplaylist = generate_playlist_feature(complete_feature_set, playlistDF_test)

In [51]:
# Non-playlist features
complete_feature_set_nonplaylist.head()

Unnamed: 0,genre|21st_century_classical,genre|432hz,genre|_hip_hop,genre|_roll,genre|a_cappella,genre|abstract_beats,genre|abstract_hip_hop,genre|accordion,genre|acid_jazz,genre|acid_rock,...,key|5,key|6,key|7,key|8,key|9,key|10,key|11,mode|0,mode|1,id
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,4nMlau89VAjmV7agkl7OY3
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.5,0.0,7vRriwrloYVaoAe3a9wJHe
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,152lZdxL1OR0ZMW6KquMif
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,3eze1OsZ1rqeXkKStNfTmi
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.5,2QbFClFyhMMtiurUjuQlAe


In [52]:
# Summarized playlist features
complete_feature_set_playlist_vector

genre|21st_century_classical     0.0
genre|432hz                      0.0
genre|_hip_hop                   0.0
genre|_roll                      0.0
genre|a_cappella                 0.0
                                ... 
key|9                            3.5
key|10                           2.5
key|11                           3.0
mode|0                           9.0
mode|1                          30.0
Length: 2165, dtype: float64

#### Find similarity
The last puzzle is to find the similarities between the summarized playlist vector and all other songs. There are many similarity measures but one of the most common measures is **cosine similarity**.

Cosine similarity is a mathematical value that measures the similarities between vectors. Imagining our songs vectors as only two dimensional, the visual representation would look similar to the figure below. 

The mathematical formula can be expressed as:
$$\text{Cosine Sim}(A,B)=\frac{A\cdot B}{||A||\times||B||}=\frac{\sum_{i=1}^n A_i\times B_i}{\sqrt{\sum_{i=1}^n A_i^2}\times \sqrt{\sum_{i=1}^n B_i^2}}$$

In our code, we used the `cosine_similarity()` function from `scikit learn` to measure the similarity between each song and the summarized playlist vector.

One big advatange of doing this is the time complexity of the whole algorithm is equal to a matrix multiplication since we are performing the cosine similarity measure between each row vector (song) and the column vector of summarized playlist feature.

![cossim_img](https://images.deepai.org/glossary-terms/cosine-similarity-1007790.jpg)

In [53]:
def generate_playlist_recos(df, features, nonplaylist_features):
    '''
    Generated recommendation based on songs in a specific playlist.
    ---
    Input: 
    df (pandas dataframe): spotify dataframe
    features (pandas series): summarized playlist feature (single vector)
    nonplaylist_features (pandas dataframe): feature set of songs that are not in the selected playlist
        
    Output: 
    non_playlist_df_top_40: Top 40 recommendations for that playlist
    '''
    
    non_playlist_df = df[df['id'].isin(nonplaylist_features['id'].values)]  # all songs that are not in selected playlist
    # Find cosine similarity between the playlist and the complete song set
    non_playlist_df['sim'] = cosine_similarity(nonplaylist_features.drop('id', axis = 1).values, features.values.reshape(1, -1))[:,0]
    non_playlist_df_top_40 = non_playlist_df.sort_values('sim', ascending = False).head(40)  # sort values according to cosine similarity to playlist
    
    return non_playlist_df_top_40

In [54]:
# Generate top 10 recommendation
recommend = generate_playlist_recos(song_df, complete_feature_set_playlist_vector, complete_feature_set_nonplaylist)
recommend.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  non_playlist_df['sim'] = cosine_similarity(nonplaylist_features.drop('id', axis = 1).values, features.values.reshape(1, -1))[:,0]


Unnamed: 0,artist_name,id,track_name,danceability,energy,key,loudness,mode,speechiness,acousticness,...,liveness,valence,tempo,artist_pop,genres,track_pop,genres_list,subjectivity,polarity,sim
4579,American Authors,64ybTt8CKxPdeXBNnu08Op,Believer,0.583,0.968,1,-2.909,1,0.0368,0.00141,...,0.13,0.91,119.999,70,indie_poptimism modern_alternative_rock modern...,55,"[indie_poptimism, modern_alternative_rock, mod...",low,Neutral,0.784023
8250,American Authors,1obisQNOcikRvTdStbW3pG,Go Big Or Go Home,0.665,0.875,1,-4.272,1,0.0426,0.00939,...,0.0897,0.66,122.008,70,indie_poptimism modern_alternative_rock modern...,63,"[indie_poptimism, modern_alternative_rock, mod...",low,Neutral,0.781738
55023,The 1975,51cd3bzVmLAjlnsSZn4ecW,She's American,0.647,0.857,1,-3.94,1,0.0547,0.167,...,0.0763,0.55,115.976,78,modern_alternative_rock modern_rock pop rock,55,"[modern_alternative_rock, modern_rock, pop, rock]",low,Neutral,0.768862
4574,Neon Trees,0K1KOCeJBj3lpDYxEX9qP2,Sleeping With A Friend,0.582,0.882,2,-4.256,1,0.0355,0.00189,...,0.32,0.507,107.034,71,modern_alternative_rock modern_rock pop pop_ro...,59,"[modern_alternative_rock, modern_rock, pop, po...",low,Neutral,0.763553
1057,American Authors,4gHD93RNqEhEh2NkYzl3x6,Luck,0.554,0.806,0,-3.463,1,0.046,0.00177,...,0.165,0.646,144.923,70,indie_poptimism modern_alternative_rock modern...,54,"[indie_poptimism, modern_alternative_rock, mod...",low,Neutral,0.763182
54362,WALK THE MOON,71wT7aMCFPYfzutF66OLac,Aquaman,0.63,0.772,1,-6.986,1,0.0297,0.51,...,0.0881,0.721,99.964,72,dance_pop dance_rock indie_poptimism modern_al...,46,"[dance_pop, dance_rock, indie_poptimism, moder...",low,Neutral,0.756718
8390,Neon Trees,1fBl642IhJOE5U319Gy2Go,Animal,0.482,0.833,5,-5.611,1,0.0449,0.000346,...,0.365,0.74,148.039,71,modern_alternative_rock modern_rock pop pop_ro...,74,"[modern_alternative_rock, modern_rock, pop, po...",low,Neutral,0.754982
49941,The 1975,1v07ywlVYd02pOCnXRBDNA,Menswear,0.708,0.539,1,-10.281,1,0.0681,0.541,...,0.0856,0.159,97.015,78,modern_alternative_rock modern_rock pop rock,51,"[modern_alternative_rock, modern_rock, pop, rock]",low,Neutral,0.75466
21117,The 1975,5hc71nKsUgtwQ3z52KEKQk,Somebody Else,0.61,0.788,0,-5.724,1,0.0585,0.195,...,0.153,0.472,101.045,78,modern_alternative_rock modern_rock pop rock,75,"[modern_alternative_rock, modern_rock, pop, rock]",low,Neutral,0.749041
7680,Andy Grammer,4Jz4bjXeiF2SXVj9P4YfY5,Keep Your Head Up,0.674,0.778,0,-5.367,1,0.0376,0.0469,...,0.145,0.816,90.01,74,dance_pop modern_rock neo_mellow pop pop_rap p...,74,"[dance_pop, modern_rock, neo_mellow, pop, pop_...",low,Neutral,0.747519


In [55]:
playlistDF_test[["artist_name","track_name"]][:20]

Unnamed: 0,artist_name,track_name
27274,Audien,Something Better
27275,John Legend,All of Me
27276,Aloe Blacc,Wake Me Up - Acoustic
27277,Astoria Kings,Come Alive
27278,Imagine Dragons,Nothing Left To Say / Rocks - Medley
27279,American Authors,Best Day Of My Life
27280,You Me At Six,Stay With Me - Acoustic Version
27281,Aloe Blacc,I Need a Dollar
27282,Anthem Lights,Best of 2012: Payphone / Call Me Maybe / Wide ...
27283,Pharrell Williams,"Happy - From ""Despicable Me 2"""
