# Harry Neal - Spotify MoodGrid

*Spotify Moodgrid Playlist Subsetter is a tool to combine multiple playlists, give them a happy score and an energy score, and then output a new personal playlist centred around your chosen Happy/Energy mood.*

## Notebook 1 of 4 - Data Acquisition & Cleaning

### Data Dictionary

Column Name | Data Type | Description
------------|-----------|------------
track_id | string | Song unique ID
track_name | string | Song Name
artists | string | Song Artist or artists
popularity | integer | Song Popularity (0-100) where higher number equates to higher popularity
danceability | float | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy | float | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key | integer | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness | float | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode | integer | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness | float | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness | float | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness | float | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness | float | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence | float | 	A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo | float | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms | integer | Duration of song in milliseconds
time_signature | integer | Number of beats per measure
explicit | boolean | Indicates whether the track contains explicit content
playlist_mood | string | Mood of playlist, one of 'happy', 'sad', 'energetic', 'chill'
query | string | Query used to search for playlist
playlist_name | string | Name of playlist



### Project Introduction

Spotify playlists have become a staple in the lives of music lovers worldwide. They provide a personalised and curated listening experience that can reflect our unique tastes, preferences, and moods. 

A third of Spotify listening time happens on Spotify curated playlists, with another third happening on user-generated playlists.
People love making playlists because it allows them to express themselves, showcase their favorite songs and artists, and share their musical tastes with others. Similarly, listening to playlists offers a convenient and effortless way to discover new music and enjoy familiar favorites.  

Playlisting can be a quick way of changing or enhancing a mood with music, but currently, there is no way in Spotify of outputting one or more playlists to a more specific subset based on mood.  That's where the Spotify MoodGrid Playlist Subsetter comes in. 

By using multiple playlists and analyzing their happy/energy scores, the app allows users to create a new playlist centered on their desired mood, providing a more tailored and fulfilling listening experience. It's a tool for anyone who wants to enjoy their favorite songs in a way that matches their current mood or to discover new music that will enhance their listening experience.


In [2]:
import numpy as np
import pandas as pd
import re
from collections import Counter

import joblib
import warnings
from sklearn.model_selection import train_test_split
warnings.filterwarnings('ignore')

In [3]:
query_df = joblib.load("./data/pickles/query_df.pkl")

## Data Extract Transform Load Clean

A python script was written to download the data using Spotify's API.

**./data_download_scripts/Harry_Neal_Spotify_MoodGrid_automated_playlist_download.py**

Instructions on obtaining a Spotify API acces token can be found here:

https://developer.spotify.com/documentation/web-api

Public user playlists were searched using key search terms under one of four umbrella moods - Happy, Sad, Energetic & Chilled.

Features were extracted from various different API endpoints.

- Get playlist names from search term
- Get playlist tracks
- Get artist genres of each artist in the playlist
- Get audio features of each artist in playlist

The combination of these API calls resulted in the Spotify API rate limit being hit regularly.  To avoid regular slowdown of the script, an exception was added to fill the genre with an empty string if a 429 error was found.  This resulted in many empty values for genre.

Values for track ID for each mood were stored in a dictonary in order to quickly skip over duplicate songs, rather than waste time writing and removing duplicates later.

Track information such as 'Artist', 'Track Name'  and audio features were downloaded from separate API calls 

Inspiration for the data download script was taken from the following:
https://github.com/plamere/playlistminer/blob/master/scripts/crawl.py

The key words searched and their associated moods can be found below

In [4]:
query_df.index

MultiIndex([(     'adrenaline', 'energetic'),
            (     'aggressive', 'energetic'),
            (       'angriest',       'sad'),
            (          'angry',       'sad'),
            (        'anguish',       'sad'),
            (      'anguished',       'sad'),
            (          'beach',     'chill'),
            (          'bliss',     'happy'),
            (       'blissful',     'happy'),
            (           'calm',     'chill'),
            (       'carefree',     'happy'),
            (       'cheerful',     'happy'),
            (          'chill',     'chill'),
            (        'chilled',     'chill'),
            (       'chilling',     'chill'),
            (      'contented',     'happy'),
            (         'crying',       'sad'),
            (    'death metal',       'sad'),
            (        'depress',       'sad'),
            (        'despair',       'sad'),
            (           'easy',     'chill'),
            (        'ecstasy',   

## Data Cleaning

#### Deal with NaN values

In [5]:
def print_NANs(dataframe):
    ''' print sum of NaNs in a dataframe if not zero'''
    nulls = dataframe.isna().sum()
    nulls_vals = list(nulls)
    nulls_keys = list(nulls.index)

    for i in range(len(nulls_keys)):
        if nulls_vals[i] > 0:
            print(f"{nulls_keys[i]} has {nulls_vals[i]} NaNs")

    if nulls.sum() == 0:
        print("There are no NaN values")

def print_value_counts(dataframe, column):
    vc = dataframe[column].value_counts()
    vc_index = list(vc.index)
    vc_counts = list(vc)
    for i in range(len(vc_counts)):
        print(f"Dataset contains {vc_counts[i]} {vc_index[i]} songs")

Load in the raw data

In [6]:
# read in csv file from Spotify API script
df_raw = pd.read_csv("./data/CSVs/1881pl_output.csv")

In [7]:
print(f"Dataset contains {df_raw.shape[0]} tracks & {len(df_raw['playlist_name'].unique())} playlists")

Dataset contains 160514 tracks & 1691 playlists


In [8]:
print_NANs(df_raw)

track_name has 180 NaNs
artists has 179 NaNs
artist_genre has 103058 NaNs
playlist_name has 693 NaNs


#### Check for duplicates

We are expecting duplicates from different playlist_moods, and for now we want to keep these.  However, we shouldn't have duplicates from the same mood as we built in a duplicate check into the data download script.  Let's check for duplicates anyway by removing 'playlist_name' and 'query', as we want to check if any songs have doubled-up across the entire mood, regardless of what query or playlist they are from.

In [9]:
df_dupcheck = df_raw.drop(columns=['playlist_name', 'query'])

In [10]:
df_dupcheck.duplicated().sum()

49

We have 49 duplicates that have slipped through the net.

Let's take a quick look at these

In [11]:
df_raw['playlist_name'][
   df_dupcheck.duplicated(keep=False)
].value_counts()

This Is Angus & Julia Stone    98
Name: playlist_name, dtype: int64

In [12]:
df_raw['query'][
   df_dupcheck.duplicated(keep=False)
].value_counts()

anguish      49
anguished    49
Name: query, dtype: int64

The duplicates all come from the same playlist 'This Is Angus & Julia Stone' and the query 'anguish/anguished'.  We can safely remove one of the duplicates of these.

In [13]:
df = df_raw[
    ~df_dupcheck.duplicated()
]

Sanity check duplicate removal

In [14]:
df.drop(columns=['playlist_name', 'query']).duplicated().sum()

0

In [15]:
print(f"Dataset contains {df.shape[0]} tracks & {len(df['playlist_name'].unique())} playlists")

Dataset contains 160465 tracks & 1691 playlists


Check NaNs again

In [16]:
print_NANs(df)

track_name has 180 NaNs
artists has 179 NaNs
artist_genre has 103058 NaNs
playlist_name has 693 NaNs


The high number of NaNs present in artist_genre is immediately obvious.  This is a result of this column being filled with an empty string every time the Spotify rate limit was hit.

The artist genres were fetched separately in a different script run later.  These will now be merged with our original dataset to further populate genre, as the genre of the song may be a good predictor of the mood of a song.

Load the separately grabbed genres and add them to a dataframe

In [17]:
genre_data = joblib.load("./data/pickles/1881pl_genre_update.pkl")

genre_df = pd.DataFrame(genre_data)

In [18]:
genre_df.head(3)

Unnamed: 0,artist,genre
0,"Mr Mantega, Chill Select","[focus beats, lo-fi jazzhop]"
1,"DLJ, ØDYSSEE, Bastien Brison",[lo-fi beats]
2,Nowun,[lo-fi beats]


In [19]:
df.head(3)

Unnamed: 0,track_id,track_name,artists,artist_genre,popularity,danceability,energy,key,loudness,mode,...,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,explicit,playlist_mood,query,playlist_name
0,6zSpb8dQRaw0M1dK8PBwQz,Cold Heart - PNAU Remix,"Elton John, Dua Lipa, PNAU","['glam rock', 'mellow gold', 'piano rock']",85,0.796,0.798,1,-6.312,1,...,4.2e-05,0.0952,0.942,116.032,202735,4,False,happy,happy,Happy Beats
1,39JofJHEtg8I4fSyo7Imft,B.O.T.A. (Baddest Of Them All) - Edit,"Eliza Rose, Interplanetary Criminal",['house'],81,0.736,0.906,0,-7.589,1,...,0.585,0.106,0.698,137.001,226627,4,False,happy,happy,Happy Beats
2,1bgKMxPQU7JIZEhNsM1vFs,Words (feat. Zara Larsson),"Alesso, Zara Larsson","['dance pop', 'edm', 'electro house', 'pop', '...",81,0.739,0.586,10,-5.079,0,...,0.000252,0.308,0.444,124.026,142677,4,False,happy,happy,Happy Beats


Left join df with genre_df on the artists/artist column

In [20]:
merged_df = df.merge(genre_df, how='left', left_on='artists', right_on='artist')

Fill NaN values of `artist_genre` with the new values of `genre`

In [21]:
merged_df['artist_genre'] = merged_df['genre'].fillna(merged_df['artist_genre'])

Drop the `artist` and `genre` columns that came from genre_df, and write to a new df

In [22]:
df = merged_df.drop(columns=['artist', 'genre'])

In [23]:
print_NANs(df)

track_name has 180 NaNs
artists has 179 NaNs
artist_genre has 15 NaNs
playlist_name has 693 NaNs


By including the genres from the separate genre API script we have gone from >100k NaNs to 15 NaNs for `artist_genre`.

Pickle the full raw dataset for use later

In [24]:
joblib.dump(df, "./data/pickles/df_full_raw.pkl")

['./data/pickles/df_full_raw.pkl']

### Perform train/test split

In order to prevent any data leakage, and information and knowledge about the test set influencing decisions made on the training set, we should split our data into a train set and test set upon input.

This means our test set is kept as it would be found 'in the wild', and later we can apply the same pre-processing that is applied to the train set.

**Split X and y**

In [25]:
df.columns

Index(['track_id', 'track_name', 'artists', 'artist_genre', 'popularity',
       'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'duration_ms', 'time_signature', 'explicit', 'playlist_mood', 'query',
       'playlist_name'],
      dtype='object')

Our target variable is `playlist_mood`, so let's split on this column, where X is all other features

In [26]:
y = df['playlist_mood']
X = df.drop(columns='playlist_mood')

Split test and train in a 80/20% split, stratifying on 'y' to ensure we keep the same proportions of playlist_mood between the input dataset and the split dataset.

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

Output the raw test and train datasets

In [28]:
joblib.dump(X_test, "./data/pickles/X_test_raw.pkl")
joblib.dump(y_test, "./data/pickles/y_test_raw.pkl")
joblib.dump(X_train, "./data/pickles/X_train_raw.pkl")
joblib.dump(y_train, "./data/pickles/y_train_raw.pkl")


['./data/pickles/y_train_raw.pkl']

Recombine X_train and y_train for cleaning and pre-processing

In [29]:
# perform a left join and reset the index
df = (X_train.merge(y_train, how='left', left_index=True, right_index=True)).reset_index(drop=True)

# perform a left join and reset the index
df_test = (X_test.merge(y_test, how='left', left_index=True, right_index=True)).reset_index(drop=True)

Output the raw test and train dataframes for later modelling

In [30]:
joblib.dump(df_test, "./data/pickles/df_test_raw.pkl")
joblib.dump(df, "./data/pickles/df_train_raw.pkl")

['./data/pickles/df_train_raw.pkl']

Check NaN values of the dataset

In [31]:
print_NANs(df)

track_name has 147 NaNs
artists has 146 NaNs
artist_genre has 12 NaNs
playlist_name has 547 NaNs


Investigate NaN track and artist

In [32]:
df[
    df['track_name'].isna()
].sample(5, random_state=77)

Unnamed: 0,track_id,track_name,artists,artist_genre,popularity,danceability,energy,key,loudness,mode,...,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,explicit,query,playlist_name,playlist_mood
15334,67UzeF97qWMpXfwZYJftTO,,,"[australian psych, neo-psychedelic]",0,0.485,0.925,1,-5.856,0,...,0.0103,0.0482,0.574,107.916,297857,3,False,high octane,High Octane songs,energetic
4445,4PZQGu0VYDGkj9xqcIpFeO,,,"[australian psych, neo-psychedelic]",0,0.469,0.573,2,-6.263,1,...,4.3e-05,0.117,0.244,141.72,270881,4,False,blissful,Blissful Bollywood🥰 #2,happy
29219,7q7kUnhGYdocY7Bn4fC34g,,,"[australian psych, neo-psychedelic]",0,0.439,0.911,6,-4.096,1,...,0.553,0.288,0.782,181.499,164455,4,True,heavy sad,heavy stone make sad brain voice quiet 💪😤,sad
14892,4RLGpC5pvisS0py1wJ8OF4,,Cubenssi,[chill breakcore],11,0.0809,0.519,3,-13.918,0,...,0.936,0.139,0.378,76.219,106668,3,False,depress,DEPRESSIVE BREAKCORE>>>,sad
107456,6JHm7Da8ZX0KA5xZhXiaB2,,,"[australian psych, neo-psychedelic]",0,0.543,0.723,1,-5.326,0,...,0.0,0.0738,0.292,145.779,336037,4,False,blissful,Blissful Bollywood🥰 #2,happy


NaN track name and artist appear to line up, are null for genre too, and have popularity of zero.  We don't necessarily need track name and artist name but we have enough data to not worry about dropping ~150 rows out of 150000.

Drop the NaN artists.

In [33]:
df = df[
    df['artists'].notna()
]

In [34]:
print_NANs(df)

track_name has 1 NaNs
artist_genre has 12 NaNs
playlist_name has 546 NaNs


We have 1 NaN left in track name, therefore we must have dropped the same rows for artist and track names. Let's drop the one remaining.

In [35]:
df = df[
    df['track_name'].notna()
]
print_NANs(df)

artist_genre has 12 NaNs
playlist_name has 546 NaNs


Check some examples of the NaN values in playlist_name

In [36]:
df[
    df['playlist_name'].isna()
].sample(5, random_state=11)

Unnamed: 0,track_id,track_name,artists,artist_genre,popularity,danceability,energy,key,loudness,mode,...,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,explicit,query,playlist_name,playlist_mood
120897,0eH2eHURaXUP15D8gQlfjx,LoveGame,Lady Gaga,"[art pop, dance pop, pop]",73,0.894,0.678,6,-5.611,0,...,2e-06,0.317,0.844,105.024,216333,4,False,beach,,chill
99814,6lTG1xmx6XMtQYOSLjZG5V,Spectre,A.C. XZA,[],2,0.781,0.49,11,-8.982,0,...,0.749,0.112,0.137,132.935,138998,4,True,angriest,,sad
60786,0NjW4SKY3gbfl2orl1p8hr,IFHY (feat. Pharrell),"Tyler, The Creator, Pharrell Williams","[hip hop, rap]",72,0.358,0.715,7,-6.181,1,...,0.0,0.581,0.275,85.478,319253,4,True,beach,,chill
35860,1EryAkZ0VHstC6haIxVBiE,Sextape,Deftones,"[alternative metal, nu metal, rap metal, rock,...",75,0.367,0.634,5,-6.475,1,...,0.0759,0.116,0.0964,89.981,241533,4,False,crying,,sad
103100,7hbb8vRcx3VICc1Yy6lnp8,NO TVS ALLOWED,"Mibhova, Phouelisi",[],26,0.659,0.494,7,-8.378,1,...,0.0,0.255,0.392,75.027,104000,4,False,crying,,sad


There's nothing about these songs to suggest we should drop them

In [37]:
pl_nans = df[
    df['playlist_name'].isna()
]

In [38]:
pl_nans['query'].value_counts()

upbeat      156
screamo     110
angriest     93
beach        89
pumped       39
crying       24
angry        18
sad          17
Name: query, dtype: int64

Grouping the tracks with null values for playlist name by query, we see there examples for 8 different queries.  It seems likely that these originate from one playlist for each query, and the playlist name was not written

In [39]:
pl_nans[
    pl_nans['query'] == "angry"].head()

Unnamed: 0,track_id,track_name,artists,artist_genre,popularity,danceability,energy,key,loudness,mode,...,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,explicit,query,playlist_name,playlist_mood
1817,4BYejINgfZF0qKDMEH2cim,Picture To Burn,Taylor Swift,[pop],60,0.658,0.877,7,-2.098,1,...,0.0,0.0962,0.821,105.586,173067,4,False,angry,,sad
6540,1Bx0zEdVjkFlV27iKaePug,Man Down,Rihanna,"[barbadian pop, pop, urban contemporary]",70,0.47,0.904,0,-4.024,0,...,0.0,0.0491,0.557,155.788,267000,4,True,angry,,sad
27316,4fixebDZAVToLbUCuEloa2,Womanizer,Britney Spears,"[dance pop, pop]",76,0.724,0.695,11,-5.226,1,...,0.0,0.0889,0.235,139.0,224400,4,False,angry,,sad
29753,0RP1kqoSPkVXsKiQNhMKzV,mad woman,Taylor Swift,[pop],64,0.593,0.7,3,-9.016,1,...,7e-06,0.116,0.451,141.898,237267,4,True,angry,,sad
46343,7aonAWn0J0AJ47ZU9WHCXC,INFERNO,"Sub Urban, Bella Poarch",[modern indie pop],66,0.82,0.611,9,-5.02,0,...,2.5e-05,0.0684,0.637,127.883,133134,4,False,angry,,sad


We don't necessarily need playlist_name for any analysis, only for quality checking that the tracks are in the correct mood.  The examples we have seen appear to match the query in terms of mood by ear, so let's give the NaN values a name that matches the query. 

In [40]:
df['playlist_name'].fillna(df['query'], inplace=True)

In [41]:
print_NANs(df)

artist_genre has 12 NaNs


In [42]:
df['artist_genre'].value_counts()

[]                                                                          16402
[]                                                                           5557
[pop]                                                                        1550
['lo-fi beats']                                                              1065
[drift phonk]                                                                 761
                                                                            ...  
[irish folk, irish neo-traditional]                                             1
[alternative metal, alternative rock, neo mellow, pop rock, post-grunge]        1
[dubstep, gaming dubstep, psybass]                                              1
['australian dance']                                                            1
['indonesian jazz', 'indonesian pop', 'indonesian singer-songwriter']           1
Name: artist_genre, Length: 13403, dtype: int64

The only remaining NaNs are from artist_genre where the Spotify API had some sort of http error (e.g. 400).  Checking the value counts we see that the most common occurrence in genre is a string of an empty list '[]', which is populated when a genre cannot be found.

Therefore we should fill the NaN values with the same empty list string, as they are essentially missing genre too.

In [43]:
df[
    df['artist_genre'].isna()
].head(3)

Unnamed: 0,track_id,track_name,artists,artist_genre,popularity,danceability,energy,key,loudness,mode,...,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,explicit,query,playlist_name,playlist_mood
1430,1J0oCWx3OcL0SsFB45ejlL,Nba Youngboy Whap Whap (Remix),tnv_30,,0,0.8,0.803,6,-8.619,0,...,9.3e-05,0.0993,0.606,107.172,208538,4,True,aggressive,nbayoungboy (Aggressive) 💚,energetic
13531,73TWeCvTA60pPnvEjqol86,"Орлеанская дева, действие I: No. 6, Гимн ""Царь...","Irina Arkhipova, Андрей Соколов, Виктор Селива...",,9,0.249,0.335,8,-13.966,1,...,0.00559,0.122,0.0402,110.075,392440,3,False,unhappy,POV: You're an unhappy muse on Mount Parnassus,sad
57220,1CCDIow4pPYyJYzhtTUYz2,RAW SHIT (feat. Migos),"DaBaby, Migos",,54,0.874,0.714,5,-4.946,1,...,0.0,0.356,0.85,130.046,216678,4,True,pumped,Pumped up songs,energetic


In [44]:
df = df.fillna("[]")

Check NaNs again

In [45]:
print_NANs(df)

There are no NaN values


We have successfully removed/filled all NaN values.

### Checking Playlist appropriate to Mood

Now let's dig deeper into the playlist names and whether they are appropriate for the chosen query or mood.

First, to get an overall sense of the playlist names and what has been added with which query, let's create a dictionary of queries and playlists

In [46]:
# a list of all queries
querylist = list(df['query'].unique())

# add a list of unique playlist names to each query key
pl_dict = {}
for query in querylist:
    pl_dict[query] = list(df[
        df['query']== query]['playlist_name'].unique())

# print the query and playlist list for an overview
for query in pl_dict.keys():
    print(query, pl_dict[query])

sad ['Sad 90s', 'Lonely Sad Mix', 'Depressing Rap 😭', 'crying myself to sleep', 'SAD', 'Sad viral tik tok Songs', 'Overthinking🥀🖤', 'Best of Sadar Bahar', 'Sad Rock 🤘', 'Sad hours: Punjabi', 'sad songs for sad breakups', 'Sad Corridos♥️', 'sad hour', 'Sad songs to cry your heart out to 😭💔', '#SadCuhHours 🥺', 'sad girl starter pack', 'sad songs 2023 😢 crying and depressing music', 'sad songs everyone knows', 'Sad Indie', 'Bhojpuri Sad Song 😭😣', 'sad', 'sad rap vibes 2023', '💔😭Sad Songs For Crying At 3am😭💔', 'Sad Songs 🥺', 'sad sierreño', 'sad tik tok songs 2023 / 2022', 'sad girl country', 'sad country songs to cry to.', 'NF saddest songs ;(', 'Sad Soul', 'Sad Covers', 'Moody Sad Mix', 'sad spanish Songs to cry in the corner of your room bebe', 'Sad Songs', 'Sad 80s', 'Slow sad songs to fall asleep to', 'Sad Crying Mix', 'sad lofi', 'Sad Classical', 'Sad 00s', 'Sad Boi Hours', 'Sad Love Song Mix', 'sad song club', "taylor swift but you're sad.", 'Sadboy', 'sad instrumentals for sad nigh

One risk of searching for moods based on queries is that we blindly add playlists that do not fit the mood, whether that is due to a difference in meaning or language of certain words, due to a search term finding a word that is close in Levenstein distance but different in meaning, or searching for playlists with the query name in the Artist name, when the mood of that Artist's music may be completely different to the desired result.

First let's deal with the first example - where the query is in the Artist's name.

In [47]:
# lower case all playlist names for ease of cleaning
df['playlist_name'] = df['playlist_name'].apply(str.lower)

# define custom function to compare row by row whether an artist name is same as a query
def artist_contains_query(row):
    return (row['query'] in row['artists'].lower()) and (row['query'] in row['playlist_name'])

# apply the custom function
q_artist = df[df.apply(artist_contains_query, axis=1)]

In [48]:
print(f"There are {q_artist.shape[0]} tracks where the query is in the artist name")

There are 4963 tracks where the query is in the artist name


Checking examples of these we can see our function is doing what we want and selecting Artists whose name matches the query.

In [49]:
q_artist.sample(5, random_state=5)

Unnamed: 0,track_id,track_name,artists,artist_genre,popularity,danceability,energy,key,loudness,mode,...,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,explicit,query,playlist_name,playlist_mood
60476,19q4YtvaSA2K78v4sApTlP,Itihaas,Hardeep Grewal,[punjabi pop],15,0.737,0.688,0,-4.112,0,...,0.0,0.232,0.477,84.961,203294,4,False,hard,this is hardeep grewal,energetic
23213,5qHYXcVvc9xsFB2uH7GpMN,Kokomo,The Beach Boys,"[adult standards, baroque pop, classic rock, p...",72,0.682,0.635,0,-10.05,1,...,0.0,0.137,0.927,115.584,217693,4,False,beach,this is the beach boys,chill
59133,2Zyl9jW8HXHDAeIvga5JVK,A Lover Spurned,"Soft Cell, Marc Almond","['new romantic', 'new wave', 'new wave pop', '...",18,0.683,0.526,4,-13.705,0,...,0.000288,0.0599,0.791,120.439,339893,4,False,soft,this is soft cell,chill
66395,6ofQC96SOJs38OeveivG3X,Still Dreaming,Calm,[nigerian pop],8,0.742,0.181,9,-20.528,0,...,0.914,0.101,0.642,90.063,223431,4,False,calm,this is calm,chill
126109,3ZPF2C5503DJShSlTi2Bp5,The Girl With The Patent Leather Face,Soft Cell,"[new romantic, new wave, new wave pop, synthpop]",13,0.394,0.51,2,-6.776,1,...,5e-06,0.0716,0.431,172.497,297131,4,False,soft,this is soft cell,chill


Drop all rows where query is in artist and playlist name

In [50]:
df = df[
    ~(df.apply(artist_contains_query, axis=1))
]

There are a lot of instances of the search query bringing up an artist with a close Levenstein distance,
e.g. 'Dynamic' query brings up the artist 'Dynasty'
we want to remove these as they are generally unrelated to the mood, however we want to keep examples where the query is present e.g. 'Sad Taylor Swift

In [51]:
#define custom function to compare row by row whether an artist name is in the playlist name, but not in the query

def playlist_contains_artist_not_query(row):
    return ((row['artists'].lower() in row['playlist_name']) & (row['query'] not in row['playlist_name']))

# apply custom function to each row of the DataFrame
pl_artist_not_query = df[df.apply(playlist_contains_artist_not_query, axis=1)]

print(f"There are {pl_artist_not_query.shape[0]} tracks where the artist is in the Playlist name, but not in the query")


There are 3368 tracks where the artist is in the Playlist name, but not in the query


In [52]:
pl_artist_not_query.sample(5, random_state=33)

Unnamed: 0,track_id,track_name,artists,artist_genre,popularity,danceability,energy,key,loudness,mode,...,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,explicit,query,playlist_name,playlist_mood
14120,0RIZcb4vPEcTdHRm7yDO6H,I Will Fly,Angaza Singers,"['sda choir', 'swahili gospel']",2,0.779,0.517,10,-6.657,1,...,0.0,0.313,0.402,142.152,265418,4,False,anguished,this is angaza singers,sad
51125,0Ob1qDTsJtdDjLWnjHbOG0,Intricate - Original Mix,Energun,"[dark trap, scream rap]",0,0.724,0.723,1,-15.407,1,...,0.789,0.111,0.329,128.014,438399,3,False,energetic,this is energun,energetic
56399,3rRvxAsZv2UkCfhb4lPC9o,Hasta el Día de Hoy,Dinamicos Jrs,"[classic sierreno, corrido, corridos tumbados,...",38,0.816,0.739,8,-6.968,1,...,1.4e-05,0.0947,0.818,97.481,232100,4,False,dynamic,this is dinamicos jrs,energetic
85561,3kVZLGlzRZNE8GfNbN9gE1,Numb3rs,Xtatic,['kenyan alternative'],0,0.534,0.918,2,-6.502,1,...,0.873,0.368,0.264,145.989,401096,4,False,ecstatic,this is xtatic,happy
101712,58oJMzQsO51iocCocOhVDk,I Don't Want To Be A Freak (But I Can't Help M...,Dynasty,[norwegian pop],0,0.725,0.896,5,-5.678,0,...,0.0011,0.0696,0.939,118.525,433880,4,False,dynamic,this is dynasty,energetic


Sanity check shows that our function is flagging the correct rows

Drop these tracks

In [53]:
#drop all rows where artist is in playlist name but query is not
df = df[
    ~(df.apply(playlist_contains_artist_not_query, axis=1))
]

A lot of the playlists that are unrelated to the mood come from the "This is..." artist playlists.
create a custom function to drop these

In [54]:
def this_is_identifier(row):
    return (("this is" in row['playlist_name']))

pl_this_is = df[
    (df.apply(this_is_identifier, axis=1))
]

print(f"There are {pl_this_is.shape[0]} tracks with 'This is...' in the playlist name")

There are 1194 tracks with 'This is...' in the playlist name


In [55]:
pl_this_is.sample(5, random_state = 1)

Unnamed: 0,track_id,track_name,artists,artist_genre,popularity,danceability,energy,key,loudness,mode,...,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,explicit,query,playlist_name,playlist_mood
5521,2PZiLzbvaKNppzaBGvhPw3,Daily Duppy,"Ard Adz, GRM Daily","[grime, uk alternative hip hop, uk hip hop]",31,0.608,0.554,9,-8.544,1,...,0.0,0.27,0.651,139.214,178338,4,True,hard,this is ard adz,energetic
122554,2YMA7qrL7SyNafjrb8tcF5,Smile Back at Me,"Mellow Thing, Erick Yung, Moonrock Mont, Eddy ...",[],0,0.706,0.619,8,-11.578,0,...,0.0,0.0797,0.191,144.044,164212,4,True,melancholy,this is mellow thing,sad
28707,00shcLNgwxFoIWDCCiQB6i,"Holberg Suite, Op. 40: 4. Air (Andante religioso)","Edvard Grieg, Academy of St. Martin in the Fie...","[classical, late romantic era, norwegian class...",23,0.168,0.0337,7,-24.257,0,...,0.861,0.147,0.0339,67.446,406667,4,False,grief,this is grieg,sad
108357,4oVHVOiduNppLB19NpdeCS,La Chacha del Trompedario,"Adriano, Paulina",[],9,0.924,0.702,3,-4.781,1,...,0.752,0.0541,0.964,129.143,152200,4,False,adrenaline,this is adriano,energetic
88856,4PRKn6xgM6EAAPbUTNS2FF,Vinho Novo,"Agnus Dei, Juliene","['brazilian ccm', 'musicas espiritas']",2,0.709,0.431,11,-8.883,1,...,0.0,0.0526,0.793,138.434,218040,4,False,anguished,this is agnus dei,sad


Function passes the sanity check

Drop all tracks where playlist name is 'this is...'

In [56]:
df = df[
    ~(df.apply(this_is_identifier, axis=1))
]

Re-print the queries and associated playlists 

In [57]:
# a list of all queries
querylist = list(df['query'].unique())

# add a list of unique playlist names to each query key
pl_dict = {}
for query in querylist:
    pl_dict[query] = list(df[
        df['query']== query]['playlist_name'].unique())

# print the query and playlist list for an overview
for query in pl_dict.keys():
    print(query, pl_dict[query])

sad ['sad 90s', 'lonely sad mix', 'depressing rap 😭', 'crying myself to sleep', 'sad', 'sad viral tik tok songs', 'overthinking🥀🖤', 'best of sadar bahar', 'sad rock 🤘', 'sad hours: punjabi', 'sad songs for sad breakups', 'sad corridos♥️', 'sad hour', 'sad songs to cry your heart out to 😭💔', '#sadcuhhours 🥺', 'sad girl starter pack', 'sad songs 2023 😢 crying and depressing music', 'sad songs everyone knows', 'sad indie', 'bhojpuri sad song 😭😣', 'sad rap vibes 2023', 'sad songs 🥺', 'sad sierreño', '💔😭sad songs for crying at 3am😭💔', 'sad tik tok songs 2023 / 2022', 'sad girl country', 'sad country songs to cry to.', 'nf saddest songs ;(', 'sad soul', 'sad covers', 'sad spanish songs to cry in the corner of your room bebe', 'sad songs', 'sad 80s', 'slow sad songs to fall asleep to', 'sad crying mix', 'sad lofi', 'sad classical', 'sad 00s', 'sad boi hours', 'sad love song mix', 'sad song club', "taylor swift but you're sad.", 'sadboy', 'sad instrumentals for sad nights', 'sad ?', 'sad songs

Checking through the list and with a little deeper investigation via Spotify, we see a few examples that don't fit the mood of the query, but are not general enough to remove in bulk.  Let's remove some of these 'anomalies'

In [58]:
rows_before = df.shape[0]

# remove 'lovely' instead of 'lively' playlists
df = df[
    ~((df['playlist_name'].str.contains('lovely')) & (df['query']=='lively'))
]

rows_after = df.shape[0]
print(f"Lovely not lively: Removed {rows_before - rows_after}")
rows_before = rows_after

# remove 'heavy rain' playlists
df = df[
    ~((df['playlist_name'].str.contains('heavy rain')))
]

rows_after = df.shape[0]
print(f"Heavy rain: Removed {rows_before - rows_after}")
rows_before = rows_after


# # remove anti-depression playlists where query is 'depress'
df = df[
    ~(((df['playlist_name'].str.contains('anti-d')) | (df['playlist_name'].str.contains('anti d'))) & (df['query']=='depress'))
]

rows_after = df.shape[0]
print(f"Anti-depression in depress: Removed {rows_before - rows_after}")
rows_before = rows_after


# # remove playlists that are referring to 'less' of the query e.g. 'less heavy'
df = df[
    ~(df['playlist_name'].str.contains('less'))
]

rows_after = df.shape[0]
print(f"Less heavy: Removed {rows_before - rows_after}")
rows_before = rows_after

# remove 'sadar bahar' returned with 'sad'
df = df[
    ~((df['playlist_name'].str.contains('sadar bahar')) & (df['query']=='sad'))
]

rows_after = df.shape[0]
print(f"Sadar bahar: Removed {rows_before - rows_after}")
rows_before = rows_after


# remove playlists that are 'positive affirmations' from 'positive' query
df = df[
    ~((df['playlist_name'].str.contains('positive affirmation')) & (df['query']=='positive'))
]

rows_after = df.shape[0]
print(f"Positive affirmations: Removed {rows_before - rows_after}")
rows_before = rows_after

# remove playlists with 'negative' query as these generally don't fit the mood, e.g. negative space or negative rizz
df = df[
    ~(df['query']=='negative')
]

rows_after = df.shape[0]
print(f"Negative query: Removed {rows_before - rows_after}")
rows_before = rows_after

# remove playlists with 'Selena Gomez - Calm Down' from 'calm' query
df = df[
    ~((df['playlist_name'].str.contains('selena gomez')) & (df['query']=='calm'))
]

rows_after = df.shape[0]
print(f"Selena Gomez calm down: Removed {rows_before - rows_after}")
rows_before = rows_after

# remove 'softball' playlists from 'soft' query
df = df[
    ~((df['playlist_name'].str.contains('softball')) & (df['query']=='soft'))
]

rows_after = df.shape[0]
print(f"Softball in soft: Removed {rows_before - rows_after}")
rows_before = rows_after

# remove 'Blackpink' playlists from 'happiest' query (they have a sad song called 'Happiest Girl')
df = df[
    ~((df['playlist_name'].str.contains('happiest girl')) & (df['query']=='happiest'))
]

rows_after = df.shape[0]
print(f"Blackpink in happiest: Removed {rows_before - rows_after}")
rows_before = rows_after

# remove 'dynamic' playlists, which are generally unsuitable as a query\
#  e.g. 'dynamic duo' playlists named after the creators, or 'dynamic sleep' and 'dynamic yoga'
df = df[
    ~(df['query']=='dynamic')
]

rows_after = df.shape[0]
print(f"Dynamic: Removed {rows_before - rows_after}")
rows_before = rows_after

# remove playlist 'euphoric destruction', which is a large 389 song playlist \
# that contains heavy metal, some of which is the opposite of happy 
df = df[
    ~((df['playlist_name'].str.contains('destruction')) & (df['query']=='euphoric'))
]

rows_after = df.shape[0]
print(f"Euphoric destruction: Removed {rows_before - rows_after}")
rows_before = rows_after

# remove 'sunny' query, as it has only returned songs from a movie or related to the location 'sunny beach'

df = df[
    ~(df['query']=='sunny')
]

rows_after = df.shape[0]
print(f"Sunny: Removed {rows_before - rows_after}")
rows_before = rows_after




Lovely not lively: Removed 1414
Heavy rain: Removed 172
Anti-depression in depress: Removed 409
Less heavy: Removed 366
Sadar bahar: Removed 207
Positive affirmations: Removed 261
Negative query: Removed 3728
Selena Gomez calm down: Removed 51
Softball in soft: Removed 122
Blackpink in happiest: Removed 17
Dynamic: Removed 4437
Euphoric destruction: Removed 228
Sunny: Removed 115


Check for duplicates between opposite sides of the moodgrid, as this will be a good indicator of a bad playlist download.  For example if we have songs that are in a sad and a happy mood, then perhaps the playlist that the songs came from need to be investigated further.

Start by splitting the dataframes into Happy+Sad and Energetic+Chilled


In [59]:
df_HS = df[
    (df['playlist_mood'] == 'happy') | ((df['playlist_mood'] == 'sad'))
]

df_EC = df[
    (df['playlist_mood'] == 'energetic') | ((df['playlist_mood'] == 'chill'))
]

In [60]:
print(f"There are {df_HS['track_id'].duplicated().sum()} duplicates between Happy and Sad")

There are 2610 duplicates between Happy and Sad


In [61]:
print(f"There are {df_EC['track_id'].duplicated().sum()} duplicates between Energetic and Chilled")

There are 1078 duplicates between Energetic and Chilled


Save all of the happy/sad duplicates to a new dataframe


Do the same with energetic/chilled

In [62]:
df_HS_duplicates = df_HS[
    df_HS['track_id'].duplicated(False)
]

df_EC_duplicates = df_EC[
    df_EC['track_id'].duplicated(False)
]

In [63]:
df_HS_duplicates[['playlist_name', 'track_name', 'artists']].sample(5, random_state=1)

Unnamed: 0,playlist_name,track_name,artists
89192,canzoni depresse✨🧸💔,Io e Lei,"Vago, Cifra149"
44759,crying in the car by yourself,Mr. Perfectly Fine (Taylor’s Version) (From Th...,Taylor Swift
18027,sad rap vibes 2023,Real Shit (with benny blanco),"Juice WRLD, benny blanco"
98063,happy drive,We Are Young (feat. Janelle Monáe),"fun., Janelle Monáe"
30376,carefree,not my job anymore,Thomas Day


In [64]:
df_EC_duplicates[['playlist_name', 'track_name', 'artists']].sample(5, random_state=33)

Unnamed: 0,playlist_name,track_name,artists
112997,divine feminine energy 🧿,So It Goes,Mac Miller
95435,mellow rap 😌,"Nasty Girl (feat. Diddy, Nelly, Jagged Edge & ...","The Notorious B.I.G., Avery Storm, Diddy, Jagg..."
64637,energy booster: hip-hop,Drip Too Hard (Lil Baby & Gunna),"Lil Baby, Gunna"
64415,high octane,L.A. Woman,The Doors
9820,chilled restaurant vibes,Stay,"Zedd, Alessia Cara"


Investigating the duplicates we can see they are generally songs that could be quite subjective in their meaning and mood for different people.  Later we could investigate whether there are particular features of these songs that inform us about happiness and sadness, but for now we should drop songs with ambiguous feelings to give our model the best chance of predicting important features.

Save the index of the duplicates and drop based on the index

In [65]:
dup_index_HS = df_HS_duplicates.index
dup_index_EC = df_EC_duplicates.index


Drop both versions of the duplicates, i.e. happy AND sad, as we want to exclude ambiguous songs

In [66]:
rows_before = df.shape[0]
df.drop(dup_index_HS, inplace=True)
rows_after = df.shape[0]
print(f"Dropped {rows_before - rows_after} happy/sad duplicates")

rows_before = rows_after
df.drop(dup_index_EC, inplace=True)
rows_after = df.shape[0]
print(f"Dropped {rows_before - rows_after} energetic/chilled duplicates")

Dropped 5220 happy/sad duplicates
Dropped 2156 energetic/chilled duplicates


In [67]:
df['playlist_mood'].value_counts()

happy        26982
sad          26424
chill        23306
energetic    23085
Name: playlist_mood, dtype: int64

In [68]:
# a list of all queries
querylist = list(df['query'].unique())

# add a list of unique playlist names to each query key
pl_dict = {}
for query in querylist:
    pl_dict[query] = list(df[
        df['query']== query]['playlist_name'].unique())

# print the query and playlist list for an overview
for query in pl_dict.keys():
    print(query, pl_dict[query])

sad ['sad 90s', 'lonely sad mix', 'depressing rap 😭', 'crying myself to sleep', 'sad', 'sad viral tik tok songs', 'overthinking🥀🖤', 'sad rock 🤘', 'sad hours: punjabi', 'sad songs for sad breakups', 'sad corridos♥️', 'sad hour', 'sad songs to cry your heart out to 😭💔', '#sadcuhhours 🥺', 'sad girl starter pack', 'sad songs 2023 😢 crying and depressing music', 'sad songs everyone knows', 'sad indie', 'bhojpuri sad song 😭😣', 'sad rap vibes 2023', 'sad sierreño', '💔😭sad songs for crying at 3am😭💔', 'sad tik tok songs 2023 / 2022', 'sad songs 🥺', 'sad girl country', 'sad country songs to cry to.', 'nf saddest songs ;(', 'sad soul', 'sad covers', 'sad spanish songs to cry in the corner of your room bebe', 'sad songs', 'sad 80s', 'slow sad songs to fall asleep to', 'sad lofi', 'sad classical', 'sad 00s', 'sad boi hours', 'sad song club', "taylor swift but you're sad.", 'sad crying mix', 'sadboy', 'sad instrumentals for sad nights', 'sad ?', 'sad songs for the boys 🍻💊', 'sad love song mix', 'moo

Output the cleaned training data for pre-processing and EDA

In [69]:
joblib.dump(df, "./data/pickles/cleaned_train_df.pkl")

['./data/pickles/cleaned_train_df.pkl']