# **BEYOND THE MOOD 🎧 - SPOTIFY EDA, MOOD CLUSTERING & RECOMMENDATION SYSTEM**

---

***Have you ever wonder if your music taste and habits can tell something about you that you didn't know ?*** It's the main thought that lead me to do this.
I started this project in December 2018 but at that moment I was working and studying at the same time, so it would have taken too much energy to commit seriously into this. Time passed an I ended up diving back into it in December 2020 (yes, two years later 😅).

As I mention on my GitHub I'm deeply interested in Music and Psychology, and particularly in the combination of them both; so I started looking for studies made by Spotify on their <a href="https://research.atspotify.com"> research website</a> and I found a lot of interesting thing to try, but unfortunately it wasn't possible because they use data from online survey which they only have access so it wasn't possible for me to try it.

So, I decided to go on the most simple way:


1.   Create mood features by clustering the music
2.   Using those features for my homemade recommendation system




*Sources:*


*   <a href="https://developer.spotify.com/documentation/web-api/reference/"> Spotify Web API </a>
*   <a href="https://medium.com/@FinchMF/praise-questions-and-critique-spotify-api-38e984a4174b"> Praise, Questions and Critique: Spotify API </a>
*   <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3138530/"> The Structure of Musical Preferences: A Five-Factor Model, </a>
*   <a href="https://benanne.github.io/2014/08/05/spotify-cnns.html"> Recommending music on Spotify with deep learning, </a>






In [None]:
!pip install spotipy

# API wrapper
import spotipy

Collecting spotipy
  Downloading https://files.pythonhosted.org/packages/02/9b/25b96cd2f1e2174ac787099b386e2afd427bf78dfb11e1c5857affc3409d/spotipy-2.18.0-py3-none-any.whl
Collecting requests>=2.25.0
[?25l  Downloading https://files.pythonhosted.org/packages/29/c1/24814557f1d22c56d50280771a17307e6bf87b70727d975fd6b2ce6b014a/requests-2.25.1-py2.py3-none-any.whl (61kB)
[K     |████████████████████████████████| 61kB 4.0MB/s 
Collecting urllib3>=1.26.0
[?25l  Downloading https://files.pythonhosted.org/packages/0c/cd/1e2ec680ec7b09846dc6e605f5a7709dfb9d7128e51a026e7154e18a234e/urllib3-1.26.5-py2.py3-none-any.whl (138kB)
[K     |████████████████████████████████| 143kB 20.2MB/s 
[31mERROR: google-colab 1.0.0 has requirement requests~=2.23.0, but you'll have requests 2.25.1 which is incompatible.[0m
[31mERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.[0m
Installing collected packages: urllib3, requests, spotipy
  Found existin

In [None]:
# runtime environment
import os
import sys
import json
import requests


# regular expression, math operation and data manipulation
import re
import math
import copy
import itertools
import numpy as np
import pandas as pd
from scipy import stats
from sklearn import preprocessing

# graphs
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
from plotly.offline import iplot, init_notebook_mode

# others
from pprint import pprint
from fnmatch import fnmatch
from pathlib import Path
from datetime import date, datetime, timedelta
from spotipy.oauth2 import SpotifyClientCredentials

In [None]:
# mount google drive path
from google.colab import drive, data_table
drive.mount("/content/MyDrive/")


# Path were we save all the data
path_to_data = '/content/MyDrive/My Drive/Colab Notebooks/my_spotify_data/'

Mounted at /content/MyDrive/


#**1. DATA ACQUISITION**

##**1.1. Streaming data**

Our data are mainly divided into two sources: the first one is from the Spotify Customer Support, we had to ask them to send us the integrality of our streaming history which is not directly available on our spotify account because there are some different data level due to the GDPR. For the second one we had to use the Spotify API for collecting some features like the audio features and music genres that we will need for our analysis and modelling.

###**1.1.1. Import local history files**

In [None]:
def import_local_json(stream_history_path: str, key_string: str):
    """ Crawl through a local folder, load all JSON that name contains the key_string, append them in one DataFrame.

    Args:
        stream_history_path (str): Local folder's path.
        key_string (str): String contained in files's name that we want to load.

    Returns:
        df_raw (DataFrame): Concatenated DataFrame from all the loaded files.
        
    """
    tmp_lst = []

    # loop through each file in directory
    for file in os.listdir(stream_history_path):

        # import json file to dataframe if the key string is found in their filename
        if fnmatch(file, key_string):
            data = pd.read_json(stream_history_path + file, orient = 'records', encoding='utf-8')
            tmp_lst.append(data)

    # concat all dataframes
    df_raw = pd.concat(tmp_lst, ignore_index = True)

    return df_raw

In [None]:
df_raw = import_local_json(path_to_data, 'endsong*.json') \
                .drop(
                    ['username',
                    'platform',
                    'ip_addr_decrypted',
                    'user_agent_decrypted',
                    'city',
                    'region',
                    'metro_code',
                    'longitude',
                    'latitude'
                    ]
                , axis=1
                )

print('Raw data dimension: ', df_raw.shape)
print()
print('Raw data columns', df_raw.columns)

data_table.DataTable(df_raw, include_index=False, num_rows_per_page = 5, max_columns = 24)

Output hidden; open in https://colab.research.google.com to view.

In [None]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105203 entries, 0 to 105202
Data columns (total 15 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   ts                                 105203 non-null  object 
 1   ms_played                          105203 non-null  int64  
 2   conn_country                       105203 non-null  object 
 3   master_metadata_track_name         99958 non-null   object 
 4   master_metadata_album_artist_name  99958 non-null   object 
 5   master_metadata_album_album_name   99958 non-null   object 
 6   episode_name                       655 non-null     object 
 7   episode_show_name                  655 non-null     object 
 8   reason_start                       105203 non-null  object 
 9   reason_end                         105203 non-null  object 
 10  shuffle                            105203 non-null  bool   
 11  skipped                            4019

###**1.1.2. Building the podcast's dataframe**

In [None]:
df_podcasts = df_raw[(df_raw.episode_name.notnull()) & (df_raw.ms_played >= 120000)] \
                        .drop(['master_metadata_track_name'
                             , 'master_metadata_album_artist_name'
                             , 'master_metadata_album_album_name']
                        , axis=1)

print('df_podcasts dimension: ',df_podcasts.shape)

data_table.DataTable(df_podcasts, include_index = False, num_rows_per_page = 5, max_columns = 24)

###**1.1.3. Building the song's dataframe**






In [None]:
df_songs = df_raw[(df_raw.master_metadata_track_name.notnull()) \
                  & (df_raw.episode_name.isnull())] \
                  .drop(['episode_name', 'episode_show_name'], axis = 1) \
                  .rename(columns={'master_metadata_track_name': 'track_name',
                                   'master_metadata_album_artist_name': 'artist_name',
                                   'master_metadata_album_album_name': 'album_name'
                                   }
                          )

print('df_songs dimension: ', df_songs.shape)

data_table.DataTable(df_songs, include_index = False, num_rows_per_page = 5, max_columns = 24)

Output hidden; open in https://colab.research.google.com to view.

##**1.2. Features from Spotify Web API**
In this section we will get some informations per track as the track id, audio features and genres

In [None]:
# Connexion à l'API Spotify

CLIENT_ID = '********************************'
CLIENT_SECRET = '********************************'

auth_manager = SpotifyClientCredentials(client_id = CLIENT_ID, client_secret = CLIENT_SECRET)
sp = spotipy.Spotify(auth_manager=auth_manager)

###**1.2.1. Extract id, uri and explicit content per track**

In order to get the audio features we need the track ID of each song so we will build a dictionary to make the artist-track correspondence and loop on it in the Spotipy method ```sp.search()``` to get each track ID.

In [None]:
# Deduplicate on "artist_name", "track_name" and "album_name" in order to get unique value per track
df_track_nodup = df_songs[['artist_name', 'track_name', 'album_name']].drop_duplicates(subset=['artist_name', 'track_name'], keep='first')


# create cleaned columns to use for API query
df_track_nodup['artist_name_clean'] = df_track_nodup['artist_name'].replace({"-":" ", "'": ""}, regex=True)
df_track_nodup['track_name_clean'] = df_track_nodup['track_name'].replace({"\'":" ", " - Single Version": "", " - (Album Version)": ""}, regex=True)
df_track_nodup['track_name_clean'] = df_track_nodup['track_name_clean'].str.replace(r"\(f.*\)", "")
df_track_nodup['album_name_clean'] = df_track_nodup['album_name'].replace({"\'":" "}, regex=True)

print("Nombre de chansons unique:", len(df_track_nodup))

data_table.DataTable(df_track_nodup, include_index = False, num_rows_per_page = 5, max_columns = 24)

Output hidden; open in https://colab.research.google.com to view.

In [None]:
def extract_track_id(df):
    """ Extract track's id and uri with a Spotify Web API search query on artist, track and album.
        Then append the artist's name, track's name and album's name with id and uri

        Returns
                track_info_list (list): Concatenation per rows
               
    """
    track_info_list = []
    for a, b, c, d, e, f in df.itertuples(index=False):
        track_info = sp.search(q="artist:" + d + " track:" + e, limit=50, type='track', market=["FR"])
        try:
            g = track_info['tracks']['items'][0]['id']
            h = track_info['tracks']['items'][0]['uri']
            i = track_info['tracks']['items'][0]['explicit']
            try: 
                j = track_info['tracks']['items'][0]['duration_ms']
                k = track_info['tracks']['items'][0]['album']['release_date']
            except IndexError:
                j = track_info['tracks']['items'][1]['duration_ms']
                k = track_info['tracks']['items'][1]['album']['release_date']
            track_info_list.append([a, b, c, g, h, i, j, k])
        except IndexError:
            print("artist: " + d + " | track: " + e + " | album: " + f)
            continue

    return track_info_list

In [None]:
%%time
track_info_list = extract_track_id(df_track_nodup)

len(track_info_list)

In [None]:
# Pass track_info_list into a dataframe

df_song_info = pd.DataFrame(track_info_list, columns = ['artist_name',
                                                        'track_name',
                                                        'album_name',
                                                        'track_id',
                                                        'track_uri',
                                                        'explicit',
                                                        'duration_ms',
                                                        'release_date'])


print('df_song_info dimension:', df_song_info.shape)

data_table.DataTable(df_song_info, include_index = False, num_rows_per_page = 5, max_columns = 24)

Output hidden; open in https://colab.research.google.com to view.

In [None]:
## Filter on ms_played > 30000
#df_song_info.to_csv('/content/MyDrive/My Drive/Colab Notebooks/my_spotify_data/df_song_info_11042021.csv', index=False)
df_song_info.to_csv('/content/MyDrive/My Drive/Colab Notebooks/my_spotify_data/df_song_info_27042021.csv', index=False)

In [None]:
## Import of df_song_info
df_song_info = pd.read_csv('/content/MyDrive/My Drive/Colab Notebooks/my_spotify_data/df_song_info_27042021.csv', index_col=False, low_memory=False)
print('Dimensions de df_song_info:', df_song_info.shape)

Dimensions de df_song_info: (18270, 8)


###**1.2.2. Extract id, uri, genres and popularity per artist**
Since Spotify assign music genre by artist and not by track we have to use another API query to get those features.


In [None]:
# Deduplicate artis_name to get unique value and avoid additional loop

df_artist_nodup = df_songs[['artist_name']].drop_duplicates(subset=['artist_name'], keep='first')
df_artist_nodup['artist_name_clean'] = df_artist_nodup['artist_name'].replace({"-M-":"Matthieu Chedid", "'":""}, regex=True)
len(df_artist_nodup)

6415

In [None]:
def extract_artist_id(df):
    artist_info_list = []
    for i in df['artist_name']:
        try:
            artist_query = sp.search(q="artist:" + i, type='artist')
            a = artist_query['artists']['items'][0]['id']
            b = artist_query['artists']['items'][0]['uri']
            c = artist_query['artists']['items'][0]['genres']
            if c:
                d = artist_query['artists']['items'][0]['popularity']
            else:
                c = artist_query['artists']['items'][-1]['genres']
                d = artist_query['artists']['items'][0]['popularity']
            artist_info_list.append([i, a, b, c, d])
        except IndexError:
            pass
    return artist_info_list

In [None]:
%%time
artist_info_list = extract_artist_id(df_artist_nodup)
len(artist_info_list)
'''
Wall time: 10min 32s
'''

CPU times: user 23.2 s, sys: 1.54 s, total: 24.7 s
Wall time: 10min 32s


In [None]:
# Pass artist_info_list into a dataframe
df_artist_info = pd.DataFrame(artist_info_list, columns = ['artist_name',
                                                           'artist_id',
                                                           'artist_uri',
                                                           'artist_genres',
                                                           'artist_popularity'])


print('df_artist_info dimension:', df_artist_info.shape)

Dimensions de df_artist_info: (6337, 5)


In [None]:
## EXPORT
# df_artist_info.to_csv('/content/MyDrive/My Drive/Colab Notebooks/my_spotify_data/df_artist_info.csv', index=False)
# df_artist_info.to_csv('/content/MyDrive/My Drive/Colab Notebooks/my_spotify_data/df_artist_info_11042021.csv', index=False)
# df_artist_info.to_csv('/content/MyDrive/My Drive/Colab Notebooks/my_spotify_data/df_artist_info_27042021.csv', index=False)

In [None]:
## IMPORT
df_artist_info = pd.read_csv('/content/MyDrive/My Drive/Colab Notebooks/my_spotify_data/df_artist_info_27042021.csv', index_col=False, low_memory=False)
print('df_artist_info dimension:', df_artist_info.shape)

Dimensions de df_artist_info: (6337, 5)


We want to transform each music genres in the column ```artist_genres``` into unique value columns so we can use boolean for each one instead of a unique columns which contains every style.

In [None]:
df_artist_info = df_artist_info.drop_duplicates(subset="artist_id").set_index("artist_id")
print("df_artist_info shape:", df_artist_info.shape)

# Transform string column artist_genres to list
df_artist_info['artist_genres'] = df_artist_info['artist_genres'].apply(eval)
df_artist_info.head(2)

Taille de df_artist_info: (6210, 4)


Unnamed: 0_level_0,artist_name,artist_uri,artist_genres,artist_popularity
artist_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
7zmk5lkmCMVvfvwF3H8FWC,Elle Varner,spotify:artist:7zmk5lkmCMVvfvwF3H8FWC,"[hip pop, neo soul, pop r&b, r&b, urban contem...",47
5KL2vJiSdNo1rrkhurd6St,Jaojoby,spotify:artist:5KL2vJiSdNo1rrkhurd6St,"[malagasy folk, malagasy pop]",25


In [None]:
def to_1D(series):
    """Transforms the list column from 2D to 1D
    """
    return pd.Series([x for _list in series for x in _list])


unique_items = to_1D(df_artist_info["artist_genres"]).value_counts()

In [None]:
def boolean_df(series, unique_items):
    """ Transform all the dataframe's unique values in columns, and check the presence
        of the song for each genre with boolean

    Args:
        series (pandas series): ----
        unique_items (str): ----

    Returns:
        df_raw (DataFrame): Concatenated DataFrame from all the loaded files.

    """
    # Create empty dict
    bool_dict = {}

    # Loop through all the tags
    for i, item in enumerate(unique_items):
        
        # Apply boolean mask
        bool_dict[item] = series.apply(lambda x: item in x)
    
    # Return the results as a dataframe
    return pd.DataFrame(bool_dict)

In [None]:
# Extract the column 'artist_genres' and transform each unique value in a column, then use 'artist_id as index for concatenation in the next step
df_artist_genre_bool = boolean_df(df_artist_info["artist_genres"], unique_items.keys())
df_artist_genre_bool.shape
df_artist_genre_bool.head(2)

Unnamed: 0_level_0,dance pop,pop,hip hop,rap,pop rap,rock,soul,background piano,funk,french hip hop,r&b,post-teen pop,trap,urban contemporary,adult standards,jazz,southern hip hop,gangster rap,lo-fi beats,french pop,chillhop,pop rock,tropical house,cool jazz,vocal jazz,pop urbaine,francoton,edm,electropop,modern rock,pop dance,afropop,focus,quiet storm,motown,latin,hip pop,jazz funk,hardcore hip hop,neo soul,...,bajki,electronica peruana,electro-industrial,arab trap,shoegaze chileno,nuevo folklore mexicano,japanese soundtrack,scam rap,chinese idol pop,kabyle,australian ambient,vintage chanson,praise,byzantine,deep pop edm,riot grrrl,chicago punk,york indie,french techno,cape verdean folk,irish country,atlanta indie,jewish pop,psychedelic doom,anglican liturgy,uppsala indie,hardcore,symphonic black metal,comedy rap,barrelhouse piano,indie catala,japanese electropop,south african modern jazz,cocuk sarkilari,cha-cha-cha,milan indie,dutch cabaret,greek downtempo,muzica crestina,cumbia andina mexicana
artist_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
7zmk5lkmCMVvfvwF3H8FWC,False,False,False,False,False,False,False,False,False,False,True,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,True,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
5KL2vJiSdNo1rrkhurd6St,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


Spotify use complex algorithm to categorize artist per music genres and can totalise 1387 sub-genres in total in 2016 ([*article*](https://www.thestar.com/entertainment/2016/01/14/meet-the-man-classifying-every-genre-of-music-on-spotify-all-1387-of-them.html))


As we can see, ```df_artist_genre_bool``` have 1949 columns corresponding to music genres with some duplicates values. So, we build a dictionary to aggregate music sub-genres into more concise style.


>**NB : Building the dictionary was one of the longest task to do in this project because I had to check every single music genre and classify it by myself in order to be the more accurate as possible.**

In [None]:
# Import JSON file wich contains the dictionary

with open(path_to_data + 'my_spotify_music_genres.json') as json_file:
    new_col_names = json.load(json_file)

In [None]:
# If a column name correspond to a value in the dictionary we rename it as the key

for col in df_artist_genre_bool.columns:
    for k, val_list in new_col_names.items():
        for v in val_list:
            if re.findall(r"{}".format(v), col):
                df_artist_genre_bool.rename(columns={col:k},inplace=True)

genres = []
for col in df_artist_genre_bool.columns:
    if col not in genres:
        genres.append(col)

In [None]:
# Drop useless music genres

genres_to_drop = ["abstract", "accordeon", "accordion", "australian psych", "commons", "complextro", "exotica", "supergroup",
                  "women's music", "tzadik", "laboratorio", "bardcore", "dreamo", "cosmic american", "fake", "svensk progg",
                  "ballet class", "banjo", "byzantine", "canzone napoletana", "dark wave", "fluxwork", "strut", "shimmer psych",
                  "c86", "ethereal wave", "double drumming", "deathgrass", "ebm", "franco-flemish school",
                  "slowcore", "groove room", "queercore", "small room", "broken beat", "deconstructed club", "redneck",
                  "future garage", "mashup", "melbourne bounce", "melbourne bounce international", "palm desert scene",
                  "uk garage", "shimmer psych", "kleine hoerspiel"]

for g in genres_to_drop:
    for col in df_artist_genre_bool.columns:
        if g in col:
            df_artist_genre_bool = df_artist_genre_bool.drop(columns=g, inplace=False)
        else:
            continue

df_artist_genre_bool.shape

(6210, 1909)

In [None]:
# Deduplicates column with the same name by combining their value

def sjoin(x):
    return '/'.join(x[x.notnull()].astype(str))

%time
df_artist_genre_bool = df_artist_genre_bool.groupby(level=0, axis=1).apply(lambda x: x.apply(sjoin, axis=1))

CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 8.34 µs


In [None]:
# Transform True and False in numerical boolean

for col in df_artist_genre_bool:
    df_artist_genre_bool[col] = df_artist_genre_bool[col].apply(lambda x: 1 if 'True' in x else 0)

df_artist_genre_bool.reset_index(inplace=True)
df_artist_info.reset_index(inplace=True)

print("df_artist_genre_bool shape:", df_artist_genre_bool.shape)
df_artist_genre_bool.head(5)

df_artist_genre_bool shape: (6210, 54)


Unnamed: 0,artist_id,a_cappella,adult_standards,afro,arab_pop,baroque,beatboxing,blues,bossa_nova,cabaret,chanson_francaise,children_music,choir,classical,comic,country,dance,disco,edm,experimental,focus,folk,funk,gospel,hip_hop,indie,instrumental,japanese_pop,jazz,karaoke,kpop,latin,lo_fi,lullaby,medieval,metal,movie,opera,oratory,pop,rap,reggae,reggaeton,relaxative,religious,rnb,rock,salsa,show,slam_poetry,soul,surf_music,tropical,world
0,7zmk5lkmCMVvfvwF3H8FWC,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0
1,5KL2vJiSdNo1rrkhurd6St,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0L8ExT028jH3ddEcZwqJJ5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
3,6gK1Uct5FEdaUWRWpU4Cl2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
4,2Kb4gv8jSstDI7ygRhenuC,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0


In [None]:
# merge df_artist_info and df_artist_bool
df_artist_info = df_artist_info.merge(df_artist_genre_bool, how='left',on='artist_id')
df_artist_info.drop(columns='artist_genres', inplace=True)

print("df_artist_info shape:", df_artist_info.shape)
df_artist_info.head()

df_artist_info shape: (6210, 57)


Unnamed: 0,artist_id,artist_name,artist_uri,artist_popularity,a_cappella,adult_standards,afro,arab_pop,baroque,beatboxing,blues,bossa_nova,cabaret,chanson_francaise,children_music,choir,classical,comic,country,dance,disco,edm,experimental,focus,folk,funk,gospel,hip_hop,indie,instrumental,japanese_pop,jazz,karaoke,kpop,latin,lo_fi,lullaby,medieval,metal,movie,opera,oratory,pop,rap,reggae,reggaeton,relaxative,religious,rnb,rock,salsa,show,slam_poetry,soul,surf_music,tropical,world
0,7zmk5lkmCMVvfvwF3H8FWC,Elle Varner,spotify:artist:7zmk5lkmCMVvfvwF3H8FWC,47,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0
1,5KL2vJiSdNo1rrkhurd6St,Jaojoby,spotify:artist:5KL2vJiSdNo1rrkhurd6St,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0L8ExT028jH3ddEcZwqJJ5,Red Hot Chili Peppers,spotify:artist:0L8ExT028jH3ddEcZwqJJ5,86,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
3,6gK1Uct5FEdaUWRWpU4Cl2,Petit Biscuit,spotify:artist:6gK1Uct5FEdaUWRWpU4Cl2,73,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
4,2Kb4gv8jSstDI7ygRhenuC,"Spa, Relaxation and Dreams",spotify:artist:2Kb4gv8jSstDI7ygRhenuC,47,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0


In [None]:
#df_artist_info.to_csv('/content/MyDrive/My Drive/Colab Notebooks/my_spotify_data/df_artist_info_complete_27042021.csv', index=False)

In [None]:
df_artist_info = pd.read_csv('/content/MyDrive/My Drive/Colab Notebooks/my_spotify_data/df_artist_info_complete_27042021.csv', index_col=False, low_memory=False)
df_artist_info.head(2)

###**1.2.3. Audio features extraction**

In [None]:
print("Table df_song_info before deduplicate:", len(df_song_info))
print("Table df_song_info after deduplicate:", len(df_song_info[['track_id']].drop_duplicates(subset=['track_id'])))
data_table.DataTable(df_song_info, include_index = False, num_rows_per_page = 5, max_columns = 24)

Output hidden; open in https://colab.research.google.com to view.

In [None]:
def extract_audio_features():
    track_audio_features = []
    df_tmp_id_ded = df_song_info[['track_id']].drop_duplicates(subset=['track_id'])
    for tmp_id in list(df_tmp_id_ded['track_id']):
        audio_ft = sp.audio_features(tracks=[tmp_id])
        try:
            a = audio_ft[0].get('id')
            b = audio_ft[0].get('acousticness')
            c = audio_ft[0].get('danceability')
            d = audio_ft[0].get('energy')
            e = audio_ft[0].get('instrumentalness')
            f = audio_ft[0].get('key')
            g = audio_ft[0].get('liveness')
            h = audio_ft[0].get('loudness')
            i = audio_ft[0].get('mode')
            j = audio_ft[0].get('speechiness')
            k = audio_ft[0].get('tempo')
            l = audio_ft[0].get('time_signature')
            m = audio_ft[0].get('valence')
            track_audio_features.append([a, b, c, d, e, f, g, h, i, j, k, l, m])
        except (IndexError, AttributeError):
            continue
    return track_audio_features

In [None]:
%%time
track_audio_features = extract_audio_features()
len(track_audio_features)

'''
Wall time: 29min
'''

CPU times: user 1min 15s, sys: 6.18 s, total: 1min 21s
Wall time: 29min


In [None]:
df_audio_features = pd.DataFrame(track_audio_features,
                                 columns = ['track_id',
                                            'acousticness',
                                            'danceability',
                                            'energy',
                                            'instrumentalness',
                                            'key',
                                            'liveness',
                                            'loudness',
                                            'mode',
                                            'speechiness',
                                            'tempo',
                                            'time_signature',
                                            'valence'])


print('df_audio_features dimension:', df_audio_features.shape)

data_table.DataTable(df_audio_features, include_index = False, num_rows_per_page = 5, max_columns = 24)

Output hidden; open in https://colab.research.google.com to view.

In [None]:
# export de df_audio_features
#df_audio_features.to_csv('/content/MyDrive/My Drive/Colab Notebooks/my_spotify_data/df_audio_features_11042021.csv', index=False)
#df_audio_features.to_csv('/content/MyDrive/My Drive/Colab Notebooks/my_spotify_data/df_audio_features_27042021.csv', index=False)

In [None]:
df_audio_features = pd.read_csv('/content/MyDrive/My Drive/Colab Notebooks/my_spotify_data/df_audio_features_27042021.csv', index_col=False, low_memory=False)

##**1.3. Merging DataFrames**

In [None]:
# Join between df_songs and df_song_info to have the original dataframe with the songs features

df_join_songs_infos = pd.merge(df_songs,
                         df_song_info,
                         how = "left",
                         on = ['artist_name', 'track_name', 'album_name'])


print('df_join_songs_infos shape:', df_join_songs_infos.shape)
data_table.DataTable(df_join_songs_infos, include_index = False, num_rows_per_page = 10, max_columns = 40)

Output hidden; open in https://colab.research.google.com to view.

In [None]:
# Join df_join_songs_info and df_artist_info

df_join_songs_artist_info = pd.merge(df_join_songs_infos, df_artist_info,
                  how="left",
                  on='artist_name')

print('df_join_songs_artist_info dimension:', df_join_songs_artist_info.shape)
df_join_songs_artist_info.head()

Dimensions de df_join_songs_artist_info: (99958, 74)


Unnamed: 0,ts,ms_played,conn_country,track_name,artist_name,album_name,reason_start,reason_end,shuffle,skipped,offline,offline_timestamp,incognito_mode,track_id,track_uri,explicit,duration_ms,release_date,artist_id,artist_uri,artist_popularity,a_cappella,adult_standards,afro,arab_pop,baroque,beatboxing,blues,bossa_nova,cabaret,chanson_francaise,children_music,choir,classical,comic,country,dance,disco,edm,experimental,focus,folk,funk,gospel,hip_hop,indie,instrumental,japanese_pop,jazz,karaoke,kpop,latin,lo_fi,lullaby,medieval,metal,movie,opera,oratory,pop,rap,reggae,reggaeton,relaxative,religious,rnb,rock,salsa,show,slam_poetry,soul,surf_music,tropical,world
0,2019-06-01T09:22:34Z,14725,FR,Kinda Love,Elle Varner,Kinda Love,trackdone,endplay,False,,False,1559380934739,False,7jxxHhhN6qRHLsInwrXGzn,spotify:track:7jxxHhhN6qRHLsInwrXGzn,False,194502.0,2019-07-12,7zmk5lkmCMVvfvwF3H8FWC,spotify:artist:7zmk5lkmCMVvfvwF3H8FWC,47.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,2020-04-03T15:23:03Z,35632,FR,E tiako,Jaojoby,E tiako (Madagascar),trackdone,unexpected-exit-while-paused,False,,False,1585900675181,False,6VKZQ4mivre7UQho29VZCp,spotify:track:6VKZQ4mivre7UQho29VZCp,False,329440.0,1998-05-05,5KL2vJiSdNo1rrkhurd6St,spotify:artist:5KL2vJiSdNo1rrkhurd6St,25.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2020-05-24T21:24:55Z,78380,FR,Goodbye Angels,Red Hot Chili Peppers,The Getaway,clickrow,logout,True,,False,1590351419209,False,2XTkpF9T2PKvcLgamGJGx1,spotify:track:2XTkpF9T2PKvcLgamGJGx1,False,268733.0,2016-06-17,0L8ExT028jH3ddEcZwqJJ5,spotify:artist:0L8ExT028jH3ddEcZwqJJ5,86.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2020-01-14T22:13:44Z,2150,FR,We Were Young,Petit Biscuit,We Were Young,clickrow,endplay,False,,False,1579040021737,False,1USj0dJqfBxnOiwiOuB7pU,spotify:track:1USj0dJqfBxnOiwiOuB7pU,False,214200.0,2019-05-28,6gK1Uct5FEdaUWRWpU4Cl2,spotify:artist:6gK1Uct5FEdaUWRWpU4Cl2,73.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,2019-03-23T16:05:08Z,1220,FR,A Light Rain in the Forest,"Spa, Relaxation and Dreams","Rain Drop Medley of Roof, Thunder, Forest, Car...",clickrow,endplay,False,,False,1553357104299,False,6sZoMDORjyGJqZjlitEdy7,spotify:track:6sZoMDORjyGJqZjlitEdy7,False,48299.0,2015-06-06,2Kb4gv8jSstDI7ygRhenuC,spotify:artist:2Kb4gv8jSstDI7ygRhenuC,47.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Joint df_join_songs_artist_info and df_audio_features

df = pd.merge(df_join_songs_artist_info, df_audio_features,
              how="left",
              on='track_id')

df = df.dropna(subset=['track_id', 'artist_id'])

print('df shape:', df.shape)
df.head()

Dimensions de df: (92740, 86)


Unnamed: 0,ts,ms_played,conn_country,track_name,artist_name,album_name,reason_start,reason_end,shuffle,skipped,offline,offline_timestamp,incognito_mode,track_id,track_uri,explicit,duration_ms,release_date,artist_id,artist_uri,artist_popularity,a_cappella,adult_standards,afro,arab_pop,baroque,beatboxing,blues,bossa_nova,cabaret,chanson_francaise,children_music,choir,classical,comic,country,dance,disco,edm,experimental,...,instrumental,japanese_pop,jazz,karaoke,kpop,latin,lo_fi,lullaby,medieval,metal,movie,opera,oratory,pop,rap,reggae,reggaeton,relaxative,religious,rnb,rock,salsa,show,slam_poetry,soul,surf_music,tropical,world,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,2019-06-01T09:22:34Z,14725,FR,Kinda Love,Elle Varner,Kinda Love,trackdone,endplay,False,,False,1559380934739,False,7jxxHhhN6qRHLsInwrXGzn,spotify:track:7jxxHhhN6qRHLsInwrXGzn,False,194502.0,2019-07-12,7zmk5lkmCMVvfvwF3H8FWC,spotify:artist:7zmk5lkmCMVvfvwF3H8FWC,47.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.197,0.63,0.641,4.3e-05,2.0,0.107,-7.743,1.0,0.0829,95.912,4.0,0.364
1,2020-04-03T15:23:03Z,35632,FR,E tiako,Jaojoby,E tiako (Madagascar),trackdone,unexpected-exit-while-paused,False,,False,1585900675181,False,6VKZQ4mivre7UQho29VZCp,spotify:track:6VKZQ4mivre7UQho29VZCp,False,329440.0,1998-05-05,5KL2vJiSdNo1rrkhurd6St,spotify:artist:5KL2vJiSdNo1rrkhurd6St,25.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0701,0.787,0.66,6e-05,2.0,0.148,-9.856,1.0,0.0443,112.011,4.0,0.802
2,2020-05-24T21:24:55Z,78380,FR,Goodbye Angels,Red Hot Chili Peppers,The Getaway,clickrow,logout,True,,False,1590351419209,False,2XTkpF9T2PKvcLgamGJGx1,spotify:track:2XTkpF9T2PKvcLgamGJGx1,False,268733.0,2016-06-17,0L8ExT028jH3ddEcZwqJJ5,spotify:artist:0L8ExT028jH3ddEcZwqJJ5,86.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.104,0.365,0.804,6.6e-05,9.0,0.15,-5.922,0.0,0.105,171.597,4.0,0.577
3,2020-01-14T22:13:44Z,2150,FR,We Were Young,Petit Biscuit,We Were Young,clickrow,endplay,False,,False,1579040021737,False,1USj0dJqfBxnOiwiOuB7pU,spotify:track:1USj0dJqfBxnOiwiOuB7pU,False,214200.0,2019-05-28,6gK1Uct5FEdaUWRWpU4Cl2,spotify:artist:6gK1Uct5FEdaUWRWpU4Cl2,73.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.259,0.741,0.54,0.0,5.0,0.114,-6.915,1.0,0.156,100.048,4.0,0.333
4,2019-03-23T16:05:08Z,1220,FR,A Light Rain in the Forest,"Spa, Relaxation and Dreams","Rain Drop Medley of Roof, Thunder, Forest, Car...",clickrow,endplay,False,,False,1553357104299,False,6sZoMDORjyGJqZjlitEdy7,spotify:track:6sZoMDORjyGJqZjlitEdy7,False,48299.0,2015-06-06,2Kb4gv8jSstDI7ygRhenuC,spotify:artist:2Kb4gv8jSstDI7ygRhenuC,47.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.2e-05,0.275,0.448,0.87,1.0,0.323,-21.215,1.0,0.0987,69.994,3.0,0.132


In [None]:
#df.to_pickle('/content/MyDrive/My Drive/Colab Notebooks/my_spotify_data/df_spotify.pkl')
#df.to_csv('/content/MyDrive/My Drive/Colab Notebooks/my_spotify_data/df_spotify_11_04_2021.csv', index=False)
df.to_csv('/content/MyDrive/My Drive/Colab Notebooks/my_spotify_data/df_spotify_27_04_2021.csv', index=False)

#**2. DATA PREPROCESSING**

In [None]:
df = pd.read_csv(path_to_data + 'df_spotify_27_04_2021.csv', index_col=False)
print('df shape:', df.shape)
df.head(3)

df shape: (92740, 86)


Unnamed: 0,ts,ms_played,conn_country,track_name,artist_name,album_name,reason_start,reason_end,shuffle,skipped,offline,offline_timestamp,incognito_mode,track_id,track_uri,explicit,duration_ms,release_date,artist_id,artist_uri,artist_popularity,a_cappella,adult_standards,afro,arab_pop,baroque,beatboxing,blues,bossa_nova,cabaret,chanson_francaise,children_music,choir,classical,comic,country,dance,disco,edm,experimental,...,instrumental,japanese_pop,jazz,karaoke,kpop,latin,lo_fi,lullaby,medieval,metal,movie,opera,oratory,pop,rap,reggae,reggaeton,relaxative,religious,rnb,rock,salsa,show,slam_poetry,soul,surf_music,tropical,world,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
0,2019-06-01T09:22:34Z,14725,FR,Kinda Love,Elle Varner,Kinda Love,trackdone,endplay,False,,False,1559380934739,False,7jxxHhhN6qRHLsInwrXGzn,spotify:track:7jxxHhhN6qRHLsInwrXGzn,False,194502.0,2019-07-12,7zmk5lkmCMVvfvwF3H8FWC,spotify:artist:7zmk5lkmCMVvfvwF3H8FWC,47.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.197,0.63,0.641,4.3e-05,2.0,0.107,-7.743,1.0,0.0829,95.912,4.0,0.364
1,2020-04-03T15:23:03Z,35632,FR,E tiako,Jaojoby,E tiako (Madagascar),trackdone,unexpected-exit-while-paused,False,,False,1585900675181,False,6VKZQ4mivre7UQho29VZCp,spotify:track:6VKZQ4mivre7UQho29VZCp,False,329440.0,1998-05-05,5KL2vJiSdNo1rrkhurd6St,spotify:artist:5KL2vJiSdNo1rrkhurd6St,25.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0701,0.787,0.66,6e-05,2.0,0.148,-9.856,1.0,0.0443,112.011,4.0,0.802
2,2020-05-24T21:24:55Z,78380,FR,Goodbye Angels,Red Hot Chili Peppers,The Getaway,clickrow,logout,True,,False,1590351419209,False,2XTkpF9T2PKvcLgamGJGx1,spotify:track:2XTkpF9T2PKvcLgamGJGx1,False,268733.0,2016-06-17,0L8ExT028jH3ddEcZwqJJ5,spotify:artist:0L8ExT028jH3ddEcZwqJJ5,86.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.104,0.365,0.804,6.6e-05,9.0,0.15,-5.922,0.0,0.105,171.597,4.0,0.577


##**2.1. Cleaning missing values**

In [None]:
# count missing values per columns
print("Missing values per column: \n-------------------------")
for col in df:
    print("{}:".format(col) , df[col].isnull().sum())

Missing values per column: 
-------------------------
ts: 0
ms_played: 0
conn_country: 0
track_name: 0
artist_name: 0
album_name: 0
reason_start: 0
reason_end: 0
shuffle: 0
skipped: 89327
offline: 0
offline_timestamp: 0
incognito_mode: 0
track_id: 0
track_uri: 0
explicit: 0
duration_ms: 0
release_date: 0
artist_id: 0
artist_uri: 0
artist_popularity: 0
a_cappella: 0
adult_standards: 0
afro: 0
arab_pop: 0
baroque: 0
beatboxing: 0
blues: 0
bossa_nova: 0
cabaret: 0
chanson_francaise: 0
children_music: 0
choir: 0
classical: 0
comic: 0
country: 0
dance: 0
disco: 0
edm: 0
experimental: 0
focus: 0
folk: 0
funk: 0
gospel: 0
hip_hop: 0
indie: 0
instrumental: 0
japanese_pop: 0
jazz: 0
karaoke: 0
kpop: 0
latin: 0
lo_fi: 0
lullaby: 0
medieval: 0
metal: 0
movie: 0
opera: 0
oratory: 0
pop: 0
rap: 0
reggae: 0
reggaeton: 0
relaxative: 0
religious: 0
rnb: 0
rock: 0
salsa: 0
show: 0
slam_poetry: 0
soul: 0
surf_music: 0
tropical: 0
world: 0
acousticness: 14
danceability: 14
energy: 14
instrumentalness: 14

In [None]:
# drop rows where acoustic values are missing
df = df.dropna(subset = ["acousticness", "danceability", "energy", "instrumentalness",
                         "key", "liveness", "loudness", "mode", "speechiness", "tempo",
                         "time_signature", "valence"])


# transform missing values with 0
df["skipped"] = df["skipped"].fillna(0).astype(int)

In [None]:
# Drop all the non musical track like Nature sounds, Ocean Noises, Commentary, Interview, etc
useless_tracks = ["Rain", "Ocean Sounds", "Ocean Noises","Ambient ", "Forest",
                  "Birds", "Relaxing ", "Nature", "Nature Sounds", "White Noise",
                  "Wave", "Whale", "Background Noise", "Binaural", "River Sounds",
                  "California State Beach", "3D Audio Textures of the River",
                  "High Tide", "These April Shores", "Birds By The River", "Deep Sleep",
                  "Umbrella Weather", "3D Audio Textures", "Oceanic ", "Rain Sounds",
                  "Thunderous Relaxation", "Calm Rain", " Natural Samples", "Venice, Italy",
                  "Dripping and Thunder", "Ocean Breez", "Cloud Collection",
                  "Fiji -Tropical Islands-", "Ocean Walk", "The Beautiful Sounds Of The Ocean",
                  "Coast of Carmel", "Peaceful Wildlife", "Thunderstorm", "Seawaves",
                  "Contemplation by the Sea", "Beach Sounds", "Brain Tingle",
                  "Natural Ambience for Mindfulness", "Hawaii -The Big Island-",
                  "Sound Effects", "Disneynature", "Commentary", "Interview",
                  "Soothing Evening waves Of Nerja", "Dolphin Sounds", "Animal and Bird Songs"]


def drop_rows_contains(dataFrame, column, drop_list):
    df_cleaned = dataFrame[~dataFrame[column].str.contains('|'.join(drop_list))]
    return df_cleaned

In [None]:
df = drop_rows_contains(df, 'album_name', useless_tracks)
df = df[df['album_name'] != "Sleep"]
print("df rows number:", len(df))

df rows number: 90797


##**2.2. Formating columns and Encoding features**

Let's describe those audio features as Spotify mentioned it:


1.   **acousticness** (*float*): a confidence measure from ```0.0``` to ```1.0``` of whether the track is acoustic. ```1.0``` represents high confidence the )track is acoustic

2.   **danceability** (*float*):  describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of ```0.0``` is least danceable and ```1.0``` is most danceable.

3.   **energy** (*float*): is a measure from ```0.0``` to ```1.0``` and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.

4.   **instrumentalness** (*float*): predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to ```1.0```, the greater likelihood the track contains no vocal content. Values above ```0.5``` are intended to represent instrumental tracks, but confidence is higher as the value approaches ```1.0```.

5.   **key** (*integer*): the key the track is in. Integers map to pitches using standard Pitch Class Notation. E.g. ```0 = C```, ```1 = C♯/D♭```, ```2 = D```, and so on. 

<p align="center">
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/6/6b/Integer_notation.png/525px-Integer_notation.png">
</p>

6.   **liveliness** (*float*): detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above ```0.8``` provides strong likelihood that the track is live.

7.   **loudness** (*float*): the overall loudness of a track in decibels (**dB**). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between ```-60``` and ```0``` db

8.   **mode** (*integer*): indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by ```1``` and minor is ```0```.

9.   **speechiness** (*float*): detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to ```1.0``` the attribute value. Values above ```0.66``` describe tracks that are probably made entirely of spoken words. Values between ```0.33``` and ```0.66``` describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below ```0.33``` most likely represent music and other non-speech-like tracks

10.  **tempo** (*float*): the overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

<p align="center">
<img src="https://i.pinimg.com/originals/bc/65/a7/bc65a7c363b1487f9131a62ff7d5f2b6.png">
</p>

11.  **time_signature** (*integer*): an estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).

12.  **valence** (*float*): a measure from ```0.0``` to ```1.0``` describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).



In [None]:
# transform specific columns to integer
col_to_boolean = ["shuffle","offline", "incognito_mode", "explicit",
                  "artist_popularity", "a_cappella", "adult_standards", "afro", 
                  "arab_pop", "baroque", "beatboxing","blues", "bossa_nova",
                  "cabaret", "chanson_francaise", "children_music", "choir",
                  "classical", "comic", "country", "dance", "disco", "edm",
                  "experimental", "focus", "folk", "funk", "gospel", "hip_hop",
                  "indie", "instrumental", "japanese_pop", "jazz", "karaoke",
                  "kpop", "latin", "lo_fi", "lullaby", "medieval", "metal", 
                  "movie", "opera", "oratory", "pop", "rnb", "rap", "reggae",
                  "reggaeton", "relaxative", "religious", "rock", "salsa", "show",
                  "slam_poetry", "soul", "surf_music", "tropical", "world",
                  "key", "mode", "time_signature"]

for col in col_to_boolean:
    df[col] = df[col].astype(np.int64)


# Transform ts column's format to datetime and extract temporal features
df['ts'] = pd.to_datetime(df['ts']).dt.tz_localize(None)
df['day'] = pd.DatetimeIndex(df['ts']).day
df['hour'] = pd.DatetimeIndex(df['ts']).hour
df['year'] = pd.DatetimeIndex(df['ts']).year
df['times'] = pd.DatetimeIndex(df['ts']).time
df['month'] = pd.DatetimeIndex(df['ts']).month
df['month_name'] = pd.DatetimeIndex(df['ts']).month_name()
df['release_year'] = pd.DatetimeIndex(df['release_date']).year


# Transform millisecond into second
df['s_played'] = (df['ms_played'] / 1000).astype(int)
df['duration_sec'] = (df['duration_ms'] / 1000).astype(int)


# Supprimer les colonnes 
df.drop(columns=['conn_country', 'offline_timestamp','release_date', 'ms_played', 'duration_ms'], inplace=True)

In [None]:
# Create one hot features for each tempo scale 
%time
df['tempo_larghissimo'] = [1 if x <= 20 else 0 for x in df['tempo']]
df['tempo_grave'] = [1 if x >= 20 and x <= 40 else 0 for x in df['tempo']]
df['tempo_lento'] = [1 if x >= 40 and x <= 60 else 0 for x in df['tempo']]
df['tempo_largo'] = [1 if x >= 60 and x <= 66 else 0 for x in df['tempo']]
df['tempo_adagio'] = [1 if x >= 66 and x <= 76 else 0 for x in df['tempo']]
df['tempo_andante'] = [1 if x >= 76 and x <= 108 else 0 for x in df['tempo']]
df['tempo_moderato'] = [1 if x >= 108 and x <= 120 else 0 for x in df['tempo']]
df['tempo_allegretto'] = [1 if x >= 112 and x <= 124 else 0 for x in df['tempo']]
df['tempo_allegro'] = [1 if x >= 120 and x <= 168 else 0 for x in df['tempo']]
df['tempo_vivace'] = [1 if x == 140 else 0 for x in df['tempo']]
df['tempo_presto'] = [1 if x >= 168 and x <= 200 else 0 for x in df['tempo']]
df['tempo_prestissimo'] = [1 if x > 200 else 0 for x in df['tempo']]

df.head(2)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.96 µs


Unnamed: 0,ts,track_name,artist_name,album_name,reason_start,reason_end,shuffle,skipped,offline,incognito_mode,track_id,track_uri,explicit,artist_id,artist_uri,artist_popularity,a_cappella,adult_standards,afro,arab_pop,baroque,beatboxing,blues,bossa_nova,cabaret,chanson_francaise,children_music,choir,classical,comic,country,dance,disco,edm,experimental,focus,folk,funk,gospel,hip_hop,...,salsa,show,slam_poetry,soul,surf_music,tropical,world,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,day,hour,year,times,month,month_name,release_year,s_played,duration_sec,tempo_larghissimo,tempo_grave,tempo_lento,tempo_largo,tempo_adagio,tempo_andante,tempo_moderato,tempo_allegretto,tempo_allegro,tempo_vivace,tempo_presto,tempo_prestissimo
0,2019-06-01 09:22:34,Kinda Love,Elle Varner,Kinda Love,trackdone,endplay,0,0,0,0,7jxxHhhN6qRHLsInwrXGzn,spotify:track:7jxxHhhN6qRHLsInwrXGzn,0,7zmk5lkmCMVvfvwF3H8FWC,spotify:artist:7zmk5lkmCMVvfvwF3H8FWC,47,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,0.197,0.63,0.641,4.3e-05,2,0.107,-7.743,1,0.0829,95.912,4,0.364,1,9,2019,09:22:34,6,June,2019,14,194,0,0,0,0,0,1,0,0,0,0,0,0
1,2020-04-03 15:23:03,E tiako,Jaojoby,E tiako (Madagascar),trackdone,unexpected-exit-while-paused,0,0,0,0,6VKZQ4mivre7UQho29VZCp,spotify:track:6VKZQ4mivre7UQho29VZCp,0,5KL2vJiSdNo1rrkhurd6St,spotify:artist:5KL2vJiSdNo1rrkhurd6St,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0.0701,0.787,0.66,6e-05,2,0.148,-9.856,1,0.0443,112.011,4,0.802,3,15,2020,15:23:03,4,April,1998,35,329,0,0,0,0,0,0,1,1,0,0,0,0


In [None]:
# One hot encoding
df = pd.get_dummies(df, columns = ["reason_start", "reason_end"], prefix = ["reason_start", "reason_end"])
df.head(2)

Unnamed: 0,ts,track_name,artist_name,album_name,shuffle,skipped,offline,incognito_mode,track_id,track_uri,explicit,artist_id,artist_uri,artist_popularity,a_cappella,adult_standards,afro,arab_pop,baroque,beatboxing,blues,bossa_nova,cabaret,chanson_francaise,children_music,choir,classical,comic,country,dance,disco,edm,experimental,focus,folk,funk,gospel,hip_hop,indie,instrumental,...,day,hour,year,times,month,month_name,release_year,s_played,duration_sec,tempo_larghissimo,tempo_grave,tempo_lento,tempo_largo,tempo_adagio,tempo_andante,tempo_moderato,tempo_allegretto,tempo_allegro,tempo_vivace,tempo_presto,tempo_prestissimo,reason_start_appload,reason_start_backbtn,reason_start_clickrow,reason_start_fwdbtn,reason_start_playbtn,reason_start_remote,reason_start_trackdone,reason_start_trackerror,reason_start_unknown,reason_end_backbtn,reason_end_endplay,reason_end_fwdbtn,reason_end_logout,reason_end_remote,reason_end_trackdone,reason_end_trackerror,reason_end_unexpected-exit,reason_end_unexpected-exit-while-paused,reason_end_unknown
0,2019-06-01 09:22:34,Kinda Love,Elle Varner,Kinda Love,0,0,0,0,7jxxHhhN6qRHLsInwrXGzn,spotify:track:7jxxHhhN6qRHLsInwrXGzn,0,7zmk5lkmCMVvfvwF3H8FWC,spotify:artist:7zmk5lkmCMVvfvwF3H8FWC,47,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,...,1,9,2019,09:22:34,6,June,2019,14,194,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0
1,2020-04-03 15:23:03,E tiako,Jaojoby,E tiako (Madagascar),0,0,0,0,6VKZQ4mivre7UQho29VZCp,spotify:track:6VKZQ4mivre7UQho29VZCp,0,5KL2vJiSdNo1rrkhurd6St,spotify:artist:5KL2vJiSdNo1rrkhurd6St,25,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,3,15,2020,15:23:03,4,April,1998,35,329,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0


##**2.4. Cleaning Outliers**

In [None]:
# descriptive statistics
df[['artist_popularity', 'acousticness', 'danceability', 'energy', 'instrumentalness', 'key', 'liveness', 'loudness', 'mode', 'speechiness', 'tempo', 'time_signature', 'valence']].describe()

Unnamed: 0,artist_popularity,acousticness,danceability,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence
count,90797.0,90797.0,90797.0,90797.0,90797.0,90797.0,90797.0,90797.0,90797.0,90797.0,90797.0,90797.0,90797.0
mean,61.119332,0.394431,0.608008,0.551186,0.189622,5.215051,0.18792,-9.539341,0.573521,0.107599,117.54876,3.915085,0.502637
std,20.962512,0.34248,0.173094,0.240775,0.33712,3.570045,0.170549,5.460166,0.494568,0.109771,29.783116,0.420064,0.246533
min,0.0,3e-06,0.0,0.000122,0.0,0.0,0.0,-46.583,0.0,0.0,0.0,0.0,0.0
25%,49.0,0.0711,0.497,0.384,0.0,2.0,0.0921,-11.621,0.0,0.0384,94.03,4.0,0.316
50%,64.0,0.294,0.617,0.585,9.1e-05,5.0,0.117,-8.081,1.0,0.0563,113.96,4.0,0.498
75%,77.0,0.724,0.737,0.735,0.163,8.0,0.215,-5.81,1.0,0.131,135.818,4.0,0.694
max,100.0,0.996,0.98,1.0,0.992,11.0,0.993,1.485,1.0,0.967,220.036,5.0,0.986


As we can see in the descriptive statistics, all musical features value are between ```0``` and ```1``` except ```artist_popularity```, ```key```, ```loudness```, ```tempo``` and ```time_signature``` but we will scale them later just before modelling because we need those data to be as they are for some EDA.

In [None]:
# Multiple histograms for distribution

## Features with continuous data
continuous_features = df[['artist_popularity', 'acousticness', 'danceability',
                          'energy', 'instrumentalness', 'key', 'liveness', 'loudness',
                          'speechiness', 'tempo', 'time_signature', 'valence']]


fig = make_subplots(rows=6, cols=2)

for idx, i in enumerate(continuous_features):
    fig.add_trace(
        go.Histogram(x = continuous_features[i],
                     name = i,
                     histnorm = 'probability'),
        row = idx%6 + 1,
        col = math.floor(idx/5.6) + 1
    )

fig.update_layout(
    hovermode = 'x',
    autosize = False,
    width = 1100,
    height = 1000,
    margin = dict(
        l = 20,
        r = 20,
        b = 60,
        t = 100,
        pad = 5
    ),
    title = {
        'text': "<b>Audio Features Histogram</b>",
        'y': 0.95,
        'x': 0.17,
        'xanchor': 'center',
        'yanchor': 'top'}
)
fig.update_traces(opacity = 0.50)
fig.show()

Output hidden; open in https://colab.research.google.com to view.

In [None]:
# audio features on the same scales

trace0 = go.Box(
    y=df['acousticness'],
    name='<b>acousticness</b>',
    text=df['track_name'] + ' - ' + df['artist_name']
)
trace1 = go.Box(
    y=df['danceability'],
    name='<b>danceability</b>',
    text=df['track_name'] + ' - ' + df['artist_name']
)
trace2 = go.Box(
    y=df['energy'],
    name='<b>energy</b>',
    text=df['track_name'] + ' - ' + df['artist_name']
)
trace3 = go.Box(
    y=df['instrumentalness'],
    name='<b>instrumentalness</b>',
    text=df['track_name'] + ' - ' + df['artist_name']
)
trace4 = go.Box(
    y=df['liveness'],
    name='<b>liveness</b>',
    text=df['track_name'] + ' - ' + df['artist_name']
)
trace5 = go.Box(
    y=df['speechiness'],
    name='<b>speechiness</b>',
    text=df['track_name'] + ' - ' + df['artist_name']
)

fig = make_subplots(rows=3, cols=1)

fig.append_trace(trace0, row = 1, col = 1)
fig.append_trace(trace1, row = 1, col = 1)
fig.append_trace(trace2, row = 2, col = 1)
fig.append_trace(trace3, row = 2, col = 1)
fig.append_trace(trace4, row = 3, col = 1)
fig.append_trace(trace5, row = 3, col = 1)

fig.update_layout(
    width=1000,
    height=1000,
    title='<b>Box plot of Audio Features</b>'
    )

fig.show()



## Audio features on other scales

trace6 = go.Box(
    y=df['valence'],
    name='<b>valence</b>',
    text=df['track_name'] + ' - ' + df['artist_name']
)

# decibel (dB)
trace7 = go.Box(
    y=df['loudness'],
    name='<b>loudness</b>',
    text=df['track_name'] + ' - ' + df['artist_name']
)

trace8 = go.Box(
    y=df['tempo'],
    name='<b>tempo</b>',
    text=df['track_name'] + ' - ' + df['artist_name']
)

fig2 = make_subplots(rows=1, cols=3)

fig2.append_trace(trace6, row = 1, col = 1)
fig2.append_trace(trace7, row = 1, col = 2)
fig2.append_trace(trace8, row = 1, col = 3)

fig2.update_layout(
    width=1000,
    height=400
    )

fig2.show()

Output hidden; open in https://colab.research.google.com to view.

In [None]:
def found_all_outliers(v):

    Q1 = np.quantile(v, 0.08)
    Q3 = np.quantile(v, 0.92)
    EIQ = Q3 - Q1
    LI = Q1 - (EIQ * 1.5)
    LS = Q1 + (EIQ * 1.5)
    i = list(v.index[(v < LI) | (v > LS)])
    val = list(v[i])
    
    return i, val

In [None]:
outliers = found_all_outliers(df["loudness"])
for i in outliers[1]:
    df = df[~df["loudness"].isin([i])]

In [None]:
df = df[~df['tempo'].isin([0, 30.862])]


We can see here that there are at least one track with tempo (BPM) and time_signature are equal to 0, these are nature sounds like rain, forest, etc

In [None]:
df_to_scale = df[['artist_popularity', 'acousticness', 'danceability', 'energy', 'instrumentalness', 'key', 'liveness', 'loudness', 'speechiness', 'tempo', 'valence']]
#df.loc[:, "artist_popularity": "valence"]

In [None]:
def get_outlier_counts(df, threshold):
    df = df.copy()

    # Get the z-score for specified threshold
    threshold_z_score = stats.norm.ppf(threshold)

    # Get the z-scores for each value in df, and compare them to the threshold
    z_score_df = pd.DataFrame(np.abs(stats.zscore(df)), columns = df.columns)

    return (z_score_df > threshold_z_score).sum(axis=0)

In [None]:
get_outlier_counts(df_to_scale, 0.99999999)

artist_popularity      0
acousticness           0
danceability           0
energy                 0
instrumentalness       0
key                    0
liveness               0
loudness               0
speechiness          121
tempo                  0
valence                0
dtype: int64

#**3. EXPLORATORY DATA ANALYSIS**

Nous allons travailler exclusivement sur les données musicales.
On commence d'abord par vérifier le type de chaque colonnes

In [None]:
# Sum of stream per year

fig = px.bar(df.groupby(['year'])
                .size()
                .reset_index()
                .rename(columns = {0: "count"}),
             x = 'year',
             y = 'count',
             title = '<b>Sum of stream per year</b>',
             color = 'count',
             color_continuous_scale = 'aggrnyl',
             width = 650,
             height = 500
)

fig.show()

In [None]:
# Sum of stream per hour per year

fig = px.line(df.groupby(['hour', 'year'])
                .size()
                .reset_index()
                .rename(columns = {0: "count"}),
              x = "hour",
              y = "count",
              color = 'year',
              labels = {'hour': 'Hour',
                        'count': 'Count'
              }
)

annotation01 = {
    'x': 8, 'y': 1378,
    'showarrow': True, 'arrowhead': 3,
    'text': "On my way to work",
    'font': {'size': 10,
             'color': 'black'
    }
}

annotation02 = {
    'x': 21, 'y': 2235,
    'showarrow': True, 'arrowhead': 3,
    'text': "The end of my spotify daily use",
    'font': {'size': 10,
             'color': 'black'
    }
}

fig.update_layout({
    'annotations': [annotation01, annotation02]},
    title = '<b>Sum of stream per hour per year</b>',
    font = dict(
        size = 12,
        color = "navy"
        )
)

fig.show()

We can see here that my listening habits have drasticaly changed over the five years as I became more and more a spotify user.



*   Until 2018 I hadn't a spotify premium account so I used to listen music on mp3 files on my smartphone when I'm outside and other streaming platforms like youtube or other materials when I'm indoor. However, there are some pattern on my daily music consumption: a first pic between 7:00 AM and 8:00 AM because it is mainly around this hour that I'm on my way to go to college or work.
*   Between 2015 and 2017 I was essentialy a student so I hadn't enought time to listen music along the day except during breaks so it is evident on the graph that between 8:00 AM and 7:00 AM I was mainly in classes.
*   In 2020, despite the lockdown I tried to keep my daily routine, that's why the 2020 line is quiet similar to 2018.

In [None]:
# Correlation Matrix

cr = continuous_features.corr(method='pearson')

fig = go.Figure(go.Heatmap(
        x=cr.columns,
        y=cr.columns,
        z=cr.values.tolist(),
        colorscale='aggrnyl'
    )
)

fig.update_layout(
    title="<b>Correlation Matrix of the Audio Features</b>",
    font=dict(
        size=12
    )
)

fig.show()

In [None]:
music_genres_sum = pd.melt(df.loc[:, "a_cappella": "world"],
               value_vars = df.loc[:,"a_cappella":"world"].columns,
               var_name = 'Genre',
               value_name = 'Sum')

music_genres_sum = pd.DataFrame(music_genres_sum.groupby(['Genre']).sum())
music_genres_sum.reset_index(inplace=True)
music_genres_sum.head(5)

Unnamed: 0,Genre,Sum
0,a_cappella,25
1,adult_standards,3189
2,afro,3837
3,arab_pop,79
4,baroque,619


In [None]:
fig = go.Figure(data = [go.Pie(labels = music_genres_sum['Genre'],
                               values = music_genres_sum['Sum'],
                               hole = .5)]
                )
fig.update_traces(textposition = 'inside', textinfo = 'percent+label')
fig.update_layout(
    title_text = "<b>Music genres proportion</b>",
    margin = dict(t = 35, b = 35, l = 5, r = 5)
)
fig.show()

As you can see on the graph above, the main music that I listen is Pop music. However, I split Hip Hop and Rap in different genre because I wanted to put aside as Rap the Hip Hop songs which are mainly lyrical. So we can conclude that the real top 5 music sub-genres that I listen the most are:


1.   Hip Hop - Rap 🔞
2.   Pop Music 🎙️
3.   Jazz 🎷
4.   Rock ⚡
5.   Soul 🎼



In [None]:
fig = px.bar(
    music_genres_sum, y = 'Sum', x = 'Genre', text = 'Sum',
    color_discrete_sequence = ['lightgreen']*len(music_genres_sum['Genre'])
)


fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
#fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')

fig.show()

In [None]:
# Scatter plot graph
fig = go.Figure(data=go.Scatter(
    x = df['valence'],
    y = df['energy'],
    mode='markers',
    marker=dict(
        size=3,
        color=df['key'],    #set color equal to a variable
        colorscale='aggrnyl',   # one of plotly colorscales,
        colorbar=dict(title='<b>Key</b>'),
        showscale=True
        ),
    text=df['track_name'] + ' - ' + df['artist_name']
    )
)

# vertical line
fig.add_shape(type="line",
    x0=0.5, y0=0, x1=0.5, y1=1,
    line=dict(color="black",width=1)
)

# horizontal line
fig.add_shape(type="line",
    x0=0, y0=0.5, x1=1, y1=0.5,
    line=dict(color="black",width=1)
)

# First annotation
fig.add_annotation(text="<b>ANGRY</b>",
                   xref="paper", yref="paper",
                   x=0.1, y=0.9, showarrow=False,
                   opacity=0.4,
                   font=dict(
                       family="Gotham Medium, monospace",
                       size=30,
                       color="Navy"
                       )
                   )

# Second annotation
fig.add_annotation(text="<b>SAD</b>",
                   xref="paper", yref="paper",
                   x=0.1, y=0.14, showarrow=False,
                   opacity=0.4,
                   font=dict(
                       family="Gotham Medium, monospace",
                       size=30,
                       color="Navy"
                        )
                   )

# Third annotation
fig.add_annotation(text="<b>HAPPY</b>",
                   xref="paper", yref="paper",
                   x=0.92, y=0.9, showarrow=False,
                   opacity=0.4,
                   font=dict(
                       family="Gotham Medium, monospace",
                       size=30,
                       color="Navy"
                       )
                   )

# Forth annotation
fig.add_annotation(text="<b>CALM</b>",
                   xref="paper", yref="paper",
                   x=0.92, y=0.14, showarrow=False,
                   opacity=0.4,
                   font=dict(
                       family="Gotham Medium, monospace",
                       size=30,
                       color="Navy"
                       )
                   )

fig.update_layout(
    title="<b>MOOD REPARTITION</b>",
    xaxis_title="Valence",
    yaxis_title="Energy",
    paper_bgcolor='rgba(255,255,255,1)',
    plot_bgcolor='rgba(255,255,255,1)'
)

fig.show()

Output hidden; open in https://colab.research.google.com to view.

In [None]:
#df.to_csv(path_to_data + "df_spotify_cleaned.csv", index=False)

#**4. MUSIC MOOD PREDICTION with FUZZY C-MEANS CLUSTERING**

In [None]:
!pip install fuzzy-c-means

Collecting fuzzy-c-means
  Downloading https://files.pythonhosted.org/packages/cc/34/64498f52ddfb0a22a22f2cfcc0b293c6864f6fcc664a53b4cce9302b59fc/fuzzy_c_means-1.2.4-py3-none-any.whl
Installing collected packages: fuzzy-c-means
Successfully installed fuzzy-c-means-1.2.4


In [None]:
import tensorflow as tf
import random as random
from fcmeans import FCM
from scipy.spatial import distance

In [None]:
df = pd.read_csv(path_to_data + "df_spotify_cleaned.csv", index_col=False)
print('df shape:', df.shape) 
#df.head(3)

df shape: (90771, 119)


In [None]:
# Features Scaling
scale = df[['artist_popularity', 'key', 'loudness', 'tempo', 'time_signature']].values
min_max_scaler = preprocessing.MinMaxScaler()
df[['artist_popularity', 'key', 'loudness', 'tempo', 'time_signature']] = min_max_scaler.fit_transform(scale)

#Standscaler = preprocessing.StandardScaler()
#df[['artist_popularity', 'key', 'loudness', 'tempo', 'time_signature']] = Standscaler.fit_transform(scale)

# Deduplicate on artist_name and track_name
df_track_nodup = df.drop_duplicates(subset=['artist_name', 'track_name'], keep='last')

# filter on 2020
df_year = df_track_nodup[(df_track_nodup['year'] == 2020) & (df_track_nodup['s_played'] > 30)]
#df_features = df_track_nodup.loc[:, "acousticness": "valence"]

# filtre sur colonnes
#df_features = df_year[["danceability", "speechiness", "loudness", "acousticness", "liveness"]]
#X = df_year[["danceability", "speechiness", "loudness", "acousticness", "liveness"]]
X = df_year[["energy", "valence"]]
X = X.head()

# Conversion en Array
#X = X.to_numpy()

**Fuzzy C-Means**

In [None]:
# number of cluster
nb_cluster = 2

# fuzziness parameter > 1
m = 2

# length of dataset
n = len(X)

# dimension of dataset
d = X.shape[1]

###**Step 1: Randomly initialize the Fuzzy pseudo-partition matrix U**

In [None]:
def init_fuzzy_pseudo_partition(data: pd.DataFrame, nb_cluster: int):
    
    # Randomly initialise the fuzzy partition
    U = np.random.dirichlet(np.ones(nb_cluster), size=len(data))
    
    return U

#weight_arr = init_fuzzy_pseudo_partition(X, nb_cluster=2)
##U = init_fuzzy_pseudo_partition(X, nb_cluster=2)


##print("\nThe data with cluster number: \n", U)

On a ci-dessus la répartition des poids de chacun des 2 clusters pour chaque point de données.

###**Step 2: Calculate cluster center (centroid) for each cluster using the Fuzzy pseudo-partition**



In [None]:
def compute_centroids(weight_arr: np.ndarray):

    C = []
    for j in range(nb_cluster):
        
        # denominateur
        weight_sum = sum(np.power(weight_arr[:, j], m))
        Cj = []

        # numerateur
        for j in range(d):
            numerator = sum(np.multiply(np.power(weight_arr[:, j], m), X.iloc[:, j]))
            C_val = numerator / weight_sum
            Cj.append(C_val)
        C.append(Cj)

    return C


##C = compute_centroids(U)
##print("\nThe data with cluster number: \n", C)

###**Step 3: Update partition matrix**

In [None]:
# bulle de test

def update_fuzzy_pseudo_partition(weight_arr: np.ndarray, C: list):
    """ Calculate the Fuzzy-pseudo partition
    """

    denom = np.zeros(n)
    for j in range(nb_cluster):

        # Compute Euclidean distance between data points and their centroid
        euclid_dist = np.sqrt(np.sum((X.iloc[:,:].values - C[j])**2, axis=1))
        
        # Compute denominator
        denom  = denom + np.power(1/euclid_dist, 1/(m-1))

    for j in range(nb_cluster):

        # Compute Euclidean distance
        euclid_dist = np.sqrt(np.sum((X.iloc[:,:].values - C[j])**2, axis=1))
        
        # Update the partition matrix
        weight_arr[:, j] = np.divide(np.power(1/euclid_dist, 1/(m-1)), denom)

    return weight_arr

##U_last = update_fuzzy_pseudo_partition(U, C)
##print("\nThe data with cluster number: \n", U_last)

In [None]:
def check_for_convergence(U, U_last, epsilon: float):
    
    for row in range(0, len(U)):
        for col in range(0, len(U[0])):
            if U_last[row][col] - U[row][col] > epsilon:
                return False
    return True


#check_for_convergence(U, U_last, 0.0001)

In [None]:
def fuzzy_c_means(data: pd.DataFrame, nb_cluster: int, epsilon: float):

    U = init_fuzzy_pseudo_partition(data, nb_cluster)
    maxit = 0

    while True:
        maxit +=1
        U_old = copy.deepcopy(U)
        C = compute_centroids(U)
        U_last = update_fuzzy_pseudo_partition(U, C)
        if check_for_convergence(U, U_last, epsilon):
            return U
        elif maxit == 5:
            print("Too long")
            return U
        
fuzzy_c_means(X, 2, 0.01)

array([[0.27949252, 0.72050748],
       [0.2975191 , 0.7024809 ],
       [0.7731391 , 0.2268609 ],
       [0.4463244 , 0.5536756 ],
       [0.55528599, 0.44471401]])

###**Step 4: Check for convergence**

#**5. RECOMMENDATION SYSTEM**