# Spotify Music Recommendation System with Mood Detection Using The Million Song Dataset

Author: James Meredith

Instructor: David Elliot

Institution: Flatiron School

Active Project Dates: Jun 267th, 2023 - July 14th, 2023

## Overview

Recommendation systems use machine learning to solve the long-tail problem with digital content. There is a lot of content out on the internet so it would take a person a long time to sort through it to find something they like. Recommendation systems help wth that buy suggesting content to people. [Netflix estimates](https://towardsdatascience.com/deep-dive-into-netflixs-recommender-system-341806ae3b48) it gets around 80% of total watch time thanks to its recommendation system.

The data is fairly clean already so most of the preperation involved manipulating the datasets into formats that can be used by recommendation systems, some recommendation systems use the format provided but ALS by hand and cosign similarity required some modifications.

Our modeling approach was:

- Start with a content-based system using cosign similarity, a commonly used distance measurement for recommendation systems
- Move to collaborative filtering using surprise, the popular library for implementing predictive recommendation systems
- Introduce ALS by hand to cement our understanding of the approach
- Implement ALS in Spark using pyspark separately, spark is commonly used to implement ALS at scale

We evaluated these models using RMSE as the measurement. Netflix's famous competition that offered $1M to a winner that could improve their content recommendation algorythm by 10% RMSE which was finally won by BellKor’s Pragmatic Chaos in 2009 with an RMSE of 0.8567. We decided that was our goal. All models were trained on train data, cross validated, and then evaluated on test data.

Note: ratings were on a scale of 1-5 so a RMSE of 1 means the average prediction was off by a whole score

Our best performing model was SVD optimized using gridsearch using surprise. Our recommendation however is to use ALS via spark to start however because of the efficiency of ALS and Spark in handling large datasets.

## Business Problem

## The Data

The dataset used for this project was the Million Song Database. It contains metadata for one million songs. The metadata includes information such as artist, title, year, and genre. The dataset also contains audio features for each song. The audio features are extracted from the audio files using the Echo Nest analyzer. The audio features include information such as tempo, loudness, and key. The dataset also contains lyrics for each song. The lyrics are extracted from the audio files using the Musixmatch analyzer. The lyrics are stored as a bag of words. The dataset is stored in HDF5 format. The dataset is 280 GB uncompressed. The dataset is available for download at http://millionsongdataset.com/. The dataset is also available on AWS S3 at s3://millionsongdataset/. 

The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. The Million Song Dataset started as a collaborative project between The Echo Nest and LabROSA. It was supported in part by the NSF. 

Its purposes are:

- To encourage research on algorithms that scale to commercial sizes
- To provide a reference dataset for evaluating research
- As a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest's)
- To help new researchers get started in the MIR field

The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. 

Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. 
The Million Song Dataset. In Proceedings of the 12th International Society
for Music Information Retrieval Conference (ISMIR 2011), 2011.

### Exploring The Dataset

In [149]:
# Imports Required for Accessing the Data
import os
import sys
import time
import glob
import datetime
import sqlite3
import numpy as np
import pandas as pd
import tables as tb
import h5py
import hdf5_getters as GETTERS


# Standard imports
from random import gauss, uniform as uni, seed
import math as math
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Cosign Similarity
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler

# Surprise imports for collaborative filtering
from surprise import Dataset, Reader
from surprise.prediction_algorithms import *
from surprise.model_selection import cross_validate, train_test_split, GridSearchCV
from surprise.accuracy import rmse, mae

# ALS
from sklearn.linear_model import LinearRegression

The structure of the dataset is quite complex, with many nested folders with files within them. The first step is to explore the structure of the dataset. Here we use the os module to explore the structure of the dataset.

In [40]:
# Establishes path to the Million Song Dataset subset
msd_subset_path='./MillionSongSubset'
msd_subset_data_path=os.path.join(msd_subset_path,'data')
msd_subset_addf_path=os.path.join(msd_subset_path,'AdditionalFiles')

# Establishes path to the Million Song Dataset code
msd_code_path='./MSongsDB'
assert os.path .isdir(msd_code_path),'wrong path' # sanity check
# we add some paths to python so we can import MSD code
# Ubuntu: you can change the environment variable PYTHONPATH
# in your .bashrc file so you do not have to type these lines
sys.path.append( os.path.join(msd_code_path,'PythonSrc') )

# Creates a function to display the folder structure of the dataset
def display_folder_structure(path, indent='', max_depth=float('inf'), depth=0):
    if depth > max_depth:
        return

    files = os.listdir(path)
    for file in files:
        current_path = os.path.join(path, file)
        if os.path.isdir(current_path):
            print(f"{indent}|- {file}/")
            display_folder_structure(current_path, indent + '  ', max_depth, depth + 1)
        else:
            print(f"{indent}|- {file}")


# Uses the function to display the folder structure of the dataset to the specified depth
max_display_depth = 0  # Specify the maximum depth to display
display_folder_structure(msd_subset_path, max_depth=max_display_depth)

|- A/
|- B/


So the structure of the dataset is as follows:

- The dataset is stored in the folder `MillionSongSubset`
- The individual dataset files are stored in HDF5 format
- The dataset is stored in a nested folder structure
- The dataset contains 10,000 folders, each containing 100 folders, each containing 100 files

The next step is to explore the contents of the dataset. Here we use the h5py module to explore the contents of the dataset.


In [41]:
# Let's open one file and explore it's structure and contents
filename = '.\MillionSongSubset\A\A\A\TRAAAAW128F429D538.h5'
h5file = tb.open_file(filename, mode='r')
h5file 

File(filename=.\MillionSongSubset\A\A\A\TRAAAAW128F429D538.h5, title='H5 Song File', mode='r', root_uep='/', filters=Filters(complevel=1, complib='zlib', shuffle=True, bitshuffle=False, fletcher32=False, least_significant_digit=None))
/ (RootGroup) 'H5 Song File'
/analysis (Group) 'Echo Nest analysis of the song'
/analysis/bars_confidence (EArray(83,)shuffle, zlib(1)) 'array of confidence of bars'
  atom := Float64Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := (1024,)
/analysis/bars_start (EArray(83,)shuffle, zlib(1)) 'array of start times of bars'
  atom := Float64Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := (1024,)
/analysis/beats_confidence (EArray(344,)shuffle, zlib(1)) 'array of confidence of sections'
  atom := Float64Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := (1024,)
/analysis/beats_start (EArray(344,)shuffle, zlib(1

In [42]:
h5file.root

/ (RootGroup) 'H5 Song File'
  children := ['analysis' (Group), 'metadata' (Group), 'musicbrainz' (Group)]

In [43]:
h5file.root.analysis


/analysis (Group) 'Echo Nest analysis of the song'
  children := ['bars_confidence' (EArray), 'bars_start' (EArray), 'beats_confidence' (EArray), 'beats_start' (EArray), 'sections_confidence' (EArray), 'sections_start' (EArray), 'segments_confidence' (EArray), 'segments_loudness_max' (EArray), 'segments_loudness_max_time' (EArray), 'segments_loudness_start' (EArray), 'segments_pitches' (EArray), 'segments_start' (EArray), 'segments_timbre' (EArray), 'songs' (Table), 'tatums_confidence' (EArray), 'tatums_start' (EArray)]

In [44]:
h5file.root.metadata

/metadata (Group) 'metadata about the song'
  children := ['artist_terms' (EArray), 'artist_terms_freq' (EArray), 'artist_terms_weight' (EArray), 'similar_artists' (EArray), 'songs' (Table)]

In [45]:
h5file.root.musicbrainz

/musicbrainz (Group) 'data about the song coming from MusicBrainz'
  children := ['artist_mbtags' (EArray), 'artist_mbtags_count' (EArray), 'songs' (Table)]

So the structure of each .h5 file is as follows:
- The file contains 3 groups: analysis, metadata, and musicbrainz
- The analysis group contains:
    bars_confidence, bars_start, beats_confidence, beats_start, sections_confidence, sections_start, segments_confidence, segments_loudness_max, segments_loudness_max_time, segments_loudness_start, segments_pitches, segments_start, segments_timbre, songs, tatums_confidence, and tatums_start
- The metadata group contains:
    artist_terms, artist_terms_freq, artist_terms_weight, similar_artists, songs, and tags
- The musicbrainz group contains:
    artist_mbtags, artist_mbtags_count, songs, and tags


## Data Exploration

## Data Preperation & Cleaning

Because the structure of the dataset is catered towards big data applications, it is not ideal for machine learning applications. The first step is to convert the dataset into a more machine learning friendly format. Here we use the pandas module to convert the dataset into a pandas dataframe.

In [46]:
df = pd.DataFrame(h5file)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,961,962,963,964,965,966,967,968,969,970
0,/analysis (Group) 'Echo Nest analysis of the s...,/metadata (Group) 'metadata about the song',/musicbrainz (Group) 'data about the song comi...,,,,,,,,...,,,,,,,,,,
1,"/analysis/bars_confidence (EArray(83,)shuffle,...","/analysis/bars_start (EArray(83,)shuffle, zlib...","/analysis/beats_confidence (EArray(344,)shuffl...","/analysis/beats_start (EArray(344,)shuffle, zl...","/analysis/sections_confidence (EArray(10,)shuf...","/analysis/sections_start (EArray(10,)shuffle, ...","/analysis/segments_confidence (EArray(971,)shu...","/analysis/segments_loudness_max (EArray(971,)s...",/analysis/segments_loudness_max_time (EArray(9...,"/analysis/segments_loudness_start (EArray(971,...",...,,,,,,,,,,
2,"/metadata/artist_terms (EArray(37,)shuffle, zl...","/metadata/artist_terms_freq (EArray(37,)shuffl...","/metadata/artist_terms_weight (EArray(37,)shuf...","/metadata/similar_artists (EArray(100,)shuffle...","[/metadata/songs.row (Row), pointing to row #0]",,,,,,...,,,,,,,,,,
3,"/musicbrainz/artist_mbtags (EArray(0,)shuffle,...","/musicbrainz/artist_mbtags_count (EArray(0,)sh...","[/musicbrainz/songs.row (Row), pointing to row...",,,,,,,,...,,,,,,,,,,
4,0.643,0.746,0.722,0.095,0.091,0.362,0.465,0.204,0.129,0.618,...,,,,,,,,,,
5,0.58521,2.94247,5.14371,7.74554,10.36149,12.98399,15.59835,18.21002,20.81724,23.41491,...,,,,,,,,,,
6,0.834,0.851,0.65,0.635,0.532,0.753,0.622,0.657,0.704,0.745,...,,,,,,,,,,
7,0.58521,1.19196,1.78893,2.37813,2.94247,3.50622,4.05077,4.56902,5.14371,5.76504,...,,,,,,,,,,
8,1.0,1.0,0.218,0.133,0.384,0.326,0.373,0.129,0.588,0.62,...,,,,,,,,,,
9,0.0,7.74554,36.44331,43.61667,75.17954,90.1827,135.77195,164.23964,189.03133,198.20273,...,,,,,,,,,,


In [47]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28 entries, 0 to 27
Columns: 971 entries, 0 to 970
dtypes: object(971)
memory usage: 212.5+ KB


Attempting to import an individual data file wholesale into a pandas dataframe results in a dataframe with 971 columns and 27 rows - for just one song! Multiply that by 10,000 for the number of songs in the dataset and you have a dataframe with 9,710,000 columns and 270,000 rows - for just the Million Song Subset! This is not ideal for loading into memory on a local machine. Instead we will only load the columns we need into the dataframe moving forward. We'll start by defining a function that can iterate through the dataset and load the columns we need into a dataframe, and then prepare the dataframe for machine learning applications.

In [95]:
# we define this very useful function to iterate through all the files in the dataset
def apply_to_all_files(basedir,func=lambda x: x,ext='.h5'):
    """
    From a base directory, go through all subdirectories,
    find all files with the given extension, apply the
    given function 'func' to all of them.
    If no 'func' is passed, we do nothing except counting.
    INPUT
       basedir  - base directory of the dataset
       func     - function to apply to all filenames
       ext      - extension, .h5 by default
    RETURN
       number of files
    """
    cnt = 0
    # iterate over all files in all subdirectories
    for root, dirs, files in os.walk(basedir):
        files = glob.glob(os.path.join(root,'*'+ext))
        # count files
        cnt += len(files)
        # apply function to all files
        for f in files :
            func(f)       
    return cnt

In [96]:
# we can now easily count the number of files in the dataset
print('number of song files:',apply_to_all_files(msd_subset_path))

number of song files: 10000


In [139]:
df = pd.DataFrame()

In [141]:
# Creates a function to open a file and return the data
def get_data(filename):
    # Opens the file
    get_data_file = GETTERS.open_h5_file_read(filename)
    # Append the song ID to the dataframe as the index
    df.loc[filename, 'song_id'] = GETTERS.get_song_id(get_data_file).decode('utf-8')
    # Append the artist ID to the dataframe
    df.loc[filename, 'artist_id'] = GETTERS.get_artist_id(get_data_file).decode('utf-8')
    # Append the release to the dataframe
    df.loc[filename, 'release'] = GETTERS.get_release(get_data_file).decode('utf-8')
    # Appends the song year to the dataframe
    df.loc[filename, 'year'] = GETTERS.get_year(get_data_file)
    # Appends the song mode to the dataframe
    df.loc[filename, 'mode'] = GETTERS.get_mode(get_data_file)
    # Appends the song time signature to the dataframe
    df.loc[filename, 'time_signature'] = GETTERS.get_time_signature(get_data_file)
    # Appends the song tempo to the dataframe
    df.loc[filename, 'tempo'] = GETTERS.get_tempo(get_data_file)
    # Appends the song loudness to the dataframeq
    df.loc[filename, 'loudness'] = GETTERS.get_loudness(get_data_file)
    # # Appends the song energy to the dataframe - omitted due to lack of data
    # df.loc[filename, 'energy'] = GETTERS.get_energy(get_data_file)
    # # Appends the song danceability to the dataframe - omitted due to lack of data
    # df.loc[filename, 'danceability'] = GETTERS.get_danceability(get_data_file) 
    # Appends the popularity to the dataframe
    # df.loc[filename, 'song_hotttnesss'] = GETTERS.get_song_hotttnesss(get_data_file)
    #closes the file
    get_data_file.close()

In [142]:
# Applies the function to all files in the dataset
apply_to_all_files(msd_subset_path, func=get_data)

10000

In [143]:
df.head(20)

Unnamed: 0,song_id,artist_id,release,year,mode,time_signature,tempo,loudness
./MillionSongSubset\A\A\A\TRAAAAW128F429D538.h5,SOMZWCG12A8C13C480,ARD7TVE1187B99BFB1,Fear Itself,0.0,0.0,4.0,92.198,-11.197
./MillionSongSubset\A\A\A\TRAAABD128F429CF47.h5,SOCIWDW12A8C13D406,ARMJAGH1187FB546F3,Dimensions,1969.0,0.0,4.0,121.274,-9.843
./MillionSongSubset\A\A\A\TRAAADZ128F9348C2E.h5,SOXVLOJ12AB0189215,ARKRRTF1187B9984DA,Las Numero 1 De La Sonora Santanera,0.0,1.0,1.0,100.07,-9.689
./MillionSongSubset\A\A\A\TRAAAEF128F4273421.h5,SONHOTT12A8C13493C,AR7G5I41187FB4CE6C,Friend Or Foe,1982.0,1.0,4.0,119.293,-9.013
./MillionSongSubset\A\A\A\TRAAAFD128F92F423A.h5,SOFSOCN12A8C143F5D,ARXR32B1187FB57099,Muertos Vivos,2007.0,1.0,4.0,129.738,-4.501
./MillionSongSubset\A\A\A\TRAAAMO128F1481E7F.h5,SOYMRWW12A6D4FAB14,ARKFYS91187B98E58F,Ordinary Day,0.0,1.0,3.0,147.782,-9.323
./MillionSongSubset\A\A\A\TRAAAMQ128F1460CD3.h5,SOMJBYD12A6D4F8557,ARD0S291187B9B7BF5,Da Ghetto Psychic,0.0,1.0,1.0,111.787,-17.302
./MillionSongSubset\A\A\A\TRAAAPK128E0786D96.h5,SOHKNRJ12A6701D1F8,AR10USD1187B99F3F1,Gin & Phonic,0.0,0.0,3.0,101.43,-11.642
./MillionSongSubset\A\A\A\TRAAARJ128F9320760.h5,SOIAZJW12AB01853F1,AR8ZCNI1187B9A069B,Pink World,1984.0,1.0,4.0,86.643,-13.496
./MillionSongSubset\A\A\A\TRAAAVG12903CFA543.h5,SOUDSGM12AC9618304,ARNTLGG11E2835DDB9,Superinstrumental,0.0,0.0,4.0,114.041,-6.697


In [144]:
# Sets the index to the song ID
df_test = df.set_index('song_id')

## Modeling

### Content-Based Model using Cosine Similarity

In [None]:
features = ['mode', 'time_signature', 'tempo', 'loudness']

In [None]:
scaler = StandardScaler()
df_test_scaled = scaler.fit_transform(df_test[features])

In [None]:
cosine_sim = cosine_similarity(df_test_scaled)

In [None]:
similarity_matrix = pd.DataFrame(cosine_sim, index=df['song_id'], columns=df['song_id'])


In [None]:
similarity_matrix.head(20)

song_id,SOMZWCG12A8C13C480,SOCIWDW12A8C13D406,SOXVLOJ12AB0189215,SONHOTT12A8C13493C,SOFSOCN12A8C143F5D,SOYMRWW12A6D4FAB14,SOMJBYD12A6D4F8557,SOHKNRJ12A6701D1F8,SOIAZJW12AB01853F1,SOUDSGM12AC9618304,...,SOILDRV12A8C13EB77,SOBUUYV12A58A7DA27,SOJARSR12AB0184939,SOUWMIW12AB0184748,SOVMTAW12A8C13B071,SOLXXPY12A67ADABA0,SOAYONI12A6D4F85C8,SOJZLAJ12AB017E8A2,SORZSCJ12A8C132446,SOFAOMI12A6D4FA2D8
song_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
SOMZWCG12A8C13C480,1.0,0.872983,-0.290332,-0.579917,-0.499656,-0.930551,-0.283379,0.884208,0.0372,0.821943,...,-0.724665,-0.571291,-0.055299,-0.540575,-0.199168,-0.771839,0.027415,-0.498854,-0.716368,0.442023
SOCIWDW12A8C13D406,0.872983,1.0,-0.478978,-0.68018,-0.364192,-0.690332,-0.475992,0.801997,-0.419751,0.932622,...,-0.454138,-0.528582,-0.398142,-0.739686,-0.463533,-0.552218,-0.450367,-0.721653,-0.651972,0.801117
SOXVLOJ12AB0189215,-0.290332,-0.478978,1.0,-0.078905,-0.069991,0.378492,0.818307,0.071113,0.108605,-0.374759,...,-0.329159,-0.261648,0.14803,-0.122068,0.87647,-0.224467,0.548174,-0.111707,-0.274357,-0.468848
SONHOTT12A8C13493C,-0.579917,-0.68018,-0.078905,1.0,0.776698,0.317649,-0.278453,-0.843271,0.46246,-0.483023,...,0.300176,0.288915,0.75916,0.798669,-0.22622,0.787134,0.163236,0.75404,0.547388,-0.441644
SOFSOCN12A8C143F5D,-0.499656,-0.364192,-0.069991,0.776698,1.0,0.453094,-0.505654,-0.659999,-0.134141,-0.066293,...,0.174861,-0.139052,0.508496,0.264401,-0.462168,0.85036,-0.341246,0.189346,0.170502,0.123738
SOYMRWW12A6D4FAB14,-0.930551,-0.690332,0.378492,0.317649,0.453094,1.0,0.313475,-0.693216,-0.365149,-0.633734,...,0.648583,0.372265,-0.193234,0.195743,0.211384,0.673239,-0.234345,0.14923,0.491801,-0.182034
SOMJBYD12A6D4F8557,-0.283379,-0.475992,0.818307,-0.278453,-0.505654,0.313475,1.0,0.086941,0.224367,-0.586883,...,-0.025779,0.196957,-0.162484,0.040551,0.979638,-0.385668,0.635601,0.082142,0.045125,-0.639723
SOHKNRJ12A6701D1F8,0.884208,0.801997,0.071113,-0.843271,-0.659999,-0.693216,0.086941,1.0,-0.17185,0.724135,...,-0.73643,-0.628504,-0.313852,-0.774481,0.142322,-0.914717,0.053715,-0.72694,-0.831952,0.419982
SOIAZJW12AB01853F1,0.0372,-0.419751,0.108605,0.46246,-0.134141,-0.365149,0.224367,-0.17185,1.0,-0.426372,...,-0.183231,0.268099,0.717117,0.725941,0.325875,-0.135234,0.865736,0.763184,0.2653,-0.810587
SOUDSGM12AC9618304,0.821943,0.932622,-0.374759,-0.483023,-0.066293,-0.633734,-0.586883,0.724135,-0.426372,1.0,...,-0.604054,-0.755379,-0.151847,-0.753037,-0.524549,-0.406026,-0.463999,-0.754896,-0.781941,0.840593


### Collaborative Filtering

### Aleternating Least-Squares (ALS)

## Conclusion:

- Content based systems are simple to implement but hard to measure error (no RMSE)
- Collaborative filtering has issues with cold start (movies with no ratings)
- Collaborative filtering using surprise is incredibly effective and produces great models
- ALS is incredibly efficient an simple

There are many options to create recommendation systems. 
In this notebook we explored content based systems using cosign similarity, 
various collaborative filtering models via surprise,
and dove deep into the concepts behind ALS to explain how it works.

We were able to beat BellKor’s Pragmatic Chaos's RMSE of 0.8567 with an RMSE of 0.8559 but this was likely because of our smaller dataset.

The best approach to recommendation systems is likely a hybrid approach which we will not explore in this notebook.