# Data Preparation and Cleaning

The purpose of this notebook is to output a single dataframe with which to build a regression model.

## Imports

In [1]:
import pandas as pd

## Spotify Audio Features Dataset

To access this database, you can use this [link](https://www.kaggle.com/datasets/naoh1092/spotify-genre-audio-features). This database contains genre, audio features, release date and some other metadata.

In [2]:
df = pd.read_csv("DATA/spotify-audio-features.csv")

In [3]:
df.head()

Unnamed: 0,genre,Title,Album_cover_link,Artist,duration,explicit,id,popularity,release_date,release_date_precision,...,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,pop,Girls Like Girls,https://i.scdn.co/image/03dc1a009e2dec953f53aa...,Hayley Kiyoko,229655,0,3dNjUFt6EFU4Gq6Q5vfJqf,66,2015-02-03,day,...,6,-5.235,1,0.0248,0.0109,1e-05,0.0994,0.604,142.027,4
1,edm,Getaway - Koven Remix,https://i.scdn.co/image/0cd0333511252003372c21...,Tritonal,203720,0,10EpXLXKHmNSVKvX7A5hg8,57,2016-07-08,day,...,6,-2.875,0,0.223,0.00415,0.0,0.158,0.222,172.187,4
2,edm,Role Models,https://i.scdn.co/image/0eb66c9b9d8f4d2231f3a7...,Walker & Royce,259694,1,5a8I5duAd2jiGrmRHUZ185,46,2017-10-20,day,...,7,-8.391,1,0.0337,0.0282,0.0132,0.221,0.143,125.992,4
3,edm,I Feel The Love,https://i.scdn.co/image/1ba3fabdc667d91dc7405f...,Tritonal,213120,0,6UiDiFJUGEDzkGpZBL8IYq,55,2016-09-09,day,...,0,-5.792,1,0.033,0.0201,0.0,0.165,0.435,125.034,4
4,latin,No Tengo Suerte En El Amor,https://i.scdn.co/image/1ef175621e439a34f679bf...,Yoskar Sarante,239866,0,6Yeiqec7xM5qssOH1FWV1g,54,2002-06-25,day,...,10,-5.923,0,0.0387,0.0908,4.6e-05,0.0669,0.896,132.107,4


In [4]:
df.columns

Index(['genre', 'Title', 'Album_cover_link', 'Artist', 'duration', 'explicit',
       'id', 'popularity', 'release_date', 'release_date_precision',
       'total_tracks', 'danceability', 'energy', 'key', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'time_signature'],
      dtype='object')

## Choosing features to use in the model

The Spotify API provides a unique dataset of "audio features" for every song on their platform in addition to some basic metadata. Audio features are parameters calculated using proprietary Spotify algorithms that represent different perceptible qualities of the music in numerical form.

### Features to keep

The plan is to drop the features that won't contribute to the result of the model or will cause a tiny change.

* **Duration:** This is the length of the song in milliseconds.
* **Tempo:** This is the estimated speed of the song in beats per minute. 
* **Danceability:** Measure of how danceable a song is from a scale of 0 to 1.
* **Energy:** A perceptual measure of intensity and activity from a scale of 0 to 1.
* **Loudness:** The overall loudness of a song in dB on a scale of -20 to 0. 
* **Speechiness:** Detects the presence of spoken words in a song on a scale of 0 to 1. A lower number (0-0.33) tends to indicate less speech, a moderate number (0.33-0.66) tends to indicate a mix of music and spoken word like rap, and a higher number (0.66-1) tends to indicate something like a poetry album.
* **Acousticness:** A confidence measure on if the song is acoustic on a scale of 0 to 1. 
* **Instrumentalness:** A measure on a scale of 0 to 1 on whether a track has no vocals (excluding adlibs like ooh-ahh's). Values about 0.5 tend to indicate an instrumental track. 
* **Liveness:** Detects the presence of an audience in a recording on a scale of 0 to 1. A value of 0.8 or above indicates a strong chance that the song is a live recording.
* **Valence:** A measure from 0 to 1 indicating the "musical positiveness" of a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
* **Genre:** A category that identifies some pieces of music as belonging to a shared tradition or set of conventions.
* **Explicit:** Indicates whether a track has curse words or language or art that is sexual, violent, or offensive in nature.
* **Release Date:** A date on which a track is due to become available for the public to listen.
* **Popularity:** The popularity of the track. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. This is the feature that we are going try to predict.


### Features to drop

In this dataset, there are some really irrelevant feautures like album cover links and total track in an album. Apart from them, some features require some consideration such as time signatures and key. For now, I will drop them since I think they won't affect the model much.

* **Key:** The key the track is in. I like songs in almost every key, so I don't buy that this will be effective and won't include it initially.
* **Mode:** Mode indicates the modality (major or minor) of a track.
* **Time Signature:** An estimated time signature. The time signature is a notational convention to specify how many beats are in each bar.   

For more detailed information on these parameters refer to the [Spotify API Documentation](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-audio-features).

In [5]:
df = df[["duration", "tempo", "danceability", "energy", "loudness", "speechiness", "acousticness", "instrumentalness", "liveness", "valence", "genre", "explicit", "release_date", "popularity"]]

In [6]:
df.head()

Unnamed: 0,duration,tempo,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,genre,explicit,release_date,popularity
0,229655,142.027,0.611,0.701,-5.235,0.0248,0.0109,1e-05,0.0994,0.604,pop,0,2015-02-03,66
1,203720,172.187,0.488,0.914,-2.875,0.223,0.00415,0.0,0.158,0.222,edm,0,2016-07-08,57
2,259694,125.992,0.705,0.531,-8.391,0.0337,0.0282,0.0132,0.221,0.143,edm,1,2017-10-20,46
3,213120,125.034,0.555,0.813,-5.792,0.033,0.0201,0.0,0.165,0.435,edm,0,2016-09-09,55
4,239866,132.107,0.904,0.686,-5.923,0.0387,0.0908,4.6e-05,0.0669,0.896,latin,0,2002-06-25,54


## Dealing with Categorical Data

Our model won't be able to deal with categorical data. In this case, we need to convert "Genre" category into dummy variables, otherwise known as "one-hot" encoding.

In [7]:
categorical_features = df[["genre"]]
df = df.drop("genre", axis=1)
categorical_features = pd.get_dummies(categorical_features, drop_first=True)
df = pd.concat([df, categorical_features], axis=1)

In [8]:
df.head()

Unnamed: 0,duration,tempo,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,explicit,release_date,popularity,genre_hiphop,genre_latin,genre_pop,genre_r&b,genre_rap,genre_rock
0,229655,142.027,0.611,0.701,-5.235,0.0248,0.0109,1e-05,0.0994,0.604,0,2015-02-03,66,0,0,1,0,0,0
1,203720,172.187,0.488,0.914,-2.875,0.223,0.00415,0.0,0.158,0.222,0,2016-07-08,57,0,0,0,0,0,0
2,259694,125.992,0.705,0.531,-8.391,0.0337,0.0282,0.0132,0.221,0.143,1,2017-10-20,46,0,0,0,0,0,0
3,213120,125.034,0.555,0.813,-5.792,0.033,0.0201,0.0,0.165,0.435,0,2016-09-09,55,0,0,0,0,0,0
4,239866,132.107,0.904,0.686,-5.923,0.0387,0.0908,4.6e-05,0.0669,0.896,0,2002-06-25,54,0,1,0,0,0,0


## Dealing with Date Objects

Values of the category "release_date" are date objects. In order for our model to evaluate these data, we need to convert this values into integers by only using year part of the date objects.

In [9]:
df["release_date"] = pd.to_datetime(df["release_date"]).dt.year

In [10]:
df.head()

Unnamed: 0,duration,tempo,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,explicit,release_date,popularity,genre_hiphop,genre_latin,genre_pop,genre_r&b,genre_rap,genre_rock
0,229655,142.027,0.611,0.701,-5.235,0.0248,0.0109,1e-05,0.0994,0.604,0,2015,66,0,0,1,0,0,0
1,203720,172.187,0.488,0.914,-2.875,0.223,0.00415,0.0,0.158,0.222,0,2016,57,0,0,0,0,0,0
2,259694,125.992,0.705,0.531,-8.391,0.0337,0.0282,0.0132,0.221,0.143,1,2017,46,0,0,0,0,0,0
3,213120,125.034,0.555,0.813,-5.792,0.033,0.0201,0.0,0.165,0.435,0,2016,55,0,0,0,0,0,0
4,239866,132.107,0.904,0.686,-5.923,0.0387,0.0908,4.6e-05,0.0669,0.896,0,2002,54,0,1,0,0,0,0


## Now, we are ready to move on.

After this data preparation, our dataset is finally ready for the regression model.

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6917 entries, 0 to 6916
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   duration          6917 non-null   int64  
 1   tempo             6917 non-null   float64
 2   danceability      6917 non-null   float64
 3   energy            6917 non-null   float64
 4   loudness          6917 non-null   float64
 5   speechiness       6917 non-null   float64
 6   acousticness      6917 non-null   float64
 7   instrumentalness  6917 non-null   float64
 8   liveness          6917 non-null   float64
 9   valence           6917 non-null   float64
 10  explicit          6917 non-null   int64  
 11  release_date      6917 non-null   int64  
 12  popularity        6917 non-null   int64  
 13  genre_hiphop      6917 non-null   uint8  
 14  genre_latin       6917 non-null   uint8  
 15  genre_pop         6917 non-null   uint8  
 16  genre_r&b         6917 non-null   uint8  


In [12]:
df.describe()

Unnamed: 0,duration,tempo,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,explicit,release_date,popularity,genre_hiphop,genre_latin,genre_pop,genre_r&b,genre_rap,genre_rock
count,6917.0,6917.0,6917.0,6917.0,6917.0,6917.0,6917.0,6917.0,6917.0,6917.0,6917.0,6917.0,6917.0,6917.0,6917.0,6917.0,6917.0,6917.0,6917.0
mean,230325.976579,120.034788,0.663048,0.680214,-6.725942,0.111231,0.178805,0.034783,0.184376,0.538109,0.317768,2007.742808,60.491543,0.140379,0.149776,0.167847,0.105682,0.133873,0.164233
std,54336.723207,28.015297,0.151653,0.173824,2.783829,0.096413,0.214174,0.138639,0.147764,0.233274,0.465642,13.233453,12.944708,0.347405,0.356877,0.373758,0.307452,0.340541,0.370514
min,120133.0,52.145,0.142,0.0848,-20.567,0.0224,2e-06,0.0,0.0136,0.026,0.0,1959.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,193466.0,96.571,0.562,0.562,-8.166,0.0406,0.0204,0.0,0.0916,0.356,0.0,2000.0,53.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,223733.0,120.01,0.679,0.696,-6.258,0.0645,0.088,2e-06,0.125,0.546,0.0,2012.0,62.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,259853.0,136.029,0.775,0.818,-4.787,0.157,0.261,0.00047,0.236,0.728,1.0,2019.0,69.0,0.0,0.0,0.0,0.0,0.0,0.0
max,591693.0,214.025,0.983,0.998,0.878,0.399,0.983,0.973,0.979,0.985,1.0,2021.0,100.0,1.0,1.0,1.0,1.0,1.0,1.0


In [13]:
df.to_csv('DATA/spotify-audio-features-final.csv',index=False)