# Cleaning the dataset from spotify API containing hit songs (1970-2020)

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
df = pd.read_csv('../data/modern_albums.csv')

In [4]:
df.head()

Unnamed: 0,id,name,artist,popularity,explicit,energy,tempo,positiveness,danceability,acousticness,instrumentalness,loudness,mode,speechiness,duration_ms,key,album_name,year
0,2YpeDb67231RjR0MgVLzsG,Old Town Road - Remix,Lil Nas X,79,False,0.619,136.041,0.639,0.878,0.0533,0.0,-5.56,1,0.102,157067,6,7 EP,2019
1,6fTt0CH2t0mdeB2N9XFG5r,Panini,Lil Nas X,62,False,0.594,153.848,0.475,0.703,0.342,0.0,-6.146,0,0.0752,114893,5,7 EP,2019
2,1ABQT5SxlUTNapSbSzblGx,F9mily (You & Me),Lil Nas X,48,False,0.534,170.054,0.408,0.556,0.019,0.00063,-7.75,1,0.0332,162720,0,7 EP,2019
3,3qIV7Rnj3ZxLs2JcLPUbFV,Kick It,Lil Nas X,53,True,0.484,151.878,0.523,0.739,0.138,0.0,-9.646,1,0.226,141987,9,7 EP,2019
4,4ak7xjvBeBOcJGWFDX9w5n,Rodeo,Lil Nas X,67,True,0.679,140.081,0.657,0.706,0.139,7e-05,-5.614,1,0.0324,158707,9,7 EP,2019


<br>

## Cleaning Summary (TLDR)

- Removed column "instrumentalness" due to high percentage of missing values
- Filtered out songs with a popularity score less than 20
- Renamed column "duration_ms" to "duration_minutes" and converted the values from milliseconds to minutes
- Mapped "key" column values from integers (0-11) to their respective musical keys (0 to C, 1 to Csharp/Dflat, etc)
- Updated "mode" column to represent 'Major' or 'Minor' instead of binary values (1 or 0).

<br>

## 1. Removing columns that contains too many nulls or zeros

### Instrumentalness

I noticed from just looking at the head of the dataframe that the instrumentalness column does not look right, lets see how many 0 values it holds

In [5]:
instrumentalness_zeros = df[df['instrumentalness'] == 0]

In [6]:
instrumentalness_zeros.shape[0]

434

In [7]:
(1844/5000) * 100

36.88

As there is about 37% values missing from this column, I want to drop it completely

In [8]:
df.drop('instrumentalness', axis=1, inplace=True)

### Speechiness

In [9]:
low_speechiness = df[df['speechiness'] < 0.3]
low_speechiness.shape[0]

688

Considering that API documentation states that values below 0.33 "represent music and other non-speech-like tracks", I don't think it's accurate that my dataset almost only contains values way below this mark. Let's drop this attribute as well.

In [10]:
df.drop('speechiness', axis=1, inplace=True)

In [11]:
df.describe()

Unnamed: 0,popularity,energy,tempo,positiveness,danceability,acousticness,loudness,mode,duration_ms,key,year
count,787.0,787.0,787.0,787.0,787.0,787.0,787.0,787.0,787.0,787.0,787.0
mean,56.224905,0.593872,118.116426,0.464227,0.660586,0.281243,-6.958445,0.579416,197018.19695,5.21601,2019.0
std,16.933006,0.179837,31.851629,0.220911,0.161856,0.275642,3.306684,0.493967,53428.723969,3.587382,0.0
min,0.0,0.0561,0.0,0.0,0.0,0.000199,-25.521,0.0,10760.0,0.0,2019.0
25%,46.0,0.489,94.0655,0.2995,0.571,0.05435,-8.168,0.0,175028.0,2.0,2019.0
50%,57.0,0.607,114.986,0.456,0.682,0.176,-6.373,1.0,194810.0,5.0,2019.0
75%,68.0,0.7155,140.0065,0.62,0.7755,0.449,-4.9865,1.0,219513.5,8.0,2019.0
max,100.0,0.985,207.476,0.963,0.968,0.979,-1.205,1.0,667707.0,11.0,2019.0


This should be it for the columns I want to remove

<br>

## 2. Removing rows with zeroes and very low values

When fetching the data I noticed some messages from error handling that there were some rows that did not contain any data. I will select a column like danceability and see if it contains any zeroes.

In [12]:
danceability_zeros = df[df['danceability'] == 0]

In [13]:
danceability_zeros

Unnamed: 0,id,name,artist,popularity,explicit,energy,tempo,positiveness,danceability,acousticness,loudness,mode,duration_ms,key,album_name,year
8,0rQtoQXQfwpDW0c7Fw1NeM,!!!!!!!,Billie Eilish,20,False,0.278,0.0,0.0,0.0,0.768,-21.63,1,13578,1,"WHEN WE ALL FALL ASLEEP, WHERE DO WE GO?",2019
97,7A5uLkZbEOzHiAlhD2Hr2L,"First Stop, Arizona - Dialogue",Cast,7,False,0.23,0.0,0.0,0.0,0.878,-23.413,1,10760,6,A Star Is Born Soundtrack,2019
103,1UR9zquKVw87PBAl5b9PDH,How Do You Hear It? - Dialogue,Cast,4,False,0.13,0.0,0.0,0.0,0.606,-21.347,1,14507,9,A Star Is Born Soundtrack,2019
110,2YEslbHiO4TyqNv2BQ2EWJ,SNL - Dialogue,Cast,4,False,0.0604,0.0,0.0,0.0,0.408,-20.782,1,13147,6,A Star Is Born Soundtrack,2019
417,5e4LIAQI0bClLazNf2gZV0,EXACTLY WHAT YOU RUN FROM YOU END UP CHASING,"Tyler, The Creator",9,False,0.478,0.0,0.0,0.0,0.927,-16.569,0,14627,11,IGOR,2019


Removing these two rows should not be a problem since it won't really affect the analysis later on.

In [14]:
df = df[df['danceability'] != 0]

I also noticed some songs that had very low popularity, since these playlists are hitlist I want to remove these outliers from the dataset. I think a popularity score like 20 is a good treshold to set.

In [17]:
low_pop = df[df['popularity'] <= 5]

In [18]:
low_pop

Unnamed: 0,id,name,artist,popularity,explicit,energy,tempo,positiveness,danceability,acousticness,loudness,mode,duration_ms,key,album_name,year
108,1OlU1rn0QOG40zQ699g3hV,Vows - Dialogue,Cast,5,False,0.264,92.748,0.169,0.789,0.711,-23.195,0,17933,0,A Star Is Born Soundtrack,2019
732,5PbAxxkcKnbkdozI7ZvFgZ,Tick Tock (with Clean Bandit ft. 24kGoldn),Clean Bandit,0,False,0.704,101.022,0.95,0.778,0.385,-3.903,1,178200,0,High Expectations,2019
733,1rzHuxpjVx8FHy93ep8Aba,West Ten (with AJ Tracey),AJ Tracey,0,True,0.823,129.986,0.867,0.843,0.318,-4.904,0,213693,10,High Expectations,2019
751,71xTDzNV4aufQe6InpbjZf,My Lover (with Not3s),Not3s,0,False,0.569,106.044,0.621,0.823,0.188,-7.112,0,192627,6,High Expectations,2019


In [19]:
df = df[df['popularity'] > 5]

Lets have a look at the summary statistics again to see if it looks better

In [20]:
df.describe()

Unnamed: 0,popularity,energy,tempo,positiveness,danceability,acousticness,loudness,mode,duration_ms,key,year
count,778.0,778.0,778.0,778.0,778.0,778.0,778.0,778.0,778.0,778.0,778.0
mean,56.812339,0.596197,118.930369,0.466246,0.664072,0.277826,-6.855323,0.579692,198437.338046,5.213368,2019.0
std,16.106571,0.177559,30.565716,0.217531,0.153508,0.274092,3.071913,0.493926,51248.508605,3.582199,0.0
min,8.0,0.0561,56.94,0.0315,0.131,0.000199,-25.521,0.0,19133.0,0.0,2019.0
25%,47.0,0.492,94.5205,0.30625,0.572,0.0538,-8.09575,0.0,175387.25,2.0,2019.0
50%,57.0,0.6095,114.996,0.457,0.682,0.173,-6.3635,1.0,194991.0,5.0,2019.0
75%,69.0,0.716,140.0295,0.62,0.77475,0.442,-4.98575,1.0,220283.75,8.0,2019.0
max,100.0,0.985,207.476,0.963,0.968,0.979,-1.205,1.0,667707.0,11.0,2019.0


<br>

## 3. Converting the duration of songs (milliseconds to minutes)

Since looking at milliseconds is a bit weird I'd like to convert this value to minutes, something that everyone understands.

In [21]:
df = df.rename(columns={'duration_ms': 'duration_minutes'})

df['duration_minutes'] = df['duration_minutes'] / 60000


Let's also round up these minutes

In [22]:
df['duration_minutes'] = df['duration_minutes'].round(1)

In [23]:
df['duration_minutes'].head(3)

0    2.6
1    1.9
2    2.7
Name: duration_minutes, dtype: float64

<br>

## 4. Changing keys from integers to strings

Next up I would like to change the name of they keys, spotify name these using Pitch Class notation.

<img src="https://davidkulma.com/wp-content/uploads/2016/08/Integer-Circle.001.png" width="300"/>

In [24]:
key_mapping = {
    0: "C",
    1: "Csharp/Dflat",
    2: "D",
    3: "Dsharp/Eflat",
    4: "E",
    5: "F",
    6: "Fsharp/Gflat",
    7: "G",
    8: "Gsharp/Aflat",
    9: "A",
    10: "Asharp/Bflat",
    11: "B"
}

In [25]:
df['key'] = df['key'].map(key_mapping)

In [26]:
df['key'].head()

0    Fsharp/Gflat
1               F
2               C
3               A
4               A
Name: key, dtype: object

That looks better and much easier to follow

In [27]:
df

Unnamed: 0,id,name,artist,popularity,explicit,energy,tempo,positiveness,danceability,acousticness,loudness,mode,duration_minutes,key,album_name,year
0,2YpeDb67231RjR0MgVLzsG,Old Town Road - Remix,Lil Nas X,79,False,0.619,136.041,0.639,0.878,0.0533,-5.560,1,2.6,Fsharp/Gflat,7 EP,2019
1,6fTt0CH2t0mdeB2N9XFG5r,Panini,Lil Nas X,62,False,0.594,153.848,0.475,0.703,0.3420,-6.146,0,1.9,F,7 EP,2019
2,1ABQT5SxlUTNapSbSzblGx,F9mily (You & Me),Lil Nas X,48,False,0.534,170.054,0.408,0.556,0.0190,-7.750,1,2.7,C,7 EP,2019
3,3qIV7Rnj3ZxLs2JcLPUbFV,Kick It,Lil Nas X,53,True,0.484,151.878,0.523,0.739,0.1380,-9.646,1,2.4,A,7 EP,2019
4,4ak7xjvBeBOcJGWFDX9w5n,Rodeo,Lil Nas X,67,True,0.679,140.081,0.657,0.706,0.1390,-5.614,1,2.6,A,7 EP,2019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
782,0OAyZn9pGuuL2FJih1PM58,Billie Jean,A Boogie Wit da Hoodie,52,True,0.536,139.961,0.375,0.830,0.0919,-6.324,0,2.3,E,Hoodie SZN,2019
783,5h66wqaGklsSa8eSxTwcJT,4 Min Convo (Favorite Song),A Boogie Wit da Hoodie,59,True,0.650,91.053,0.658,0.705,0.3590,-5.654,0,4.3,G,Hoodie SZN,2019
784,7MuhbSNErlmQAGRzhnnIjT,Odee,A Boogie Wit da Hoodie,58,True,0.480,86.049,0.293,0.854,0.2080,-7.388,0,2.7,A,Hoodie SZN,2019
785,76vviJBqmRdMMb9u9npgfj,Pull Up (feat. NAV),A Boogie Wit da Hoodie,50,True,0.370,149.950,0.557,0.828,0.4870,-7.446,0,3.2,G,Hoodie SZN,2019


<br>

## 5. Changing Mode from integers to strings

Currently in the mode column, 1 represents Major and 0 represents Minor

In [28]:
mode_mapping = {
    0: "Minor",
    1: "Major"
}

In [29]:
df['mode'] = df['mode'].map(mode_mapping)

In [30]:
df['mode'].head()

0    Major
1    Minor
2    Major
3    Major
4    Major
Name: mode, dtype: object

That's about it, one last thing I want to do is to remove the id column as well, I kept it their if I ever needed to fetch any additional data on the songs but it was not of much help.

<br>

## 6. Dropping the id column

In [31]:
df.drop('id', axis=1, inplace=True)

<br>

## 7. Having a look at the clean dataframe

In [32]:
df

Unnamed: 0,name,artist,popularity,explicit,energy,tempo,positiveness,danceability,acousticness,loudness,mode,duration_minutes,key,album_name,year
0,Old Town Road - Remix,Lil Nas X,79,False,0.619,136.041,0.639,0.878,0.0533,-5.560,Major,2.6,Fsharp/Gflat,7 EP,2019
1,Panini,Lil Nas X,62,False,0.594,153.848,0.475,0.703,0.3420,-6.146,Minor,1.9,F,7 EP,2019
2,F9mily (You & Me),Lil Nas X,48,False,0.534,170.054,0.408,0.556,0.0190,-7.750,Major,2.7,C,7 EP,2019
3,Kick It,Lil Nas X,53,True,0.484,151.878,0.523,0.739,0.1380,-9.646,Major,2.4,A,7 EP,2019
4,Rodeo,Lil Nas X,67,True,0.679,140.081,0.657,0.706,0.1390,-5.614,Major,2.6,A,7 EP,2019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
782,Billie Jean,A Boogie Wit da Hoodie,52,True,0.536,139.961,0.375,0.830,0.0919,-6.324,Minor,2.3,E,Hoodie SZN,2019
783,4 Min Convo (Favorite Song),A Boogie Wit da Hoodie,59,True,0.650,91.053,0.658,0.705,0.3590,-5.654,Minor,4.3,G,Hoodie SZN,2019
784,Odee,A Boogie Wit da Hoodie,58,True,0.480,86.049,0.293,0.854,0.2080,-7.388,Minor,2.7,A,Hoodie SZN,2019
785,Pull Up (feat. NAV),A Boogie Wit da Hoodie,50,True,0.370,149.950,0.557,0.828,0.4870,-7.446,Minor,3.2,G,Hoodie SZN,2019


In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 778 entries, 0 to 786
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              778 non-null    object 
 1   artist            778 non-null    object 
 2   popularity        778 non-null    int64  
 3   explicit          778 non-null    bool   
 4   energy            778 non-null    float64
 5   tempo             778 non-null    float64
 6   positiveness      778 non-null    float64
 7   danceability      778 non-null    float64
 8   acousticness      778 non-null    float64
 9   loudness          778 non-null    float64
 10  mode              778 non-null    object 
 11  duration_minutes  778 non-null    float64
 12  key               778 non-null    object 
 13  album_name        778 non-null    object 
 14  year              778 non-null    int64  
dtypes: bool(1), float64(7), int64(2), object(5)
memory usage: 91.9+ KB


In [34]:
df.describe()

Unnamed: 0,popularity,energy,tempo,positiveness,danceability,acousticness,loudness,duration_minutes,year
count,778.0,778.0,778.0,778.0,778.0,778.0,778.0,778.0,778.0
mean,56.812339,0.596197,118.930369,0.466246,0.664072,0.277826,-6.855323,3.306812,2019.0
std,16.106571,0.177559,30.565716,0.217531,0.153508,0.274092,3.071913,0.85362,0.0
min,8.0,0.0561,56.94,0.0315,0.131,0.000199,-25.521,0.3,2019.0
25%,47.0,0.492,94.5205,0.30625,0.572,0.0538,-8.09575,2.9,2019.0
50%,57.0,0.6095,114.996,0.457,0.682,0.173,-6.3635,3.25,2019.0
75%,69.0,0.716,140.0295,0.62,0.77475,0.442,-4.98575,3.7,2019.0
max,100.0,0.985,207.476,0.963,0.968,0.979,-1.205,11.1,2019.0


## Exporting to csv

In [35]:
df.to_csv('../data/modern_albums_cleaned.csv', index=False)