# Spotify Top 50 Track Analysis

#### About the data: 
Top 50 most streamed tracks on Spotify in 2020. This dataset has various variables regarding these songs.

#### Objective: 
To analyze Spotify's top hits and find answers to the product manager's requirement to quantify what makes a hit song.

In [3]:
import pandas as pd
import numpy as np

In [71]:
spotify = pd.read_csv('spotifytoptracks.csv', index_col= 0)
spotify.head()

Unnamed: 0,artist,album,track_name,track_id,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,genre
0,The Weeknd,After Hours,Blinding Lights,0VjIjW4GlUZAMYd2vXMi3b,0.73,0.514,1,-5.934,0.00146,0.0598,9.5e-05,0.0897,0.334,171.005,200040,R&B/Soul
1,Tones And I,Dance Monkey,Dance Monkey,1rgnBhdG2JDFTbYkYRZAku,0.593,0.825,6,-6.401,0.688,0.0988,0.000161,0.17,0.54,98.078,209755,Alternative/Indie
2,Roddy Ricch,Please Excuse Me For Being Antisocial,The Box,0nbXyq5TXYPCO7pr3N8S4I,0.586,0.896,10,-6.687,0.104,0.0559,0.0,0.79,0.642,116.971,196653,Hip-Hop/Rap
3,SAINt JHN,Roses (Imanbek Remix),Roses - Imanbek Remix,2Wo6QQD1KMDWeFkkjLqwx5,0.721,0.785,8,-5.457,0.0149,0.0506,0.00432,0.285,0.894,121.962,176219,Dance/Electronic
4,Dua Lipa,Future Nostalgia,Don't Start Now,3PfIrDoz19wz7qK7tYeu62,0.793,0.793,11,-4.521,0.0123,0.083,0.0,0.0951,0.679,123.95,183290,Nu-disco


#### Data cleaning

In [41]:
# Using IQR-based outliers to check whether the data we have outliers
Q1 = numeric_df.quantile(0.25)
Q3 = numeric_df.quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Find rows with outliers
outliers = numeric_df[(numeric_df < lower_bound) | (numeric_df > upper_bound)].dropna(how='all')
print("Outliers based on IQR:")
print(outliers)

Outliers based on IQR:
    energy  danceability  key  loudness  acousticness  speechiness  \
0      NaN           NaN  NaN       NaN           NaN          NaN   
1      NaN           NaN  NaN       NaN         0.688          NaN   
2      NaN           NaN  NaN       NaN           NaN          NaN   
3      NaN           NaN  NaN       NaN           NaN          NaN   
7      NaN           NaN  NaN       NaN         0.731          NaN   
9      NaN           NaN  NaN       NaN         0.751          NaN   
10     NaN           NaN  NaN       NaN           NaN          NaN   
12     NaN           NaN  NaN       NaN           NaN          NaN   
16     NaN         0.459  NaN       NaN           NaN          NaN   
18     NaN           NaN  NaN       NaN         0.837          NaN   
19     NaN           NaN  NaN       NaN           NaN        0.487   
24     NaN           NaN  NaN   -14.454         0.902          NaN   
26     NaN           NaN  NaN       NaN           NaN        0.375 

We can see that the dataset has multiple outliers. In this case, I am not going to take further action to mitigate  the outliers as we will not be extensively analysing the numerical features for this study. However, we will use median as a measure when needed in later in this task as the data is skewed and mean is extremely sensitive to outliers. 

In [42]:
#checking for null values

spotify['null_values'] = spotify.isnull().any(axis=1)

In [43]:
#removing the column that was previously used to check if data has any null values.

spotify.drop(columns = 'null_values', inplace = True)

In [59]:
#checking for duplicates

spotify.duplicated(keep = 'first')

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
41    False
42    False
43    False
44    False
45    False
46    False
47    False
48    False
49    False
dtype: bool

In [72]:
#converting the duration_ms to time format hh:mm:ss

spotify['duration_ms'] = pd.to_datetime(spotify['duration_ms'],unit='ms').dt.strftime('%H:%M:%S:%f').str[:-7]


In [73]:
spotify['duration_ms'] = pd.to_timedelta(spotify['duration_ms'], errors='coerce')

After exploring the data further, I discovered that the data is clean without any null values or duplicates. However, there are outliers in the dataset which I chose to ignore for this study as we do not extensively analyse the numeric features further and using median instead of mean for average comparison should suffice our need.

#### Content Analysis: 

1 & 2: How many observations and features this dataset has?

In [25]:
row, col = spotify.shape

print(f"There are {row} observations and {col} features.")

There are 50 observations and 17 features.


3 & 4: which of the features are numerical and which are categorical?

In [26]:
spotify.dtypes

artist               object
album                object
track_name           object
track_id             object
energy              float64
danceability        float64
key                   int64
loudness            float64
acousticness        float64
speechiness         float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
duration_ms           int64
genre                object
duration             object
dtype: object

There are 5 categorical features with data type object and 11 numerical features.

5: Are there any artists that have more than 1 popular track? If yes, which and how many?


In [251]:
artist_track_count = spotify.groupby('artist').track_name.count()

top_artists = artist_track_count[artist_track_count>1]
print("Top artists are:")
top_artists

Top artists are:


artist
Billie Eilish    3
Dua Lipa         3
Harry Styles     2
Justin Bieber    2
Lewis Capaldi    2
Post Malone      2
Travis Scott     3
Name: track_name, dtype: int64

There are 7 artists who have multiple songs that got listed into Spotify's 2020 top track list.

6: Who was the most popular artist?


In [252]:
most_popular_artist = artist_track_count[artist_track_count.max() == artist_track_count]
print(f"The most popular artist are:\n{most_popular_artist}")

The most popular artist are:
artist
Billie Eilish    3
Dua Lipa         3
Travis Scott     3
Name: track_name, dtype: int64


7: How many artists in total have their songs in the top 50?

In [253]:
print(f"There are {spotify.artist.nunique()} artists who have their songs in the top 50.")

There are 40 artists who have their songs in the top 50.


8: Are there any albums that have more than 1 popular track? If yes, which and how many?

In [254]:
track_count = spotify.groupby('album').track_name.count()

print(f"The following albums have more than 1 popular track: ")
track_count[track_count>1]

The following albums have more than 1 popular track: 


album
Changes                 2
Fine Line               2
Future Nostalgia        3
Hollywood's Bleeding    2
Name: track_name, dtype: int64

9: How many albums in total have their songs in the top 50?

In [255]:
print(f"{spotify.album.nunique()} albums have their songs in the top 50.")

45 albums have their songs in the top 50.


10: Which tracks have a danceability score above 0.7?

In [16]:
spotify[['track_name', 'danceability']][spotify.danceability>0.7]

Unnamed: 0,track_name,danceability
1,Dance Monkey,0.825
2,The Box,0.896
3,Roses - Imanbek Remix,0.785
4,Don't Start Now,0.793
5,ROCKSTAR (feat. Roddy Ricch),0.746
7,death bed (coffee for your head),0.726
8,Falling,0.784
10,Tusa,0.803
13,Blueberry Faygo,0.774
14,Intentions (feat. Quavo),0.806


Out of 50, 32 songs have danceability score greater than 0.7. Majority of the songs that are in top 10 have danceability score greater than 0.7. 

11: Which tracks have a danceability score below 0.4?

In [257]:
spotify[['track_name', 'danceability']][spotify.danceability<0.4]

Unnamed: 0,track_name,danceability
44,lovely (with Khalid),0.351


12: Which tracks have their loudness above -5?

In [18]:
spotify[['track_name', 'loudness']][spotify.loudness>-5]

Unnamed: 0,track_name,loudness
4,Don't Start Now,-4.521
6,Watermelon Sugar,-4.209
10,Tusa,-3.28
12,Circles,-3.497
16,Before You Go,-4.858
17,Say So,-4.577
21,Adore You,-3.675
23,Mood (feat. iann dior),-3.558
31,Break My Heart,-3.434
32,Dynamite,-4.41


In [None]:
19 of the soundtracks have loudness greater that -5.

13: Which tracks have their loudness below -8?

In [259]:
spotify[['track_name', 'loudness']][spotify.loudness<-8]

Unnamed: 0,track_name,loudness
7,death bed (coffee for your head),-8.765
8,Falling,-8.756
15,Toosie Slide,-8.82
20,Savage Love (Laxed - Siren Beat),-8.52
24,everything i wanted,-14.454
26,bad guy,-10.965
36,HIGHEST IN THE ROOM,-8.764
44,lovely (with Khalid),-10.109
47,If the World Was Ending - feat. Julia Michaels,-10.086


14: Which track is the longest?

In [80]:
longest = spotify.iloc[spotify.duration_ms.idxmax()]

longest_duration = str(longest.duration_ms).split(" ")[-1]  # Extracts only the time portion

print(f"The longest track is {longest.track_name} with duration {longest_duration} seconds.")

The longest track is SICKO MODE with duration 00:05:12 seconds.


15: Which track is the shortest?

In [82]:
shortest = spotify.iloc[spotify.duration_ms.idxmin()]

shortest_duration = str(shortest.duration_ms).split(" ")[-1]  # Extracts only the time portion

print(f"The shortest track is {shortest.track_name} with duration {shortest_duration} seconds.")

The shortest track is Mood (feat. iann dior) with duration 00:02:20 seconds.


16: Which genre is the most popular?

In [262]:
print(f"The most popular is: \n{spotify.genre.value_counts().head(1)}")

The most popular is: 
Pop    14
Name: genre, dtype: int64


17: Which genres have just one song on the top 50?

In [263]:
num_of_songs = spotify.groupby('genre').track_name.count()
num_of_songs[num_of_songs == 1]


genre
Alternative/reggaeton/experimental    1
Chamber pop                           1
Dance-pop/Disco                       1
Disco-pop                             1
Dreampop/Hip-Hop/R&B                  1
Hip-Hop/Trap                          1
Nu-disco                              1
Pop rap                               1
Pop/Soft Rock                         1
R&B/Hip-Hop alternative               1
Name: track_name, dtype: int64

18: How many genres in total are represented in the top 50?

In [264]:
print(f"The total genre represented in the top 50 is {spotify.genre.nunique()}.")

The total genre represented in the top 50 is 16.


19: Which features are strongly positively correlated?

In [265]:
numeric_feat = spotify.select_dtypes(include='number')
correlation_matrix = numeric_feat.corr()

threshold = 0.7

strong_positive_corr = correlation_matrix[(correlation_matrix > threshold) & (correlation_matrix < 1.0)]

print("Strongly positively correlated features:")
strong_positive_corr.dropna(how='all', axis=0).dropna(how='all', axis=1)

Strongly positively correlated features:


Unnamed: 0,energy,loudness
energy,,0.79164
loudness,0.79164,


20: Which features are strongly negatively correlated?

In [271]:
numeric_feat = spotify.select_dtypes(include='number')
correlation_matrix = numeric_feat.corr()

threshold = -0.6

strong_negative_corr = correlation_matrix[(correlation_matrix < threshold) & (correlation_matrix > -1.0)]

print("Strongly negatively correlated features: ")
strong_negative_corr.dropna(how='all', axis=0).dropna(how='all', axis=1)


Strongly negatively correlated features: 


Unnamed: 0,energy,acousticness
energy,,-0.682479
acousticness,-0.682479,


21: Which features are not correlated?

In [110]:
numeric_feat = spotify.select_dtypes(include='number')
correlation_matrix = numeric_feat.corr()

low_threshold = -0.1

up_threshold = 0.1

no_corr = correlation_matrix[(correlation_matrix > low_threshold) & (correlation_matrix < up_threshold)]

print("Features that do not correlate are:")
no_corr.stack().drop_duplicates(keep = 'first', inplace = False)

Features that do not correlate are:


energy            key                 0.062428
                  speechiness         0.074267
                  liveness            0.069487
                  tempo               0.075191
danceability      instrumentalness   -0.017706
                  liveness           -0.006648
key               loudness           -0.009178
                  speechiness        -0.094965
                  instrumentalness    0.020802
                  tempo               0.080475
loudness          speechiness        -0.021693
                  liveness           -0.069939
speechiness       instrumentalness    0.028948
                  valence             0.053867
instrumentalness  liveness           -0.087034
                  tempo               0.018853
liveness          valence            -0.033366
                  tempo               0.025457
valence           tempo               0.045089
dtype: float64

22: How does the danceability score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

In [84]:
genre_dance_score = spotify.groupby('genre').danceability.median()
select_genre = ['Pop', 'Hip-Hop/Rap', 'Dance/Electronic', 'Alternative/Indie']

print("The danceability score comparrison with Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres: ")

genre_dance_score[genre_dance_score.index.isin(select_genre)]


The danceability score comparrison with Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres: 


genre
Alternative/Indie    0.663
Dance/Electronic     0.785
Hip-Hop/Rap          0.774
Pop                  0.690
Name: danceability, dtype: float64

23: How does the loudness score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

In [85]:
genre_loudness_score = spotify.groupby('genre').loudness.median()

print("The loudness score comparrison with Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres: ")
genre_loudness_score[genre_loudness_score.index.isin(select_genre)]

The loudness score comparrison with Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres: 


genre
Alternative/Indie   -5.2685
Dance/Electronic    -5.4570
Hip-Hop/Rap         -7.6480
Pop                 -6.6445
Name: loudness, dtype: float64

24: How does the acousticness score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

In [86]:
genre_acoustic_score = spotify.groupby('genre').acousticness.median()

print("The acousticness score comparrison with Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres: ")
genre_acoustic_score[genre_acoustic_score.index.isin(select_genre)]


The acousticness score comparrison with Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres: 


genre
Alternative/Indie    0.6460
Dance/Electronic     0.0686
Hip-Hop/Rap          0.1450
Pop                  0.2590
Name: acousticness, dtype: float64