# Introduction

This notebook contains an analysis of the Top 50 Spotify Tracks 2020 dataset. The notebook provides and describes steps of data cleaning, exploratory data analysis, correlation calcuations, data comparisons, and future improvements to be made.

In [3]:
import numpy as np  # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)

# Load the data and clean column names

In [4]:
df = pd.read_csv(
    '/Users/murtaza.aziz/Desktop/Turing College Tasks/Datasets/spotifytoptracks.csv', index_col=0)
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
df.head()

Unnamed: 0,artist,album,track_name,track_id,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,genre
0,The Weeknd,After Hours,Blinding Lights,0VjIjW4GlUZAMYd2vXMi3b,0.73,0.514,1,-5.934,0.00146,0.0598,9.5e-05,0.0897,0.334,171.005,200040,R&B/Soul
1,Tones And I,Dance Monkey,Dance Monkey,1rgnBhdG2JDFTbYkYRZAku,0.593,0.825,6,-6.401,0.688,0.0988,0.000161,0.17,0.54,98.078,209755,Alternative/Indie
2,Roddy Ricch,Please Excuse Me For Being Antisocial,The Box,0nbXyq5TXYPCO7pr3N8S4I,0.586,0.896,10,-6.687,0.104,0.0559,0.0,0.79,0.642,116.971,196653,Hip-Hop/Rap
3,SAINt JHN,Roses (Imanbek Remix),Roses - Imanbek Remix,2Wo6QQD1KMDWeFkkjLqwx5,0.721,0.785,8,-5.457,0.0149,0.0506,0.00432,0.285,0.894,121.962,176219,Dance/Electronic
4,Dua Lipa,Future Nostalgia,Don't Start Now,3PfIrDoz19wz7qK7tYeu62,0.793,0.793,11,-4.521,0.0123,0.083,0.0,0.0951,0.679,123.95,183290,Nu-disco


# Data Cleaning

### Checking if any rows are NULL

In [None]:
# Use of isnull() and sum() functions to calculate number of rows that include nulls

df.isnull().sum().sum()

np.int64(0)

### Checking if there are any duplicates

In [None]:
# Used duplicated() and sum() functions to calculate number of duplicates

df.duplicated(df.columns).sum()

np.int64(0)

### Identifying Outliers

Although the below output shows the identified outlier values for each feature, they were not removed from the dataset when performing EDA

In [None]:
numeric_cols = df.select_dtypes(include=np.number).columns

Q1 = df[numeric_cols].quantile(0.25)
Q3 = df[numeric_cols].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers_iqr = df[(df[numeric_cols] < lower_bound) |
                  (df[numeric_cols] > upper_bound)]


outlier_percentages = {}
for col in df.select_dtypes(include=np.number).columns:
    outlier_percentages[col] = (
        1-(outliers_iqr[col].isna().sum()/len(df))) * 100
    print(f"Percentage of outliers in {col}: {outlier_percentages[col]:.2f}%")

Percentage of outliers in energy: 0.00%
Percentage of outliers in danceability: 6.00%
Percentage of outliers in key: 0.00%
Percentage of outliers in loudness: 2.00%
Percentage of outliers in acousticness: 14.00%
Percentage of outliers in speechiness: 12.00%
Percentage of outliers in instrumentalness: 24.00%
Percentage of outliers in liveness: 6.00%
Percentage of outliers in valence: 0.00%
Percentage of outliers in tempo: 0.00%
Percentage of outliers in duration_ms: 4.00%


# Exploratory data analysis

How many observations and features are there in this dataset?

In [120]:
print(f"Observations: {len(df)}")
print(f"Features: {len(df.columns)}")

Observations: 50
Features: 16


Which of the features are categorical?

In [None]:
cat_features = df.select_dtypes(
    include=['object', 'category']).columns.insert(5, 'key').tolist()
print(f"Categorical Features: {cat_features}")

Categorical Features: ['artist', 'album', 'track_name', 'track_id', 'genre', 'key']


Which of the features are numeric?

In [None]:
num_features = df.select_dtypes(
    include=['number']).columns.drop('key').tolist()

print(f"Numeric Features: {num_features}")

Numeric Features: ['energy', 'danceability', 'loudness', 'acousticness', 'speechiness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms']


Are there any artists that have more than 1 popular track? If yes, which and how many?

In [123]:
track_counts = df['artist'].value_counts()
artists_with_multiple_tracks = track_counts[track_counts > 1]
artist_count = artists_with_multiple_tracks.count()
print(artists_with_multiple_tracks)
print(f"There are {artist_count} with more than 1 popular track.")

artist
Billie Eilish    3
Dua Lipa         3
Travis Scott     3
Justin Bieber    2
Harry Styles     2
Lewis Capaldi    2
Post Malone      2
Name: count, dtype: int64
There are 7 with more than 1 popular track.


Who was the most popular artist?

In [124]:
max_tracks = track_counts.max()
most_popular_artists = track_counts[track_counts == max_tracks]
print(
    f"Most popular artist with {max_tracks} tracks are:\n {most_popular_artists}")

Most popular artist with 3 tracks are:
 artist
Billie Eilish    3
Dua Lipa         3
Travis Scott     3
Name: count, dtype: int64


How many artists have their songs in the top 50 in total?

In [125]:
unique_artist = df['artist'].nunique()
print(f"Number of artists in the top 50: {unique_artist}")

Number of artists in the top 50: 40


Are there any albums that have more than 1 popular track? If yes, which and how many?

In [126]:
album_counts = df['album'].value_counts()

print("Albums with > 1 track:\n", album_counts[album_counts > 1])

Albums with > 1 track:
 album
Future Nostalgia        3
Hollywood's Bleeding    2
Fine Line               2
Changes                 2
Name: count, dtype: int64


How many albums in total have their songs in the top 50?

In [127]:
print("Unique albums in the top 50: ", df['album'].nunique())

Unique albums in the top 50:  45


Which tracks have a danceability score above 0.7?

In [128]:
high_danceability = df[df['danceability'] > 0.7]
high_danceability_sorted = high_danceability.sort_values(
    'danceability', ascending=False)
high_danceability_sorted[['artist', 'track_name', 'danceability']]

Unnamed: 0,artist,track_name,danceability
27,Cardi B,WAP (feat. Megan Thee Stallion),0.935
2,Roddy Ricch,The Box,0.896
39,Regard,Ride It,0.88
28,Surfaces,Sunday Best,0.878
33,BENEE,Supalonely (feat. Gus Dapperton),0.862
40,Travis Scott,goosebumps,0.841
49,Travis Scott,SICKO MODE,0.834
15,Drake,Toosie Slide,0.83
1,Tones And I,Dance Monkey,0.825
29,Eminem,Godzilla (feat. Juice WRLD),0.808


Which tracks have a danceability score below 0.4?

In [129]:
low_danceability = df[df['danceability'] < 0.4]
low_danceability_sorted = low_danceability.sort_values(
    'danceability', ascending=False)
low_danceability_sorted[['artist', 'track_name', 'danceability']]

Unnamed: 0,artist,track_name,danceability
44,Billie Eilish,lovely (with Khalid),0.351


Which tracks have their loudness above -5?

In [130]:
high_loudness = df[df['loudness'] > -5]
high_loudness_sorted = loud_tracks.sort_values('loudness', ascending=False)
high_loudness_sorted[['artist', 'track_name', 'loudness']]

Unnamed: 0,artist,track_name,loudness
10,KAROL G,Tusa,-3.28
40,Travis Scott,goosebumps,-3.37
31,Dua Lipa,Break My Heart,-3.434
38,Maluma,Hawái,-3.454
12,Post Malone,Circles,-3.497
23,24kGoldn,Mood (feat. iann dior),-3.558
21,Harry Styles,Adore You,-3.675
49,Travis Scott,SICKO MODE,-3.714
48,Dua Lipa,Physical,-3.756
35,Lady Gaga,Rain On Me (with Ariana Grande),-3.764


Which tracks have their loudness below -8?

In [131]:
low_loudness = df[df['loudness'] < -8]
low_loudness_sorted = low_loudness.sort_values('loudness', ascending=False)
low_loudness_sorted[['artist', 'track_name', 'loudness']]

Unnamed: 0,artist,track_name,loudness
20,Jawsh 685,Savage Love (Laxed - Siren Beat),-8.52
8,Trevor Daniel,Falling,-8.756
36,Travis Scott,HIGHEST IN THE ROOM,-8.764
7,Powfu,death bed (coffee for your head),-8.765
15,Drake,Toosie Slide,-8.82
47,JP Saxe,If the World Was Ending - feat. Julia Michaels,-10.086
44,Billie Eilish,lovely (with Khalid),-10.109
26,Billie Eilish,bad guy,-10.965
24,Billie Eilish,everything i wanted,-14.454


Which track is the longest?

In [132]:
longest_track = df.loc[df['duration_ms'].idxmax()]
print(
    f"Longest track is {longest_track['track_name']} by {longest_track['artist']}, with a duration of {longest_track['duration_ms']} milliseconds.")

Longest track is SICKO MODE by Travis Scott, with a duration of 312820 milliseconds.


Which track is the shortest?

In [133]:
shortest_track = df.loc[df['duration_ms'].idxmin(
)][['track_name', 'artist', 'duration_ms']]
print(
    f"Shortest track is {shortest_track['track_name']} by {shortest_track['artist']}, with a duration of {shortest_track['duration_ms']} milliseconds.")

Shortest track is Mood (feat. iann dior) by 24kGoldn, with a duration of 140526 milliseconds.


Which genre is the most popular?

In [134]:
genre_counts = df['genre'].value_counts()
most_pop_genre = genre_counts.idxmax()
most_pop_count = genre_counts.max()
print(f"Most popular genre is {most_pop_genre} with {most_pop_count} tracks")

Most popular genre is Pop with 14 tracks


Which genres have just one song in the top 50?

In [135]:
genres_with_one_song = genre_counts[genre_counts == 1]
print(f"There are {genres_with_one_song.count()} genres that have just one song in the top 50:\n{genres_with_one_song}")

There are 10 genres that have just one song in the top 50:
genre
Nu-disco                              1
R&B/Hip-Hop alternative               1
Pop/Soft Rock                         1
Pop rap                               1
Hip-Hop/Trap                          1
Dance-pop/Disco                       1
Disco-pop                             1
Dreampop/Hip-Hop/R&B                  1
Alternative/reggaeton/experimental    1
Chamber pop                           1
Name: count, dtype: int64


How many genres in total are represented in the top 50?

In [136]:
unique_genre = genre_counts.count()
print(f"Unique genres: {unique_genre}")

Unique genres: 16


# Correlation calculations
In the following cell, a variable is assigned that calculates correlations between numeric features using Pearson correlation.
The output is a dataframe with those Pearson correlation scores

In [43]:
correlation = df[num_features].corr()
correlation

Unnamed: 0,energy,danceability,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms
energy,1.0,0.152552,0.79164,-0.682479,0.074267,-0.385515,0.069487,0.393453,0.075191,0.081971
danceability,0.152552,1.0,0.167147,-0.359135,0.226148,-0.017706,-0.006648,0.479953,0.168956,-0.033763
loudness,0.79164,0.167147,1.0,-0.498695,-0.021693,-0.553735,-0.069939,0.406772,0.102097,0.06413
acousticness,-0.682479,-0.359135,-0.498695,1.0,-0.135392,0.352184,-0.128384,-0.243192,-0.241119,-0.010988
speechiness,0.074267,0.226148,-0.021693,-0.135392,1.0,0.028948,-0.142957,0.053867,0.215504,0.366976
instrumentalness,-0.385515,-0.017706,-0.553735,0.352184,0.028948,1.0,-0.087034,-0.203283,0.018853,0.184709
liveness,0.069487,-0.006648,-0.069939,-0.128384,-0.142957,-0.087034,1.0,-0.033366,0.025457,-0.090188
valence,0.393453,0.479953,0.406772,-0.243192,0.053867,-0.203283,-0.033366,1.0,0.045089,-0.039794
tempo,0.075191,0.168956,0.102097,-0.241119,0.215504,0.018853,0.025457,0.045089,1.0,0.130328
duration_ms,0.081971,-0.033763,0.06413,-0.010988,0.366976,0.184709,-0.090188,-0.039794,0.130328,1.0


In the following cell, the dataframe is reshaped, columns are renamed, and the updated dataframe is sorted.
The output is a dataframe with those 2 audio features and their corresponding Pearson correlation scores.

In [44]:
# Calculate the correlation matrix and reshaping dataframe
stacked_correlation = correlation.stack().reset_index()

# Rename the columns
stacked_correlation.columns = [
    'audio_feature_1', 'audio_feature_2', 'correlation']

# Create a mask to identify rows with duplicate features
mask_corr = (stacked_correlation[['audio_feature_1', 'audio_feature_2']].apply(
    frozenset, axis=1).duplicated()) | (stacked_correlation['audio_feature_1'] == stacked_correlation['audio_feature_2'])

# Apply mask to original correlation dataframe
stacked_correlation = stacked_correlation[~mask_corr]

# Sort updated correlation datafram
sorted_correlation = stacked_correlation.sort_values(
    by='correlation', ascending=False)
sorted_correlation

Unnamed: 0,audio_feature_1,audio_feature_2,correlation
2,energy,loudness,0.79164
17,danceability,valence,0.479953
27,loudness,valence,0.406772
7,energy,valence,0.393453
49,speechiness,duration_ms,0.366976
35,acousticness,instrumentalness,0.352184
14,danceability,speechiness,0.226148
48,speechiness,tempo,0.215504
59,instrumentalness,duration_ms,0.184709
18,danceability,tempo,0.168956


Which features are strongly positively correlated?

In [45]:
positive_corr = sorted_correlation[(sorted_correlation['correlation'] > 0.7)]
positive_corr

Unnamed: 0,audio_feature_1,audio_feature_2,correlation
2,energy,loudness,0.79164


Which features are strongly negatively correlated?

In [46]:
negative_corr = sorted_correlation[(sorted_correlation['correlation'] < -0.6)]
negative_corr

Unnamed: 0,audio_feature_1,audio_feature_2,correlation
3,energy,acousticness,-0.682479


Which features are not correlated?

In [47]:
not_corr = sorted_correlation[(sorted_correlation['correlation'] < 0.2) & (
    sorted_correlation['correlation'] > -0.2)]
not_corr

Unnamed: 0,audio_feature_1,audio_feature_2,correlation
59,instrumentalness,duration_ms,0.184709
18,danceability,tempo,0.168956
12,danceability,loudness,0.167147
1,energy,danceability,0.152552
89,tempo,duration_ms,0.130328
28,loudness,tempo,0.102097
9,energy,duration_ms,0.081971
8,energy,tempo,0.075191
4,energy,speechiness,0.074267
6,energy,liveness,0.069487


# Data Comparisons
How does the danceability score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

In [52]:
genre_danceability = df.groupby('genre')['danceability'].describe()
genre_of_interest = ['Pop', 'Hip-Hop/Rap',
                     'Dance/Electronic', 'Alternative/Indie']
genre_comparison = genre_danceability[genre_danceability.index.isin(
    genre_of_interest)]
print(
    f"For the selected genres, the danceability score statistics are as follows:\n {genre_comparison}")

For the selected genres, the danceability score statistics are as follows:
                    count      mean       std    min      25%    50%      75%  \
genre                                                                          
Alternative/Indie    4.0  0.661750  0.211107  0.459  0.49050  0.663  0.83425   
Dance/Electronic     5.0  0.755000  0.094744  0.647  0.67400  0.785  0.78900   
Hip-Hop/Rap         13.0  0.765538  0.085470  0.598  0.72600  0.774  0.83000   
Pop                 14.0  0.677571  0.109853  0.464  0.61575  0.690  0.76275   

                     max  
genre                     
Alternative/Indie  0.862  
Dance/Electronic   0.880  
Hip-Hop/Rap        0.896  
Pop                0.806  


How does the loudness score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

In [53]:
genre_loudness = df.groupby('genre')['loudness'].describe()
genre_of_interest = ['Pop', 'Hip-Hop/Rap',
                     'Dance/Electronic', 'Alternative/Indie']
genre_comparison = genre_loudness[genre_loudness.index.isin(genre_of_interest)]
print(
    f"For the selected genres, the loudness score statistics are as follows:\n {genre_comparison}")

For the selected genres, the loudness score statistics are as follows:
                    count      mean       std     min      25%     50%  \
genre                                                                   
Alternative/Indie    4.0 -5.421000  0.774502  -6.401 -5.85950 -5.2685   
Dance/Electronic     5.0 -5.338000  1.479047  -7.567 -5.65200 -5.4570   
Hip-Hop/Rap         13.0 -6.917846  1.891808  -8.820 -8.52000 -7.6480   
Pop                 14.0 -6.460357  3.014281 -14.454 -7.17875 -6.6445   

                       75%    max  
genre                              
Alternative/Indie -4.83000 -4.746  
Dance/Electronic  -4.25800 -3.756  
Hip-Hop/Rap       -5.61600 -3.370  
Pop               -3.87525 -3.280  


How does the acoustics score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

In [54]:
genre_acoustics = df.groupby('genre')['acousticness'].describe()
genre_of_interest = ['Pop', 'Hip-Hop/Rap',
                     'Dance/Electronic', 'Alternative/Indie']
genre_comparison = genre_acoustics[genre_acoustics.index.isin(
    genre_of_interest)]
print(
    f"For the selected genres, the acousticness score statistics are as follows:\n {genre_comparison}")

For the selected genres, the acousticness score statistics are as follows:
                    count      mean       std      min      25%     50%  \
genre                                                                    
Alternative/Indie    4.0  0.583500  0.204086  0.29100  0.52575  0.6460   
Dance/Electronic     5.0  0.099440  0.095828  0.01370  0.01490  0.0686   
Hip-Hop/Rap         13.0  0.188741  0.186396  0.00513  0.06700  0.1450   
Pop                 14.0  0.323843  0.318142  0.02100  0.05990  0.2590   

                       75%    max  
genre                              
Alternative/Indie  0.70375  0.751  
Dance/Electronic   0.17700  0.223  
Hip-Hop/Rap        0.23400  0.731  
Pop                0.34800  0.902  


# Findings and Insights
- Most popular artists are Billie Eilish, Dua Lipa, and Travis Scott with 3 tracks each.
- Strong artist diversity with 40 unique artists
- Traditional and hybrid genres are represented with 16 unique genres
- Pop dominates with 28% of Top 50 tracks
- Strong positive correlation between loudness and energy




# Future Improvements

- Create visualization for outliers of each audio feature
- Print outlier identification in a more readable format
- Import and compare data from years beyond 2020 to see genre trends in the Top 50