# Exploratory Data Analysis on Top 50 Spotify Tracks

To start, I am going to import pandas and numpy, and import the data from the CSV.

In [331]:
import pandas as pd
import numpy as np
toptracks_df = pd.read_csv('spotifytoptracks.csv', index_col=0)

The columns appear to be named sensibly and there appears to be no missing data, plus artist and track names appear distinct (so I believe I can safely assume no duplicates). 

In [332]:
toptracks_df.columns

Index(['artist', 'album', 'track_name', 'track_id', 'energy', 'danceability',
       'key', 'loudness', 'acousticness', 'speechiness', 'instrumentalness',
       'liveness', 'valence', 'tempo', 'duration_ms', 'genre'],
      dtype='object')

In [333]:
toptracks_df.isna().sum()

artist              0
album               0
track_name          0
track_id            0
energy              0
danceability        0
key                 0
loudness            0
acousticness        0
speechiness         0
instrumentalness    0
liveness            0
valence             0
tempo               0
duration_ms         0
genre               0
dtype: int64

In [334]:
np.sort(toptracks_df.artist.unique())

array(['24kGoldn', 'Ariana Grande', 'Arizona Zervas', 'BENEE', 'BTS',
       'Bad Bunny', 'Billie Eilish', 'Black Eyed Peas', 'Cardi B',
       'DaBaby', 'Doja Cat', 'Drake', 'Dua Lipa', 'Eminem', 'Future',
       'Harry Styles', 'JP Saxe', 'Jawsh 685', 'Juice WRLD',
       'Justin Bieber', 'KAROL G', 'Lady Gaga', 'Lewis Capaldi',
       'Lil Mosey', 'Maluma', 'Maroon 5', 'Post Malone', 'Powfu',
       'Regard', 'Roddy Ricch', 'SAINt JHN', 'Shawn Mendes', 'Surf Mesa',
       'Surfaces', 'THE SCOTTS', 'The Weeknd', 'Tones And I', 'Topic',
       'Travis Scott', 'Trevor Daniel'], dtype=object)

In [335]:
np.sort(toptracks_df.track_name.unique())

array(['Adore You', 'Before You Go', 'Blinding Lights', 'Blueberry Faygo',
       'Break My Heart', 'Breaking Me', 'Circles', 'Dance Monkey',
       "Don't Start Now", 'Dynamite', 'Falling',
       'Godzilla (feat. Juice WRLD)', 'HIGHEST IN THE ROOM', 'Hawái',
       'If the World Was Ending - feat. Julia Michaels',
       'Intentions (feat. Quavo)', 'Life Is Good (feat. Drake)',
       'Lucid Dreams', 'Memories', 'Mood (feat. iann dior)', 'Physical',
       'RITMO (Bad Boys For Life)', 'ROCKSTAR (feat. Roddy Ricch)',
       'ROXANNE', 'Rain On Me (with Ariana Grande)', 'Ride It',
       'Roses - Imanbek Remix', 'SICKO MODE', 'Safaera',
       'Savage Love (Laxed - Siren Beat)', 'Say So', 'Señorita',
       'Someone You Loved', 'Stuck with U (with Justin Bieber)',
       'Sunday Best', 'Sunflower - Spider-Man: Into the Spider-Verse',
       'Supalonely (feat. Gus Dapperton)', 'THE SCOTTS', 'The Box',
       'Toosie Slide', 'Tusa', 'WAP (feat. Megan Thee Stallion)',
       'Watermelon S

However, several songs have additional artists listed in the track names. These are identified by either 'feat.' or 'with'. I am going to extract these into their own column.

In [336]:
import re

def second_artist(r):
    matches = re.findall(r'^.*(?:feat\.|with) (\w+(?: ?\w*)*)\)?$',r)
    if matches:
        return matches[0]
    else:
        return None

toptracks_df['second_artist'] = toptracks_df.track_name.apply(second_artist)
toptracks_df[['artist','second_artist']].head(10)
toptracks_df[toptracks_df.second_artist.notna()][['artist','second_artist']]


Unnamed: 0,artist,second_artist
5,DaBaby,Roddy Ricch
14,Justin Bieber,Quavo
19,Future,Drake
23,24kGoldn,iann dior
27,Cardi B,Megan Thee Stallion
29,Eminem,Juice WRLD
33,BENEE,Gus Dapperton
34,Surf Mesa,Emilee
35,Lady Gaga,Ariana Grande
44,Billie Eilish,Khalid


At least one genre (Electro-pop) has additional whitespace, so I will strip across all string (object) columns to ensure better data quality.

In [337]:
toptracks_df = toptracks_df.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

There are 50 observations in this data set, each with 17 features (one is extracted). Of these 17 features, 10 are numeric, 7 are categorical.

In [338]:
print("Number of features:", len(toptracks_df.dtypes))
print(toptracks_df.dtypes,"\n")
print(toptracks_df.dtypes.value_counts())

Number of features: 17
artist               object
album                object
track_name           object
track_id             object
energy              float64
danceability        float64
key                   int64
loudness            float64
acousticness        float64
speechiness         float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
duration_ms           int64
genre                object
second_artist        object
dtype: object 

float64    9
object     6
int64      2
Name: count, dtype: int64


All float64 features are numeric, 1 of the int64 (duration_ms) is also numeric, and the rest are categorical. 

#### Numeric features
- energy
- danceability
- loudness
- acousticness
- speechiness
- instrumentalness
- liveness
- valence
- tempo

#### Categorical features
- key
- genre
- artist
- album
- track_name
- track_id

#### Extracted data (categorical)
- second_artist

### Are there any artists that have more than 1 popular track? If yes, which and how many?
Yes, 7 artists have more than 1 popular track. The highest number of tracks is 3, which 3 artists have: Billie Eilish, Dua Lipa, Travis Scott. 

In [339]:
no_tracks_per_artist = toptracks_df.groupby('artist').track_name.count().sort_values(ascending=False)
mult_track_artists = no_tracks_per_artist[no_tracks_per_artist > 1]
print("Number of artists with more than 1 top 50 track:",len(mult_track_artists))
print(mult_track_artists)


Number of artists with more than 1 top 50 track: 7
artist
Billie Eilish    3
Dua Lipa         3
Travis Scott     3
Lewis Capaldi    2
Harry Styles     2
Post Malone      2
Justin Bieber    2
Name: track_name, dtype: int64


If we include artists who are the second artist on a track, this changes to 11 artists with more than 1 top 50 track. The maximum number of tracks (3) doesn't change, but an extra artist now has 3 tracks (4 total - Justin Bieber has been added).

In [340]:
second_artist_count = toptracks_df.groupby('second_artist').track_name.count().sort_values(ascending=False)
second_artist_incl = no_tracks_per_artist.add(second_artist_count, fill_value = 0).sort_values(ascending=False)
print("Number of artists with more than 1 top 50 track (when including tracks as second artist):",len(second_artist_incl[second_artist_incl > 1]))
print(second_artist_incl[second_artist_incl > 1])

Number of artists with more than 1 top 50 track (when including tracks as second artist): 11
Billie Eilish    3.0
Dua Lipa         3.0
Justin Bieber    3.0
Travis Scott     3.0
Juice WRLD       2.0
Drake            2.0
Ariana Grande    2.0
Harry Styles     2.0
Lewis Capaldi    2.0
Roddy Ricch      2.0
Post Malone      2.0
Name: track_name, dtype: float64


## Who was the most popular artist?

In order to assess this more accurately, I beleive we would require play counts. Knowing which songs were played the most would make it easier to determine which artist(s) were more popular. 

Without this data, the above analysis of the most popular track per artist is the best way of assessing this. I believe it's reasonable to say that being the main artist ranks above being the second artist, so we can consider the 3 artists who had 3 popular songs as the most popular artists:
- Billie Eilish
- Dua Lipa
- Travis Scott

## How many artists in total have their songs in the top 50?

There are 40 artists listed as the main artist in the top 50.

In [341]:
len(toptracks_df.artist.unique())

40

If we include second artists, this increase to 47 artists.

In [342]:
len(np.union1d(toptracks_df.artist.unique(), toptracks_df.second_artist.dropna().unique()))

47

## Are there any albums that have more than 1 popular track? If yes, which and how many?

4 albums have more than 1 track in the top 50. The most tracks for a single album is 3, which only the album Future Nostalgia managed. The artist is Dua Lipa. 

In [343]:
album_counts = toptracks_df.album.value_counts().sort_values(ascending=False)
print("Number of albums with more than 1 track:", len(album_counts[album_counts > 1]))
print(album_counts[album_counts > 1])

Number of albums with more than 1 track: 4
album
Future Nostalgia        3
Hollywood's Bleeding    2
Fine Line               2
Changes                 2
Name: count, dtype: int64


In [344]:
toptracks_df[toptracks_df['album'] == 'Future Nostalgia'].artist.unique()[0]

'Dua Lipa'

## How many albums in total have their songs in the top 50?

45 albums have songs in the top 50. 

In [345]:
len(album_counts)

45

## Which tracks have a danceability score above 0.7?

32 tracks have a danceability score more than 0.7, with WAP (feat. Megan Thee Stallion) being the most danceable track.


In [346]:
danceable_tracks = toptracks_df[toptracks_df['danceability'] > 0.7][['track_name','danceability']].sort_values(by='danceability',ascending=False).set_index('track_name')
print("Number of danceable tracks:", len(danceable_tracks))
print(danceable_tracks)

Number of danceable tracks: 32
                                               danceability
track_name                                                 
WAP (feat. Megan Thee Stallion)                       0.935
The Box                                               0.896
Ride It                                               0.880
Sunday Best                                           0.878
Supalonely (feat. Gus Dapperton)                      0.862
goosebumps                                            0.841
SICKO MODE                                            0.834
Toosie Slide                                          0.830
Dance Monkey                                          0.825
Godzilla (feat. Juice WRLD)                           0.808
Intentions (feat. Quavo)                              0.806
Tusa                                                  0.803
Life Is Good (feat. Drake)                            0.795
Don't Start Now                                       0.793
Breaking 

## Which tracks have a danceability score below 0.4?

Only 1 song (lovely (with Khalid)) has a low danceability score.

In [347]:
undanceable_tracks = toptracks_df[toptracks_df['danceability'] < 0.4][['track_name','danceability']].sort_values(by='danceability',ascending=False).set_index('track_name')
print("Number of un-danceable tracks:", len(undanceable_tracks))
print(undanceable_tracks)

Number of un-danceable tracks: 1
                      danceability
track_name                        
lovely (with Khalid)         0.351


## Average danceability

In [348]:
print(f"Mean danceability: {toptracks_df.danceability.mean():.3f}")
print(f"Median danceability: {toptracks_df.danceability.median():.3f}")

Mean danceability: 0.717
Median danceability: 0.746


## Which tracks have their loudness above -5?

There are 19 loud tracks, with the loudest being Tusa. 

In [349]:
loud_tracks = toptracks_df[toptracks_df['loudness'] > -5][['track_name','loudness']].sort_values(by='loudness',ascending=False).set_index('track_name')
print("Number of loud tracks:", len(loud_tracks))
print(loud_tracks)

Number of loud tracks: 19
                                               loudness
track_name                                             
Tusa                                             -3.280
goosebumps                                       -3.370
Break My Heart                                   -3.434
Hawái                                            -3.454
Circles                                          -3.497
Mood (feat. iann dior)                           -3.558
Adore You                                        -3.675
SICKO MODE                                       -3.714
Physical                                         -3.756
Rain On Me (with Ariana Grande)                  -3.764
Safaera                                          -4.074
Watermelon Sugar                                 -4.209
Ride It                                          -4.258
Sunflower - Spider-Man: Into the Spider-Verse    -4.368
Dynamite                                         -4.410
Don't Start Now       

## Which tracks have their loudness below -8?

There are 9 quiet songs, of which the quietest is everything i wanted. 

In [350]:
quiet_tracks = toptracks_df[toptracks_df['loudness'] < -8][['track_name','loudness']].sort_values(by='loudness').set_index('track_name')
print("Number of quiet tracks:", len(quiet_tracks))
print(quiet_tracks)

Number of quiet tracks: 9
                                                loudness
track_name                                              
everything i wanted                              -14.454
bad guy                                          -10.965
lovely (with Khalid)                             -10.109
If the World Was Ending - feat. Julia Michaels   -10.086
Toosie Slide                                      -8.820
death bed (coffee for your head)                  -8.765
HIGHEST IN THE ROOM                               -8.764
Falling                                           -8.756
Savage Love (Laxed - Siren Beat)                  -8.520


## Average loudness


In [351]:
print(f"Mean loudness: {toptracks_df.loudness.mean():.3f}")
print(f"Median loudness: {toptracks_df.loudness.median():.3f}")

Mean loudness: -6.226
Median loudness: -5.992


## Which track is the longest?

The track SICKO MODE by Travis Scott is the longest track, at 5 minutes 12 seconds. 

In [352]:
toptracks_df.loc[toptracks_df.duration_ms.idxmax(),['artist','track_name','duration_ms']]

artist         Travis Scott
track_name       SICKO MODE
duration_ms          312820
Name: 49, dtype: object

In [353]:
mins_longest_track, secs_longest_track = divmod(toptracks_df.loc[toptracks_df.duration_ms.idxmax()].duration_ms / 1000 / 60, 1)
secs_longest_track = secs_longest_track * 60
print(f"{int(mins_longest_track)} minutes {int(secs_longest_track)} seconds")


5 minutes 12 seconds


## Which track is the shortest?

The track Mood (feat. iann dior) by 24kGoldn was the shortest track, at 2 minutes 20 seconds.

In [354]:
toptracks_df.loc[toptracks_df.duration_ms.idxmin(),['artist','track_name','duration_ms']]

artist                       24kGoldn
track_name     Mood (feat. iann dior)
duration_ms                    140526
Name: 23, dtype: object

In [355]:
mins_shortest_track, secs_shortest_track = divmod(toptracks_df.loc[toptracks_df.duration_ms.idxmin()].duration_ms / 1000 / 60, 1)
secs_shortest_track = secs_shortest_track * 60
print(f"{int(mins_shortest_track)} minutes {int(secs_shortest_track)} seconds")

2 minutes 20 seconds


## Average length

In [356]:
mins_mean_track, secs_mean_track = divmod(toptracks_df.duration_ms.mean() / 1000 / 60, 1)
secs_mean_track = secs_mean_track * 60

mins_median_track, secs_median_track = divmod(toptracks_df.duration_ms.median() / 1000 / 60, 1)
secs_median_track = secs_median_track * 60

print(f"Mean length: {int(mins_mean_track)} minutes {int(secs_mean_track)} seconds")
print(f"Median length: {int(mins_median_track)} minutes {int(secs_median_track)} seconds")

Mean length: 3 minutes 19 seconds
Median length: 3 minutes 17 seconds


## Which genre is the most popular?

Based on the genres provide, Pop is the most popular, with 14 tracks.

In [357]:
genre_counts = toptracks_df.genre.value_counts()
print(genre_counts)

genre
Pop                                   14
Hip-Hop/Rap                           13
Dance/Electronic                       5
Alternative/Indie                      4
R&B/Soul                               2
Electro-pop                            2
R&B/Hip-Hop alternative                1
Nu-disco                               1
Pop/Soft Rock                          1
Pop rap                                1
Hip-Hop/Trap                           1
Dance-pop/Disco                        1
Disco-pop                              1
Dreampop/Hip-Hop/R&B                   1
Alternative/reggaeton/experimental     1
Chamber pop                            1
Name: count, dtype: int64


The listed genres contain both sub-categories and categories. It's currently difficult to tell which genres are actually part of the same parent genre. We would be able to get a more comprehensive view on with genres are the most popualar if we had the parent genres (e.g. Disco-pop and Nu-disco are both part of the Disco genre, which in turn is part of the Electronic genre). There are also tracks that are part of multiple genres (e.g. Pop/Soft Rock), so it would also be beneficial to split out the genres in to each individual genre. 

Most tracks fall under some form of Pop or Hip-hop.  

## Which genres have just one song on the top 50?

10 genres have only one track in the top 50. However, we can see from the below list that most of these genres appear to fit into the same parent genre or share at least 1 genre with at least one other track. Arguably, the only track that actually has a unique genre is Safaera by Bad Bunny, with the genre Alternative/reggaeton/experimental. 

In [358]:
print("Number of genres with only 1 track:", len(genre_counts[genre_counts == 1]))
print(genre_counts[genre_counts == 1])

Number of genres with only 1 track: 10
genre
R&B/Hip-Hop alternative               1
Nu-disco                              1
Pop/Soft Rock                         1
Pop rap                               1
Hip-Hop/Trap                          1
Dance-pop/Disco                       1
Disco-pop                             1
Dreampop/Hip-Hop/R&B                  1
Alternative/reggaeton/experimental    1
Chamber pop                           1
Name: count, dtype: int64


In [359]:
toptracks_df[toptracks_df['genre'] == 'Alternative/reggaeton/experimental'][['artist','track_name']]

Unnamed: 0,artist,track_name
43,Bad Bunny,Safaera


## How many genres in total are represented in the top 50?

There are 16 sub-genres represented in the top 50. This count could be improved by getting the parent genres and splitting out tracks with multiple genres.

In [360]:
len(genre_counts)

16

## Which features are strongly positively correlated?

## Which features are strongly negatively correlated?

## Which features are not correlated?

## How does the danceability score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

## How does the loudness score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

## How does the acousticness score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

## Improvements for future analysis