# Top 50 Spotify Tracks 2020 data analysis
<img src="https://upload.wikimedia.org/wikipedia/commons/2/26/Spotify_logo_with_text.svg" alt="Spotify Logo" width="300">

This project involves performing a comprehensive analysis of a Top 50 Spotify Tracks dataset to gain insights about top tracks, artists, albums, genres, and their features such as danceability, loudness, acousticness, and more.
The analysis also includes data cleaning, exploratory data analysis (EDA), and feature correlation analysis.
Based on analysis results, project will quantify what makes a hit song.

In [3]:
import pandas as pd

In [5]:
df = pd.read_csv('spotifytoptracks.csv', index_col=0)
df.head(3)

Unnamed: 0,artist,album,track_name,track_id,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,genre
0,The Weeknd,After Hours,Blinding Lights,0VjIjW4GlUZAMYd2vXMi3b,0.73,0.514,1,-5.934,0.00146,0.0598,9.5e-05,0.0897,0.334,171.005,200040,R&B/Soul
1,Tones And I,Dance Monkey,Dance Monkey,1rgnBhdG2JDFTbYkYRZAku,0.593,0.825,6,-6.401,0.688,0.0988,0.000161,0.17,0.54,98.078,209755,Alternative/Indie
2,Roddy Ricch,Please Excuse Me For Being Antisocial,The Box,0nbXyq5TXYPCO7pr3N8S4I,0.586,0.896,10,-6.687,0.104,0.0559,0.0,0.79,0.642,116.971,196653,Hip-Hop/Rap


*Note:* index_col=0 makes pandas use the first column of the file as the row index instead of adding a new default index. This is useful if the first column already has unique IDs or labels.

### 🧹 Dataset cleaning 

*3 main steps for data cleaning:*
- Handle missing values.
- Remove duplicate samples and features.
- Treat the outliers.

In [10]:
missing_values = df.isnull().sum()
print(missing_values)

artist              0
album               0
track_name          0
track_id            0
energy              0
danceability        0
key                 0
loudness            0
acousticness        0
speechiness         0
instrumentalness    0
liveness            0
valence             0
tempo               0
duration_ms         0
genre               0
dtype: int64


*Note:* 0 in result table shows that no null values are in dataset. All columns&rows have data in cells. 

In [13]:
duplicates_all = df.duplicated()
print(f"Number of duplicate rows: {duplicates_all.sum()}")

Number of duplicate rows: 0


*Note:* there are no duplicate rows in dataset. 

In [16]:
def count_outliers(column):
    '''Q1 - First quartile, Q3 - Third quartile
    IQR - Interquartile range
    lower_bound - Lower threshold for outliers
    upper_bound - Upper threshold for outliers'''
    Q1 = column.quantile(0.25)  
    Q3 = column.quantile(0.75)  
    IQR = Q3 - Q1  
    lower_bound = Q1 - 1.5 * IQR  
    upper_bound = Q3 + 1.5 * IQR  
    outliers = column[(column < lower_bound) | (column > upper_bound)]
    return len(outliers)

columns_to_check = [
    'energy', 'danceability', 'key', 'loudness', 'acousticness',
    'speechiness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms'
]

outliers_count = {col: count_outliers(df[col]) for col in columns_to_check}

for col, count in outliers_count.items():
    print(f"{col}: {count} outliers")

energy: 0 outliers
danceability: 3 outliers
key: 0 outliers
loudness: 1 outliers
acousticness: 7 outliers
speechiness: 6 outliers
instrumentalness: 12 outliers
liveness: 3 outliers
valence: 0 outliers
tempo: 0 outliers
duration_ms: 2 outliers


*Note:* An Outlier is a data item/object that deviates significantly from the rest of the objects. The dataset consists of only 50 rows, making it relatively small. Removing outliers from this dataset could significantly impact its structure and compromise the accuracy of future analyses and insights. So all rows will be kept for analysing dataset.

### 📊 Exploratory data analysis 


**How many observations are there in this dataset?**

In [21]:
num_observations = len(df)
print(f"Number of observations: {num_observations}")

Number of observations: 50


*Note:* The number of observations refers to the number of rows in dataset. The dataset have 50 rows.

**How many features this dataset has?**

In [25]:
num_features = len(df.columns)
print(f"Number of features: {num_features}")

Number of features: 16


*Note:* The number of observations refers the number of columns in dataset. The dataset have 16 columns.

**Which of the features are categorical?**

In [29]:
categorical_col = df.select_dtypes(include=['object']).columns.tolist()
no_cat_col = len(categorical_col)

print(f"Number of categorical features: {no_cat_col}")
print(f"List of catergorical names: {categorical_col}") 

Number of categorical features: 5
List of catergorical names: ['artist', 'album', 'track_name', 'track_id', 'genre']


**Which of the features are numeric?**

In [32]:
num_col = df.select_dtypes(include=['number']).columns.tolist()
no_num_col = len(num_col)

print(f"Number of numeric features: {no_num_col}") 
print(f"List of numeric names: {num_col}") 

Number of numeric features: 11
List of numeric names: ['energy', 'danceability', 'key', 'loudness', 'acousticness', 'speechiness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms']


**Are there any artists that have more than 1 popular track? If yes, which and how many?**

In [35]:
popular_artist = df.artist.value_counts()
pop_track = popular_artist[popular_artist > 1]
print(f"Artists who have more than 1 popular song:")
print(pop_track)

Artists who have more than 1 popular song:
artist
Dua Lipa         3
Billie Eilish    3
Travis Scott     3
Harry Styles     2
Lewis Capaldi    2
Justin Bieber    2
Post Malone      2
Name: count, dtype: int64


*Note:* result table shows all artists who have more than 1 popular track. 3 artists have 3 popular tracks and 4 artists have 2 popular tracks.

**Who was the most popular artist?**

In [39]:
popular_artist = df.artist.value_counts()
pop_track = popular_artist.max()
most_popular_artists = popular_artist[popular_artist == pop_track]
print(f"The most popular artists:")
print(most_popular_artists)

The most popular artists:
artist
Dua Lipa         3
Billie Eilish    3
Travis Scott     3
Name: count, dtype: int64


*Note:* The popular artist in this case was determined based on the number of their popular tracks included in the list. In this dataset there are 3 artists who have most tracks (each has for 3 tracks). 

**How many artists in total have their songs in the top 50?**

In [43]:
unique_artists = df.artist.unique()
no_unique = len(unique_artists)
print(f'Total number of unique artists: {no_unique}')

Total number of unique artists: 40


**Are there any albums that have more than 1 popular track? If yes, which and how many?**

In [46]:
popular_album = df.album.value_counts()
pop_track = popular_album[popular_album > 1]
print(f"Albums which have more than 1 popular song:")
print(pop_track)

Albums which have more than 1 popular song:
album
Future Nostalgia        3
Hollywood's Bleeding    2
Fine Line               2
Changes                 2
Name: count, dtype: int64


*Note:* result table shows all albums which have more than 1 popular track. 1 album have 3 popular tracks and 3 albums have 2 popular tracks.

**How many albums in total have their songs in the top 50?**

In [50]:
unique_albums = df.album.unique()
no_unique = len(unique_albums)
print(f'Total number of unique albums: {no_unique}')

Total number of unique albums: 45


**Which tracks have a danceability score above 0.7?**

In [53]:
high_danceability = df[df['danceability'] > 0.7]
high_danceability = high_danceability[['artist', 'track_name', 'danceability']]
print(f"In dataset are {len(high_danceability)} tracks with a danceability score above 0.7")
high_danceability

In dataset are 32 tracks with a danceability score above 0.7


Unnamed: 0,artist,track_name,danceability
1,Tones And I,Dance Monkey,0.825
2,Roddy Ricch,The Box,0.896
3,SAINt JHN,Roses - Imanbek Remix,0.785
4,Dua Lipa,Don't Start Now,0.793
5,DaBaby,ROCKSTAR (feat. Roddy Ricch),0.746
7,Powfu,death bed (coffee for your head),0.726
8,Trevor Daniel,Falling,0.784
10,KAROL G,Tusa,0.803
13,Lil Mosey,Blueberry Faygo,0.774
14,Justin Bieber,Intentions (feat. Quavo),0.806


*Note:* Danceability easures how suitable a track is for dancing, based on tempo, rhythm stability, beat strength, and overall regularity. Higher values indicate tracks that are easier to dance to. 64% of the tracks in this top list have a high danceability value, demonstrating that listeners tend to prefer songs that are easy to dance to.

**Which tracks have a danceability score below 0.4?**

In [57]:
low_danceability = df[df['danceability'] < 0.4]
low_danceability = low_danceability[['artist', 'track_name', 'danceability']]
print(f"In dataset are {len(low_danceability)} tracks with a danceability score below 0.4")
low_danceability

In dataset are 1 tracks with a danceability score below 0.4


Unnamed: 0,artist,track_name,danceability
44,Billie Eilish,lovely (with Khalid),0.351


**Which tracks have their loudness above -5?**

In [60]:
loud_tracks = df[df['loudness'] > -5]
loud_tracks = loud_tracks[['artist', 'track_name', 'loudness']]
print(f"In dataset are {len(loud_tracks)} tracks with a loudness above -5")
loud_tracks

In dataset are 19 tracks with a loudness above -5


Unnamed: 0,artist,track_name,loudness
4,Dua Lipa,Don't Start Now,-4.521
6,Harry Styles,Watermelon Sugar,-4.209
10,KAROL G,Tusa,-3.28
12,Post Malone,Circles,-3.497
16,Lewis Capaldi,Before You Go,-4.858
17,Doja Cat,Say So,-4.577
21,Harry Styles,Adore You,-3.675
23,24kGoldn,Mood (feat. iann dior),-3.558
31,Dua Lipa,Break My Heart,-3.434
32,BTS,Dynamite,-4.41


*Note:* Loudness indicates the average volume of a track in decibels (dB). "Above -5" means greater than -5, so it includes numbers like -4, -3, 0, etc., but not -5 itself or anything smaller (e.g., -6). The closer the value is to 0, the louder it is.  Higher values represent louder tracks, while lower values represent quieter ones. 38% of the tracks in Top list have loudness over -5dB.

**Which tracks have their loudness below -8?**

In [602]:
quieter_track = df[df['loudness'] < -8]
quieter_track = quieter_track[['artist', 'track_name', 'loudness']]
print(f"In dataset are {len(quieter_track)} tracks with a loudness below -8")
quieter_track

In dataset are 9 tracks with a loudness below -8


Unnamed: 0,artist,track_name,loudness
7,Powfu,death bed (coffee for your head),-8.765
8,Trevor Daniel,Falling,-8.756
15,Drake,Toosie Slide,-8.82
20,Jawsh 685,Savage Love (Laxed - Siren Beat),-8.52
24,Billie Eilish,everything i wanted,-14.454
26,Billie Eilish,bad guy,-10.965
36,Travis Scott,HIGHEST IN THE ROOM,-8.764
44,Billie Eilish,lovely (with Khalid),-10.109
47,JP Saxe,If the World Was Ending - feat. Julia Michaels,-10.086


*Note:* Based on the analysis of the most popular artist above, it is interesting to note that all three Billie Eilish songs fall into the list of quieter tracks. This observation suggests that her music resonates with listeners not through high volume or intensity but through its softer, more subtle sound.

**Which track is the longest?**

In [683]:
long_track = df[df['duration_ms'] == df['duration_ms'].max()].copy()
long_track['duration_minutes'] = (long_track['duration_ms'] // 60000).astype(int)
long_track['duration_seconds'] = ((long_track['duration_ms'] % 60000) // 1000).astype(int)

long_track['formatted_duration'] = (
    long_track['duration_minutes'].astype(str) + " min " +
    long_track['duration_seconds'].astype(str) + " sec"
)
long_track = long_track[['artist', 'track_name', 'duration_ms', 'formatted_duration']]
long_track

Unnamed: 0,artist,track_name,duration_ms,formatted_duration
49,Travis Scott,SICKO MODE,312820,5 min 12 sec


**Which track is the shortest?**

In [687]:
short_track = df[df['duration_ms'] == df['duration_ms'].min()].copy()
short_track['duration_minutes'] = (short_track['duration_ms'] // 60000).astype(int)
short_track['duration_seconds'] = ((short_track['duration_ms'] % 60000) // 1000).astype(int)

short_track['formatted_duration'] = (
    short_track['duration_minutes'].astype(str) + " min " +
    short_track['duration_seconds'].astype(str) + " sec"
)
short_track = short_track[['artist', 'track_name', 'duration_ms', 'formatted_duration']]
short_track

Unnamed: 0,artist,track_name,duration_ms,formatted_duration
23,24kGoldn,Mood (feat. iann dior),140526,2 min 20 sec


*Note:* By comparing the shortest and longest tracks in the Top list, the nearly 3-minute difference highlights the significant variability in listeners' preferences. The difference in duration could be a reflection of various factors, such as genre, mood, or the context in which the music is being consumed.

**Which genre is the most popular?**

In [377]:
popular_genre = df.genre.value_counts()
print(f"The most popular genre: {popular_genre.index[0]}")


The most popular genre: Pop


**Which genres have just one song on the top 50?**

In [380]:
unpopular_genre = df.genre.value_counts()
least_pop_genre = unpopular_genre[unpopular_genre == 1]
print(f"The most popular genre:")
print(least_pop_genre)

The most popular genre:
genre
R&B/Hip-Hop alternative               1
Nu-disco                              1
Pop/Soft Rock                         1
Pop rap                               1
Hip-Hop/Trap                          1
Dance-pop/Disco                       1
Disco-pop                             1
Dreampop/Hip-Hop/R&B                  1
Alternative/reggaeton/experimental    1
Chamber pop                           1
Name: count, dtype: int64


**How many genres in total are represented in the top 50?**

In [383]:
tot_genres = df.genre.value_counts()
print(f"There are {len(tot_genres)} genres in total")

There are 16 genres in total


**Which features are strongly positively correlated?**

In [638]:
numeric_df = df.select_dtypes(include=['number'])
correlation_matrix = numeric_df.corr()
correlation_long = correlation_matrix.stack().reset_index()
correlation_long.columns = ['Feature 1', 'Feature 2', 'Correlation']
pos_corr = correlation_long[(correlation_long['Correlation'] > 0.7) & (correlation_long['Correlation'] < 1)]
print(pos_corr)


   Feature 1 Feature 2  Correlation
3     energy  loudness      0.79164
33  loudness    energy      0.79164


*Note:* The corr() method calculates the relationship between each column in data set.  It is expressed as a value between -1 and 1. 
Strong positive correlation: Values close to +1. Meaning that tracks with higher energy tend to have higher loudness.

**Which features are strongly negatively correlated?**

In [643]:
numeric_df = df.select_dtypes(include=['number'])
correlation_matrix = numeric_df.corr()
correlation_long = correlation_matrix.stack().reset_index()
correlation_long.columns = ['Feature 1', 'Feature 2', 'Correlation']
pos_corr = correlation_long[(correlation_long['Correlation'] < -0.7) & (correlation_long['Correlation'] > -1)]
print(pos_corr)

Empty DataFrame
Columns: [Feature 1, Feature 2, Correlation]
Index: []


*Note:* Strong negative correlation: Values close to -1. Empty table means that no negatively correlated features are in dataset.

**Which features are not correlated?**

In [645]:
numeric_df = df.select_dtypes(include=['number'])
correlation_matrix = numeric_df.corr()
correlation_long = correlation_matrix.stack().reset_index()
correlation_long.columns = ['Feature 1', 'Feature 2', 'Correlation']
pos_corr = correlation_long[(correlation_long['Correlation'] > -0.1) & (correlation_long['Correlation'] < 0.1)]
print(pos_corr)

            Feature 1         Feature 2  Correlation
2              energy               key     0.062428
5              energy       speechiness     0.074267
7              energy          liveness     0.069487
9              energy             tempo     0.075191
10             energy       duration_ms     0.081971
17       danceability  instrumentalness    -0.017706
18       danceability          liveness    -0.006648
21       danceability       duration_ms    -0.033763
22                key            energy     0.062428
25                key          loudness    -0.009178
27                key       speechiness    -0.094965
28                key  instrumentalness     0.020802
31                key             tempo     0.080475
32                key       duration_ms    -0.003345
35           loudness               key    -0.009178
38           loudness       speechiness    -0.021693
40           loudness          liveness    -0.069939
43           loudness       duration_ms     0.

*Note:* Values between -0.1 and 0.1 (closely to 0) represent weak or no correlation.

**How does the danceability score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?**

In [69]:
selected_genres = ['Pop', 'Hip-Hop/Rap', 'Dance/Electronic', 'Alternative/Indie']
filtered_df = df[df['genre'].isin(selected_genres)]
genre_danceability = filtered_df.groupby('genre')['danceability'].median()
sorted_genre_danceability = genre_danceability.sort_values(ascending=False)
print(sorted_genre_danceability)

genre
Dance/Electronic     0.785
Hip-Hop/Rap          0.774
Pop                  0.690
Alternative/Indie    0.663
Name: danceability, dtype: float64


*Note:* because danceability has outliers (3 out of 50 or 6%) , it is chosen to use median function. Median represents the middle value of the dataset when sorted, and outliers do not affect it as much as the mean. Not surprisingly, Dance/Electronic genre tend to have highest danceability score among all genres.

**How does the loudness score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?**

In [71]:
selected_genres = ['Pop', 'Hip-Hop/Rap', 'Dance/Electronic', 'Alternative/Indie']
filtered_df = df[df['genre'].isin(selected_genres)]
genre_danceability = filtered_df.groupby('genre')['loudness'].median()
sorted_genre_danceability = genre_danceability.sort_values(ascending=False)
print(sorted_genre_danceability)

genre
Alternative/Indie   -5.2685
Dance/Electronic    -5.4570
Pop                 -6.6445
Hip-Hop/Rap         -7.6480
Name: loudness, dtype: float64


*Note:* same as danceability, loudness has outliers (1 out of 50 or 2%) , it is chosen to use median function. Alternative/Indie and Dance/Electronic stand out with the highest loudness values, at around -5 dB, indicating a preference for more dynamic and impactful sound. 

**How does the acousticness score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?**

In [73]:
selected_genres = ['Pop', 'Hip-Hop/Rap', 'Dance/Electronic', 'Alternative/Indie']
filtered_df = df[df['genre'].isin(selected_genres)]
genre_danceability = filtered_df.groupby('genre')['acousticness'].median()
sorted_genre_danceability = genre_danceability.sort_values(ascending=False)
print(sorted_genre_danceability)

genre
Alternative/Indie    0.6460
Pop                  0.2590
Hip-Hop/Rap          0.1450
Dance/Electronic     0.0686
Name: acousticness, dtype: float64


*Note:* same as danceability, loudness has outliers (7 out of 50 or 14%) , it is chosen to use median function. Alternative/Indie has the highest average acousticness value at 0.64, suggesting a preference for more acoustic and less electronic-based sounds in this genre. In contrast, Dance/Electronic has the lowest value at 0.06, indicating a heavier reliance on electronic elements in its music. 

### 📝 Conclusion

**Key Quantified Characteristics of Hit Songs:**
- Danceability ≥ 0.7 (64% of top tracks)
- Loudness > -5 dB (38% of top tracks)
- Energy + Danceability correlation (Energetic tracks tend to have higher loudness)
- Track Duration between 3-5 minutes: Optimal for engagement and repeatability.

**Genre Preferences:**
- Dance/Electronic: Highest danceability and loudness
- Alternative/Indie: Higher acousticness for a more organic sound

These factors combined suggest that a hit song is typically one that is energetic, easy to dance to, and has a volume that catches attention, all while fitting within genre-specific preferences.

### 💡 Suggestions for analysis improvement

* Import data from other years Top list and compare analysis results how listeners choises changes during the years.
* Implement demografic category for geografical comparison.
* Remove outliers from data set and compare how results changed.
* Add visualizations to analysis ot improve readability.