# Project Aim

In this analysis I explore Spotify’s Top 50 tracks of 2020 with three goals:

Audio profile summary – to show the typical values and spread of key numeric features (danceability, loudness, etc.), so we know what the 2020 chart “sounds like.”

Hit concentration – to identify artists and albums with multiple entries in the Top 50 to see if a few names dominate the chart or if success is more evenly spread.

Genre comparison – compare Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie on danceability, loudness, and acousticness to see how their sound profiles differ.

I also do the essentials, like data cleaning: handling missing values, duplicates, and outliers. Also I examine feature correlations, to see which audio attributes tend to move together. These goals give me a clear view of what made a hit in 2020 and where there is room for new sounds or artists to stand out.

# 1. Downloading "Spotify Top 50 Tracks of 2020" dataset from Kaggle:

In [145]:
import kagglehub

path = kagglehub.dataset_download("atillacolak/top-50-spotify-tracks-2020")

print("Path to dataset files:", path)

Path to dataset files: C:\Users\Darius\.cache\kagglehub\datasets\atillacolak\top-50-spotify-tracks-2020\versions\2


# 2. Loading data (CSV file) using Pandas:

In [146]:
import pandas as pd

spotify = pd.read_csv("spotifytoptracks.csv", index_col=0)

spotify.head()

Unnamed: 0,artist,album,track_name,track_id,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,genre
0,The Weeknd,After Hours,Blinding Lights,0VjIjW4GlUZAMYd2vXMi3b,0.73,0.514,1,-5.934,0.00146,0.0598,9.5e-05,0.0897,0.334,171.005,200040,R&B/Soul
1,Tones And I,Dance Monkey,Dance Monkey,1rgnBhdG2JDFTbYkYRZAku,0.593,0.825,6,-6.401,0.688,0.0988,0.000161,0.17,0.54,98.078,209755,Alternative/Indie
2,Roddy Ricch,Please Excuse Me For Being Antisocial,The Box,0nbXyq5TXYPCO7pr3N8S4I,0.586,0.896,10,-6.687,0.104,0.0559,0.0,0.79,0.642,116.971,196653,Hip-Hop/Rap
3,SAINt JHN,Roses (Imanbek Remix),Roses - Imanbek Remix,2Wo6QQD1KMDWeFkkjLqwx5,0.721,0.785,8,-5.457,0.0149,0.0506,0.00432,0.285,0.894,121.962,176219,Dance/Electronic
4,Dua Lipa,Future Nostalgia,Don't Start Now,3PfIrDoz19wz7qK7tYeu62,0.793,0.793,11,-4.521,0.0123,0.083,0.0,0.0951,0.679,123.95,183290,Nu-disco


# 3. Data cleaning.

## Handling missing values.

By using .isna() method I create a dataframe with Boolean values: True for NaN or None, and False for everything else.

Then, with .any() method I check if there are any missing values (True) in any column (default axis=0).

In [147]:
missing_data = spotify.isna().any()
missing_data


artist              False
album               False
track_name          False
track_id            False
energy              False
danceability        False
key                 False
loudness            False
acousticness        False
speechiness         False
instrumentalness    False
liveness            False
valence             False
tempo               False
duration_ms         False
genre               False
dtype: bool

.any() returned a boolean series indexed by column names, with all False values, meaning there are no missing values.

If, say, there were missing values, I would check the quantity of missing data per row and column with isna().sum(). 

If only few missing values here and there, I would use .fillna() to fill the gaps with either mean or median.

And if missing quantity is bigger, I would drop the rows or columns right away, depending on how missing data is scattered.

## Checking for duplicates.

I use .duplicated(), which return boolean series with True if there is a duplicated row, or else False. 

Then .any() checks whether any row is duplicated.

Then I do the same for column names.

Since .duplicated() works just on rows, I transpose the dataframe with spotify.T.

In [148]:
dup_rows = spotify.duplicated().any()
dup_col_names = spotify.columns.duplicated().any()
dup_col_content = spotify.T.duplicated().any()

print(dup_rows, dup_col_names, dup_col_content)


False False False


If there were duplicates, I would use .drop_duplicates() to remove them.

## Treating the outliers.

First I choose the columns, that makes sense to check for outliers, so I drop categorical columns: "artist", "album", "track_name", "track_id" and "genre".

Then I use .describe() to look at the quick summary, to get a feel, whether there could be any outliers, based on the min/max and precentiles.

In [149]:
outliers = spotify.iloc[:, 4:15]
outliers.describe().round(4)


Unnamed: 0,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms
count,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0,50.0
mean,0.6093,0.7167,5.72,-6.2259,0.2562,0.1242,0.016,0.1966,0.5557,119.6905,199955.36
std,0.1543,0.125,3.709,2.3497,0.2653,0.1168,0.0943,0.1766,0.2164,25.4148,33996.1225
min,0.225,0.351,0.0,-14.454,0.0015,0.029,0.0,0.0574,0.0605,75.801,140526.0
25%,0.494,0.6725,2.0,-7.5525,0.0528,0.0483,0.0,0.094,0.434,99.5572,175845.5
50%,0.597,0.746,6.5,-5.9915,0.1885,0.07,0.0,0.111,0.56,116.969,197853.5
75%,0.7298,0.7945,8.75,-4.2855,0.2987,0.1555,0.0,0.2712,0.7262,132.317,215064.0
max,0.855,0.935,11.0,-3.28,0.934,0.487,0.657,0.792,0.925,180.067,312820.0


Then I check the count of upper and lower outliers based on the IQR.

To get IQR, I compute 25th and 75th precentiles with .quantile().

Then I define lower and upper fences at 1.5 (50%) x IQR below q1 and above q3.

With boolean mask I look for outliers: values, that are outside the fences.

Then I take a sum per column to have an overview of how much of outliers I have there.

In [150]:
q1 = outliers.quantile(0.25)
q3 = outliers.quantile(0.75)
iqr = q3 - q1

lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr

outliers_check = (outliers < lower_fence) | (outliers > upper_fence)

outliers_count = outliers_check.sum()

print(outliers_count)

energy               0
danceability         3
key                  0
loudness             1
acousticness         7
speechiness          6
instrumentalness    12
liveness             3
valence              0
tempo                0
duration_ms          2
dtype: int64


Outliers can be treated similar to missing values and duplicates. 

We either substitute them with more relevant value, like median, or drop rows/columns if outlier count is too high to impact the interpretation of the data.

I decided not to treat the outliers in this case and continue with full dataset, because based on the further tasks: summaries, correlations, genre comparison - leaving those outliers will not meaningfully impact the results.

# 4. Performing EDA.

## 4.1. How many observations are there in this dataset?

Observations = rows. Using .shape attribute.

In [151]:
spotify.shape[0]

50

#### Takeaway: 

##### 50 observations. We are looking at very focused sample of top 50 hits of 2020, which means every row is already a success, so any pattern we find out will point directly to what resonated the most with listeners that year.

## 4.2 How many features this dataset has?

Features = columns.

In [152]:
spotify.shape[1]

16

#### Takeaway:

##### 16 features. So each track has 16 different data points to work on.

## 4.3 Which of the features are categorical and numeric?

First I check each column's dtype and for more precise evaluation I check the first 5 rows of the dataset.

In [153]:
print(spotify.dtypes)
spotify.head()

artist               object
album                object
track_name           object
track_id             object
energy              float64
danceability        float64
key                   int64
loudness            float64
acousticness        float64
speechiness         float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
duration_ms           int64
genre                object
dtype: object


Unnamed: 0,artist,album,track_name,track_id,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,genre
0,The Weeknd,After Hours,Blinding Lights,0VjIjW4GlUZAMYd2vXMi3b,0.73,0.514,1,-5.934,0.00146,0.0598,9.5e-05,0.0897,0.334,171.005,200040,R&B/Soul
1,Tones And I,Dance Monkey,Dance Monkey,1rgnBhdG2JDFTbYkYRZAku,0.593,0.825,6,-6.401,0.688,0.0988,0.000161,0.17,0.54,98.078,209755,Alternative/Indie
2,Roddy Ricch,Please Excuse Me For Being Antisocial,The Box,0nbXyq5TXYPCO7pr3N8S4I,0.586,0.896,10,-6.687,0.104,0.0559,0.0,0.79,0.642,116.971,196653,Hip-Hop/Rap
3,SAINt JHN,Roses (Imanbek Remix),Roses - Imanbek Remix,2Wo6QQD1KMDWeFkkjLqwx5,0.721,0.785,8,-5.457,0.0149,0.0506,0.00432,0.285,0.894,121.962,176219,Dance/Electronic
4,Dua Lipa,Future Nostalgia,Don't Start Now,3PfIrDoz19wz7qK7tYeu62,0.793,0.793,11,-4.521,0.0123,0.083,0.0,0.0951,0.679,123.95,183290,Nu-disco


#### Takeaway:

##### Features: "artist", "album", "track_name", "track_id" and "genre" - are categorical, and the rest - numeric.

## 4.4. Are there any artists that have more than 1 popular track? If yes, which and how many?

First I index out the artist column and apply .value_counts() method to count how many times each artist appears in the list.

Then, with boolean mask I filter out only those, that apppear more than 1 time.

In [182]:
artists = spotify.iloc[:, 0].value_counts()
top = artists[artists > 1]
total_top = top.sum()
print(top)
print(total_top)

artist
Dua Lipa         3
Billie Eilish    3
Travis Scott     3
Harry Styles     2
Lewis Capaldi    2
Justin Bieber    2
Post Malone      2
Name: count, dtype: int64
17


#### Takeaway:

##### Seven artists had more than one track in the top 50. So 17 of the 50 tracks (34%) are owned by just seven names. This means top 50 list success is concentrated. Might be wise to invest in these proven artists.

## 4.5. Who was the most popular artist?

Same approach, but with different boolean mask.

In [155]:
most_popular = artists[artists == artists.max()]
print(most_popular)


artist
Dua Lipa         3
Billie Eilish    3
Travis Scott     3
Name: count, dtype: int64


#### Takeaway:

##### Three artists dominate: Dua Lipa, Billie Eilish and Travis Scott. Each represents 6% of the top 50 list. These are best candidates to build promotion campaigns around.

## 4.6. How many artists in total have their songs in the top 50?

I apply .unique() method, to filter only distinct artist names.

In [156]:
total = spotify.iloc[:, 0].unique()
print(len(total))

40


#### Takeaway:

##### 40 different artist out of 50 tracks. Even though success is concentrated (34% belongs to 7 names), but still the majority of the artists got only one track. This may suggest that there is a room for fresh talent to get in.

## 4.7 Are there any albums that have more than 1 popular track? If yes, which and how many?

I apply value_count() method to check how many times each album appears in the list.

Then I apply boolean mask to filter out all albums, that appear more than once.

In [183]:
albums = spotify.iloc[:, 1].value_counts()
top_albums = albums[albums > 1]
print(top_albums)

album
Future Nostalgia        3
Hollywood's Bleeding    2
Fine Line               2
Changes                 2
Name: count, dtype: int64


#### Takeaway:

##### 4 albums got 9 of 50 tracks (18%). Not something extraordinary to draw a conclusion from, but some campaigns can be made around these albums (and as stated before - also its artists).

## 4.8. How many albums in total have their songs in the top 50?

Same as with tracks, I apply .unique() to filter only distinct album names.

In [158]:
total_albums = spotify.iloc[:, 1].unique()
print(len(total_albums))

45


#### Takeaway

##### 45 different albums in the list. For spotify it rather makes sense to pay attention to single hits, than albums.

## 4.9. Which tracks have a danceability score above 0.7?

I apply boolean mask on danceability column to filter all the rows, where score exceeds 0.7. 

I select track_name and danceability columns for visual purpose.

In [191]:
danceability_score = spotify.loc[spotify["danceability"] > 0.7, ["track_name", "danceability"]]
print(danceability_score)

print(f"Count: {danceability_score.shape[0]}")

                                       track_name  danceability
1                                    Dance Monkey         0.825
2                                         The Box         0.896
3                           Roses - Imanbek Remix         0.785
4                                 Don't Start Now         0.793
5                    ROCKSTAR (feat. Roddy Ricch)         0.746
7                death bed (coffee for your head)         0.726
8                                         Falling         0.784
10                                           Tusa         0.803
13                                Blueberry Faygo         0.774
14                       Intentions (feat. Quavo)         0.806
15                                   Toosie Slide         0.830
17                                         Say So         0.787
18                                       Memories         0.764
19                     Life Is Good (feat. Drake)         0.795
20               Savage Love (Laxed - Si

#### Takeaway:

##### 32 0f 50 (64%) songs have danceability score above 0.7. We can say, that upbeat songs are really popular. As a business insight we can say that it makes sense to place songs with high danceability score in playlists dedicated for party, workout, dance playlists.

## 4.10 Which tracks have a danceability score below 0.4?

I do the same here, but with different boolean mask to filter out scores below 0.4

In [None]:
danceability_score = spotify.loc[spotify["danceability"] < 0.4, ["track_name", "danceability"]]
print(danceability_score)

              track_name  danceability
44  lovely (with Khalid)         0.351


#### Takeaway

##### Only one song (Billie Eilish - lovely) is below 0.4, which is 2% of the total. This just confirms, that if you want to make a next hit, you have better chances, when you avoid slow ballads.

## 4.11. Which tracks have their loudness above -5?

In [192]:
loudness_score  = spotify.loc[spotify["loudness"] > -5, ["track_name", "loudness"]]
print(loudness_score)

print(f"Count: {loudness_score.shape[0]}")

                                       track_name  loudness
4                                 Don't Start Now    -4.521
6                                Watermelon Sugar    -4.209
10                                           Tusa    -3.280
12                                        Circles    -3.497
16                                  Before You Go    -4.858
17                                         Say So    -4.577
21                                      Adore You    -3.675
23                         Mood (feat. iann dior)    -3.558
31                                 Break My Heart    -3.434
32                                       Dynamite    -4.410
33               Supalonely (feat. Gus Dapperton)    -4.746
35                Rain On Me (with Ariana Grande)    -3.764
37  Sunflower - Spider-Man: Into the Spider-Verse    -4.368
38                                          Hawái    -3.454
39                                        Ride It    -4.258
40                                     g

#### Takeaway

##### 19 of 50 (38%) are louder than -5dB.  Though, most songs are still a bit on the quieter side, so high loudness is not necessarily required for success of the song.

## 4.12. Which tracks have their loudness below -8?

In [193]:
loudness_score  = spotify.loc[spotify["loudness"] < -8, ["track_name", "loudness"]]
print(loudness_score)

print(f"Count: {loudness_score.shape[0]}")

                                        track_name  loudness
7                 death bed (coffee for your head)    -8.765
8                                          Falling    -8.756
15                                    Toosie Slide    -8.820
20                Savage Love (Laxed - Siren Beat)    -8.520
24                             everything i wanted   -14.454
26                                         bad guy   -10.965
36                             HIGHEST IN THE ROOM    -8.764
44                            lovely (with Khalid)   -10.109
47  If the World Was Ending - feat. Julia Michaels   -10.086
Count: 9


#### Takeaway

##### 9 out of 50 (18%) sit below -8 dB score. Yes quiet songs can become hits, but I would say its more an exception than the rule. Most songs are noticeably louder.

## 4.13. Which track is the longest?

First I apply .idxmax() method, which returns the index of the longest track. 

Then I use .loc to filter out just the track name and its duration.

In [198]:
duration = spotify.loc[spotify["duration_ms"].idxmax(), ["track_name", "duration_ms"]]
print(duration)

track_name     SICKO MODE
duration_ms        312820
Name: 49, dtype: object


#### Takeaway

##### Sicko Mode is the longest (more than 5 min). In the fast paced society such long songs still can make a hit.

## 4.14 Which track is the shortest?

Same as before, but with .idxmin() method.

In [164]:
duration = spotify.loc[spotify["duration_ms"].idxmin(), ["track_name", "duration_ms"]]
print(duration)

track_name     Mood (feat. iann dior)
duration_ms                    140526
Name: 23, dtype: object


#### Takeaway

##### Mood (feat. iann dior) is the shortes, around 2.5 min. Also, very short tracks can become hits.

## 4.15. Which genre is the most popular?

First I apply .value_counts() to check how many times each genre appears in the list.

Then I find the highest count with .max().

In [165]:
genre = spotify.loc[:, "genre"].value_counts()
top_genre = genre[genre == genre.max()]
print(top_genre)


genre
Pop    14
Name: count, dtype: int64


#### Takeaway

##### 14 of 50 songs are Pop genre, which is around 28%. We can say that pop style is the most reliable for reaching a bigger audiences, so it definitely should stay under Spotify focus, when creating playlists and planning campaigns.

## 4.16. Which genres have just one song on the top 50?

Same approach as before, but with different boolean mask.

In [200]:
one_song_genre = genre[genre == 1]
print(one_song_genre)

print(f"Count: {one_song_genre.sum()}")


genre
R&B/Hip-Hop alternative               1
Nu-disco                              1
Pop/Soft Rock                         1
Pop rap                               1
Hip-Hop/Trap                          1
Dance-pop/Disco                       1
Disco-pop                             1
Dreampop/Hip-Hop/R&B                  1
Alternative/reggaeton/experimental    1
Chamber pop                           1
Name: count, dtype: int64
Count: 10


#### Takeaway

##### 10 genres have only 1 song in the top 50. Yes, some niche styles can break through into the top list, but the better way to approach it would be targeted marketing and campaigns, rather than broadcasting them to the masses, as they could get lost between mainstream genres, like Pop.

## 4.17. How many genres in total are represented in the top 50?

I use unique() to extract only distinct genres.

In [167]:
unique_genres = spotify.loc[:, "genre"].unique()
print(len(unique_genres))

16


#### Takeaway

##### 16 different genres in total. From before we know that 10 genres had only one song in the top 50 list. So 6 genres make the majority of the list (80%).

## 4.18. Which features are strongly positively correlated, negatively correlated, and not correlated?

First I call .corr() method, only on numeric values. 

This retuns me a squared matrix with indexes and column by feature names. 

Each value here shows how strongly two features corelate together. 

Values near +1 mean a strong positive relationship, near –1 a strong negative relationship, and near 0 little to no linear relationship.

In [168]:
spotify.corr(numeric_only=True)

Unnamed: 0,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms
energy,1.0,0.152552,0.062428,0.79164,-0.682479,0.074267,-0.385515,0.069487,0.393453,0.075191,0.081971
danceability,0.152552,1.0,0.285036,0.167147,-0.359135,0.226148,-0.017706,-0.006648,0.479953,0.168956,-0.033763
key,0.062428,0.285036,1.0,-0.009178,-0.113394,-0.094965,0.020802,0.278672,0.120007,0.080475,-0.003345
loudness,0.79164,0.167147,-0.009178,1.0,-0.498695,-0.021693,-0.553735,-0.069939,0.406772,0.102097,0.06413
acousticness,-0.682479,-0.359135,-0.113394,-0.498695,1.0,-0.135392,0.352184,-0.128384,-0.243192,-0.241119,-0.010988
speechiness,0.074267,0.226148,-0.094965,-0.021693,-0.135392,1.0,0.028948,-0.142957,0.053867,0.215504,0.366976
instrumentalness,-0.385515,-0.017706,0.020802,-0.553735,0.352184,0.028948,1.0,-0.087034,-0.203283,0.018853,0.184709
liveness,0.069487,-0.006648,0.278672,-0.069939,-0.128384,-0.142957,-0.087034,1.0,-0.033366,0.025457,-0.090188
valence,0.393453,0.479953,0.120007,0.406772,-0.243192,0.053867,-0.203283,-0.033366,1.0,0.045089,-0.039794
tempo,0.075191,0.168956,0.080475,0.102097,-0.241119,0.215504,0.018853,0.025457,0.045089,1.0,0.130328


Then I transform the 2D matrix into dataframe of feature pairs and their correlation values. 

It lets me sort the correlation pairs at scale, so I dont have to manually compare each feature pair.

.stack() method transforms data into 1D matrix with multiindexed labeles. 

Then i give names to those labes with .rename_axis(). 

And eventually I use reset_index(name="r") ("r" for Pearson correlation coefficient) to turn multiindexed lables into actual columns and place the correlation values into a new column "r".

In [169]:
corr = spotify.corr(numeric_only=True)
pairs = corr.stack().rename_axis(["feature_1", "feature_2"]).reset_index(name="r")
print(pairs)

       feature_1         feature_2         r
0         energy            energy  1.000000
1         energy      danceability  0.152552
2         energy               key  0.062428
3         energy          loudness  0.791640
4         energy      acousticness -0.682479
..           ...               ...       ...
116  duration_ms  instrumentalness  0.184709
117  duration_ms          liveness -0.090188
118  duration_ms           valence -0.039794
119  duration_ms             tempo  0.130328
120  duration_ms       duration_ms  1.000000

[121 rows x 3 columns]


Since initial 2D matrix is squared, so I need to drop similar values, like A-A, and then choose higher or lower triangle of data to work on: A-B, or B-A.

In [170]:
pairs = pairs[pairs['feature_1'] != pairs['feature_2']]
pairs = pairs[pairs['feature_1'] < pairs['feature_2']]
print(pairs)

            feature_1         feature_2         r
2              energy               key  0.062428
3              energy          loudness  0.791640
5              energy       speechiness  0.074267
6              energy  instrumentalness -0.385515
7              energy          liveness  0.069487
8              energy           valence  0.393453
9              energy             tempo  0.075191
11       danceability            energy  0.152552
13       danceability               key  0.285036
14       danceability          loudness  0.167147
16       danceability       speechiness  0.226148
17       danceability  instrumentalness -0.017706
18       danceability          liveness -0.006648
19       danceability           valence  0.479953
20       danceability             tempo  0.168956
21       danceability       duration_ms -0.033763
25                key          loudness -0.009178
27                key       speechiness -0.094965
29                key          liveness  0.278672


After data is prepared, I use boolean masks to find the correlation.

Rule of thumb: more than 0.7 of absolute value is strong positive or negative and below absolute 0.3 is either weark or none.

In [205]:
strong_positive = pairs[pairs["r"] > 0.7]
strong_negative = pairs[pairs["r"] < -0.6]
no_corr = pairs[(pairs["r"] > -0.3) & (pairs["r"] < 0.3)]

##### Strong positive:

In [172]:
print(strong_positive)

  feature_1 feature_2        r
3    energy  loudness  0.79164


#### Takeaway

##### Energy and loudness (r = 0.79). Louder songs are also higher in energy. We can draw a conclusion, that it makes sense for tracks, that meant to be energetic, to increase their overall loudness up.

##### Strong negative:

In [206]:
print(strong_negative)

       feature_1 feature_2         r
44  acousticness    energy -0.682479


#### Takeaway

##### There are no strong negative correlations below coefficient -0.7. Though, if I increase the threshold to, say, -0.6, we have one result: acousticness and energy, which tell us, that the more song is acoustic (instrumental, no digital production), the lower it scores on energy. This information can be useful when creating playlists - for party and workout mixes its better to look for tracks with lower acousticness, and for chill playlists - with higher.

##### No or weak correlation:

In [207]:
print(no_corr)

            feature_1         feature_2         r
2              energy               key  0.062428
5              energy       speechiness  0.074267
7              energy          liveness  0.069487
9              energy             tempo  0.075191
11       danceability            energy  0.152552
13       danceability               key  0.285036
14       danceability          loudness  0.167147
16       danceability       speechiness  0.226148
17       danceability  instrumentalness -0.017706
18       danceability          liveness -0.006648
20       danceability             tempo  0.168956
21       danceability       duration_ms -0.033763
25                key          loudness -0.009178
27                key       speechiness -0.094965
29                key          liveness  0.278672
30                key           valence  0.120007
31                key             tempo  0.080475
38           loudness       speechiness -0.021693
42           loudness             tempo  0.102097


#### Takeaway

##### A lot of pairs of features that are not correlated. This can tell us, that many of these audio attributes can be adjusted independently, without negatively affecting one another. This gives producers more ways to create unique songs.

# 5. How does the danceability score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

By using .isin() I filter the subset of data to include only rows for mentioned genres.

Then I call .describe() on danceability column to get a statistics summary.

In [175]:
genres = ["Pop", "Hip-Hop/Rap", "Dance/Electronic", "Alternative/Indie"]
subset = spotify[spotify["genre"].isin(genres)]
danceability_comp = subset.groupby("genre")["danceability"].describe()
danceability_comp

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Alternative/Indie,4.0,0.66175,0.211107,0.459,0.4905,0.663,0.83425,0.862
Dance/Electronic,5.0,0.755,0.094744,0.647,0.674,0.785,0.789,0.88
Hip-Hop/Rap,13.0,0.765538,0.08547,0.598,0.726,0.774,0.83,0.896
Pop,14.0,0.677571,0.109853,0.464,0.61575,0.69,0.76275,0.806


#### Takeaway

Hip-Hop/Rap (13 tracks): is at the top with an average danceability of 0.77.

Dance/Electronic - only 5 data points (tracks). Mean is also near the top: 0.76.

We can conclude, that hip-hop/rap and dance/electronic - are really good genres to fill party, workout and dance playlists with.

Pop (14 tracks): averages around 0.68, with range from about 0.46 to 0.81. Pop songs are hit-or-miss on danceability scale.

Alternative/Indie - only 4 data points. Has the lowest average at 0.66, with the widest spread—from about 0.46 all the way up to 0.86. That tells us this genre is the most unpredictable.

# 6. How does the loudness score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

In [176]:
loudness_comp = subset.groupby("genre")["loudness"].describe()
loudness_comp

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Alternative/Indie,4.0,-5.421,0.774502,-6.401,-5.8595,-5.2685,-4.83,-4.746
Dance/Electronic,5.0,-5.338,1.479047,-7.567,-5.652,-5.457,-4.258,-3.756
Hip-Hop/Rap,13.0,-6.917846,1.891808,-8.82,-8.52,-7.648,-5.616,-3.37
Pop,14.0,-6.460357,3.014281,-14.454,-7.17875,-6.6445,-3.87525,-3.28


#### Takeaway

Dance/Electronic and Alternative/Indie are the loudest on average, with mean levels around –5.34 and –5.42. 

These can go straight into the high energy playlists.

Pop tracks are a bit quieter on average at around –6.46, but they show the biggest variation: from a very quiet outlier at –14.45 up to –3.28. 

So some pop tracks are soft while others match the electronic and indie genres.

Hip-Hop/Rap has the lowest loudness on avarage at about –6.92. 

It is the quietest genre on the list on average, though still with a few louder outliers. 

When including these into a mixed playlists, probably its wise to have some transition between hip-hop/rap genre and loud one, like dance/electronic.

# 7. How does the acousticness score compare between Pop, Hip-Hop/Rap, Dance/Electronic, and Alternative/Indie genres?

In [177]:
acousticness_comp = subset.groupby("genre")["acousticness"].describe()
acousticness_comp

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Alternative/Indie,4.0,0.5835,0.204086,0.291,0.52575,0.646,0.70375,0.751
Dance/Electronic,5.0,0.09944,0.095828,0.0137,0.0149,0.0686,0.177,0.223
Hip-Hop/Rap,13.0,0.188741,0.186396,0.00513,0.067,0.145,0.234,0.731
Pop,14.0,0.323843,0.318142,0.021,0.0599,0.259,0.348,0.902


#### Takeaway

Alternative/Indie is the most acoustic on average, with a mean of 0.58. 

These are perfect for chill and acoustic playlists.

Pop average acousticness is at around 0.32. 

It is the most variable: some tracks are very accoustic with score up to 0.90, while others are almost entirely produced (0.02).

Hip-Hop/Rap averages around 0.19 and has a wide range: from nearly zero acoustic content on some tracks to about 0.73 on the outlier.

Most rap tracks on the list are heavily produced, but there is at least one with a strong acoustic element.

Dance/Electronic is the least acoustic, with a mean of 0.10 and ranging only from about 0.01 to 0.22 - ideal when you want digital flavor.

# Conclusion

Bottom line: 2020’s Top-50 hits share a clear trend.

Most are very danceable (64% score > 0.70) and mixed loud (nearly 40% above –5dB), with energy rising as loudness goes up.

Seven artists and four albums account for a third of the list, yet 40 different artists and 16 genres still appear, showing room for both big names and fresh sounds.

Hip-Hop/Rap and Dance/Electronic lead on danceability, Alternative/Indie produces the most acoustic tracks, and Pop is in the middle of everything, but with wide variety.

Niche genres break through only one track at a time, so they need targeted rather than mass promotion.

So, the takeaway for Spotify is clear: keep the best-known artists in the spotlight, fill party and workout playlists with the most danceable, loud tracks, and rely on genre specific playlists to bring quieter or more acoustic songs to the right listeners.
