> **Check out my other cool notebooks !**
> * [🏡 House Prices - Exploratory Data Analysis](https://www.kaggle.com/code/dreygaen/house-prices-exploratory-data-analysis)
> * [🛳️ Titanic - Top 1% with KNN [0.81818]](https://www.kaggle.com/code/dreygaen/titanic-top-1-with-knn-0-81818)
> * [🛳️ Titanic - EDA and Prediction [0.78229]](https://www.kaggle.com/code/dreygaen/titanic-eda-and-prediction-0-78229)

# 📋 Table of contents

- [🎓 Context](#🎓-Context)
- [🛠️ Project Setup](#🛠️-Project-Setup)
- [💾 Loading Data](#💾-Loading-Data)
- [💡 First Insights](#💡-First-Insights)
- [❓ Missing Values](#❓-Missing-Values)
- [⚙️ Dtype Correction](#⚙️-Dtype-Correction)
- [🏆 Most Streamed Songs in 2023](#🏆-Most-Streamed-Songs-in-2023)
- [🥉 Top 1000 Analysis](#🥉-Top-1000-Analysis)
    - [🔍 Top 1000 - Descriptive Statistics](#🔍-Top-1000---Descriptive-Statistics)
    - [📊 Top 1000 - Numerical Values Analysis](#📊-Top-1000---Numerical-Values-Analysis)
    - [📊 Top 1000 - Categorical Values Columns](#📊-Top-1000---Categorical-Values-Columns)
- [🥈 Top 500 Analysis](#🥈-Top-500-Analysis)
    - [🔍 Top 500 - Descriptive Statistics](#🔍-Top-500---Descriptive-Statistics)
    - [📊 Top 500 - Numerical Values Analysis](#📊-Top-500---Numerical-Values-Analysis)
    - [📊 Top 500 - Categorical Values Columns](#📊-Top-500---Categorical-Values-Columns)
- [🥇 Top 100 Analysis](#🥇-Top-100-Analysis)
    - [🔍 Top 100 - Descriptive Statistics](#🔍-Top-100---Descriptive-Statistics)
    - [📊 Top 100 - Numerical Values Analysis](#📊-Top-100---Numerical-Values-Analysis)
    - [📊 Top 100 - Categorical Values Columns](#📊-Top-100---Categorical-Values-Columns)
- [🏆 What Does It Take To Top the Spotify Charts ?](#🏆-What-Does-It-Take-To-Top-the-Spotify-Charts-?)
- [📈 Music Evolution Over the Years](#📈-Music-Evolution-Over-the-Years)
- [🎓 Conclusion](#🎓-Conclusion)
- [📝 Note of the Author](#📝-Note-of-the-Author)

# **🎓 Context**

This dataset contains a comprehensive list of the most famous songs of 2023 as listed on Spotify. The dataset offers a wealth of features beyond what is typically available in similar datasets. It provides insights into each song's attributes, popularity, and presence on various music platforms. The dataset includes information such as **track name**, **artist(s) name**, **release date**, **Spotify playlists and charts**, **streaming statistics**, **Apple Music presence**, **Deezer presence**, **Shazam charts**, and **various audio features**.

# **🛠️ Project Setup**

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# **💾 Loading Data**

In [None]:
data_df = pd.read_csv('../input/top-spotify-songs-2023/spotify-2023.csv', encoding='ISO-8859-1')

This code block displays the first few rows of the **data_df** dataframe using the *head* function. By default, head displays the first 5 rows of the dataframe, but this can be changed by specifying a different number as an argument. This is a useful step to take when working with a new dataset, as it allows us to quickly inspect the data and get a sense of what it looks like. In this case, we can see the values of the various features for the first few properties in the dataset.

In [None]:
data_df.head()

This code block prints out the shape of the **data_df** dataframes using the *shape* attribute. The shape attribute returns a tuple with two values: the **number of rows** and the **number of columns** in the dataframe.

We now know that our dataset is made of 953 rows and 24 columns, which means that we have the 953 most listened songs on Spotify in 2023.

In [None]:
print('data_df : ', data_df.shape)

The first print statement in this code block prints the column names of the **data_df** dataframe using the *columns.values* attribute. The columns attribute returns an *Index* object containing the column labels of the dataframe, while the values attribute returns a numpy array containing the actual labels. This step is useful to get an overview of the features available in the dataset.

The second print statement uses the info method to print a summary of the **data_df** dataframe, including information on the number of non-null values and data types of each column. This is a useful step to take when working with a new dataset, as it allows us to quickly identify any missing values or data types that need to be converted. In this case, we can see that some of the columns have missing values, and that some of the data types may need to be converted to a more appropriate format.

In [None]:
print(data_df.columns.values)
print('='*93)
data_df.info()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

## **💡 First Insights**

We now know that our dataset is made of **953 rows** and **24 columns**, which means that we have the **953 most listened songs on Spotify in 2023**. Those **24 columns** are made of various useful informations such as :
 
 * **track_name** : *Name of the song*
 * **artist(s)_name** : *Name of the artist(s) of the song*
 * **artist_count** : *Number of artists contributing to the song*
 * **released_year** : *Year when the song was released*
 * **released_month** : *Month when the song was released*
 * **released_day** : *Day of the month when the song was released*
 * **in_spotify_playlists** : *Number of Spotify playlists the song is included in*
 * **in_spotify_charts** : *Presence and rank of the song on Spotify charts*
 * **streams** : *Total number of streams on Spotify*
 * **in_apple_playlists** : *Number of Apple Music playlists the song is included in*
 * **in_apple_charts** : *Presence and rank of the song on Apple Music charts*
 * **in_deezer_playlists** : *Number of Deezer playlists the song is included in*
 * **in_deezer_charts** : *Presence and rank of the song on Deezer charts*
 * **in_shazam_charts** : *Presence and rank of the song on Shazam charts*
 * **bpm** : *Beats per minute, a measure of song tempo*
 * **key** : *Key of the song*
 * **mode** : *Mode of the song (major or minor)*
 * **danceability_%** : *Percentage indicating how suitable the song is for dancing*
 * **valence_%** : *Positivity of the song's musical content*
 * **energy_%** : *Perceived energy level of the song*
 * **acousticness_%** : *Amount of acoustic sound in the song*
 * **instrumentalness_%** : *Amount of instrumental content in the song*
 * **liveness_%** : *Presence of live performance elements*
 * **speechiness_%** : *Amount of spoken words in the song*

Our main goal in this notebook is to try to understand why a song is more succesful than another on Spotify in 2023.

# **❓ Missing Values**

This code block creates a bar plot of the missing values in the **data_df** dataframe using the *isnull* and *sum* methods to count the number of missing values for each column. The resulting series is then filtered using a boolean mask to keep only the columns with missing values, and sorted in ascending order by the number of missing values using the *sort_values* method. Finally, the resulting series is plotted using the *plot.bar* method from the matplotlib library.

In [None]:
missing = data_df.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
ax = missing.plot.bar(color='#1d8954')
ax.bar_label(ax.containers[0])
plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">
    
**Analysis** : The resulting plot shows the number of missing values for each column in the dataset, ordered from least to most missing values. This is a useful step to take when working with a new dataset, as it allows us to quickly identify which features have missing values and how severe the problem is. In this case, we can see that only 2 columns have missing values : in_shazam_charts and key. This information can be used to guide subsequent data cleaning and imputation steps.

# ⚙️ Dtype Correction

We can observe that some of our columns have a wrong Dtype such as **streams**, which is normally an int64 because it represents the number of streams of the song, and **in_deezer_playlists** and **in_shazam_charts** should also be int64 because they represent, respectively, the number of Deezer playlists containing the song and its rank on Shazam.

Let's change their Dtype to have more useful data. One of the song have a typo in its stream count with a value of *BPM110KeyAModeMajorDanceability53Valence75Energy69Acousticness7Instrumentalness0Liveness17Speechiness3*, which is the reason of its wrong Dtype in the first place. According to its Spotify page ([link](https://open.spotify.com/intl-fr/track/5Ts1DYOuouQLgzTaisxWYh)), this song has a stream count of 209 536 449. Let's correct our data to fit in this new information, and finally have a correct **streams** column. 

In [None]:
data_df['track_name'].loc[data_df['streams'] == 'BPM110KeyAModeMajorDanceability53Valence75Energy69Acousticness7Instrumentalness0Liveness17Speechiness3']

In [None]:
data_df.loc[data_df['track_name'] == 'Love Grows (Where My Rosemary Goes)', 'streams'] = 209536449
data_df['track_name'].loc[data_df['streams'] == 'BPM110KeyAModeMajorDanceability53Valence75Energy69Acousticness7Instrumentalness0Liveness17Speechiness3']

In [None]:
data_df['streams'] = data_df['streams'].astype('int64')

Now that our streams are fixed, let's focus on **in_deezer_playlists** and **in_shazam_charts**. The only problem here is that there is a thousand separator on all numbers, causing our Dtype to become an object. Let's fix that by deleting this separator.

In [None]:
data_df['in_deezer_playlists'].replace(',','', regex=True, inplace=True)
data_df['in_shazam_charts'].replace(',','', regex=True, inplace=True)

data_df['in_deezer_playlists'] = data_df['in_deezer_playlists'].astype('int64')
data_df['in_shazam_charts'] = data_df['in_deezer_playlists'].astype('int64')

In [None]:
data_df.info()

# **🏆 Most Streamed Songs in 2023**

In [None]:
# Top 10 songs with most streams on Spotify
top_spotify_streams = data_df[['track_name', 'artist(s)_name', 'streams']].sort_values(by='streams', ascending=False).head(10)

# Plot
plt.figure(figsize=(15, 5))
ax = sns.barplot(x=top_spotify_streams['streams'], y=top_spotify_streams['track_name'], palette='Greens')
plt.xlabel('Streams (in billions)')
plt.ylabel('Track Name')
plt.title('Top 10 Songs with Most Streams on Spotify')
plt.xticks(rotation=45)
ax.bar_label(ax.containers[0])

plt.show()

# **🥉 Top 1000 Analysis**

## **🔍 Top 1000 - Descriptive Statistics**

This code block generates a summary of the numerical features in the **data_df** dataframe using the *describe* method. The resulting summary contains statistical information about the numerical columns in the dataset, including the count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum values.

This summary can be used to quickly identify any outliers, skewness, or other issues with the numerical columns in the dataset, and to guide subsequent data cleaning and feature engineering steps.

In [None]:
data_df.describe()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">
    
**Analysis** : Here are some observations from the provided output :

**1. Released Year, Month, and Day** :
   * The dataset contains songs released between 1930 and 2023, with a mean release year of 2018.
   * The mean release month is June, and the mean release day is around the 14th of the month.

**2. Popularity Measures** :
   * The mean number of streams for a song is approximately 513,817,800, with a standard deviation of 566,645,100.
   * The mean number of times a song appears in Spotify playlists is approximately 5200, with a standard deviation of 7897. This suggests that there is a significant variation in popularity among songs.
   * The average appearance of songs in Spotify charts is around 12, with a standard deviation of 19.57.
   * Similarly, the average appearance in Apple playlists is approximately 68, and in Apple charts, it is around 52.
   * The mean appearances in Deezer playlists, Deezer charts, and Shazam charts are approximately 385, 2.67, and 385, respectively.

**3. Song Characteristics** :
   * The average beats per minute (BPM) of songs is 122.54, with a standard deviation of 28.05.
   * The average danceability percentage is 66.97%, with a standard deviation of 14.63.
   * The average valence (positivity) percentage is 51.43%, with a standard deviation of 23.48.
   * The average energy level is 64.28%, with a standard deviation of 16.55.
   * The average acousticness percentage is 27.06%, with a standard deviation of 25.99.
   * The average instrumentalness percentage is 1.58%, with a standard deviation of 8.41.
   * The average liveness percentage is 18.21%, with a standard deviation of 13.71.
   * The average speechiness percentage is 10.13%, with a standard deviation of 9.91.

**4. Artist Count** :
   * On average, each song has 1.56 artists associated with it.

In [None]:
data_df.describe(include=['O'])

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">
    
**Analysis** : Here are some observations from the provided output :

**1. Track Name and Artists** :
   * There are 953 unique tracks in the dataset, with some tracks appearing more than once. The top track, "Daylight" appears twice in the dataset.
   * Among the 953 tracks, there are 645 unique artists. Taylor Swift is the top artist, appearing 34 times in the dataset.

**2. Musical Key and Mode** :
   * The musical keys have 11 unique values, with C# being the most frequent key in the dataset.
   * The mode has 2 unique values, with Major being the most common mode.

## **📊 Top 1000 - Numerical Values Analysis**

This code selects only the numerical columns in the **data_df** DataFrame and creates a new DataFrame called **df_num** that contains only these columns. It will be easier to analyze our numerical values by using a DataFrame containing only this kind of value. 

In [None]:
df_num = data_df.select_dtypes(include = ['float64', 'int64']).copy()
df_num.head()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">
    
**Analysis** : We can see that our dataset contains 20 numerical columns : artist_count, released_year, released_month, released_day, in_spotify_playlists, in_spotify_charts, streams, in_apple_playlists, in_apple_charts, in_deezer_playlists, in_deezer_charts, in_shazam_charts, bpm, danceability_%, valence_%, energy_%, acousticness_%, instrumentalness_%, liveness_% and speechiness_%.

In [None]:
f,ax = plt.subplots(5,4,figsize=(25, 20))

for i, col in enumerate(df_num.columns):
        sns.histplot(data=df_num, x=col, kde=True, color='#1d8954', ax=ax[i//4,i%4])

plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : With all those distributions, we can draw general conclusions about the top 1000 songs from Spotify by analyzing their skewness. released_year, danceability_% and energy_% are left skewed, which means that few songs from the top 1000 are old, less danceable and less energic. On the other hand, few songs have high acousticness, high instrumentalness, high liveness and high speechness, beacause of the right skewness of their distribution.

**⚠️ Be careful, this doesn't mean that high energy or low acousticness means that the song will perform better. To know this, we need to do a bivariate analysis. ⚠️**

In [None]:
f,ax = plt.subplots(5,4,figsize=(25, 20))

num_cols = df_num.loc[:, df_num.columns != 'streams'].columns

for i, col in enumerate(num_cols):
        sns.regplot(data=df_num, x=col, y='streams', color='#1d8954', ax=ax[i//4,i%4])

ax[4, 3].set_axis_off()

plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : It seems that, the more the song is present in playlists, the more it will be listened by Spotify users. Having a lot of artist on the song doesn't seem to work well, and nor the day nor the month seem to influence the way the song is listened.

In [None]:
# compute the correlation matrix
cor_numVar = df_num.corr(method='pearson')

# sort on decreasing correlations with SalePrice
cor_sorted = cor_numVar['streams'].sort_values(ascending=False)

# select only high correlations
CorHigh = cor_sorted[abs(cor_sorted) > 0.1].index
cor_numVar = cor_numVar.loc[CorHigh, CorHigh]

# plot the correlation matrix
plt.figure(figsize=(8,8))
corrplot = sns.heatmap(cor_numVar, annot=True, cmap='Greens', square=True, linewidths=0.5, linecolor='white')
corrplot.set_xticklabels(corrplot.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.title('Correlations Matrix for the Top 1000')

plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : The correlation matrix confirms it, being in a lot of playlists improve significantly the number of streams of the song ! Sadly, it doesn't give any information about the song itself...

## **📊 Top 1000 - Categorical Values Columns**

This code block selects only the categorical columns from the dataset using the Pandas *select_dtypes* method. The method *include=['O']* specifies that only object columns (**i.e.** columns containing strings) should be selected. The resulting dataframe **df_cat** only contains the categorical columns of the original dataset, and **cat_cols** is a list containing the names of all the categorical columns.

In [None]:
df_cat = data_df.select_dtypes(include = ['O']).copy()
cat_cols = df_cat.columns
df_cat.head()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">
    
**Analysis** : We can see that our dataset contains 4 categorical columns : track_name, artist(s)_name, key and mode.

In [None]:
f,ax = plt.subplots(1,2,figsize=(15, 5))

for i, col in enumerate(['key', 'mode']):
    sns.countplot(data=df_cat, x=col, palette='Greens', ax=ax[i])
    ax[i].set_title(col + ' Distribution in the Top 1000')
    ax[i].set_xticklabels(ax[i].get_xticklabels(), rotation=45, horizontalalignment='right')
    ax[i].bar_label(ax[i].containers[0])

plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : As we have seen earlier in the descriptive analysis, the majority of the top 1000 songs have a C# key and a Major mode.

In [None]:
top_artists = data_df['artist(s)_name'].value_counts().head(10)

plt.figure(figsize=(12, 6))
ax = sns.barplot(x=top_artists.values, y=top_artists.index, palette='Greens')
plt.xlabel('Number of Songs')
plt.ylabel('Artist(s) Name')
plt.title('Top 10 Artists with Most Songs in the top 1000')
ax.bar_label(ax.containers[0])
plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">
   
**Analysis** : Taylor Swift is the queen of this year with 34 songs in the top 1000, followed by The Weeknd with 22 songs and Bad Bunny and SZA, both having 19 songs in the top 1000.

In [None]:
f = pd.melt(data_df, id_vars=['streams'], value_vars=['key', 'mode'])
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
for i, col in enumerate(['key', 'mode']):
        sns.boxplot(data=f[f['variable']==col], x='value', y='streams', palette='Greens', ax=ax[i])
        ax[i].set_xlabel(col)
        ax[i].set_xticklabels(ax[i].get_xticklabels(), rotation=45, horizontalalignment='right')

plt.show()

# **🥈 Top 500 Analysis**

In [None]:
data_df_500 = data_df.sort_values(by='streams', ascending=False).head(500)

## **🔍 Top 500 - Descriptive Statistics**

This code block generates a summary of the numerical features in the **data_df** dataframe using the *describe* method. The resulting summary contains statistical information about the numerical columns in the dataset, including the count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum values.

This summary can be used to quickly identify any outliers, skewness, or other issues with the numerical columns in the dataset, and to guide subsequent data cleaning and feature engineering steps.

In [None]:
data_df_500.describe()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : Both the top 1000 and top 500 datasets have similar mean values for musical attributes such as BPM, danceability, valence, energy, acousticness, instrumentalness, liveness, and speechiness. However, the second dataset generally has a slightly lower standard deviation for these attributes, indicating less variability. The mean number of appearances in Spotify playlists is higher in the second dataset (8785) compared to the first dataset (5200). The mean artist count in the first dataset is 1.56, and in the second dataset, it is 1.44. This suggests that, on average, the songs in the first dataset are associated with slightly more artists compared to the second dataset. Overall, while the top 1000 dataset has more variability in certain attributes, the top 500 dataset seems to have higher popularity measures and fewer variations in musical characteristics.

In [None]:
data_df_500.describe(include=['O'])

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : The top 500 songs on Spotify use, in majority, the same mode and key than those on the top 1000.

## **📊 Top 500 - Numerical Values Analysis**

In [None]:
df_num = data_df_500.select_dtypes(include = ['float64', 'int64']).copy()
df_num.head()

In [None]:
f,ax = plt.subplots(5,4,figsize=(25, 20))

for i, col in enumerate(df_num.columns):
        sns.histplot(data=df_num, x=col, kde=True, color='#1d8954', ax=ax[i//4,i%4])

plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : With all those distributions, we can draw the same general conclusions about the top 500, than on the top 1000 songs from Spotify.

In [None]:
f,ax = plt.subplots(5,4,figsize=(25, 20))

num_cols = df_num.loc[:, df_num.columns != 'streams'].columns

for i, col in enumerate(num_cols):
        sns.regplot(data=df_num, x=col, y='streams', color='#1d8954', ax=ax[i//4,i%4])

ax[4, 3].set_axis_off()

plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">
    
**Analysis** : Presence in playlists is still logically the main reason for the number of streams.

In [None]:
# compute the correlation matrix
cor_numVar = df_num.corr(method='pearson')

# sort on decreasing correlations with SalePrice
cor_sorted = cor_numVar['streams'].sort_values(ascending=False)

# select only high correlations
CorHigh = cor_sorted[abs(cor_sorted) > 0.1].index
cor_numVar = cor_numVar.loc[CorHigh, CorHigh]

# plot the correlation matrix
plt.figure(figsize=(8,8))
corrplot = sns.heatmap(cor_numVar, annot=True, cmap='Greens', square=True, linewidths=0.5, linecolor='white')
corrplot.set_xticklabels(corrplot.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.title('Correlations Matrix for the Top 500')

plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">
    
**Analysis** : The correlation matrix still confirms that being in a lot of playlists improve significantly the number of streams of the song, but we can also notice that speechiness_% disappeared from the most correlated variables with the stream number.

## **📊 Top 500 - Categorical Values Columns**

In [None]:
df_cat = data_df_500.select_dtypes(include = ['O']).copy()
cat_cols = df_cat.columns
df_cat.head()

In [None]:
f,ax = plt.subplots(1,2,figsize=(15, 5))

for i, col in enumerate(['key', 'mode']):
    sns.countplot(data=df_cat, x=col, palette='Greens', ax=ax[i])
    ax[i].set_title(col + ' Distribution in the Top 500')
    ax[i].set_xticklabels(ax[i].get_xticklabels(), rotation=45, horizontalalignment='right')
    ax[i].bar_label(ax[i].containers[0])

plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : The majority of the top 500 songs, like the top 1000, have a C# key and a Major mode. The difference here is that it is way more noticeable than before with 59.4% of songs being with a Major mode and 13.4% having a C# key. 

In [None]:
top_artists = data_df_500['artist(s)_name'].value_counts().head(10)

plt.figure(figsize=(12, 6))
ax = sns.barplot(x=top_artists.values, y=top_artists.index, palette='Greens')
plt.xlabel('Number of Songs')
plt.ylabel('Artist(s) Name')
plt.title('Top 10 Artists with Most Songs in the top 500')
ax.bar_label(ax.containers[0])
plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : Taylor Swift is still the queen of this top with 18 songs in the top 500, shortly followed by Bad Bunny winning the second place with 17 songs over The Weeknd and Harry Styles, both having 11 songs.

In [None]:
f = pd.melt(data_df_500, id_vars=['streams'], value_vars=['key', 'mode'])
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
for i, col in enumerate(['key', 'mode']):
        sns.boxplot(data=f[f['variable']==col], x='value', y='streams', palette='Greens', ax=ax[i])
        ax[i].set_xlabel(col)
        ax[i].set_xticklabels(ax[i].get_xticklabels(), rotation=45, horizontalalignment='right')

plt.show()

# **🥇 Top 100 Analysis**

In [None]:
data_df_100 = data_df.sort_values(by='streams', ascending=False).head(100)

## **🔍 Top 100 - Descriptive Statistics**

This code block generates a summary of the numerical features in the **data_df** dataframe using the *describe* method. The resulting summary contains statistical information about the numerical columns in the dataset, including the count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum values.

This summary can be used to quickly identify any outliers, skewness, or other issues with the numerical columns in the dataset, and to guide subsequent data cleaning and feature engineering steps.

In [None]:
data_df_100.describe()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : The average BPM, danceability, valence, energy, acousticness, instrumentalness, liveness, and speechiness values are generally similar to the other datasets. However, the standard deviation for several of these attributes, including streams, BPM, and liveness, is higher in this dataset, indicating more variability. The mean number of appearances in Spotify playlists is 20552, which is higher than the previous two datasets. The mean release year for this dataset is 2012.78, which is the earliest mean release year among the three datasets. The mean artist count in this dataset is 1.31, which is the lowest among all three datasets. This latest dataset showcases a higher level of popularity measures and higher appearances in various charts, along with a slightly earlier period of release. The variability in certain musical attributes is also higher in this dataset compared to the previous ones.

In [None]:
data_df_100.describe(include=['O'])

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">
    
**Analysis** : The top 100 songs on Spotify use, in majority, the same mode and key than those on the top 1000 and top 500. But it seems that Taylor Swift lost her crown to Ed Sheeran.

## **📊 Top 100 - Numerical Values Analysis**

In [None]:
df_num = data_df_100.select_dtypes(include = ['float64', 'int64']).copy()
df_num.head()

In [None]:
f,ax = plt.subplots(5,4,figsize=(25, 20))

for i, col in enumerate(df_num.columns):
        sns.histplot(data=df_num, x=col, kde=True, color='#1d8954', ax=ax[i//4,i%4])

plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : Still the same distributions.

In [None]:
f,ax = plt.subplots(5,4,figsize=(25, 20))

num_cols = df_num.loc[:, df_num.columns != 'streams'].columns

for i, col in enumerate(num_cols):
        sns.regplot(data=df_num, x=col, y='streams', color='#1d8954', ax=ax[i//4,i%4])

ax[4, 3].set_axis_off()

plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : Presence in playlists is still the main reason for the number of streams.

In [None]:
# compute the correlation matrix
cor_numVar = df_num.corr(method='pearson')

# sort on decreasing correlations with SalePrice
cor_sorted = cor_numVar['streams'].sort_values(ascending=False)

# select only high correlations
CorHigh = cor_sorted[abs(cor_sorted) > 0.1].index
cor_numVar = cor_numVar.loc[CorHigh, CorHigh]

# plot the correlation matrix
plt.figure(figsize=(8,8))
corrplot = sns.heatmap(cor_numVar, annot=True, cmap='Greens', square=True, linewidths=0.5, linecolor='white')
corrplot.set_xticklabels(corrplot.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.title('Correlations Matrix for the Top 100')
plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : The correlation matrix still confirms that being in a lot of playlists improve significantly the number of streams of the song. But what appears here, and in the top 500 correlation matrix too, is that the day of release of the song have its importance.

## **📊 Top 100 - Categorical Values Columns**

In [None]:
df_cat = data_df_100.select_dtypes(include = ['O']).copy()
cat_cols = df_cat.columns
df_cat.head()

In [None]:
f,ax = plt.subplots(1,2,figsize=(15, 5))

for i, col in enumerate(['key', 'mode']):
    sns.countplot(data=df_cat, x=col, palette='Greens', ax=ax[i])
    ax[i].set_title(col + ' Distribution in the Top 100')
    ax[i].set_xticklabels(ax[i].get_xticklabels(), rotation=45, horizontalalignment='right')
    ax[i].bar_label(ax[i].containers[0])

plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : The majority of the top 100 songs, like the top 500 and top 1000, have a C# key and a Major mode. The difference here is that it is way more noticeable than before with 64% of songs being with a Major mode and 17% having a C# key. 

In [None]:
top_artists = data_df_100['artist(s)_name'].value_counts().head(10)

plt.figure(figsize=(12, 6))
ax = sns.barplot(x=top_artists.values, y=top_artists.index, palette='Greens')
plt.xlabel('Number of Songs')
plt.ylabel('Artist(s) Name')
plt.title('Top 10 Artists with Most Songs in the top 100')
ax.bar_label(ax.containers[0])
plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : A new king has appeared as Ed Sheeran takes the throne with 6 songs in the top 100 ! 

In [None]:
f = pd.melt(data_df_100, id_vars=['streams'], value_vars=['key', 'mode'])
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
for i, col in enumerate(['key', 'mode']):
        sns.boxplot(data=f[f['variable']==col], x='value', y='streams', palette='Greens', ax=ax[i])
        ax[i].set_xlabel(col)
        ax[i].set_xticklabels(ax[i].get_xticklabels(), rotation=45, horizontalalignment='right')

plt.show()

# **🏆 What Does It Take To Top the Spotify Charts ?**

We will now compare 6 subsets of the Spotify dataset to highlight the differences between songs of the **top 10**, **top 50**, **top 100**, **top 250**, **top 500** and **top 1000**. To highlight them, I will compare **average**, **max**, **min** and **std** values of numeric columns, and see the evolution of the **proportion** of categorical values.

In [None]:
data_df_10 = data_df.sort_values(by='streams', ascending=False).head(10)
data_df_50 = data_df.sort_values(by='streams', ascending=False).head(50)
data_df_250 = data_df.sort_values(by='streams', ascending=False).head(250)

data = [data_df_10, data_df_50, data_df_100, data_df_250, data_df_500, data_df]
labels = ['Top 10', 'Top 50', 'Top 100', 'Top 250', 'Top 500', 'Top 1000']

In [None]:
sns.set_style("white")

key_counts = []
key_percentages = []

for df in data:
    key_count = df['key'].value_counts()['C#']
    key_counts.append(key_count)
    key_percentage = (key_count / len(df)) * 100
    key_percentages.append(key_percentage)

fig, ax1 = plt.subplots(figsize=(12, 6))

ax1.bar(labels, key_counts, color='#1d8954', alpha=0.7, label=f'Count of C# key')
ax1.set_xlabel('Subset of Songs')
ax1.set_ylabel(f'Count of C# key')
ax1.set_title(f'Distribution and Percentage of C# key in Different Subsets of Songs')
ax1.set_ylim(0, max(key_counts) + 2)

ax2 = ax1.twinx()
ax2.plot(labels, key_percentages, marker='o', color='lightcoral', linestyle='-', markersize=8, label=f'% of C# key')
ax2.set_ylabel(f'% of C# key')
ax2.set_ylim(0, max(key_percentages) + 10)

for i, percentage in enumerate(key_percentages):
    ax2.text(i, percentage, f'{percentage:.2f}%', ha='center', va='bottom', fontsize=10)

for i, count in enumerate(key_counts):
    ax1.text(i, count, str(count), ha='center', va='bottom', fontsize=10)

ax1.legend(loc='upper left', framealpha=0.7)
ax2.legend(loc='upper right', framealpha=0.7)

plt.tight_layout()
plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : As we saw earlier, C# seems to be the key that works the most to improve the number of streams. Here, we can clearly see how its representation increase as we climb up the Spotify top. Having a song with a C# tends to work better, in general, on Spotify.

In [None]:
columns_of_interest = ['Major', 'Minor']

counts_1 = {col: [] for col in columns_of_interest}
counts_0 = {col: [] for col in columns_of_interest}

for df in data:
    for col in columns_of_interest:
        count_1 = df['mode'].value_counts()[col]
        count_0 = len(df) - count_1
        counts_1[col].append(count_1)
        counts_0[col].append(count_0)

for col in columns_of_interest:
    plt.figure(figsize=(12, 6))
    
    plt.bar(labels, counts_1[col], color='#1d8954', alpha=0.7, label=f'Count of {col} (1.0)')
    plt.xlabel('Subset of Songs')
    plt.ylabel(f'Count of {col}')
    plt.title(f'Distribution of {col} in Different Subsets of Songs')
    plt.ylim(0, max(counts_1[col]) + 2)

    for i, count in enumerate(counts_1[col]):
        plt.text(i, count, str(count), ha='center', va='bottom', fontsize=10)
    
    plt.twinx()
    plt.plot(labels, [(count / len(df)) * 100 for count, df in zip(counts_1[col], data)], marker='o', color='lightcoral', linestyle='-', markersize=8, label=f'% of {col}')
    plt.ylabel(f'% of {col}')
    plt.ylim(0, 110)

    for i, count in enumerate(counts_1[col]):
        plt.text(i, (count / len(data[i])) * 100, f'{(count / len(data[i])) * 100:.2f}%', ha='center', va='bottom', fontsize=10)
    
    plt.legend(loc='upper left', framealpha=0.7)
    plt.tight_layout()

plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : Another thing we noticed is that the Major mode is also more represented than the Minor mode. With 70% of the top 10 songs being in Major mode, and more than half of the whole top being in Major mode too, it is clear that this mode have an advantage in term of song popularity.

In [None]:
artists_min = []
artists_max = []
artists_avg = []
artists_std = []

for df in data:
    artists_min.append(df['artist_count'].min())
    artists_max.append(df['artist_count'].max())
    artists_avg.append(df['artist_count'].mean())
    artists_std.append(df['artist_count'].std())

plt.figure(figsize=(12, 6))

plt.plot(labels, artists_avg, marker='o', linestyle='-', color='skyblue', label='Average Artists')
plt.plot(labels, artists_max, marker='o', linestyle='-', color='lightcoral', label='Max Artists')
plt.plot(labels, artists_min, marker='o', linestyle='-', color='lightgreen', label='Min Artists')
plt.plot(labels, artists_std, marker='o', linestyle='-', color='gold', label='Artists Std Deviation')

plt.xlabel('Subset of Songs')
plt.ylabel('Artists')
plt.title('Distribution of Artists per Subsets')
plt.legend()
plt.grid(True)

for i, (avg, max_val, min_val, std_val) in enumerate(zip(artists_avg, artists_max, artists_min, artists_std)):
    plt.text(i, avg, f'Avg: {avg:.2f}', ha='center', va='bottom', fontsize=10)
    plt.text(i, max_val, f'Max: {max_val:.2f}', ha='center', va='top', fontsize=10)
    plt.text(i, min_val, f'Min: {min_val:.2f}', ha='center', va='bottom', fontsize=10)
    plt.text(i, std_val, f'Std: {std_val:.2f}', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : We didn't highlight the artist count so much until now but here is an interesting thing, the highest average number of artists is for the top 10 with 1.6 artists by song. It tends to say that a song with a featuring will be, in average, more popular.

In [None]:
bpm_min = []
bpm_max = []
bpm_avg = []
bpm_std = []

for df in data:
    bpm_min.append(df['bpm'].min())
    bpm_max.append(df['bpm'].max())
    bpm_avg.append(df['bpm'].mean())
    bpm_std.append(df['bpm'].std())

plt.figure(figsize=(12, 6))

plt.plot(labels, bpm_avg, marker='o', linestyle='-', color='skyblue', label='Average BPM')
plt.plot(labels, bpm_max, marker='o', linestyle='-', color='lightcoral', label='Max BPM')
plt.plot(labels, bpm_min, marker='o', linestyle='-', color='lightgreen', label='Min BPM')
plt.plot(labels, bpm_std, marker='o', linestyle='-', color='gold', label='BPM Std Deviation')

plt.xlabel('Subset of Songs')
plt.ylabel('BPM')
plt.title('Distribution of BPM per Subsets')
plt.legend()
plt.grid(True)

for i, (avg, max_val, min_val, std_val) in enumerate(zip(bpm_avg, bpm_max, bpm_min, bpm_std)):
    plt.text(i, avg, f'Avg: {avg:.2f}', ha='center', va='bottom', fontsize=10)
    plt.text(i, max_val, f'Max: {max_val:.2f}', ha='center', va='top', fontsize=10)
    plt.text(i, min_val, f'Min: {min_val:.2f}', ha='center', va='bottom', fontsize=10)
    plt.text(i, std_val, f'Std: {std_val:.2f}', ha='center', va='bottom', fontsize=10)

plt.legend(loc='upper right', framealpha=0.7)
plt.tight_layout()
plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : The best BPM range overall is 90-120 BPM, as shown here, and on all the distributions above. This range is higher than the heart rate, but a BPM too fast will decrease the number of streams overall.

In [None]:
danceability_min = []
danceability_max = []
danceability_avg = []
danceability_std = []

for df in data:
    danceability_min.append(df['danceability_%'].min())
    danceability_max.append(df['danceability_%'].max())
    danceability_avg.append(df['danceability_%'].mean())
    danceability_std.append(df['danceability_%'].std())

plt.figure(figsize=(12, 6))

plt.plot(labels, danceability_avg, marker='o', linestyle='-', color='skyblue', label='Average Danceability')
plt.plot(labels, danceability_max, marker='o', linestyle='-', color='lightcoral', label='Max Danceability')
plt.plot(labels, danceability_min, marker='o', linestyle='-', color='lightgreen', label='Min Danceability')
plt.plot(labels, danceability_std, marker='o', linestyle='-', color='gold', label='Danceability Std Deviation')

plt.xlabel('Subset of Songs')
plt.ylabel('Danceability')
plt.title('Distribution of Danceability per Subsets')
plt.legend()
plt.grid(True)

for i, (avg, max_val, min_val, std_val) in enumerate(zip(danceability_avg, danceability_max, danceability_min, danceability_std)):
    plt.text(i, avg, f'Avg: {avg:.2f}', ha='center', va='bottom', fontsize=10)
    plt.text(i, max_val, f'Max: {max_val:.2f}', ha='center', va='top', fontsize=10)
    plt.text(i, min_val, f'Min: {min_val:.2f}', ha='center', va='bottom', fontsize=10)
    plt.text(i, std_val, f'Std: {std_val:.2f}', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : In average, all those song are danceable at 66.97%. But we can see that the top 10 are the most danceable songs of the top, averaging almost 70% of danceability, and a minimum of 50%.

In [None]:
valence_min = []
valence_max = []
valence_avg = []
valence_std = []

for df in data:
    valence_min.append(df['valence_%'].min())
    valence_max.append(df['valence_%'].max())
    valence_avg.append(df['valence_%'].mean())
    valence_std.append(df['valence_%'].std())

plt.figure(figsize=(12, 6))

plt.plot(labels, valence_avg, marker='o', linestyle='-', color='skyblue', label='Average Valence')
plt.plot(labels, valence_max, marker='o', linestyle='-', color='lightcoral', label='Max Valence')
plt.plot(labels, valence_min, marker='o', linestyle='-', color='lightgreen', label='Min Valence')
plt.plot(labels, valence_std, marker='o', linestyle='-', color='gold', label='Valence Std Deviation')

plt.xlabel('Subset of Songs')
plt.ylabel('Valence')
plt.title('Distribution of Valence per Subsets')
plt.legend()
plt.grid(True)

for i, (avg, max_val, min_val, std_val) in enumerate(zip(valence_avg, valence_max, valence_min, valence_std)):
    plt.text(i, avg, f'Avg: {avg:.2f}', ha='center', va='bottom', fontsize=10)
    plt.text(i, max_val, f'Max: {max_val:.2f}', ha='center', va='top', fontsize=10)
    plt.text(i, min_val, f'Min: {min_val:.2f}', ha='center', va='bottom', fontsize=10)
    plt.text(i, std_val, f'Std: {std_val:.2f}', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : Higher valence means more streams ! The most popular songs are also the most positive ones.

In [None]:
energy_min = []
energy_max = []
energy_avg = []
energy_std = []

for df_da in data:
    energy_min.append(df_da['energy_%'].min())
    energy_max.append(df_da['energy_%'].max())
    energy_avg.append(df_da['energy_%'].mean())
    energy_std.append(df_da['energy_%'].std())

plt.figure(figsize=(12, 6))

plt.plot(labels, energy_avg, marker='o', linestyle='-', color='skyblue', label='Average Energy')
plt.plot(labels, energy_max, marker='o', linestyle='-', color='lightcoral', label='Max Energy')
plt.plot(labels, energy_min, marker='o', linestyle='-', color='lightgreen', label='Min Energy')
plt.plot(labels, energy_std, marker='o', linestyle='-', color='gold', label='Energy Std Deviation')

plt.xlabel('Subset of Songs')
plt.ylabel('Energy')
plt.title('Distribution of Energy per Subsets')
plt.legend()
plt.grid(True)

for i, (avg, max_val, min_val, std_val) in enumerate(zip(energy_avg, energy_max, energy_min, energy_std)):
    plt.text(i, avg, f'Avg: {avg:.2f}', ha='center', va='bottom', fontsize=10)
    plt.text(i, max_val, f'Max: {max_val:.2f}', ha='center', va='top', fontsize=10)
    plt.text(i, min_val, f'Min: {min_val:.2f}', ha='center', va='bottom', fontsize=10)
    plt.text(i, std_val, f'Std: {std_val:.2f}', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : The top 10 is not composed with the most energic songs on Spotify, but we can observe that it still has the highest minimum value of all the subsets. We can deduce that, if having a highly energic song will not be a synonym of success, it is best to have a minimum of energy in the song.

In [None]:
acousticness_min = []
acousticness_max = []
acousticness_avg = []
acousticness_std = []

for df in data:
    acousticness_min.append(df['acousticness_%'].min())
    acousticness_max.append(df['acousticness_%'].max())
    acousticness_avg.append(df['acousticness_%'].mean())
    acousticness_std.append(df['acousticness_%'].std())

plt.figure(figsize=(12, 6))

plt.plot(labels, acousticness_avg, marker='o', linestyle='-', color='skyblue', label='Average Acousticness')
plt.plot(labels, acousticness_max, marker='o', linestyle='-', color='lightcoral', label='Max Acousticness')
plt.plot(labels, acousticness_min, marker='o', linestyle='-', color='lightgreen', label='Min Acousticness')
plt.plot(labels, acousticness_std, marker='o', linestyle='-', color='gold', label='Acousticness Std Deviation')

plt.xlabel('Subset of Songs')
plt.ylabel('Acousticness')
plt.title('Distribution of Acousticness per Subsets')
plt.legend()
plt.grid(True)

for i, (avg, max_val, min_val, std_val) in enumerate(zip(acousticness_avg, acousticness_max, acousticness_min, acousticness_std)):
    plt.text(i, avg, f'Avg: {avg:.2f}', ha='center', va='bottom', fontsize=10)
    plt.text(i, max_val, f'Max: {max_val:.2f}', ha='center', va='top', fontsize=10)
    plt.text(i, min_val, f'Min: {min_val:.2f}', ha='center', va='bottom', fontsize=10)
    plt.text(i, std_val, f'Std: {std_val:.2f}', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : It seems that, in average, a song of the top 10 will be ~5% more acoustic than any other song.

In [None]:
instrumentalness_min = []
instrumentalness_max = []
instrumentalness_avg = []
instrumentalness_std = []

for df in data:
    instrumentalness_min.append(df['instrumentalness_%'].min())
    instrumentalness_max.append(df['instrumentalness_%'].max())
    instrumentalness_avg.append(df['instrumentalness_%'].mean())
    instrumentalness_std.append(df['instrumentalness_%'].std())

plt.figure(figsize=(12, 6))

plt.plot(labels, instrumentalness_avg, marker='o', linestyle='-', color='skyblue', label='Average Instrumentalness')
plt.plot(labels, instrumentalness_max, marker='o', linestyle='-', color='lightcoral', label='Max Instrumentalness')
plt.plot(labels, instrumentalness_min, marker='o', linestyle='-', color='lightgreen', label='Min Instrumentalness')
plt.plot(labels, instrumentalness_std, marker='o', linestyle='-', color='gold', label='Instrumentalness Std Deviation')

plt.xlabel('Subset of Songs')
plt.ylabel('Instrumentalness')
plt.title('Distribution of Instrumentalness per Subsets')
plt.legend()
plt.grid(True)

for i, (avg, max_val, min_val, std_val) in enumerate(zip(instrumentalness_avg, instrumentalness_max, instrumentalness_min, instrumentalness_std)):
    plt.text(i, avg, f'Avg: {avg:.2f}', ha='center', va='bottom', fontsize=10)
    plt.text(i, max_val, f'Max: {max_val:.2f}', ha='center', va='top', fontsize=10)
    plt.text(i, min_val, f'Min: {min_val:.2f}', ha='center', va='bottom', fontsize=10)
    plt.text(i, std_val, f'Std: {std_val:.2f}', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : It seems that its best to have no instrumentalness at all in the song to maximize the streams on Spotify.

In [None]:
liveness_min = []
liveness_max = []
liveness_avg = []
liveness_std = []

for df in data:
    liveness_min.append(df['liveness_%'].min())
    liveness_max.append(df['liveness_%'].max())
    liveness_avg.append(df['liveness_%'].mean())
    liveness_std.append(df['liveness_%'].std())

plt.figure(figsize=(12, 6))

plt.plot(labels, liveness_avg, marker='o', linestyle='-', color='skyblue', label='Average Liveness')
plt.plot(labels, liveness_max, marker='o', linestyle='-', color='lightcoral', label='Max Liveness')
plt.plot(labels, liveness_min, marker='o', linestyle='-', color='lightgreen', label='Min Liveness')
plt.plot(labels, liveness_std, marker='o', linestyle='-', color='gold', label='Liveness Std Deviation')

plt.xlabel('Subset of Songs')
plt.ylabel('Liveness')
plt.title('Distribution of Liveness per Subsets')
plt.legend()
plt.grid(True)

for i, (avg, max_val, min_val, std_val) in enumerate(zip(liveness_avg, liveness_max, liveness_min, liveness_std)):
    plt.text(i, avg, f'Avg: {avg:.2f}', ha='center', va='bottom', fontsize=10)
    plt.text(i, max_val, f'Max: {max_val:.2f}', ha='center', va='top', fontsize=10)
    plt.text(i, min_val, f'Min: {min_val:.2f}', ha='center', va='bottom', fontsize=10)
    plt.text(i, std_val, f'Std: {std_val:.2f}', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : While having the lowest average of all subsets, the top 10 also have the highest minimum. We can conclude that having a minimum of 7%-14% of liveness in the song could be a good idea.

In [None]:
speechiness_min = []
speechiness_max = []
speechiness_avg = []
speechiness_std = []

for df in data:
    speechiness_min.append(df['speechiness_%'].min())
    speechiness_max.append(df['speechiness_%'].max())
    speechiness_avg.append(df['speechiness_%'].mean())
    speechiness_std.append(df['speechiness_%'].std())

plt.figure(figsize=(12, 6))

plt.plot(labels, speechiness_avg, marker='o', linestyle='-', color='skyblue', label='Average Speechiness')
plt.plot(labels, speechiness_max, marker='o', linestyle='-', color='lightcoral', label='Max Speechiness')
plt.plot(labels, speechiness_min, marker='o', linestyle='-', color='lightgreen', label='Min Speechiness')
plt.plot(labels, speechiness_std, marker='o', linestyle='-', color='gold', label='Speechiness Std Deviation')

plt.xlabel('Subset of Songs')
plt.ylabel('Speechiness')
plt.title('Distribution of Speechiness per Subsets')
plt.legend()
plt.grid(True)

for i, (avg, max_val, min_val, std_val) in enumerate(zip(speechiness_avg, speechiness_max, speechiness_min, speechiness_std)):
    plt.text(i, avg, f'Avg: {avg:.2f}', ha='center', va='bottom', fontsize=10)
    plt.text(i, max_val, f'Max: {max_val:.2f}', ha='center', va='top', fontsize=10)
    plt.text(i, min_val, f'Min: {min_val:.2f}', ha='center', va='bottom', fontsize=10)
    plt.text(i, std_val, f'Std: {std_val:.2f}', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : Same conclusions than on the liveness here, having 3% to 9% of speechiness seems to improve your place in the tops.

# **📈 Music Evolution Over the Years**

This part will not help us to understand what makes a song more popular than another, but I think that it is interesting to investigate into the evolution of music over the years. This part will not be as precise as it could be because we only have the most popular songs, and most recent ones are way more represented than older ones in the dataset. But I think we can still discover some interesting patterns and evolutions just by analyzing this dataset.

In [None]:
average_bpm_by_year = data_df.groupby('released_year')['bpm'].mean()

plt.figure(figsize=(10, 6))
sns.lineplot(x=average_bpm_by_year.index, y=average_bpm_by_year.values, color='#1d8954')
plt.xlabel('Year')
plt.ylabel('Average BPM')
plt.title('Trends in BPM Over the Years')
plt.xticks(rotation=45)
plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : We can see here that the BPM stays in that range of 100-130 BPM in average, and that each year has its tempo, while still staying in that range of rythm. As I said, it moves a lot in the older years than in the most recent ones because there are way less songs than in 2018-2023 in the top of Spotify.

In [None]:
average_danceability_by_year = data_df.groupby('released_year')['danceability_%'].mean()

plt.figure(figsize=(10, 6))
sns.lineplot(x=average_danceability_by_year.index, y=average_danceability_by_year.values, color='#1d8954')
plt.xlabel('Year')
plt.ylabel('Average Danceability (%)')
plt.title('Trends in Danceability Over the Years')
plt.xticks(rotation=45)
plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : It seems that danceability of songs is slowly increasing over the years.

In [None]:
average_valence_by_year = data_df.groupby('released_year')['valence_%'].mean()

plt.figure(figsize=(10, 6))
sns.lineplot(x=average_valence_by_year.index, y=average_valence_by_year.values, color='#1d8954')
plt.xlabel('Year')
plt.ylabel('Average Valence (%)')
plt.title('Trends in Valence Over the Years')
plt.xticks(rotation=45)
plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : Valence is, in comparison, much more balanced, staying on a straight line around 50% over the years.

In [None]:
average_energy_by_year = data_df.groupby('released_year')['energy_%'].mean()

plt.figure(figsize=(10, 6))
sns.lineplot(x=average_energy_by_year.index, y=average_energy_by_year.values, color='#1d8954')
plt.xlabel('Year')
plt.ylabel('Average Energy (%)')
plt.title('Trends in Energy Over the Years')
plt.xticks(rotation=45)
plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : Just like danceability, songs are slowly more and more energic over the years, averaging 70% today.

In [None]:
average_acousticness_by_year = data_df.groupby('released_year')['acousticness_%'].mean()

plt.figure(figsize=(10, 6))
sns.lineplot(x=average_acousticness_by_year.index, y=average_acousticness_by_year.values, color='#1d8954')
plt.xlabel('Year')
plt.ylabel('Average Acousticness (%)')
plt.title('Trends in Acousticness Over the Years')
plt.xticks(rotation=45)
plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : In opposition, we can see that acousticness is slowly decreasing, with a little increasing bump around 2010, but it seems that it is, again, going down.

In [None]:
average_instrumentalness_by_year = data_df.groupby('released_year')['instrumentalness_%'].mean()

plt.figure(figsize=(10, 6))
sns.lineplot(x=average_instrumentalness_by_year.index, y=average_instrumentalness_by_year.values, color='#1d8954')
plt.xlabel('Year')
plt.ylabel('Average Instrumentalness (%)')
plt.title('Trends in Instrumentalness Over the Years')
plt.xticks(rotation=45)
plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : Instrumentalness was used mostly between 1980 and 1990, and, just like acousticness, around 2010. Other than that, no intrumentalness can be found in the top songs of Spotify.

In [None]:
average_liveness_by_year = data_df.groupby('released_year')['liveness_%'].mean()

plt.figure(figsize=(10, 6))
sns.lineplot(x=average_liveness_by_year.index, y=average_liveness_by_year.values, color='#1d8954')
plt.xlabel('Year')
plt.ylabel('Average Liveness (%)')
plt.title('Trends in Liveness Over the Years')
plt.xticks(rotation=45)
plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : If we focus only on the most recent years, it seems that liveness is slowly going up, while starting from a really low percentage.

In [None]:
average_speechiness_by_year = data_df.groupby('released_year')['speechiness_%'].mean()

plt.figure(figsize=(10, 6))
sns.lineplot(x=average_speechiness_by_year.index, y=average_speechiness_by_year.values, color='#1d8954')
plt.xlabel('Year')
plt.ylabel('Average Speechiness (%)')
plt.title('Trends in Speechiness Over the Years')
plt.xticks(rotation=45)
plt.show()

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

**Analysis** : Speechiness seems to slowly increase too over the years.

<div style="border-radius:10px; border:#366632 solid; padding: 15px; background-color: #EBFBEA; font-size:100%; text-align:left">

# **🎓 Conclusion**

After exploring all those variables, it seems that the most streamed songs on Spotify have some things in common. If you want to create a song and want it to be the most popular possible, here are some guidelines you should follow to maximize your chances of success : 

* **Key** : As we've seen throughout the analysis, the C# key is the most popular and most successful one for listeners.
* **Mode** : Using the Major mode is preferable, as it is the most popular and most successful mode on Spotify.
* **BPM** : A rythm above the heart rate, but not too fast, is recommended. The range should be between 90 BPM and 120 BPM.
* **Energy** : Your song needs to share a minimum of energy to be among the most popular ones. A minimum of 40% is recommended.
* **Danceability** : The more danceable it is, the best it will perform. 
* **Valence** : Positivity is the key, be positive in the song and it will have a higher chance of success.
* **Acousticness** : Lower the acousticness of the song, as highly acoustic songs do not perform that well.
* **Instrumentalness** : Do not put any instrumentalness at all if possible.
* **Liveness** : A low percentage of liveness tends to perform better so don't put too much of it. 7% to 14% should be enough.
* **Speechiness** : Just like liveness, a low percentage of speechiness is acceptable with 3% to 9% being a good range.
* **Playlists Presence** : The most important thing is the presence of the song in playlists throughout all the streaming platforms because it will boost the popularity of your song exponentially.

# **📝 Note of the Author**

Firstly, I would like to express my sincerest gratitude to all of you who took the time to read this notebook. I am a French Data Science engineer and I am still learning lots of things in this field.

I am always looking to improve, and I would love to hear your thoughts on how I can make this notebook and/or analysis even better. So please, feel free to reach out to me with any comments or suggestions.

If you found this notebook helpful or interesting, please consider upvoting it. Your support means the world to me, and it will encourage me to continue sharing my work with the community.

Thank you once again for your time and for being a part of my learning journey. **--Lucas**

> **Check out my other cool notebooks !**
> * [🏡 House Prices - Exploratory Data Analysis](https://www.kaggle.com/code/dreygaen/house-prices-exploratory-data-analysis)
> * [🛳️ Titanic - Top 1% with KNN [0.81818]](https://www.kaggle.com/code/dreygaen/titanic-top-1-with-knn-0-81818)
> * [🛳️ Titanic - EDA and Prediction [0.78229]](https://www.kaggle.com/code/dreygaen/titanic-eda-and-prediction-0-78229)