Analysis by Larissa Takou-Ayaoh

10/22/2023

## 2023 Spotify Songs Exploratory Data Analysis
### What were the most popular songs, artists and genres in 2023?

![Spotify](https://play-lh.googleusercontent.com/cShys-AmJ93dB0SV8kE6Fl5eSaf4-qMMZdwEDKI5VEmKAXfzOqbiaeAsqqrEBCTdIEs)

**Table of Contents**
* 0. Executive Summary (Key Findings)
* 1. Introduction
* 3. Data Exploration and Cleaning
* 4. Data Analysis and Visualization
* 5. Conclusion

## 0. Executive Summary
### Key Findings

*********
- The most popular Artist was the Weeknd, and he is also the artist for the most popular song 'Blinding lights'. 
'Blinding Lights' was released in 2019, but still maintained a large number of streams.\
- The top 10 most streamed songs from spotify, ranking did not coincide with number of streams. It could also be an indicator that a songs has been on the charts for a longer period.
- There was no strong correlation between numerical factors and the number of streams, however there were medium correlations between factors.
- The majority of songs in the dataset had danceability, valence and energy around 60%. 
- The majority of songs have approximately 120 bpm.
- Songs with the largest streams had a key of C#.

### Objective and Scope
* Collect, clean and analyze the spotify dataset from Kaggle: [2023_Spotify_Songs](https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023)
* Determine variables that contribute to song popularity
* Visualize data to show relationships between variables
* Share findings and insights to help make data-driven business decisions

## 1. Introduction

In this project I am exploring a 2023 Spotify Songs data set to gain insights into which variables have a most significant impact on the popularity of a track.
Below is a list ofthe data set columns and descriptions:

|Columns Names | Description|
| -------------|------------|
|  track_name |Name of the song|
|artist(s)_name| Name of Artist |
|artist_count| Number of Artists features on song|
|released_year| Year song was released|
|released_month| Month song was released|
|released_day| Day song was released|
|in_spotify_playlists| Number of Spotify Playlists song is a part of|
|in_spotify_charts| Presence and Rank of song on Spotify Charts|
|streams | Total number of streams on spotify|
|in_apple_playlists | Number of apple music playlists song is a part of|
|in_apple_charts | Presence and Rank of song on apple charts|
|in_deezer_playlists |Number of Deezer Playlists song is a part of|
|in_deezer_charts | Presence and Rank of song on Deezer charts|
|in_shazam_charts |Rank of song on Shazam charts |
|bpm|Beats per Minutes, measure of song tempo|
|key|Key of the song|
|mode|Mode of song (major or minor)|
|danceability_% |Percentage indicating how suitable the song is for dancing|
|valence_% | Positivity of the song's musical content|
|energy_% | Perceived energy level of the song|
|acousticness_% | Amount of acoustic sound in the song |
|instrumentalness_% |Amount of instrumental content in the song|
|liveness_% | Presence of live performance elements |
|speechiness_% | Amount of spoken words in the song|


Some pertinent questions to answer are the following:

* What is the most popular song of 2023
* Who is the most popular artist of 2023
* What is the Most popular genre of 2023
* What are the most significant factors contributing to a Song's popularity in 2023
* How many albums released in 2023 were in the top 20?

This analysis will be performed entirely with Python and its libraries. Additional interactive visualizations will be created in Tableau.

## 2. Data Exploration and Cleaning

In [None]:
#import pertinent libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
spotify = pd.read_csv("/kaggle/input/top-spotify-songs-2023/spotify-2023.csv", encoding = 'latin')

spotify.head()

In [None]:
spotify.tail()

In [None]:
spotify.info()

In [None]:
# 'in_shazam_chars' and 'key' contain null values
#return number of null values in columns
spotify.isnull().sum() 


In [None]:
#check if there are duplicate rows
spotify.duplicated().sum()

In [None]:
#combine 'year', 'month' and 'day' to one datetime column

spotify["release_date"] = spotify["released_year"].astype(str)+"-" + spotify["released_month"].astype(str)+"-" +spotify["released_day"].astype(str)

pd.to_datetime(spotify["release_date"])

In [None]:
#streams is object type but we expect it to be numeric
spotify["streams"].sort_values(ascending= False)

#record 574 seems to contain a typo

In [None]:
#search song on the internet to find number of streams and assign it to row 574, column 8/'streams'
spotify.iloc[574, 8] = '210581761'
#convert 'streams' column to integer
spotify['streams'] =  spotify['streams'].astype(int)

spotify["streams"].sort_values(ascending = False)

## 3. Data Analysis and Visualization

### What was the Most Popular Song of 2023?

In [None]:
spotify['artist(s)_name'].unique().size

#There are 953 song entries and 645 artists,it shows that some artists had multiple records that were popular in 2023

In [None]:
spotify_df = spotify.copy()
top_10_songs = spotify_df.sort_values(by = "streams", ascending = False).iloc[:10,:]

top_10_songs

In [None]:
t_a = sns.barplot(data = top_10_songs, y ="track_name", x= "streams")
plt.title("2023 Top 10 Songs Streamed on Spotify")
plt.xlabel("Number of Streams (Billions)")
plt.ylabel("Artist(s) Name")
t_a.plot(top_10_songs['streams'].max(), top_10_songs['artist(s)_name'][top_10_songs['streams']==top_10_songs['streams'].max()], marker ="*", color = 'red')


* The most streamed song was 'Blinding Lights' with 3.7 Billion Streams. The bar plot also shows that only the top two songs had over 3 Billion Streams.

### Who was the most popular Artist of 2023?

In [None]:
popular_artists = spotify_df.groupby('artist(s)_name')['streams'].sum()

top_10_artists = popular_artists.sort_values(ascending = False).nlargest(10)

top_10_artists

In [None]:
print(top_10_artists.idxmax())
top_10_artists.max()
popular_artists.max()

In [None]:
top_a = top_10_artists.plot.barh(color ='skyblue')
plt.title("Most Streamed Artists of 2023")
plt.xlabel("Total Number of Streams (Billions)")
plt.ylabel('Artist(s) Name')
x_label_range = np.arange(0, popular_artists.max(),2500000000 )
plt.xticks(x_label_range, labels = [0, 2.5,5,7.5,10, 12.5])
top_a.invert_yaxis()

* The Weeknd holds first place as the most streamed artists, and most streamed song with 'Blinding Lights'

### How did the Top 10 Artists and Songs perform on Different platforms?

In [None]:
#retrieve top 10 songs chart rank on each platform
platforms = ["spotify", "apple", "deezer", "shazam"]
spotify_charts, apple_charts, deezer_charts, shazam_charts = [top_10_songs[['artist(s)_name', f'in_{p}_charts', 'streams']].sort_values(f'in_{p}_charts') for p in platforms]


In [None]:
#remove chart ranks of 0 (chart scale starts from 1(best))

spotify_charts['in_spotify_charts'].replace(0, np.NaN, inplace = True)
apple_charts['in_apple_charts'].replace(0, np.NaN, inplace = True)
deezer_charts['in_deezer_charts'].replace(0, np.NaN, inplace = True)
shazam_charts['in_shazam_charts'].replace(np.NaN, 0, inplace = True)

shazam_charts['in_shazam_charts'] = shazam_charts['in_shazam_charts'].astype(int)
shazam_charts['in_shazam_charts'].replace(0, np.NaN,inplace = True)

shazam_charts

In [None]:
from matplotlib import cm, colors

fig, axes = plt.subplots(1, 4, figsize = (20,10), layout = 'constrained')

#first subplot
#bar chat
axes[0].bar(spotify_charts['artist(s)_name'], spotify_charts['streams'], color= 'purple', label ='Streams')
axes[0].set_title('Spotify', loc='left', fontstyle='oblique', fontsize='medium')
axes[0].set_xticklabels(spotify_charts['artist(s)_name'], rotation=45, ha='right', minor=False)
axes[0].set_ylabel('Streams')

ax0 = axes[0].twinx()
ax0.plot(spotify_charts['artist(s)_name'], spotify_charts['in_spotify_charts'], marker = "o", label = 'Rank', color = 'black')
ax0.invert_yaxis()
ax0.legend()


axes[1].bar(apple_charts['artist(s)_name'],apple_charts['streams'], color = 'lightblue')
axes[1].set_title('Apple Music', loc='left', fontstyle='oblique', fontsize='medium')
axes[1].set_xticklabels(apple_charts['artist(s)_name'], rotation=45, ha='right', minor=False)
axes[1].set_ylabel('Streams')

ax1 = axes[1].twinx()
ax1.plot(apple_charts['artist(s)_name'], apple_charts['in_apple_charts'], marker = "o",label = 'Rank', color = 'black')
ax1.invert_yaxis()
ax1.legend()


axes[2].bar(deezer_charts['artist(s)_name'],deezer_charts['streams'])
axes[2].set_title('Deezer', loc='left', fontstyle='oblique', fontsize='medium')
axes[2].set_xticklabels(deezer_charts['artist(s)_name'], rotation=45, ha='right', minor=False)
axes[2].set_ylabel('Streams')

ax2 = axes[2].twinx()
ax2.plot(deezer_charts['artist(s)_name'], deezer_charts['in_deezer_charts'], marker = 'o', label = 'Rank', color = 'black')
ax2.invert_yaxis()
ax2.legend()


axes[3].bar(shazam_charts['artist(s)_name'],shazam_charts['streams'], color = 'pink')
axes[3].set_title('Shazam', loc='left', fontstyle='oblique', fontsize='medium')
axes[3].set_xticklabels(shazam_charts['artist(s)_name'], rotation=45, ha='right', minor=False)
axes[3].set_ylabel('Streams')

ax3 = axes[3].twinx()
ax3.plot(shazam_charts['artist(s)_name'], shazam_charts['in_shazam_charts'], marker = 'o',label = 'Rank', color= 'black')
ax3.invert_yaxis()
ax3.legend()

fig.suptitle("Artist Streams and Ranks on Different Platforms")##combine all ranks into one column



**Note that there is some missing data in the ranks of the songs, our assumption is that they were not present on the charts. Therefore, song chart rank does not determine the number of streams within a year, and vice-versa.**

### What Were the Most Significant Factors that Impacted a Song's Popularity in 2023?

In [None]:
numerical_factors = ["streams","bpm","danceability_%","valence_%","energy_%","acousticness_%","instrumentalness_%","liveness_%","speechiness_%"]
spotify_df_factors = spotify_df.sort_values(by='streams', ascending = False)[numerical_factors]

spotify_df_factors.iloc[:10, :]

In [None]:
#numerical factors distribution plots
numerical_factors = numerical_factors[1:]
fig, axesf = plt.subplots(2,4,figsize= (18,10))

for i in range(0,2):
    for j in range(0,4):
        if i == 0:
            sns.histplot(spotify_df[numerical_factors[j]], kde = True, ax = axesf[i,j], bins = 20)
        elif i == 1:
            sns.histplot(spotify_df[numerical_factors[4+j]], kde = True, ax = axesf[i,j], bins = 20)        
        
                                
        

### Top 10 Songs, Factors Distribution

In [None]:
spotify_df_10 = spotify_df.iloc[:10,:]
fig, axesf10 = plt.subplots(2,4,figsize= (18,10), sharey = True)
for i in range(0,2):
    for j in range(0,4):
        if i == 0:
            sns.histplot(spotify_df_10[numerical_factors[j]], kde = True, ax = axesf10[i,j] )
        elif i == 1:
            sns.histplot(spotify_df_10[numerical_factors[4+j]], kde = True, ax = axesf10[i,j])        
        

In [None]:
#correlation between number of streams and each factor
spotify_df_factors.corr()["streams"]

The correlation coefficients between number of streams and other columns are not as strong as expected. The highest is -0.111  with 'speechiness', however this shows that there were only weak correlations. 
If we were to perform a multi-linear regression, we may come up with an appropriate model with predictors that can effectively predict the success of a song (number of streams)

In [None]:
numerical_factors_corr = spotify_df_factors.corr()
numerical_factors_corr

In [None]:
sns.heatmap(numerical_factors_corr, cmap = 'magma', linewidth = 0.5)
plt.grid(visible = False)
plt.xticks()
plt.title('Numerical Factors Correlations')

### Numerical Factors Relationship Scatter Plots

In [None]:

#Add scatter plot of each factor vs streams/ to show weak corr
sns.pairplot(spotify_df[numerical_factors[:4]])

In [None]:
sns.pairplot(spotify_df[numerical_factors[4:]])

### How Are Streams Distributed per key?

In [None]:
keys_factors = ["streams","bpm","danceability_%","valence_%","energy_%","acousticness_%","instrumentalness_%","liveness_%","speechiness_%", "key"]
spotify_df_keys = spotify_df[keys_factors]

#avg value of numerical factors per key
group_spotify_df_keys = spotify_df_keys.groupby('key').mean().sort_values(by = 'streams', ascending = False)
group_spotify_df_keys

In [None]:
sns.catplot(data= spotify_df_keys, x= 'key', y = 'streams', kind = 'box')
plt.title('Streams Distribution per Key')

* Key 'C#' has the largest average number of streams, however we can also see that the outliers contribute to the larger average and does not represent the entire population.

In [None]:
labels = []
for l,s in zip(group_spotify_df_keys.index,group_spotify_df_keys['streams']):
    pct = (s/spotify_df_keys['streams'].sum())*100
    pct = str(round(pct,2))+"%"
    lab = l+ "-"+ pct
    
    labels.append(lab)
    
plt.pie(group_spotify_df_keys['streams'], labels = labels)
plt.legend(loc ='best',bbox_to_anchor=(1.05, 1))
plt.title('Percentage of Total Streams per Key')

print(labels)

### Investigating Factors with Strongest Correlation Coefficients

In [None]:
sns.relplot(data = spotify_df, x = spotify_df['energy_%'], y= spotify_df['acousticness_%'], hue = 'key')

In [None]:
sns.relplot(data = spotify_df, x = spotify_df['danceability_%'], y= spotify_df['valence_%'], hue = 'key')

In [None]:
data =input("Enter column label to graph against streams for key categories:")

sns.relplot(data = spotify_df, x = spotify_df[data], y= spotify_df['streams'], hue = 'key')

## 5. Conclusion

This Exploratory Data Analysis allowed to gain various insights into the data and investigate factors that may contribute to most popular songs. Understanding the pattern of what makes a song 'popular' is essential from a marketing or song creation stand point if the goal is to garner large amounts of streams, thus listeners and potential album purchasers. 
It is important to note, however that there are various factors that could play a part into a song's popularity such as genre, country or geographical location, or season (Winter, Spring, Summer or Fall).
Including more predictors would require a larger dataset which may enable to build prediction or classification models to perform tasks such as recognizing a song from its features or classifying the song in the appropriate genre category based on musical keys or mode. There is a vast array of possibilities in how this data could be used to make data driven business decisions for artists and their team, or simply to streamline organization of new songs in a user's library.