# What Makes A Song Become A Hit?

### Spotify Content Analysis - Top 50 songs
_____________________________________________________________________________________________________________________________________________

## 0.1. Introduction

Understanding what makes a song a hit is crucial for platforms like Spotify, where content performance directly impacts user engagement and business decisions. 

**This project aims to analyze Spotify’s Top 50 Tracks of 2020 dataset to uncover patterns and insights that define a successful song.** By leveraging data analysis techniques in Python with Pandas, we will explore key factors that contribute to a track’s popularity, including artist influence, genre distribution, and audio features such as danceability, loudness, and acousticness.

The analysis will begin with data cleaning to ensure accuracy, handling missing values, eliminating duplicates, and treating potential outliers. Then, through exploratory data analysis (EDA), we will assess the dataset’s structure, identifying the number of observations, features, and categorical or numerical variables. Key business questions will be addressed, such as which artists and albums dominate the charts, what characteristics distinguish high-ranking tracks, and how different genres perform in terms of danceability, loudness, and acoustic properties. Additionally, correlation analysis will reveal which audio features are most strongly related to hit status, providing valuable insights for Spotify’s content curation strategies.

The outcomes of this study will help refine music recommendation algorithms, optimize playlist curation, and guide marketing and promotional efforts by identifying trends in consumer music preferences. Further improvements could involve expanding the dataset to multiple years, incorporating listener engagement metrics, and applying machine learning techniques to predict future hits.

## 0.2. Data preparation

The dataset used in this analysis was sourced from the Spotify Top 50 Tracks of 2020 dataset, originally obtained from Kaggle. It contains structured information about the most popular songs on Spotify during that year, allowing for an in-depth analysis of the key attributes that contribute to a track’s success.

*Let's see how the table looks like:*

In [91]:
import pandas as pd

#
df = pd.read_csv('https://raw.githubusercontent.com/leonardovaloppi/Spotify-Top-50-Songs/refs/heads/main/spotifytoptracks.csv', index_col=0)
df.head()

Unnamed: 0,artist,album,track_name,track_id,energy,danceability,key,loudness,acousticness,speechiness,instrumentalness,liveness,valence,tempo,duration_ms,genre
0,The Weeknd,After Hours,Blinding Lights,0VjIjW4GlUZAMYd2vXMi3b,0.73,0.514,1,-5.934,0.00146,0.0598,9.5e-05,0.0897,0.334,171.005,200040,R&B/Soul
1,Tones And I,Dance Monkey,Dance Monkey,1rgnBhdG2JDFTbYkYRZAku,0.593,0.825,6,-6.401,0.688,0.0988,0.000161,0.17,0.54,98.078,209755,Alternative/Indie
2,Roddy Ricch,Please Excuse Me For Being Antisocial,The Box,0nbXyq5TXYPCO7pr3N8S4I,0.586,0.896,10,-6.687,0.104,0.0559,0.0,0.79,0.642,116.971,196653,Hip-Hop/Rap
3,SAINt JHN,Roses (Imanbek Remix),Roses - Imanbek Remix,2Wo6QQD1KMDWeFkkjLqwx5,0.721,0.785,8,-5.457,0.0149,0.0506,0.00432,0.285,0.894,121.962,176219,Dance/Electronic
4,Dua Lipa,Future Nostalgia,Don't Start Now,3PfIrDoz19wz7qK7tYeu62,0.793,0.793,11,-4.521,0.0123,0.083,0.0,0.0951,0.679,123.95,183290,Nu-disco


The dataset is organized into **16 columns**, each capturing different aspects of a song’s metadata and audio features:

-	**Metadata-related columns:** These include ***artist***, ***album***, ***track_name***, and ***track_id***, which provide basic identification details about each song and its creator.

-	**Genre classification:** The genre column categorizes tracks into different musical styles.

-	**Musical and audio features:** These are numerical attributes that describe different characteristics of a song’s composition and production:
  - **energy**: Measures the intensity and activity level of a song.
  - ***danceability:*** Quantifies how suitable a track is for dancing based on tempo, rhythm stability, and beat strength.
  - ***key:*** Represents the overall musical key of the track.
  - ***loudness:*** Indicates the overall volume of the track in decibels (dB).
  - ***acousticness:*** Predicts the likelihood of a song being acoustic.
  - ***speechiness:*** Evaluates the presence of spoken words in the track.
  - ***instrumentalness:*** Measures the extent to which a track is purely instrumental.
  - ***liveness:*** Detects whether the song was recorded in a live setting.
  - ***valence:*** Describes the musical positivity of a track.
  - ***tempo:*** Measures the speed of the song in beats per minute (BPM).
  - ***duration_ms:*** Represents the song’s length in milliseconds.

To ensure the reliability of the analysis, the dataset underwent pre-processing steps, including:

  - **Handling missing values:** Checking for and addressing any missing data to prevent biases in the analysis.
    
  - **Removing duplicates:** Ensuring each track appears only once to maintain data integrity.
    
  - **Outlier treatment:** Identifying and managing extreme values that could distort insights, especially in features like loudness or tempo.

This structured and well-defined dataset serves as the foundation for exploring the characteristics that define a hit song, uncovering trends in music popularity, and drawing actionable insights for content strategy and artist recommendations.

## 1. Data Exploration

Before diving into the core analysis, it is essential to explore and understand the dataset through an Exploratory Data Analysis (EDA) phase. This step provides valuable insights into the dataset’s structure, identifies potential data quality issues, and helps uncover patterns or trends that might influence the final results.

The EDA process begins by examining the overall composition of the dataset, including the total number of observations (songs) and features. A key focus is on distinguishing categorical variables (such as artist and genre) from numerical attributes (such as danceability, loudness, and tempo) to determine the appropriate analysis techniques.

Additionally, summary statistics are used to assess the distribution of key numerical features, helping to detect potential outliers or inconsistencies. Visualizations such as histograms, box plots, and correlation matrices further assist in understanding the relationships between different attributes, revealing patterns that contribute to a song’s popularity.

By conducting this initial exploration, we establish a solid foundation for deeper analysis, ensuring that the data is well-structured, meaningful, and ready for more advanced insights.

### 1.1. Number of observation (rows)

In [90]:
#
a = df.count()

#
b = df.map(type).nunique() == 1

#
c = df.loc[0].map(lambda x: type(x).__name__)

#
pd.concat([a, b, c], axis=1, keys=["Non-Nulls", "Single DType?", "DType"])

Unnamed: 0,Non-Nulls,Single DType?,DType
artist,50,True,str
album,50,True,str
track_name,50,True,str
track_id,50,True,str
energy,50,True,float64
danceability,50,True,float64
key,50,True,int64
loudness,50,True,float64
acousticness,50,True,float64
speechiness,50,True,float64


In [34]:
df[["track_id", "track_name", "artist", "album"]].nunique()

track_id      50
track_name    50
artist        40
album         45
dtype: int64

45