### Overall Observations

Genre Differentiation: Genres exhibit distinct patterns in danceability, loudness, and acousticness, reflecting the unique character of each genre.
Feature Relationships: Strong correlations exist between certain pairs of features, indicating that they tend to vary together.
Popularity Indicators: The analysis of unique genres, artists, and albums, as well as the identification of popular tracks, provides insights into what resonates with listeners in the Top 50.

**Conclusions**

1. **Genre Characteristics**
    - Hip-Hop/Rap stands out with the highest average danceability, confirming its strong association with rhythmic and dance-oriented music.
    - Dance/Electronic and Pop genres exhibit high average loudness, which aligns with the energetic and often intense nature of these genres.
    - Alternative/Indie has the highest average acousticness, suggesting a preference for organic and less processed sounds in this genre.
2. **Feature Correlations**
    - The strong positive correlation between 'energy' and 'loudness' suggests that songs perceived as energetic also tend to be loud. This could be attributed to the use of production techniques that simultaneously increase both energy and loudness.
3. **Popularity and Diversity**
    - The presence of 16 unique genres in the Top 50 highlights the diversity of popular music and the wide range of listener preferences.
    - Pop is the most popular genre, which reflects its widespread appeal and its ability to consistently produce hit songs.
    - The fact that there are 40 unique artist in the top 50, from which 8 have more than 1 song, suggests that even though there is a big diversity of artists, the public tend to prefer a small group of them.
    - The identification of albums and artists with multiple popular tracks indicates consistent success and the ability to create music that resonates with a large audience.
4. **Track Attributes**
    - The wide range of 'loudness' and 'danceability' values in the Top 50 demonstrates the diverse sonic landscape of popular music and the varied preferences of listeners.
    - The identification of longest and shortest tracks, as well as those with extreme 'loudness' and 'danceability' values, provides specific examples of how these attributes can vary in popular music.

**Recommendations**

- Further explore the factors that contribute to the success of albums and artists with multiple popular tracks.
- Investigate the potential impact of 'loudness' and 'danceability' on song popularity, considering both extreme and moderate values.
- Analyze the evolution of genre characteristics and feature correlations over time to identify emerging trends and patterns.

In [22]:

# Genres of interest
genres_of_interest = ['Pop', 'Hip-Hop/Rap', 'Dance/Electronic', 'Alternative/Indie']

# Filter the dataset for the genres of interest
df_filtered = df[df['genre'].isin(genres_of_interest)]

# Calculate the mean scores for danceability, loudness, and acousticness for each genre
genre_grouped = df_filtered.groupby('genre')[['danceability', 'loudness', 'acousticness']].mean()

# Display the results
print("Comparison of Danceability, Loudness, and Acousticness Scores for Different Genres:")
print(genre_grouped)


Comparison of Danceability, Loudness, and Acousticness Scores for Different Genres:
                   danceability  loudness  acousticness
genre                                                  
Alternative/Indie      0.661750 -5.421000      0.583500
Dance/Electronic       0.755000 -5.338000      0.099440
Hip-Hop/Rap            0.765538 -6.917846      0.188741
Pop                    0.677571 -6.460357      0.323843


### Comparison of Danceability, Loudness, and Acousticness Across Genres

In this analysis, we focused on comparing the **danceability**, **loudness**, and **acousticness** scores across four specific genres: **Pop**, **Hip-Hop/Rap**, **Dance/Electronic**, and **Alternative/Indie**. The dataset was filtered to include only these genres of interest.

The **mean scores** for **danceability**, **loudness**, and **acousticness** were calculated for each genre, providing insights into how these audio characteristics vary across the selected genres.

The results of this comparison help to understand genre-specific trends in terms of how songs in each genre score on these attributes, which can be useful for further analysis or genre-based recommendations.

Hip-Hop/Rap: Highest average danceability.
Emphasizes rhythm and beat patterns.

Dance/Electronic: Higher average loudness. Focuses on
energy and production techniques.

Alternative/Indie: Highest average acousticness.
Prioritizes organic instrumental sounds.

In [22]:
# Correlation matrix for numerical features (after removing outliers)
correlation_matrix = df_no_outliers[numerical_features].corr()

# Display the correlation matrix
print("Correlation Matrix for Numerical Features (After Removing Outliers):")
print(correlation_matrix)

# Extract strongly positive correlations (above 0.7)
strong_positive_corr = correlation_matrix[correlation_matrix > 0.7]

# Extract strongly negative correlations (below -0.7)
strong_negative_corr = correlation_matrix[correlation_matrix < -0.7]

# Extract features that are not correlated (near 0)
not_correlated = correlation_matrix[(correlation_matrix > -0.1) & (correlation_matrix < 0.1)]

# Print the results
print("\nFeatures that are strongly positively correlated (above 0.7):")
print(strong_positive_corr.dropna(how='all', axis=0).dropna(how='all', axis=1))

print("\nFeatures that are strongly negatively correlated (below -0.7):")
print(strong_negative_corr.dropna(how='all', axis=0).dropna(how='all', axis=1))

print("\nFeatures that are not correlated (between -0.1 and 0.1):")
print(not_correlated.dropna(how='all', axis=0).dropna(how='all', axis=1))



Correlation Matrix for Numerical Features (After Removing Outliers):
                    energy  danceability       key  loudness  acousticness  \
energy            1.000000      0.158797  0.047830  0.764852     -0.641366   
danceability      0.158797      1.000000  0.269747  0.188406     -0.351675   
key               0.047830      0.269747  1.000000 -0.008471     -0.097976   
loudness          0.764852      0.188406 -0.008471  1.000000     -0.412957   
acousticness     -0.641366     -0.351675 -0.097976 -0.412957      1.000000   
speechiness       0.088558      0.217779 -0.019792 -0.053342     -0.099744   
instrumentalness -0.182112     -0.006744  0.059855 -0.352444      0.032809   
liveness          0.031172     -0.153177  0.168719 -0.086091     -0.021335   
valence           0.358533      0.510354  0.112497  0.387957     -0.189840   
tempo             0.073577      0.142349  0.103269  0.092170     -0.241888   
duration_ms       0.130896     -0.143058 -0.047521  0.150379      0.01514

### Correlation Analysis of Numerical Features

In this step, we performed a **correlation analysis** on the numerical features of the dataset to understand the relationships between them. A correlation matrix was calculated, which shows the strength and direction of the linear relationships between the selected numerical features.

The **outcome** of the analysis was:

- **Strong Positive Correlations (above 0.7):** We extracted and displayed pairs of features that have a strong positive relationship, indicating that as one feature increases, the other tends to increase as well.
- **Strong Negative Correlations (below -0.7):** We identified and displayed pairs of features with a strong negative relationship, meaning that as one feature increases, the other decreases.
- **Features with No Significant Correlation (between -0.1 and 0.1):** We extracted features that have a negligible correlation, suggesting no meaningful linear relationship between them.

The results help identify which features are highly related, which can inform feature selection for further analysis or modeling.

**Energy & Loudness
Strong positive correlation (0.764). Energetic songs tend to
be louder.

**Danceability & Valence
Positive correlation (0.51). Danceable songs often have a
happier feel.

**Energy & Acousticness
Negative correlation (-0.641). High-energy tracks typically
have less acoustic sound.


In [21]:
# Create a filtered DataFrame without outliers
df_no_outliers = df[(z_scores <= 3).all(axis=1)]

# Print the new dataset size
print(f"Original dataset: {df.shape[0]} rows")
print(f"Dataset after outlier removal: {df_no_outliers.shape[0]} rows")

## Define numerical features by selecting columns with numeric data types
numerical_features = df_no_outliers.select_dtypes(include=['number']).columns



Original dataset: 50 rows
Dataset after outlier removal: 45 rows


In [18]:
from scipy import stats
import numpy as np

# Compute Z-scores for numerical columns
z_scores = np.abs(stats.zscore(df[numeric_features]))

# Find rows where any Z-score is greater than 3 (outliers)
outliers = df[(z_scores > 3).any(axis=1)]
print(outliers)



             artist                                  album  \
2       Roddy Ricch  Please Excuse Me For Being Antisocial   
19           Future                          High Off Life   
24    Billie Eilish                    everything i wanted   
41  Black Eyed Peas                            Translation   
49     Travis Scott                             ASTROWORLD   

                    track_name                track_id  energy  danceability  \
2                      The Box  0nbXyq5TXYPCO7pr3N8S4I   0.586         0.896   
19  Life Is Good (feat. Drake)  1K5KBOgreBi5fkEHvg5ap3   0.574         0.795   
24         everything i wanted  3ZCTVFBt2Brf31RLEnCkWJ   0.225         0.704   
41   RITMO (Bad Boys For Life)  4NCsrTzgVfsDo8nWyP8PPc   0.704         0.723   
49                  SICKO MODE  2xLMifQCjDGFmkHkpNLD9h   0.730         0.834   

    key  loudness  acousticness  speechiness  instrumentalness  liveness  \
2    10    -6.687       0.10400       0.0559           0.00000     0.7

In [17]:
# Number of Unique Genres

unique_genres = df["genre"].nunique()
print("Total number of genres in top 50:", unique_genres)


Total number of genres in top 50: 16



### Identifying Genres with Only One Song in Top 50

The dataset was analyzed to find the total of unique genres in the Top 50

In [16]:
# Genres with Only One Song in Top 50

scarce_genres = df["genre"].value_counts()
scarce_genres = scarce_genres[scarce_genres == 1]
print("Genres with only one song in top 50:", scarce_genres)


Genres with only one song in top 50: genre
R&B/Hip-Hop alternative               1
Nu-disco                              1
Pop/Soft Rock                         1
Pop rap                               1
Hip-Hop/Trap                          1
Dance-pop/Disco                       1
Disco-pop                             1
Dreampop/Hip-Hop/R&B                  1
Alternative/reggaeton/experimental    1
Chamber pop                           1
Name: count, dtype: int64


### Identifying Genres with Only One Song in Top 50

The dataset was analyzed to find the Genres with Only One Song in Top 50

In [15]:
# Most Popular Genre

most_popular_genre = df["genre"].value_counts().idxmax()
print("Most popular genre:", most_popular_genre)


Most popular genre: Pop


### Identifying Most Popular Genre

The dataset was analyzed to find the most popular genre.

In [14]:
# Longest & Shortest Tracks

longest_track = df.loc[df["duration_ms"].idxmax(), ["track_name", "duration_ms"]]
shortest_track = df.loc[df["duration_ms"].idxmin(), ["track_name", "duration_ms"]]

print("Longest track:", longest_track)
print("Shortest track:", shortest_track)


Longest track: track_name     SICKO MODE
duration_ms        312820
Name: 49, dtype: object
Shortest track: track_name     Mood (feat. iann dior)
duration_ms                    140526
Name: 23, dtype: object


### Identifying Longest & Shortest Tracks

The dataset was analyzed to find the longest and shortest song in duration.

In [13]:
# Tracks with Loudness Below -8

low_loudness = df[df["loudness"] < -8][["track_name", "loudness"]]
print(low_loudness)


                                        track_name  loudness
7                 death bed (coffee for your head)    -8.765
8                                          Falling    -8.756
15                                    Toosie Slide    -8.820
20                Savage Love (Laxed - Siren Beat)    -8.520
24                             everything i wanted   -14.454
26                                         bad guy   -10.965
36                             HIGHEST IN THE ROOM    -8.764
44                            lovely (with Khalid)   -10.109
47  If the World Was Ending - feat. Julia Michaels   -10.086


### Identifying Tracks with Loudness Below -8

The dataset was analyzed to find all the songs with a Loudness Below -8.

In [12]:
# Tracks with Loudness Above -5

high_loudness = df[df["loudness"] > -5][["track_name", "loudness"]]
print(high_loudness)


                                       track_name  loudness
4                                 Don't Start Now    -4.521
6                                Watermelon Sugar    -4.209
10                                           Tusa    -3.280
12                                        Circles    -3.497
16                                  Before You Go    -4.858
17                                         Say So    -4.577
21                                      Adore You    -3.675
23                         Mood (feat. iann dior)    -3.558
31                                 Break My Heart    -3.434
32                                       Dynamite    -4.410
33               Supalonely (feat. Gus Dapperton)    -4.746
35                Rain On Me (with Ariana Grande)    -3.764
37  Sunflower - Spider-Man: Into the Spider-Verse    -4.368
38                                          Hawái    -3.454
39                                        Ride It    -4.258
40                                     g

### Identifying Tracks with Loudness Above -5

The dataset was analyzed to find all the songs with a Loudness Above -5.

In [11]:
# Tracks with Danceability Below 0.4

low_danceability = df[df["danceability"] < 0.4][["track_name", "danceability"]]
print(low_danceability)


              track_name  danceability
44  lovely (with Khalid)         0.351


### Identifying Tracks with Danceability Below 0.4

The dataset was analyzed to find all the songs with a Danceability Below 0.4.

In [10]:
# Tracks with Danceability Above 0.7

high_danceability = df[df["danceability"] > 0.7][["track_name", "danceability"]]
print(high_danceability)


                                       track_name  danceability
1                                    Dance Monkey         0.825
2                                         The Box         0.896
3                           Roses - Imanbek Remix         0.785
4                                 Don't Start Now         0.793
5                    ROCKSTAR (feat. Roddy Ricch)         0.746
7                death bed (coffee for your head)         0.726
8                                         Falling         0.784
10                                           Tusa         0.803
13                                Blueberry Faygo         0.774
14                       Intentions (feat. Quavo)         0.806
15                                   Toosie Slide         0.830
17                                         Say So         0.787
18                                       Memories         0.764
19                     Life Is Good (feat. Drake)         0.795
20               Savage Love (Laxed - Si

### Identifying Tracks with Danceability Above 0.7

The dataset was analyzed to find all the songs with a Danceability Above 0.7.

In [9]:
# Number of Unique Albums in Top 50

unique_albums = df["album"].nunique()
print("Number of unique albums in top 50:", unique_albums)


Number of unique albums in top 50: 45


### Identifying Number of Unique Albums in Top 50

The dataset was analyzed to find if how many unique albums were part of the Top 50 list.

In [8]:
#Albums with More Than 1 Popular Track

popular_albums = df["album"].value_counts()
popular_albums = popular_albums[popular_albums > 1]
print(popular_albums)


album
Future Nostalgia        3
Hollywood's Bleeding    2
Fine Line               2
Changes                 2
Name: count, dtype: int64


### Identifying Albums with More Than 1 Popular Track

The dataset was analyzed to find if there were albums which contained multiple songs in the Top 50 list.

In [7]:
# Unique Artists in Top 50

unique_artists = df["artist"].nunique()
print("Number of unique artists in top 50:", unique_artists)


Number of unique artists in top 50: 40


### Identifying Unique Artists

The dataset was analyzed to find how many unique artists where part of the Top 50 list.

In [6]:
# Most Popular Artists

popular_artists = df["artist"].value_counts()
popular_artists = popular_artists[popular_artists > 1]
print(popular_artists)


artist
Dua Lipa         3
Billie Eilish    3
Travis Scott     3
Harry Styles     2
Lewis Capaldi    2
Justin Bieber    2
Post Malone      2
Name: count, dtype: int64


### Identifying Most Popular Artists 

The dataset was analyzed to find the most popular artists, those which had more than 1 song in the Top 50.

In [5]:
# Identify Categorical & Numeric Features

categorical_features = df.select_dtypes(include=['object']).columns
numeric_features = df.select_dtypes(include=['number']).columns


print("Categorical features:", categorical_features.tolist())
print("Numeric features:", numeric_features.tolist())


Categorical features: ['artist', 'album', 'track_name', 'track_id', 'genre']
Numeric features: ['energy', 'danceability', 'key', 'loudness', 'acousticness', 'speechiness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms']


In [4]:
# Drop the unnecessary column
df = df.drop(columns=["Unnamed: 0"])

# Verify column is removed
print(df.head())
print(df.info())


        artist                                  album             track_name  \
0   The Weeknd                            After Hours        Blinding Lights   
1  Tones And I                           Dance Monkey           Dance Monkey   
2  Roddy Ricch  Please Excuse Me For Being Antisocial                The Box   
3    SAINt JHN                  Roses (Imanbek Remix)  Roses - Imanbek Remix   
4     Dua Lipa                       Future Nostalgia        Don't Start Now   

                 track_id  energy  danceability  key  loudness  acousticness  \
0  0VjIjW4GlUZAMYd2vXMi3b   0.730         0.514    1    -5.934       0.00146   
1  1rgnBhdG2JDFTbYkYRZAku   0.593         0.825    6    -6.401       0.68800   
2  0nbXyq5TXYPCO7pr3N8S4I   0.586         0.896   10    -6.687       0.10400   
3  2Wo6QQD1KMDWeFkkjLqwx5   0.721         0.785    8    -5.457       0.01490   
4  3PfIrDoz19wz7qK7tYeu62   0.793         0.793   11    -4.521       0.01230   

   speechiness  instrumentalness  live

### Identifying Categorical and Numeric Features

The dataset was analyzed to distinguish between categorical and numeric features. 

In [3]:
import pandas as pd

df = pd.read_csv("spotifytoptracks.csv")

# Data Exploration

print(df.head())  # Show first rows
print(df.info())  # Show column info


   Unnamed: 0       artist                                  album  \
0           0   The Weeknd                            After Hours   
1           1  Tones And I                           Dance Monkey   
2           2  Roddy Ricch  Please Excuse Me For Being Antisocial   
3           3    SAINt JHN                  Roses (Imanbek Remix)   
4           4     Dua Lipa                       Future Nostalgia   

              track_name                track_id  energy  danceability  key  \
0        Blinding Lights  0VjIjW4GlUZAMYd2vXMi3b   0.730         0.514    1   
1           Dance Monkey  1rgnBhdG2JDFTbYkYRZAku   0.593         0.825    6   
2                The Box  0nbXyq5TXYPCO7pr3N8S4I   0.586         0.896   10   
3  Roses - Imanbek Remix  2Wo6QQD1KMDWeFkkjLqwx5   0.721         0.785    8   
4        Don't Start Now  3PfIrDoz19wz7qK7tYeu62   0.793         0.793   11   

   loudness  acousticness  speechiness  instrumentalness  liveness  valence  \
0    -5.934       0.00146      

### Data Loading and Initial Exploration

The dataset was loaded using **Pandas**, and an initial exploration was conducted using the `head()` and `info()` functions. 
The exploration revealed that there were no missing values or duplicates, and the data was well-structured with no issues.


# Spotify's Top 50 Tracks Analysis 
## Objective
The goal of this project is to analyze the top 50 music tracks and extract insights about:
- Most popular artists, albums, and genres
- Correlations between features (danceability, loudness, etc.)
- Comparison of features across music genres
