For our final project, we have chosen to analyze Spotify music data, focusing on key attributes of songs. This dataset comprises various features, including track duration, explicit content, danceability, energy, and more.

Our central question revolves around predicting the popularity of a song based on its musical characteristics and qualities. Specifically, we aim to investigate the following:

Can we predict the popularity of a song based on musical characteristics/qualities? What qualities of a song contribute the most to its popularity?

The motivation and value behind this question is to provide valuable insights for recording companies and artists, offering a greater understanding of the factors that influence a song's popularity. By determining which musical attributes contribute significantly to a song's success, stakeholders can make informed decisions to enhance the appeal and marketability of their music. 

Our question is relevant to the music industry, where the ability to predict a song's popularity directly impacts revenue generation.


We found our dataset on this website: https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset/tree/main 

The columns that we are working with to analyze are as follows: 
name
album
artist
id
release_date
popularity
length
danceability
acousticness
energy
instrumentalness
liveness
valence
loudness
speechiness
tempo
key
time_signature
mood

In [18]:
# step 1: import the libraries and upload the dataset 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# open the dataset.csv file 
music = pd.read_csv('dataset.csv')
# inspect the types of each column
music.dtypes


Unnamed: 0            int64
track_id             object
artists              object
album_name           object
track_name           object
popularity            int64
duration_ms           int64
explicit               bool
danceability        float64
energy              float64
key                   int64
loudness            float64
mode                  int64
speechiness         float64
acousticness        float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
time_signature        int64
track_genre          object
dtype: object

We decided that the best model to use for our question would be Random forest. This is because we have alot of key features that are both numerical and categorical. 

We want to use random forest to be able to determine what factors contribute the most to popularity. RF is an adequate model for our question because they work with both numerical and categorical data, flexible, and easy to interpret.



First, lets clean our data set a little bit for manipulation. 

In [19]:
# check how many nas 
music.isna().sum()
# not too many nas, so we can drop them 
music = music.dropna()


In [20]:
# lets fix some of the columns values --> example some of the track names are not legible
# lets remove the ones that are not in english, with any letters that contain special symbols 
# in another language other than english 
music = music[music['track_name'].str.contains('^[a-zA-Z ]+$')]
music['track_name'].value_counts()
# lets drop track_id because it is just a bunch of letters 
music = music.drop(['track_id'], axis = 1)
# also remove the unnamed column
music = music.drop(['Unnamed: 0'], axis = 1)



In [22]:
music.dtypes


artists              object
album_name           object
track_name           object
popularity            int64
duration_ms           int64
explicit               bool
danceability        float64
energy              float64
key                   int64
loudness            float64
mode                  int64
speechiness         float64
acousticness        float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
time_signature        int64
track_genre          object
dtype: object