## Data Science Exploration

This script will do an EDA of the Top Hits on Spotify from 2000-2019 
- https://www.kaggle.com/datasets/paradisejoy/top-hits-spotify-from-20002019/data 
- This dataset contains audio statistics of the top 2000 tracks on Spotify from 2000-2019. The data contains about 18 columns each describing the track and it's qualities.

In [92]:
# Import required libaries 
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, precision_score, recall_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

In [85]:
# Read in the data 
spotifyData = pd.read_csv('spotify.csv')
print(spotifyData.shape) # Size of the data set 
print(spotifyData.sample()) # Print a sample of the data 

(2000, 18)
       artist                song  duration_ms  explicit  year  popularity  \
691  The Fray  How to Save a Life       262533     False  2005          79   

     danceability  energy  key  loudness  mode  speechiness  acousticness  \
691          0.64   0.743   10     -4.08     1       0.0379         0.269   

     instrumentalness  liveness  valence    tempo genre  
691               0.0     0.101    0.361  122.035   pop  


In [86]:
print(spotifyData.describe())

         duration_ms        year   popularity  danceability       energy  \
count    2000.000000  2000.00000  2000.000000   2000.000000  2000.000000   
mean   228748.124500  2009.49400    59.872500      0.667438     0.720366   
std     39136.569008     5.85996    21.335577      0.140416     0.152745   
min    113000.000000  1998.00000     0.000000      0.129000     0.054900   
25%    203580.000000  2004.00000    56.000000      0.581000     0.622000   
50%    223279.500000  2010.00000    65.500000      0.676000     0.736000   
75%    248133.000000  2015.00000    73.000000      0.764000     0.839000   
max    484146.000000  2020.00000    89.000000      0.975000     0.999000   

               key     loudness         mode  speechiness  acousticness  \
count  2000.000000  2000.000000  2000.000000  2000.000000   2000.000000   
mean      5.378000    -5.512434     0.553500     0.103568      0.128955   
std       3.615059     1.933482     0.497254     0.096159      0.173346   
min       0.000

# Fit A GLM Model
#### Will fit a GLM to the data to predict the year of the song

In [87]:
# Define variables, dropping all categorical features 
X = spotifyData.drop(columns=['year', 'artist', 'song', 'genre'])
y = spotifyData['year']

# Split data into training and testing sets with an 80-20 split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [88]:
# Fit the GLM
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

In [89]:
# Evaluate the model performance 
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("R-squared:", r2)

Mean Squared Error: 30.849836548281324
R-squared: 0.14436197290701924


Results:
Mean Squared Error: 30.849836548281324
R-squared: 0.14436197290701924

These results show that the model is not very useful at predicting the year of the song 

# Fit A Random Forest Classifier 
#### Will fit a Random Forest model to the data to predict the genre of the song 

In [136]:
# Select relevant features and target variable
X2 = spotifyData.drop(columns=['artist', 'song', 'genre'])
y2 = spotifyData['genre']

# Encode categorical target variable (artist)
label_encoder = LabelEncoder()
y2 = label_encoder.fit_transform(y2)

# Split the data into training and testing sets
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.2, random_state=41)

# Initialize and train the model (e.g., Random Forest Classifier)
model = RandomForestClassifier()
model.fit(X2_train, y2_train)

# Make predictions on the testing set
y2_pred = model.predict(X2_test)


In [137]:
# Evaluate the model using accuracy, precision and recall 
accuracy = accuracy_score(y2_test, y2_pred)
precision = precision_score(y2_test, y2_pred, average='weighted', zero_division=0)  # Weighted average precision
recall = recall_score(y2_test, y2_pred, average='weighted', zero_division=0)  # Weighted average recall

# Print the metrics 
print("Accuracy: ", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy:  0.36
Precision: 0.3249078025778987
Recall: 0.36


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [135]:
example_input = {
    'duration_ms': 228700,
    'explicit': 1,  
    'year': 2003, 
    'popularity': 66,
    'danceability': 0.3,
    'energy': 0.6,
    'key': 3,
    'loudness': 3,
    'mode': 3.5,
    'speechiness': 0.9,
    'acousticness': 0.3, 
    'instrumentalness': 0.3, 
    'liveness': 0.2,
    'valence': 0.3,
    'tempo': 35
}

example_df = pd.DataFrame([example_input])

predicted_genre = model.predict(example_df)

# Decode the predicted genre labels back to their original names
predicted_genre_names = label_encoder.inverse_transform(predicted_genre)

# Print the predicted genre
print("Predicted Genre:", predicted_genre_names)

Predicted Genre: ['hip hop, pop']
