---
title: "Modeling for predicting decades"
subtitle: "DSAN 5300 Final Project"
authors: ["Jorge Bris Moreno", "William McGloin", "Kangheng Liu", "Isfar Baset"]
date: last-modified
date-format: long
format:
  html:
    self-contained: true
    toc: true
    code-overflow: wrap
    code-fold: true
---

## Introductionç

In this document, we will do prediction modeling for predicting decades. We will use the balance dataset that has been adjusted by generating synthetic data using SMOTE for those underrepresented decades. We will use the following models to predict decades: Logistic regression, SVMs, and Neural Nets. Every model will have hyperparameter tuning. We will evaluate the models using accuracy, precision, recall, and F1 score.

## Data Preparation

In [9]:
# split data into training, validation and test sets
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# define the path to the data directory
tracks = pd.read_csv('../data/clean_data/balanced_tracks.csv')

# make column time_signature, decade, key, and mode to be a categorical column
tracks['time_signature'] = tracks['time_signature'].astype('category')
tracks['decade'] = tracks['decade'].astype('category')
tracks['key'] = tracks['key'].astype('category')
tracks['mode'] = tracks['mode'].astype('category')

tracks.head()

Unnamed: 0,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,duration_ms,key,mode,decade
0,0.787,0.889,-3.125,0.128,0.00951,0.000322,0.652,0.677,156.027,4,172399,2,1,2020s
1,0.759,0.833,-5.01,0.0779,0.00026,0.0573,0.178,0.522,140.026,4,183919,11,1,2020s
2,0.84,0.934,-3.717,0.119,0.0484,0.0,0.0961,0.67,149.994,4,145842,0,1,2020s
3,0.894,0.767,-4.695,0.137,0.0231,2.4e-05,0.574,0.412,144.077,4,140288,10,0,2020s
4,0.78,0.78,-2.857,0.0858,0.00147,0.0,0.472,0.446,118.014,4,177289,0,1,2020s


In [10]:
# split the data into training, validation and test sets with the same amount of the column "decade"
train, test = train_test_split(tracks, test_size=0.2, stratify=tracks['decade'])
train, val = train_test_split(train, test_size=0.2, stratify=train['decade'])

In [11]:
# specifying targets and features
train_target = train['decade']
train_features = train.drop(columns=['decade'])

val_target = val['decade']
val_features = val.drop(columns=['decade'])

test_target = test['decade']
test_features = test.drop(columns=['decade'])

## Logistic regression

In [12]:
# logistic regression using OVR and gridsearch for L1 and L2 regularization
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score, f1_score, recall_score, precision_score

# Define the parameter grid
C_values = np.logspace(-4, 4, 20)
penalties = ['l1', 'l2']

best_score = 0
best_params = {'C': None, 'penalty': None}

for C in C_values:
    for penalty in penalties:
        if penalty == 'l1':
            solver = 'liblinear'  # 'liblinear' works well with L1 penalty
        else:
            solver = 'lbfgs'  # 'lbfgs' is good for L2 penalty
        
        # Initialize the Logistic Regression model
        model = LogisticRegression(C=C, penalty=penalty, solver=solver, multi_class='ovr')
        
        # Fit the model
        model.fit(train_features, train_target)
        
        # Evaluate the model on the validation set
        val_predictions = model.predict(val_features)
        score = accuracy_score(val_target, val_predictions)
        
        # Update best model if the current model is better
        if score > best_score:
            best_score = score
            best_params['C'] = C
            best_params['penalty'] = penalty
            best_model = model

# Print best parameters and best score
print("Best Parameters:", best_params)
print("Best Validation Score:", best_score)

Best Parameters: {'C': 29.763514416313132, 'penalty': 'l1'}
Best Validation Score: 0.3905570511493063


In [13]:
# Predictions on test set
test_predictions = best_model.predict(test_features)

# Evaluation
print("Accuracy on test set: ", accuracy_score(test_target, test_predictions))
print("Classification Report:\n", classification_report(test_target, test_predictions))


Accuracy on test set:  0.386070706736026
Classification Report:
               precision    recall  f1-score   support

       1950s       0.45      0.76      0.56      3772
       1960s       0.38      0.38      0.38      3772
       1970s       0.33      0.15      0.21      3773
       1980s       0.39      0.55      0.45      3772
       1990s       0.36      0.16      0.22      3773
       2000s       0.35      0.42      0.38      3773
       2010s       0.29      0.20      0.24      3773
       2020s       0.43      0.46      0.45      3773

    accuracy                           0.39     30181
   macro avg       0.37      0.39      0.36     30181
weighted avg       0.37      0.39      0.36     30181



## SVM

In [14]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score


# Define the parameter grid
C_values = [0.1, 1, 10, 100]
gamma_values = ['scale', 'auto']
kernel_types = ['rbf']

best_score = 0
best_params = {}

for C in C_values:
    for gamma in gamma_values:
        for kernel in kernel_types:
            # Initialize the SVM model
            svm_model = SVC(C=C, gamma=gamma, kernel=kernel)

            # Fit the SVM model to the training data
            svm_model.fit(train_features, train_target)

            # Predictions on validation set
            val_predictions = svm_model.predict(val_features)

            # Calculate accuracy on the validation set
            score = accuracy_score(val_target, val_predictions)

            # Update best model if current model is better
            if score > best_score:
                best_score = score
                best_params = {'C': C, 'gamma': gamma, 'kernel': kernel}
                best_model = svm_model

# Display the best parameters and the best validation score
print("Best Parameters:", best_params)
print("Best Validation Score:", best_score)

In [None]:
# Predictions on test set
test_predictions = best_model.predict(test_features)

# Evaluation
print("Accuracy on test set: ", accuracy_score(test_target, test_predictions))
print("Classification Report on Test Set:\n", classification_report(test_target, test_predictions))

## Neural Nets