# Project Check In Week 5: Classification
---

The below code is a simple implementation of a classification model using the Random Forest algorithm for the spotify dataset. 

We use the same binary variable as the previous week, which is the **genre of the song**, specifically the subset of the data that contains the genres **'cantopop'** and **'chicago-house'**. 

We will use the Random Forest algorithm to classify the genre of the song based on the other features in the dataset.

In [14]:
# Imports
import pandas as pd 
import numpy as np
import plotly.express as px
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split


pd.set_option('display.max_columns', None)
pd.options.display.max_colwidth = 500
pd.options.display.max_rows = 100

In [15]:
# Load data
classification_data_orig = pd.read_excel('clean_data.xlsx')

### Data Preprocessing and Selection


To apply binary classification, we select two specific music genres from our dataset and assign the classes for them:
- `genre_1` (`chicago-house`) is set as the positive class (1).
- `genre_2` (`cantopop`) is set as the negative class (0).

To ensure compatibility with classification models in scikit-learn, we: 
- Filter our dataset to only include numeric columns and the selected binary genre labels.
- Remove any unnecessary columns, such as 'Unnamed: 0', and focus only on relevant predictor variables.
- Split the data into training and validation sets (stratified split to ensure both sets have same proportion of class 1 and class 0, which helps maintain the representativeness of the classes in each set)

In [31]:
# Select a subset of the data s.t. we only have 2 classes
# We will use the 'genre' column to create the classes
genre_1 = "chicago-house" # will convert to be the positive class (1)
genre_2 = "cantopop" # will convert to be the negative class (0)


# Since categorical variables are not directly supported by sklearn, we will use dummy encoding to convert them to numerical
# Since we just are dealing with the binary categorical response variable from the previous check in, we will use 0 and 1 to represent the classes
d = {
    genre_1: 1,
    genre_2: 0
}

# Split data into training and validation sets (no testing set for project check in)

classification_data = classification_data_orig[classification_data_orig["track_genre"].isin([genre_1, genre_2])]
classification_data = classification_data.replace({"track_genre": d}).select_dtypes(include=[np.number]).drop(columns="Unnamed: 0")

# With selected predictor variables:
classification_data = classification_data[['danceability', 'acousticness', 'instrumentalness','time_signature', 'track_genre']]

# Split data into training and validation sets - stratified split to ensure same proportion of classes in both sets
classification_data_train, classification_data_val = train_test_split(classification_data, test_size=0.2, random_state=42, stratify=classification_data["track_genre"])

# Check the proportion of classes in the training and validation sets

prop_zero_train = (classification_data_train["track_genre"] == 0).sum() / len(classification_data_train)
prop_zero_val = (classification_data_val["track_genre"] == 0).sum() / len(classification_data_val)

print("Proportion of Cantopop in training set:", prop_zero_train)
print("Proportion of Cantopop in validation set:", prop_zero_val)

assert np.isclose(prop_zero_train, prop_zero_val, atol=0.05), "Proportion of classes in training and validation sets are not the same"

Proportion of Cantopop in training set: 0.5
Proportion of Cantopop in validation set: 0.5



Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`



###  Decision Tree Classifier
We start by training a Decision Tree Classifier to classify the selected genres. 

We fit the classifier on the training set and predict the probabilities for the validation set, specifically extracting probabilities for class "1" (chicago-house). 

Then, calculate the confusion matrix, accuracy, prediction error, F1 score, true positive rate (TPR), and true negative rate (TNR).

In [38]:
# Create the Decision Tree Classifier
classification_cart = DecisionTreeClassifier(min_samples_split=20, max_depth=30)

# Fit the classifier to the training data 
classification_cart.fit(classification_data_train.drop(columns="track_genre"), classification_data_train["track_genre"])

# Compute validation set predictions (probability) and only keep the positive class (chicago-house) probabilities
classification_val_pred = classification_cart.predict_proba(classification_data_val.drop(columns="track_genre"))[:, 1]


In [39]:
classification_data_val["track_genre"]

12835    1
12460    1
12498    1
11283    0
12248    1
        ..
11431    0
11857    0
12685    1
11438    0
11383    0
Name: track_genre, Length: 400, dtype: int64

In [40]:
# Computer the confusion matrix for the Decision Tree Classifier model
conf_matrix_dt = metrics.confusion_matrix(classification_data_val["track_genre"], classification_val_pred > 0.5)

pred_accuracy_dt = (conf_matrix_dt[0, 0] + conf_matrix_dt[1, 1]) / np.sum(conf_matrix_dt)
pred_error_dt = 1 - pred_accuracy_dt
f1_score_dt = metrics.f1_score(classification_data_val["track_genre"], classification_val_pred > 0.5)

print(f"Decision Tree Classifier Accuracy: {pred_accuracy_dt}")
print(f"Decision Tree Classifier Error: {pred_error_dt}")
print(f"Decision Tree Classifier F1 Score: {f1_score_dt}")
print(f"Decision Tree Classifier Model TPR: {conf_matrix_dt[1, 1] / np.sum(conf_matrix_dt[1, :])}")
print(f"Decision Tree Classifier Model TNR: {conf_matrix_dt[0, 0] / np.sum(conf_matrix_dt[0, :])}")

Decision Tree Classifier Accuracy: 0.955
Decision Tree Classifier Error: 0.04500000000000004
Decision Tree Classifier F1 Score: 0.9538461538461539
Decision Tree Classifier Model TPR: 0.93
Decision Tree Classifier Model TNR: 0.98


### Random Forest Classifier
Now we train a Random Forest Classifier on the same data.

We also calculate these metrics: confusion matrix, accuracy, prediction error, F1 score, true positive rate (TPR), and true negative rate (TNR).

In [41]:
# Random Forest Algorithm
rf = RandomForestClassifier(random_state=8743)
rf.fit(classification_data_train.drop(columns="track_genre"), classification_data_train["track_genre"]) 

rf_val_pred = rf.predict_proba(classification_data_val.drop(columns="track_genre"))[:, 1]

In [10]:
# Computer the confusion matrix for the RF model
conf_matrix_rf = metrics.confusion_matrix(classification_data_val["track_genre"], rf_val_pred > 0.5)

pred_accuracy_rf = (conf_matrix_rf[0, 0] + conf_matrix_rf[1, 1]) / np.sum(conf_matrix_rf)
pred_error_rf = 1 - pred_accuracy_rf
f1_score_rf = metrics.f1_score(classification_data_val["track_genre"], rf_val_pred > 0.5)

print(f"Random Forest Model Accuracy: {pred_accuracy_rf}")
print(f"Random Forest Model Error: {pred_error_rf}")
print(f"Random Forest Model F1 Score: {f1_score_rf}")
print(f"Random Model TPR: {conf_matrix_rf[1, 1] / np.sum(conf_matrix_rf[1, :])}")
print(f"Random Model TNR: {conf_matrix_rf[0, 0] / np.sum(conf_matrix_rf[0, :])}")

Random Forest Model Accuracy: 0.98
Random Forest Model Error: 0.020000000000000018
Random Forest Model F1 Score: 0.9798994974874372
Random Model TPR: 0.975
Random Model TNR: 0.985


### ROC Curve and AUC Calculation


#### Plotting the ROC Curve and Calculating AUC

We plot the ROC curve and calculate the AUC score for the Random Forest model to assess its classification performance. 


In [11]:
# Display the ROC curve for the RF model
fpr_rf, tpr_rf, rf_thresholds = metrics.roc_curve(classification_data_val["track_genre"], rf_val_pred)

roc_rf = pd.DataFrame({
    'False Positive Rate': fpr_rf,
    'True Positive Rate': tpr_rf,
    'Model': 'RF'
})

px.line(roc_rf, y='True Positive Rate', x='False Positive Rate',
        color='Model',
        width=700, height=500)

AUC (Area Under Curve) provides a single metric to evaluate the classifier's ability to distinguish between the two classes. Calculate the AUC:

In [12]:
# Calculate the AUC for the RF model
roc_auc_rf = metrics.auc(fpr_rf, tpr_rf)
print(f"Random Forest Model AUC: {roc_auc_rf}")

Random Forest Model AUC: 0.9958875


This ROC curve indicates a very strong performance because:

- High True Positive Rate (TPR): The curve rises sharply towards the top-left corner, which suggests that the model achieves a high TPR with a low False Positive Rate (FPR).

- The curve stays close to the top edge (TPR = 1) with a very low FPR across most threshold values, indicating that the model is very good at distinguishing between the 2 classes.

The AUC (which is 0.9958875) is close to 1.0. This indicates a nearly perfect model performance.

#### Cross-Validation for AUC and Accuracy
To evaluate the model's consistency, we use 5-fold cross-validation on the validation set, calculating AUC and accuracy for each fold and displaying the results for an in-depth understanding of the model's performance.



In [13]:
# Use 5-fold Cross Validation on the validation set to calculate the AUC and accuracy of each fold
from sklearn.model_selection import cross_val_score

rf_cv_auc = cross_val_score(rf, classification_data_val.drop(columns="track_genre"), classification_data_val["track_genre"], cv=5, scoring="roc_auc")
rf_cv_accuracy = cross_val_score(rf, classification_data_val.drop(columns="track_genre"), classification_data_val["track_genre"], cv=5, scoring="accuracy")

rf_cv_auc_str = ', '.join([str(x) for x in rf_cv_auc])
rf_cv_accuracy_str = ', '.join([str(x) for x in rf_cv_accuracy])

print(f"Random Forest Model CV AUC: {rf_cv_auc_str}")
print(f"Random Forest Model CV Accuracy: {rf_cv_accuracy_str}")

Random Forest Model CV AUC: 0.9984374999999999, 0.9896875, 0.9975, 1.0, 0.996875
Random Forest Model CV Accuracy: 0.975, 0.9625, 0.9625, 1.0, 0.95


The AUC is very high for our model, which may be a sign that our model is overfitting. However, since the two genres we are classifying are very different, it is possible that the model is just very good at distinguishing between the two genres.