# Week 5 Check-In
## Team Spotiflies: Joanna, Aaron, Aubrey, Kennedy, Aster, Ethan
GitHub Link: https://github.com/ketexon/csm148-spotiflies

In [2]:
%pip install pandas numpy matplotlib seaborn scikit-learn mlxtend

Collecting mlxtend
  Downloading mlxtend-0.23.1-py3-none-any.whl.metadata (7.3 kB)
Downloading mlxtend-0.23.1-py3-none-any.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mlxtend
Successfully installed mlxtend-0.23.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
# SETUP data set like in week 4:


import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics

# Reading in the cleaned data from previous week check in
spotify = pd.read_csv("csv_outputs/cleaned_spotify.csv")

# select the variables of interest
selected_spotify = spotify[['mode', 'valence']]
selected_spotify

random_seed = 42
response = 'mode'
predictor = 'valence'

# Splitting the data
# First split: separate out 20% for the test set
spotify_train_val, spotify_test = train_test_split(selected_spotify, test_size=0.2, random_state=random_seed)

# Second split: separate remaining 80% into 60% training and 40% validation
spotify_train, spotify_val = train_test_split(spotify_train_val, test_size=0.25, random_state=random_seed)  # 0.25 * 0.8 = 0.2

# Reshape the data to fit the model
X_train = spotify_train.drop(columns=response)
y_train = spotify_train[response]

# fit the model and list intercept and coefficient
logistic_reg = LogisticRegression(solver='liblinear')
logistic_reg.fit(X=X_train,y=y_train)

# generate values for plotting the curve as a DataFrame with the same column name
x_values = pd.DataFrame(np.linspace(0, 1, 100), columns=[predictor])  # Use 'valence' as the column name

# Now you can predict the probabilities without the feature name issue
y_values = logistic_reg.predict_proba(x_values)[:, 1]

### KNN Algorithm:

In [14]:
# Import necessary libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# Define the response and predictor variables
response = 'mode'
predictor = 'valence'

# Use the same train-test split as before
X_train = spotify_train.drop(columns=response)
y_train = spotify_train[response]

X_val = spotify_val.drop(columns=response)
y_val = spotify_val[response]

# Initialize the KNN model
# Set n_neighbors to the desired number (e.g., 5) - you can tune this hyperparameter later
knn = KNeighborsClassifier(n_neighbors=5)

# Fit the KNN model on the training data
knn.fit(X_train, y_train)

# Predict the values on the validation set
y_val_pred = knn.predict(X_val)

# Evaluate the model
print("Validation Accuracy:", accuracy_score(y_val, y_val_pred))
print(classification_report(y_val, y_val_pred))

Validation Accuracy: 0.5777192982456141
              precision    recall  f1-score   support

           0       0.38      0.27      0.32      8276
           1       0.64      0.75      0.69     14524

    accuracy                           0.58     22800
   macro avg       0.51      0.51      0.51     22800
weighted avg       0.55      0.58      0.56     22800



Unfortunately our KNN model's accuracy `valence` and `mode` variables isn't very good, but we can continue to check the validity of our model with the confusion matrix and other metrics.

### Calculating the Confusion Matrix + Metrics

In [17]:
# calculate the confusion matrix using the validation set
y_pred = logistic_reg.predict(spotify_val.drop(columns=response))
y_true = spotify_val[response]
conf = metrics.confusion_matrix(y_pred=y_pred, y_true=y_true)
print('confusion matrix:\n', conf)
print('Prediction Accuracy:', metrics.accuracy_score(y_true=y_true, y_pred=y_pred))
print('Prediction Error:', 1 - metrics.accuracy_score(y_true=y_true, y_pred=y_pred))
print('True Positive Rate:', metrics.recall_score(y_true=y_true, y_pred=y_pred))
print('True Negative Rate:', metrics.recall_score(y_true=y_true, y_pred=y_pred, pos_label=0))
print('F1 score:', metrics.f1_score(y_true=y_true, y_pred=y_pred))

confusion matrix:
 [[    0  8276]
 [    0 14524]]
Prediction Accuracy: 0.6370175438596491
Prediction Error: 0.36298245614035085
True Positive Rate: 1.0
True Negative Rate: 0.0
F1 score: 0.7782659950701961


Our model seems to be predicting all values as a positive based on the confusion matrix, which is a sign that the model doesn't fit our data very well. However, we can continue to investigate using the ROC Curve and AUC.

### ROC Curve + AUC Calculation

In [23]:
import plotly.express as px

# Create the ROC curve variables
logistic_reg_fpr_sample, logistic_reg_tpr_sample, logistic_reg_thresholds_sample = metrics.roc_curve(
    spotify_val[response], logistic_reg.predict_proba(spotify_val.drop(columns=response))[:, 1]
)

# Calculate AUC
logistic_reg_auc_sample = metrics.roc_auc_score(
    spotify_val[response], logistic_reg.predict_proba(spotify_val.drop(columns=response))[:, 1]
)
print('Logistic regression AUC:', logistic_reg_auc_sample.round(3))

# Prepare DataFrame for plotting
roc_logistic_reg_sample = pd.DataFrame({
    'False Positive Rate': logistic_reg_fpr_sample,
    'True Positive Rate': logistic_reg_tpr_sample,
    'Model': f'Logistic Regression (AUC = {logistic_reg_auc_sample:.3f})'
}, index=logistic_reg_thresholds_sample)

roc_sample_df = pd.concat([roc_logistic_reg_sample])

# ROC Plot with AUC in the title
fig = px.line(
    roc_sample_df,
    y='True Positive Rate',
    x='False Positive Rate',
    color='Model',
    width=700,
    height=500,
    title=f"ROC Plot (AUC = {logistic_reg_auc_sample:.3f})"
)

# Show plot
fig.show()


Logistic regression AUC: 0.511


Based on the AUC, we have a sensitivity rating of about 0.511, which is basically equivalent to  making random guesses, so our model is probably not a good fit at all for the relationship between the `mode` and `valence` variables. The logistic model is probably not a good predictor model for our data.

### 5-Fold CV + AUC Calculation

In [25]:
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skfolds = StratifiedKFold(n_splits=5)
i = 1
X = spotify_val.drop(columns=response)
y = spotify_val[response]
for train_index, test_index in skfolds.split(X, y):
    clone_lr = clone(logistic_reg)
    X_train_folds = X.iloc[train_index]
    y_train_folds = y.iloc[train_index]
    X_test_fold = X.iloc[test_index]
    print(test_index)
    clone_lr.fit(X_train_folds, y_train_folds)
    y_pred = clone_lr.predict(X_test_fold)

    auc_sample = metrics.roc_auc_score(y.iloc[test_index], y_pred)
    print('Fold: ', i)
    print('AUC: ', auc_sample)
    print('Accuracy: ', metrics.accuracy_score(y.iloc[test_index], y_pred))

    i += 1

[   0    1    2 ... 4661 4662 4670]
Fold:  1
AUC:  0.5
Accuracy:  0.6370614035087719
[4491 4492 4493 ... 9161 9164 9167]
Fold:  2
AUC:  0.5
Accuracy:  0.6370614035087719
[ 9086  9089  9092 ... 13753 13754 13757]
Fold:  3
AUC:  0.5
Accuracy:  0.6370614035087719
[13641 13642 13643 ... 18306 18310 18311]
Fold:  4
AUC:  0.5
Accuracy:  0.6370614035087719
[18216 18218 18219 ... 22797 22798 22799]
Fold:  5
AUC:  0.5
Accuracy:  0.6368421052631579


We ended up picking the default threshold of 0.5 as a starting point for a model. With it, we were able to get an accuracy of ~0.63 and an AUC ~0.511, which means that the logistic model does not really do better than random guessing at predicting our data. We found that the accuracy is the same as the class imbalance for the `mode` variable, so the small boost above 0.5 is likely due to that, and not actually our model performing well. In conclusion, the relationship between `valence` and `mode` is not well modelled by the logistic predictor model.