# Project Check In Week 4: Logistic Regression

The below code is a simple demonstration of a logistic regression on the spotify dataset. To make sure the data we are working with has the proper format, we will use the data output by the dataclean.py script we generated in an earlier week. For this check in, we will attempt to classify tracks into one of two genres (for simplicity) based on other features in the data set.

In [2]:
# Import relevant libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import matplotlib.pyplot as plt

In [6]:
# Load the data and check for duplicates
logistic_data_orig = pd.read_excel('clean_data.xlsx')

In [36]:
assert logistic_data_orig.duplicated("track_id").sum() == 0, "There are duplicates track_ids in the data"
logistic_data_orig["track_genre"].value_counts()

track_genre
chicago-house    998
cantopop         998
alt-rock         997
breakbeat        996
forro            996
                ... 
metal            230
punk             226
house            210
indie            132
reggaeton         74
Name: count, Length: 113, dtype: int64

Since we will be classifying the data into genres, it is important that each song only has one genre. The above line is used to ensure our data cleaning returned data such that each track only has one genre. The other part of the previous code block is used to see how many of each genre is in the data. This is important as the proportion of each gene will affect the accuracy of the model. If one genre is much more common than the other, the model may be biased towards that genre. For the sake of this check in, we will thus select two genres that are relatively close in number of tracks.

In [61]:
# Select a subset of the data s.t. we only have 2 classes
# We will use the 'genre' column to create the classes
genre_1 = "chicago-house" # will convert to be the positive class (1)
genre_2 = "alt-rock" # will convert to be the negative class (0)

# Split data into training and validation sets (no testing set for project check in)
# Ensure the proportion of each class is the same in the training and validation sets
logistic_data = logistic_data_orig[logistic_data_orig["track_genre"].isin([genre_1, genre_2])]
logistic_data = logistic_data.replace({"track_genre": {genre_1: 1, genre_2: 0}}).select_dtypes(include=[np.number]).drop(columns="Unnamed: 0")
logistic_data_train = logistic_data.sample(frac=0.8, random_state=0)
logistic_data_val = logistic_data.drop(logistic_data_train.index)

  logistic_data = logistic_data.replace({"track_genre": {genre_1: 1, genre_2: 0}}).select_dtypes(include=[np.number]).drop(columns="Unnamed: 0")


In [62]:
# Compute logistic regression over entire training data set
lr_all = LogisticRegression(solver='liblinear')
lr_all.fit(X=logistic_data_train.drop(columns="track_genre"), y=logistic_data_train["track_genre"])
lr_all.intercept_, lr_all.coef_

(array([-0.0006144]),
 array([[-3.98596532e-02,  1.30839237e-05,  9.07309268e-04,
         -5.01523843e-04, -2.37200507e-03, -1.00132704e-02,
         -1.41801006e-03,  8.47046317e-05, -6.25468669e-04,
          1.99745016e-03, -4.33626362e-04,  1.87967475e-04,
         -2.32481693e-02, -2.31026332e-03]]))

In [68]:
# Test the model on a subset of all observations in the validation set
X_val = logistic_data_val.drop(columns="track_genre")
y_val = logistic_data_val["track_genre"]

pred_val = pd.DataFrame({"actual": y_val, "predicted": lr_all.predict(X_val), "prob": lr_all.predict_proba(X_val)[:,1]})
pred_val.replace({1: genre_1, 0: genre_2}, inplace=True)
pred_val

Unnamed: 0,actual,predicted,prob
1996,alt-rock,alt-rock,0.025192
2008,alt-rock,chicago-house,0.628505
2009,alt-rock,chicago-house,0.802282
2010,alt-rock,chicago-house,0.623586
2013,alt-rock,chicago-house,0.934922
...,...,...,...
13014,chicago-house,chicago-house,0.750419
13018,chicago-house,chicago-house,0.713422
13029,chicago-house,chicago-house,0.908595
13036,chicago-house,chicago-house,0.736378


In [69]:
# Confusion matrix
conf_matrix = metrics.confusion_matrix(y_true=pred_val["actual"].replace({genre_1: 1, genre_2: 0}), y_pred=pred_val["predicted"].replace({genre_1: 1, genre_2: 0}))
conf_matrix

  conf_matrix = metrics.confusion_matrix(y_true=pred_val["actual"].replace({genre_1: 1, genre_2: 0}), y_pred=pred_val["predicted"].replace({genre_1: 1, genre_2: 0}))


array([[146,  65],
       [ 41, 147]])

In [70]:
# Metrics
accuracy = metrics.accuracy_score(y_true=pred_val["actual"].replace({genre_1: 1, genre_2: 0}), y_pred=pred_val["predicted"].replace({genre_1: 1, genre_2: 0}))
error = 1 - accuracy
TPR = conf_matrix[1,1] / (conf_matrix[1,1] + conf_matrix[1,0])
FPR = conf_matrix[0,1] / (conf_matrix[0,1] + conf_matrix[0,0])
TNR = conf_matrix[0,0] / (conf_matrix[0,0] + conf_matrix[0,1])
FNR = conf_matrix[1,0] / (conf_matrix[1,0] + conf_matrix[1,1])
print(f"Accuracy: {accuracy}\nError: {error}\nTPR: {TPR}\nFPR: {FPR}\nTNR: {TNR}\nFNR: {FNR}")

Accuracy: 0.7343358395989975
Error: 0.2656641604010025
TPR: 0.7819148936170213
FPR: 0.3080568720379147
TNR: 0.6919431279620853
FNR: 0.21808510638297873


  accuracy = metrics.accuracy_score(y_true=pred_val["actual"].replace({genre_1: 1, genre_2: 0}), y_pred=pred_val["predicted"].replace({genre_1: 1, genre_2: 0}))


In [73]:
# Predicted probability densities 
px.histogram(pred_val, x="prob", color="actual", opacity=0.5, barmode="overlay", title="Predicted Probability Densities")

In [75]:
# ROC Curve
lr_fpr, lr_tpr, lr_thresholds = metrics.roc_curve(y_true=pred_val["actual"].replace({genre_1: 1, genre_2: 0}), y_score=pred_val["prob"])
lr_thresholds


Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`



array([       inf, 0.99845447, 0.96140726, 0.95164291, 0.93672395,
       0.93492169, 0.90790959, 0.90681839, 0.88364834, 0.877199  ,
       0.85377196, 0.85197817, 0.84630286, 0.84579307, 0.83742341,
       0.83625486, 0.82441056, 0.82279503, 0.80558991, 0.80474557,
       0.80314327, 0.80228241, 0.80008431, 0.79940871, 0.7949291 ,
       0.79446726, 0.77937968, 0.77862762, 0.7712918 , 0.76563984,
       0.75498321, 0.75472398, 0.71185209, 0.705953  , 0.70426907,
       0.7028007 , 0.69638795, 0.69128113, 0.68837466, 0.68751651,
       0.6860564 , 0.68130565, 0.67817322, 0.67310935, 0.66793707,
       0.66427915, 0.65919984, 0.65458575, 0.64990299, 0.64913763,
       0.64806375, 0.64211338, 0.64158257, 0.62850512, 0.62622647,
       0.62358612, 0.61970432, 0.61202201, 0.61085412, 0.6041244 ,
       0.59826999, 0.59652682, 0.58962088, 0.58256043, 0.57946978,
       0.57511587, 0.5675682 , 0.56621561, 0.56230356, 0.55622023,
       0.55607018, 0.5544284 , 0.54620985, 0.5365761 , 0.52625

In [76]:
roc_lr = pd.DataFrame({"False Positive Rate": lr_fpr, "True Positive Rate": lr_tpr, "Model": "Logistic Regression"}, index=lr_thresholds)
roc_df = pd.concat([roc_lr]) # This concatenation is necessary for the plot to work
px.line(roc_df, y="True Positive Rate", x="False Positive Rate", color="Model", title="LR ROC Curve")

In [78]:
# Compute AUC
lr_auc = metrics.roc_auc_score(y_true=pred_val["actual"].replace({genre_1: 1, genre_2: 0}), y_score=pred_val["prob"])
print("Logistic Regression AUC:", lr_auc.round(3))

# The AUC measures the area under the ROC curve. The closer the AUC is to 1, the better the model is at distinguishing between the two classes.
# AUC is threshold independent - it summarizes performance over all possible classification thresholds
# The AUC is also robust to class imbalance, as it considers the trade-off between sensitivity and specificity (this makes it a more informative metric when class distributions are skewed)
# If a model's AUC is 0 <= X <= 1, it means that it can distinguish between positive and negative cases with a probability of X

Logistic Regression AUC: 0.873



Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`

