# Audio data classification

## Run prior notebooks

%% capture hides the output of the prior notbooks.

In [1]:
%%capture
%run 06_logistic_regression_classifier.ipynb

## 07 Balance data

Both our models are poor at classifying hip-hop songs. This might be because there is a smaller number of hip-hop songs in the data set, meaning these songs are given less weight in the fitting process. To fix this, we can balance the data.

In [2]:
# Subset only the hip-hop tracks, and then only the rock tracks
hop_only = spotify.loc[spotify['genre_top'] == 'Hip-Hop']
rock_only = spotify.loc[spotify['genre_top'] == 'Rock']

# sample the rocks songs to be the same number as there are hip-hop songs
rock_only = rock_only.sample(hop_only.shape[0], random_state=10)

# concatenate the dataframes rock_only and hop_only
rock_hop_bal = pd.concat([rock_only, hop_only])

# The features, labels, and pca projection are created for the balanced dataframe
features = rock_hop_bal.drop(['genre_top', 'track_id'], axis=1) 
labels = rock_hop_bal['genre_top']
pca_projection = pca.fit_transform(scaler.fit_transform(features))

# Redefine the train and test set with the pca_projection from the balanced data
train_features, test_features, train_labels, test_labels = train_test_split(pca_projection, labels, random_state=10)

Refit and compare the decision tree and logistic regression classifier models.

In [3]:
# Train our decision tree on the balanced data
tree = DecisionTreeClassifier(random_state=10)
tree.fit(train_features, train_labels)
pred_labels_tree = tree.predict(test_features)

# Train our logistic regression on the balanced data
logreg = LogisticRegression(random_state=10)
logreg.fit(train_features, train_labels)
pred_labels_logit = logreg.predict(test_features)

# Compare the models
print("Decision Tree: \n", classification_report(test_labels, pred_labels_tree))
print("Logistic Regression: \n", classification_report(test_labels, pred_labels_logit))

Decision Tree: 
               precision    recall  f1-score   support

     Hip-Hop       0.79      0.81      0.80       230
        Rock       0.80      0.78      0.79       225

    accuracy                           0.79       455
   macro avg       0.79      0.79      0.79       455
weighted avg       0.79      0.79      0.79       455

Logistic Regression: 
               precision    recall  f1-score   support

     Hip-Hop       0.81      0.83      0.82       230
        Rock       0.82      0.80      0.81       225

    accuracy                           0.82       455
   macro avg       0.82      0.82      0.82       455
weighted avg       0.82      0.82      0.82       455

