### Final Project: Rockify

This project aims to develop a machine-learning model for music genre classification to identify various rock music sub-genres. Our model, Rockify, is trained on a dataset of rock music from Spotify to classify songs into specific sub-genres, including but not limited to classic rock, alternative, and heavy metal. In this report, we will guide you comprehensively through the methods used, the model, the results, and an in-depth discussion of the implications of this breakthrough in music technology. 

- Department: Master of Science in Technology Innovation, University of Washington
- Team Members: Yvonne Yang, Jiahui Kao, Emily Chou, Chia-Wei Chang

##### Part A. Rockify's Data Extration

The following code uses the Spotify API and the Spotipy library to extract information about the tracks in a Spotify playlist and save it in a CSV file. 

Step 1. Import necessary libraries

In [None]:
# spotipy is a Python library for interacting with the Spotify API.
import spotipy

# SpotifyClientCredentials is a class from spotipy.oauth2 used to 
# authenticate with the Spotify API using client ID and client secret.
from spotipy.oauth2 import SpotifyClientCredentials

# csv is a Python library for reading and writing CSV files.
import csv

Step 2. Load credentials from .env file

In [None]:
# client_id and client_secret are the unique identifiers provided by Spotify 
# when you register your application to use the Spotify API.
client_id = "af41dd34ac11488c9351e2632411d974"
client_secret = "1d69550688494d4c91dbdd5f7ccf769c"

# SpotifyClientCredentials is used to obtain an access token by 
# providing client_id and client_secret, which is then used to authenticate the spotipy.
client_credentials_manager = SpotifyClientCredentials(
    client_id=client_id, client_secret=client_secret)

# Spotify object, sp, that we will use later in the code.
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

Step 3. Specify the ID of the Spotify playlist you want to extract

In [None]:
# playlist_id is the unique identifier of the Spotify playlist that 
# we want to extract information from.
playlist_id = "1YC2hYS5awhGQBNaCObjyK"

Step 4. Get all the tracks in the playlist

In [None]:
# sp.playlist_tracks() method is used to get a list of all the tracks in the 
# specified playlist, playlist_id, using the spotipy.Spotify object, sp.
tracks = sp.playlist_tracks(playlist_id)

Step 5. Create a new CSV file to store the dataset <br/>
Step 6. Write the header row of the csv file <br/>
Step 7. Loop through each track and extract its information

In [None]:
# A new file named classic2.csv is created in write mode using the open() method from the csv library.
with open("Rockify_Dataset.csv", mode="w", newline="") as file:
    
    # A csv.writer object is created, which will be used to write data to the CSV file.
    writer = csv.writer(file)

    # A list of header row column names is created and written to the CSV file 
    # using the writerow() method from the csv.writer object created earlier.
    header = ['Track ID', 'Track Name', 'Artist', 'Album', 'Duration (ms)', 
              'Dance Ability', 'Energy', 'Key', 'Loudness', 'Mode', 'Speechiness', 
              'Acousticness', 'Instrumentalness', 'Liveness', 'Valence', 'Tempo']
    
    # Write the header row of the csv file.
    writer.writerow(header)
    
    # Loop through each track and extract its information.
    for track in tracks['items']:
       
       ## Get the track information.
       
       # track_id: The unique ID of the track in the Spotify database.
       track_id = track['track']['id']    
       # track_name: The name of the track.
       track_name = track['track']['name']
       # artist: The name of the artist.
       artist = track['track']['artists'][0]['name']
       # album: The name of the album the track is from.
       album = track['track']['album']['name']
       # duration_ms: The duration of the track in milliseconds.
       duration_ms = track['track']['duration_ms']

       # Use the track_id to obtain the audio features for the track 
       # using the sp.audio_features() function. 
       audio_features = sp.audio_features(track_id)[0]
       
       ## Extract the audio features.
       
       # danceability: A value representing how suitable the track is 
       #               for dancing based on a combination of musical elements.
       danceability = audio_features['danceability']
       # energy: A value representing the intensity and activity level of the track.
       energy = audio_features['energy']
       # key: The musical key the track is in.
       key = audio_features['key']
       # loudness: The overall loudness of the track in decibels.
       loudness = audio_features['loudness']
       # mode: Whether the track is in a major or minor key.
       mode = audio_features['mode']
       # speechiness: A value representing the amount of spoken word in the track.
       speechiness = audio_features['speechiness']
       # acousticness: A value representing how acoustic the track is.
       acousticness = audio_features['acousticness']
       # instrumentalness: A value representing how instrumental the track is.
       instrumentalness = audio_features['instrumentalness']
       # liveness: A value representing the presence of a live audience in the recording.
       liveness = audio_features['liveness']
       # valence: A value representing the musical positiveness conveyed by a track.
       valence = audio_features['valence']
       # tempo: The overall tempo or speed of the track in beats per minute.
       tempo = audio_features['tempo']

       # Write the row to the CSV file.
       row = [track_id, track_name, artist, album, duration_ms, 
              danceability, energy, key, loudness, mode, speechiness, 
              acousticness, instrumentalness, liveness, valence, tempo]
       writer.writerow(row)

##### Part B. Rockify's ML Training Models & Evaluations

The following code builds and evaluates various classifier models using the extracted data provided from Part A. 

Step 1. Import necessary libraries

In [None]:
# Used in Step 2-4
import pandas as pd

# Used in Step 4
import seaborn as sns
import matplotlib.pyplot as plt

# Used in Step 5
from sklearn.model_selection import train_test_split

# Used in Step 6
from sklearn.tree import DecisionTreeClassifier

# Used in Step 6-7
from sklearn.metrics import confusion_matrix

# Used in Step 6-11
from sklearn.metrics import accuracy_score

# Used in Step 7
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

# Used in Step 8
from sklearn.neighbors import KNeighborsClassifier

# Used in Step 9
from sklearn.svm import SVC

# Used in Step 10
from sklearn.linear_model import LogisticRegression

Step 2. Load the dataset file

In [None]:
# Load the 'Rockify_Dataset.csv' file from the local system using 
# the pandas library's read_csv function and stores it in the data variable.
data = pd.read_csv("/Rockify_Dataset.csv")

Step 3. Cleaning non-numeric data

The code drops the 'Track Name' column using the pandas drop function and removes any non-numeric values from the 'Genre' column by converting them to numeric values using a dictionary. **This step ensures that the 'Genre' column contains only numeric values, which is necessary for the machine learning algorithms used later in the code.**

In [None]:
# Drop Track Name.
data = data.drop(['Track Name'], axis=1)

# Next, we convert the values in the 'Genre' column, which is a categorical variable, 
# to numeric values. It first obtains the unique values of the 'Genre' column and converts 
# them to a list using the unique and tolist functions, respectively.
genre_values = data['Genre'].unique().tolist()

# We then create a dictionary that maps each genre to a unique integer value. 
# The for loop iterates through the values in the genre_values list and 
# assigns each value to a unique integer. The resulting dictionary is stored 
# in the genre_dic variable.
genre_dic = {}
for i in range(len(genre_values)):
    genre_dic[genre_values[i]] = i

# Finally, the map function of the Pandas library is used to apply the dictionary to 
# the 'Genre' column, converting each genre value to its corresponding integer value.
data['Genre'] = data['Genre'].map(genre_dic)

Step 4. Visualizing variable correlation

The code creates a heatmap of the variable correlation between different columns in the data using the seaborn library's heatmap function and visualizes it using the plt.show() function. The purpose of visualizing the variable correlation is to gain insights into the relationships between the different features in the dataset. The code generates a heatmap of the correlation matrix using the seaborn library, which provides a graphical representation of the pairwise correlations between the features.

The resulting plot provides a visual representation of the correlation between each pair of features in the dataset, with lighter colors indicating stronger positive correlations and darker colors indicating stronger negative correlations. **By examining the heatmap, we can identify which features are most strongly correlated with each other, which can be useful for feature selection and model building.** 

In [None]:
plt.figure(figsize=(10, 10))

# We use the sns.heatmap() function to create a heatmap of he correlation matrix, 
# passing in the data.corr() function as an argument to 
# compute the correlation between each pair of columns. 
# The annot=True parameter is used to display the correlation coefficients in each cell.
# The cmap parameter is used to set the color map for the heatmap.
sns.heatmap(data.corr(), annot=True, cmap='RdYlGn', linewidths=0.2)

fig = plt.gcf()
plt.show()

Step 5. Splitting the data:

The code splits the data into training and testing sets using the train_test_split function from the scikit-learn library.

- Features: These are the **input variables** or independent variables that are used to make predictions or **to train the model**. 
- Target: This is the **output variable** or dependent variable that **we want to predict** or classify.

The number 42 is a reference to the book "The Hitchhiker's Guide to the Galaxy" by Douglas Adams. In the book, a group of hyper-intelligent beings ask a supercomputer named Deep Thought to calculate the answer to the ultimate question of life, the universe, and everything. After much anticipation, Deep Thought finally reveals that the answer is 42, but the characters are disappointed because they don't know what the question is.

The use of 42 as the value for random_state in the train_test_split() function is a playful reference to this book and has become a convention among many data scientists and machine learning practitioners. However, in practice, any non-negative integer value can be used for random_state. 

In [None]:
# X, y are the features and target variables respectively.

X = data.drop(['Genre'], axis=1)
y = data['Genre']

# test_size = 0.2 specifies that 20% of the data should be used for testing.
# random_state = 42 ensures that the same random split is obtained each time the code is run.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In the following steps, we are going to implement Random Forest, Decision Tree, KNN, SVM, and Logistic Regression.

Step 6. Building and Evaluating: Decision Tree Classifier Model

In [None]:
# A DecisionTreeClassifier() model is initialized with the default hyperparameters.
dt_model = DecisionTreeClassifier()

# The model is trained using the training data using the .fit() function.
dt_model.fit(X_train, y_train)

# The predictions for the test data are then made using the .predict() function, 
# and these predictions are stored in the y_pred variable.
y_pred = dt_model.predict(X_test)

## Evaluate the model.

# The accuracy_score() function calculates the accuracy of the model on the test data, 
# which is the fraction of correctly classified examples over the total number of examples.
print('Accuracy:', accuracy_score(y_test, y_pred))

# The confusion_matrix() function returns a confusion matrix that summarizes the number of 
# true positives, true negatives, false positives, and false negatives for each class.
print('Confusion Matrix:', confusion_matrix(y_test, y_pred))

# The classification_report() function returns a string that summarizes 
# the classification metrics for each class in a tabular format. 
# The metrics include precision, recall, f1-score, and support
# (the number of samples in each class).
print('Classfication Report:', classification_report(y_test, y_pred))

Step 7-1. Building and Evaluating: Random Forest Classifier Model

In [None]:
# A random forest classifier model is created with 100 estimators and a maximum depth of 10.
rfc_model = RandomForestClassifier(n_estimators=100, max_depth=10)

# The model is trained using the training data using the .fit() function.
rfc_model.fit(X_train, y_train)

# The model is used to predict the genre of the test data (X_test). 
# The predicted genre values are stored in y_pred.
y_pred = rfc_model.predict(X_test)

# Evaluate the model.
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Confusion Matrix:', confusion_matrix(y_test, y_pred))
print('Classfication Report:', classification_report(y_test, y_pred))

Step 7-2. Perform a grid search to find the best hyperparameters for the random forest model

In [None]:
# The param_grid dictionary contains a set of values to be tested for the hyperparameters 
# such as the number of trees in the forest (n_estimators), 
# the maximum depth of each tree (max_depth), 
# the minimum number of samples required to split a node (min_samples_split), 
# the minimum number of samples required to be at a leaf node (min_samples_leaf), and 
# the method of sampling data (bootstrap).
param_grid = {'n_estimators': [100, 200, 300, 400, 500],
              'max_depth': [5, 10, 15, 20, 25, None],
              'min_samples_split': [2, 5, 10],
              'min_samples_leaf': [1, 2, 4],
              'bootstrap': [True, False]}

model_for_grid_search = RandomForestClassifier()

# The GridSearchCV function from the scikit-learn library performs the grid search 
# by fitting the model with each set of hyperparameters and 
# evaluating its performance using cross-validation.
# The cv parameter specifies the number of folds for the cross-validation.
grid_search = GridSearchCV(model_for_grid_search, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Once the grid search is complete, the best hyperparameters are 
# returned using the best_params_ attribute.
n_estimators, max_depth, min_samples_split, min_samples_leaf, bootstrap = grid_search.best_params_.values()

# Retrain the model using optimal parameters.
model = RandomForestClassifier(n_estimators, max_depth, min_samples_split, min_samples_leaf, bootstrap)
model.fit(X_train,y_train)
y_pred=model.predict(X_test)

# Evaluate the model.
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Confusion Matrix:', confusion_matrix(y_test, y_pred))
print('Classfication Report:', classification_report(y_test, y_pred))

Step 8. Building and Evaluating: K-Nearest Neighbors Classifier Model

In [None]:
# A K-nearest neighbors classifier model is created with 5 neighbors.
knn_model = KNeighborsClassifier(n_neighbors=5)

# The model is trained using the training data using the .fit() function.
knn_model.fit(X_train, y_train)

# The model is used to predict the genre of the test data (X_test). 
# The predicted genre values are stored in y_pred.
y_pred = knn_model.predict(X_test)

# Evaluate the model.
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Confusion Matrix:', confusion_matrix(y_test, y_pred))
print('Classfication Report:', classification_report(y_test, y_pred))

Step 9. Building and Evaluating: Support Vector Machine (SVM) Classifier Model

In [None]:
# A support vector machine (SVM) classifier model is created with 
# a radial basis function (RBF) kernel, 
# a regularization parameter of 1.0, and 
# an auto value for the gamma parameter. 
svm_model = SVC(kernel='rbf', C=1.0, gamma='auto')

# The model is trained using the training data using the .fit() function.
svm_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)

# Evaluate the model.
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Confusion Matrix:', confusion_matrix(y_test, y_pred))
print('Classfication Report:', classification_report(y_test, y_pred))

Step 10. Building and Evaluating: Logistic Regression Classifier Model

In [None]:
# Using logistic regression models to predict Genre.
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Evaluate the model.
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Confusion Matrix:', confusion_matrix(y_test, y_pred))
print('Classfication Report:', classification_report(y_test, y_pred))