#### Your objective is to classify fMRI brain images taken while listening to music in five different genres: label 0=Ambient Music, 1=Country Music, 2=Heavy Metal, 3=Rock 'n Roll, 4=Classical Symphonic. The data consists of train_data.csv, train_labels.csv, and test_data.csv, for a one-person subset of a larger 20-subject study, linked above.

#### The training data (train_data.csv) consist of 160 event-related brain images (trials), corresponding to twenty 6-second music clips, four clips in each of the five genres, repeated in-order eight times (runs). The labels (train_labels.csv) correspond to the correct musical genres, listed above, for each of the 160 trials.

#### There are 22036 features in each brain image, corresponding to blood-oxygenation levels at each 2mm-cubed 3D location within a section of the auditory cortex. In human brain imaging, there are often many more features (brain sites) than samples (trials), thus making the task a relatively challenging multiway classification problem.

#### The testing data (test_data.csv) consists of 40 event-related brain images corresponding to novel 6-second music clips in the five genres. The test data is in randomized order with no labels. You must predict, using only the given brain images, the correct genre labels (0-4) for the 40 test trials.


# Final Project

# "Classifying The Brain on Music"

Michael Casey, https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2017.01179/full


## **1. Multi-Class Genre Classifier** [[12 points]](https://)

#### Build a multi-class classifier for the 5 music genres. Your goal is to train a model to classify brain images into corresponding genre categories. You are free to choose any machine learning models from the class.

#### **1-1. Hyper-parameter Search.** [[4 points]](https://) Demonstrate your hyperparameter search process using cross-validation. Provide details for at least one hyperparameter with 10 different possible values.

#### **1-2. Model Training and Testing.** [[4 points]](https://) Following the hyperparameter search, train your model with the best combination of hyperparameters. Run the model on the test set and submit the results to the Kaggle competition. To get full marks, your model should outperform the baseline model, which is provided in Kaggle. You **must** show your test accuracy computed by Kaggle in this report.

#### **1-3. Model Analysis.** [[4 points]](https://) Conduct a thorough analysis of your model, including:

#### **1-3-1. Confusion Matrix:** Split the training set into train/validation sets. The data is organized into eight runs, in order, with each run repeating the same 20 music trials. You should split the data by run. Retrain your model using the best hyperparameter combination. Present the confusion matrix on the validation set.

#### **1-3-2. Example Examination:** Examine four validation samples where your model fails to classify into the correct category. Display the true label and the predicted label. Looking at the confusion matrix, how might you explain your results from the perspectives of human brain data and music genre similarity?


---

## **A. Data Download**

#### For your convenience, we have provided code to download the dataset, which includes true labels, training data (features), training labels, and testing data (features).


#### **A-1. Download Features and Labels.**

#### Run the following code to download the brain features and labels of the music clips.


In [None]:
import numpy as np
!pip install gdown

In [None]:
!gdown --id 1aFDPryEDcT5wg0k8NhWYpF8lulGmot5J # train data
!gdown --id 11kgAdB_hkEcC4npCEWJcAOOmGe3495yY # train labels
!gdown --id 1wXq56F6RIUtDzPceZegZAMA-JGW21Gqu # test data

In [2]:
# Data Import Method 1, with pandas
import pandas as pd

train_data = pd.read_csv("../train_data.csv", header=None)
train_labels = pd.read_csv("../train_labels.csv", header=None)
test_data = pd.read_csv("../test_data.csv", header=None)

print('train_data.shape: {}'.format(train_data.shape))
print('train_labels.shape: {}'.format(train_labels.shape))
print('test_data.shape: {}'.format(test_data.shape))

train_data.shape: (160, 22036)
train_labels.shape: (160, 1)
test_data.shape: (40, 22036)


#### Data exploration


In [None]:
print("\nFirst few rows of the dataset:\n")
train_data.head(2)

In [None]:
print("\nDescriptive statistics for numerical columns:\n")
train_data.describe()

In [None]:
print("\nInformation about the dataset:\n")
print(train_data.info())

print("\nShape of the dataset (rows, columns):\n")
print(train_data.shape)

print("\nData types of each column:\n")
print(train_data.dtypes)

# print(df['categorical_column'].value_counts())

print("\nNumber of missing values in each column:\n")
print(train_data.isnull().sum())

#### Step 1: Split the data into training


In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(train_data, train_labels, test_size=0.3, random_state=0) # 70% to train

#### Step 2: Normalize the features using StandardScaler


In [79]:
from sklearn.preprocessing import StandardScaler

# Seems to decrease accuracy.

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#### Step 3: One-hot encode the target variable


In [78]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
y_train_scaled = encoder.fit_transform(y_train.values.reshape(-1, 1))
y_test_scaled = encoder.transform(y_test.values.reshape(-1, 1))

#### Step 4: Tests various model parameters


#### RandomForestClassifier


In [17]:
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import cross_val_score

# For classification
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(rf_clf, X_train, y_train[0].tolist(), cv=5, scoring='accuracy')
print(f"Random Forest Classifier Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

Random Forest Classifier Accuracy: 0.599 (+/- 0.069)


#### KNeighborsClassifier


In [65]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

# For classificatio
kn_clf = KNeighborsClassifier()
scores = cross_val_score(kn_clf, X_train, y_train[0].tolist(), cv=5, scoring='accuracy')
print(f"KNeighborsClassifier Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

Random Forest Classifier Accuracy: 0.393 (+/- 0.093)


#### LogisticRegression


In [18]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

lr = LogisticRegression(max_iter=1000)
scores = cross_val_score(lr, X_train, y_train[0].tolist(), cv=5, scoring='accuracy')
print(f"Logistic Regression Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

Logistic Regression Accuracy: 0.678 (+/- 0.078)


In [19]:
from sklearn.decomposition import PCA

pca = PCA(n_components=80)
X_train_pca = pca.fit_transform(X_train)

lr = LogisticRegression(max_iter=1000)
scores = cross_val_score(lr, X_train_pca, y_train[0].tolist(), cv=5, scoring='accuracy')
print(f"Logistic Regression on PCA data Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

Logistic Regression on PCA data Accuracy: 0.750 (+/- 0.098)


#### SVC


In [7]:
from sklearn.svm import SVC

svc = SVC(kernel='linear', decision_function_shape='ovr')
scores = cross_val_score(svc, X_train, y_train[0].tolist(), cv=5, scoring='accuracy')
print(f"SVC Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

SVC Accuracy: 0.642 (+/- 0.114)


In [22]:
pca = PCA(n_components=80)
X_train_pca = pca.fit_transform(X_train)

svc = SVC(kernel='linear', decision_function_shape='ovr')
scores = cross_val_score(svc, X_train_pca, y_train[0].tolist(), cv=5, scoring='accuracy')
print(f"SVC Accuracy on PCA data: {scores.mean():.3f} (+/- {scores.std():.3f})")

SVC Accuracy on PCA data: 0.688 (+/- 0.061)


#### DecisionTreeClassifier


In [8]:
from sklearn.tree import DecisionTreeClassifier

lr = DecisionTreeClassifier()
scores = cross_val_score(lr, X_train, y_train[0].tolist(), cv=5, scoring='accuracy')
print(f"DecisionTreeClassifier Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

DecisionTreeClassifier Accuracy: 0.562 (+/- 0.064)


#### GaussianNB

In [9]:
from sklearn.naive_bayes import GaussianNB

svc = GaussianNB()
scores = cross_val_score(svc, X_train, y_train[0].tolist(), cv=5, scoring='accuracy')
print(f"GaussianNB Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

GaussianNB Accuracy: 0.455 (+/- 0.055)
