#### Your objective is to classify fMRI brain images taken while listening to music in five different genres: label 0=Ambient Music, 1=Country Music, 2=Heavy Metal, 3=Rock 'n Roll, 4=Classical Symphonic. The data consists of train_data.csv, train_labels.csv, and test_data.csv, for a one-person subset of a larger 20-subject study, linked above.

#### The training data (train_data.csv) consist of 160 event-related brain images (trials), corresponding to twenty 6-second music clips, four clips in each of the five genres, repeated in-order eight times (runs). The labels (train_labels.csv) correspond to the correct musical genres, listed above, for each of the 160 trials.

#### There are 22036 features in each brain image, corresponding to blood-oxygenation levels at each 2mm-cubed 3D location within a section of the auditory cortex. In human brain imaging, there are often many more features (brain sites) than samples (trials), thus making the task a relatively challenging multiway classification problem.

#### The testing data (test_data.csv) consists of 40 event-related brain images corresponding to novel 6-second music clips in the five genres. The test data is in randomized order with no labels. You must predict, using only the given brain images, the correct genre labels (0-4) for the 40 test trials.


# Final Project

# "Classifying The Brain on Music"

Michael Casey, https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2017.01179/full


## **1. Multi-Class Genre Classifier** [[12 points]](https://)

#### Build a multi-class classifier for the 5 music genres. Your goal is to train a model to classify brain images into corresponding genre categories. You are free to choose any machine learning models from the class.

#### **1-1. Hyper-parameter Search.** [[4 points]](https://) Demonstrate your hyperparameter search process using cross-validation. Provide details for at least one hyperparameter with 10 different possible values.

#### **1-2. Model Training and Testing.** [[4 points]](https://) Following the hyperparameter search, train your model with the best combination of hyperparameters. Run the model on the test set and submit the results to the Kaggle competition. To get full marks, your model should outperform the baseline model, which is provided in Kaggle. You **must** show your test accuracy computed by Kaggle in this report.

#### **1-3. Model Analysis.** [[4 points]](https://) Conduct a thorough analysis of your model, including:

#### **1-3-1. Confusion Matrix:** Split the training set into train/validation sets. The data is organized into eight runs, in order, with each run repeating the same 20 music trials. You should split the data by run. Retrain your model using the best hyperparameter combination. Present the confusion matrix on the validation set.

#### **1-3-2. Example Examination:** Examine four validation samples where your model fails to classify into the correct category. Display the true label and the predicted label. Looking at the confusion matrix, how might you explain your results from the perspectives of human brain data and music genre similarity?


---

## **A. Data Download**

#### For your convenience, we have provided code to download the dataset, which includes true labels, training data (features), training labels, and testing data (features).


#### **A-1. Download Features and Labels.**

#### Run the following code to download the brain features and labels of the music clips.


In [1]:
import numpy as np
!pip install gdown

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting gdown
  Downloading gdown-5.1.0-py3-none-any.whl.metadata (5.7 kB)
Collecting beautifulsoup4 (from gdown)
  Downloading beautifulsoup4-4.12.3-py3-none-any.whl.metadata (3.8 kB)
Collecting filelock (from gdown)
  Downloading filelock-3.13.1-py3-none-any.whl.metadata (2.8 kB)
Collecting tqdm (from gdown)
  Downloading tqdm-4.66.2-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting soupsieve>1.2 (from beautifulsoup4->gdown)
  Downloading soupsieve-2.5-py3-none-any.whl.metadata (4.7 kB)
Collecting PySocks!=1.5.7,>=1.5.6 (from requests[socks]->gdown)
  Downloading PySocks-1.7.1-py3-none-any.whl.metadata (13 kB)
Downloading gdown-5.1.0-py3-none-any.whl (17 kB)
Downloading beautifulsoup4-4.12.3-py3-none-any.whl (147 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m

In [1]:
!gdown --id 1aFDPryEDcT5wg0k8NhWYpF8lulGmot5J # train data
!gdown --id 11kgAdB_hkEcC4npCEWJcAOOmGe3495yY # train labels
!gdown --id 1wXq56F6RIUtDzPceZegZAMA-JGW21Gqu # test data

Downloading...
From: https://drive.google.com/uc?id=1aFDPryEDcT5wg0k8NhWYpF8lulGmot5J
To: /home/jonathan/external/le-wagon-data-science/project/project/notebook-models-experimentation/jonathan/train_data.csv
100%|██████████████████████████████████████| 89.7M/89.7M [01:00<00:00, 1.48MB/s]
Downloading...
From: https://drive.google.com/uc?id=11kgAdB_hkEcC4npCEWJcAOOmGe3495yY
To: /home/jonathan/external/le-wagon-data-science/project/project/notebook-models-experimentation/jonathan/train_labels.csv
100%|██████████████████████████████████████████| 320/320 [00:00<00:00, 1.88MB/s]
Downloading...
From: https://drive.google.com/uc?id=1wXq56F6RIUtDzPceZegZAMA-JGW21Gqu
To: /home/jonathan/external/le-wagon-data-science/project/project/notebook-models-experimentation/jonathan/test_data.csv
100%|██████████████████████████████████████| 22.5M/22.5M [00:08<00:00, 2.70MB/s]


In [2]:
# Data Import Method 1, with pandas
import pandas as pd

train_data = pd.read_csv("../data/train_data.csv", header=None)
train_labels = pd.read_csv("../data/train_labels.csv", header=None)
test_data = pd.read_csv("../data/test_data.csv", header=None)

print('train_data.shape: {}'.format(train_data.shape))
print('train_labels.shape: {}'.format(train_labels.shape))
print('test_data.shape: {}'.format(test_data.shape))

train_data.shape: (160, 22036)
train_labels.shape: (160, 1)
test_data.shape: (40, 22036)


In [21]:
test_data[:1].transpose().to_json()

'{"0":{"0":-0.7178518774,"1":-1.4899767689,"2":-0.7884903493,"3":-0.7662788071,"4":-0.6728429014,"5":-0.4850071408,"6":0.3966876826,"7":0.6679982176,"8":-1.5934891319,"9":0.0298919598,"10":1.82700331,"11":2.3022735575,"12":2.2298025202,"13":1.4569434452,"14":2.0060618636,"15":2.1750428913,"16":1.7260763522,"17":1.8911432018,"18":1.8649202578,"19":1.939132858,"20":-0.1656052446,"21":-0.6340144676,"22":-0.5951131657,"23":0.5114239335,"24":-0.0737992895,"25":-1.5546264028,"26":-0.8203873916,"27":0.290294518,"28":-0.0662729232,"29":-0.7019926292,"30":-1.7626824651,"31":-0.9707795248,"32":-1.0516989836,"33":-0.375800949,"34":-0.3755381665,"35":-0.4202871037,"36":-1.4033216029,"37":-0.3291812539,"38":-0.5435402454,"39":-0.5090651187,"40":-0.7869883773,"41":-0.5352123578,"42":-0.6990619675,"43":-1.4911404154,"44":0.3486176648,"45":0.4181671279,"46":-1.2584590031,"47":-0.6821400967,"48":-0.4524580057,"49":-1.2969371418,"50":-1.2334312468,"51":-0.7189002004,"52":-0.3312694544,"53":0.1264439497,

In [11]:
test_data.head(0)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22026,22027,22028,22029,22030,22031,22032,22033,22034,22035


#### Data exploration


In [None]:
print("\nFirst few rows of the dataset:\n")
train_data.head(2)

In [None]:
print("\nDescriptive statistics for numerical columns:\n")
train_data.describe()

In [None]:
print("\nInformation about the dataset:\n")
print(train_data.info())

print("\nShape of the dataset (rows, columns):\n")
print(train_data.shape)

print("\nData types of each column:\n")
print(train_data.dtypes)

# print(df['categorical_column'].value_counts())

print("\nNumber of missing values in each column:\n")
print(train_data.isnull().sum())

#### Step 1: Split the data into training


In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(train_data, train_labels, test_size=0.3, random_state=0) # 70% to train

#### Step 2: Normalize the features using StandardScaler


In [79]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#### Step 3: One-hot encode the target variable


In [78]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
y_train_scaled = encoder.fit_transform(y_train.values.reshape(-1, 1))
y_test_scaled = encoder.transform(y_test.values.reshape(-1, 1))

#### Step 4: Tests various model parameters


In [1]:
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV, cross_val_score, cross_validate
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

#### LogisticRegression simple


In [None]:
logreg = LogisticRegression()

# run in 173m
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'max_iter': [2000],
    'class_weight': ['balanced']
}

grid_search = GridSearchCV(logreg, param_grid, scoring='accuracy', cv=5)
grid_search.fit(X_train, y_train[0].tolist())

# Best Parameters:  {'C': 1, 'class_weight': 'balanced', 'max_iter': 2000, 'penalty': 'l1', 'solver': 'saga'}
print("Best Parameters: ", grid_search.best_params_) 
print("Best Score: ", grid_search.best_score_) # 0.7584980237154151

# Train the final model with the best parameters
best_logreg = grid_search.best_estimator_
best_logreg.fit(X_train, y_train[0].tolist())
test_acc = accuracy_score(y_test, best_logreg.predict(X_test))
print(f"Test Accuracy: {test_acc:.3f}")

In [10]:
model = LogisticRegression(max_iter=5000, C=1, penalty="l1", solver="saga")
scores = cross_val_score(model, X_train, y_train[0].tolist(), cv=5, scoring='accuracy')
print(f"LogisticRegression Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")



LogisticRegression Accuracy: 0.749 (+/- 0.087)




In [5]:
model = LogisticRegression(max_iter=1000)
scores = cross_val_score(model, X_train, y_train[0].tolist(), cv=5, scoring='accuracy')
print(f"LogisticRegression Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

LogisticRegression Accuracy: 0.678 (+/- 0.078)


#### LogisticRegression with Decomposition


In [None]:
from sklearn.decomposition import FactorAnalysis

fa = FactorAnalysis(n_components=90)
X_train_fa = fa.fit_transform(X_train)

model = LogisticRegression(max_iter=1000)
scores = cross_val_score(model, X_train_fa, y_train[0].tolist(), cv=5, scoring='accuracy')
print(f"LogisticRegression Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

LogisticRegression Accuracy: 0.403 (+/- 0.073)


In [8]:
from sklearn.decomposition import FastICA

fa = FastICA(n_components=90)
X_train_fa = fa.fit_transform(X_train)

model = LogisticRegression(max_iter=1000)
scores = cross_val_score(model, X_train_fa, y_train[0].tolist(), cv=5, scoring='accuracy')
print(f"LogisticRegression Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

LogisticRegression Accuracy: 0.464 (+/- 0.041)


In [39]:
from sklearn.decomposition import TruncatedSVD

fa = TruncatedSVD(n_components=90)
X_train_fa = fa.fit_transform(X_train)

model = LogisticRegression(max_iter=1000, C=1)
scores = cross_val_score(model, X_train_fa, y_train[0].tolist(), cv=5, scoring='accuracy')
print(f"LogisticRegression Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

LogisticRegression Accuracy: 0.769 (+/- 0.138)


In [40]:
from sklearn.decomposition import TruncatedSVD

fa = TruncatedSVD(n_components=90)
X_train_fa = fa.fit_transform(X_train)
X_test_fa = fa.transform(X_test)

model = LogisticRegression(max_iter=1000, C=1)
model.fit(X_train_fa, y_train[0].tolist())

test_acc = accuracy_score(y_test, model.predict(X_test_fa))
print(f"Test Accuracy: {test_acc:.3f}")

Test Accuracy: 0.708


### Best overall model

In [11]:
# BEST OVERALL MODEL
model = LogisticRegression(max_iter=5000, C=1, penalty="l1", solver="saga")
model.fit(X_train, y_train[0].tolist())

test_acc = accuracy_score(y_test, model.predict(X_test))
print(f"Test Accuracy: {test_acc:.3f}")

Test Accuracy: 0.771


In [14]:
best_model = LogisticRegression(max_iter=5000, C=1, penalty="l1", solver="saga")
best_model.fit(train_data, train_labels[0].tolist())



#### LogisticRegression with PCA


In [25]:
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA, IncrementalPCA
from sklearn.linear_model import LogisticRegression

pca = PCA(n_components=90)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

logreg = LogisticRegression(max_iter=5000)

param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}
grid_search = GridSearchCV(logreg, param_grid, scoring='accuracy', cv=5)
grid_search.fit(X_train_pca, y_train[0].tolist())

print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

# Train the final model with the best parameters
best_logreg = grid_search.best_estimator_
best_logreg.fit(X_train_pca, y_train[0].tolist())
test_acc = accuracy_score(y_test, best_logreg.predict(X_test_pca))
print(f"Test Accuracy: {test_acc:.3f}")

Best Parameters:  {'C': 0.001}
Best Score:  0.7588932806324111
Test Accuracy: 0.771


In [26]:
pca = PCA(n_components=90)
X_train_pca = pca.fit_transform(train_data)

logreg = LogisticRegression(max_iter=5000)

param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}
grid_search = GridSearchCV(logreg, param_grid, scoring='accuracy', cv=5)
grid_search.fit(X_train_pca, train_labels[0].tolist())

print(X_train_pca.shape)
print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

best_logreg = grid_search.best_estimator_

(160, 90)
Best Parameters:  {'C': 0.001}
Best Score:  0.80625


In [7]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [15]:
from musicbrain.ml_logic.registry import save_model

MODEL_TARGET="local"
save_model(best_model)

✅ Model saved locally


In [9]:
from musicbrain.ml_logic.registry import load_model
from dotenv import load_dotenv

# load_dotenv("../../")
# print(MODEL_TARGET)

model = load_model()
model

[34m
Load latest model from GCS...[0m





❌ No model found in GCS bucket final-project-lewagon


In [30]:
from musicbrain.ml_logic.preprocessor import preprocess_features

y_pred = model.predict(preprocess_features(test_data))
y_pred


Preprocessing features...


ValueError: n_components=90 must be less or equal to the batch number of samples 40.

In [31]:
test_data.shape

(40, 22036)

#### LogisticRegression with PCA and Scaled


In [81]:
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

pca = PCA(n_components=90)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

logreg = LogisticRegression(max_iter=10000)

param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}
grid_search = GridSearchCV(logreg, param_grid, scoring='accuracy', cv=5)
grid_search.fit(X_train_pca, y_train[0].tolist())

print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)

# Train the final model with the best parameters
best_logreg = grid_search.best_estimator_
best_logreg.fit(X_train_pca, y_train[0].tolist())
test_acc = accuracy_score(y_test, best_logreg.predict(X_test_pca))
print(f"Test Accuracy: {test_acc:.3f}")

Best Parameters:  {'C': 1}
Best Score:  0.6869565217391305
Test Accuracy: 0.688
