# Documentation

In this file we continue the use of CCA, utilizing the cca package from sklearn. We assess the performance of this model and how different parameters may influence the outcome.

# Work

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import KFold
from sklearn.cross_decomposition import CCA
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import statistics

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Load in our previously processed data used for cca by hand...

In [3]:
cca = pd.read_csv('/content/drive/MyDrive/UW/sphscAudiogram/audiogram_cca.csv')

In [4]:
# predictors: default predictors without any measurements from the audiogram
predictors = ['AGE', 'GENDER_Female', 'GENDER_Male', 'LT1', 'RT1']
X = cca[predictors]
Y = cca.drop(columns = predictors)

In [5]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=6)

**Note:** We did not scale the data in this case, it is to be determined how scaling might change the performance of CCA

In [6]:
# cca_model: cca fitted on X_ train and Y_train with 5 components
cca_model = CCA(n_components=5)
cca_model.fit(X_train, Y_train)

# X_cca, Y_cca: X_train, Y_train transformed by cca_model
X_cca, Y_cca = cca_model.transform(X_train, Y_train)
score = cca_model.score(X_test, Y_test)
print(score)

0.07630164278131113


**How to interpret score:** The score from the CCA model in `sklearn` represents the sum of squared canonical correlations between two datasets. It ranges from 0 to 1, where higher values indicate stronger correlations and a better fit of the model to the underlying relationship between the datasets. A high score suggests that the model effectively captures the correlation between the linear combinations of the variables from both sets.

In [7]:
root_mean_squared_error(cca_model.predict(X_test), Y_test)

17.6425584221586

Recall that we obtained an rmse of 14 with our by hand cca process.

Does this mean the by hand algorithm has better performance?

**Using CV** to validate the rmse, as well as observe performance of the cca model as n_comp changes

**Note:** This function is different than before in that it allows us to pass in the list of predictors we want to use

In [11]:
'''
Input
predictors: features used to predict
data: the dataframe of all the features
n_splits: number of splits for cv, default to 10

Returns the rmse of the cca model for each number of components
'''
def cca_cv(predictors, data, n_splits = 10, shuffle = True, random_state = 42):
  # get the cv with n_splits
  cv = KFold(n_splits=n_splits, shuffle=shuffle, random_state=random_state)

  # define X and Y
  X = data[predictors]
  Y = data.drop(columns = predictors)

  # rmse[i] is the list of rmse of the cca model with i canonical components for each fold
  rmse = {}
  for n_comp in range(1, len(predictors)+1):
    # list of rmse for each fold
    rmse[n_comp] = []
    for train_index, test_index in cv.split(X):
      # split into train and test set
      X_train, X_test = X.iloc[train_index], X.iloc[test_index]
      Y_train, Y_test = Y.iloc[train_index], Y.iloc[test_index]

      # fit cca model with current fold
      cca_model = CCA(n_components=n_comp)
      cca_model.fit(X_train, Y_train)
      rmse[n_comp].append(root_mean_squared_error(cca_model.predict(X_test), Y_test))

  return rmse

In [12]:
# predictors: default predictors without any measurements from the audiogram
rmse = cca_cv(predictors, cca)

**Calculate the mean rmse of the list of rmse after cv**

In [13]:
{key: statistics.mean(value) for key, value in rmse.items()}

{1: 14.094567291119887,
 2: 14.398533052646007,
 3: 15.110606007150562,
 4: 16.792411817525764,
 5: 19.39073679737196}

**Observation:** One interesting result that can be easily seen is that, the rmse increases as n_comp increases. Which seems to contradict with our intuition that more components means lower loss, and should give lower rmse.

In general, the performance of this cca model is not predicting as well as the by hand cca.