# Classical CCA: Practice#

This is short practical part of introduction to CCA. Here we will work with NYC school dataset. We will have two groups of variables. One group will be measures of quality of education environment at particular school, other group will contain measures of students performance. We will find 2 canonical directions for our sets

Start from importing all packages and downloading dataset

In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('./data/2016 School Explorer.csv')
# choose relevant features
df = df[['Rigorous Instruction %',
      'Collaborative Teachers %',
     'Supportive Environment %',
       'Effective School Leadership %',
   'Strong Family-Community Ties %',
    'Trust %','Average ELA Proficiency',
       'Average Math Proficiency']]
# drop missing values
df = df.dropna()
# separate X and Y groups
X = df[['Rigorous Instruction %',
      'Collaborative Teachers %',
     'Supportive Environment %',
       'Effective School Leadership %',
   'Strong Family-Community Ties %',
    'Trust %'
      ]]
Y = df[['Average ELA Proficiency',
       'Average Math Proficiency']]

Look at groups of variables. Easy question - can more than two canonical directions be found there?

Next we will do scaling

In [2]:
for col in X.columns:
    X[col] = X[col].str.strip('%')
    X[col] = X[col].astype('int')
# Standardise the data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler(with_mean=True, with_std=True)
X_sc = sc.fit_transform(X)
Y_sc = sc.fit_transform(Y)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[col] = X[col].str.strip('%')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X[col] = X[col].astype('int')


Now lets start to code our own implementation of CCA. Here we would not care about efficiency, but just cover the 
pipline, introduced in theorethical part

Step 1. Compute covariance matrices. You may not care about scaling it with number of instances. Why? What it will
affect?

In [3]:
#Compute CCA by hands
Sigma_XY = np.dot(X_sc.T, Y_sc)
Sigma_XX = np.dot(X_sc.T, X_sc)
Sigma_YY = np.dot(Y_sc.T, Y_sc)




Step 2. Compute inverse of square roots of covariation matrices. It is needed for computetions of A and B matrices and for computing T matrix. Main formulas are provided below 
\
$$T = \Sigma_{XX}^{-\frac{1}{2}}\Sigma_{XY}\Sigma_{YY}^{-\frac{1}{2}} $$
\
$$T=U\Lambda V^T $$
\
$$A _k^{k \times n}= U_k ^T \Sigma_{XX}^{-\frac{1}{2}}, \ B _k^{k \times m}= V_k ^T \Sigma_{YY}^{-\frac{1}{2}}$$

In [4]:
S_x, L_x = np.linalg.eig(Sigma_XX)
Sigma_XX_sqrt_inv = np.linalg.inv(L_x @ np.diag(np.sqrt(S_x)) @ np.linalg.inv(L_x))
S_y, L_y = np.linalg.eig(Sigma_YY)
Sigma_YY_sqrt_inv = np.linalg.inv(L_y @ np.diag(np.sqrt(S_y)) @ np.linalg.inv(L_y))

Step 3. Compute T and its SVD

In [5]:
T = Sigma_XX_sqrt_inv @ Sigma_XY @ Sigma_YY_sqrt_inv

In [6]:
U, S, Vh = np.linalg.svd(T)

Step 4. Compute A and B

In [7]:
A = U[:,:2].T@Sigma_XX_sqrt_inv
B = Vh[:2,:]@Sigma_YY_sqrt_inv

Step 5. Compute maximized pairwise correlations

In [8]:
direction_1_x = A.T[:,0]/np.linalg.norm(A.T[:,0])
direction_2_x = A.T[:,1]/np.linalg.norm(A.T[:,1])
direction_1_y = B.T[:,0]/np.linalg.norm(B.T[:,0])
direction_2_y = B.T[:,1]/np.linalg.norm(B.T[:,1])

In [9]:
var_1_by_hands = np.dot(X_sc@direction_1_x,Y_sc@direction_1_y)/np.sqrt(np.dot(X_sc@direction_1_x,X_sc@direction_1_x)*np.dot(Y_sc@direction_1_y,Y_sc@direction_1_y))
var_2_by_hands = np.dot(X_sc@direction_2_x,Y_sc@direction_2_y)/np.sqrt(np.dot(X_sc@direction_2_x,X_sc@direction_2_x)*np.dot(Y_sc@direction_2_y,Y_sc@direction_2_y))
print(var_1_by_hands)
print(var_2_by_hands)



0.4605990186268907
0.18447786228102248


Now we do the same thing with package for CCA, and check if everything is done correctly

In [10]:
from sklearn.cross_decomposition import CCA
nComponents = 2 # min(p,q) components
cca = CCA(n_components=nComponents)
cca.fit(X_sc, Y_sc)
X_c, Y_c = cca.transform(X_sc, Y_sc)

In [11]:
#Check variances
var1 = (X_c.T@Y_c)[0,0]/np.sqrt((X_c.T@X_c)[0,0]*(Y_c.T@Y_c)[0,0])
var2 = (X_c.T@Y_c)[1,1]/np.sqrt((X_c.T@X_c)[1,1]*(Y_c.T@Y_c)[1,1])
print("Maximized correlation between first pair of canonical variables : ", var1)
print("Maximized correlation between second pair of canonical variables : ", var2)

Maximized correlation between first pair of canonical variables :  0.46059901511953844
Maximized correlation between second pair of canonical variables :  0.18447786368577776


In [12]:
#Coefficient 
print(cca.x_rotations_)
print(cca.y_rotations_)

[[ 0.13341004  0.24115923]
 [-0.02182311 -0.11012058]
 [ 0.72897535 -0.25737072]
 [ 0.44467451  0.91707609]
 [-0.01654807 -0.27890374]
 [-0.50230588 -0.38948744]]
[[-0.23566075  1.3150408 ]
 [ 0.97183538 -1.17967538]]
