# Module 3: Feature extraction - Practice

In this session you will practice feature extraction with **Principal Componentent Analysis**
and **Factor Analysis**.

We are going to use **titanic** dataset.

sklearn API reference:

+ [sklearn.decomposition.PCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)
+ [sklearn.decomposition.FactorAnalysis](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.FactorAnalysis.html)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

import os, sys
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.decomposition import FactorAnalysis
from sklearn.preprocessing import scale

from scipy.stats import pearsonr

np.random.seed(18937)

## Load dataset

In [None]:
# Dataset location (Question #P101)
DATASET = '/dsa/data/all_datasets/titanic_ML/titanic.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset = pd.read_csv(DATASET).sample(frac = 1).reset_index(drop=True)
dataset.describe()

Create variable **X** and **y** and pull features and labels respectively.

In [None]:
X = dataset.iloc[:, :-1]
y = dataset.survived

**Initialize** and **fit** both a PCA and a FactorAnalysis feature extractors <span style="background: yellow">with 5 components</span>.

In [None]:
# Add your code below this comment for PCA (Question #P102)
# ----------------------------------
pca = PCA(n_components=5)
pca.fit(X)

# Add your code below this comment for FA (Question #P103)
# ----------------------------------
fa = FactorAnalysis(n_components=5)
fa.fit(X)

## Print explained variance ratio for each extracted feature

In [None]:
# Complete code below this comment for PCA (Question #P104)
# ----------------------------------
print('PCA', pca.explained_variance_ratio_)

In [None]:
# Complete code below this comment for FA (Question #P105)
# ----------------------------------
def FA_explained_variance_ratio(fa):
    fa.explained_variance_ = np.flip(np.sort(np.sum(fa.components_**2, axis=1)), axis=0)
    total_variance = np.sum(fa.explained_variance_) + np.sum(fa.noise_variance_)
    fa.explained_variance_ratio_ = fa.explained_variance_ / total_variance

FA_explained_variance_ratio(fa)
print('FA', fa.explained_variance_ratio_)

## Compute correlation coefficient for each of the extracted features and the target

1) Use [scipy.stats.pearsonr()](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html)

In [None]:
# Complete code below this comment for PCA (Question #P106)
# ----------------------------------
X_PCA = pca.transform(X)
print([pearsonr(X_PCA[:,i], y)[0] for i in range(X_PCA.shape[1])])

# Complete code below this comment for FA (Question #P107)
# ----------------------------------
X_FA = fa.transform(X)
print([pearsonr(X_FA[:,i], y)[0] for i in range(X_FA.shape[1])])

2) We encourage you to attempt the same using the following equation, where j is index for features.

$$ r_j = \frac{\sigma_{X_j y}}{\sigma_{X_j} \sigma_y} = \frac{(y-\bar y)^T (X_j-\bar {X_j})}{\lVert X_j-\bar {X_j}\rVert \cdot \lVert y-\bar y\rVert} = cos \measuredangle(X_j-\bar {X_j}, y-\bar y) $$



$(X-\bar X)$ and $(y-\bar y)$ are given as **X_centered** and **y_centered** respectively.


In [None]:
# Complete code below this comment for PCA (Question #P108)
# ----------------------------------

X_centered = scale(X_PCA, with_std = False)
y_centered = scale(y.astype(float), with_std = False)

# Either of the following are possible answers
# ----------------------------------
cosine = lambda a,b: np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print([cosine(X_centered[:, j], y_centered) for j in range(X_PCA.shape[1])])

# --  OR  --------------------------

print(np.dot(y_centered, X_centered) / np.linalg.norm(X_centered, axis=0) / np.linalg.norm(y_centered))


## Scree plot

Create a scree plot for PCA's explained variance ratio below.

In [None]:
x_ticks = np.arange(len(pca.components_))+1
plt.xticks(x_ticks) # this enforces integers on the x-axis
# Complete code below this comment for PCA (Question #P109)
# ----------------------------------
plt.plot(x_ticks, pca.explained_variance_ratio_)

Create a scree plot for FA's explained variance ratio below.

In [None]:
x_ticks = np.arange(len(fa.components_))+1
plt.xticks(x_ticks) # this enforces integers on the x-axis
# Complete code below this comment for FA (Question #P110)
# ----------------------------------
plt.plot(x_ticks, fa.explained_variance_ratio_)

Plot both in the same figure below <span style="background: yellow;">in log-scale</span>.

In [None]:
# Complete code below this comment for FA (Question #P111)
# ----------------------------------
x_ticks = np.arange(len(pca.components_))+1
plt.xticks(x_ticks) # this enforces integers on the x-axis
plt.plot(x_ticks, np.log(pca.explained_variance_ratio_), 'b')
plt.plot(x_ticks, np.log(fa.explained_variance_ratio_), 'r')
plt.show()

Which do you think performed better for this dataset? PCA or FA? What makes you think so?

# Save your notebook!