<div style="text-align: right"> Provided on April 23, Due on May 7 [BRI516, Spring/2020] </div>

For homework in general:
* Install `Anaconda` and create an environment with `NumPy`, `Pandas`, `Matplotlib`, `scikit-learn` in Python 3.5
* Please type the equation and/or text using markdown in jupyter-lab or jupyter-notebook
* Please upload your jupyter-notebook file for homework to `Blackboard` (In case of 1.(a)-(c) and 2.(a)-(d), any format is fine; such as .docx, hand writing, etc.)
* Please discuss your results at least one line of text

#### [Hw#2] 

##### (1) Linear discriminant analysis (LDA):

Suppose we have two-classes and assume we have $m$-dimensional samples $\{ \bf{x}^1, \bf{x}^2, \cdots, \bf{x}^{N_i} \}$ belong to class $\omega_i$, where $i \in \{1, 2\}$.

The aim is to obtain a transformation of $\bf{x}$ to $y$ through projecting the samples in $\bf{x}$ onto a line with a scalar $y$:
$$ y = \bf{w}^T \bf{x} $$ 
where $\bf{w}$ is a projection vector.

(a) Show that an objective function to maximize for LDA can be represented as follows:

$$ J(w) \triangleq \frac{|\tilde{\mu}_1 - \tilde{\mu}_2|^2}{\tilde{s}_1^2 + \tilde{s}_2^2} = \frac{w^T S_B w}{w^T S_W w}, $$

where $\tilde{\mu}_i$ and $\tilde{s}_i^2$ are the mean value and variance of the $i^{th}$ class in the feature space $y$, respectively, and $\bf{S}_W$ and $\bf{S}_B$ are the within-class scatter matrix and between-class scatter matrix, respectively. 

<br><br>

(b) Show that the solution of the LDA can be given as the eigenvector of the following term:

$$ \bf{S}_X = \bf{S}_W^{-1} \bf{S}_B $$

<br><br>

(c) Apply PCA and LDA to the MNIST digit dataset for feature extraction into two-dimensional space and compare the results.
* Please use 'from sklearn import datasets' and 'load_digits()' to load MNIST dataset, then split them into train and test sets.


<br><br>

(d) Apply the LR and SVM classifiers to the extracted features from (c) and compare the classification performance (i) between the two classifiers and (ii) between the original features and dimension reduced features. 
    
<br><br><br><br>


Load MNIST digit dataset

In [18]:
from sklearn import datasets
import numpy as np
dataset = datasets.load_digits()
x_data = dataset.data
y_data = dataset.target
print(np.shape(x_data))
print(np.shape(y_data))

(1797, 64)
(1797,)


Split them into train and test sets.

In [19]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(x_data, y_data, test_size = 0.3, stratify = y_data)
print(np.shape(X_train))
print(np.shape(X_test))
print(np.shape(Y_train))
print(np.shape(Y_test))

(1257, 64)
(540, 64)
(1257,)
(540,)


Standardize data

In [20]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)

Apply PCA

In [21]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_train_pca = pca.fit_transform(X_train_std)
X_test_pca = pca.transform(X_test_std)
pca.explained_variance_ratio_

array([0.11978151, 0.09575799])

Apply LDA

In [22]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components=2)
X_train_lda = lda.fit_transform(X_train_std, Y_train)
X_test_lda = lda.transform(X_test_std)

Apply LogisticRegression and SVM to PCA, LDA, original feature

In [23]:
from sklearn.linear_model import LogisticRegression
from sklearn import svm
lr_pca = LogisticRegression(solver='lbfgs', multi_class = 'auto')
lr_pca = lr_pca.fit(X_train_pca, Y_train)
svc_pca = svm.SVC()
svc_pca = svc_pca.fit(X_train_pca, Y_train)
lr_lda = LogisticRegression(solver='lbfgs', multi_class = 'auto')
lr_lda = lr_lda.fit(X_train_lda, Y_train)
svc_lda = svm.SVC()
svc_lda = svc_lda.fit(X_train_lda, Y_train)
lr = LogisticRegression(solver='lbfgs', multi_class = 'auto')
lr = lr.fit(X_train, Y_train)
svc = svm.SVC()
svc = svc.fit(X_train, Y_train)


Compare Performance

In [24]:
from sklearn.metrics import classification_report


print(classification_report(Y_test, lr_pca.predict(X_test_pca)))
print(classification_report(Y_test, lr_lda.predict(X_test_lda)))
print(classification_report(Y_test, lr.predict(X_test)))

print(classification_report(Y_test, svc_pca.predict(X_test_pca)))
print(classification_report(Y_test, svc_lda.predict(X_test_lda)))
print(classification_report(Y_test, svc.predict(X_test)))

precision    recall  f1-score   support

           0       0.72      0.76      0.74        54
           1       0.43      0.60      0.50        55
           2       0.53      0.74      0.62        53
           3       0.44      0.49      0.46        55
           4       0.88      0.93      0.90        54
           5       0.32      0.22      0.26        55
           6       0.86      0.80      0.83        54
           7       0.67      0.78      0.72        54
           8       0.50      0.15      0.24        52
           9       0.34      0.30      0.32        54

    accuracy                           0.58       540
   macro avg       0.57      0.58      0.56       540
weighted avg       0.57      0.58      0.56       540

              precision    recall  f1-score   support

           0       0.98      0.98      0.98        54
           1       0.45      0.53      0.49        55
           2       0.44      0.42      0.43        53
           3       0.55      0.58     

Comparing PCA, LDA, and original feature, original featrue shows highest accuracy, and LDA follows. and PCA shows worst accuracy

Comparing LR and SVC, SVC shows higher performance than LR among all features(PCA, LDA, original)

##### (2) Kernel principal component analysis (KPCA):

Suppose that the mean of the $d$-dimensional data in the kernal feature space is:
$$ \mu = \frac{1}{n} \sum^n_{i=1} \phi (x_i) = 0 $$

And, the covariance is :
$$ C = \frac{1}{n} \sum^n_{i=1} \phi ( x_i) {\phi(x_i)}^T $$

Thus, eigen-decomposition is as follows:
$$ C \bf{\nu} = \lambda \bf{\nu} $$

(a) Show that the $j^{th}$ eigenvector can be expressed as a linear combination of features:

$$ {\bf{\nu}}_j = \sum^n_{i=1} \alpha_{ji} \phi(x_i), $$
where $\alpha_{ji}$ is a coefficient.

<br>

(b) Show that the coefficient $\alpha_{ji}$ is obtained from the eigenvector of the kernel matrix:

$$ K \alpha_j = n\lambda_j \alpha_j, $$
where $K_{ij} = K(x_i, x_j) = \phi(x_i)^T \phi(x_j) $ 

<br>

(c) Show that the zero-meaned kernel matrix is represented as follows:

$$ \tilde{K} = K - 2\bf{1}_{1/n} K + \bf{1}_{1/n} K \bf{1}_{1/n}, $$
where $\bf{1}_{1/n}$ is a matrix with all elements $1/n$.

<br>

(d) Show that any data point, $x$ can be represented as:

$$ y_j = \sum^n_{i=1} \alpha_{ji} K(x, x_i), j = 1, \cdots, d $$

<br>

(e) Apply the KPCA to the MNIST digit data for two dimensional feature extraction and compare the results with (1-c).

<br>

(f) Apply the LR and SVM classifiers to the extracted features from (e) and compare the classification performance with (1-d)


Apply KPCA

In [40]:
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components=2)
X_train_kpca = kpca.fit_transform(X_train_std)
X_test_kpca = kpca.transform(X_test_std)


Apply LR and SVC to the feature extracted by KPCA

In [41]:
lr_kpca = LogisticRegression(solver='lbfgs', multi_class = 'auto')
lr_kpca = lr_kpca.fit(X_train_kpca, Y_train)
svc_kpca = svm.SVC()
svc_kpca = svc_kpca.fit(X_train_kpca, Y_train)

Compare Performance

In [42]:
print(classification_report(Y_test, lr_kpca.predict(X_test_kpca)))
print(classification_report(Y_test, svc_kpca.predict(X_test_kpca)))


precision    recall  f1-score   support

           0       0.72      0.76      0.74        54
           1       0.43      0.60      0.50        55
           2       0.53      0.74      0.62        53
           3       0.44      0.49      0.46        55
           4       0.88      0.93      0.90        54
           5       0.32      0.22      0.26        55
           6       0.86      0.80      0.83        54
           7       0.67      0.78      0.72        54
           8       0.50      0.15      0.24        52
           9       0.34      0.30      0.32        54

    accuracy                           0.58       540
   macro avg       0.57      0.58      0.56       540
weighted avg       0.57      0.58      0.56       540

              precision    recall  f1-score   support

           0       0.74      0.80      0.77        54
           1       0.49      0.75      0.59        55
           2       0.71      0.60      0.65        53
           3       0.43      0.49     

KPCA shows quite similar result with PCA, or slightly higher accuracy.

For extracted feature by KPCA, SVC shows higher accuracy than LR