# MLCC - Laboratory 3 - Dimensionality reduction and feature selection
In this laboratory we will address the problem of data analysis with a reference to a classification problem. 
Follow the instructions below.

Import all the functions from the file "lab2ImpFunction.py" by: <br>
`from lab3ImpFunction import *` <br>
Also import pyplot for plotting: <br>
`import matplotlib.pyplot as plt`

## 1. Warm up - data generation

* **1.A** Generate a training and a test set of D-dimensional ponts (N points for each class), with `N=100, D=30`. Only two of these dimensions will be meaningful, the remaining will be noise.

    `n = 100`
    
    `dim = 30`

    `Xtr, Ytr = MixGauss(means=[[1,-1],[1,1]],sigmas=[[0.7], [0.7]],n=n)
`

    `Xts, Yts = MixGauss(means=[[1,-1],[1,1]],sigmas=[[0.7], [0.7]],n=n)
`

    `Ytr = 2*np.mod(Ytr, 2)-1`

    `Yts = 2*np.mod(Yts, 2)-1`
    
    
* **1.B** You may want to plot the relevant variables of the data

     `plt.scatter(Xtr[:,0], Xtr[:,1], s=30, c=np.squeeze(Ytr), alpha=0.5)`
     
     `plt.title('Plot data', fontsize=14, color='red')`
     
     `plt.show()`
     
     
* **1.C** The remaining variables will be generated as gaussian noise

    `sigma_noise = 0.01`
    
    `Xtr_noise = sigma_noise * np.random.randn(2*n, dim-2)`
    
    `Xts_noise = sigma_noise * np.random.randn(2*n, dim-2)`
    
    To compose the final data matrix, execute:
    
    `Xtr = np.concatenate((Xtr, Xtr_noise), axis=1)`
    
    `Xts = np.concatenate((Xts, Xts_noise), axis=1)`


## 2. Principal Component Analysis

* **2.A** Compute the data principal components (see help(PCA))


* **2.B** Plot the first two components of X_proj using the following line

    `plt.scatter(X_proj[:,0], X_proj[:,1], s=30, c=np.squeeze(Ytr), alpha=0.5)`
    
    `plt.show()`
    
    
    
* **2.C** Try now with the first 3 components, by using

    `fig = pyplot.figure()`
    `ax = Axes3D(fig)`

    `x = X_proj[:,0].real`
    `y = X_proj[:,1].real`
    `z = X_proj[:,2].real`

    `ax.scatter(x, y, z, c=np.squeeze(Ytr), marker='o')`

    `ax.set_xlabel('X Label')`
    `ax.set_ylabel('Y Label')`
    `ax.set_zlabel('Z Label')`

    `plt.show()`
    
Reason on the meaning of the results you are obtaining     
    
    
    
* **2.D** Display the sqrt of the first 10 eigenvalues `print(np.sqrt(d[:10]))`. Plot the coefficients (eigenvalues) associated with the largest eigenvalue:

    `plt.scatter(range(dim), abs(d))`
    
    `plt.show()`



* **2.D** Repeat the above steps with datasets generated using different sigma_noise `(0, 0.01, 0.1, 0.5, 1, 1.5, 2)`.  To what extent data visualization by PCA is affected by the noise?

## 3. Variable selection

* **3.A** Use the data generated in section 1. Standardize the data matrix, so that each column has mean 0 and standard deviation 1

    `m = np.mean(Xtr, axis=0)`

    `s = np.std(Xtr, axis=0)`

    `Xtr = (Xtr - m) / s`

    Do te same for `Xtr`, by using means and standard deviation computed on `Xtr`
    

* **3.B** Use the orthogonal matching pursuit algorithm (type 'help(OMatchingPursuit)')


* **3.C** You may want to check the predicted labels on the training set

    `Ypred = np.sign(Xts * w)`
    
    `err = calcErr(Yts, Ypred)`
    
   and plot the coefficients `w` with `scatter(range(dim), np.abs(w))`. How does the error change with the number of iterations of the method?


* **3.D** Repeat the experiment but this time using a dataset where the first two dimensions are gaussians with `means=[[1,1],[-1,-1]]`.  Plot the coefficients `w` with `scatter(range(dim), np.abs(w))`, what difference do you observe?


* **3.E** By using the method `holdoutCVOMP` find the best number of iterations with `intIter = range(dim)` (and, for instance, `perc=0.75 nrip = 20`). Plot the training and validation error with the following lines of code:

    `it, Vm, Vs, Tm, Ts = holdoutCVOMP(Xtr, Ytr, perc, nrip, intIter)`

    `plt.plot(intIter, Tm, 'r+')`

    `plt.plot(intIter, Vm, 'b+')`

    `plt.title('Cross validation results', fontsize=20, color='red')`

    `plt.xlabel('Number of dimension', fontsize=12, color='red')`
    
    `plt.ylabel('error', fontsize=12, color='red')`

    `plt.show()`
    
    What is the behavior of the training and the validation errors with respect to the number of iterations?
    
    
* **3.F** Try to increase the number of relevant variables d = 3,5,.. (and the corresponding standard deviation of the Gaussians) and see how this change is reflected in the cross-validation.