# How to Scikit-Learn

## Dimension Reduction

### PCA

https://scikit-learn.org/stable/modules/decomposition.html
The point is to find the successive orthogonal components that explain most of the variance of the centered data set.
Here is a very simple video on the Topic https://www.youtube.com/watch?v=FgakZw6K1QQ

Here is the scikit-learn documentation
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

    from sklearn.decomposition import PCA
    pca = PCA(n_components=[# , 'mle', %])
    pca.fit(X)

you can specify in n_components
* number of features to keep
* 'mle' to let Minka's MLE algorithm fit it for you https://vismod.media.mit.edu/tech-reports/TR-514.pdf
* a percentage between 0 and 1 that represents the amount of total variance that should be explained by your features

Useful attributes
* components_ : array, shape (n_components, n_features) -- Gives you the n_components components (rows) and the contribution of each feature (columns)
* explained_variance_ (ratio_) : array, shape (n_components,) -- Gives you the variance explained by each component

Some Methods
* fit(X) : fits the model with X
* fit_transform(X) : fits AND returns the transformed data
* transform(X) : returns the transformed data using the fitted model
* inverse_transform(X) : transform your data back to the original space
* get_covariance() : computes the covariance matrix $cov \in \mathscr{M}_{n_{features}}$  
$$cov =  components^T * S^2 * components + \boldsymbol{\sigma_2} * I_{n_{features}}$$ 
where $S^2$ contains the explained variances, and $\boldsymbol{\sigma_2}$ contains the noise variances.
* get_precision() : computes the precision (inverse of the covariance)

If you're inteerested in only a certain part of the whole dataset you can use the 
* svd_solver='randomized' : it only uses the right amount of data to predict the n_features wanted

In [5]:
from sklearn.decomposition import PCA

## X is the dataset : lines are instances, columns are features ##

pca = PCA(n_components).fit(X)
X_pca = PCA(n_components).transform(X)

X_pca = PCA(n_components).fit_transform(X)

# This function plots an elbow curve representing the variance explained by components
def plot_elbow(X,n_components=10):
    pca = PCA(n_components).fit(X)
    plt.plot(np.cumsum(pca.explained_variance_ratio_))
    plt.xlabel('number of components')
    plt.ylabel('cumulative explained variance')
    plt.title('Ratio of variance explained by the number components')
    plt.show()
    
#A more general implementation for visualizing data is available under Kernel PCA

NameError: name 'n_components' is not defined

#### Incremental PCA

For big sized data you would want to use chunks of data.
It computes estimates of components and naoise variances from a batch and then updates them with the next batch <br>
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.IncrementalPCA.html

#### Kernel PCA

Documentation : https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html

- You can use a special kernel to separate non linear datasets : https://scikit-learn.org/stable/modules/metrics.html

    - Linear : $$ K(x,x') = x^Tx' $$
    - poly : $$ K(x,x') = ( \color {green} \gamma x^T x' + \color {blue} c_0)^\color {red}d $$
    - sigmoid : $$ K(x,x') = tanh( \color {green} \gamma x^T x' + \color {blue} c_0 ) \;\;\; $$
    - Radial basis function (RBF) : $$ K(x,x') = exp(- \color {green} \gamma \|{x-x'}\|^2) $$
    - cosine : $$ K(x,x') = \frac {x^T x'}{\|x^T\| \|x'\|} $$

You can tune some Hyper parameter

$\color {green} \gamma $ <br>
`gamma  (default = 1/n_features) is used by poly / sigmoid / rbf`<br>
$\color {blue} {c_0} $ <br>
`coef0  (default = 1)            is used by poly / sigmoid` <br>
$\color {red} d $ <br>
`degree (default = 3)            is used by poly`<br>


More info on kernels : http://crsouza.com/2010/03/17/kernel-functions-for-machine-learning-applications/

In [4]:
from sklearn.decomposition import KernelPCA

# This function plots the projection of the data on the 1 2 or 3 main components and returns the PCA
#using whichever kernel and parameter you give it

def plot_pca (X,y,kernel='linear',n_components=2,gamma=None,coef0=None,degree=None):
    pca = KernelPCA(n_components,kernel, gamma=gamma, degree=degree, coef0=coef0)
    X_pca = pca.fit_transform(X)
    print("original shape:   ", X.shape)
    print("transformed shape:", X_pca.shape)
    if n_components==1:
        plt.scatter(X_pca[:,0],np.zeros(len(X_pca),),alpha=0.2,c=y.values,vmin=-3,vmax=3,)
        plt.xlabel('Component 1')
        plt.title("data projected on the main component \n using " + kernel + " kernel")
    elif n_components==2:
        plt.scatter(X_pca[:,0],X_pca[:,1],alpha=0.2,c=y.values,vmin=-3,vmax=3)
        plt.xlabel('Component 1')
        plt.ylabel('Component 2')
        plt.title("data projected on the 2 main components \n using " + kernel + " kernel")
    elif n_components==3:
        from mpl_toolkits.mplot3d import Axes3D
        fig=plt.figure()
        ax = fig.add_subplot(111, projection='3d')
        ax.scatter(X_pca[:,0],X_pca[:,1],X_pca[:,2],alpha=0.2,c=y.values,vmin=-3,vmax=3)
        ax.set_xlabel('Component 1')
        ax.set_ylabel('Component 2')
        ax.set_zlabel('Component 3')
        plt.title("data projected on the 3 main components \n using " + kernel + " kernel")
        return pca
    else :
        print("how am I supposed to show you that with your 2-D eyes, beta !")
        return pca
    plt.colorbar()

    plt.show()
    return pca

#### Sparse PCA

You can use Sparse PCA to yield sparse component, this is used via a Lasso ($l_1$) regularization
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.SparsePCA.html#sklearn.decomposition.SparsePCA



#### Truncated SVD

If you have a large sparse dataset that you don't want to center (because of Out Of Memory Error) use this algorithm (ex : tf-idf count matrices)
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html



### LLE



### MDS



### Isomap



### t-Distributed Stochastic Neighbor Embedding



### Linear Discriminant Analysis



## Cross validation

In [65]:
import numpy as np
from sklearn.model_selection import KFold

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9,10], [11,12]])  #your EARLY dataset
y = np.array([0, 1, 2, 3, 4, 5])                              #your PREDICTED dataset
kf = KFold(n_splits=3)   #do a 3 fold
print(X.shape, y.shape)
scores=list()

for train_index, test_index in kf.split(X,y):
    print("TRAINindex:", train_index, "TESTindex:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print("TrainSet: \n", X_train, "\n", y_train,"\n TestSet: \n",X_test, "\n",y_test)
    
    # DEFINE A MODEL HERE
    
    # FIT A MODEL HERE ON X_TRAIN + y_train
    
    # EVALUATE MODEL HERE X_TEST + y_test
    
    # STORE THE RESULTS in a list scores=list() scores.append(accuracy,loss)
    
print('Estimated Accuracy %.3f (%.3f)' % (np.mean(scores), np.std(scores)))


(6, 2) (6,)
TRAINindex: [2 3 4 5] TESTindex: [0 1]
TrainSet: 
 [[ 5  6]
 [ 7  8]
 [ 9 10]
 [11 12]] 
 [2 3 4 5] 
 TestSet: 
 [[1 2]
 [3 4]] 
 [0 1]
TRAINindex: [0 1 4 5] TESTindex: [2 3]
TrainSet: 
 [[ 1  2]
 [ 3  4]
 [ 9 10]
 [11 12]] 
 [0 1 4 5] 
 TestSet: 
 [[5 6]
 [7 8]] 
 [2 3]
TRAINindex: [0 1 2 3] TESTindex: [4 5]
TrainSet: 
 [[1 2]
 [3 4]
 [5 6]
 [7 8]] 
 [0 1 2 3] 
 TestSet: 
 [[ 9 10]
 [11 12]] 
 [4 5]
Estimated Accuracy nan (nan)


  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)
  keepdims=keepdims)
  arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
  ret = ret.dtype.type(ret / rcount)
