***SkLearn Reference***
=======================

**Standard Import**
-------------------
```
import sklearn.model_selection as ms
import sklearn.impute as impute
import sklearn.preprocessing as pp
import sklearn.pipeline as pp
import sklearn.compose as compose
import sklearn.decomposition as decom

import sklearn.metrics as metrics 
```
---
<br>

**Processing Data**
-------------------

*Handle Missing Data*
---------------------
[API - simpleimputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)

```
imputer = impute.SimpleImputer(strategy = {'mean', 'median', 'most_frequent', 'constant', fill_value= 10, copy= True)
imputer.fit_transform(X) --- return a numpy array
```

*Transformation*
----------------
[API - preprocessing](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)<br>
[API - polyfeatures](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)<br>
[API - minmax](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)<br>
[API - standard](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)<br>
[API - oneHot](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)<br>
[API - getdummies](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)

```
pp.PolynomialFeatures(degree, include_bias)
pp.MinMaxScaler()
pp.StandardScaler()

one_hot = pp.OneHotEncoder(sparse= False) --- return a numpy array if sparse = False else return a sparse matrix
one_hot.get_feature_names() --- return the feature names in the transformed data

pd.get_dummies(X) --- one-hot encode
```

*Splitting Data*
----------------
[API - train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

```
ms.train_test_split(X, y, train_size, random_state, shuffle, stratify)
```

*Data Pipeline*
---------------
[API - pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)
[API - columntransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html)

```
data_pipe = pipe.Pipeline([
                            ( 'scaler', pp.StandardScaler() ), 
                            ( 'impute', impute.SimpleImputer(strategy='median') ),
                            ( 'tree_classifier', tree.DecisionTreeClassifier(random_state=42) )
                            ])
----------
num_pipeline = pipe.Pipeline([
                        ( 'impute', impute.SimpleImputer(strategy='mean') ),
                        ( 'scaler', pp.StandardScaler() ),
							])
								
full_pipeline = compose.ColumnTransformer([
                        ( 'num', num_pipeline, lst_num ),
                        ( 'cat', pp.OneHotEncoder(), lst_cat ),
                        ], remainder='passthrough')
```
* lst_num is the list variable containing all the variable name of the columns to be transformed.

```
full_pipeline.fit_transform(X)
full_pipeline.named_transformers_['cat'].get_feature_names()
```

---
<br>

**Dimensionality Reduction**
----------------------------
* Remember to scale the data before applying dimensionality reduction on the data

```
import sklearn.decomposition as decom
```
[API - pca](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)<br>
[API - incrementalpca](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.IncrementalPCA.html)<br>
[API - kernelpca](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html)

| Type            	| Require the data to fit in memory? 	| Notes                                                                                                                            	|
|-----------------	|------------------------------------	|----------------------------------------------------------------------------------------------------------------------------------	|
| Regular PCA     	| Yes                                	|                                                                                                                                  	|
| Randomized PCA  	| Yes                                	| * Quickly find an approximation of the first d principal components <br>    * Faster then full SVD when d is much smaller than n 	|
| Incremental PCA 	| No                                 	| * Good for large training set that cannot be fitted into memory <br> * Good for applying PCA online                              	|
| Kernel PCA      	|                                    	| * For complex nonlinear projections                                                                                              	|


<strong>Regular PCA</strong>

```
pca = decom.PCA(n_components=int/float, svd_solver= 'auto') --- to create a PCA object
```
* If n_components = int, it refer to the absoulte number of dimension to retain
* Else if n_components = float where it is between 0 and 1, it will auto compute the minimum number of dimension required to preserve float variance in the data

<strong>Randomized PCA</strong>

```
pca = decom.PCA(n_components=int/float, svd_solver= 'randomized')
```
* By default, svd_solver= 'auto', auto is where SKLearn will automatically uses the randomized PCA algorithm if


---
<br>

**Classifiers**
---------------

*Linear Classifiers*
--------------------
```
import sklearn.linear_model as lm
```
<strong>Logistic Regression</strong>

[API - logisticregression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

```
lmLogisticRegression() --- this is a native binary classifier
lm.LogisticRegression(
        multi_class="multinomial", 
        solver="lbfgs", --- must be stated for a softmax regression 
        C=10, --- applied regularization, inverse C aka higher value lowwer regularization
        ) --- this is a multiclass classifier aka softmax regression

estimator.fit(X, y)
estimator.predict(X)
estimator.predict_proba(X) --- return probability estimates
estimator.decison_function(X)
estimator.score(X, y) --- return the mean accuracy
```

<strong>SGD Classifier</strong>

[API - sgdclassifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)

```
lm.SGDClassifier() --- this is a native multi-class classifier
```

*NN Classifiers*
----------------
```
import sklearn.neural_network as nn
```

<strong>Multi-Layer Perceptron  Classifier</strong>

[API - mlpclassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)

```
nn.MLPClassifier(
        hidden_layer_sizes= 10, 
        activation= {'identity', 'logistic', 'tanh', 'relu'}, 
        solver= {'lbfgs', 'sgd', 'adam'}
        )
```

*Tree Classifiers*
------------------
```
import sklearn.tree as tree
```

<strong>Decision Tree Classifier</strong>

[API - dtclassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
[API - plottree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html)

```
tree.DecisionTreeClassifier(
        max_depth= 10,
        cp_alpha= [0,1],
        max_featurs,
        max_leaf_nodes,
        min_samples_split,
        min_sample_leaf
        )

estimator.cost_complexity_pruning_path(X, y) --- generate path that contain alpha (pruning values) and impurity score

tree.plot_tree(estimator, 
                filled=True, 
                rounded=True, 
                class_name=['negaitve_class', 'positive_class'], 
                ) --- to generate tree diagram 

```

*Ensemble Classifiers*
----------------------
```
import sklearn.ensemble as ensemble
```

<strong>Random Forest Classifier</strong>

[API - rfclassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

```
ensemble.RandomForestClassifier(
       		n_estimators,
		bootstrap= True
		max_samples,
                oob_score, ---if True, uses out-of-bag samples to estimate the generalization accuracy
                n_jobs ---> number of CPU cores to use for training and prediction
                )     
```

<strong>Voting Classifier</strong>

[API - votingclassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html)

```
ensemble.VotingClassifier(
                estimators=[ ('name', estimator),.... ],
                voting= {'hard', 'soft'},
                flatten_transform= {True, False}
                )
```
| flatten_transform 	| voting 	| What it return after calling transform method on X                      	|
|-------------------	|--------	|-------------------------------------------------------------------------	|
| True              	| hard   	| Return predicted class label                                            	|
| {False, True}     	| soft   	| Return the probabilities of class labels for all estimators in ensemble 	|

<strong>Bagging Classifier</strong>

[API - baggingclassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)

```
ensemble.BaggingClassifier(
        base_estimator, 
        n_estimators, 
        max_samples, 
        max_features, 
        bootstrap, 
        bootstrap_features, 
        oob_score
        )
```

<strong>Ada Boost Classifier</strong>

[API - adaboostclassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html)

```
ensemble.AdaBoostClassifier(
        base_estimator, 
        n_estimators=50, 
        learning_rate=1.0, 
        algorithm='SAMME.R
        )
```

<strong>Gradient Boosting Classifier</strong>

[API - gbclassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)

```
ensemble.GradientBoostingClassifier()
```

*SVM Classifiers*
-----------------
```
import sklearn.svm as svm 
```
* SVM classifiers do not output probabilities 
* SVM classifiers are sensitive to feature scales so remember to scale the features before fitting them 


<strong>Linear Data</strong>

[API - svc](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
[API - linearsvc](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html)

```
svm.SVC() --- this is a native binary classifier that uses OvO strategy natively

svm.SVC(
        kernel='linear', 
        C=1, 
        probability=False)
```

* This is the same as the LinearSVC class. 
* If probability = True, it will add predict_proba() and predict_log_proba() to the estimator's method.

```
svm.LinearSVC(C=1, loss='hinge') ---- this is linear support vector classifier
```
* C is the inverse C, applied regularization, higher the value, the less applied regularization
* The LinearSVC class regularize the bias term as well and hence we need to standardized the training set 
* If we scale the data during the preprocessing stage, it will not be an issue 
* Remember to set the loss = 'hinge' as it is not the default value 

<strong>Non-Linear Data</strong>

```
svm.SVC(
        kernel='poly' , 
        degree=2, 
        coef0=1, 
        C=5)
```
* This trains a SVM classifier with a int-degree polynomial kernel 
* It gets the same results as if we have added polynomial features without actually having to add them
* coef0 controls how much the model is influenced by high-degree poly model vs low-degree poly model    
* When degree=int, the poly kernel computes the relationship between each pair observations in int-dimension and the relationship is used to find the threshold/SVC

```
svm.SVC(
        kernel='rbf', 
        gamma=int, 
        C=0.001)
```
* This train a SVM classifier with the RBF kernel 
* gamma / C are inverse applied regularization, higher the value, the less applied regularization


<strong>Computational Complexity</strong>

| Class                    	| Kernel Trick 	| Time                                                                                                                                                                                                                                                                                                  	| Out of Core Support 	|
|--------------------------	|--------------	|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------	|---------------------	|
| LinearSVC <br> LinearSVR 	| No           	| * Scale   linearly with number of training instances and number of features O(m x n) <br> * m ~ number of instances <br> * n ~ number of features                                                                                                                                                     	| No                  	|
| SVC <br>  SVR            	| Yes          	| * When the number of training instances get large, it gets very slow <br> * O(m^2 x n) to O(M^3 x n) <br> * Good for complex small to medium datasets datasets with large number of features <br> * Especially for dataset with sparse features (i.e. with very few nonzero features per instance)    	| No                  	|
| SGDClassifier            	| No           	| * Scale linearly with number of training instances and number of features <br> * O(m x n)                                                                                                                                                                                                     	|     Yes             	|

<br>


*Strategy Conversion*
---------------------

```
import sklearn.multiclass as mc
```
[API - 1v1classifier](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsOneClassifier.html)
[API - 1vRclassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html)

```
mc.OneVsOneClassifier() 
mc.OneVsRestClassifier(svm.SVC()) --- to use SVC as a multiclass classifier
```


---
<br>

**Regression**
--------------
```
import sklearn.linear_model as lm
```

*Base Models*
-------------

[API - linearregression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
[API - sgdregression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html)

```
lm.LinearRegression()
lm.SGDRegressor()
```

*Tree Regressors*
-----------------
```
import sklearn.tree as tree
```

[API - decisiontreeregression](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)
[API - randomforestregression](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)

```
tree.DecisionTreeRegressor()
ensemble.RandomForestRegressor() 
```

---
<br>

**Clustering**
--------------

```
import sklearn.cluster as cluster
```
* Remember to scale the data before performing clustering on the data
* Might be insightful to plot the inertia and silhouette score as a function of k (number of clusters)

```
kms = cluster.Kmeans(
        n_cluster=8, --- number of cluster 
        n_init=10,  --- number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia
        init={'k-means++', 'random', array}, --- methods to select the cluster center during the initialization process
        algorithm={'auto'....} --- type of algorithm to run 
        )

kms.inertia --- to generate the inertia score (mean squared distance between the instances and its closest centroid)
kms.cluster_centers_ --- return the centroids coordinates
kms.labels_ --- return the instance's label (index of the cluster that this instance get assigned to by the algorithm)
kms.transform(X) --- return the distance between each instance and all of the clusters

mb_kms = cluster.MiniBatchKmeans(n_cluster=int.....)
```

---
<br>

**Metrics for Model Evaluation**
--------------------------------
[API - modelevaluation](https://scikit-learn.org/stable/modules/model_evaluation.html)

```
import sklearn.metrics as metrics
```

<strong>Regression</strong>

```
metrics.mean_squared_error(y_actual, y_prediction) --- to generate the mean squared error
```

<strong>Classification</strong>

```
metrics.confusion_matrix(y_actual, y_prediction) 
metrics.plot_confusion_matrix(estimator, X, y, display_labels])

metrics.precision_score(y_actual, y_prediction)
metrics.recall_score(y_actual, y_prediction)
metrics.f1_score(y_actual, y_prediction)
metrics.roc_auc_score(y_actual, y_prediction)
metrics.classification_report(y_actual, y_prediction)

metrics.precision_recall_curve(
                        y_actual, 
                        y_prediction_score) --- generate range of precision, recall values at different thresholds
metrics.roc_curve(
        y_actual, 
        y_prediction_score) --- generate range of FPR, TPR at different thresholds

```

<strong>Cross-Validationn</strong>

[API - crossvalscore](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)<br>
[API - crossvalpredict](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html)

```
ms.cross_val_score(estimator, X, y, cv=10,
                        scoring={'accuracy', 'neg_mean_squared_error'} )

ms.cross_val_predict(estimator, X, y, cv=10,
                        method={'decision_function', 'predict_proba'} ) 

```

---
<br>

**Grid Search and Randomized Search**
-------------------------------------

<strong>Grid Search</strong>

[API - gridsearchcv](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

```
parameter_grid= [ 
                {'n_estimators' :[2,4,6,8], 'max_features': [2,4,6]}, --- tests all permutations in dictionary 
                {defined parameters} ]


grid_search = ms.GridSearchCV(estimator, parameter_grid, 
                                scoring= 'xxxx', cv=10, 
                                return_train_score= True)

grid_search.fit(X, y)
grid_search.best_estimator_
grid_seach.best_estimator_.feature_importances_
grid_search.best_score_
grid_search.best_params_

grid_search.cv_results_
grid_search.cv_results_['mean_train_score'] --- to return the average training score
grid_search.cv_results_['mean_test_score'] --- to return the average test score
grid_search.cv_results_['params'] --- to return the parameters of the results
```

<strong>Randomized Search</strong>

[API - randomizedsearchcv](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

```
parameter_dist_grid = {'n_estimators': reciprocal(20,20000), 'max_features': some_distribution_function}

ms.RandomizedSearchCV(estimator, parameter_dist_grid, 
                        n_iter=10, cv=10, 
                        scoring= 'xxxx', )
```

