### ML cheatsheets
### ML comparisons

https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

https://medium.com/machine-learning-in-practice/cheat-sheet-of-machine-learning-and-python-and-math-cheat-sheets-a4afe4e791b6

## Scaling & Centering (Normalization)

 - StandardScaler, MinMax, Robust
 - Robust may handle outliers better
 - It doesnt effect Tree based model
 - But it never harms the model, so can be used in all cases

## Imbalance Dataset
 - SMOTE
 - Oversampling
 - Precision / Recall Curve is better for Imbalanced Data

## 5-fold Cross validation methods
 - Train, Test 
    - Isolate Test
    - CV on training set
 - Kfold
    - Loop over 5 folds
    - Manual
    - only use this method if the feature engineering you do is very complex
 - cross_val_score, cross_validate
    - Pipeline
    - Get a feel for how your model to perform
    - How do new features affect performance
    - Using different algos
    - Simple Hyperparameter (max_depth= [5, 10]) check for ranges
 - GridSearch
    - Pipeline
    - Complex Hyperparameter tuning with all combination of hyperparameters
    - Automatically does cross validation

## Machine Learning Techniques

 - **Regression/Estimation** (*Predicting Continuous values*) technique is used for predicting a continuous value. For example, predicting things like the price of a house based on its characteristics, or to estimate the Co2 emission from a car’s engine. 
 - **Classification** (*Predicting the item class/category of a case*) technique is used for Predicting the class or category of a case, for example, if a cell is benign or malignant, or whether or not a customer will churn. 
 - **Clustering** (*Finding the structure of data; summarization*) groups of similar cases, for example, can find similar patients, or can be used for customer segmentation in the banking field. 
 - **Association** (*Associating frequent co-occuring items/events*) technique is used for finding items or events that often co-occur, for example, grocery items that are usually bought together by a particular customer. 
 - **Anomaly detection** (*Discovering abnormal and unusual cases*) is used to discover abnormal and unusual cases, for example, it is used for credit card fraud detection. 
 - **Sequence mining** (*Predicting next events;clickstream : Markoc Model, HMM*) is used for predicting the next event, for instance, the click-stream in websites. 
 - **Dimension reduction** (*PCA*) is used to reduce the size of data. 
 - **Recommendation systems**, (*Recommending items*) this associates people's preferences with others who have similar tastes, and recommends new items to them, such as books or movies. 

### Supervised
 - Regression
 - Classification

### Unsupervised
 - **Dimension reduction** : Dimensionality reduction, and/or feature selection, play a large role in this by reducing redundant features to make the classification easier.
 - **Density estimation** : Density estimation is a very simple concept that is mostly used to explore the data to find some structure within it.
 - **Market basket analysis**: Market basket analysis is a modeling technique based upon the theory that if you buy a certain group of items, you're more likely to buy another group of items.
 - **Clustering**: Clustering is considered to be one of the most popular unsupervised machine learning techniques used for grouping data points, or objects that are somehow similar. Cluster analysis has many applications in different domains, whether it be a bank's desire to segment his customers based on certain characteristics, or helping an individual to organize in-group his, or her favorite types of music. Generally speaking though, clustering is used mostly for discovering structure, summarization, and anomaly detection.    

<img src="image/sup_unsup.jpg" width="800"> 

<img src="image/Regression_Algos.jpg" width="500"> 

<img src="image/Classification_Algos.jpg" width="500"> 

<img src="image/train_test.jpg" width="700"> 

<img src="image/KFold.jpg" width="700"> 

<img src="image/errors.jpg" width="700"> 

 - Mean absolute error is the mean of the absolute value of the errors. This is the easiest of the metrics to understand, since it's just the average error. 
 - Mean squared error is the mean of the squared error. It's more popular than mean absolute error because the focus is geared more towards large errors. This is due to the squared term exponentially increasing larger errors in comparison to smaller ones. 
 - Root mean squared error is the square root of the mean squared error. This is one of the most popular of the evaluation metrics because root mean squared error is interpretable in the same units as the response vector or y units, making it easy to relate its information. 
 - Relative absolute error, also known as residual sum of square, where y bar is a mean value of y, takes the total absolute error and normalizes it by dividing by the total absolute error of the simple predictor. 
 - Relative squared error is very similar to relative absolute error but is widely adopted by the data science community, as it is used for calculating R squared. 
 - R squared is not an error per se but is a popular metric for the accuracy of your model. It represents how close the data values are to the fitted regression line. The higher the R-squared, the better the model fits your data. 
 
##### Each of these metrics can be used for quantifying of your prediction. The choice of metric completely depends on the type of model, your data type, and domain of knowledge.

##### Estimating Multiple Linear Regression approaches:
 - Ordinary Least Squares
    - Linear Algebra Operations
    - Takes longer time for large dataset
 - An Optimization Algorithm
   - Gradient Descent

## ML Lifecycle

#### Objective
 - Business Case
 - Metric

#### Data
 - Data Collection
 - Cleaning
 - EDA
 - Feature Engineering
 
#### ML
 - Model
 - Tuning
 - Evaluation
 - Tuning
 
#### Reporting/Deployment




### Choosing the Right Estimator

When use KNN?
 - Great for small Data.

 - If we have lots of features and data KNN dont perform very well.

SVM maximises distance between support vectors on each side to create the Decision Boundary

RandomForest - Wisdom of the crowd, Average of the guesses of many.
 - It look for random data and features eg 60% of the data, of sqr root Features eg 4 features of a total of 16.
 - if we have 100 trees, each will look at different snapshot of data, and then do agg for all.
 - Relatively fast
 - Requires less cleaning
 - hence used as a Baseline model

Ensemble method, looks at different snapshot of data and then create decision boundaries based on its perspective.

https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

##### Trees Algorithm have Feature Importances
##### Regression have coefficients

 - Remove less important features
 - Do Feature Engineering to important features

### Model Evaluation

Precision = TP / (TP + FP) Gives a measure of Predicted Positives (False Positives costs the model) 
eg Churn incentives given to Customers that might leave we need Precision

Recall = TP / (TP + FN) Gives a measure of Actual Positives (False Negatives costs the model)
eg Fraud Detection, Disease Prediction

#### Regression metrics
 - Mean Squared Error
 - R2

#### Classification metrics
 - Logloss
 - Accuracy
 - Precision
 - Recall
 - F1
 
#### Adjust classification threshold to change ratio of prec:recall

## Decision Trees

Try every possible split for the Features, to get towards the pure node based on maximum depth. Also look for Gini/Entropy to check how much the data is mixed.

Loop through all the features.

Try different parameters to make it less mixed.

Gini Impurity is the probability of incorrectly classifying a randomly chosen element in the dataset if it were randomly labeled according to the class distribution in the dataset. 
https://victorzhou.com/blog/gini-impurity/

Information Gain is calculated for a split by subtracting the weighted entropies of each branch from the original entropy. When training a Decision Tree using these metrics, the best split is chosen by maximizing Information Gain.
https://victorzhou.com/blog/information-gain/

Use One Hot Encoding for Categorical Variables. Decision Trees doesnt care about weightage, Label Encoding can also be used.

### Bias - Variance Tradeoff

To check for Over-fitting or under-fitting:
 - Check for Train, Test Accuracy scores.

#### Example

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

model = DecisionTreeClassifier(max_depth=3)

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)

print('Train score:', accuracy_score(y_train, y_pred_train))

y_pred = model.predict(X_test)

print('Test score:', accuracy_score(y_test, y_pred))

### How can we make this model better?
 - Use different ML algorithm
 - Feature Engineering
 - Hyperparameter Tuning
 - Feature selection

objective -> 

data -> model -> reporting/deployment data collection = data engineer 

eda/feature engineering/model = data scientist 

deployment = machine learning engineer

## Model Validation

### Data Leakage
 - leaking data into train/test set
 - in Vinnys example Customer ID's relationship with label causes it

### Imbalanced Learning
 - The Algorithm gets really good in learning majority Class and not so well in minority class.
 - The Accuracy score as metric is no longer useful.

#### Random Oversampling
 - Take the Minority Class and randomly duplicate them
 - You can also Undersample or use a combination of both

If you have lots of Data then you can use Holdout Cross Validation (Millions of rows of Data). 

But if you have smaller Data then you can use K-Fold Cross Validation.

#### If the K-Fold Score **varies** a lot then the model is **Unstable**.

<img src="image/SVM_app.jpg" width="700"> 