# Mistakes to avoid in Machine Learning

Always check to avoid the following mistakes in ML.

+ [ ] [Assuming Data is good to go](#1)
+ [ ] [Neglecting to consult subject matter experts](#2)
+ [ ] [Overtiffing your models](#3)
+ [ ] [Not standardizing your data](#4)
+ [ ] [Focusing on Wrong Factors](#5)
+ [ ] [Data Leakage](#6)
+ [ ] [Forgetting traditional statistics tools](#7)
+ [ ] [Assuming Deployment is a breeze](#8)
+ [ ] [Assuming Machine Learning is the answer](#9)
+ [ ] [Developing in a silo](#10)
+ [ ] [Not treating for imbalanced sampling](#11)
+ [ ] [Interpreting your coefficients without properly treating for multicollinearity](#12)
+ [ ] [Evaluating by accuracy alone](#13)
+ [ ] [Giving overly technical presentations](#14)

------

# <a name='1'>1) Assuming Data is good to go</a>

### Visualize your data
+ use `describe()` functions in pandas
+ use `pandas-profiling`

### Check for Duplicate
+ use pandas `duplicated` function
+ use `drop_duplicates()`

### Beware of Missing Values
+ use `isnull().sum()` on a dataframe to determine missing values.
+ use `.dropna()` to remove records with null values
    + replace null values with `0` using `.fillna()` function.
+ Estimate null values with **imputing** data
    + use scikit learn's SimpleImputer
    + fill missing values with mean, median, mode

-----

# <a name='2'>2) Neglecting to consult subject matter experts</a>

+ SMEs can be Product Mangager or customers whom you are creating a model for
+ always check with your customers for
    + Are there any known issues with the data?
    + Have there been any prior issues with modelling?
    + Are there any common hang-ups?
    + Are there any additional concerns
+ get their feedbacks

--------

# <a name='3'>3) Overtiffing your models</a>

### **Overfitting**<br> 
When your model captures patterns in your training data too well - meaning it doesn't generalize well to unseen data.

## **Preventing Overfitting** 

**Regularization:** Introducing a penalty for overly complex features that reduces - or eliminates - their weight in our model.

Two common types of regularization include 
+ **Lasso or L1 regularization**
+ **Ridge or L2 regularization**.


Each of these will shrink the weights of coefficients in the model. But L1 can reduce the weight for some features to zero, thereby removing them entirely from the model.

--------

# <a name='4'>4) Not standardizing your data</a>


### **Why do we need to scale features?**

* Many machine learning techniques will incorrectly assign a **higher weight to features of a higher magnitude**.

* There are wo common approaches for scaling.
    + **MinMaxScaler**
    + **StandardScaler**


**NOTE: tree based algorithms don't require Scaling**

-------

## Min Max Scaler

+ Min-max scaling involves scaling your feature to `a range between 0 and 1`, as defined by the `min and max of your feature`.
+ is recommended when your algorithm `doesn't require assumptions about the distributions` of your variables, as in the case of KNN.


## Standard Scaler

+ The StandardScaler will scale features to be the `standard deviation from the mean for that feature`. Thus we have a `range of values both positive and negative`.
+ this approach assumes a `bell curve distribution (Normal Distribution)` for your variables and it's most effective when it's the case.


-------

### Before we scale, we need to perform the train-test split.

The reason we do scaling is that we will actually derive the scaling bounds from the training set, then apply it to test set.

Remember in machine learning, it's important that anything our model learns must come from training set, not the test set.

--------


# <a name='5'>5)  Focusing on Wrong Factors</a>

Sometimes Datascientists hit the wall of performance after tuning and refining the model.

If we want to avoid this outcome,
+ take a step back and ask yourselves and other around you, is there any other datasets that the model can be benefit from?
+ revisit the data that we collected at the start of the project.
+ there is a good chance that incorporating more data sources into your model, we can significantly improve the model.

--

## Suggested Approach

+ Make a **wish list** of the data you want. Then map this wishlist to the data sources that exist within your business.

+ Find the **avaliable data sources** and incorporate **relevant new variables**. If they meet the feature selection criteria, measure their feature importance and overall impact on the model's performance.

+ Reach out to **other data team**. Ask what data they would think to incorporate. Mostly data scientists work in silo, so approach other team who focus on other data sources. The other team may be able to provide us with starter script or query which can save us time on data prep. Put aside the bias on the data first and let the evaluation criteria be the judge.

+ More is not always better. Adding more data may require additional work on feature selection and may lead to overfitting.

-------

# <a name='6'>6)  Data Leakage</a>

## **What is data leakage?** ##

Data leakage can be thought of as anytime information from outside your training set enters your model.


It is especially prevlant when working with time series data and in environments where there are data cleaniness issues. The end result maybe fool you into thinking your model generalizes much better than it really does.

Mindful of this and remember if your results look too good to be true, pump the brakes and follow those important steps before sharing out the models results.


## **How to detect and prevent data leakage** ##
+ **Are any features surprisingly highly correlated with your target variable?**
    + use corr() to find out
+ Similarly, after training your model - **review the feature importance to see if anything stands out.**
+ **If using time series data, train-test split along your date variable.**
    + it is not appropriate to do a random test split as you normally would.
    + sort by data first
    + then split along the date variable.
    
+ When **Scaling**, fit your scaler to your training group only, then transform both training and test group

+ when using **K-fold cross validation**, repeat the preprocessing steps within each fold separately to prevent data leakage.
    + use the pipeline to handle preprocessing steps and use it via GridSearchCV, RandomizedSearchCV.
    + https://towardsdatascience.com/pre-process-data-with-pipeline-to-prevent-data-leakage-during-cross-validation-e3442cca7fdc
    
![cv_pipeline.png](cv_pipeline.png)

### [INCORRECT WAY] AVOID SCALING and passing those data into CV. 

![avoidthis.png](avoidthis.png)

-------

# <a name='7'>7)  Forgetting traditional statistics tools</a>


famaliarize yourself with traditional regression techniques if we trying to explain the past, rather than trying to predict the future.

## Regression Approach
+ R-squared value
+ Variable coefficients
+ P-values
+ Interpretability

Always be mindful of statistics methods of when using Regression such as...
+ treating Multicollinearityi if you are intent to interpreting the output.


## A/B Testing
+ uses t-tests
+ Randomized Experiment: We determine if a treament yields any statistically significant impact in a randomized experiment.
+ Casual Interference: this can yield the covted causal inference in which we can reasonably say X caused Y.

--------

# <a name='8'>8)  Assuming Deployment is a breeze</a>

Depending on the usecase, deployment can be complex.


## Start with the end in mind
+ Plan your deployment startegy **at the beginning** of the project.
    + this will help to illuminate the limitations you'll need to consider when you are creating the model. (example: if you are planning for real-time predictions in your deployment, check to see all the data you have will actually be available to generate predictions.
+ Will you be scheduling batch predictions or predicting in real time?
    + if so, you'll need to look into things like Flask and deploying APIs.
+ What are the compute requirements?
+ How will you monitor performance over time?
+ Will you be updating and re-deploying your model? If so, How?
+ Is your model driving behaviour that you intended it to?
    + use A/B testing to evaluate its significance.




-----

# <a name='9'>9)  Assuming Machine Learning is the answer</a>

Access whether the project requires machine learning upfront.

Use the **following criterias whether the machine learning is necessary and likely to succeed for your use case**:
+ Do I have a large and diverse set of data to start with?
+ How well-defined is the problem that I am trying to solve?
+ Do I have a clear outcome that I am trying to predict?
+ Do I have hypothesis?
+ Will Quick Ad-hoc analysis or Full-fledged machine learning model require?
    + often quick descriptive statistics can provide the insights.
+ For classification problem, is your data label?
    + prehaps data cleaning exercises is needed, before any modelling can be performed.
+ If successful, will my results drive meaningful action?
    + don't predict for the predicting sake

-----

# <a name='10'>10)  Developing in a silo</a>

With so much time spent heads down in your code, it's easy to lose perspective about some of the intangibles that go into a successful machine learning project. 

So avoid the mistake of developing in a silo through these tips:

+ **Invite others** to look at your code.
    + don't worry about being judged. Maybe you will get some recommendations or will even get questions that will prepare you to socialize your work to a broader audience
+ Reach out to **established data scientists** in your field.
+ Thoroughly **version and document** your code so it can be reproduced.
+ **Communicate reguarly** with your subject matter experts.
    + most importantly, regularly communicate your progress to your managers or customers and diplomatically share any roadblocks you're encountering. You want to avoid disappearing into your script after those initial meetings. Your customers may be left wondering, what's taking him so long? Or did she incorporate that feedback I gave her?
    + don't share your script with nontechnical audiences, but **communicating your progress and initial findings through visualizations will help set expectations and create happier customers**. 
+ Take a break form the screen and **go for a walk**.

-----

# <a name='11'>11)  Not treating for imbalanced sampling</a>

## **Imbalanced Data**
Encountered in a classification problem in which the **number of observations per class are disproportionately distributed.**


## **How to treat for Imbalanced Data?**
+ use `imbalanced-learn` (imblearn) package.
+ it can provide various sampling techniques.

-------

# 1) Over-Sampling Approach


## 1.1) naive approach known as Random Over-Sampling
+ We will upsample our minority classes, that is sample with replacement until the number of observations is uniform across all classes.
+ As we can imagine this approach should give us a pause depending on the scale of upsampling we'll be doing.
+ `from imblearn.over_sampling import RandomOverSampler`

## 1.2) another approach is SMOTE (Synthetic Minority Oversampling Technique)
+ in the case, we generate new observations within the existing feature space over our minority classes.



--------

# 2) Under-Sampling Technique

## 2.1) Naive approach to randomly under-sample our majority class
+ this time we actually throwing out data in our majority class until the number of observations is uniform.
+ `from imblearn.under_sampling import RandomUnderSampler`
+ always check number of observations per class after resampled. **Because of the infrequency of our smallest minority class, we threw out a huge percentage**. If that is the case and we lost a lot of data, we might want to consider other methods for this kind of dataset (like `k-means` and `near-miss`)

------

# <a name='12'>12)  Interpreting your coefficients without properly treating for multicollinearity</a>

+ **Multicollinearity** is when one predictor variable in your regression model can be accurately predicted from the others.

-----

## Use Logit (https://pypi.org/project/statsmodels/)

+ `import statsmodels.api as sm`, `from statsmodels.tools.tools import add_constant` (provide full summary of the data)
+ check for `R-Squared` value (Pseudo R-squ).
+ check for `Coefficient` of independent variables (coef). These tell us how you can expect the likelihood of being one class to respond to changes in features.
+ check for `P-values` which tell us the relative statistical significance of each variables. ( P>|z|)

**Before you interpret those values, we need to understand if multicollinearity is present in our data.**

Multicollinearity won't affect the quality of predictions in our model, but only our abiblity to intrepret individual coefficients in p-values. This brings us to `Variance Inflation Factor(VIF)`.

-----
## Variance Inflation Factor (VIF)
 
+ **Variance Inflation Factor (VIF)** tells us the extent to which we have multicollinearity in our result
+ a `factor of 5 to 10 or more is considered high` and tells us to be wary of the model coefficients.

`from statsmodels.stats.outliers_influence import variance_inflation_factor`
`vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]`

+ To treat for multicollinearity, you can remove those variables with the high VIF. 

**Note this will affect the explanatory power of your overall model.**

-----
------

# <a name='13'>13)  Evaluating by accuracy alone</a>

## **Beyond Accuracy**

+ **Accuracy**: The share of all total predictions that were correct.
    + As accuracy alone doesn't tell the whole story, check the following.
    + **True Positive Rate (Sensitivity)**: this tell us how well our predictions in the positive subset of data.
    + **True Negative Rate (Specificity)**: this tell us how well our predictions in the negative subset of data.
+ **Recall** is the ability of the classifier to find all the positive samples
+ **Precision** tells us how relevant our result is
+ **F1 Score:** the weighted average of precision and recall. We want this to be as close to 1 as possible.

------
+ use **Confusion matrix** and **Classification Report**
    + Confusion Matrix: this matrix will reveal how well our predictions line up with the actuals across the positive and negative subsets of the data.
    
------

### ROC Curve and AUC

+ A ROC curve is the most commonly used way to visualize the performance of a binary classifier
+ AUC is the best way to summarise its performance in a single number.


+ **ROC curve (receiving operator characteristic curve):**
    + This is a great way to evaluate your model when you have actual predicted probabilities instead of just zeros and ones. 
    + The ROC curve allows you to see the true positive rate plotted against the false positive rate across varying thresholds for deeming a prediction positive or negative. 
    + The path of this curve tells us how well your model performs. 
    + And we can actually sum the area under this curve to give us a clean metric to work with. This is known as a **AUC, area under the curve**, and it varies from 0.5 to one. 
        + we can think of this as a **letter grade**. 
        + So 0.85 is like a B, but the actual interpretation depends on your use case. 
        
Take note that accuracy as a metric will only tell you part of the picture, particularly when you have imbalanced data.


------

# <a name='14'>14)  Giving overly technical presentations</a>

## Art of story telling
+ Use **Data Visualizations** instead of showing your code.
+ stick to summaries of your results.
+ Make data visualizations **relevant and easy to read**.
+ **Lead with the results**. Not the technicalities.
    + sell the orgianizaion's benfits like "What time will be saved?" Or value created above the current baseline.
    + you can use confusion matrix to get your point across.
+ Speak to how your approach **addresses the problem to be solved.**
+ learn to use story telling techniques to convey why you work is going to have a big impact and drive success.


------