### Data Calibration

* If given a model $f$ which takes $x_q$ as input and returns $y_q$ - may or may not be the actual probability $P(y_q = 1 | x_q)$

* We already have a trained model i.e., $f$.

* To get the exact $P(y_q = 1 | x_q)$ and $P(y_q = 0 | x_q)$, we have follow a method called calibration.

* Why we need this because - in ML given 2 class classification problem, we extensively use the log-loss metric.

$$\text{log-loss} = -\frac{1}{n} \sum_{i=1}^n \big[y_i \log{(p_i)} + (1 - y_i)\log{(1 - p_i)}\big]$$

* We have to take the output of the model and on top of it, we have to build calibration model.
    - This gives us the exact $P(y_q = 1 | x_q)$.

> Predicting the probabilities allows some flexibility including -
> * deciding how to interpret the probabilities.
> * presenting predictions with uncertainty.
> * providing more nuanced ways to evaluate the skill of the model.

### Procedure to Visualize Calibration Data

**Steps**

![calibration-plot](https://user-images.githubusercontent.com/63333753/126610858-11e971b5-1f20-45ff-91eb-31718cf1095b.PNG)

* For a 2 class classification, fit a model to the training data, let that be $f$.
    - $y_i$ is either 0 or 1.

* Take the cross-validation data and predict $y_i$.
    - Let the prediction value be $\hat{y_i}$.
    - Let the actual value be $y_i$.

* Construct a table $\{x_{q_i}, y_i, \hat{y_i}\}$.
    - $x_{q_i} \in$ cross-validation data.

* Sort the table in the increasing or ascending order of $\hat{y_i}$.

* Chunkenize the sorted table into chunks of size `m`.

* Calculate or compute the average of $y_i$, $\hat{y_i}$, and $x_{q_i}$.

* Consider the mean data of $y_i$ and $\hat{y_i}$ from each chunk. This data is known as calibration data.
     - The data set consists of $\{\hat{y}^{(i)}, y^{(i)}\}$.

* Now, the $P(y_q = 1 | x_q)$ are available in the mean data of $y_i$ from each chunk (because, the data is binary classification type 1 or 0).

* Plot $\{\hat{y}^{(i)}, y^{(i)}\}$.
    - x-axis → $\hat{y}^{(i)}$
    - y-axis → $y^{(i)}$

* We can also map the calibration data in a linear regression model and draw the best fit line.

### Platt’s Calibration / Scaling

* Given the calibration data $\{\hat{y}^{(i)}, y^{(i)}\}$, we need to predict $y^{(i)}$ considering $\hat{y}^{(i)}$.

* The calibration plot looks more like a sigmoidal function. So, Platt has come up with an idea to use the modified form of sigmoid function such as -
    - $\hat{y}^{(i)}$ is coming from $[f(x_q) = y_q]$
    - $y^{(i)}$ is coming from the actual $y_i$'s and it is $P(y_q = 1 | x_q)$
    - Scaling equation
    
    $$\implies P(y_q = 1 | x_q) = \frac{1}{1 + \exp{[A f(x_q) + B]}}$$
    
    - $A$ and $B$ are found by solving an optimization problem.

* Platt scaling works iff the calibration data looks like sigmoidal function. Otherwise, in such cases, we opt the method od Isotonic calibration or Isotonic regression.

* Isotonic calibration is the most used technique in the real world.

### Isotonic Regression / Calibration

* When we have calibration data not in the form sigmoidal shape, we choose isotonic regression to predict the $y^{(i)}$.

* Isotonic regression uses a method known piece-wise linear models that breaks down model building into multile sets of data (for ever step function) and tries to get the best approximation.

![isotonic-piece-wise](https://user-images.githubusercontent.com/63333753/126634652-d63093b8-0abe-467c-bf40-e6f2d6b504e5.png)

* The only problem is how do we know till which points the model $L_i$ should consider. For example, if there is a point which lies in the range of (7, 10), for predicting the $y^{(i)}$, we consider the model $L_3$. The ultimate question here is how do we find the exact threshold to decide which model to take which set of points.

* To find so, we have to be solving an optimization problem (minimizing the errors between the point and the line). This helps to decide the threshold values.

    - The optimization problem helps to find the thresholds along with find the slopes and intercepts for lines by putting a constraint that there should not be too many lines.
    
    - This optimization problem looks like an extended linear regression problem.

* In isotonic regression we have more parameters to estimate than in Platt's scaling technique. Thus, we need more data or more points (calibration data).

* It is suggested to use Platt's scaling when the calibration data is less than proceeding to use Isotonic regression.

<h3><span style="color:red">RANSAC</span> (<span style="color:red">RAN</span>dom <span style="color:red">SA</span>mpling <span style="color:red">C</span>onsensus)</h3>

* It is a statistical concept that focuses on the subject of building a machine learning model in the presence of outliers in the data.
    - How to build a robust model in the presence of outliers?

* This concept or technique is mostly used in the field of image processing and computer vision.

* RANSAC is a model independent technique

* Let's say we are building a linear regression model without using RANSAC. Now, in the case of outliers that are in the data, they tend to pull the best fit line towards them.

![without_ransac](https://user-images.githubusercontent.com/63333753/126858877-f57c4286-0463-40d1-8cdc-4a0453889ff1.png)

* **RANSAC model steps**
    - From the whole training data ($D_{Train}$), take a random sample ($D_0$).
        - Build a model on $D_0$ and let $L_0$ be a model for the best fit line.
            - Most of the points in $D_0$ are going to be inliers and very few will be outliers.
            - The line $L_0$ is less impacted with few outliers.
    - Create a outlier set $O_0$ from $D_{Train}$ and construct another data set such as $D_{Train}^1 = D_{Train} - O_0$.
        - Take a random sample ($D_1$) from $D_{Train}^1$.
        - Build a model on $D_1$ and let $L_1$ be a model for the best fit line.
    - Create a outlier set $O_1$ from $D_{Train}^1$ and construct another data set such as $D_{Train}^2 = D_{Train}^1 - O_1$.
        - ...
        - ...
    - Repeat this whole process $k$ times until there is no difference between the models $L_i$ and $L_{i+1}$
    - Final robust model ($L_*$) will be built on $D_{Train}^{i+1}$ where all the outliers will be removed.

### Retraining Models Periodically

* Whenever the data is collected on the daily basis, we need to re-train the model so as to perform well.

**How do we know when to re-train the model?**

* We need to re-train when the model's performance is dropping.
* If the data set itself is changing (non-stationary) constantly, then model re-training should be done.

### A/B Testing

* It is one of the most important parts of machine learning. It is as important as modelling itself.

* It is also known as bucket testing or split run or controlled experiments.

* It is a user experience research methodology - randomized experiment with two variants, **A** and **B**.
    - It includes application of **statistical hypothesis testing** or **two-sample hypothesis testing** as used in the field of statistics.

<img src="https://miro.medium.com/max/1838/1*aPUSwnuyUsxF_ACHuxiDlA.png">

* A/B testing method - is actually used to test it production environment.

* We use the concept of **Confidence Interval** to see the performances of two models **A** and **B**.
    - If both the models' CI's are overlapping them we should use old model.
    - Otherwise, we can use new model.

**Credits** - Image from Internet

### Data Science / ML Project Life - Cycle

1. Understand the business requirement.
    - Defining the problems and objectives.

2. Data Acquisition
    - ETL
    - SQL

3. Data Preparation
    - Cleaning
    - Preprocessing

4. Exploratory Data Analysis
    - Data Analysis
    - Data Visualization
    - Hypothesis Testing

5. Modelling, Evaluation & Interpretation

6. Produce analytical reports
    - Communication
    - 1-pager reports

7. Deployment
    - Engineering

8. Real world testing
    - A/B testing

9. Customer / Business buy-in

10. Operations
    - Retrain models
    - Handle failure processes

11. Optimization
    - Improve models
    - More data
    - More features
    - More optimization of code

### Productionization and Deployment of ML Models

There are two ways of productionizing the models -

1. Using `sklearn` itself to store the model in hard disk and loading it in the `RAM` whenever we want to predict for the real (actual) value.
    - We can use `pickle` file to store the model.

2. Custom Implementation Approach
    - a
        - Store the parameters in a file or anything.
        - During the runtime, while predicting the an input, implement the `predict()` function in `C` or `C++` or `JAVA`.
        - Thus, we will have a low latency system.
    - b
        - If we are using logistic regression model, storing the coefficients in hashtable is better becuase hashtables (dictionaries) are way faster than other data structures.
    - c
        - If we are using a decision tree model, once we get the if else conditions, we have to implement the same in `C` or `C++` or `JAVA` to get the results faster.