# <center><span style='background :yellow'>Data Science - Statistics & Mathematics</span></center>

---

**Ref URL**
- [Analytics Vidhya - Ultimate Data Science - Statistics Mathematics Cheat Sheet](https://medium.com/analytics-vidhya/your-ultimate-data-science-statistics-mathematics-cheat-sheet-d688a48ad3db)
- [Analytics Vidhya - Confusion Matrix-Accuracy-Precision-Recall-F1 Score](https://medium.com/analytics-vidhya/confusion-matrix-accuracy-precision-recall-f1-score-ade299cf63cd)
- [Towards Data Science - Accuracy, Precision, Recall or F1](https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9)
- [Towards Data Science - Decoding Confusion Matrix](https://towardsdatascience.com/decoding-the-confusion-matrix-bb4801decbb)
- [Medium - Classification Report](https://medium.com/@kennymiyasato/classification-report-precision-recall-f1-score-accuracy-16a245a437a5)
- [Building Intelligence Together - Model Selection: Accuracy, Precision, Recall or F1](https://koopingshung.com/blog/machine-learning-model-selection-accuracy-precision-recall-f1/)
- [Data School - Guide to Confusion Matrix](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/)
- [Blogspot - Confusion Matrix](https://manisha-sirsat.blogspot.com/2019/04/confusion-matrix.html)

## <font color = green>Topics</font>

### Classifier Metrics
1. Confusion Matrix
2. Sensitivity / Recall
3. Specificity
4. Precision
5. F1 score

---

### Regressor Metrics
1. Mean Absolute Error (MAE)
2. Mean Square Error (MSE)
3. Root Mean Square Error (RMSE)
4. Mean Squared Logarithmic Error (MSLE)
5. R² (R is correlation coefficient)

---

### Statistical Indicators
1. Correlation Coefficient
2. Covariance
3. Variance
4. Standard Deviation

---

### Types of Distributions
1. Uniform Distribution
2. Normal Distribution
3. Poisson Distribution

## <font color = red>Classifier Metrics</font>
Metrics used to evaluate the performance of machine learning classifiers — models that put each training example into one of several discrete categories.

### 1. Confusion Matrix
- Its a matrix used to indicate a classifier’s predictions on labels. 
- It contains four cells, each corresponding to one combination of a predicted true or false and an actual true or false. 
- Many classifier metrics are based on the confusion matrix, so it’s helpful to keep an image of it stored in your mind.

![](https://miro.medium.com/max/1400/1*dXXjDqSN6jx9Y0KHId7ypg.png)

                                 Predicted Value - Machine Learning Model prediction
                                 Real/Value - The true value/label in the data

- <font color = olive>**True Negative (TN)**</font>
    - The actual value was **False, and the model predicted False**.
    - It correctly identified that the person does not like dogs.
- <font color = salmon>**False Positive (FP)**</font> 
    - The actual value was **False, and the model predicted True**.
    - This is also known as a Type I error (rejection of a true null hypothesis).
    - It predicted yes, the person likes dogs, but they actually don’t.
- <font color = red>**False Negative (FN)**</font>
    - The actual value was **True, and the model predicted False**. 
    - This is also known as a Type II error (non-rejection of a false null hypothesis).
    - It predicted no, the person does not like dogs, but they actually do.
- <font color = green>**True Positive**</font>
    - The actual value was **True, and the model predicted True**.
    - It correctly identified that the person does like dogs.


![](https://miro.medium.com/max/1400/1*Ub0nZTXYT8MxLzrz0P7jPA.png)

<br>

**Important points about Model Selection metrics**
1. <font color = blue>Accuracy</font> - both <font color = green>True Positive</font> and <font color = olive>True Negatives</font> matters to the business
2. <font color = blue>Recall</font> - having large number of <font color = red>False Negatives</font> has a higher negative to business
3. <font color = blue>Precision</font> - having large number of <font color = salmon>False Positives</font> has a higher cost to business
4. <font color = blue>F1-Score</font> - want to balance <font color = red>False Negatives</font> and <font color = salmon>False Positives</font>.


![](https://miro.medium.com/max/552/1*5CATnJ2FyNOF9xTpJ7qA5w.jpeg)

---

### 2. Sensitivity/Recall 
- The number of `positives` that were accurately predicted.
- Recall actually calculates <span style='background :pink'>how many of the **Actual Positives** our model capture through labeling it as Positive (True Positive).</span>
- This is calculated as **TP/(TP+FN)** (note that false negatives are positives). 
- Sensitivity is a good metric to use in contexts where <span style='background :pink'>correctly predicting **positives** is important or when there is a high cost associated with **False Negative**</span>, like medical diagnoses. 
- In some cases, false positives can be dangerous (*e.g. a diagnosis of ‘no cancer’ in someone who does have cancer*) ... but it is generally agreed upon that false negatives are more deadly. 
- By having model maximize sensitivity, its ability to prioritize correctly classify positives is targeted.
- <span style='background :turquoise'>Recall should ideally be 1 (high) for a good classifier (happens when True Postives = True Postives + False Negatives, meaning False Negatives is zero.</span>
 
---

### 3. Specificity 
- It is the number of `negatives` that were accurately predicted
- There should be high specificity
- This is calculated as **TN/(TN+FP)** (note that false positives are actually negatives). 
- Like sensitivity, specificity is a helpful metric in the scenario that <span style='background :pink'>accurately classifying negatives is more important than classifying positives.</span>

---

### 4. Precision 
- It shows correctness achieved in positive prediction.
- It can be thought of as the `opposite of sensitivity or recall`
    - Sensitivity measures the proportion of actually true observations that were predicted as true
    - Precision measures <span style='background :pink'>the proportion of predicted true observations that actually were true.</span>
    - Precision talks about how <span style='background :pink'>precise/accurate your model is out of those predicted positive, how many of them are actual positive.</span>
- This is measured as **TP/(TP+FP)**.
- Precision is a good measure to determine, when the `costs of False Positive is high`. 
- For instance, email spam detection. In email spam detection, a false positive means that an email that is non-spam (actual negative) has been identified as spam (predicted spam).
- Precision and recall together provide a rounded view of a model’s performance.
- <span style='background :turquoise'>Precision should ideally be 1 (high) for a good classifier (meaning False Positives is zero).</span>

---

<img src="https://miro.medium.com/max/1400/1*93lHZUzCKeGc60-bW8pVOw.png" width="700"/>

---

### 5. F1-Score
- So, ideally in a good classifier, we want both Precision and Recall to be 1 which also means False Positive and False Negative are zero. Therefore we need a metric that takes into account both precision and recall ... F1-Score is that metric.
- F1-score combines precision and recall through the **harmonic mean** (*harmonic mean penalizes more extreme values, opposed to the mean, which is naïve in that it weights all errors the same*)
- The `F1 score` is used to <span style='background :pink'>measure a **test’s accuracy**</span>, and it balances the use of **precision** and **recall** to do it.
- The harmonic mean is used since it penalizes more extreme values, opposed to the mean, which is naïve in that it weights all errors the same.

$$
F_{1}=\frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

- <span style='background :turquoise'>F1 Score becomes 1 only when both Precision and Recall are 1</span>

---

<span style='background :orange'>**IMBALANCED DATASET -- Why to use F1-score and not Accuracy**</span>
- F1 Score is needed when you want to seek a balance between Precision and Recall.
- Accuracy can be largely contributed by a large number of `True Negatives` which in most business circumstances, we do not focus on much; whereas `False Negative` and `False Positive` usually has business costs (tangible & intangible)
- Thus F1 Score might be a better measure to use if we need to seek a balance between Precision and Recall AND `there is an uneven class distribution (large number of Actual Negatives)`.

<span style='background :pink'>The F1 score should be used when not making mistakes is more important (False Positives and False Negatives being penalized more heavily)</span>
<br>
<span style='background :yellow'>Accuracy should be used when the model’s goal is to optimize performance.</span>

---

### Summary
- <font color = red>**Precision**</font> is how certain you are of your true positives.
    - Choose Precision if you want to be more **confident of your true positives**.
        - For example, in case of spam emails, you would rather have some spam emails in your inbox rather than some regular emails in your spam box. 
        - You would like to be extra sure that email X is spam before we put it in the spam box.
- <font color = red>**Recall**</font> is how certain you are that you are not missing any positives.
    - Choose Recall if the occurrence of **false negatives is unaccepted/intolerable**. 
- <font color = red>**Specificity**</font> is out of **all the positive classes, how much we have predicted correctly**.
    - Recall should be high.
    - Choose specificity if you want to **cover all true negatives**, i.e. meaning we do not want any false alarms or false positives. 
        - For example, in case of a drug test in which all people who test positive will immediately go to jail, you would not want anyone drug-free going to jail.
        
<img src="https://2.bp.blogspot.com/-EvSXDotTOwc/XMfeOGZ-CVI/AAAAAAAAEiE/oePFfvhfOQM11dgRn9FkPxlegCXbgOF4QCLcBGAs/s1600/confusionMatrxiUpdated.jpg" width="600"/>

---

### Implementing the Metrics

<img src="https://miro.medium.com/max/1400/1*aABip6ltJG6meLlu07EHPw.png" width="500"/>

## <font color = red>Regressor Metrics</font>
Regression metrics are used to measure performance of a model that puts a training example on a continuous scale, such as determining the price of a house.

### 1. Mean Absolute Error (MAE) 
- The most common and interpretable regression metric.
>- The error between two numbers is simply the difference between them. 
>- The absolute error is the absolute difference. 
>- To find the mean absolute error, you must find the absolute error between corresponding values in the sets, and then find the mean of those errors.
- It measures the average magnitude of the errors in a set of predictions, without considering their direction.
- MAE calculates the `difference between each data point’s predicted y-value and the real y-value, then averages every difference` (the difference being calculated as the absolute value of one value minus the other).

$$
Mean Absolute Error = (\frac{1}{n})\sum_{i=1}^{n}\left | y_{i} - x_{i} \right |
$$

---

### 2. Median Absolute Error
- Metric of evaluating the average error. 
- A measure of errors between paired observations expressing the same phenomenon.
- While it has the benefit of moving the error distribution lower by focusing on the middle error value, it also tends to ignore extremely high or low errors that are factored into the mean absolute error.

---

### 3. Mean Square Error (MSE)
- Regression metric that `punishes higher errors more`. 
    - For example
        - An error (difference) of 2 would be weighted as 4
        - An error of 5 would be weighted as 25
        - Mean Square Error (MSE) finds the difference between the two errors as 21. MSE calculates the square of each data point’s predicted y-value and real y-value, then averages the squares.
        - Mean Absolute Error (MAE) weights the difference at its face value — 3.

---

### 4. Root Mean Square Error (RMSE) 
- is used to give a level of interpretability that mean square error lacks.
- By square-rooting the MSE, we achieve a metric similar to MAE in that it is on a similar scale, while still weighting higher errors at higher levels.

---

### 5. Mean Squared Logarithmic Error (MSLE) 
- Another common variation of the mean absolute error. 
- Because of the logarithmic nature of the error, MSLE only cares about the `percent differences`. 
- This means that MSLE will treat small differences between small values (for example, 4 and 3) the same as large differences on a large scale (for example, 1200 and 900).

---

### 6. R² 
- A commonly used metric (where r is known as the correlation coefficient) which measures `how much a regression model represents the proportion of the variance for a dependent variable` which can be explained by the independent variables. 
- In short, it is a good metric of `how well the data fits the regression model`.

---

### Implementing the Metrics

<img src="https://miro.medium.com/max/1400/1*aABip6ltJG6meLlu07EHPw.png" width="500"/>

## <font color = red>Statistical Indicators</font>

### 1. Correlation
- A statistical measure of `how well two variables fluctuate together`. 
- Positive correlations mean that two variables fluctuate together (a positive change in one is a positive change to another)
- A negative correlations mean that two variable change opposite one another (a positive change in one is a negative change in another). 
- Correlation is more useful for `determining the strength of the relationship between two variables`.
- The correlation coefficient, from +1 to -1, is also known as R.
- Formula for Correlation Coefficient (r) is 

$$
r = \frac{{}\sum (x_i - \overline{x})(y_i - \overline{y})}
{\sqrt{\sum (x_i - \overline{x})^2(y_i - \overline{y})^2}}
$$

<br> 

<img src="https://miro.medium.com/max/1400/1*Q44XbGSYlc0pz9ZrE9SL-Q.png" width="800"/>

- The correlation coefficient can be accessed using the `.corr()` function through Pandas DataFrames.
- Calling `table.corr()` yields a correlation table (*The correlation table is symmetric and equal to 1 when a sequence is compared against itself*)

<img src="https://miro.medium.com/max/1400/1*DvyePGWrErXFXHXl9303Kg.png" width="600"/>

---

### 2. Covariance 
- A measure of the property of a function of retaining its form when the `variables are linearly transformed`.
- Unlike correlation, however, covariance can `take on any number while correlation is limited between a range`.
- Covariance has units (unlike correlation) and is `affected by changes in the center or scale`
- It is less widely used as a stand-alone statistic.
- Covariance is used in many statistics formulas, and is a useful figure to know.
- Covariance can be used in  Python with **numpy.cov(a,b)[0][1]**, where a and b are the sequences to be compared.
- Formula for Covariance (Cov) is 

$$
Cov_{xy} = \frac{{}\sum (x_i - \overline{x})(y_i - \overline{y})}{(n-1)} = \frac{{}\sum (xy) - (n \overline{xy})}{(n-1)} = \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})
$$

---

### 3. Variance 
- Its measure of `expectation of the squared deviation of a random variable from its mean`. 
- It informally measures `how far a set of numbers are spread out from their mean`. 
- Variance can be measured in the statistics library (**import statistics**) with **statistics.variance(list)**.
- Formula for Variance ($\sigma^{2}$) is 

$$
\sigma^{2} = \frac{{}\sum (x - \mu)^{2}}{N}
$$

---

### 4. Standard Deviation 
- Its the `square root of the variance`
- A more `scaled statistic` for how spread out a distribution is. 
- Standard deviation can be measured in the statistics library with **statistics.stdev(list)**.
- Formula for Variance (𝜎) is 

$$
\sigma = \sqrt{\frac{{}\sum (x - \overline{x})^2}{(n - 1)}}
$$

## <font color = red>Types of Distribution</font>

---

<img src="https://miro.medium.com/max/1400/1*epy6pFCxBuBIglmWaLTlgg.png" width="600"/>

---

### 1. Uniform Distribution 
- It is completely flat, or truly random. 
- This distribution describes an experiment where there is an arbitrary outcome that lies between `certain bounds`.
- The bounds are defined by the parameters, a and b, which are the minimum and maximum values. 
- The interval can be either be closed (eg. [a, b]) or open (eg. (a, b)).
- Therefore, the distribution is often abbreviated U (a, b), where U stands for uniform distribution.
    - For example, which number of dots a dice landed on (from 1 to 6) recorded for each of 6000 throws would yields a flat distribution, with approximately 1000 throws per number of dots. 

---

### 2. Normal Distribution (Gaussian curve)
- Its a very common distribution that resembles a curve (one name for the distribution is the ‘Bell Curve’).
- A normal variable has a mean “μ”, and a standard deviation “σ”. Regardless of the mean, variance and standard deviation, all normal distributions have a distinguishable bell shape.
- Besides its common use in data science, it is where most things in the universe can be described by, like IQ or salary. It is characterized by the following features:
    - 68% of the data is within one standard deviation of the mean.
    - 95% of the data is within two standard deviations of the mean.
    - 99.7% of the data is within three standard deviations of the mean.
- Many machine learning algorithms require a normal distribution among the data. 
    - For example, logistic regression requires the residuals be normally distributed. This can be visualized and recognized with a **residplot()**.
    
<img src="https://cdn.analystprep.com/cfa-level-1-exam/wp-content/uploads/2019/10/05093814/page-123.jpg" width="600">

---

### 3. Poisson Distribution 
- Can be thought of as a generalization of the normal distribution
- A discrete probability distribution that `expresses the probability of a given number of events` occurring in a fixed interval of time or space (*if these events occur with a known constant mean rate and independently of the time since the last event*). 
- Poisson distributions, valid only for `integers` on the horizontal axis. λ (also written as μ) is the expected number of event occurrences.
- With changing values of λ, the Poisson distribution shifts the mean left or right, with the capability of creating left-skewed or right-skewed data.

<img src="https://www.statisticshowto.com/wp-content/uploads/2013/10/poisson-distribution.png" width="400">