<a href="https://colab.research.google.com/github/nickstone1911/data-analysis-practice/blob/main/Model_Evaluation_Metrics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Business Analytics

## Lesson 07 Tech Exercises: Calculating Model Evaluation Metrics

### Learning Objectives
In this exercise we:

>- Develop a deeper understanding of the various model evaluation metrics
>- Practice calculating model evaluation metrics for classification problems using a confusion matrix
>- Practice calculating regression model evaluation metrics
>- Continue developing python programming skills by working with `pandas` and `numpy` to create model evaluation functions

Note: for this exercise we will not be using `scikit-learn` to calculate the metrics for us. Rather, we will be doing them more "mannually" using pandas and numpy so we develop a deeper understanding of the metrics prior to use `scikit-learn`


---



# Section 1: Classification Metrics

## Learning Objectives
In this section we are going to:
1. Gain experience working with a confusion matrix
2. Review `pandas` by creating a confusion matrix DataFrame with additional columns and rows for column and row totals
3. Continue developing fundamental python programming skills by creating and classification metrics function to calculate accuracy, precision, recall, and f1_score.

---

## Confusion Matrix Use Case

Consider a binary classification problem where a model predicts whether emails are spam (positive class) or not spam (negative class). The confusion matrix for the model is as follows:

\begin{array}{|c|c|}
\hline
195 (TN) & 5 (FP) \\
\hline
25 (FN) & 175 (TP) \\
\hline
\end{array}

>- Note: this is the default way `scikit-learn` displays a confusion matrix so we are going to be consistent with this. If you look at other documentation you may see the confusion matrix set up slightly differently, such as:

\begin{array}{|c|c|}
\hline
(TP) &  (FN) \\
\hline
(FP) & (TN) \\
\hline
\end{array}

---

# 1.2: Pandas Review and Confusion Matrix DataFrame

In the next cell, create a pandas dataframe, `cm_df`, that:

1. Creates columns with the labels, "Predicted (0/N)" and "Predicted (1/P)" that stores the values from the confusion matrix given above
2. Set the index values to "Actual (0/N)" and "Actual (1/P)"
3. Creates a "Row Totals" index label that shows the sum of values for the rows
3. Creates a "Column Totals" column that shows the sum of values across the columns

---

Here is what your dataframe should look like:

|              	| Predicted (0/N) 	| Predicted (1/P) 	| Column Totals 	|
|--------------	|-----------------	|-----------------	|---------------	|
| Actual (0/N) 	| 195             	| 5               	| 200           	|
| Actual (1/P) 	| 25              	| 175             	| 200           	|
| Row Totals   	| 220             	| 180             	| 400           	|

In [None]:
import pandas as pd
import numpy as np

In [None]:
d = {"Predicted (0/N)": [195, 25], "Predicted (1/P)": [5, 175]}
cm_df = pd.DataFrame(data = d, index = ["Actual (0/N)", "Actual (1/P)"])
cm_df['Row Totals'] = cm_df.sum(axis=1)
cm_df.loc['Column Total'] = cm_df.sum(axis = 0)
cm_df

Unnamed: 0,Predicted (0/N),Predicted (1/P),Row Totals
Actual (0/N),195,5,200
Actual (1/P),25,175,200
Column Total,220,180,400


## 1.3 Function to Calculate Classification Metrics

In the next cell, create a function that takes a DataFrame representing a confusion matrix as we have defined in `1.2` and retruns accuracy, precistion, recall, and the f1-score.

#### Steps
1. **Import pandas**: Import the pandas library to work with DataFrames.

2. **Define the Function**:
   - Define a function called `calculate_metrics` that takes one parameter: `df`
   - `df` is the DataFrame representing the confusion matrix.
   
3. **Calculate Metrics**:
   - Calculate True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN) from the DataFrame using the labels we created for columns and rows in `1.2`.
   - Calculate accuracy, precision, recall, and F1 score using the following formulas:
     - Accuracy = (TP + TN) / (TP + TN + FP + FN)
     - Precision = TP / (TP + FP)
     - Recall = TP / (TP + FN)
     - F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
    - Round all the metrics to three decimal places

4. **Return Metrics**: Return accuracy, precision, recall, and F1 score from the function.


---
### Example Usage
```python
import pandas as pd

def calculate_metrics(df, positive_label, negative_label, true_label, false_label):
    # Calculate TP, TN, FP, FN
    # Calculate accuracy, precision, recall, and F1 score
    return accuracy, precision, recall, f1_score

# Example DataFrame
df = pd.DataFrame({...}, index=[...])

# Calculate metrics
accuracy, precision, recall, f1_score = calculate_metrics(df)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1_score)
```

---


In [None]:
def calculate_metrics(df):
    # Calculate TP, TN, FP, FN
    TP = df["Predicted (1/P)"]["Actual (1/P)"]
    TN = df['Predicted (0/N)']['Actual (0/N)']
    FP = df["Predicted (1/P)"]['Actual (0/N)']
    FN = df['Predicted (0/N)']["Actual (1/P)"]
    # Calculate accuracy, precision, recall, and F1 score
    accuracy = (TP + TN) / (TP + TN + FP + FN)
    precision = TP / (TP + FP)
    recall = TP / (TP + FN)
    f1_score = 2 * (precision * recall) / (precision + recall)
    return accuracy, precision, recall, f1_score


# Calculate metrics
accuracy, precision, recall, f1_score = calculate_metrics(cm_df)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1_score)


Accuracy: 0.925
Precision: 0.9722222222222222
Recall: 0.875
F1 Score: 0.9210526315789473


## 1.3b: Call `calculate_metrics()` Function and Print Results

In the next cell, call your `calculate_metrics()` function and pass in our example dataframe from `1.2`.
>- Print a summary statement similar to:

```
Classification model summary statistics:
  Model accuracy: 0.XXX
  Model precision: 0.XXX
  Model recall: 0.XXX
  Model f1_score: 0.XXX

```

In [None]:
accuracy, precision, recall, f1_score = calculate_metrics(cm_df)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1_score)


Accuracy: 0.925
Precision: 0.9722222222222222
Recall: 0.875
F1 Score: 0.9210526315789473


## 1.3c: Alternate Function

*Optional*: If you want your function to work with any labels for the rows/columns of a confusion matrix dataframe use `iat` instead of `at`.
>- `iat` accesses the data from a DataFrame using index positions instead of labels.

In [None]:
TP = df.iat[1,1]
TN = df.iat[0,0]
FP = df.iat[0,1]
FN = df.iat[1,0]

# Section 2: Regression Metrics

## Learning Objectives

In this section we are going to:
1. Gain experience working with regression data
2. Review `numpy` and `pandas` by creating a DataFrame from scratch
3. Develop a strong understanding of the fundamental regression model evaluation metrics
3. Continue developing fundamental python programming skills by creating a regression metrics function to calculate `MAE`, `MSE`, `RMSE`

---

## 2.1: Regression Dataset

**Create a DataFrame**:
   >- Create a small dataset for a regression problem with an 'Actual' column containing actual values and a 'Predicted' column containing predicted values.
   >>- Use numpy to set a random seed of 42
   >>- Create 10 random integers between 0 and 100 for both 'Actual' and 'Predicted'
   >>- Hint: check out the [numpy random.radnint doc](https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html) if you need a refresher on creating random data with `numpy`
   >- Store the data in a DataFrame, `reg_df`



In [None]:
np.random.seed(42)
reg_df = pd.DataFrame(data = {'Actual': np.random.randint(101, size = 10), 'Predicted': np.random.randint(101, size = 10)})
reg_df

Unnamed: 0,Actual,Predicted
0,51,87
1,92,99
2,14,23
3,71,2
4,60,21
5,20,52
6,82,1
7,86,87
8,74,29
9,74,37


## 2.2: Define a Function

In the next cell,
   - Define a function called `calc_reg_metrics` that takes a DataFrame as input.
   - Inside the function, extract the 'Actual' and 'Predicted' columns from the DataFrame.

**Calculate Metrics**:

Your function should calculate the following regression metrics *without the use of scikit-learn*. A major goal of this exercise is to help you understand the metrics so it's important you learn how the calculations are performed so code this into your function using `numpy` functions, not prebuilt scikit-learn functions.

>- Calculate Mean Absolute Error (MAE) as the mean of the absolute differences between 'Actual' and 'Predicted' values.
>- Calculate Mean Squared Error (MSE) as the mean of the squared differences between 'Actual' and 'Predicted' values.
>- Calculate Root Mean Squared Error (RMSE) as the square root of MSE.

>- Round all the metrics to two decimals

**Return Metrics**:
- Return MAE, MSE, and RMSE from the function:

>- **Mean Absolute Error (MAE):** $MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$

>- **Mean Squared Error (MSE):**
    $MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

>- **Root Mean Squared Error (RMSE):**
   $RMSE = \sqrt{MSE}$


In [None]:
def calc_reg_metrics(df):
    actual = df['Actual']
    predicted = df['Predicted']
    MAE = np.mean(np.abs(actual - predicted))
    MSE =
    RMSE =
    return MAE, MSE, RMSE

## 2.3. Display Results

In the next cell:
1. Assign values for variables: `mae`, `mse`, `rmse` based on calling your `calc_reg_metrics` function and passing in the sample dataframe.
2. Print the calculated metrics.

In [None]:
calc_reg_metrics(reg_df)

# Wrap-Up

Nice work!

After completing this exercise you should have developed a stronger understanding of both classification and regression model evaluation metrics as well as continued to develop your python programming skills.