# Model Fine-Tuning

- Learn several metrics and visualizing
- Optimize classification and regression with hyperparameter tuning

---

## Why Metrics Do not Always Work?

Conflict on a dataframe with class imbalance may not work.

Example: If your model is trained on dataset with 99% being `yes` and 1% being `no`, you could build a model that predicts none of the observation as `no` and if you use the simple ratio of correct / number of total observation, it could reflect a 99% accuracy.

---

## Confusion Matrix in Assessing Classifier

to be able to do confusion matrix and its metircs, we follow:
- Import the `classification_report` and `confusion_matrix`
- Intantiate a classification model
- Fit
- Predict
- Print the confusion matrix
- print the classification report

```python
from sklearn.metrics import confusion_matrix, classification_report
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(confusion_matrix(y_true, y_pred))
print(classification_report(y_true, y_pred))
```

The following are the tools/arguments use for confusion matrix and classification report:

**Arguments**

| Arguments       | Description                                                   | Syntax                                      |
|----------------|---------------------------------------------------------------|---------------------------------------------|
| y_true         | Actual labels for the test data                               | `confusion_matrix(y_true, y_pred)`          |
| y_pred         | Predicted labels for the test data                            | `confusion_matrix(y_true, y_pred)`          |
| labels         | List of labels to index the matrix                            | `confusion_matrix(y_true, y_pred, labels=)` |
| target_names   | List of target names to index the report                      | `classification_report(y_true, y_pred, target_names=)` |
| output_dict    | If True, return output as dict                                | `classification_report(y_true, y_pred, output_dict=)` |
| digits         | Number of digits for formatting output                        | `classification_report(y_true, y_pred, digits=)` |

**Functions**

| Functions            | Description                                                   | Syntax                                      |
|----------------------|---------------------------------------------------------------|---------------------------------------------|
| confusion_matrix     | Computes the confusion matrix to evaluate the accuracy of a classification | `confusion_matrix(y_true, y_pred)`          |
| classification_report| Builds a text report showing the main classification metrics  | `classification_report(y_true, y_pred, target_names=, output_dict=, digits=)` |


## Confusion Matrix with Cross-Validation

---

- Aggregate the results from all folds to get a comprehensive evaluation of the model's performance.
- Use k-fold cross-validation to split the data into k folds.
- For each fold:
  - Use the fold as the test set (`y_true`).
  - Use the remaining k-1 folds as the training set.
  - Train the model on the training set.
  - Predict the labels for the test set (`y_pred`).
  - Store the true and predicted labels.
- Combine the true labels and predicted labels from all folds.
- Compute the overall confusion matrix using the aggregated true and predicted labels.

**Flow of Cross-Validation for Confusion Matrix:**

1. **Initialize K-Fold Cross-Validation**: 
   - Define the number of folds (e.g., `kf = KFold(n_splits=5, shuffle=True, random_state=42)`).

2. **Prepare Lists for Aggregation**:
   - Initialize empty lists to store true and predicted labels (`all_y_true = []`, `all_y_pred = []`).

3. **Iterate Through Folds**:
   - For each fold:
     - Split the data into training and testing sets.
     - Train the model on the training set.
     - Predict the labels for the test set.
     - Append the true and predicted labels to the respective lists.

4. **Aggregate Results**:
   - Combine the true labels and predicted labels from all folds into single lists.

5. **Compute Confusion Matrix**:
   - Use the aggregated true and predicted labels to compute the overall confusion matrix (`cm = confusion_matrix(all_y_true, all_y_pred)`).

**Example Code**:

```python
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

# Create a sample DataFrame
data = {
    'Feature1': np.random.rand(100),
    'Feature2': np.random.rand(100),
    'Target': np.random.choice([0, 1], size=100)
}
df = pd.DataFrame(data)

# Split the data into features (X) and target (y)
X = df[['Feature1', 'Feature2']]
y = df['Target']

# Initialize KFold with 5 splits
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Initialize lists to store true and predicted values
all_y_true = []
all_y_pred = []

# Perform cross-validation
for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[test_index]
    
    # Instantiate and train the logistic regression model
    model = LogisticRegression()
    model.fit(X_train, y_train)
    
    # Predict the target values for the test set
    y_pred = model.predict(X_test)
    
    # Store the true and predicted values
    all_y_true.extend(y_test)
    all_y_pred.extend(y_pred)

# Compute the overall confusion matrix
cm = confusion_matrix(all_y_true, all_y_pred)

print("Overall Confusion Matrix:")
print(cm)


## Logistic  Regression and ROC Curve

**Functions**  

| Function               | Description                                      | Syntax                        |
|------------------------|--------------------------------------------------|-------------------------------|
| `LogisticRegression`   | Instantiates a logistic regression classifier   | `LogisticRegression()`        |
| `fit`                 | Trains the model on training data                | `model.fit(X_train, y_train)` |
| `predict`             | Makes predictions on test data                   | `model.predict(X_test)`       |
| `predict_proba`       | Predicts class probabilities                     | `model.predict_proba(X_test)` |
| `roc_curve`           | Computes ROC curve values                        | `roc_curve(y_test, y_pred_probs)` |
| `roc_auc_score`       | Computes AUC score                               | `roc_auc_score(y_test, y_pred_probs)` |

**Arguments**  

| Argument              | Description                                        | Syntax                           |
|----------------------|--------------------------------------------------|---------------------------------|
| `X_train`           | Training data features                            | `model.fit(X_train, y_train)`  |
| `y_train`           | Training data labels                              | `model.fit(X_train, y_train)`  |
| `X_test`            | Test data features                                | `model.predict(X_test)`        |
| `y_test`            | Test data labels                                  | `roc_curve(y_test, y_pred_probs)` |
| `y_pred_probs`      | Predicted probabilities from `predict_proba`      | `roc_curve(y_test, y_pred_probs)` |
