[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ktnspr/py4ac/blob/main/05_python_car_insurance_classification_exercise.ipynb) [![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/ktnspr/py4ac/blob/main/05_python_car_insurance_classification_exercise.ipynb)

# Exercise – Predicting Car‑Insurance Claims  

Welcome to the practical part of our **Visualization and Machine Learning course**.  
In this notebook you will guide a small insurance company through the complete **classification workflow**:  
* exploratory data analysis (EDA)  
* data cleaning and feature engineering  
* model training & evaluation  
* first steps towards model interpretability  

💡 **How to work with the notebook**  
* Every upcoming code cell is announced by a markdown block (like this one) that explains what needs to be done.
* Feel free to experiment – the dataset is small and reloads in a second.

## 1 – Import the required libraries  

The next code cell should collect **all library imports** you will need later, including:  

* **Pandas** – data handling  
* **Seaborn / Matplotlib** – quick visualisations  
* **Scikit‑learn** – preprocessing, model selection and metrics  

👉 **Your task:** Make sure the imports are working

In [None]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    ConfusionMatrixDisplay,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)
from sklearn.dummy import DummyClassifier
import matplotlib.pyplot as plt
import kagglehub
from pathlib import Path

# Enable nicer plots inside the notebook
%matplotlib inline


## 2 – Load the dataset  

Our CSV file contains one row per customer and the target column **`OUTCOME`** that indicates whether the customer filed a claim last year.  

👉 **Your task:**  
1. Read the CSV (hint: `pd.read_csv`).  
2. Show the first five rows – this immediately reveals spelling issues in column names.  
3. Print the shape so you know how many observations you have to play with.  
4. (Optional) Adjust the path if the dataset is in a different folder.

In [None]:
# 2 – Load data
path = kagglehub.dataset_download("sagnik1511/car-insurance-data")
print("Path to dataset:", path)
df = pd.read_csv(Path(path)/'Car_Insurance_Claim.csv')


## 3 – Quick exploratory analysis (EDA)  

Before jumping into modelling we **look at the raw data**. This helps us to spot data‑quality issues and build an intuition for useful features.  

👉 **Your tasks:**  
1. **Class balance:** Create a count‑plot of `OUTCOME` – how imbalanced is the target?  Use `sns.countplot`!
2. **Outcome by age:** Visualise the distribution of claims across age groups; a stacked or grouped bar chart works well.  
3. **Feature correlations:** For **numeric columns** compute a correlation matrix and show it as a heat‑map. Search for highly correlated features that could be removed later. Use the **pandas method** `df.select_dtypes(include=['int64', 'float64'])` for selecting the numerical features only, then use `sns.heatmap`!
4. **Credit‑score vs outcome:** A box‑plot is a quick way to see if lower credit scores correlate with more claims. Use `sns.boxplot`!  

*(Tip: keep the figures small – `figsize=(6,4)` is usually enough.)*

In [None]:
# 3 – Exploratory plots
# --- Class balance, variable OUTCOME
...

# --- AGE vs OUTCOME
...

# --- Correlation heat‑map (numeric features only)
...

# --- CREDIT_SCORE distribution per class
...


## 4 – Separate feature matrix **X** and target **y**  

Machine‑learning APIs expect **`X`** (all predictors) and **`y`** (label) as separate objects. We also drop non‑informative identifier columns such as `ID` and `POSTAL_CODE`.  

👉 **Your tasks:**  
1. Assign `df["OUTCOME"]` to `y`.  
2. Create `X` by dropping `OUTCOME`, `ID`, and `POSTAL_CODE`.  
3. Double‑check: `X.shape[0]` should equal `y.shape[0]`. 

In [None]:
# 4 – Target and features
y = ...
X = ...

# check that shape of X and y match
assert X.shape[0] == y.shape[0], "Mismatch between number of samples in X and y"

## 5 – Detect categorical vs numerical columns  

We have to treat **categorical** and **numerical** features differently (dummy‑encoding vs. scaling).  
The **pandas method** `select_dtypes` helps us identify the relevant columns. Since you only want the column names, use `.columns.tolist()`.

👉 **Your tasks:**  
1. Build a list `cat_cols` containing all `object` columns.  
2. Build a list `num_cols` containing `int64` and `float64` columns.  
3. Print both lists – are the column names plausible?


In [None]:
# 5 – Column type detection
cat_cols = ...
num_cols = ...


## 6 – Pre‑processing pipeline  

We perform three independent steps:  
1. **One‑hot encoding** for categoricals (`pd.get_dummies`). Use the previously created list of categorical columns `columns=cat_cols` and use `drop_first=True` to avoid perfect multicollinearity.  
2. **Missing‑value imputation** – replacing NA by the **most frequent category** is a decent default for this dataset. Use `SimpleImputer` with `strategy="most_frequent"`!
3. **Standardization** of numerical columns so that gradient‑based learners (like an MLP) converge faster. Use `StandardScaler` with the `fit_transform`-method. Use the previously created list of numerical columns, `num_cols`.

👉 **Your tasks:** Implement each step in the given order. After scaling, print `X_encoded.head()` to verify the transformation.

In [None]:
# 6 – One‑hot encoding of all categorical features
X_encoded = ...

# 6b – Impute missing values
imputer = SimpleImputer(strategy="most_frequent")
X_encoded = ...

# 6c – Standardise numerical features
scaler = StandardScaler()  #!
X_encoded[num_cols] = ...


## 7 – Train / test split  

To evaluate our models we keep **20 %** of the data as an **unseen hold‑out set**. Use `train_test_split` with a fixed `random_state` so that your results are reproducible.  

👉 **Your tasks:**  
* Call `train_test_split` with `test_size=0.2` and `random_state=42`.  
* Examine the class ratio in `y_train` and `y_test` – do they match the original distribution?

In [None]:
# 7 – Train/test split
X_train, X_test, y_train, y_test = ...

# Class ratios in train and test sets
print("Class ratios in training set:")
print(y_train.value_counts(normalize=True))
print("\nClass ratios in test set:")
print(y_test.value_counts(normalize=True))


## 8 – Configure the models  

We will compare three real classifiers against a **baseline DummyClassifier**:  

| Model | Key hyper‑parameters | Strength | Weakness |
|-------|----------------------|----------|----------|
| Naïve Bayes | default | works on small data | assumes feature independence |
| Decision Tree | `max_depth=5` | interpretable & non‑linear | prone to overfit | 
| MLP | one hidden layer with 50 neurons | captures complex patterns | needs scaling + more compute |
| dummy | `strategy="most_frequent"` | good for comparison | well... |

👉 **Your tasks:** Instantiate all four models in a dictionary called `models`. Feel free to tweak `max_depth`, `hidden_layer_sizes`, or `max_iter` later.

In [None]:
# 8 – Model dictionary
models = {
    "Naive Bayes": ... # GaussianNB(),
    "Decision Tree": ... # DecisionTreeClassifier(),
    "MLP": ... # MLPClassifier(),
    "Dummy": DummyClassifier(strategy="most_frequent"),
}


## 9 – Train, predict, collect metrics  

For each model:  
1. Fit on the training data: `model.fit`
2. Predict the test set: `model.predict`
3. Store **accuracy** (`accuracy_score`), as well as the macro‑averages of **precision (`precision_score`), recall (`recall_score`), F1 (`f1_score`)** in a list of dictionaries. Always use `average="macro"`.

👉 **Your tasks:** Fill the loop and ensure each metric lands in the `results` list.

In [None]:
# 9 – Loop over models
results = []
for name, model in models.items():
    print(f"\n=== {name} ===")
    ... # fit
    ... # predict

    results.append({
        "Model": name,
        "Accuracy": accuracy_score(y_test, y_pred),
        "Precision (macro)": precision_score(y_test, y_pred, average="macro"),
        "Recall (macro)": recall_score(y_test, y_pred, average="macro"),
        "F1-Score (macro)": f1_score(y_test, y_pred, average="macro"),
    })


## 10 – Compare the results  

Turn `results` into a DataFrame and round to three decimals so differences are visible. The Dummy baseline shows how far a trivial guess gets us – our models should beat it.  

👉 **Your tasks:** Just display the DataFrame. You might consider using `results_df.plot.bar(x="Model", y="Accuracy")` for a quick visual comparison.

In [None]:
# 10 – Results table
results_df = pd.DataFrame(results)
display(results_df.round(3))

## 11 – Inspect Decision‑Tree feature importance  

Tree‑based models come with built‑in feature importance scores that reflect the average information gain contributed by each split.  

👉 **Your tasks:**  
1. Extract `feature_importances_` from the trained tree.  
2. Build a DataFrame with columns `Feature` and `Importance`.  
3. Plot a horizontal bar‑chart (lowest importance on top) so that the most influential features are easy to spot.  
4. Reflect: Does the ranking align with your domain intuition from section 3?

In [None]:
# 11 – Feature importance plot
if isinstance(models["Decision Tree"], DecisionTreeClassifier):
    feature_importances = models["Decision Tree"].feature_importances_
    importance_df = (
        pd.DataFrame({
            "Feature": X_encoded.columns,
            "Importance": feature_importances
        })
        .sort_values(by="Importance", ascending=True)
    )

    plt.figure(figsize=(10, 6))
    plt.barh(importance_df["Feature"], importance_df["Importance"], color='skyblue')
    plt.xlabel("Importance")
    plt.title("Feature importance – Decision Tree")
    plt.tight_layout()
    plt.show()
