# Travel insurance prediction model

This is the main notebook of travel insurance prediction model which composes of 3 parts:
1. Data preparation
2. Model evaluation

Note - reasons for this specific data preperation and modelling choices are described in the [Exploratory_analysis.ipynb](./Exploratory_analysis.ipynb) file.

## Data Set
The dataset contains information about customers, including demographics and travel history. The target variable is TravelInsurance which indicates whether a customer purchased travel insurance (1) or not (0).The data can be downloaded from [Kaggle](https://www.kaggle.com/datasets/tejashvi14/travel-insurance-prediction-data)

## Goal

- Build a model to predict which customers will buy travel insurance.

## Biases

In the data there are some biases which thus affect model predictions:
- Class imbalance: There is 2 times more customers who did not buy insurance than who did.
- Data is based on Tour and travel company customers, they might be different from general population and other insurance companies customers.
- This data focuses on travel insurance, which might be different from other types of insurance.

## Domain knowledge
I am not an insurance expert, but I use statistical and machine learning knowledge to build a model to predict travel insurance purchases.

## Libraries

In [1]:
import pickle
from pathlib import Path

import pandas as pd
import xgboost as xgb
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from utils.evaluations import evaluate_pipeline, print_single_result

## Get Data

In [2]:
train_df = pd.read_csv("data/train_travel_insurance.csv")
val_df = pd.read_csv("data/val_travel_insurance.csv")
test_df = pd.read_csv("data/test_travel_insurance.csv")

# Process Data
In this section we prepare the data for modeling by handling categorical features, missing values, and scaling numerical features.

In [3]:
numerical_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)

categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(drop="first", sparse_output=False)),
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer, ["Age", "AnnualIncome", "FamilyMembers"]),
        (
            "cat",
            categorical_transformer,
            [
                "Employment Type",
                "GraduateOrNot",
                "FrequentFlyer",
                "EverTravelledAbroad",
                "ChronicDiseases",
            ],
        ),
    ],
    remainder="drop",
    verbose_feature_names_out=False,
)

Now I will process both train and test data.

In [7]:
train_and_val_df = pd.concat([train_df, val_df])
X_train = train_and_val_df.drop("TravelInsurance", axis=1)
y_train = train_and_val_df["TravelInsurance"]

In [8]:
X_test = test_df.drop("TravelInsurance", axis=1)
y_test = test_df["TravelInsurance"]

# Evaluation

In this section I will evaluate the model on the test set. Model has not seen this data during training and validation, thus it is a good indicator of the model's generalization performance.

In [9]:
full_pipeline = Pipeline(
    [
        ("preprocessor", preprocessor),
        (
            "classifier",
            xgb.XGBClassifier(
                learning_rate=0.1, max_depth=5, n_estimators=200, random_state=42
            ),
        ),
    ]
)
results = evaluate_pipeline(full_pipeline, X_train, X_test, y_train, y_test)
print_single_result(results)

Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score,AUC
0,XGBoost,0.822742,0.846154,0.616822,0.713514,0.787992


The model achieves similar performance on test set as when cross validated. This is good sign that model is not overfitting. On test set it has better recall and precision by 1%.

# Save the trained model

In [10]:
Path("models").mkdir(exist_ok=True)
with Path("models/full_pipeline.pkl").open("wb") as f:
    pickle.dump(results["pipeline"], f)

## Conclusion

Model which predicts travel insurance purchase likelihood predicts:
- Model correctly predicts 82.3% of customers who will buy or will not buy insurance. But data has class imbalance, so it is better to look at precision and recall.
- When model predicts that customer will buy insurance it is correct 84.6% of the time.
- The model identifies 61.7% of all actual insurance buyers.
- AUC of 78.8% indicates the model has good discriminative ability, that is it is able to distinguish between customers who will buy and will not buy insurance and is better than random guessing at 50%

This model can be particularly useful for targeted marketing campaigns where the focus is on customers who are more likely to buy insurance.

# Improvements
- Use more data.