<center> <img src="https://www.goodreturns.in/img/2016/01/insuranceauto-25-1453703850.jpg" /> </center>


# Introduction

# Problem Statements

VahanBima is one of the leading insurance companies in India. It provides motor vehicle insurances at best prices with 24/7 claim settlement.  It offers different types of policies for  both personal and commercial vehicles. It has established its brand across different regions in India. 

Around 90% of the businesses today use personalized services. The company wants to launch different personalized experience programs for customers of VahanBima. The personalized experience can be dedicated resources for claim settlement, different kinds of services at doorstep, etc. Inorder to do so, they would like to segment the customers into different tiers based on their customer lifetime value (CLTV).

Inorder to do it, they would like to predict the customer lifetime value based on the activity and interaction of the customer with the platform. So, as a part of this challenge, your task at hand is to build a high performance and interpretable machine learning model to predict the CLTV based on the user and policy data.

# Data Description

You are provided with the sample dataset of the company holding the information of customers and policy such as highest qualification of the user, total income earned by a customer in a year, employee status,  policy opted by the user, type of policy and so on and the target variable indicating the total customer lifetime value.

## Data Dictionary

You are provided with 3 files - train.csv, test.csv and sample_submission.csv


### Training Data 

You are provided with around 90K records containing the attributes of the user and policy and the target variable cltv indicating the total customer lifetime value.

|Variable       |Description                                                |
| ------------- |:-------------                                            :| 
|id             |Unique identifier of a customer                            |
|gender         |Gender of the customer                                     |
|area           |Area of the customer                                       |
|qualification  |Highest Qualification of the customer                      |
|income         |Income earned in a year (in rupees)                        |
|marital_status |Marital Status of the customer {0:Single, 1: Married}      |
|vintage        |No. of years since the first policy date                   |
|claim_amount   |Total Amount Claimed by the customer (in rupees)           |
|num_policies   |Total no. of policies issued by the customer               |
|policy         |Active policy of the customer                              |
|type_of_policy |Type of active policy                                      |
|cltv           |Customer life time value (Target Variable)                 |



# Importing Libraries

In [None]:
# For Numerical Python
import numpy as np

# For Panel Data Analysis
import pandas as pd
from pandas_profiling import ProfileReport
pd.set_option('display.precision', 2)

# For Data Visualization
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

# To Disable Warnings
import warnings
warnings.filterwarnings("ignore")

In [None]:
# For Data Model Development
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [None]:
# For Machine Learning Model Evaluation
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Importing Data 

In [None]:
data = pd.read_csv("train_BRCpofr.csv")

In [None]:
data.head(10)


In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.duplicated().value_counts()

### Observations

- 89392 observations in data and no missing entries
- 

In [None]:
data['income'].value_counts()


In [None]:
data['qualification'].value_counts()

In [None]:
data['area'].value_counts()

In [None]:
data['vintage'].value_counts()

In [None]:
data['num_policies'].value_counts()

In [None]:
data['policy'].value_counts()

In [None]:
data['type_of_policy'].value_counts()

In [None]:
sns.displot(data.vintage)
plt.axvline(data.vintage.mean(), color='r')
plt.axvline(data.vintage.median(), color='m')
plt.show()

In [None]:
#from pandas_profiling import ProfileReport 
#profile = ProfileReport(df=data)
#profile.to_file(output_file='Pre Profiling Report.html')
#print('Accomplished!')

### Observations
 
- Total 89392 observations and 12 columns in data. **No Missing** values in data.
- Of 12 columns **8 catagorical and 4 numerical columns**.
- 56-44 Male-Female ratio customer.
- **Majority(70%)** of customers are from **Urban** `area`.
- **59%** customers are in **5L-10L Income** slab, then **23% in 2L-5L** slab.
- **57%** customers are married.
- **20%** customers have **claimed  nothing**.
- **Claimed amount** column have **positive skewness** of 1. 95th percentile is 10078 and max value is 31894 suggesting **there are some outliers**.
- **67.4%** customers have **more than 1 policy**.
- **63.4%** customers have **A poicy** followed by **27.6** have **B policy**.
- **Platnum policy type** have **maximum(53.5%)** distribution in data set. Other two types have equal distribution.
- `CLTV` has **mean 97952** and **median 66396**
    - kewness is **2.75** suggesting data is little bit skewed.
    - he 95th percentile is 307265 and maximum value is 724068.
- There is some **correlation** between
    - Area and claim amount**.
    - CLTV and No. of policies**.
 

In [None]:
sns.heatmap(data.corr(), annot=True, cmap='viridis')

##  relation between cltv and claim amount?

## vintage , cltv and policy


- **20%** customers had their **first policy 6 years ago.** 

In [None]:
def creat(data):    
    p=[]
    for i in range(len(data)):
        if data.vintage[i] != 0:
            a = data.claim_amount[i] / data.vintage[i]
            p.append(a)
        else:
            p.append(0)
    return p


In [None]:
data['claim_per_year'] = creat(data)

In [None]:
data.head(10)

<a id = Section7></a>
## **7. Post Data Processing & Analysis**

<a id = Section701></a>
### **7.1 Encoding Categorical Data**

In [None]:
# Creating dummy variable of the Type column
data = pd.get_dummies(data, drop_first=True)
data.head()

<a id = Section704></a>
### **7.3 Data Splitting**

- Now, we will **split** the dataset into **Train** and **Test** subsets.

- We will use **80%** data for **training** and the remaining **20%** data for **testing** our models.

In [None]:
# Creating the feature matrix by removing the target variable
X = data.drop(['cltv','id'], axis=1)
X.head()


In [None]:
# Creating the target vector
y = data['cltv']
y.head()

In [None]:
s = MinMaxScaler()
sc = s.fit_transform(X[['claim_amount','claim_per_year']])
sc = pd.DataFrame(sc ,columns=['claim_amount','claim_per_year'])

In [None]:
X =X.drop(labels=['claim_amount','claim_per_year'], axis=1)
X = pd.concat(objs=[X,sc], axis=1 ) 

In [None]:
X.head()

In [None]:
# Using scikit-learn's train_test_split function to split the dataset into train and test sets.
# 80% of the data will be in the train set and 20% in the test set, as specified by test_size=0.2
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
# Checking the shapes of the training and test sets.
print('Training Data Shape:', X_train.shape, y_train.shape)
print('Testing Data Shape:', X_test.shape, y_test.shape)

<a id = Section8></a>
## **8. Model Development & Evaluation**

- In this section, we will be **building** our Machine Learning models and fitting them with the training data.

- We will be building models using:

  - **All** the **features** of the training set.

  - The most **important features** of the training set, according to the Random Forest algorithm.

- We will use **K-fold Cross Validation** to validate our models and select the best one.

- We are creating a **helper function** `display_scores` that will help us in **displaying** our *K-fold cross validation* **scores**.

In [None]:
# A helper function to display the scores along with the mean and standard deviation of scores.
def display_scores(scores):
    scores_rmse = np.sqrt(-scores)
    print('Scores:', scores_rmse)
    print('Mean:', scores_rmse.mean())
    print('Standard Deviation:', scores_rmse.std())

<a id = Section801></a>
### **8.1 Baseline Models**

- In the baseline models, we will be using **all** the **features** of the dataset in our models.

- We will be performing **5-fold** cross-validation to **validate** our models.

<a id = Section80101></a>
#### **8.1.1 Linear Regression Model**

In [None]:
base_lr = LinearRegression()

In [None]:
# Performing K-fold Cross-validation for 5 folds.
scores = cross_val_score(estimator=base_lr, X=X_train, y=y_train, cv=5, scoring='neg_mean_squared_error')

In [None]:
# Displaying the RMSE scores with display_score helper function.
display_scores(scores)

<a id = Section80102></a>
#### **8.1.2 Decision Tree Model**

In [None]:
base_dt = DecisionTreeRegressor(random_state=0)

In [None]:
# Performing K-fold Cross-validation for 5 folds.
scores = cross_val_score(estimator=base_dt, X=X_train, y=y_train, cv=5, scoring='neg_mean_squared_error')

In [None]:
# Displaying the RMSE scores with display_score helper function.
display_scores(scores)

<a id = Section80103></a>
#### **8.1.3 Random Forest Model**

In [None]:
# Creating a Random Forest model.
base_rf = RandomForestRegressor(n_estimators=10, random_state=0, n_jobs=-1)

In [None]:
%%time
# Performing K-fold Cross-validation for 5 folds.
scores = cross_val_score(estimator=base_rf, X=X_train, y=y_train, cv=5, scoring='neg_mean_squared_error')

In [None]:
# Displaying the RMSE scores with display_score helper function.
display_scores(scores)

#### **Checking Feature Importances**

In [None]:
# Fitting the baseline Random Forest model on the entire train set to obtain the feature importances of each feature. 
base_rf.fit(X_train, y_train)

In [None]:
# Checking the feature importances of various features.
# Sorting the importances by descending order (lowest importance at the bottom).
for score, name in sorted(zip(base_rf.feature_importances_, X_train.columns), reverse=True):
    print('Feature importance of', name, ':', score*100, '%')

In [None]:
# Plotting the Feature Importance of each feature.
plt.figure(figsize=(12, 7))
plt.bar(X_train.columns, base_rf.feature_importances_*100, color='green')
plt.xlabel('Features', fontsize=14)
plt.ylabel('Importance', fontsize=14)
plt.xticks(rotation=90)
plt.title('Feature Importance of each Feature', fontsize=16)

<a id = Section802></a>
### **8.2 Essential Feature Models**

- In the essential feature models, we will be using only the **most important features** of the dataset in our models.

- The features are selected on the basis of the **feature importance** obtained from the Random Forest model.

- We will be performing **5-fold** cross-validation to **validate** our models.

In [None]:
X_train_essential = X_train[['claim_amount', 'num_policies_More than 1', 'vintage', 'qualification_High School','qualification_High School', 'type_of_policy_Platinum', 'gender_Male', 'income_5L-10L','type_of_policy_Silver', 'income_More than 10L','claim_per_year' ]]
X_train_essential.head()

<a id = Section80201></a>
#### **8.2.1 Linear Regression Model**

In [None]:
essential_lr = LinearRegression()

In [None]:
# Performing K-fold Cross-validation for 5 folds.
scores = cross_val_score(estimator=essential_lr, X=X_train_essential, y=y_train, cv=5, scoring='neg_mean_squared_error')

In [None]:
# Displaying the RMSE scores with display_score helper function.
display_scores(scores)

<a id = Section80202></a>
#### **8.2.2 Decision Tree Model**

In [None]:
essential_dt = DecisionTreeRegressor(random_state=0)

In [None]:
# Performing K-fold Cross-validation for 5 folds.
scores = cross_val_score(estimator=essential_dt, X=X_train_essential, y=y_train, cv=5, scoring='neg_mean_squared_error')

In [None]:
# Displaying the RMSE scores with display_score helper function.
display_scores(scores)

<a id = Section80203></a>
#### **8.2.3 Random Forest Model**

In [None]:
# Creating a Random Forest model.
essential_rf = RandomForestRegressor(n_estimators=10, random_state=0, n_jobs=-1)

In [None]:
%%time
# Performing K-fold Cross-validation for 5 folds.
scores = cross_val_score(estimator=essential_rf, X=X_train_essential, y=y_train, cv=5, scoring='neg_mean_squared_error')

In [None]:
# Displaying the RMSE scores with display_score helper function.
display_scores(scores)

**Observations:**

- The mean **RMSE** score for the Essential Feature Random Forest Model is 

- Our model has improved even though we are using only a **subset** (i. e. 6 features) of the features from the entire dataset.

- It took **53 seconds** to perform 5-fold cross-validation on our Random Forest model having 10 trees.

- The **training time** has **reduced** significantly and the **performance** has **improved**.

- The RMSE is still significantly **lower** than the Decision Tree model.

#### **Model Comparision**

**Baseline Models**

| Model | RMSE Score |
| :--: | :--: |
| **Linear Regression** | **21612.39** |
| **Decision Tree** | **5322.68** |
| **Random Forest** | **3997.32** |

<br>

**Essential Feature Models**

| Model | RMSE Score |
| :--: | :--: |
| **Linear Regression** | **21669.89** |
| **Decision Tree** | **4644.98** |
| **Random Forest** | **3686.24** |

<a id = Section803></a>
### **8.3 Hyperparameter Tuning of Model**

In [None]:
param_grid = [{'n_estimators': [60,70,80,90], 'max_depth': [5,7,9], 'max_features': ['auto', 2, 4,6,8]}]

In [None]:
temp_rf = RandomForestRegressor(random_state=0, n_jobs=-1)

In [None]:
grid_search = GridSearchCV(estimator=temp_rf, param_grid=param_grid, scoring='neg_mean_squared_error', cv=5, n_jobs=-1)

In [None]:
%%time
grid_search.fit(X_train_essential, y_train)

In [None]:
# Calculating the best RMSE score found by Grid Search 
np.sqrt(-grid_search.best_score_)

In [None]:
# The hyperparameter values which provide us the best RMSE score
grid_search.best_params_

<a id = Section804></a>
### **8.4 Final Model**

- We found out the **best hyperparameter combinations** for our Random Forest model.

- Now, we will use the model with those hyperparameters as our **final model**.

- Using this final model, we will make **predictions** on our test set.

**Creating the Final Model**

In [None]:
# Creating the final random forest model from the grid search's best estimator.
final_rf = grid_search.best_estimator_

**Fitting the Final Model**

In [None]:
# Fitting the final model with training set
final_rf.fit(X_train_essential, y_train)

- After fitting the final model with the training data, we are ready to make **predictions** on the test set. 

**Removing Non-Essential Features from the Test Set**

- We trained our model on only the most important features of the dataset.

- So, we need to **remove** the **non-important features** from our test set as well.

- If we don't remove the non-essential features our model will give an **error** while making predictions due to the **difference in shapes** of training and testing sets.

In [None]:
# Creating the test set with only the essential features
X_test_essential = X_test[['claim_amount', 'num_policies_More than 1', 'vintage', 'qualification_High School','qualification_High School', 'type_of_policy_Platinum', 'gender_Male', 'income_5L-10L','type_of_policy_Silver', 'income_More than 10L','claim_per_year' ]]
X_test_essential.head()

**Making Predictions**

- Now, we will make **predictions** on both our training and testing sets.

In [None]:
# Making predictions on the train set
y_train_pred = final_rf.predict(X_train_essential)

In [None]:
# Making predictions on the test set
y_test_pred = final_rf.predict(X_test_essential)

In [None]:
pd.DataFrame({'Actual Test Set Values': y_test[0:5].values, 'Predicted Test Set Values': y_test_pred[0:5]})

**Calculating the RMSE Score**

In [None]:
# Estimating RMSE on Train & Test Data
print('RMSE for Train Set:', np.round(np.sqrt(mean_squared_error(y_train, y_train_pred)), decimals=2))
print('RMSE for Test Set:', np.round(np.sqrt(mean_squared_error(y_test, y_test_pred)), decimals=2))

**Calculating R-Squared Value**

In [None]:
# Estimating R-Squared on Train & Test Data
print('R-Squared for Train Set:', np.round(r2_score(y_train, y_train_pred), decimals=2))
print('R-Squared for Test Set:', np.round(r2_score(y_test, y_test_pred), decimals=2))

In [None]:
# Creating a helper function to plot the actual and predicted values for train and test sets.
def plot_score(y_train, y_train_pred, y_test, y_test_pred):
  '''
  Plot acutal and predicted values for train & test data
  y_train: actual y_train values
  y_train_pred: predicted values of y_train
  y_test: actual y_test values
  y_test_pred: predicted values of y_test
  '''
  plt.figure(figsize=[16, 6])
  plt.subplot(1, 2, 1)
  sns.regplot(x=y_train, y=y_train_pred, color='red')
  plt.xlabel('Actual', size=14)
  plt.ylabel('Predicted', size=14)
  plt.title('For Train Data', size=16)

  plt.subplot(1, 2, 2)
  sns.regplot(x=y_test, y=y_test_pred, color='green')
  plt.xlabel('Actual', size=14)
  plt.ylabel('Predicted', size=14)
  plt.title('For Test Data', size=16)
  plt.show()

In [None]:
# Plotting Actual vs Predicted Values
# This will take some time
plot_score(y_train, y_train_pred, y_test, y_test_pred)

<a id = Section9></a>
## **9. Conclusion**

In [None]:
y_test = pd.read_csv("test_koRSKBP.csv")
y_test['claim_per_year'] = creat(y_test)
y_test = pd.get_dummies(y_test, drop_first=True)
y_test = y_test.drop(['id'], axis=1)
sc = s.transform(y_test[['claim_amount','claim_per_year']])
sc = pd.DataFrame(sc ,columns=['claim_amount','claim_per_year'])
X =X.drop(labels=['claim_amount','claim_per_year'], axis=1)
X = pd.concat(objs=[X,sc], axis=1 ) 

In [None]:
y_test_pred = final_rf.predict(y_test[['claim_amount', 'num_policies_More than 1', 'vintage', 'qualification_High School','qualification_High School', 
                                       'type_of_policy_Platinum', 'gender_Male', 'income_5L-10L','type_of_policy_Silver',
                                       'income_More than 10L','claim_per_year'] ])

In [None]:
print(y_test_pred)