# Predict the claims

AS per the problem statement, we need to analyze the available data and predict whether to sanction the insurance or not. So the problem is a binary classification problem. Following are the features of the dataset

Target: Claim Status (Claim)
Name of agency (Agency)
Type of travel insurance agencies (Agency.Type)
Distribution channel of travel insurance agencies (Distribution.Channel)
Name of the travel insurance products (Product.Name)
Duration of travel (Duration)
Destination of travel (Destination)
Amount of sales of travel insurance policies (Net.Sales)
The commission received for travel insurance agency (Commission)
Gender of insured (Gender)
Age of insured (Age)

The ML pipeline we will use will comprise of the following steps:

1. Data Analysis
2. Feature Engineering
3. Feature Selection
4. Modeling and Tunning

## Data Analysis

In the following cells, I will analyse the Dataset. I will take you through the different aspects of the analysis that we will make over the variables.

Let's go ahead and load the dataset.

In [1]:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


# path for dataset
path = r'D:\Work\DS_ML\hackathon\greyatom_hack\data\train.csv'

# to display all the columns of the dataframe in the notebook
pd.pandas.set_option('display.max_columns', None)

# creating a dataframe
data = pd.read_csv(path)

# visualize the data frame
data.head()

Unnamed: 0,ID,Agency,Agency Type,Distribution Channel,Product Name,Claim,Duration,Destination,Net Sales,Commision (in value),Gender,Age
0,3433,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,0,7,MALAYSIA,0.0,17.82,,31
1,4339,EPX,Travel Agency,Online,Cancellation Plan,0,85,SINGAPORE,69.0,0.0,,36
2,34590,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,0,11,MALAYSIA,19.8,11.88,,75
3,55816,EPX,Travel Agency,Online,2 way Comprehensive Plan,0,16,INDONESIA,20.0,0.0,,32
4,13816,EPX,Travel Agency,Online,Cancellation Plan,0,10,"KOREA, REPUBLIC OF",15.0,0.0,,29


In [2]:
# shape of the dataframe 'data'
data.shape

(50553, 12)

The loaded data set contains 50553 rows and 12 columns, out of the coulmns 'Claim' is the target column.

We will analyse the dataset to identify:

1. Missing values
2. Numerical features
3. Distribution of the numerical features
4. Outliers
5. Categorical features
6. Cardinality of the categorical features
7. Binary features

### Missing Values
Lets find out the columns having missing values.

In [3]:
# make a list of the features that contain missing values
vars_with_na = [var for var in data.columns if data[var].isnull().sum()>1]

# print the feature name and the percentage of missing values
for var in vars_with_na:
    print(var, np.round(data[var].isnull().mean(), 3)*100,  ' % missing values')

Gender 71.1  % missing values


In our data set we have only one column 'Gender' which is having 71.1% of missing values.

### Numerical Features
Lets find out the features variables which numeric and which are not

In [4]:
# list of numerical features
num_vars = [var for var in data.columns if data[var].dtypes != 'O']

print('Number of numerical features: ', len(num_vars))

# visualise the numerical features
data[num_vars].head()

Number of numerical features:  6


Unnamed: 0,ID,Claim,Duration,Net Sales,Commision (in value),Age
0,3433,0,7,0.0,17.82,31
1,4339,0,85,69.0,0.0,36
2,34590,0,11,19.8,11.88,75
3,55816,0,16,20.0,0.0,32
4,13816,0,10,15.0,0.0,29


### Non-Numeric Features 

By non numeric features we mean categorical values, so lets find the categorical values.

In [5]:
# categorical variables
cat_var = [var for var in data.columns if var not in num_vars]

print('Number of categorical features: ', len(num_vars))

# visualise the numerical features
data[cat_var].head()

Number of categorical features:  6


Unnamed: 0,Agency,Agency Type,Distribution Channel,Product Name,Destination,Gender
0,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,MALAYSIA,
1,EPX,Travel Agency,Online,Cancellation Plan,SINGAPORE,
2,CWT,Travel Agency,Online,Rental Vehicle Excess Insurance,MALAYSIA,
3,EPX,Travel Agency,Online,2 way Comprehensive Plan,INDONESIA,
4,EPX,Travel Agency,Online,Cancellation Plan,"KOREA, REPUBLIC OF",


In our dataset out of 12 columns we have 6 columns which have numeric data and 6 have non numeric data

### Discrete V/S Continous Numeric Features

Lets find out the no of unique values for each numeric variable, it will help us to understand which numerical features are discrete or continous.

In [6]:
# dictionary of numerical variables and no of unique values
num_unique_dict = {x:data[x].nunique() for x in num_vars}

for k,v in num_unique_dict.items():
    print('{} has {} no of unique values'.format(k,v))

ID has 50553 no of unique values
Claim has 2 no of unique values
Duration has 444 no of unique values
Net Sales has 1053 no of unique values
Commision (in value) has 964 no of unique values
Age has 88 no of unique values


From the above we can conclude the only Claim is binary i.e has discrete value, rest all the features are continous

In [7]:
# list of continous vars
cont_var = [var for var in num_vars if var != 'Claim']

# getting summary of data
data[cont_var].describe()

Unnamed: 0,ID,Duration,Net Sales,Commision (in value),Age
count,50553.0,50553.0,50553.0,50553.0,50553.0
mean,31679.740134,49.425969,40.800977,9.83809,40.011236
std,18288.26535,101.434647,48.899683,19.91004,14.076566
min,0.0,-2.0,-389.0,0.0,0.0
25%,15891.0,9.0,18.0,0.0,35.0
50%,31657.0,22.0,26.5,0.0,36.0
75%,47547.0,53.0,48.0,11.55,44.0
max,63325.0,4881.0,810.0,283.5,118.0


From the above summary of the data for continous numeric feature, we can see duration has negative value(s) and duration of stay cant be negative.

In [42]:
# indices of negative duration
neg_dura = data[data['Duration']<=0].index.tolist()

print(neg_dura)

[181, 314, 1864, 3068, 4063, 4282, 7324, 7482, 7885, 8205, 8512, 8897, 9501, 10465, 10558, 11515, 12579, 12924, 16251, 17641, 17741, 18198, 18299, 19419, 20538, 21442, 23561, 23606, 23766, 24505, 25266, 25785, 26076, 27002, 30597, 31258, 31932, 33674, 34178, 34511, 36144, 36403, 38935, 39014, 39027, 39318, 41302, 41861, 43410, 43464, 43485, 43719, 45395, 45474, 47625, 48367, 50357]


In [43]:
data.iloc[181]

ID                           46888
Agency                         JWT
Agency Type               Airlines
Distribution Channel        Online
Product Name            Value Plan
Claim                            0
Duration                         0
Destination                  INDIA
Net Sales                       31
Commision (in value)          12.4
Gender                           M
Age                            118
Name: 181, dtype: object