## Customer is interested in a caravan insurance policy or not?

Direct mailings to a company’s potential customers – `junk mail` to many – can be a very effective way for them
to market a product or a service. However, as we all know, much of this junk mail is really of no interest to the
people that receive it. Most of it ends up thrown away, not only wasting the money that the company spent on it,
but also filling up landfill waste sites or needing to be recycled.

If the company had a better understanding of who their potential customers were, they would know more
accurately who to send it to, so some of this waste and expense could be reduced.

We're going to take the following approach:
1. Problem definition
2. Data
3. Evaluation
4. Modelling

## 1. Problem Definition

In a statement,
> We want you to predict whether a customer is interested in a caravan insurance policy from other data about the
customer

## 2. Data

For training the model : `carvan_train.csv`
For testing the model: `carvan_test.csv`

## 3. Evaluation

> If we can get Fbeta score greater than 0.26 (beta = 2) at predicting whether or not to mail customers.

## Preparing the tools

We're going to use `pandas`, `matplotlib`, `scikit-learn` and `numpy` for data analysis and manipulation

In [1]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
import pylab as pl
from sklearn.metrics import fbeta_score

## Load Data

In [4]:
df_train=pd.read_csv('carvan_train.csv')
df_test=pd.read_csv('carvan_test.csv')

In [7]:
df_train.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V77,V78,V79,V80,V81,V82,V83,V84,V85,V86
0,33,1,3,2,8,0,5,1,3,7,...,0,0,0,1,0,0,0,0,0,0
1,37,1,2,2,8,1,4,1,4,6,...,0,0,0,1,0,0,0,0,0,0
2,37,1,2,2,8,0,4,2,4,3,...,0,0,0,1,0,0,0,0,0,0
3,9,1,3,3,3,2,3,2,4,5,...,0,0,0,1,0,0,0,0,0,0
4,40,1,4,2,10,1,4,1,4,7,...,0,0,0,1,0,0,0,0,0,0


In [14]:
df_test.tail()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V76,V77,V78,V79,V80,V81,V82,V83,V84,V85
3995,33,1,2,4,8,0,7,2,0,5,...,0,0,0,0,1,0,0,0,0,0
3996,24,1,2,3,5,1,5,1,3,4,...,1,0,0,0,1,0,0,0,0,0
3997,36,1,2,3,8,1,5,1,3,7,...,0,0,0,0,1,0,0,0,1,0
3998,33,1,3,3,8,1,4,2,3,7,...,0,0,0,0,0,0,0,0,0,0
3999,8,1,2,3,2,4,3,0,3,5,...,0,0,0,0,1,0,0,0,0,0


In [15]:
df_test['V86'] = np.nan
df_train['data'] = 'train' 
df_test['data'] = 'test'
df_test = df_test[df_train.columns]
df_all = pd.concat([df_train,df_test],axis=0)

In [16]:
df_test.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V78,V79,V80,V81,V82,V83,V84,V85,V86,data
0,33,1,4,2,8,0,6,0,3,5,...,0,0,1,0,0,0,0,0,,test
1,6,1,3,2,2,0,5,0,4,5,...,0,0,1,0,0,0,0,0,,test
2,39,1,3,3,9,1,4,2,3,5,...,0,0,1,0,0,0,0,0,,test
3,9,1,2,3,3,2,3,2,4,5,...,0,0,1,0,0,0,0,0,,test
4,31,1,2,4,7,0,2,0,7,9,...,0,0,1,0,0,0,0,0,,test


In [17]:
(df_train['V86'][df_train['V86']==1].value_counts()/df_train.shape[0])*100

1    5.977327
Name: V86, dtype: float64

## Creating Dummies

In [18]:
def dummies(data,var,freq_cutoff=0):
    t= data[var].value_counts(normalize=True) 
    t=t[t.values>freq_cutoff] 
    t=t.sort_values() 
    t_min=t.idxmin()
    t=t.drop([t_min])
    categories=t.index

    for cat in categories :
        name=var+'_'+cat
        name=re.sub(" ","",name) 
        name=re.sub("-","_",name)
        name=re.sub("\\?","Q",name) 
        name=re.sub("<","LT_",name)
        name=re.sub("\\+","",name) 
        name=re.sub("\\/","_",name) 
        name=re.sub(">","GT_",name) 
        name=re.sub("=","EQ_",name)
        name=re.sub(",","",name)
        data[name]=(data[var]==cat)+0 
               
    data=data.drop(columns=[var])
    return data

In [19]:
for i in df_all:
    if(i!='V86' and i!='data'):
        df_all[i] = df_all[i].astype(str)
        df_all = dummies(df_all,i)

In [20]:
df_train=df_all[df_all['data']=='train'] 
del df_train['data'] 
df_test=df_all[df_all['data']=='test']
df_test.drop(['V86','data'],axis=1,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [23]:
train1, train2 = train_test_split(df_train, test_size = 0.2,random_state=2)

x_train1=train1.drop(["V86"],1)
y_train1=train1["V86"]
x_train2=train2.drop(["V86"],1)
y_train2=train2["V86"]

x_train1.reset_index(drop=True,inplace=True)
y_train1.reset_index(drop=True,inplace=True)

In [24]:
svclassifier = SVC(kernel='rbf', class_weight='balanced', C=1.0, random_state=0)
svclassifier.fit(x_train1, y_train1)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight='balanced', coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False)

In [25]:
y_pred1 = svclassifier.predict(x_train1).round()

In [26]:
y_pred2 = svclassifier.predict(x_train2).round()

In [27]:
y_pred2[y_pred2==1].size

211

In [28]:
print(confusion_matrix(y_train1,y_pred1))
print(classification_report(y_train1,y_pred1))

[[3759  625]
 [   3  270]]
              precision    recall  f1-score   support

         0.0       1.00      0.86      0.92      4384
         1.0       0.30      0.99      0.46       273

    accuracy                           0.87      4657
   macro avg       0.65      0.92      0.69      4657
weighted avg       0.96      0.87      0.90      4657



In [29]:
print(confusion_matrix(y_train2,y_pred2))
print(classification_report(y_train2,y_pred2))

[[914 176]
 [ 40  35]]
              precision    recall  f1-score   support

         0.0       0.96      0.84      0.89      1090
         1.0       0.17      0.47      0.24        75

    accuracy                           0.81      1165
   macro avg       0.56      0.65      0.57      1165
weighted avg       0.91      0.81      0.85      1165



In [30]:
roc_auc_score(y_train1, y_pred1) #auc score of train 1

0.9232235601989252

In [31]:
roc_auc_score(y_train2, y_pred2) #auc score of train 1

0.652599388379205

In [32]:
fbeta_score(y_train1, y_pred1, beta=2)

0.6794162053346754

In [33]:
fbeta_score(y_train2, y_pred2, beta=2)

0.34246575342465757

In [35]:
y_test_pred = svclassifier.predict(df_test).round()

In [36]:
pd.DataFrame(y_test_pred).to_csv("Project2_Paramvir_Yadav_Submission.csv",header="V86",index=False)