# 1. Business Understanding

## 1.1 Objective
In this competition, you will predict the probability that an auto insurance policy holder files a claim.

## 1.2 Description
Nothing ruins the thrill of buying a brand new car more quickly than seeing your new insurance bill. The sting’s even more painful when you know you’re a good driver. It doesn’t seem fair that you have to pay so much if you’ve been cautious on the road for years.

Porto Seguro, one of Brazil’s largest auto and homeowner insurance companies, completely agrees. Inaccuracies in car insurance company’s claim predictions raise the cost of insurance for good drivers and reduce the price for bad ones.

In the train and test data, features that belong to similar groupings are tagged as such in the feature names (e.g., ind, reg, car, calc). In addition, feature names include the postfix bin to indicate binary features and cat to indicate categorical features. Features without these designations are either continuous or ordinal. Values of -1 indicate that the feature was missing from the observation. The target columns signifies whether or not a claim was filed for that policy holder.

# 2. Data Understanding

## 2.1 Import Libraries

In [None]:
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Handle table-like data and matrices
import numpy as np
import pandas as pd

# Modelling Algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Visualisation
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns

## 2.2 Load data

In [None]:
#getting data as a data frame
train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")

In [None]:
#train.info()

In [None]:
#train.head()

In [None]:
#train.describe()

In [None]:
#test.info()

In [None]:
#test.head()

In [None]:
#test.describe()

## 2.3 Data analysis/summaries

In [None]:
corr=train.corr()
#corr

In [None]:
sns.heatmap(corr)

In [None]:
corrt=corr[["target"]].drop("target")
corrt=corrt.reset_index().rename(columns={'index': 'x', 'target': 'y'})
corrt=corrt.sort_values("y", ascending=False)
#corrt

In [None]:
sns.barplot(x="y", y="x", data=corrt)

In [None]:
corrf=corrt[corrt.y>0]
#corrf

# 3. Data Preparation

## 3.1 Shortlisting features having positive correlation with target

In [None]:
lcn=corrf['x'].tolist()
testf=test[lcn]
lcn.append("target")
trainf=train[lcn]

## 3.2 Removing columns having mode=-1

In [None]:
mc=trainf.mode().transpose().reset_index().rename(columns={'index': 'colu', 0: 'modeofc'})
for index,row in mc.iterrows():
    if(row['modeofc']==-1):
        trainf=trainf.drop(row['colu'],axis=1)
        testf=testf.drop(row['colu'],axis=1)
mc=trainf.mode().transpose().reset_index().rename(columns={'index': 'colu', 0: 'modeofc'})
mc2=testf.mode().transpose().reset_index().rename(columns={'index': 'colu', 0: 'modeofc'})

## 3.3 Replace missing values with mode

In [None]:
for index,row in mc.iterrows():
    c=row['colu']
    val=row['modeofc']
    if(trainf.dtypes[c]==np.int64):
        val=np.int64(val)
    trainf.loc[trainf[c] == -1, c] = val
for index,row in mc2.iterrows():
    c=row['colu']
    val=row['modeofc']
    if(testf.dtypes[c]==np.int64):
        val=np.int64(val)
    testf.loc[testf[c] == -1, c] = val

# 4. Modeling

## 4.1 Creating test train split

In [None]:
x=trainf.drop("target",axis=1)
y=trainf["target"]
X_test=pd.DataFrame(testf)
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.33,random_state=42)

## 4.2 Running models

In [None]:
clf=LogisticRegression()
clf.fit(x_train,y_train)
Y_pred=clf.predict(X_test)
acc = round(clf.score(x_test, y_test) * 100, 2)
acc

In [None]:
submission = pd.DataFrame({"id": test["id"],"target": Y_pred})
submission.to_csv('mysubmission.csv', index=False)