# Introduction

You may already have seen the [Overview](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction) of Porto Seguro’s Safe Driver Prediction. 
But let us explore the data given for this competition and get some more insights about the competition.

### Porto Seguro
[Porto Seguro](https://www.portoseguro.com.br/) is one of the largest auto and homeowner insurance companies in Brazil. They raise the cost of insurance for good drivers and reduce the price for bad ones using predicition models.

### Challenge
Build a model that predicts the probability that a driver will initiate an auto insurance claim in the next year.

In this notebook, we will explore the features given in the dataset.

Import the modules required

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
%matplotlib inline

Load the training data

In [None]:
train_df = pd.read_csv("../input/train.csv")
train_df.head()

In [None]:
train_df.describe()

## Checking null values
As mentioned in the data description "Values of -1 indicate that the feature was missing from the observation".
So first lets verify the statement.

In [None]:
labels = []
values = []
for col in train_df.columns:
    labels.append(col)
    values.append(train_df[col].isnull().sum())
    print(col, values[-1])

We can see that there are no missing values in the dataset, so let's just find out all the missing values that have been entered as -1 and replace them with NaN. 

In [None]:
train_copy = train_df
train_copy = train_copy.replace(-1, np.NaN)

Now we can simply visualize the missing values.

In [None]:
import missingno as msno
msno.matrix(df=train_copy.iloc[:,2:42], figsize=(20, 14), color=(0.42, 0.1, 0.05))

## Target Distribution

In [None]:
plt.figure(figsize=(12,8))
sns.countplot(x="target", data=train_df, color=color[0])
plt.ylabel('Count', fontsize=12)
plt.xlabel('Target', fontsize=12)
plt.xticks(rotation='vertical')
plt.title("Frequency of targets (0 : Claim not filed; 1 : Claim filed)", fontsize=15)
plt.show()

We can see that the claim was not filed for vast majority of policy holders and for only few it was filed.

## Correlation between features

We will now plot some linear correlation graphs.

In [None]:
train_float = train_df.select_dtypes(include=['float64'])
train_int = train_df.select_dtypes(include=['int64'])
colormap = plt.cm.inferno
plt.figure(figsize=(16,12))
plt.title('Pearson correlation of continuous features', y=1.05, size=15)
sns.heatmap(train_float.corr(),linewidths=0.1,vmax=1.0, square=True, cmap=colormap, linecolor='white', annot=True)

From the plot we can see that only few features shows correlation with one another.
Namely, 

* **ps_car_13** and **ps_car_12** : 0.64
*  **ps_reg_03** and **ps_reg_01** : 0.67
* **ps_car_15** and **ps_car_13** : 0.53
* **ps_reg_03** and **ps_reg_02** : 0.52

Other featues have no or very small correlation values between them.


*Please upvote if this notebook was helpful*