# Beginner's First Look
*Update: 2017-10-04*

**Table of Contents**

* [1 Introduction](#intro)
* [2 Preparation](#prep)
    * [2.1 Import](#import)
    * [2.2 Basic Stats](#basic_stats)
    * [2.3 Zero One Columns](#zero_one)
    * [2.4 Float64 Columns](#float64)
* [3 Data Cleaning](#clean)

<a id="intro"></a>
## 1 / Introduction

This is my first look and exploration of the [Porto Seguro's Safe Driver Prediction](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction) with python3.

The **goal** of this challenge is to predict the probability whether a driver will initiate an auto insurance clain in the next year.

I created this notebook to record my findings as well as help other beginner like me to quickly get into this competition.

<a id="prep"></a>
## 2 / Preparation

### Import

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
train = pd.read_csv('../input/train.csv')
print(train.shape)
train.head(10)

`id` column coreesponds to each individual policy holder, while target column is a calim was filed. (It only contains two values, 0 and 1). It seem that the data contains all integers. And there are no negative number here. We can investigate on that first.

<a id='basic_stats'></a>
### Basic Stats

In [None]:
train_stat = train.iloc[:,2:].describe().T
train_stat

In [None]:
plt.figure(figsize=(12,16))
sns.barplot(x='max',y=train_stat.index, data=train_stat, color='0.75')
sns.barplot(x='min',y=train_stat.index, data=train_stat, color='r')
plt.xticks(rotation='vertical')
plt.xlabel('Min(red), Max(gray)')
plt.show()

<a id='zero_one'></a>
### zero_one_cols
Columns that contain only 0 and 1 value

In [None]:
# cols with only 0 and 1 value
zero_one_cols = train_stat.loc[(train_stat['max']==1) & (train_stat['min']==0)].index.tolist()
zero_one_cols

In [None]:
zero,one = [],[]
for col in zero_one_cols:
    zero.append(train[col].value_counts()[0])
    one.append(train[col].value_counts()[1])
zero_one_stat = pd.DataFrame({'index':zero_one_cols,'zero':zero,'one':one})
zero_one_stat.set_index('index',inplace=True)
zero_one_stat

In [None]:
# purple is intersection, and blue indicate 0 count, and red indicate 1 count
f, ax1 =plt.subplots(figsize=(12,8))
sns.barplot(x=[train.shape[0]]*18, y=zero_one_stat.index, color='blue', alpha=.5)
sns.barplot(x='one',y=zero_one_stat.index, data=zero_one_stat,color='red', alpha=0.5)
plt.xlabel("One Count(red), Zero Count(blue)")
plt.show()

In [None]:
<a id='float64'></a>
### float64_cols

In [None]:
# cols with dtype of 'float64'
float_cols = train.select_dtypes(include=['float64']).columns.tolist()
print('Length of float64 columns:',len(float_cols))
float_cols