In this notebook, I'm just sharing my implementation for predicting whether a driver will file an insurance or not.
* Here i'm using Random Forest model to predict the target variabe (Yes/No).

**Objective**
* Develop a machine learning model for predicting whether a driver will file an insurance next year or not. 

In [None]:
# Import all needed libraries for seeing the train and test files
import numpy as np 
import pandas as pd

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

In [None]:
# Let us load in the training data provided using Pandas:
train = pd.read_csv("../input/train.csv")
print(train.shape)
train.head()

* The training dataset contains 595212 rows and 59 columns.
* **target** is the column which we are going to predict.

In [None]:
# Let us load the testing data.
test = pd.read_csv("../input/test.csv")
print(test.shape)
test.head()

* The testing dataset contains 892816 rows and 58 columns. (Bigger than training dataset)
* Other than **target** column in training set all columns will be present.

Let's see the column names and types

In [None]:
train.columns

* Check for missing values in both training and testing data columns
* Before checking replace all **-1** values to **np.NaN**

In [None]:
train_copy = train
train_copy = train_copy.replace(-1, np.NaN)
test_copy = test
test_copy = test_copy.replace(-1, np.NaN)

In [None]:
import missingno as msno
%matplotlib inline
msno.bar(train_copy)

In [None]:
msno.bar(test_copy)

* From the graph, we found that 2 features are having more than 50% missing values.
* Before applying models, we will find the important features from these 58. 
* **Finding import feature using ExtraTreeClassifier**

In [None]:
# We cannot use all training samples for finding important features. So will split the data first.
from sklearn.model_selection import train_test_split

X_train = train.drop(['target'], axis=1).values
y_train = train['target'].values
X_train_main, X_train_validate, y_train_main, y_train_validate = train_test_split(X_train,y_train,test_size=0.5,stratify=y_train) 

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

forest = ExtraTreesClassifier(n_estimators=250,
                              random_state=0)
forest.fit(X_train_main, y_train_main)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X_train_main.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure(figsize=(20,10))
plt.title("Feature importances")
plt.bar(range(X_train_main.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X_train_main.shape[1]), indices)
plt.xlim([-1, X_train_main.shape[1]])
plt.show()

From the graph, we found that first 28 features are more important in this case. Others are not that much important for predictions. So we are taking only these top 28 features for our predictions.

In [None]:
important_feature = []
for f in range(28):
    important_feature.append(indices[f])
#     print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
print(important_feature)

In [None]:
# Final dataframe with only important features
train_copy = train.drop(['target'],axis=1)
final_train = train_copy.iloc[:,important_feature]
X_train = final_train.values
y_train = train['target'].values
# final_train = train.iloc[:,important_feature]
# print(final_train.head())
# X_train = final_train.drop(['target'], axis=1).values
# y_train = final_train['target'].values
X_train_main, X_train_validate, y_train_main, y_train_validate = train_test_split(X_train,y_train,test_size=0.2,stratify=y_train) 

Now this is the stage to implement our model RandomForestClassifier**

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X_train_main, y_train_main)

In [None]:
predicted_train_validate = clf.predict(X_train_validate)
actual_train_validate = y_train_validate

To check our model accuracy using accuracy score

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(actual_train_validate, predicted_train_validate)

Wow ! Got an accuracy of 96%. Interesting. 

In [None]:
# Prepare submission file
test_copy = test.iloc[:,important_feature]
X_test = test_copy.values
predicted_test = clf.predict(X_test)

In [None]:
output = pd.DataFrame({'id': test['id'].values, 'target': predicted_test})

In [None]:
output.to_csv("submission_output.csv", index=False) 

I hope that you all got atleast one new thing from my kernel. Please upvote and encourage me to write more.
* If any queries, please comment below. I can help you upto my understanding. 