Here we'll try to find the best decision tree for fraud detection.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

In [None]:
df = pd.read_csv("../input/creditcard.csv")
# print(df.describe())
print(df['Class'].value_counts())

In [None]:
y = df['Class']
df = df.drop("Class",axis=1)

print(df.shape)
print(df.columns)

Let us prepare training and test sets.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.1, random_state=0)

## Decision Tree

Let's begin with a simple base classifier: decision tree. Then we will move on with Random Forest :).

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

base_clf = DecisionTreeClassifier()

We now try different depths for the decision tree to choose a good one.

In [None]:
from sklearn.model_selection import GridSearchCV
tree_params = {'max_depth': np.arange(1, 11, 1)}
gs_base = GridSearchCV(base_clf, tree_params, n_jobs=-1, verbose=1)

In [None]:
gs_base = gs_base.fit(X_train, y_train)
base_pred = gs_base.predict(X_test)
print(classification_report(base_pred, y_test))

print('best depth of decision tree: %d' %gs_base.best_params_['max_depth'] )

So the best decision tree has a depth of **5** levels and it can achieve f1-score of 0.84.
Now we move on to random forests. 

### Random forest

In [None]:
rfc = RandomForestClassifier()
rfc.fit(X_train,y_train)
y_pred = rfc.predict(X_test) 

print(classification_report(y_pred,y_test))

Now we use **grid search** to explore which setting of random forest can give us the best result.

As grid search may be expensive we run them in parallel by setting `n_jobs` parameters. 
If we set `n_jobs=-1`, grid search will detect how many cores are installed and uses them all.

In [None]:

rf_params = {'criterion': ('gini', 'entropy'), 
             'n_estimators': np.arange(5, 25, 5) } # 'max_depth': np.arange(1, 11, 1)
gs_rfc = GridSearchCV(rfc, rf_params, n_jobs=-1, verbose=1)

gs_rfc = gs_rfc.fit(X_train, y_train)
gs_rfc_pred = gs_rfc.predict(X_test)
print(classification_report(y_test, y_pred=gs_rfc_pred))

for param_name in rf_params.keys():
    print('%s %r' %(param_name, gs_rfc.best_params_[param_name]))

From the result, we can see quite an improvement in precision of predicting frauds (an increase of **16%**). f1-score is also increased a bit (2% more). 
And the best setting is a random forest with: 

 - **20** decision trees, 
 - tree splitting is based on **gini** criterion.