#### This notebook we will work in DecisionTreeClassifier


In [19]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection  import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import cross_val_score
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)
from sklearn.model_selection import GridSearchCV

%matplotlib inline
SEED = 42

In [20]:
# Read processe file
df = pd.read_csv('../data/processed_balanced_transaction.csv')
df.shape

(872136, 34)

#### Lets separate Level and features, Scaled features


In [21]:
X, y = df.drop(['isFraud'],axis=1), df['isFraud']

#### Split data into train test 
 We are spliting data into train and test with ratio 30%. Means 30% test and 70% train data

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=SEED)

#### Create a DessionTree classifier and fit all features

In [23]:
dt_classifier = DecisionTreeClassifier(max_depth=5, random_state=SEED)
dt_classifier.fit(X_train, y_train)
y_pred = dt_classifier.predict(X_test)

In [24]:
from sklearn import metrics
target_names = ['Not Fraud', 'Fraud']
print(metrics.classification_report(y_test, y_pred, digits=3, target_names=target_names))

              precision    recall  f1-score   support

   Not Fraud      0.667     0.706     0.686    130734
       Fraud      0.688     0.648     0.667    130907

    accuracy                          0.677    261641
   macro avg      0.678     0.677     0.677    261641
weighted avg      0.678     0.677     0.677    261641



#### Converting to standard scale and fit again

In [25]:
standard_scaler = StandardScaler()
X = standard_scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=SEED)

In [26]:
dt_classifier = DecisionTreeClassifier(max_depth=5, random_state=SEED)
dt_classifier.fit(X_train, y_train)
y_pred = dt_classifier.predict(X_test)

In [27]:
print(metrics.classification_report(y_test, y_pred, digits=3, target_names=target_names))

              precision    recall  f1-score   support

   Not Fraud      0.667     0.706     0.686    130734
       Fraud      0.688     0.648     0.667    130907

    accuracy                          0.677    261641
   macro avg      0.678     0.677     0.677    261641
weighted avg      0.678     0.677     0.677    261641



#### Cross validation
we are going to apply k-fold cross-validation.

it will split the original data set into k subsets and use one of the subsets as the testing set and the remaining as the training sets. This process iterated k times until every subset have been used as the testing set. Since 10-fold cross-validation is the most popular one, we are going to use that one.

In [28]:
dt_classifier = DecisionTreeClassifier(max_depth=5, random_state=SEED)
cv_scores = cross_val_score(dt_classifier, X_train, y_train, cv=10)
print('Average score: {}'.format(round(np.mean(cv_scores),3)))

Average score: 0.677


It looks there is no improvement using cross-validation. 

#### Parameter Tuning

In classification technique, there are some parameters that can be tuned to optimize the classification. 
In DecessionTreeClassifier we can tune 

- Decision tree is max depth (the depth of the tree)
- max feature (the feature used to classify)
- criterion
- splitter

Grid Search explores a range of parameters and finds the best combination of parameters. Then repeat the process several times until the best parameters are discovered. 
lets use grid search to get best params


In [31]:
from sklearn.model_selection import StratifiedKFold

dt_classifier = DecisionTreeClassifier()

parameter_grid = {
                  'criterion': ['gini', 'entropy'],
                  'splitter': ['best', 'random'],
                  'max_depth': [4, 5, 6],
                  'max_features': [15, 20, 25]
                 }

cross_validation = StratifiedKFold(n_splits=10)

grid_search = GridSearchCV(dt_classifier, param_grid=parameter_grid, cv=cross_validation)

grid_search.fit(X_train, y_train)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))

best_dt_classifier = grid_search.best_estimator_

Best score: 0.6867722097643963
Best parameters: {'criterion': 'gini', 'max_depth': 6, 'max_features': 25, 'splitter': 'best'}


A little bit improvement in best score. lets fit best model


In [32]:
best_dt_classifier.fit(X_train, y_train)
y_pred = best_dt_classifier.predict(X_test)
print(metrics.classification_report(y_test, y_pred, digits=3, target_names=target_names))

              precision    recall  f1-score   support

   Not Fraud      0.695     0.670     0.682    130734
       Fraud      0.682     0.707     0.694    130907

    accuracy                          0.688    261641
   macro avg      0.689     0.688     0.688    261641
weighted avg      0.689     0.688     0.688    261641



#### Save best model 

In [33]:
import pickle

lookup ={
    1 : 'Fraud', 0:'Not Fraud'
}

pickle.dump(best_dt_classifier, open('../saved_models/dtc_model.pkl','wb'))
model = pickle.load(open('../saved_models/dtc_model.pkl','rb'))

#test the model 
pred = model.predict([X_test[0]])
lookup[pred[0]]

'Not Fraud'

#### Code Reference

- https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
- https://medium.com/@haydar_ai/learning-data-science-day-22-cross-validation-and-parameter-tuning-b14bcbc6b012
-  https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html