Hi guys, this is my first notebook on Kaggle. I hope you will find it interesting and I hope you will forgive my beginner mistakes. 

In this notebook we are going to see the effect of synthetic data generation techniques to tackle the imbalanced problem.

The syntethic generation techniques we are going to use are: 
- SMOTE
- SMOTE NN 
- SMOTE TOMEK
- ADASYN 

the opensource implementation of these techniques can be found in the imblearn library:
http://contrib.scikit-learn.org/imbalanced-learn/stable/index.html

The techniques that we are going to use are: 
- Logistic Regression
- Random Forest 
- Gradient Boosting 

In the end we are going to tackle the problem of imbalanced dataset by adopting a reweihthing solution. 

Spoiler alert, the best obtained results are: 
- Gradient Boosting + Adasyn whith an AUC = 0.948922348202
- Logistic Regression + Adasyn with an AUC = 0.946214132108

Logistic Regression seems the model that perform better in each test. Probably becouse of the scarse amount of data. 

Adasyn seems the best synthetic data generation technique for this problem. 



In [1]:
# Imports
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, auc
from imblearn.metrics import classification_report_imbalanced
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import ADASYN 
from imblearn.combine import SMOTEENN 
from imblearn.combine import SMOTETomek 
from sklearn import ensemble


# Load the dataset
credit_cards = pd.read_csv("../input/creditcard.csv")

# Devide labels and features
labels = credit_cards['Class']
features = credit_cards.drop('Class', axis=1)

# Train-test split
features_train, features_test, labels_train, labels_test = train_test_split(features,
                                                                            labels,
                                                                            test_size=0.2,
                                                                            random_state=1234)

fraudolent_transactions_amount = len(labels_train [labels_train ==1])
genuine_transactions_amount = len(labels_train [labels_train ==0])
ratio = genuine_transactions_amount / fraudolent_transactions_amount

#Check how much unbalanced is the train set
print('Number of fraudolent transaction = {}'.format(fraudolent_transactions_amount))
print('Number of genuine transaction = {}'.format(genuine_transactions_amount))
print('The ratio between genuine and fraudolent transactions is approximately {}:1'.format(round(ratio)))


577 : 1 is quite a lot of imbalance. This means that we cannot use classical methods for evaluations. We have to use AUC as evaluation criteria. A classifier which predicts all transactions as genuine will have an accuracy of 99.9% but will be quite useless right? 

# Augmented Dataset Generation: 

In [2]:
# Let's first instanciate the synthetic data generators

sm = SMOTE(random_state=42)
ada = ADASYN(random_state=42)
smnn = SMOTEENN(random_state=42)
smt = SMOTETomek(random_state=42)

# Generate the new datasets
features_train_sm, labels_train_sm = sm.fit_sample(features_train, labels_train)
features_train_ada, labels_train_ada = ada.fit_sample(features_train, labels_train)
features_train_smnn, labels_train_smnn = smnn.fit_sample(features_train, labels_train)
features_train_smt, labels_train_smt = smt.fit_sample(features_train, labels_train)

#let's see if the classes are balanced now
print(len(labels_train_sm [labels_train_sm ==1]))
print(len(labels_train_sm [labels_train_sm ==0]))

Now that the dataset is again balanced we can start training our models: 

## Logistic regression

In [None]:
#LOGISTIC REGRESSION  WITHOUT DATA AUGMENTATION

#train Logistic Regression
lr = LogisticRegression()
lr.fit(features_train,labels_train)

# predictions on test set
predictions=lr.predict(features_test)

#compute and print confusion matrix
cm = confusion_matrix(labels_test,predictions)

false_positive_rate, true_positive_rate, thresholds = roc_curve(labels_test,predictions)

#measure and print auc 
auc_res = auc(false_positive_rate, true_positive_rate)

target_names = ['genuine', 'fraud'] 
report = classification_report_imbalanced(labels_test, predictions, target_names=target_names)

print('Confusion Matrix:')
print(cm)
print('Report:')
print(report)
print ('The obtained AUC is {}'.format(auc_res))

In [None]:
# Logistic Regression + SMOTE
lr = LogisticRegression()
lr.fit(features_train_sm,labels_train_sm)

# predictions on test set
predictions=lr.predict(features_test)

#compute and print confusion matrix
cm = confusion_matrix(labels_test,predictions)

target_names = ['genuine', 'fraud'] 
report = classification_report_imbalanced(labels_test, predictions, target_names=target_names)
false_positive_rate, true_positive_rate, thresholds = roc_curve(labels_test,predictions)

#measure and print auc 
auc_res = auc(false_positive_rate, true_positive_rate)

print('Confusion Matrix:')
print(cm)
print('Report:')
print(report)
print ('The obtained AUC is {}'.format(auc_res))

# true negatives is C_{0,0}, false negatives is C_{1,0}, true positives is C_{1,1} and false positives is C_{0,1}.

False positives (903) are genuine transactions classified as fraud.

In [None]:
#LOGISTIC REGRESSION + SMOTE NN 

#train Logistic Regression
lr = LogisticRegression()
lr.fit(features_train_smnn,labels_train_smnn)

# predictions on test set
predictions=lr.predict(features_test)

#compute and print confusion matrix
cm = confusion_matrix(labels_test,predictions)


false_positive_rate, true_positive_rate, thresholds = roc_curve(labels_test,predictions)

#measure and print auc 
auc_res = auc(false_positive_rate, true_positive_rate)

target_names = ['genuine', 'fraud'] 
report = classification_report_imbalanced(labels_test, predictions, target_names=target_names)

print('Confusion Matrix:')
print(cm)
print('Report:')
print(report)
print ('The obtained AUC is {}'.format(auc_res))
# true negatives is C_{0,0}, false negatives is C_{1,0}, true positives is C_{1,1} and false positives is C_{0,1}.

In [None]:
#LOGISTIC REGRESSION + ADASYN

#train Logistic Regression
lr = LogisticRegression()
lr.fit(features_train_ada,labels_train_ada)

# predictions on test set
predictions=lr.predict(features_test)

#compute and print confusion matrix
cm = confusion_matrix(labels_test,predictions)

false_positive_rate, true_positive_rate, thresholds = roc_curve(labels_test,predictions)

#measure and print auc 
auc_res = auc(false_positive_rate, true_positive_rate)

target_names = ['genuine', 'fraud'] 
report = classification_report_imbalanced(labels_test, predictions, target_names=target_names)

print('Confusion Matrix:')
print(cm)
print('Report:')
print(report)
print ('The obtained AUC is {}'.format(auc_res))

In [None]:
# LOGISTIC REGRESSION + REWEIGHTING (and no data augmentation) 

lr = LogisticRegression()

param_grid = {
    'class_weight': [None, {1:5, 0:1}, {1:10, 0:1}, {1:100, 0:1}, {1:700, 0:1},
                     {1:1000, 0:1}, {1:10000, 0:1}, {1:100000, 0:1}]
}

CV_lr = GridSearchCV(estimator=lr, param_grid=param_grid, scoring='roc_auc')


#train logistic regression 


CV_lr.fit(features_train,labels_train)

print('Best parameters found by grid search are:', CV_lr.best_params_)

best_lr = CV_lr.best_estimator_
predictions=best_lr.predict(features_test)
cm = confusion_matrix(labels_test,predictions)


false_positive_rate, true_positive_rate, thresholds = roc_curve(labels_test,predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)


target_names = ['genuine', 'fraud'] 
report = classification_report_imbalanced(labels_test, predictions, target_names=target_names)

print('Confusion Matrix:')
print(cm)
print('Report:')
print(report)
print ('The obtained AUC is {}'.format(auc_res))

As you can see Logistic regression performed pretty good.
Using data augmentation techniques and reweighting boost the performances of logistic regression, from 0.81 to 0.95. 

## Random Forest

In [None]:
#use random forest + SMOTE


#train random forest


rfc=RandomForestClassifier(n_jobs=2)

param_grid = {
    'n_estimators': [10, 100, 1000],
    'max_features': ['auto', 'sqrt', 'log2']
}

CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, scoring='roc_auc')


#train random forest


CV_rfc.fit(features_train_sm,labels_train_sm)

print('Best parameters found by grid search are:', CV_rfc.best_params_)

best_rfc = CV_rfc.best_estimator_

predictions=best_rfc.predict(features_test)

cm = confusion_matrix(labels_test,predictions)


false_positive_rate, true_positive_rate, thresholds = roc_curve(labels_test,predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)


target_names = ['genuine', 'fraud'] 
report = classification_report_imbalanced(labels_test, predictions, target_names=target_names)

print('Confusion Matrix:')
print(cm)
print('Report:')
print(report)
print ('The obtained AUC is {}'.format(auc_res))

We notice that we have way less genuine transactions classified as fraud (903 vs 16) but on the other side more fraud transaction classified as genuine (11 vs 22)

In [None]:
#use random forest + ADASYN

#train random forest
rfc=RandomForestClassifier(n_jobs=2)

param_grid = {
    'n_estimators': [10, 100, 1000],
    'max_features': ['auto', 'sqrt', 'log2']
}

CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, scoring='roc_auc')


#train random forest


CV_rfc.fit(features_train_ada,labels_train_ada)

print('Best parameters found by grid search are:', CV_rfc.best_params_)

best_rfc = CV_rfc.best_estimator_

predictions=best_rfc.predict(features_test)
cm = confusion_matrix(labels_test,predictions)


false_positive_rate, true_positive_rate, thresholds = roc_curve(labels_test,predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)


target_names = ['genuine', 'fraud'] 
report = classification_report_imbalanced(labels_test, predictions, target_names=target_names)

print('Confusion Matrix:')
print(cm)
print('Report:')
print(report)
print ('The obtained AUC is {}'.format(auc_res))

In [None]:
#use random forest + SMOTE NN 

#train random forest
rfc=RandomForestClassifier(n_jobs=2)

param_grid = {
    'n_estimators': [10, 100, 1000],
    'max_features': ['auto', 'sqrt', 'log2']
}

CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, scoring='roc_auc')


#train random forest


CV_rfc.fit(features_train_smnn,labels_train_smnn)

print('Best parameters found by grid search are:', CV_rfc.best_params_)

best_rfc = CV_rfc.best_estimator_

predictions=best_rfc.predict(features_test)
cm = confusion_matrix(labels_test,predictions)


false_positive_rate, true_positive_rate, thresholds = roc_curve(labels_test,predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)


target_names = ['genuine', 'fraud'] 
report = classification_report_imbalanced(labels_test, predictions, target_names=target_names)

print('Confusion Matrix:')
print(cm)
print('Report:')
print(report)
print ('The obtained AUC is {}'.format(auc_res))

In [None]:
#use random forest + SMOTE TOMEK

#train random forest
rfc=RandomForestClassifier(n_jobs=2)

param_grid = {
    'n_estimators': [10, 100, 1000],
    'max_features': ['auto', 'sqrt', 'log2']
}

CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, scoring='roc_auc')


#train random forest


CV_rfc.fit(features_train_smt,labels_train_smt)

print('Best parameters found by grid search are:', CV_rfc.best_params_)

best_rfc = CV_rfc.best_estimator_

predictions=best_rfc.predict(features_test)
cm = confusion_matrix(labels_test,predictions)


false_positive_rate, true_positive_rate, thresholds = roc_curve(labels_test,predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)


target_names = ['genuine', 'fraud'] 
report = classification_report_imbalanced(labels_test, predictions, target_names=target_names)

print('Confusion Matrix:')
print(cm)
print('Report:')
print(report)
print ('The obtained AUC is {}'.format(auc_res))

In [None]:
#use random forest + reweighting

#train random forest
rfc=RandomForestClassifier(n_jobs=2, n_estimators = 1000, max_features = 'log2')

param_grid = {
    'class_weight': [{1:3 , 0:1}, {1:5, 0:1}, {1:10, 0:1}]
}

CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, scoring='roc_auc')


#train random forest


CV_rfc.fit(features_train,labels_train)

print('Best parameters found by grid search are:', CV_rfc.best_params_)

best_rfc = CV_rfc.best_estimator_

predictions=best_rfc.predict(features_test)
cm = confusion_matrix(labels_test,predictions)


false_positive_rate, true_positive_rate, thresholds = roc_curve(labels_test,predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)


target_names = ['genuine', 'fraud'] 
report = classification_report_imbalanced(labels_test, predictions, target_names=target_names)

print('Confusion Matrix:')
print(cm)
print('Report:')
print(report)
print ('The obtained AUC is {}'.format(auc_res))

## Gradient Boosting

In [None]:
#GRADIENT BOOSTING WITHOUT DATA AUGMENTATION

params = {'n_estimators': 1200, 'subsample': 0.5,
          'learning_rate': 0.01, 'min_samples_leaf': 1, 'random_state': 3}
clf = ensemble.GradientBoostingClassifier(**params)

param_grid = {
    'learning_rate': [0.1, 0.01, 0.001],
    'max_features': ['auto', 'sqrt', 'log2']
}

CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, scoring='roc_auc')

#train


CV_clf.fit(features_train,labels_train)

print('Best parameters found by grid search are:', CV_clf.best_params_)

best_clf = CV_clf.best_estimator_

predictions=best_clf.predict(features_test)
cm = confusion_matrix(labels_test,predictions)


false_positive_rate, true_positive_rate, thresholds = roc_curve(labels_test,predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)


target_names = ['genuine', 'fraud']
report = classification_report_imbalanced(labels_test, predictions, target_names=target_names)

print('Confusion Matrix:')
print(cm)
print('Report:')
print(report)
print ('The obtained AUC is {}'.format(auc_res))

In [None]:
#GRADIENT BOOSTING + ADASYN

params = {'n_estimators': 1200, 'subsample': 0.5,
          'learning_rate': 0.01, 'min_samples_leaf': 1, 'random_state': 3}
clf2 = ensemble.GradientBoostingClassifier(**params)

param_grid = {
    'max_features': ['auto', 'sqrt', 'log2']
}

CV_clf2 = GridSearchCV(estimator=clf2, param_grid=param_grid, scoring='roc_auc')

#train


CV_clf2.fit(features_train_ada,labels_train_ada)

print('Best parameters found by grid search are:', CV_clf2.best_params_)

best_clf2 = CV_clf2.best_estimator_

predictions=best_clf2.predict(features_test)
cm = confusion_matrix(labels_test,predictions)


false_positive_rate, true_positive_rate, thresholds = roc_curve(labels_test,predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)


target_names = ['genuine', 'fraud']
report = classification_report_imbalanced(labels_test, predictions, target_names=target_names)

print('Confusion Matrix:')
print(cm)
print('Report:')
print(report)
print ('The obtained AUC is {}'.format(auc_res))

I am not goint to try Gradient boosting with Smote NN and Tomek because Adasyn seems to be the best. 

# Conclusions

From this experiments we learned that using a synthetic data generation technique increase drastically the performances of our learner. The reweighting technique seems to work for Logistic Regression and seems to have poor performances on Random Forest. This technique cannot be applied with Gradient Boosting apparently. 

Thank you for reading and I hope you enjoied!