The goal of this assignment is to identify fraudulent transactions from credit card data. The dataset is highly imbalanced, meaning the positive class (fraudulent transactions) only accounts for about 0.2% of the training data. This assignment will help you develop strategies for dealing with imbalanced datasets.
Dataset description:
The dataset contains credit card transactions from 2013. Due to privacy concerns, the actual features have been transformed using Principal Components Analysis (PCA). As such, nearly all of the features do not have intrinsic meaning. The only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction. The feature 'Class' is the target variable. “Class” takes a value of 1 in case of fraud and 0 otherwise. 

Training set: https://wikispaces.psu.edu/download/attachments/395383213/credit_card_train.csv?api=v2
Test set: https://wikispaces.psu.edu/download/attachments/395383213/credit_card_test.csv?api=v2
 
Evaluation metrics:
For all experiments, use the following evaluation metric:
Area under the receiver operating characteristic curve (AUC score) (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score)
The final test set performance will also be evaluated using the AUC score. 
Instructions:
Q1) Using 5-fold cross-validation, implement a very naive baseline classifier where the majority class (no fraud) is predicted for each sample. Report the mean and standard deviation of the AUC score in a table.
Q2) Using 5-fold cross-validation, perform hyper parameter and model selection. Evaluate each of the following model:
Random forest
XGBOOST
SVM
KNN
Naive Bayes
Report the following:
Provide the hyperparameter values tried, as well as the mean and standard deviation for the AUC score. Tune the following hyperparameters:
Random forest: n_estimators
XGBOOST: learning rate
SVM: c (regularization penalty)
KNN: number of neighbors
Q3) Retrain the models from Q2 using cross-validation. This time train each model using the SMOTE algorithm. 
Tune the same parameters as the previous section.
Report the mean and standard deviation for the AUC score for each model with SMOTE in a table.
Q4) Identify the best performing model from all previous questions. Using 5-fold cross validation, plot the full ROC curve (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html) against each validation fold. There should be five figures in total.
Q5) Retrain on the best performing model from all previous questions on all the training data.  Predict on the test data. 
Describe the model selection process you used. 
Which model and why? 
Did you use oversampling?
What hyperparameter values did you select?
Describe the performance of your chosen model and parameter on the training data

In [None]:
import os
import sys
import pandas as pd
import numpy as np
from six.moves import urllib 
import string 
import nltk 
from nltk.corpus import stopwords 
from nltk.stem.porter import * 
from tqdm import tqdm
from matplotlib import pyplot as plt
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
URL = 'https://wikispaces.psu.edu/download/attachments/395383213/credit_card_train.csv?api=v2'
LOCAL_FOLDER = 'datasets'
CC_TRAIN = 'credit_card_train.csv'
CC_TRAIN_PATH = os.path.join(LOCAL_FOLDER, CC_TRAIN)
os.makedirs(LOCAL_FOLDER, exist_ok = True)
urllib.request.urlretrieve(URL, CC_TRAIN_PATH)
URL = 'https://wikispaces.psu.edu/download/attachments/395383213/credit_card_test.csv?api=v2'
LOCAL_FOLDER = 'datasets'
CC_TEST = 'credit_card_test.csv'
CC_TEST_PATH = os.path.join(LOCAL_FOLDER, CC_TEST)
os.makedirs(LOCAL_FOLDER, exist_ok = True)
urllib.request.urlretrieve(URL, CC_TEST_PATH)
def load_cc_train(cc_train_path=CC_TRAIN_PATH):
    csv_path = os.path.join(cc_train_path)
    return pd.read_csv(csv_path)
def load_cc_test(cc_test_path=CC_TEST_PATH):
    csv_path = os.path.join(cc_test_path)
    return pd.read_csv(csv_path)
cc_train = load_cc_train()
cc_test = load_cc_test()

In [None]:
print(len(cc_train['Class'] == 0))

In [None]:
cc_train.iloc[:,:30]
cc_train[['Class']]
#need to normalize data using StandardScaler

In [None]:
cc_test.info()

In [6]:
from sklearn.model_selection import train_test_split
X = cc_train.iloc[:,:30]
y = cc_train[['Class']]


X_train, X_test, y_train, y_test = train_test_split(X,y.values.ravel(), test_size = 0.2, random_state = 42)

print(len(X_train), len(X_test))
print(len(y_train), len(y_test))

182276 45569
182276 45569


In [None]:
X

In [29]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

model = DummyClassifier(strategy='most_frequent')
model.fit(X_train, y_train)

DummyClassifier(strategy='most_frequent')

In [30]:
from sklearn.model_selection import cross_val_score

def display_scores(scores):
    print("Scores:", scores)
    print('Mean:', scores.mean())
    print('Standard Deviation:', scores.std())
    

dummy_train_scores = cross_val_score(DummyClassifier(strategy = 'most_frequent'),
                                                     X_train, y_train,
                                                     scoring = 'accuracy', cv=5)
display_scores(dummy_train_scores)
prediction = model.predict(X_test)
accuracy_score(y_test, prediction)

Scores: [0.99829932 0.9983267  0.9983267  0.9983267  0.9983267 ]
Mean: 0.998321227270225
Standard Deviation: 1.0953771166910543e-05


0.9981566415765103

In [None]:
from sklearn.metrics import auc
from sklearn.metrics import roc_curve
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.metrics import roc_auc_score


y_pred_prob = model.predict_proba(X_test)[:,1]
roc_auc_score(y_test, y_pred_prob)

In [None]:
from sklearn.model_selection import cross_val_score

dummy_auc = cross_val_score(model, X_test, y_test, cv=5, scoring='roc_auc')
def display_scores(scores):
    print("Scores:", scores)
    print('Mean:', scores.mean())
    print('Standard Deviation:', scores.std())
display_scores(dummy_auc)

In [2]:
from sklearn.ensemble import RandomForestClassifier
import xgboost
from sklearn import svm
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB



xgb = XGBClassifier()
rf = RandomForestClassifier()
svc = svm.SVC(probability=True)
kn = KNeighborsClassifier()
gnb = GaussianNB()

In [None]:
xgb_model = xgb.fit(X_train, y_train)
rf_model = rf.fit(X_train, y_train)
svc_model = svc.fit(X_train, y_train)
kn_model = kn.fit(X_train,y_train)
gnb_model = gnb.fit(X_train, y_train)

In [None]:
#AUC scores for validation set using 5 fold cross-validation
cv_auc1 = cross_val_score(xgb_model, X_test, y_test, cv=5,scoring='roc_auc')
cv_auc2 = cross_val_score(rf_model, X_test, y_test, cv=5,scoring='roc_auc')
cv_auc3 = cross_val_score(svc_model, X_test, y_test, cv=5,scoring='roc_auc')
cv_auc4 = cross_val_score(kn_model, X_test, y_test, cv=5,scoring='roc_auc')
cv_auc5 = cross_val_score(gnb_model, X_test, y_test, cv=5,scoring='roc_auc')

print('Cross-validation AUC score for XGB')
display_scores(cv_auc1)
print('\n')
print('Cross-validation AUC score for RF')
display_scores(cv_auc2)
print('\n')
print('Cross-validation AUC score for SVC')
display_scores(cv_auc3)
print('\n')
print('Cross-validation AUC score for KN')
display_scores(cv_auc4)
print('\n')
print('Cross-validation AUC score for GNB')
display_scores(cv_auc5)

In [None]:
from sklearn.metrics import classification_report
gnb_predictions = gnb_model.predict(X_test)
print(classification_report(y_test, gnb_predictions))

In [None]:
from sklearn.model_selection import GridSearchCV
xgb_param_test = {'learning_rate': np.arange(start=0.02, stop = 1.2, step=0.02)}
xgb_cv =GridSearchCV(xgb, param_grid=xgb_param_test,scoring='roc_auc',cv=5)
xgb_cv.fit(X_train,y_train)
for i in ['mean_test_score', 'std_test_score', 'params']:
    print(i," : ",xgb_cv.cv_results_[i])
print('Best parameter:',xgb_cv.best_params_ ,'/n',"Best score:", xgb_cv.best_score_)

In [None]:
svc_param_test = {'C': np.arange(0.06, 1.2, 0.03)}
svc_cv=GridSearchCV(svc, param_grid=svc_param_test,scoring='roc_auc',cv=5)
svc_cv.fit(X_train,y_train)
for i in ['mean_test_score', 'std_test_score', 'params']:
    print(i," : ",svc_cv.cv_results_[i])
print('Best parameter:',svc_cv.best_params_ ,'/n',"Best score:", svc_cv.best_score_)

In [None]:
svc_param_test2 = {'C': np.arange(2, 27, 5)}
svc_cv2 = GridSearchCV(svc, param_grid=svc_param_test2,scoring='roc_auc',cv=5)
svc_cv2.fit(X_train,y_train)
for i in ['mean_test_score', 'std_test_score', 'params']:
    print(i," : ",svc_cv2.cv_results_[i])
print('Best parameter:',svc_cv2.best_params_ ,'/n',"Best score:", svc_cv2.best_score_)

In [None]:
rf_param_test = {'n_estimators': np.arange(start=200, stop = 1200, step=200)}
rf_cv=GridSearchCV(rf, param_grid=rf_param_test,scoring='roc_auc', cv=5)
rf_cv.fit(X_train,y_train)
for i in ['mean_test_score', 'std_test_score', 'params']:
    print(i," : ",rf_cv.cv_results_[i])
print('Best parameter:',rf_cv.best_params_ ,'/n',"Best score:", rf_cv.best_score_)

In [None]:
kn_param_test = {'n_neighbors': np.arange(start=2, stop = 200, step=10)}
kn_cv=GridSearchCV(kn, param_grid=kn_param_test,scoring='roc_auc',cv=5, n_jobs=2)
kn_cv.fit(X_train,y_train)
for i in ['mean_test_score', 'std_test_score', 'params']:
    print(i," : ",kn_cv.cv_results_[i])
print('Best parameter:',kn_cv.best_params_ ,'/n',"Best score:", kn_cv.best_score_)

In [None]:
#fit models on training data based on parameters found from first round of parameter tuning
xgb2 = XGBClassifier(learning_rate = 0.36000000000000004)
xgb_model2 = xgb2.fit(X_train,y_train)
rf2 = RandomForestClassifier(n_estimators=1000,n_jobs=-1)
rf_model2 = rf2.fit(X_train,y_train)
svc2 = svm.SVC(C=18, probability=True)
svc_model2 = svc2.fit(X_train, y_train)
kn2 = KNeighborsClassifier(n_neighbors=192, n_jobs=-1)
kn_model2 = kn2.fit(X_train, y_train)

In [None]:
#calculate auc scores on validation set using 5 fold cross-validation

xgb_auc = cross_val_score(xgb_model2, X_test, y_test, cv=5, scoring='roc_auc', n_jobs=-1)
rf_auc = cross_val_score(rf_model2, X_test, y_test, cv=5, scoring='roc_auc', n_jobs=-1)
svc_auc = cross_val_score(svc_model2, X_test, y_test, cv=5, scoring='roc_auc', n_jobs=-1)
kn_auc = cross_val_score(kn_model2, X_test, y_test, cv=5, scoring='roc_auc', n_jobs=-1)

#print scores 

print('XGB AUC Scores from first round of parameter tuning')
display_scores(xgb_auc)
print('\n')
print('RF AUC Scores from first round of parameter tuning')
display_scores(rf_auc)
print('\n')
print('SVM AUC Scores from first round of parameter tuning')
display_scores(svc_auc)
print('\n')
print('KN AUC Scores from first round of parameter tuning')
display_scores(kn_auc)

In [24]:
#utilizing smote method 
from imblearn.over_sampling import SMOTE 
sm = SMOTE(random_state = 42) 
X_sm, y_sm = sm.fit_sample(X, y.values.ravel()) 
X_sm_train,y_sm_train = sm.fit_sample(X_train, y_train)

In [None]:
#tuning XGBClassifier hyperparameters trained on smote split set
sm_xgb_param_test = {'learning_rate': np.arange(start=0.02, stop = 1.2, step=0.02)}
sm_xgb_cv =GridSearchCV(xgb, param_grid=sm_xgb_param_test,scoring='roc_auc',cv=5)
sm_xgb_cv.fit(X_sm_train,y_sm_train)
for i in ['mean_test_score', 'std_test_score', 'params']:
    print(i," : ",sm_xgb_cv.cv_results_[i])
print('Best parameter:',sm_xgb_cv.best_params_ ,'/n',"Best score:", sm_xgb_cv.best_score_)

In [None]:
#tuning Random Forest Classifier hyperparameters trained on smote
sm_rf_param_test = {'n_estimators': np.arange(start=200, stop = 1200, step=200)}
sm_rf_cv =GridSearchCV(rf,param_grid=sm_rf_param_test,scoring='roc_auc',cv=5, n_jobs=-1)
sm_rf_cv.fit(X_sm_train,y_sm_train)

for i in ['mean_test_score', 'std_test_score', 'params']:
    print(i," : ",sm_rf_cv.cv_results_[i])
print('Best parameter:',sm_rf_cv.best_params_ ,'/n',"Best score:", sm_rf_cv.best_score_)

In [None]:
#tuning Support Vector Machine Classifier hyperparameters trained on smote split set
sm_svc_param_test = {'C': np.arange(2, 22, 5)}
sm_svc_cv = GridSearchCV(svc, param_grid=sm_svc_param_test,scoring='roc_auc',cv=5, n_jobs=4)
sm_svc_cv.fit(X_sm_train,y_sm_train)
for i in ['mean_test_score', 'std_test_score', 'params']:
    print(i," : ",sm_svc_cv.cv_results_[i])
print('Best parameter:',sm_svc_cv.best_params_ ,'\n "Best score:"', sm_svc_cv.best_score_)

In [None]:
from sklearn.model_selection import GridSearchCV
sm_kn_param_test = {'n_neighbors': np.arange(start=10, stop = 200, step=10)}
sm_kn_cv =GridSearchCV(kn, param_grid=sm_kn_param_test,scoring='roc_auc',cv=5, n_jobs=2)
sm_kn_cv.fit(X_sm_train,y_sm_train)
for i in ['mean_test_score', 'std_test_score', 'params']:
    print(i," : ",sm_kn_cv.cv_results_[i])
print('Best parameter:',sm_kn_cv.best_params_ ,'\n "Best score:"', sm_kn_cv.best_score_)

In [None]:
from sklearn.model_selection import GridSearchCV
sm_kn_param_test = {'n_neighbors': np.arange(start=1, stop = 10, step=1)}
sm_kn_cv =GridSearchCV(kn, param_grid=sm_kn_param_test,scoring='roc_auc',cv=5, n_jobs=2)
sm_kn_cv.fit(X_sm_train,y_sm_train)
for i in ['mean_test_score', 'std_test_score', 'params']:
    print(i," : ",sm_kn_cv.cv_results_[i])
print('Best parameter:',sm_kn_cv.best_params_ ,'\n "Best score:"', sm_kn_cv.best_score_)

In [None]:
#fit models based on parameters found in second round of parameter tuning 

xgb3 = XGBClassifier(learning_rate=0.54, n_jobs=-1)
rf3 = RandomForestClassifier(n_estimators=400, n_jobs=-1)
kn3 = KNeighborsClassifier(n_neighbors=5,n_jobs=-1 )
#train models on smote data
xgb_model3 = xgb3.fit(X_sm_train, y_sm_train)
rf_model3=rf3.fit(X_sm_train, y_sm_train)
#svm_model3=svm.SVC(C=)
kn_model3=kn3.fit(X_sm_train, y_sm_train)

In [None]:
#calculate auc scores on validation set using 5 fold cross-validation
from sklearn.model_selection import cross_val_score
xgb_auc2 = cross_val_score(xgb_model3, X_test, y_test, cv=5, scoring='roc_auc', n_jobs=4)
rf_auc2 = cross_val_score(rf_model3, X_test, y_test, cv=5, scoring='roc_auc', n_jobs=4)
#svm_auc2 = cross_val_score(svm_clf, X_test, y_test, cv=5, scoring='roc_auc')
kn_auc2 = cross_val_score(kn_model3, X_test, y_test, cv=5, scoring='roc_auc')

#print scores 

print('XGB AUC Scores from second round of parameter tuning')
display_scores(xgb_auc2)
print('\n')
print('RF AUC Scores from second round of parameter tuning')
display_scores(rf_auc2)
print('\n')
#print('SVM AUC Scores from first round of parameter tuning')
#display_scores(svm_auc)
#print('\n')
print('KN AUC Scores from second round of parameter tuning')
display_scores(kn_auc2)

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

prediction1 = xgb_model3.predict(X_test)
prediction2 = rf_model3.predict(X_test)
prediction3 = rf_model2.predict(X_test)
print('Accuracy for XGB model trained on smote:',accuracy_score(prediction1, y_test))
print('Accuracy for RF model trained on smote:',accuracy_score(prediction2, y_test))
print('Accuracy for RF model not trained on smote:',accuracy_score(prediction3, y_test))
print('Classification report for XGB model trained on smote')
classification_report(y_test, prediction1)
print('\n')
print('Classification report for RF model trained on smote') 
(classification_report(y_test, prediction2))
print('\n')
print('Classification report for RF model not trained on smote')
classification_report(y_test, prediction3)

In [None]:
from sklearn.metrics import roc_curve
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
import xgboost
from xgboost import XGBClassifier
from sklearn.metrics import plot_roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import auc

cv = KFold(n_splits=5)

for train,test in cv.split(X,y):
    prob = RandomForestClassifier(n_estimators=1000, n_jobs=-1).fit(X.loc[train],y.loc[train].values.ravel()).predict_proba(X.loc[test])[:,1]
    fpr, tpr, t = roc_curve(y.loc[test], prob)
    tprs.append(np.interp(mean_fpr, fpr, tpr))
    roc_auc = auc(fpr, tpr)
    aucs.append(roc_auc)
    plt.plot(fpr, tpr, lw=2, alpha=0.3, label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
    i= i+1
plt.plot([0,1],[0,1],linestyle = '--',lw = 2,color = 'black')
mean_tpr = np.mean(tprs, axis=0)
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, color='blue',
         label=r'Mean ROC (AUC = %0.2f )' % (mean_auc),lw=2, alpha=1)

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.legend(loc="lower right")
plt.show()

In [25]:
first_best = RandomForestClassifier(n_estimators=1000, n_jobs=-1)
second_best = XGBClassifier(learning_rate=0.54) #trained on smote
first_train = first_best.fit(X_train, y_train)
second_train = second_best.fit(X_sm_train, y_sm_train)

In [31]:
from sklearn.model_selection import cross_val_score
first_cv = cross_val_score(first_best, X_test, y_test, cv=5, scoring='roc_auc')
second_cv = cross_val_score(second_best, X_test, y_test, cv=5, scoring='roc_auc')
display_scores(first_cv)
display_scores(second_cv)

Scores: [0.9995409  0.90388234 0.9948076  0.96670848 0.99987633]
Mean: 0.9729631294091782
Standard Deviation: 0.03665830059909268
Scores: [0.99857096 0.93419938 0.97705772 0.95749083 0.99973205]
Mean: 0.9734101901725843
Standard Deviation: 0.025020482606171044


In [34]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

pred=first_train.predict(X_test)
pred2 = second_train.predict(X_test)
print(classification_report(y_test, pred))
print(classification_report(y_test, pred2))
print(accuracy_score(y_test, pred))
print(accuracy_score(y_test, pred2))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     45485
           1       0.94      0.81      0.87        84

    accuracy                           1.00     45569
   macro avg       0.97      0.90      0.94     45569
weighted avg       1.00      1.00      1.00     45569

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     45485
           1       0.90      0.85      0.87        84

    accuracy                           1.00     45569
   macro avg       0.95      0.92      0.94     45569
weighted avg       1.00      1.00      1.00     45569

0.9995611051372644
0.9995391603941276


In [45]:
from sklearn.metrics import roc_auc_score

y_pred_prob = first_train.predict_proba(X_test)[:,1]
print(roc_auc_score(y_test, y_pred_prob))
print(pd.DataFrame(y_pred_prob))

1.0
         0
0      0.0
1      0.0
2      0.0
3      0.0
4      0.0
...    ...
45564  0.0
45565  0.0
45566  0.0
45567  0.0
45568  0.0

[45569 rows x 1 columns]


In [50]:
best_model = first_best.fit(X, y.values.ravel())

In [51]:
final_predictions = best_model.predict_proba(cc_test)[:,1]

In [52]:
richard_cruz_labels = pd.DataFrame(final_predictions)

In [53]:
richard_cruz_labels

Unnamed: 0,0
0,0.0
1,0.0
2,0.0
3,0.0
4,0.0
...,...
56957,0.0
56958,0.0
56959,0.0
56960,0.0


In [None]:
richard_cruz_labels.to_csv(r'richard_cruz_labels.csv',header=False, index=False)

Final AUC Score on unlabled test data: 0.96062