## Supervised Learning: Challenge

In this challenge, we will try to predict credit card fraud.

Download the data from [here](https://drive.google.com/file/d/1FCQY1SiWIjh_ME6Wtb3FG8Y1sKoRwAUc/view?usp=sharing). The data is originally from a [Kaggle Competition](https://www.kaggle.com/mlg-ulb/creditcardfraud).

The dataset contains transactions made by credit cards within two days in September 2013 by European cardholders.  Where **we have 492 occurrences of fraud out of the total of 284,807 transactions**. This dataset is highly unbalanced, with the positive class (frauds) account for 0.172% of all transactions.

____________________
### **Challenge:** Identify fraudulent credit card transactions.

Features V1, V2, … V28 are the principal components obtained with PCA. The only features that are not transformed with PCA are `'Time'` and `'Amount'`.  

- The feature `'Time'` contains the seconds elapsed between each transaction and the first transaction in the dataset.
- The feature `'Amount'` is the transaction amount; this feature can be used for example-dependant cost-sensitive learning. 
- The feature `'Class'` is the target variable, and it takes the value of 1 in case of fraud and 0 otherwise.

> #### Warning
> There is a huge class imbalance ratio, so we need to be careful when evaluating. It might be better to use the method `.predict_proba()` with a custom cut-off to search for fraudulent transactions.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [None]:
data = pd.read_csv("creditcard.csv")

In [None]:
data.keys()

In [None]:
data.shape

In [None]:
X = data.iloc[:,:-1]
y = data.Class

# Naive Bayes first?

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y,
                                                    test_size=0.3,
                                                    random_state=42)

In [None]:
from sklearn import naive_bayes
gnb = naive_bayes.GaussianNB()
gnb.fit(X_train, y_train)

In [None]:
y_proba_bayes = gnb.predict_proba(X_test)

In [None]:
len(y_proba_bayes[:,0]) # lol huge

In [None]:
plt.hist(y_proba_bayes[:,0], density=True, bins=10)
plt.hist(y_proba_bayes[:,1], density=True, bins=10)
plt.ylabel('Probability')
plt.xlabel('Data'); # too many 1's and 0's

# Naive Bayes try 2 (without predict proba)

In [None]:
from sklearn import naive_bayes
gnb2 = naive_bayes.GaussianNB()
gnb2.fit(X_train, y_train)

In [None]:
y_pred_gnb2 = gnb2.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred_gnb2)

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(gnb2, X_test, y_test)   # need to make it write the test again
plt.show() # it'll show y_pred vs y_test here!

# SVM time (Poly Kernel) Caught 0 frauds

In [None]:
from sklearn import svm # Default C,C0.5,C0.1 same results
clf = svm.SVC(kernel='poly')
clf.fit(X_train, y_train)

In [None]:
y_pred2 = clf.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred2)

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(clf, X_test, y_test)   # need to make it write the test again
plt.show() # it'll show y_pred vs y_test here!

# SVM (rbf kernel) - same as Poly - Caught 0 frauds

In [None]:
from sklearn import svm
clf3 = svm.SVC(kernel='rbf')
clf3.fit(X_train, y_train)

In [None]:
y_pred3 = clf3.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred3)

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(clf3, X_test, y_test)   # need to make it write the test again
plt.show() # it'll show y_pred vs y_test here

# Poor results. GridSearch time?

In [None]:
# Yes should gridsearch for best SVM hyper parameters if had more time.
# Also would learn to plot probabilities if heavy weighted 1 and 0's

In [None]:
plt.hist(y_proba_bayes, density=True, bins=30)  # density=False would make counts
plt.ylabel('Probability')
plt.xlabel('Data');

# class inblance problem. Lazy model.
# oversample = create copies of y1 to fit against y0!
# smote? sim new data, check it out

In [None]:
import imblearn

In [None]:
import collections
counter = collections.Counter(y)

In [None]:
# transform the dataset
from imblearn.over_sampling import SMOTE 
oversample = SMOTE()
X, y = oversample.fit_resample(X, y)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y,
                                                    test_size=0.3,
                                                    random_state=42)

In [None]:
y_train.shape

In [None]:
# the below did not finish after 1 hour. I gave up and stopped it

In [None]:
# from sklearn import svm
# clf4 = svm.SVC(kernel='rbf')
# clf4.fit(X_train, y_train)

In [None]:
# y_pred4 = clf4.predict(X_test)

In [None]:
# y_pred4.shape

In [None]:
# from sklearn.metrics import confusion_matrix
# confusion_matrix(y_test, y_pred4)