## Credict Card Fraud Detection

### The notebook talks about the idea of changing the Classification thershold in favor of a class and how it affects Precision and Recall. The notebook gives an Intuitive Idea of the Trade off between Accuracy and Precision in the Fraud Detection case. 

In [None]:
## Importing necessary libraries
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import numpy as np

%matplotlib inline

### Loading the Dataset

In [None]:
df=pd.read_csv('../input/creditcard.csv')

In [None]:
df.shape

#### Visualizing the Labels Count


In [None]:
count_classes = pd.value_counts(df['Class'], sort = True).sort_index()
count_classes.plot(kind = 'bar')
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")
Class_split = df.groupby(['Class']).size()
print(Class_split)

#### The Target variable 'Class' consists of 1's and 0's, 
#### 1 indicating a Fraud Transaction
#### 0 indicating a Clean Transaction
#### We see a clear imbalance in the classes, we have 284315 records of a clean transaction and just 492 records (instances) of a Fraud Transaction

In [None]:
f, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(12,4))

bins = 50

ax1.hist(df.Time[df.Class == 1], bins = bins)
ax1.set_title('Fraud')

ax2.hist(df.Time[df.Class == 0], bins = bins)
ax2.set_title('Normal')

plt.xlabel('Time (in Seconds)')
plt.ylabel('Number of Transactions')
plt.show()

In [None]:
df.isnull().values.any() # We see that there are no missing values in the data set

#### Preparing the Features and Labels data set from the entire data

In [None]:
columns=df.columns
# The labels are in the last column ('Class'). 
features_columns=columns.delete(len(columns)-1)

features=df[features_columns]
labels=df['Class']

#### Normalizing the 'Time' and 'Amount' Variable

In [None]:
features['Amount'] = (features['Amount'] - features['Amount'].min()) /  (features['Amount'].max() - features['Amount'].min())
features['Time'] = (features['Time'] - features['Time'].min()) /  (features['Time'].max() - features['Time'].min())

#### Determining the Feature importance using Random forest regressor

In [None]:
from sklearn import ensemble
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(1, oob_score = True, random_state =99)
model.fit(features,labels)

In [None]:
feature_importance = pd.Series(model.feature_importances_, index = features.columns)
feature_importance.plot( kind = 'barh', figsize = (7,6));

In [None]:
# Dropping the least important Features
df = df.drop(['V2','V8','V9','V5','V3','V23','V18','V6','V25','V24','V28'], axis =1)

#### Splitting the Train and Test set in the ratio of 70:30

In [None]:
features_train, features_test, labels_train, labels_test = train_test_split(features, 
                                                                            labels, 
                                                                            test_size=0.3, 
                                                                            random_state=1)

#### In order to counter the imbalance in the classes, I carried out oversampling via SMOTE

In [None]:
oversampler=SMOTE(random_state=1)
os_features,os_labels=oversampler.fit_sample(features_train,labels_train)

### Building the Classifiers

### 1. Random Forest Classifier

In [None]:
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.metrics import confusion_matrix,precision_recall_curve,auc,roc_auc_score,roc_curve,recall_score,classification_report 

In [None]:
clf=RandomForestClassifier(n_estimators = 100, max_depth = 4 ,max_features = 'auto',random_state=99)
clf.fit(os_features,os_labels)

In [None]:
actual=labels_test
predictions=clf.predict(features_test)

In [None]:
confusion_matrix(actual,predictions)

In [None]:
print(classification_report(actual,predictions))

In [None]:
from sklearn.metrics import roc_curve, auc

false_positive_rate, true_positive_rate, thresholds = roc_curve(actual, predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)
print ('AUC:', roc_auc)

#### Observations
#### We were able to correctly identity 113 frauds case of the 135 total fraud cases. 
#### On the other hand 342 Clean transactions were classified as a fraud transaction by the model 
#### Leaving us with a Recall of 0.84 and Precision of 0.25
#### The AUC achieved was .916

In [None]:
import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

### 2. Logistic Regression Model

#### a) Logistic Regression Model With the Default probabilty of 0.5

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(C = 10,  penalty = 'l1', random_state=99)

In [None]:
lr.fit(os_features,os_labels)

In [None]:
LR_predictions=lr.predict(features_test)

In [None]:
confusion_matrix(labels_test,LR_predictions)

#### Here we have failed to detect 16  fraud transactions of the total 135 Fraud transactions and misclasified 1745 clean transactions as Fraud

In [None]:
print(classification_report(labels_test,LR_predictions))

In [None]:
from sklearn.metrics import roc_curve, auc

false_positive_rate, true_positive_rate, thresholds = roc_curve(labels_test, LR_predictions)
roc_auc = auc(false_positive_rate, true_positive_rate)
print ('AUC:',roc_auc)

### Changing the classification Threshold

### b) Logistic Regression Model With the probabilty of 0.40, if the probabity of being a fraud case is =>.40 we classify it as Fraud case

In [None]:
predprob = lr.predict_proba(features_test) # Getting the probabilty of the classes

In [None]:
predprob

In [None]:
prob_dataframe = pd.DataFrame(predprob)

In [None]:
prob_dataframe['class'] = np.where(prob_dataframe[1] > .40, 1, 0)

In [None]:
prob_dataframe.head(10)

In [None]:
predicted40 = prob_dataframe['class']

In [None]:
confusion_matrix(labels_test,predicted40)

In [None]:
print(classification_report(labels_test,predicted40))

In [None]:
from sklearn.metrics import roc_curve, auc

false_positive_rate, true_positive_rate, thresholds = roc_curve(labels_test, predicted40)
roc_auc = auc(false_positive_rate, true_positive_rate)
print (roc_auc)

#### Observations
#### If the probabilty of a Fraud transaction is => .40 we classiy it as a fraud transaction, this helps capture a more number of fraud cases
#### We were able to correctly identity 123  frauds case of the 135 total fraud cases. 
#### On the other hand 2504 Clean transactions were classified as a fraud transaction by the model 
#### Leaving us with a Recall of 0.91  and Precision of 0.05 
#### AUC achieved was 0.9408

In [None]:
import matplotlib.pyplot as plt
plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate, true_positive_rate, 'b', label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

### Key Insights 
#### 1. By increasing the classification threshold in favor of Fraud Classes we can detect all the Fraud cases at the expense of losing the precision
#### i.e we classify a lot of Non-Fraud transactions as fraud, which might lead to customer disatisfaction, the bank might potentially lose their customers and the cost incurred by the bank to confirm if the transaction was actually fraud or if it was a False alarm goes Up.
#### 2. The threshold value depends on the cost suffered by the bank for every Fraud Transaction vs cost inccured for having classified a non-fraud transaction as fraud transaction plus the customer disatisfaction for being mistakenly Classified as a Fraud Case.