# Welcome to My Kernel

In this dataset we have to identify fraudulent transaction and it is basically a anomaly detection problem.

**Anomaly detection** is a technique used to identify unusual patterns that do not conform to expected behavior, called outliers. It has many applications in business, from intrusion detection (identifying strange patterns in network traffic that could signal a hack) to system health monitoring (spotting a malignant tumor in an MRI scan), and from fraud detection in credit card transactions to fault detection in operating environments.

This dataset is suffered from the problem of imbalanced dataset as number of fraudulent transactions are very few in comparison of non fraudulent transactions this is the reason that most of the machine learning and deep learning models will not provide satisfactory results and/or unable to identify fraudulent transasctions.

**Observation from the dataset :**
1. The dataset consists of numerical values from the 28 ‘Principal Component Analysis (PCA)’ transformed features, namely V1 to V28. Furthermore, there is no metadata about the original features provided, so pre-analysis or feature study could not be done.
2. The ‘Time’ and ‘Amount’ features are not transformed data.
3.There is no missing value in the dataset.

So, my approach/workflow for solving this problem is detailed below :
1. EDA
2. Finding correlation of attributes with target variable
3. Preprocessing the data
4. Apply Deep Neural Network
5. Apply Machine Learning Classifiers (Random forest, Decision Tree Classifier)
6. Apply Undersampling
7. Apply Deep Neural network to check the accuracy and false negatives.
8. Apply SMOTE - Oversampling
9. Apply Deep Neural network to check the accuracy and false negatives.
10. Final remarks


# Importing Essential Libraries

In [None]:
import pandas as pd
import numpy as np
import keras
np.random.seed(2)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import os
print(os.listdir("../input"))
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import itertools

In [None]:
# Function to plot Confusion Matrix (to be used later).
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()

In [None]:
data=pd.read_csv('../input/creditcard.csv')

# Exploratory Data Analysis (EDA)

1. **Checking the actual data how it looks like by looking at top 5 rows of the dataset. **

In [None]:
data.head()

2.**Checking the last five entries to get the idea of data distribution from top to end.**

In [None]:
data.tail()

In [None]:
# To check the count of fraudulent and normal transactions
sns.countplot(data['Class'],facecolor=(0, 0, 0, 0),linewidth=5,edgecolor=sns.color_palette("dark", 3), label = "Count")

**It seems that there are very few Fraudulent Transactions in comparison to Normal Transactions.**


In [None]:
# Now Checking actual number of fraudulent transactions
fraud_indices=np.array(data[data.Class==1].index)
no_records_fraud=len(fraud_indices)
normal_indices=np.array(data[data.Class==0].index)
no_records_normal=len(normal_indices)

print("No. of Fraudulent Transaction is {} and No. of Normal Transaction is {}".format(no_records_fraud, no_records_normal))

In [None]:
# To see the actual distribution of data 
sns.pairplot(data, hue = 'Class', vars = ['V1', 'V2', 'V3', 'V15', 'V18','Amount'] )

In [None]:
sns.kdeplot(data['Amount'],shade=True)

In [None]:
# To see the the actual distribution of Amount

fig=sns.FacetGrid(data,hue='Class',aspect=4)
fig.map(sns.kdeplot,'Amount',shade=True)
oldest=data['Amount'].max()
fig.set(xlim=(0,oldest))
fig.add_legend()

In [None]:
sns.scatterplot(x = 'Amount', y = 'V1',hue='Class',  data = data)


As the number of fraudulent transactions are very less in comparison to normal transactions we are not able to see fraudulent transactions.

In [None]:
dataset2 = data.drop(columns = ['Class'])



# Finding Correlation with target variable

In [None]:
dataset2.corrwith(data.Class).plot.bar(
        figsize = (20, 10), title = "Correlation with Class", fontsize = 20,
        rot = 45, grid = True)

# Preprocessing

As all the features from V1 to V28 are already normalized, so we just have to normalize the Amount

In [None]:

data['normalized_amount']=StandardScaler().fit_transform(data['Amount'].values.reshape(-1,1))
# Dropping the actual Amount column from the dataset.
data=data.drop(['Amount'],axis=1)

In [None]:
# To check the dataset for changed column
data.head()

In [None]:
# I think Time is the irrelevant column so we are dropping the Time column from dataset.
data=data.drop(['Time'],axis=1)

In [None]:
data.head()

In [None]:
# Assigning X and Y 
X=data.iloc[:,data.columns!='Class']
y=data.iloc[:,data.columns=='Class']

In [None]:
X.head()

In [None]:
y.head()

# Splitting data into Train and Test set

I am splitting the data into 70% of the data into training set and 30% of the data into test set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=0)

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
# As we have to supply the X test,X_Train,ytest,y_train into deep learning models so we have to convert it into numpy arrays.
X_train = np.array(X_train)
X_test=np.array(X_test)
y_train=np.array(y_train)
y_test=np.array(y_test)

# Deep Neural Network

**Model Definition:**

I used the Keras Sequential API, where you have just to add one layer at a time, starting from the input.

The first is the sequential layer. It takes 16 units it is a Positive integer, it specifies dimensionality of the output space and the activation function used in this layer is relu

**'relu'** is the rectifier (activation function max(0,x). The rectifier activation function is used to add non linearity to the network.
**'sigmoid'** The main reason why we use sigmoid function is because it exists between (0 to 1). Therefore, it is especially used for models where we have to predict the probability as an output.Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.

In second layer I have used 24 units and used activation function relu.

**Dropout** is a regularization method, where a proportion of nodes in the layer are randomly ignored (setting their wieghts to zero) for each training sample. This drops randomly a propotion of the network and forces the network to learn features in a distributed way. This technique also improves generalization and reduces the overfitting.
As the dataset is large I have opted for 0.5 dropout.

In third layer I have used 20 units and used activation function relu.

In fourth layer I have used 24 units and used activation function relu.

In last layer output should be 1 so i have used 1 and used activation function sigmoid.

In [None]:
from keras.models import Sequential
from keras.layers import Dropout
from keras.layers import Dense

In [None]:
model = Sequential([
     #First Layer
     Dense(units=16, input_dim=29, activation='relu'),
      #Second Layer
     Dense(units=24,activation='relu'),
     Dropout(0.5),
      #Third Layer
     Dense(20,activation='relu'),
     #Fourth Layer
     Dense(24,activation='relu'),
     #Fifth Layer
     Dense(1,activation='sigmoid')  
    
    
])

model.summary()

# Setting Optimizer and Loss Function

Once our layers are added to the model, we need to set up a score function, a loss function and an optimisation algorithm.

**Loss Function** : 
We define the loss function to measure how poorly our model performs on images with known labels. It is the error rate between the oberved labels and the predicted ones. We use a specific form for categorical classifications (=2 classes) called the binary_crossentropy".

The most important function is the optimizer. This function will iteratively improve parameters (filters kernel values, weights and bias of neurons ...) in order to minimise the loss.

I choosed **Adam optimizer** because it combines the advantages of two other extensions of stochastic gradient descent. Specifically:

**1. Adaptive Gradient Algorithm (AdaGrad)** that maintains a per-parameter learning rate that improves performance on problems with sparse gradients (e.g. natural language and computer vision problems).

**2. Root Mean Square Propagation (RMSProp)** that also maintains per-parameter learning rates that are adapted based on the average of recent magnitudes of the gradients for the weight (e.g. how quickly it is changing). This means the algorithm does well on online and non-stationary problems (e.g. noisy).

Adam realizes the benefits of both **AdaGrad** and **RMSProp**.

Adam is a popular algorithm in the field of deep learning because it achieves good results fast.

The metric function "accuracy" is used is to evaluate the performance our model. This metric function is similar to the loss function, except that the results from the metric evaluation are not used when training the model (only for evaluation).

I have used 5 epochs and batch size of 15

In [None]:
model.compile(optimizer='adam', loss='binary_crossentropy',metrics=['accuracy'])
model.fit(X_train,y_train, batch_size=15, epochs=5)

In [None]:
score=model.evaluate(X_test,y_test)
print(score)

In [None]:
y_pred=model.predict(X_test)
y_test=pd.DataFrame(y_test)

# Confusion Matrix

In [None]:
cnf_matrix=confusion_matrix(y_test,y_pred.round())
plot_confusion_matrix(cnf_matrix,classes=[0,1])
plt.show()

Our goal is to reduce the false negative to bare minimum as it indicates number of fraudulent transaction which model predicted as normal transaction which is a very serious error because our model is not able to identify the fraudulent transactions.

# Plotting Confusion matrix for entire dataset

In [None]:
y_pred=model.predict(X)
y_test=pd.DataFrame(y)
cnf_matrix=confusion_matrix(y_test,y_pred.round())
plot_confusion_matrix(cnf_matrix,classes=[0,1])
plt.show()

Our model not able to identify all the fraudulent transactions for entire dataset also as number of fals negative is 122

# Applying random Forest Classifier

In [None]:
X=data.iloc[:,data.columns!='Class']
y=data.iloc[:,data.columns=='Class']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=0)

In [None]:
from sklearn.ensemble import RandomForestClassifier
random_forest=RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train,y_train.values.ravel())

In [None]:
y_pred=random_forest.predict(X_test)

In [None]:
cnf_matrix=confusion_matrix(y_test,y_pred.round())
plot_confusion_matrix(cnf_matrix,classes=[0,1])
plt.show()

There is a significant improvement in reducing down the false negatives to 35 but still we have to reduce this number to bare minimum.

# Confusion Matrix for Entire dataset

In [None]:
y_pred=random_forest.predict(X)

cnf_matrix=confusion_matrix(y,y_pred.round())
plot_confusion_matrix(cnf_matrix,classes=[0,1])
plt.show()

For the entire dataset the false negative reduced to 35 in comparison to deep learning model having false negatives 122 in case of entire dataset. Still there is a chance of further improvement.

# Applying Decision Tree Classifier

In [None]:
X=data.iloc[:,data.columns!='Class']
y=data.iloc[:,data.columns=='Class']

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=0)


In [None]:
from sklearn.tree import DecisionTreeClassifier
decc=DecisionTreeClassifier()
decc.fit(X_train,y_train.values.ravel())

In [None]:
y_pred=decc.predict(X_test)

In [None]:
decc.score(X_test,y_test)

In [None]:
cnf_matrix=confusion_matrix(y_test,y_pred.round())
plot_confusion_matrix(cnf_matrix,classes=[0,1])
plt.show()

# Confusion Matrix for Entire dataset

In [None]:
y_pred=decc.predict(X)

cnf_matrix=confusion_matrix(y,y_pred.round())
plot_confusion_matrix(cnf_matrix,classes=[0,1])
plt.show()

Still there is a significantly more number of false negatives so we have to apply undersampling and oversampling techniques to see that the performance of the model will improve or not.

# What is the Class Imbalance Problem?
It is the problem in machine learning where the total number of a class of data (positive) is far less than the total number of another class of data (negative). This problem is extremely common in practice and can be observed in various disciplines including fraud detection, anomaly detection, medical diagnosis, oil spillage detection, facial recognition, etc.

# Why is it a problem?
Most machine learning algorithms and works best when the number of instances of each classes are roughly equal. When the number of instances of one class far exceeds the other, problems arise. This is best illustrated below with an example.

Given a dataset of transaction data, we would like to find out which are fraudulent and which are genuine ones. Now, it is highly cost to the e-commerce company if a fraudulent transaction goes through as this impacts our customers trust in us, and costs us money. So we want to catch as many fraudulent transactions as possible.

If there is a dataset consisting of 10000 genuine and 10 fraudulent transactions, the classifier will tend to classify fraudulent transactions as genuine transactions. The reason can be easily explained by the numbers. Suppose the machine learning algorithm has two possibly outputs as follows:

Model 1 classified 7 out of 10 fraudulent transactions as genuine transactions and 10 out of 10000 genuine transactions as fraudulent transactions.
Model 2 classified 2 out of 10 fraudulent transactions as genuine transactions and 100 out of 10000 genuine transactions as fraudulent transactions.
If the classifier’s performance is determined by the number of mistakes, then clearly Model 1 is better as it makes only a total of 17 mistakes while Model 2 made 102 mistakes. However, as we want to minimize the number of fraudulent transactions happening, we should pick Model 2 instead which only made 2 mistakes classifying the fraudulent transactions. Of course, this could come at the expense of more genuine transactions being classified as fraudulent transactions, but will be a cost we can bear for now. Anyhow, a general machine learning algorithm will just pick Model 1 than Model 2, which is a problem. In practice, this means we will let a lot of fraudulent transactions go through although we could have stopped them by using Model 2. This translates to unhappy customers and money lost for the company.

**How to tell the machine learning algorithm which is the better solution?
To tell the machine learning algorithm (or the researcher) that Model 2 is better than Model 1, we need to show that Model 2 above is better than Model 1 above. For that, we will need better metrics than just counting the number of mistakes made.

We introduce the concept of True Positive, True Negative, False Positive and False Negative:

1. True Positive (TP) – An example that is positive and is classified correctly as positive<br>
2. True Negative (TN) – An example that is negative and is classified correctly as negative<br>
3. False Positive (FP) – An example that is negative but is classified wrongly as positive<br>
4. False Negative (FN) – An example that is positive but is classified wrongly as negative<br>

**Sampling based approaches**
This can be roughly classified into two categories:

Oversampling, by adding more of the minority class so it has more effect on the machine learning algorithm
Undersampling, by removing some of the majority class so it has less effect on the machine learning algorithm

**1. Undersampling :** By undersampling, we could risk removing some of the majority class instances which is more representative, thus discarding useful information. This can be illustrated as follows: 
<img src="https://image.ibb.co/cuD6Vq/Undersampling-580x197.png">

Here the green line is the ideal decision boundary we would like to have, and blue is the actual result. On the left side is the result of just applying a general machine learning algorithm without using undersampling. On the right, we undersampled the negative class but removed some informative negative class, and caused the blue decision boundary to be slanted, causing some negative class to be classified as positive class wrongly.

# Applying Undersampling

In [None]:
fraud_indices=np.array(data[data.Class==1].index)
no_records_fraud=len(fraud_indices)
print(no_records_fraud)

In [None]:
normal_indices=data[data.Class==0].index

In [None]:
random_normal_indices=np.random.choice(normal_indices,no_records_fraud,replace=False)
random_normal_indices=np.array(random_normal_indices)
print(len(random_normal_indices))

In [None]:
under_sample_indices=np.concatenate([fraud_indices,random_normal_indices])
print(len(under_sample_indices))

In [None]:
under_sample_data=data.iloc[under_sample_indices,:]

In [None]:
under_sample_data.head()

In [None]:
X_undersample=under_sample_data.iloc[:,under_sample_data.columns!='Class']
y_undersample=under_sample_data.iloc[:,under_sample_data.columns=='Class']

In [None]:

X_train, X_test, y_train, y_test = train_test_split(X_undersample, y_undersample, test_size = 0.3, random_state=0)

In [None]:
X_train = np.array(X_train)
X_test=np.array(X_test)
y_train=np.array(y_train)
y_test=np.array(y_test)

# Applying Keras Sequential model on undersampled dataset

In [None]:
model = Sequential([
     Dense(units=16, input_dim=29, activation='relu'),
     Dense(units=24,activation='relu'),
     Dropout(0.5),
     Dense(20,activation='relu'),
     Dense(24,activation='relu'),
     Dense(1,activation='sigmoid')  
    
    
])

model.summary()

In [None]:
model.compile(optimizer='adam', loss='binary_crossentropy',metrics=['accuracy'])
model.fit(X_train,y_train, batch_size=15, epochs=5)

In [None]:
y_pred=model.predict(X_test)
y_expected=pd.DataFrame(y_test)

cnf_matrix=confusion_matrix(y_expected,y_pred.round())
plot_confusion_matrix(cnf_matrix,classes=[0,1])
plt.show()

# Confusion Matrix for Entire dataset

In [None]:
y_pred=model.predict(X)

cnf_matrix=confusion_matrix(y,y_pred.round())
plot_confusion_matrix(cnf_matrix,classes=[0,1])
plt.show()

# Synthetic Minority Over-sampling Technique (SMOTE)
Using a machine learning algorithm out of the box is problematic when one class in the training set dominates the other. SMOTE solves this problem. In this tutorial I'll walk you through how SMOTE works and then how the SMOTE function code works.
<img src="https://image.ibb.co/iZjL5q/SMOTE.png">

This is a statistical technique for increasing the number of cases in your dataset in a balanced way. The module works by generating new instances from existing minority cases that you supply as input. This implementation of SMOTE does not change the number of majority cases.

The new instances are not just copies of existing minority cases; instead, the algorithm takes samples of the feature space for each target class and its nearest neighbors, and generates new examples that combine features of the target case with features of its neighbors. This approach increases the features available to each class and makes the samples more general.

SMOTE takes the entire dataset as an input, but it increases the percentage of only the minority cases. For example, suppose you have an imbalanced dataset where just 1% of the cases have the target value A (the minority class), and 99% of the cases have the value B. To increase the percentage of minority cases to twice the previous percentage, you would enter 200 for SMOTE percentage in the module's properties.

<img src="https://image.ibb.co/iEudQq/smote-1.png">

In [None]:

X_resample,y_resample=SMOTE().fit_sample(X,y.values.ravel())

In [None]:
y_resample=pd.DataFrame(y_resample)
X_resample=pd.DataFrame(X_resample)



In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_resample, y_resample, test_size = 0.3, random_state=0)

In [None]:
X_train = np.array(X_train)
X_test=np.array(X_test)
y_train=np.array(y_train)
y_test=np.array(y_test)

# Applying Keras Sequential Model on Oversampled dataset

In [None]:
model.compile(optimizer='adam', loss='binary_crossentropy',metrics=['accuracy'])
model.fit(X_train,y_train, batch_size=15, epochs=5)

In [None]:
y_pred=model.predict(X_test)
y_expected=pd.DataFrame(y_test)

cnf_matrix=confusion_matrix(y_expected,y_pred.round())
plot_confusion_matrix(cnf_matrix,classes=[0,1])
plt.show()

# Confusion Matrix for entire dataset

In [None]:
y_pred=model.predict(X)

cnf_matrix=confusion_matrix(y,y_pred.round())
plot_confusion_matrix(cnf_matrix,classes=[0,1])
plt.show()

# Conclusion
Since, our goal was to reduce the false negatives which we have reduced significantly to 2 using SMOTE oversampling technique.Further we can also increase the epochs and can try above applied machine learning algo to improve the result and reduce the false negative to zero.

**if like please UPVOTE the kernel and dont forget to give your valuable suggestions.**