<a href="https://colab.research.google.com/github/madhumaram/microservicesfirstproject/blob/master/CreditCardFraudDetection_random_forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Introduction
In this project we analyze a dataset of credit card transactions made over a two-day period in September 2013 by European cardholders. The dataset contains 284,807 transactions, of which 492 (0.17%) are fraudulent.

Each transaction has 30 features, all of which are numerical. The features `V1, V2, ..., V28` are the result of a PCA transformation. To protect confidentiality, background information on these features is not available. The `Time` feature contains the time elapsed since the first transaction, and the `Amount` feature contains the transaction amount. The response variable, `Class`, is 1 in the case of fraud, and 0 otherwise.

Our goal in this project is to construct models to predict whether a credit card transaction is fraudulent. We'll attempt a supervised learning approach. We'll also create visualizations to help us understand the structure of the data and unearth any interesting patterns.

# 2. Getting Started
Import basic libraries:

In [1]:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# Pandas options
pd.set_option('display.max_colwidth', 1000, 'display.max_rows', None, 'display.max_columns', None)

# Plotting options
%matplotlib inline
mpl.style.use('ggplot')
sns.set(style='whitegrid')

Read in the data into a pandas dataframe.

In [4]:
transactions = pd.read_csv('creditcard.csv')

Check basic metadata.

In [5]:
transactions.shape

(27819, 31)

In [6]:
transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27819 entries, 0 to 27818
Data columns (total 31 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Time    27819 non-null  int64  
 1   V1      27818 non-null  float64
 2   V2      27818 non-null  float64
 3   V3      27818 non-null  float64
 4   V4      27818 non-null  float64
 5   V5      27818 non-null  float64
 6   V6      27818 non-null  float64
 7   V7      27818 non-null  float64
 8   V8      27818 non-null  float64
 9   V9      27818 non-null  float64
 10  V10     27818 non-null  float64
 11  V11     27818 non-null  float64
 12  V12     27818 non-null  float64
 13  V13     27818 non-null  float64
 14  V14     27818 non-null  float64
 15  V15     27818 non-null  float64
 16  V16     27818 non-null  float64
 17  V17     27818 non-null  float64
 18  V18     27818 non-null  float64
 19  V19     27818 non-null  float64
 20  V20     27818 non-null  float64
 21  V21     27818 non-null  float64
 22

Are there any variables with missing data?

In [7]:
transactions.isnull().any().any()

True

Let's view five randomly chosen transactions.

In [8]:
transactions.sample(5)

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
20,16,0.694885,-1.361819,1.029221,0.834159,-1.191209,1.309109,-0.878586,0.44529,-0.446196,0.568521,1.019151,1.298329,0.42048,-0.372651,-0.80798,-2.044557,0.515663,0.625847,-1.300408,-0.138334,-0.295583,-0.571955,-0.050881,-0.304215,0.072001,-0.422234,0.086553,0.063499,231.71,0.0
1759,1358,-2.029822,1.866869,0.604924,-1.310795,0.571774,0.386385,0.33167,-1.694331,0.752595,1.175716,0.328247,1.164243,1.192176,-0.717345,-0.977696,0.313715,-1.064555,0.059005,0.370919,0.289391,1.134681,-0.941047,0.066733,-0.827689,0.18481,0.341179,0.543111,0.173462,5.0,0.0
11310,19686,-0.662631,0.605865,1.835166,-1.753022,0.043807,-1.205914,0.636009,-0.523757,2.331162,-1.550088,0.753518,-1.929827,2.076437,1.165789,0.463515,-0.324997,0.01599,0.305714,-0.201912,-0.161475,-0.187687,-0.11512,-0.152142,0.311345,-0.333907,-1.168425,-0.2503,-0.076448,9.85,0.0
24457,33246,1.287311,-0.19639,-0.245885,-1.203786,0.178708,0.246528,-0.136541,0.084885,1.086622,-0.944045,0.638881,1.550376,0.669171,0.203636,0.444027,-0.661157,-0.3795,0.022808,1.203313,-0.080333,-0.185025,-0.268175,-0.159404,-1.131976,0.728215,-0.727489,0.073112,-0.000983,1.0,0.0
15532,26922,1.270554,0.186507,0.189673,0.310732,-0.000698,-0.188303,-0.038634,-0.020051,-0.19506,0.095972,0.784917,0.914331,0.618352,0.338621,0.365549,0.72926,-0.988052,0.193508,0.479656,-0.034356,-0.255666,-0.765728,0.016557,-0.495445,0.304888,0.127299,-0.032882,0.001449,4.49,0.0


How balanced are the classes, i.e. how common are fraudulent transactions?

In [9]:
transactions['Class'].value_counts()

0.0    27725
1.0       93
Name: Class, dtype: int64

In [10]:
transactions['Class'].value_counts(normalize=True)

0.0    0.996657
1.0    0.003343
Name: Class, dtype: float64

Only 0.33% (928 out of 284,807) transactions are fraudulent.

# 3. Train/Test Split
Before we begin preprocessing, we split off a test data set. First split the data into features and response variable:

In [11]:
X = transactions.drop(labels='Class', axis=1) # Features
y = transactions.loc[:,'Class']               # Response
del transactions                              # Delete the original data

We'll use a test size of 20%. We also stratify the split on the response variable, which is very important to do because there are so few fraudulent transactions.

In [12]:
from sklearn.model_selection import train_test_split

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)
del X, y

ValueError: ignored