# Project: Credit Card Fraud Detection

## Table of Contents

1. [Importing Libraries](#section-one)
2. [Loading the raw data](#section-two)
3. [Data Pre-Processing](#section-three)
4. [Model Training](#section-four)
5. [Model Evaluation](#section-five)

<a id="section-one"></a>
## 1. Importing Libraries 

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression

<a id="section-two"></a>
## 2. Loading the raw data 

In [2]:
# Panda Dataframe

raw_data = pd.read_csv('C:\\Users\\Nishant Gupta\\Desktop\\Projects\\classification_project_Nishant_Gupta\\creditcard.csv')

# Let's have a look at the data

raw_data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


So we have the first column as the transaction time. The columns V1 to V28 are the features related to the credit card. The companies cannot provide the names of features as it comes under sensitive information! Then column 'Amount' gives the credit card transactions in US dollars. Finally, we have the column 'Class' which tells us whether the transaction is legit or not so it only has two values: 1 and 0.

### Information about the dataset 

In [3]:
# let's check the number of rows and columns in the dataset

raw_data.shape

(284807, 31)

We have 284807 rows and 31 columns in this dataset. 

In [4]:
# Some information about the dataset

raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [5]:
# Let's have a look at the descriptive statistics of the dataset

raw_data.describe(include='all')

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,...,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,3.918649e-15,5.682686e-16,-8.761736e-15,2.811118e-15,-1.552103e-15,2.04013e-15,-1.698953e-15,-1.893285e-16,-3.14764e-15,...,1.47312e-16,8.042109e-16,5.282512e-16,4.456271e-15,1.426896e-15,1.70164e-15,-3.662252e-16,-1.217809e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,...,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,...,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,...,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,...,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,...,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,...,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


In [6]:
# Checking the total number of missing values

raw_data.isnull().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

Great! There are no missing values, thus we won't have to eliminate any variables.

In [7]:
# We have checked the raw data and there aren't any missing values!

data_no_mv = raw_data.copy()

In [8]:
# Checking the number of legit transactions and fraudulent transactions

data_no_mv['Class'].value_counts()

0    284315
1       492
Name: Class, dtype: int64

The dataset is very unbalanced as major data is fraudulent.

In [9]:
# Seperating the data of legit and fraud transactions 

legit_trans = data_no_mv[data_no_mv.Class == 0]
fraud_trans = data_no_mv[data_no_mv.Class == 1]

# Let's have a look at the shape

print(legit_trans.shape)
print(fraud_trans.shape)

(284315, 31)
(492, 31)


In [10]:
# Let's see the statistical measures of this data

legit_trans.Amount.describe()

count    284315.000000
mean         88.291022
std         250.105092
min           0.000000
25%           5.650000
50%          22.000000
75%          77.050000
max       25691.160000
Name: Amount, dtype: float64

In [11]:
legit_trans.Amount.describe()

count    284315.000000
mean         88.291022
std         250.105092
min           0.000000
25%           5.650000
50%          22.000000
75%          77.050000
max       25691.160000
Name: Amount, dtype: float64

In [12]:
# compare the value for both transactions

data_no_mv.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,94838.202258,0.008258,-0.006271,0.012171,-0.00786,0.005453,0.002419,0.009637,-0.000987,0.004467,...,-0.000644,-0.001235,-2.4e-05,7e-05,0.000182,-7.2e-05,-8.9e-05,-0.000295,-0.000131,88.291022
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


<a id="section-three"></a>
## 3. Data Pre-Processing 

### Under-Sampling 

We'll create a sample dataset with a similar distribution of legit and fraudulent transactions. As we know that there are 492 fraudulent transactions, so we will take 492 legit transactions and make an uniform dataset for the machine.

In [13]:
# taking random data from 'legit_trans'

legit_trans_data = legit_trans.sample(n = 492)

### Concatenating the two dataframes 

In [14]:
data = pd.concat([legit_trans_data, fraud_trans], axis = 0)

# Let's have a look at the new dataframe

data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
93818,64597.0,-0.918553,0.01583,2.12092,1.553505,-0.423409,0.833818,0.339391,0.406425,-0.122008,...,0.163667,0.463342,0.142858,0.233866,0.061847,-0.168984,0.119155,0.123019,144.0,0
274770,166197.0,2.062917,0.598278,-3.171718,0.619652,1.216218,-1.370544,0.673803,-0.300302,-0.126312,...,-0.023905,0.0656,-0.040816,0.574319,0.340788,0.661243,-0.090195,-0.024159,0.76,0
37795,39112.0,-0.991217,-1.526123,0.378366,-3.452506,2.041132,2.919554,-1.165466,0.955882,-2.424556,...,-0.240513,-0.93063,0.234,0.921573,0.066079,-0.557546,0.048891,0.119853,70.65,0
147906,89150.0,-0.607481,1.591984,-0.50144,-0.831155,1.14326,-0.806412,1.430396,-0.582221,0.82991,...,-0.566223,-0.769652,0.068244,0.578502,-0.214609,0.073541,0.338678,-0.111048,9.82,0
180048,124355.0,2.020676,-1.233668,0.383631,-0.270398,-1.88789,-0.314573,-1.55564,0.105692,0.600893,...,-0.143898,0.160355,0.318351,0.039336,-0.710591,0.62327,0.021355,-0.026448,33.85,0


In [15]:
data.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
279863,169142.0,-1.927883,1.125653,-4.518331,1.749293,-1.566487,-2.010494,-0.88285,0.697211,-2.064945,...,0.778584,-0.319189,0.639419,-0.294885,0.537503,0.788395,0.29268,0.147968,390.0,1
280143,169347.0,1.378559,1.289381,-5.004247,1.41185,0.442581,-1.326536,-1.41317,0.248525,-1.127396,...,0.370612,0.028234,-0.14564,-0.081049,0.521875,0.739467,0.389152,0.186637,0.76,1
280149,169351.0,-0.676143,1.126366,-2.2137,0.468308,-1.120541,-0.003346,-2.234739,1.210158,-0.65225,...,0.751826,0.834108,0.190944,0.03207,-0.739695,0.471111,0.385107,0.194361,77.89,1
281144,169966.0,-3.113832,0.585864,-5.39973,1.817092,-0.840618,-2.943548,-2.208002,1.058733,-1.632333,...,0.583276,-0.269209,-0.456108,-0.183659,-0.328168,0.606116,0.884876,-0.2537,245.0,1
281674,170348.0,1.991976,0.158476,-2.583441,0.40867,1.151147,-0.096695,0.22305,-0.068384,0.577829,...,-0.16435,-0.295135,-0.072173,-0.450261,0.313267,-0.289617,0.002988,-0.015309,42.53,1


In [16]:
data['Class'].value_counts()

0    492
1    492
Name: Class, dtype: int64

Nice! Now we have an uniform dataset with 492 legit & fraudulent transactions each.

In [17]:
data.groupby('Class').mean()

Unnamed: 0_level_0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,94374.121951,-0.022516,-0.043333,-0.037764,-0.070177,-0.031141,0.051094,-0.016333,-0.057324,-0.11134,...,0.035366,-0.055067,0.017863,-0.017418,0.006146,0.033882,0.004187,0.00665,0.004861,99.98378
1,80746.806911,-4.771948,3.623778,-7.033281,4.542029,-3.151225,-1.397737,-5.568731,0.570636,-2.581123,...,0.372319,0.713588,0.014049,-0.040308,-0.10513,0.041449,0.051648,0.170575,0.075667,122.211321


### Splitting the Features and Targets 

In [18]:
y = data['Class']
x = data.drop(columns = 'Class', axis = 1)

In [19]:
# Let's have a look

print(x)

            Time        V1        V2        V3        V4        V5        V6  \
93818    64597.0 -0.918553  0.015830  2.120920  1.553505 -0.423409  0.833818   
274770  166197.0  2.062917  0.598278 -3.171718  0.619652  1.216218 -1.370544   
37795    39112.0 -0.991217 -1.526123  0.378366 -3.452506  2.041132  2.919554   
147906   89150.0 -0.607481  1.591984 -0.501440 -0.831155  1.143260 -0.806412   
180048  124355.0  2.020676 -1.233668  0.383631 -0.270398 -1.887890 -0.314573   
...          ...       ...       ...       ...       ...       ...       ...   
279863  169142.0 -1.927883  1.125653 -4.518331  1.749293 -1.566487 -2.010494   
280143  169347.0  1.378559  1.289381 -5.004247  1.411850  0.442581 -1.326536   
280149  169351.0 -0.676143  1.126366 -2.213700  0.468308 -1.120541 -0.003346   
281144  169966.0 -3.113832  0.585864 -5.399730  1.817092 -0.840618 -2.943548   
281674  170348.0  1.991976  0.158476 -2.583441  0.408670  1.151147 -0.096695   

              V7        V8        V9  .

In [20]:
print(y)

93818     0
274770    0
37795     0
147906    0
180048    0
         ..
279863    1
280143    1
280149    1
281144    1
281674    1
Name: Class, Length: 984, dtype: int64


### Splitting the data into training data & testing data 

In [21]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, stratify = y, random_state = 101)

In [22]:
# By looking at the shape, we could see the number of observations which are training and testing

print(x.shape, x_train.shape, x_test.shape)

(984, 30) (738, 30) (246, 30)


In [23]:
# Feature Scaling

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

x_train = scaler.fit_transform(x_train)

x_test = scaler.fit_transform(x_test)

<a id="section-four"></a>
## 4. Model Training 

### Logistic Regression 

In [24]:
model = LogisticRegression()

In [25]:
# training the Logistic Regression Model with Training Data

model.fit(x_train, y_train)

LogisticRegression()

<a id="section-five"></a>
## 5. Model Evaluation 

In [26]:
# Accuracy Score
# we will find the accuracy on training data

from sklearn.metrics import accuracy_score

x_train_pred = model.predict(x_train)

train_data_accuracy = accuracy_score(x_train_pred, y_train)

In [27]:
print('Accuracy on training data: ', train_data_accuracy)

Accuracy on training data:  0.9471544715447154


In [28]:
# we will find accuracy on test data

x_test_pred = model.predict(x_test)

test_data_accuracy = accuracy_score(x_test_pred, y_test)

In [29]:
print('Accuracy on testing data: ', test_data_accuracy)

Accuracy on testing data:  0.9552845528455285


Accuracy is important to understand if model is deployed well. If the accuracy score of training data is very different from testing data then the model is most likely overfitted or underfitted.