# Project 3: Detecting Credit Card Fraud

Dataset is from [kaggle](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud)

### Dataset Overview:
* Columns:
    * Time (float)
    * V1 - V28 (float)
    * Amount (float)
    * class (int)
* About the dataset:
    * this data contains transactional information (numerical input variables as a result of PCA transformation) from credit cards in September 2013 in EU
    * the dataset contains 492 out of 284,807 transaction in 2 days. Very unbalanced with only 
    0.172% of positive cases 

## 1. Importing Libraries and Data Mining


In [8]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


## 2. Importing and Reading the Dataset

In [9]:
data = pd.read_csv('/Users/johnnysin/Downloads/creditcard.csv')
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [10]:
# data information
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

In [11]:
# missing value in the dataset?
data.isnull().sum()


Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
Amount    0
Class     0
dtype: int64

## 2. Distribution of Normal and Fraud transactions

In [13]:
data['Class'].value_counts()

0    284315
1       492
Name: Class, dtype: int64

Here we see that the dataset is skewed with significantly lower positive values (fraud cases) 

In [14]:
# separating the data
normal = data[data.Class==0]
fraud = data[data.Class==1]

In [16]:
normal.shape

(284315, 31)

In [17]:
fraud.shape

(492, 31)

## 3. Creating a new dataset
##### *choosing 492 random transactions from the normal dataset*

In [19]:
normal_sample = normal.sample(n=492)

In [20]:
normal_sample

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
237705,149360.0,-0.246109,1.194286,-0.912552,-0.421167,0.886964,-1.165933,1.027895,-0.131679,-0.250516,...,0.481089,1.593507,-0.134918,1.104093,-0.797107,0.586616,-0.210922,-0.009845,5.95,0
168910,119436.0,-4.222861,-2.125824,-3.610840,-0.686268,1.559709,-2.227341,0.967344,-0.325003,-0.584377,...,-0.786836,1.263283,1.629390,0.732590,-0.591560,0.702508,0.590679,-0.579535,77.60,0
200867,133614.0,0.101036,0.880332,0.834792,2.666703,1.335936,0.904359,0.225514,0.261193,-1.091926,...,-0.062196,-0.251843,0.358570,-0.098604,-1.474830,-0.586392,0.202270,0.192468,4.57,0
163482,115971.0,2.262538,-1.745051,-0.334884,-1.650563,-1.595966,0.311496,-1.910591,0.169614,-0.400143,...,-0.197128,-0.105322,0.295715,0.031698,-0.468493,-0.237298,0.053180,-0.028969,28.00,0
248052,153784.0,2.096868,-1.576352,-1.090182,-1.641218,0.747902,4.381270,-2.271850,1.244104,0.857574,...,0.133372,0.597006,0.280175,0.737922,-0.302255,-0.168415,0.082647,-0.044151,1.00,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
274326,165959.0,0.497891,-3.579021,-2.202132,-0.041862,-1.558152,-0.947046,0.835022,-0.562582,-0.336945,...,0.493985,-0.552604,-0.552775,0.008024,-0.413355,-0.418322,-0.203340,0.091928,899.00,0
35841,38290.0,1.258509,0.119476,0.476111,0.639000,-0.641945,-1.036544,-0.059790,-0.103594,0.317235,...,-0.277918,-0.891559,0.158060,0.331059,0.156361,0.100524,-0.040847,0.015202,1.98,0
252060,155633.0,-0.173904,0.273750,0.777996,-0.557276,0.645484,1.586147,-0.142234,0.561740,0.533848,...,0.499599,1.731249,0.034243,-1.380754,-1.434138,-0.254201,0.240537,0.264270,9.47,0
154400,101419.0,1.720521,-0.347084,-0.271200,1.861305,-0.372049,-0.152705,-0.160627,-0.144238,2.378427,...,-0.601098,-1.401193,0.344930,-0.210332,-0.371421,-1.149790,0.013549,-0.017576,117.99,0


Here now we match the number of rows with the positive values (fraud data)

In [23]:
new_data = pd.concat([normal_sample, fraud], axis=0)
new_data.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,984.0,984.0,984.0,984.0,984.0,984.0,984.0,984.0,984.0,984.0,...,984.0,984.0,984.0,984.0,984.0,984.0,984.0,984.0,984.0,984.0
mean,89533.723577,-2.357206,1.791793,-3.553761,2.272277,-1.59531,-0.694789,-2.769727,0.297267,-1.271739,...,0.34482,0.009189,-0.037797,-0.054524,0.006981,0.028873,0.099728,0.042734,113.0994,0.5
std,47927.39201,5.537376,3.731543,6.208962,3.192536,4.303719,1.840402,5.945396,4.877834,2.323035,...,2.795788,1.181572,1.24585,0.564618,0.685135,0.457613,1.088605,0.470901,358.791967,0.500254
min,406.0,-30.55238,-16.417395,-31.103685,-3.099974,-27.752964,-6.406267,-43.557242,-41.044261,-13.434066,...,-22.797604,-8.887017,-19.254328,-2.37579,-4.781606,-1.158898,-7.263482,-4.009839,0.0,0.0
25%,48362.5,-2.867222,-0.182297,-5.14662,-0.155495,-1.845482,-1.599839,-3.105154,-0.192932,-2.298358,...,-0.172669,-0.543407,-0.243526,-0.405936,-0.32557,-0.265555,-0.048206,-0.053064,1.52,0.0
50%,84204.0,-0.719808,1.003518,-1.422658,1.315483,-0.42169,-0.639088,-0.687471,0.136416,-0.697871,...,0.159252,0.035816,-0.033769,0.008012,0.039665,-0.006892,0.053886,0.03173,18.755,0.5
75%,135991.25,1.056312,2.814266,0.301433,4.235631,0.456639,0.085101,0.258779,0.885597,0.151853,...,0.644249,0.594179,0.193367,0.367402,0.389374,0.282435,0.459847,0.219111,99.99,1.0
max,172087.0,2.354755,22.057729,3.528608,12.114672,11.095089,18.072031,28.504065,20.007208,4.536487,...,27.202839,8.361985,5.46623,1.550407,2.208209,2.745261,12.152401,4.585234,8790.26,1.0


### Splitting the "new" Dataset

In [29]:
X=new_data.drop(columns = 'Class', axis = 1)
y=new_data['Class']

In [26]:
y

237705    0
168910    0
200867    0
163482    0
248052    0
         ..
279863    1
280143    1
280149    1
281144    1
281674    1
Name: Class, Length: 984, dtype: int64

In [28]:
X_train, X_test, Y_train, Y_test = train_test_split(X,y,test_size = 0.2, stratify=y, random_state = 42)

In [31]:
print(X_train, X_test, Y_train, Y_test)

            Time         V1         V2         V3        V4         V5  \
49613    44135.0   0.634838  -1.143511   1.808769  1.682620  -1.492012   
27941    34777.0   1.183732   0.040298   1.060949  2.685373  -0.408925   
56703    47545.0   1.176716   0.557091  -0.490800  0.756424   0.249192   
150684   93888.0 -10.040631   6.139183 -12.972972  7.740555  -8.684705   
6882      8808.0  -4.617217   1.695694  -3.114372  4.328199  -1.873257   
...          ...        ...        ...        ...       ...        ...   
39183    39729.0  -0.964567  -1.643541  -0.187727  1.158253  -2.458336   
205716  135880.0  -5.472082  -3.952467  -0.927776 -1.231708  -0.711065   
143335   85285.0  -6.713407   3.921104  -9.746678  5.148263  -5.151563   
273206  165478.0  -1.913675   1.678326  -0.949002 -0.699536   0.211417   
17317    28625.0 -27.848181  15.598193 -28.923756  6.418442 -20.346228   

              V6         V7         V8        V9  ...       V20       V21  \
49613   1.488588  -1.170296   0.60

In [32]:
model = LogisticRegression()

In [33]:
model.fit(X_train, Y_train)

LogisticRegression()

In [34]:
y_predict = model.predict(X_test)

In [38]:
acc = accuracy_score(Y_test, y_predict)
print(f'The accuracy of the Logistic Regression model is {acc}')

The accuracy of the Logistic Regression model is 0.9644670050761421
