<a href="https://colab.research.google.com/github/kdmwangi/CODSOFT/blob/main/Fraud_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Credit Card Fraud Detection

<li>Build a machine learning model to identify fraudulent credit card
transactions.
<li>Preprocess and normalize the transaction data, handle class
imbalance issues, and split the dataset into training and testing sets.
<li>Train a classification algorithm, such as logistic regression or random
forests, to classify transactions as fraudulent or genuine.
Evaluate the model's performance using metrics like precision, recall,
and F1-score, and consider techniques like oversampling or
undersampling for improving results.


The dataset contains transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

# Import Statements

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Read Data

In [2]:
dataset = pd.read_csv('creditcard.csv')


# Data Exploration

In [3]:
dataset.shape

(35742, 31)

In [4]:
dataset.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
1,0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0.0
2,1,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0.0
3,1,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0.0
4,2,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0.0


In [5]:
dataset.tail()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
35737,38240,1.23239,0.12201,0.157352,0.261906,0.158523,0.273022,-0.143169,0.138664,-0.146805,...,-0.233618,-0.677894,0.052511,-0.816369,0.221553,0.158446,-0.018118,-0.003168,1.79,0.0
35738,38240,1.11404,0.571203,0.427035,2.442135,-0.020967,-0.501774,0.320237,-0.082876,-1.182685,...,0.028826,-0.032705,-0.056634,0.535225,0.54093,-0.025773,-0.035775,0.011219,24.99,0.0
35739,38241,1.05702,0.007895,0.239256,1.236048,0.032239,0.350868,0.023279,0.137328,0.037981,...,-0.043316,-0.022866,-0.155991,-0.283221,0.662538,-0.314989,0.027467,0.010613,53.96,0.0
35740,38241,-1.546226,0.693338,1.002815,-1.528992,0.294692,-0.464031,0.26488,0.307358,0.022915,...,-0.176713,-0.164637,0.197999,-0.46307,-0.118578,0.739989,0.043625,-0.140629,0.76,0.0
35741,38241,-0.231062,0.243033,1.071749,-0.324598,-0.0,,,,,...,,,,,,,,,,


In [6]:
dataset.columns

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')

In [7]:
dataset.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,35742.0,35742.0,35742.0,35742.0,35742.0,35742.0,35741.0,35741.0,35741.0,35741.0,...,35741.0,35741.0,35741.0,35741.0,35741.0,35741.0,35741.0,35741.0,35741.0,35741.0
mean,23997.300823,-0.208427,0.072579,0.718292,0.195733,-0.216686,0.095575,-0.116847,0.032755,0.259459,...,-0.030876,-0.113555,-0.041571,0.007469,0.135958,0.021813,0.010836,0.003834,84.203356,0.002882
std,12423.654094,1.836736,1.540565,1.540896,1.409063,1.388746,1.310633,1.257788,1.242285,1.238161,...,0.769829,0.640398,0.545241,0.593343,0.435886,0.506559,0.388306,0.302016,227.279858,0.053606
min,0.0,-30.55238,-40.978852,-31.103685,-5.172595,-42.147898,-23.496714,-26.548144,-41.484823,-7.175097,...,-20.262054,-8.593642,-26.751119,-2.836627,-7.495741,-1.43865,-8.567638,-9.617915,0.0,0.0
25%,12283.5,-0.960139,-0.499301,0.244741,-0.714475,-0.818199,-0.644998,-0.598057,-0.155602,-0.523422,...,-0.239644,-0.536461,-0.178485,-0.326932,-0.12731,-0.331727,-0.0632,-0.007248,6.99,0.0
50%,28992.0,-0.23448,0.114024,0.827554,0.188607,-0.255034,-0.163054,-0.073069,0.043456,0.135169,...,-0.081611,-0.087516,-0.051996,0.061818,0.175722,-0.063275,0.008848,0.021087,22.0,0.0
75%,34258.0,1.162263,0.755015,1.456358,1.078531,0.302991,0.485169,0.43656,0.307599,0.991696,...,0.094995,0.296665,0.076214,0.398791,0.421085,0.301153,0.086772,0.076006,76.0,0.0
max,38241.0,1.960497,16.713389,4.101716,13.143668,34.099309,22.529298,36.677268,20.007208,10.392889,...,22.614889,5.805795,13.876221,4.014444,5.525093,3.517346,11.13574,5.678671,7879.42,1.0


In [8]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35742 entries, 0 to 35741
Data columns (total 31 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Time    35742 non-null  int64  
 1   V1      35742 non-null  float64
 2   V2      35742 non-null  float64
 3   V3      35742 non-null  float64
 4   V4      35742 non-null  float64
 5   V5      35742 non-null  float64
 6   V6      35741 non-null  float64
 7   V7      35741 non-null  float64
 8   V8      35741 non-null  float64
 9   V9      35741 non-null  float64
 10  V10     35741 non-null  float64
 11  V11     35741 non-null  float64
 12  V12     35741 non-null  float64
 13  V13     35741 non-null  float64
 14  V14     35741 non-null  float64
 15  V15     35741 non-null  float64
 16  V16     35741 non-null  float64
 17  V17     35741 non-null  float64
 18  V18     35741 non-null  float64
 19  V19     35741 non-null  float64
 20  V20     35741 non-null  float64
 21  V21     35741 non-null  float64
 22

In [9]:
dataset.dtypes

Time        int64
V1        float64
V2        float64
V3        float64
V4        float64
V5        float64
V6        float64
V7        float64
V8        float64
V9        float64
V10       float64
V11       float64
V12       float64
V13       float64
V14       float64
V15       float64
V16       float64
V17       float64
V18       float64
V19       float64
V20       float64
V21       float64
V22       float64
V23       float64
V24       float64
V25       float64
V26       float64
V27       float64
V28       float64
Amount    float64
Class     float64
dtype: object

In [10]:
# highest amount
print(f"{dataset['Amount'].max()} is the highest amount")
# least amount
print(f"{dataset['Amount'].min()} is the least amount")

7879.42 is the highest amount
0.0 is the least amount


In [11]:
dataset[-20:]

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
35722,38236,1.257126,0.34487,0.303501,0.694626,-0.385154,-1.07723,0.075644,-0.194959,0.055911,...,-0.289668,-0.842265,0.131665,0.341118,0.211031,0.094973,-0.023988,0.030407,1.29,0.0
35723,38236,1.118498,0.01518,1.019209,1.171215,-0.648885,-0.096776,-0.4132,-0.003221,0.353776,...,0.129004,0.361717,-0.129786,-0.092493,0.427945,-0.324783,0.061365,0.046471,54.0,0.0
35724,38237,0.925073,-0.699487,-0.326355,-0.82088,-0.541933,-1.048971,0.409409,-0.357541,1.019987,...,0.084841,-0.086474,-0.341205,-0.079781,0.695983,-0.680736,0.00853,0.05592,218.63,0.0
35725,38237,-1.358746,0.772213,2.094395,0.877924,0.073985,0.960314,0.022082,-0.111108,0.323251,...,0.164933,0.787049,0.066406,-0.26778,-0.491877,-0.402691,-0.837289,-0.021249,1.0,0.0
35726,38237,-1.98924,-1.728345,-0.119015,-1.133178,0.912164,-2.426498,-0.720731,0.375397,-1.61308,...,0.639345,0.89137,0.080938,0.576996,-0.696322,-0.473688,0.328053,-0.126211,40.0,0.0
35727,38237,1.006415,-0.065173,0.146954,1.199599,-0.217985,-0.221551,0.150001,0.078562,0.012614,...,0.087635,0.199602,-0.155577,0.212407,0.649391,-0.28035,0.002556,0.011802,73.94,0.0
35728,38238,0.989699,-0.381695,0.67852,0.954111,-0.893174,-0.446359,-0.162044,0.030962,0.726544,...,0.008777,0.071786,-0.049254,0.664742,0.357171,0.534647,-0.028542,0.023588,88.95,0.0
35729,38238,1.13498,-0.727165,1.282437,0.642479,-1.30827,0.661894,-1.185847,0.412603,-0.356676,...,-0.344395,-0.411711,0.090428,-0.039538,0.15822,-0.378503,0.100142,0.031178,29.5,0.0
35730,38238,-1.572626,0.612082,0.129682,-1.290429,-1.681315,0.188992,1.968965,0.081226,-0.442638,...,0.029837,0.383578,0.163517,0.080378,-0.081701,0.906341,0.169001,-0.081592,380.0,0.0
35731,38239,-1.270414,-0.724967,1.952693,0.155589,0.908796,0.278143,-0.137787,0.038018,0.588239,...,-0.059405,0.356225,0.511902,-0.714071,-0.420289,0.379984,-0.067049,-0.061966,22.0,0.0


In [22]:
# data shows that fraud data are 103 and correct data are 35638

dataset.value_counts(subset='Class')

Class
0.0    35638
1.0      103
dtype: int64

In [26]:
# total sum transacted in that day
print(f'{dataset["Amount"].sum():.0f} is the total amount transacted ')

3009512 is the total amount transacted 


In [28]:
# fraud transaction total amount a
dataset.groupby('Class').sum()['Amount']

Class
0.0    3000193.61
1.0       9318.53
Name: Amount, dtype: float64

# Data Cleaning

In [12]:
dataset.isna().any()

Time      False
V1        False
V2        False
V3        False
V4        False
V5        False
V6         True
V7         True
V8         True
V9         True
V10        True
V11        True
V12        True
V13        True
V14        True
V15        True
V16        True
V17        True
V18        True
V19        True
V20        True
V21        True
V22        True
V23        True
V24        True
V25        True
V26        True
V27        True
V28        True
Amount     True
Class      True
dtype: bool

In [13]:
dataset.isna().sum()

Time      0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        1
V7        1
V8        1
V9        1
V10       1
V11       1
V12       1
V13       1
V14       1
V15       1
V16       1
V17       1
V18       1
V19       1
V20       1
V21       1
V22       1
V23       1
V24       1
V25       1
V26       1
V27       1
V28       1
Amount    1
Class     1
dtype: int64

In [14]:
dataset[dataset['V23'].isna() == True]

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
35741,38241,-0.231062,0.243033,1.071749,-0.324598,-0.0,,,,,...,,,,,,,,,,


In [15]:
dataset = dataset.dropna().reset_index(drop=True)

In [16]:
dataset.isna().any()

Time      False
V1        False
V2        False
V3        False
V4        False
V5        False
V6        False
V7        False
V8        False
V9        False
V10       False
V11       False
V12       False
V13       False
V14       False
V15       False
V16       False
V17       False
V18       False
V19       False
V20       False
V21       False
V22       False
V23       False
V24       False
V25       False
V26       False
V27       False
V28       False
Amount    False
Class     False
dtype: bool

In [17]:
# the dataset contains transaction occurred in two days.
# the time column contains seconds elapsed between every transaction and the first transcation.
# feature Class is the response variable and it takes 1 for fraudulent and 0 for okay transaction.


In [20]:
dataset['Class'].nunique()

2

In [21]:
dataset.dtypes

Time        int64
V1        float64
V2        float64
V3        float64
V4        float64
V5        float64
V6        float64
V7        float64
V8        float64
V9        float64
V10       float64
V11       float64
V12       float64
V13       float64
V14       float64
V15       float64
V16       float64
V17       float64
V18       float64
V19       float64
V20       float64
V21       float64
V22       float64
V23       float64
V24       float64
V25       float64
V26       float64
V27       float64
V28       float64
Amount    float64
Class     float64
dtype: object

# Data Visualization

In [None]:
sns.scatterplt(dataset, )