<a href="https://colab.research.google.com/github/hutashani-s/codsoft/blob/main/Credit_Card_Fraud_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Credit Card Fraud Detection**
Dataset: https://www.kaggle.com/datasets/kartik2112/fraud-detection/data

Model: Logistic Regression

Logistic Regression is a classification algorithm used for predicting categorical outcomes (in this case, whether a transaction is fraudulent or not).

In essence, the Logistic Regression model learns the relationship between the selected features and the probability of a transaction being fraudulent. It then uses this learned relationship to predict the likelihood of fraud for new, unseen transactions.

There are two datasets: fraudTest.csv and fraudTrain.csv, loaded as test_data and train_data, respectively. The Logistic Regression model is trained with the train_data and then applied on test_data.

---

## Importing Dependencies

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Importing data

In [None]:
import kagglehub

path = kagglehub.dataset_download("kartik2112/fraud-detection")
print("Path to dataset files:", path)

test_data = pd.read_csv(path + "/fraudTest.csv")
train_data = pd.read_csv(path + "/fraudTrain.csv")

Downloading from https://www.kaggle.com/api/v1/datasets/download/kartik2112/fraud-detection?dataset_version_number=1...


100%|██████████| 202M/202M [00:01<00:00, 175MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/kartik2112/fraud-detection/versions/1


## Data Pre-processing and Analysis

In [None]:
train_data.head()

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,...,36.0788,-81.1781,3495,"Psychologist, counselling",1988-03-09,0b242abb623afc578575680df30655b9,1325376018,36.011293,-82.048315,0
1,1,2019-01-01 00:00:44,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,...,48.8878,-118.2105,149,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,1325376044,49.159047,-118.186462,0
2,2,2019-01-01 00:00:51,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,...,42.1808,-112.262,4154,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,1325376051,43.150704,-112.154481,0
3,3,2019-01-01 00:01:16,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,M,9443 Cynthia Court Apt. 038,...,46.2306,-112.1138,1939,Patent attorney,1967-01-12,6b849c168bdad6f867558c3793159a81,1325376076,47.034331,-112.561071,0
4,4,2019-01-01 00:03:06,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,...,38.4207,-79.4629,99,Dance movement psychotherapist,1986-03-28,a41d7549acf90789359a9aa5346dcb46,1325376186,38.674999,-78.632459,0


In [None]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 23 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   Unnamed: 0             1296675 non-null  int64  
 1   trans_date_trans_time  1296675 non-null  object 
 2   cc_num                 1296675 non-null  int64  
 3   merchant               1296675 non-null  object 
 4   category               1296675 non-null  object 
 5   amt                    1296675 non-null  float64
 6   first                  1296675 non-null  object 
 7   last                   1296675 non-null  object 
 8   gender                 1296675 non-null  object 
 9   street                 1296675 non-null  object 
 10  city                   1296675 non-null  object 
 11  state                  1296675 non-null  object 
 12  zip                    1296675 non-null  int64  
 13  lat                    1296675 non-null  float64
 14  long              

In [None]:
#checking for null values
train_data.isnull().sum()

Unnamed: 0,0
Unnamed: 0,0
trans_date_trans_time,0
cc_num,0
merchant,0
category,0
amt,0
first,0
last,0
gender,0
street,0


### Analysing data

Identifying and Differentiating between Legitimate and Fraudulent Transactions through the values of is_fraud column (legit=0, fraud=1)

In [None]:
train_data['is_fraud'].value_counts()
#there are 1,289,169 legitimate and 7506 fraudulent transactions

Unnamed: 0_level_0,count
is_fraud,Unnamed: 1_level_1
0,1289169
1,7506


In [None]:
#dividing to observe differences individually
legit = train_data[train_data.is_fraud == 0]
fraud = train_data[train_data.is_fraud == 1]

In [None]:
#observing the differences in the transaction amounts
legit.amt.describe()

Unnamed: 0,amt
count,1289169.0
mean,67.66711
std,154.008
min,1.0
25%,9.61
50%,47.28
75%,82.54
max,28948.9


In [None]:
fraud.amt.describe()

Unnamed: 0,amt
count,7506.0
mean,531.320092
std,390.56007
min,1.06
25%,245.6625
50%,396.505
75%,900.875
max,1376.04


In [16]:
train_data.groupby('is_fraud').mean(numeric_only=True)

Unnamed: 0_level_0,Unnamed: 0,cc_num,amt,zip,lat,long,city_pop,unix_time,merch_lat,merch_long
is_fraud,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,648473.169029,4.172901e+17,67.66711,48805.107481,38.536888,-90.228142,88775.228137,1349249000.0,38.536659,-90.228274
1,624949.724354,4.003577e+17,531.320092,48038.714229,38.663609,-89.916041,97276.763256,1348389000.0,38.653901,-89.915808


The significant differences between legit and fraud transactions can be observed in the above details.

Count and std cannot be used since std value increases as count value increases.

Mean is the most reliable calculation. The difference in mean of 'amt' can be observed for legit and fraud transactions in above.

### Making a balanced, unbiased dataset
This can be achieved by taking random, equal amount of both legit and fraud transactions.

In [148]:
legit_sample = legit.sample(n=7506)

In [149]:
balanced_data = pd.concat([legit_sample, fraud], axis=0)
balanced_data.head()

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
1111848,1111848,2020-04-07 10:55:27,675945690716,"fraud_Stokes, Christiansen and Sipes",grocery_net,25.65,Ellen,Smith,F,285 George Lake,...,40.7687,-80.3592,28425,Podiatrist,2000-06-09,b76342566a17bbdcd6dcf943e9d7b88b,1365332127,40.716321,-80.074888,0
146705,146705,2019-03-20 08:47:54,30238755902988,fraud_Huels-Hahn,gas_transport,49.13,Danielle,Yu,F,5395 Colon Burgs Suite 037,...,30.592,-97.2893,1766,Press sub,1976-01-02,7a3a684ac63600ecababc1d13a6ccbfe,1332233274,29.73134,-97.388875,0
512053,512053,2019-08-10 08:23:11,373905417449658,"fraud_Morissette, Weber and Wiegand",grocery_net,79.05,Sarah,Bishop,F,554 Mcdonald Valley Apt. 539,...,31.929,-97.6443,2526,Phytotherapist,1970-11-12,2e3db904683953a20b2cdcc064127250,1344586991,31.194313,-97.925548,0
733609,733609,2019-11-09 22:05:43,3577578023716568,fraud_Fahey Inc,kids_pets,45.39,Debbie,Hughes,F,0182 Owens Burgs Suite 480,...,41.0935,-81.0425,2644,"Engineer, biomedical",1983-08-25,b4ce8600a48acd6c0861a7398549d89e,1352498743,40.130885,-81.076528,0
1131880,1131880,2020-04-16 11:26:38,3519607465576254,fraud_Kozey-Boehm,shopping_net,22.51,Audrey,Gonzalez,F,34180 Lopez Plaza,...,40.7268,-124.2174,276,"Scientist, audiological",1929-05-06,c0887f385a2baf318019c663d3cff0aa,1366111598,40.252501,-124.063262,0


In [150]:
balanced_data.tail()

Unnamed: 0.1,Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
1295399,1295399,2020-06-21 01:00:08,3524574586339330,fraud_Kassulke PLC,shopping_net,977.01,Ashley,Cabrera,F,94225 Smith Springs Apt. 617,...,27.633,-80.4031,105638,"Librarian, public",1986-05-07,a83b093f0c1d9068fa0089f7c722615f,1371776408,26.888686,-80.834389,1
1295491,1295491,2020-06-21 01:53:35,3524574586339330,fraud_Schumm PLC,shopping_net,1210.91,Ashley,Cabrera,F,94225 Smith Springs Apt. 617,...,27.633,-80.4031,105638,"Librarian, public",1986-05-07,f75b35bed13b9e692f170dba45a15b21,1371779615,28.216707,-79.855648,1
1295532,1295532,2020-06-21 02:16:56,4005676619255478,"fraud_Tillman, Dickinson and Labadie",gas_transport,10.24,William,Perry,M,458 Phillips Island Apt. 768,...,30.459,-90.9027,71335,Herbalist,1994-05-31,a0ba2472cd3fc9731f2a18d3f308f5c3,1371781016,29.700456,-91.361632,1
1295666,1295666,2020-06-21 03:26:20,3560725013359375,fraud_Corwin-Collins,gas_transport,21.69,Brooke,Smith,F,63542 Luna Brook Apt. 012,...,31.8599,-102.7413,23,Cytogeneticist,1969-09-15,daa281350b1e16093c7b4bf97bf4d6ed,1371785180,32.675272,-103.484949,1
1295733,1295733,2020-06-21 03:59:46,4005676619255478,fraud_Koss and Sons,gas_transport,10.2,William,Perry,M,458 Phillips Island Apt. 768,...,30.459,-90.9027,71335,Herbalist,1994-05-31,0c1c20470fc0d16019b5c368cadf563a,1371787186,31.363252,-89.932309,1


In [151]:
balanced_data['is_fraud'].value_counts()
#both number are equal

Unnamed: 0_level_0,count
is_fraud,Unnamed: 1_level_1
0,7506
1,7506


In [152]:
balanced_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15012 entries, 1111848 to 1295733
Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             15012 non-null  int64  
 1   trans_date_trans_time  15012 non-null  object 
 2   cc_num                 15012 non-null  int64  
 3   merchant               15012 non-null  object 
 4   category               15012 non-null  object 
 5   amt                    15012 non-null  float64
 6   first                  15012 non-null  object 
 7   last                   15012 non-null  object 
 8   gender                 15012 non-null  object 
 9   street                 15012 non-null  object 
 10  city                   15012 non-null  object 
 11  state                  15012 non-null  object 
 12  zip                    15012 non-null  int64  
 13  lat                    15012 non-null  float64
 14  long                   15012 non-null  float64
 15 

Logistic Regression expects numerical inputs and so non-numerical columns in columns_to_drop are dropped

In [140]:
columns_to_drop = ['merchant', 'category', 'first', 'last', 'gender', 'street', 'city', 'state', 'job', 'dob']
balanced_data = balanced_data.drop(columns=columns_to_drop)
balanced_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15012 entries, 314704 to 1295733
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             15012 non-null  int64  
 1   trans_date_trans_time  15012 non-null  object 
 2   cc_num                 15012 non-null  int64  
 3   amt                    15012 non-null  float64
 4   zip                    15012 non-null  int64  
 5   lat                    15012 non-null  float64
 6   long                   15012 non-null  float64
 7   city_pop               15012 non-null  int64  
 8   trans_num              15012 non-null  object 
 9   unix_time              15012 non-null  int64  
 10  merch_lat              15012 non-null  float64
 11  merch_long             15012 non-null  float64
 12  is_fraud               15012 non-null  int64  
dtypes: float64(5), int64(6), object(2)
memory usage: 1.6+ MB


In [141]:
balanced_data.groupby('is_fraud').mean(numeric_only=True)

Unnamed: 0_level_0,Unnamed: 0,cc_num,amt,zip,lat,long,city_pop,unix_time,merch_lat,merch_long
is_fraud,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,654069.876898,3.880621e+17,65.622419,48586.638156,38.553122,-90.119656,87926.445643,1349452000.0,38.552404,-90.122109
1,624949.724354,4.003577e+17,531.320092,48038.714229,38.663609,-89.916041,97276.763256,1348389000.0,38.653901,-89.915808


## Training the Model

### Splitting the balanced_data into features and target

In [207]:
#features
selected_features = ['amt', 'unix_time', 'zip', 'city_pop']
X = balanced_data[selected_features]

#target
Y = balanced_data['is_fraud']
print(X)
print(Y)

             amt   unix_time    zip  city_pop
1111848    25.65  1365332127  15010     28425
146705     49.13  1332233274  76578      1766
512053     79.05  1344586991  76665      2526
733609     45.39  1352498743  44412      2644
1131880    22.51  1366111598  95537       276
...          ...         ...    ...       ...
1295399   977.01  1371776408  32960    105638
1295491  1210.91  1371779615  32960    105638
1295532    10.24  1371781016  70726     71335
1295666    21.69  1371785180  79759        23
1295733    10.20  1371787186  70726     71335

[15012 rows x 4 columns]
1111848    0
146705     0
512053     0
733609     0
1131880    0
          ..
1295399    1
1295491    1
1295532    1
1295666    1
1295733    1
Name: is_fraud, Length: 15012, dtype: int64


### Splitting into training and testing data

In [208]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.5, stratify=Y, random_state=1)
#test size of 0.3 => 30% of data will be used for testing, 70% of data will be used for training
#stratify of Y => equally distributes Y values (0,1)

### Applying Logistic Regression

In [209]:
model = LogisticRegression(max_iter=10000)

In [210]:
model.fit(X_train, Y_train)

## Model Evaluation
using accuracy score

In [211]:
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print('Accuracy of training data: ', training_data_accuracy)

Accuracy of training data:  0.8533173461231015


In [212]:
X_test_prediction = model.predict(X_test)
testing_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print('Accuracy of testing data: ', testing_data_accuracy)

Accuracy of testing data:  0.8595790034638956


## Applying model on train_data

### Processing test_data for prediction

In [213]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 23 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   Unnamed: 0             1296675 non-null  int64  
 1   trans_date_trans_time  1296675 non-null  object 
 2   cc_num                 1296675 non-null  int64  
 3   merchant               1296675 non-null  object 
 4   category               1296675 non-null  object 
 5   amt                    1296675 non-null  float64
 6   first                  1296675 non-null  object 
 7   last                   1296675 non-null  object 
 8   gender                 1296675 non-null  object 
 9   street                 1296675 non-null  object 
 10  city                   1296675 non-null  object 
 11  state                  1296675 non-null  object 
 12  zip                    1296675 non-null  int64  
 13  lat                    1296675 non-null  float64
 14  long              

In [214]:
test_data.isnull().sum()

Unnamed: 0,0
Unnamed: 0,0
trans_date_trans_time,0
cc_num,0
merchant,0
category,0
amt,0
first,0
last,0
gender,0
street,0


In [216]:
columns_to_drop = ['merchant', 'category', 'first', 'last', 'gender', 'street', 'city', 'state', 'job', 'dob']
testing_data = test_data.drop(columns=columns_to_drop)
testing_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 555719 entries, 0 to 555718
Data columns (total 13 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   Unnamed: 0             555719 non-null  int64  
 1   trans_date_trans_time  555719 non-null  object 
 2   cc_num                 555719 non-null  int64  
 3   amt                    555719 non-null  float64
 4   zip                    555719 non-null  int64  
 5   lat                    555719 non-null  float64
 6   long                   555719 non-null  float64
 7   city_pop               555719 non-null  int64  
 8   trans_num              555719 non-null  object 
 9   unix_time              555719 non-null  int64  
 10  merch_lat              555719 non-null  float64
 11  merch_long             555719 non-null  float64
 12  is_fraud               555719 non-null  int64  
dtypes: float64(5), int64(6), object(2)
memory usage: 55.1+ MB


### Applying model to test_data

In [217]:
#features
selected_features = ['amt', 'unix_time', 'zip', 'city_pop']
X1 = testing_data[selected_features]

#target
Y1 = testing_data['is_fraud']

In [219]:
test_prediction = model.predict(X1)
data_accuracy = accuracy_score(Y1, test_prediction)
print('Accuracy of testing data: ', data_accuracy)

Accuracy of testing data:  0.9560263370516394
