# FindDefault (Prediction of Credit Card fraud) - Capstone Project


## Problem Statement
A credit card is one of the most used financial products to make online purchases and payments. Though the Credit cards can be a convenient way to manage your finances, they can also be risky. Credit card fraud is the unauthorized use of someone else's credit card or credit card information to make purchases or withdraw cash.
It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase. 



### Introduction:
The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.


In this Project, We have to build a classification model to predict whether a transaction is fraudulent or not. We will use various predictive models to see how accurate they are in detecting whether a transaction is a normal payment or a fraud.  Let's start!

### Project Outline:
- **Exploratory Data Analysis:** Analyze and understand the data to identify patterns, relationships, and trends in the data by using Descriptive Statistics and Visualizations. 
- **Data Cleaning:** This might include standardization, handling the missing values and outliers in the data. 
- **Dealing with Imbalanced data:** This data set is highly imbalanced. The data should be balanced using the appropriate methods before moving onto model building.
- **Feature Engineering:** Create new features or transform the existing features for better performance of the ML Models. 
- **Model Selection:** Choose the most appropriate model that can be used for this project. 
- **Model Training:** Split the data into train & test sets and use the train set to estimate the best model parameters. 
- **Model Validation:** Evaluate the performance of the model on data that was not used during the training process. The goal is to estimate the model's ability to generalize to new, unseen data and to identify any issues with the model, such as overfitting. 
- **Model Deployment:** Model deployment is the process of making a trained machine learning model available for use in a production environment. 

#### Importing Liabraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import RobustScaler
import os

import warnings
warnings.filterwarnings("ignore")

In [3]:
# Get the parent directory (project folder)
parent_directory = os.path.abspath(os.path.join(os.getcwd(), os.pardir))

# Specify the path to the data file relative to the project folder
data_file_path = os.path.join(parent_directory, 'data', 'raw.csv')

# read the dataset
card_df = pd.read_csv(data_file_path)
card_df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


This dataset comprises 284,807 rows and 31 columns. Except for the 'Time' and 'Amount' columns, the nature of the remaining columns (V1 to V28) remains undisclosed due to privacy concerns. 

These undisclosed columns have undergone scaling and PCA transformation (dimensionality reduction technique), while the 'Class' column serves as the target variable. As a result, our primary focus will be on transforming the 'Time' and 'Amount' columns.

### Scaling 
We will first scale the columns comprise of Time and Amount. Time and Amount should be scaled as the other columns.

In [4]:
# RobustScaler is less prone to outliers.

rob_scaler = RobustScaler()

card_df['Amount'] = rob_scaler.fit_transform(card_df['Amount'].values.reshape(-1,1))
card_df['Time'] = rob_scaler.fit_transform(card_df['Time'].values.reshape(-1,1))


In [5]:
card_df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-0.994983,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,1.783274,0
1,-0.994983,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,-0.269825,0
2,-0.994972,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,4.983721,0
3,-0.994972,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,1.418291,0
4,-0.99496,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,0.670579,0


### Saving the processed data into data folder

In [6]:
# Specify the path to save the preprocessed data file in the data folder
preprocessed_file_path = os.path.join(parent_directory, 'data', 'preprocessed_data.csv')

# Save the preprocessed DataFrame to a CSV file in the data folder
card_df.to_csv(preprocessed_file_path, index=False)

In [None]:
# Optionally, print a message to confirm that the file has been saved
print(f"Preprocessed data saved to: {preprocessed_file_path}")

In [8]:
import pickle

In [9]:
# Get the parent directory of the current working directory
parent_directory = os.path.dirname(os.getcwd())

# Define the path to the scaler pickle file within the model folder
model_directory = os.path.join(parent_directory, 'model')
scaler_pickle_path = os.path.join(model_directory, 'scaler.pkl')

# Create the model directory if it doesn't exist
if not os.path.exists(model_directory):
    os.makedirs(model_directory)

# Save scaler object to a file
with open(scaler_pickle_path, 'wb') as file:
    pickle.dump(rob_scaler, file)