# Detecting credit card fraud using logistic regression: Part I

In this example, we'll create a model that will help us detect credit card fraud. This is a classic application of machine learning that is illustrative of the type of work that goes into creating a useful machine learning model. We will then host this as an API endpoint using Google Cloud (in Part II). We'll also create an end-to-end ML pipeline for model training, experimentation, deployment, and monitoring (using [this repo](https://github.com/DataTalksClub/mlops-zoomcamp) as a running resource) (in Part III).


## Motivation

Fraud detection is a classic example of a problem solved using classification algorithms. It's a good case study in basic machine learning development for the following reasons:

1. Real-world applications: Credit card fraud detection is a real-world problem.
2. Feature engineering: To make credit card data useful, it has to be transformed and manipulatedin various ways.
3. Imbalanced datasets: credit card fraud is (thankfully) a relatively rare occurrence, so detecting fraud requires managing imbalanced datasets.
4. Interpretability: since this problem is within a non-technical domain (finance), working on this project in industry will likely require talking with non-ML people. These people will likely be very interested in not only a model that can predict fraud, but also what the model looks for when it detects fraud. Therefore, we want a model that is interpretable.


## Setup and loading data


For our data, we'll be using [this](https://www.kaggle.com/datasets/mishra5001/credit-card/data) dataset from Kaggle, which is a sample dataset for credit card fraud detection.


Let's get our data loaded as well as import any missing packages


In [31]:
import pandas as pd

pd.set_option('display.max_rows', 5)
pd.set_option('display.max_columns', 5)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', None)

In [2]:
df = pd.read_csv("application_data.csv")

## Data Exploration


Let's now take a quick look at our data


In [32]:
df.head()

Unnamed: 0,SK_ID_CURR,TARGET,...,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,...,0.0,1.0
1,100003,0,...,0.0,0.0
2,100004,0,...,0.0,0.0
3,100006,0,...,,
4,100007,0,...,0.0,0.0


What features do we have available in our data? We can look at the `columns_description.csv` file in order to see what the features are.


In [16]:
column_descriptions = pd.read_csv("columns_description.csv")

This describes the features in our dataset. For our use case, we'll only look at the data in `application_data.csv`.

In [17]:
column_descriptions = column_descriptions[column_descriptions['Table'] == "application_data"][["Row", "Description", "Special"]]

In [29]:
column_descriptions.head(100)

Unnamed: 0,Row,Description,Special
0,SK_ID_CURR,ID of loan in our sample,
1,TARGET,"Target variable (1 - client with payment difficulties: he/she had late payment more than X days on at least one of the first Y installments of the loan in our sample, 0 - all other cases)",
...,...,...,...
98,FLAG_DOCUMENT_4,Did client provide document 4,
99,FLAG_DOCUMENT_5,Did client provide document 5,


## Data Preprocessing


## Dealing with Data Imbalances


## Model Development


## Model Evaluation and Iteration

## Model Interpretation


## Summary and Next Steps
