#### Load and Inspect the data

In [2]:
# Step 0: Import libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

In [3]:
# Step 1: Load the dataset

df = pd.read_csv("../data/creditcard.csv")

In [7]:
# Step 2: Inspect

df.head()
df.info()
df.describe()
df['Class'].value_counts(normalize=True)         # see class balance

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Time    284807 non-null  float64
 1   V1      284807 non-null  float64
 2   V2      284807 non-null  float64
 3   V3      284807 non-null  float64
 4   V4      284807 non-null  float64
 5   V5      284807 non-null  float64
 6   V6      284807 non-null  float64
 7   V7      284807 non-null  float64
 8   V8      284807 non-null  float64
 9   V9      284807 non-null  float64
 10  V10     284807 non-null  float64
 11  V11     284807 non-null  float64
 12  V12     284807 non-null  float64
 13  V13     284807 non-null  float64
 14  V14     284807 non-null  float64
 15  V15     284807 non-null  float64
 16  V16     284807 non-null  float64
 17  V17     284807 non-null  float64
 18  V18     284807 non-null  float64
 19  V19     284807 non-null  float64
 20  V20     284807 non-null  float64
 21  V21     28

Class
0    0.998273
1    0.001727
Name: proportion, dtype: float64

df['Class'].value_counts(normalize=True) calculates the proportion of each class (0 = non‑fraud, 1 = fraud) in the dataset.
This shows the class imbalance: about 99.82% of transactions are non‑fraud (Class 0) and only about 0.17% are fraud (Class 1), so fraud cases are extremely rare compared to normal transactions.

#### Basic Preprocessing

In [10]:
# Step 3: Split the Dataset (X = all data, expect "Class" column, Y = only "Class" column)

X = df.drop(columns=['Class'])
y = df['Class']

In [13]:
# Step 4: Train / Test Split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

In [28]:
# Step 5: Train logistic regression

clf = LogisticRegression(max_iter=1000, n_jobs=-1)
clf.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [29]:
# Step 6: Prediction + Evaluation

y_pred = clf.predict(X_test)

print("Confusion matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification report:")
print(classification_report(y_test, y_pred, digits=4))

Confusion matrix:
[[56852    12]
 [   36    62]]

Classification report:
              precision    recall  f1-score   support

           0     0.9994    0.9998    0.9996     56864
           1     0.8378    0.6327    0.7209        98

    accuracy                         0.9992     56962
   macro avg     0.9186    0.8162    0.8603     56962
weighted avg     0.9991    0.9992    0.9991     56962



### Confusion Matrix Breakdown

Confusion matrix (test set):

[[56852    12]  
 [   36    62]]

- **True Negatives (56,852)**  
  - Non-fraud transactions correctly predicted as non-fraud.  
  - This is excellent and dominates the dataset.

- **False Positives (12)**  
  - Legitimate transactions incorrectly flagged as fraud.  
  - Very few false alarms → helps keep precision high.

- **False Negatives (36)**  
  - Fraud transactions incorrectly predicted as non-fraud.  
  - **This is our main problem**: 36 missed frauds out of 98 total in the test set.

- **True Positives (62)**  
  - Fraud transactions correctly detected as fraud.  
  - We want to increase this number in future models.

Overall, the model is conservative: it rarely predicts "fraud" unless it is very confident. This gives high precision but hurts recall, which is risky for fraud detection.

## Interpretation of Baseline Results

### Current Performance (Logistic Regression)

**For Class 1 (Fraud):**

- **Precision: 0.84 (≈ 0.8378)**  
  - When the model predicts "fraud", it is correct about 84% of the time.  
  - Around 16% of fraud alerts are actually normal transactions (false positives).

- **Recall: 0.63 (≈ 0.6327)**  
  - The model catches about 63% of actual fraud cases.  
  - **About 37% of fraud transactions are missed** (false negatives).

- **F1-Score: 0.72 (≈ 0.7209)**  
  - Harmonic mean of precision and recall.  
  - Shows there is room to improve, especially in recall.

- **Accuracy: 99.92% (≈ 0.9992)**  
  - Very high overall, but mainly because 99.8% of transactions are non-fraud.  
  - Accuracy is misleading here due to extreme class imbalance.

### Why these metrics matter for fraud detection

1. **High recall is critical**  
   - Missed fraud (false negatives) means financial loss and harm to customers.  
   - Each missed case can be expensive and damage trust.

2. **Precision matters too**  
   - Too many false positives annoy customers with unnecessary fraud alerts.  
   - They also create extra manual review work for the bank.

3. **The imbalance problem**  
   - Only about 0.17% of samples are fraud, so the model is biased toward predicting non-fraud.  
   - This explains why recall is lower than precision: the model plays it safe and predicts "fraud" only when very confident.

### What to improve in Week 2

- Handle class imbalance:
  - Use class weights in LogisticRegression.
  - Try SMOTE (oversampling) and undersampling of the majority class.
- Try more powerful models:
  - Random Forest, XGBoost, etc.
- Use EDA insights:
  - Focus on strong features like V14, V12, V17.

**Target for Week 2**:  
Get **recall above 75%** while keeping **precision above 80%** for the fraud class.

## Baseline model summary

- Class distribution: fraud is about 0.17% of all transactions, and non‑fraud is about 99.8%, so the dataset is extremely imbalanced.
- Test accuracy of the baseline Logistic Regression model is approximately 0.9992 (very high overall, mainly because most transactions are non‑fraud).
- For the fraud class (Class = 1), precision ≈ 0.8378, recall ≈ 0.6327, and F1‑score ≈ 0.7209 on the test set.
- The model correctly catches many fraud cases but still misses a noticeable number (36 missed frauds vs 62 detected in the test set), which keeps recall below 1.0.
- Because of the severe class imbalance and the missed frauds, the next step will be to try techniques like class weights, resampling, or threshold tuning to increase recall for fraud without making precision too low.