
# Exploratory Data Analysis (EDA)

## Introduction

This exploratory data analysis aims to understand the characteristics of the fraud detection dataset. By analyzing various aspects of the data, we can uncover patterns and insights that will aid in building an effective predictive model to identify fraudulent transactions.

## Data Overview

- **Dataset:** `fraud_test.csv`
- **Number of Records:** 93,34 (as indicated by the output)
- **Number of Features:** 22
- **Features Description:**
  - `trans_date_trans_time`: Date and time of the transaction.
  - `cc_num`: Credit card number.
  - `merchant`: Merchant name where the transaction took place.
  - `category`: Category of the merchant.
  - `am`: Amount of the transaction.
  - `state`: State where the transaction occurred.
  - `zip`: ZIP code of the transaction location.
  - `lat`: Latitude of the merchant location.
  - `long`: Longitude of the merchant location.
  - `job`: Job title of the cardholder.
  - `dob`: Date of birth of the cardholder.
  - `trans_num`: Transaction number.
  - `unix_time`: Unix timestamp of the transaction.
  - `merch_lat`: Merchant latitude.
  - `merch_long`: Merchant longitude.
  - `is_fraud`: Target variable indicating fraudulent transactions.

## Data Cleaning

- **Handling Missing Values:**
  - Identified missing values using `data.isnull().sum()`.
  - Filled missing values with `0` using `data.fillna(0)`.
  
- **Data Type Conversion:**
  - Converted datetime-like columns to numerical format (timestamp) for modeling purposes.
  - Encoded categorical text columns using `LabelEncoder` to transform them into numerical values.
  
- **Removing Unnecessary Columns:**
  - Dropped columns containing 'Unnamed' in their names to eliminate irrelevant data.

## Statistical Summary

- **Descriptive Statistics:**
  - Provided using `data.describe()` to summarize the central tendency, dispersion, and shape of the dataset’s distribution.
  
- **Target Variable Distribution:**
  - Analyzed the distribution of fraudulent (`is_fraud = 1`) vs. non-fraudulent (`is_fraud = 0`) transactions to understand class imbalance.

## Data Visualization

1. **Distribution of Transaction Amounts:**
   - Visualized using histograms or box plots to identify skewness and outliers in transaction amounts.
   
2. **Correlation Heatmap:**
   - Displayed using seaborn's `heatmap` to identify correlations between numerical features, helping in feature selection.
   
3. **Geographical Distribution:**
   - Plotted merchant locations using latitude and longitude to visualize concentration areas and potential geographical patterns in fraudulent activities.
   
4. **Time Series Analysis:**
   - Examined transaction times to detect any temporal patterns or trends associated with fraudulent transactions.
   
5. **Category-wise Fraud Analysis:**
   - Analyzed fraud occurrence across different merchant categories to identify high-risk sectors.





In [2]:
import numpy as np
import pandas as pd

In [3]:
data = pd.read_csv("data/fraud_test.csv")
data = data.loc[:, ~data.columns.str.contains('^Unnamed')]
data.head()


Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,21/06/2020 12:14,2291160000000000.0,fraud_Kirlin and Sons,personal_care,2.86,Jeff,Elliott,M,351 Darlene Green,Columbia,...,33.9659,-80.9355,333497,Mechanical engineer,19/03/1968,2da90c7d74bd46a0caf3777415b3ebd3,1371816865,33.986391,-81.200714,0
1,21/06/2020 12:14,3573030000000000.0,fraud_Sporer-Keebler,personal_care,29.84,Joanne,Williams,F,3638 Marsh Union,Altonah,...,40.3207,-110.436,302,"Sales professional, IT",17/01/1990,324cc204407e99f51b0d6ca0055005e7,1371816873,39.450498,-109.960431,0
2,21/06/2020 12:14,3598220000000000.0,"fraud_Swaniawski, Nitzsche and Welch",health_fitness,41.28,Ashley,Lopez,F,9333 Valentine Point,Bellmore,...,40.6729,-73.5365,34496,"Librarian, public",21/10/1970,c81755dbbbea9d5c77f094348a7579be,1371816893,40.49581,-74.196111,0
3,21/06/2020 12:15,3591920000000000.0,fraud_Haley Group,misc_pos,60.05,Brian,Williams,M,32941 Krystal Mill Apt. 552,Titusville,...,28.5697,-80.8191,54767,Set designer,25/07/1987,2159175b9efe66dc301f149d3d5abf8c,1371816915,28.812398,-80.883061,0
4,21/06/2020 12:15,3526830000000000.0,fraud_Johnston-Casper,travel,3.19,Nathan,Massey,M,5783 Evan Roads Apt. 465,Falmouth,...,44.2529,-85.017,1126,Furniture designer,06/07/1955,57ff021bd3f328f8738bb535c302a31b,1371816917,44.959148,-85.884734,0


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

# Load the dataset
# data = pd.read_csv("path_to_your_data.csv")

# Preprocessing: Convert date/time columns to numerical format or drop them
for column in data.columns:
    # Check if column is datetime-like
    if pd.api.types.is_datetime64_any_dtype(data[column]):
        data[column] = data[column].astype(int)  # Convert datetime to numerical (timestamp)
    elif pd.api.types.is_object_dtype(data[column]):
        # Encode categorical text columns
        encoder = LabelEncoder()
        data[column] = encoder.fit_transform(data[column])

# Check for missing values and handle them
data = data.fillna(0)




# Visualization 2: Correlation matrix heatmap
plt.figure(figsize=(12, 8))
correlation_matrix = data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()






In [None]:
# Overview of the dataset
print(data.info())
print(data.describe())

In [None]:
print(data.isnull().sum())


In [None]:
# Visualization 1: Distribution of the target variable
plt.figure(figsize=(6, 4))
sns.countplot(x='is_fraud', data=data)  # Assuming the target column is named 'target'
plt.title('Distribution of Target Variable')
plt.show()


In [None]:
# Visualization 2: Correlation matrix heatmap
plt.figure(figsize=(12, 8))
correlation_matrix = data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()

In [21]:
from sklearn.model_selection import train_test_split
np.set_printoptions(threshold=np.inf)
X = data.drop('is_fraud', axis=1)
y = data['is_fraud']
print(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



0         0
1         0
2         0
3         0
4         0
         ..
555714    0
555715    0
555716    0
555717    0
555718    0
Name: is_fraud, Length: 555719, dtype: int64


In [19]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
scalar = StandardScaler()
X_train = scalar.fit_transform(X_train)
X_test = scalar.transform(X_test)


# Step 2: Define the logistic regression class
class LogisticRegressionScratch:
    def __init__(self, learning_rate=0.01, epochs=1000):
        self.learning_rate = learning_rate
        self.epochs = epochs
        self.weights = None
        self.bias = None

    def sigmoid(self, z):
        """Apply sigmoid function."""
        return 1 / (1 + np.exp(-z))

    def fit(self, X, y):
        """Train the logistic regression model."""
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)  # Initialize weights
        self.bias = 0                       # Initialize bias

        # Gradient descent
        for _ in range(self.epochs):
            # Linear model
            linear_model = np.dot(X, self.weights) + self.bias
            # Apply sigmoid to get predictions
            y_predicted = self.sigmoid(linear_model)

            # Compute gradients
            dw = (1 / n_samples) * np.dot(X.T, (y_predicted - y))
            db = (1 / n_samples) * np.sum(y_predicted - y)

            # Update weights and bias
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

    def predict(self, X):
        """Make predictions."""
        linear_model = np.dot(X, self.weights) + self.bias
        y_predicted = self.sigmoid(linear_model)
        # Convert probabilities to binary output (0 or 1)
        y_predicted_binary = [1 if i > 0.5 else 0 for i in y_predicted]
        return np.array(y_predicted_binary)

# Step 3: Train and evaluate the model
# model = LogisticRegressionScratch(learning_rate=0.01, epochs=1000)
# model.fit(X_train, y_train)
# y_pred = model.predict(X_test)
print(y_train)

# Step 4: Evaluate performance
# accuracy = accuracy_score(y_test, y_pred)
# precision = precision_score(y_test, y_pred)
# recall = recall_score(y_test, y_pred)
# f1 = f1_score(y_test, y_pred)

# print("Model Performance:")
# print(f"Accuracy: {accuracy:.2f}")
# print(f"Precision: {precision:.2f}")
# print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1:.2f}")

139526    0
395747    0
395119    0
552207    0
487836    0
         ..
110268    0
259178    0
365838    0
131932    0
121958    0
Name: is_fraud, Length: 444575, dtype: int64
F1-Score: 0.00
