# Data Processing for Real-Time Fraud Detection

This notebook focuses on the data processing tasks necessary to clean and prepare the dataset for analysis and modeling. It includes steps for handling missing values, encoding categorical variables, and creating new features.

## 1. Import Libraries

First, we need to import the necessary libraries.


In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

# Set display options for better readability
pd.set_option('display.max_columns', None)


## 2. Load the Data

Load the dataset that we will process. This can be from local storage or Azure Blob Storage.


In [None]:
# Load data from Azure Blob Storage or local path
data_path = 'path_to_your_data_file.csv'  # Update with your file path
data = pd.read_csv(data_path)

# Display the first few rows of the dataframe
data.head()


## 3. Data Overview

Check the shape of the dataset and get basic statistics.


In [None]:
# Check the shape of the dataset
print("Shape of the dataset:", data.shape)

# Get basic statistics
data.describe(include='all')


## 4. Check for Missing Values

Identify and handle missing values in the dataset.


In [None]:
# Check for missing values
missing_values = data.isnull().sum()
missing_values[missing_values > 0]


In [None]:
# Handle missing values
# Example: Impute numerical features with the mean and categorical features with the mode
numerical_features = ['amount', 'account_age_days', 'previous_fraud_count']
categorical_features = ['location', 'is_international']

imputer_num = SimpleImputer(strategy='mean')
data[numerical_features] = imputer_num.fit_transform(data[numerical_features])

imputer_cat = SimpleImputer(strategy='most_frequent')
data[categorical_features] = imputer_cat.fit_transform(data[categorical_features])

print("Missing values handled.")


## 5. Encoding Categorical Variables

Encode categorical variables using One-Hot Encoding to prepare for modeling.


In [None]:
# One-Hot Encoding for categorical variables
data_encoded = pd.get_dummies(data, columns=categorical_features, drop_first=True)

# Display the first few rows of the processed data
data_encoded.head()


## 6. Feature Engineering

Create any additional features that may enhance the model's performance.


In [None]:
# Example: Creating a feature that indicates whether a transaction amount is above a certain threshold
threshold = 1000  # Define a threshold for high-value transactions
data_encoded['high_value_transaction'] = np.where(data_encoded['amount'] > threshold, 1, 0)

# Display the first few rows of the updated dataframe
data_encoded.head()


## 7. Save Processed Data

Save the processed dataset to a new CSV file or to Azure Blob Storage for further use.


In [None]:
# Save processed data to a CSV file
processed_data_path = 'processed_fraud_detection_data.csv'  # Update with your desired file path
data_encoded.to_csv(processed_data_path, index=False)

print("Processed data saved to:", processed_data_path)
