# Data Exploration for Real-Time Fraud Detection

This notebook focuses on exploring and visualizing the data related to the fraud detection project. We will perform basic data analysis to understand the distribution of features and identify potential patterns that may indicate fraudulent transactions.

## 1. Import Libraries

First, we need to import the necessary libraries.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for seaborn
sns.set(style="whitegrid")


## 2. Load the Data

We will load the data from our Blob Storage or any other source (CSV, JSON, XML).


In [None]:
# Load data from Azure Blob Storage or local path
data_path = 'path_to_your_data_file.csv'  # Update with your file path
data = pd.read_csv(data_path)

# Display the first few rows of the dataframe
data.head()


## 3. Data Overview

Let's get a brief overview of the dataset, including its shape and basic statistics.


In [None]:
# Check the shape of the dataset
print("Shape of the dataset:", data.shape)

# Get basic statistics
data.describe(include='all')


## 4. Check for Missing Values

It's important to identify any missing values in the dataset.


In [None]:
# Check for missing values
missing_values = data.isnull().sum()
missing_values[missing_values > 0]


## 5. Visualize the Distribution of Key Features

We will plot histograms and boxplots for key numerical features to understand their distributions.


In [None]:
# Set of features to explore
numerical_features = ['amount', 'account_age_days', 'previous_fraud_count']

# Plot histograms
plt.figure(figsize=(15, 5))
for i, feature in enumerate(numerical_features, 1):
    plt.subplot(1, len(numerical_features), i)
    sns.histplot(data[feature], kde=True)
    plt.title(f'Distribution of {feature}')
plt.tight_layout()
plt.show()


## 6. Explore Categorical Features

Let's visualize the distribution of categorical features, such as transaction location and whether it is international.


In [None]:
# Plot categorical features
plt.figure(figsize=(15, 5))
sns.countplot(data=data, x='location', palette='viridis')
plt.title('Transaction Location Distribution')
plt.xticks(rotation=45)
plt.show()

plt.figure(figsize=(8, 5))
sns.countplot(data=data, x='is_international', palette='viridis')
plt.title('International Transactions Distribution')
plt.show()


## 7. Correlation Matrix

We will visualize the correlation between numerical features to identify any relationships.


In [None]:
# Compute the correlation matrix
correlation_matrix = data[numerical_features].corr()

# Plot the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.show()


## 8. Initial Insights

Based on the exploratory analysis, summarize any initial insights or observations about the dataset. Discuss potential feature importance for predicting fraud.


In [None]:
# This section can be filled out with observations based on the visualizations
