<a href="https://colab.research.google.com/github/goodwillhunting9/AI-Driven-Food-Security-Platform/blob/main/Copy_of_Designing_Business_Analytics_A3_%7C_21386825.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
Unzipping the file: It extracts the CSV from the ZIP file and places it in the /content/ directory.
Loading and previewing the data: The dataset is loaded into a pandas DataFrame and the first five rows are displayed to give you a glimpse of its structure.
Checking for missing values and duplicates: The script will identify any missing or duplicate data and handle them.
Data types and summary statistics: This will provide insights into the data types of each column and give a summary of the numerical features.
Class distribution check: This checks the distribution of the target variable (Class), which indicates fraudulent and non-fraudulent transactions.
Saving the cleaned dataset: The cleaned dataset will be saved as creditcard_cleaned.csv in the /content/ directory for further analysis

In [None]:
# Step 1: Import necessary libraries
import pandas as pd
import zipfile

# Step 2: Extract and load the dataset from the ZIP file
zip_file_path = '/content/creditcard.csv.zip'  # Path to the ZIP file

# Unzip the file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall('/content/')  # Extracts into the /content/ directory in Colab

# Load the CSV file (assuming it is named 'creditcard.csv' after extraction)
csv_file_path = '/content/creditcard.csv'  # Path to the extracted CSV file
df = pd.read_csv(csv_file_path)

# Step 3: Display the first few rows to understand the dataset structure
print("First 5 rows of the dataset:")
print(df.head())

# Step 4: Check for missing values
print("\nChecking for missing values:")
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0])

# Step 5: Handle missing values (if any) - here we drop them, but you could impute as necessary
df_cleaned = df.dropna()  # Drop rows with missing values
# Alternatively, you can fill missing values like this:
# df_cleaned = df.fillna(df.median())  # Impute missing values with median

# Step 6: Remove duplicate rows (if any)
print("\nChecking for duplicate rows:")
duplicate_rows = df_cleaned.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_rows}")
df_cleaned = df_cleaned.drop_duplicates()

# Step 7: Check for data types of the columns
print("\nData types of each column:")
print(df_cleaned.dtypes)

# Step 8: Summary statistics of the numerical features
print("\nSummary statistics of numerical features:")
print(df_cleaned.describe())

# Step 9: Inspect the class distribution (fraudulent vs non-fraudulent transactions)
print("\nClass distribution (fraud vs non-fraud):")
print(df_cleaned['Class'].value_counts())

# Step 10: Save the cleaned dataset to a new CSV file in the /content/ directory
output_file_path = '/content/creditcard_cleaned.csv'  # You can download this file from Colab after cleaning
df_cleaned.to_csv(output_file_path, index=False)
print(f"\nCleaned dataset saved as {output_file_path}")


Step 1: Exploratory Data Analysis (EDA)


In [None]:
# Step 1: Import necessary libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Set the aesthetic style for the plots
sns.set(style="whitegrid")

# Step 2: Check for class imbalance (fraudulent vs non-fraudulent transactions)
plt.figure(figsize=(6,4))
sns.countplot(x='Class', data=df_cleaned)
plt.title('Class Distribution (0: Non-fraud, 1: Fraud)')
plt.show()

# Step 3: Correlation matrix
plt.figure(figsize=(12,10))
corr_matrix = df_cleaned.corr()
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

# Step 4: Distribution of the transaction amounts
plt.figure(figsize=(6,4))
sns.histplot(df_cleaned['Amount'], bins=50, kde=True)
plt.title('Distribution of Transaction Amounts')
plt.show()

# Step 5: Time feature analysis (if 'Time' is a feature in your dataset)
plt.figure(figsize=(6,4))
sns.histplot(df_cleaned['Time'], bins=50, kde=True)
plt.title('Distribution of Transaction Times')
plt.show()


Feature Engineering

In [None]:
# Feature Engineering: Transaction Velocity
df_cleaned['Transaction_Velocity'] = df_cleaned.groupby('Time')['Amount'].transform(lambda x: x.rolling(window=10, min_periods=1).sum())

# Inspect new feature
print(df_cleaned[['Time', 'Amount', 'Transaction_Velocity']].head())


b. Handling Imbalance: Fraudulent transactions are usually much less frequent, so you'll need to handle this class imbalance. You can use techniques like SMOTE (Synthetic Minority Over-sampling Technique) or undersampling. Here's an example using SMOTE:

In [None]:
# Step 1: Install imbalanced-learn library (if needed)
!pip install -U imbalanced-learn

# Step 2: Import SMOTE and apply to balance the classes
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# Separate features and target
X = df_cleaned.drop('Class', axis=1)  # Features
y = df_cleaned['Class']  # Target (fraud or non-fraud)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Apply SMOTE to balance the training set
smote = SMOTE(random_state=42)
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

# Check the class distribution after SMOTE
print("Class distribution before SMOTE:", y_train.value_counts())
print("Class distribution after SMOTE:", y_train_sm.value_counts())


Step 3: Model Training
Now that your data is clean, features are engineered, and the class imbalance is addressed, you're ready for model training. Here’s a quick code to train a Random Forest classifier as an example:

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Step 1: Train a Random Forest classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_sm, y_train_sm)

# Step 2: Make predictions
y_pred = rf.predict(X_test)

# Step 3: Evaluate the model
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
}

grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(X_train_sm, y_train_sm)

print("Best parameters:", grid_search.best_params_)


Ensemble Models

In [None]:
from sklearn.ensemble import StackingClassifier
from sklearn.svm import SVC

estimators = [
    ('rf', RandomForestClassifier(n_estimators=100)),
    ('svc', SVC(probability=True))
]

stack_model = StackingClassifier(estimators=estimators, final_estimator=RandomForestClassifier())
stack_model.fit(X_train_sm, y_train_sm)

y_pred_stack = stack_model.predict(X_test)
print(confusion_matrix(y_test, y_pred_stack))
print(classification_report(y_test, y_pred_stack))


Model Explainability: To make the model explainable, especially in a financial fraud context, you can use SHAP or LIME to interpret the feature importance for each prediction

In [None]:
import shap

explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values[1], X_test)  # For class 1 (fraud)


Cross-Validation: Use cross-validation to evaluate the model’s performance across multiple folds of the data, ensuring that the model generalizes well and isn't overfitting.

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(rf, X_train_sm, y_train_sm, cv=5, scoring='accuracy')
print(f"Cross-Validation Accuracy: {scores.mean()}")


1. Ethical Considerations
Ethical issues are key when implementing AI models, particularly in sensitive areas like fraud detection. The major concerns include bias and fairness, transparency, and customer privacy.

a. Bias and Fairness:
Problem: AI models can unintentionally exhibit bias, especially when the training data contains imbalances. For example, if certain demographics are underrepresented in the data, the model may perform poorly on those groups.
Solution: Regularly check for bias in your model’s predictions and use fairness metrics such as Demographic Parity, Equalized Odds, or Disparate Impact.
Example to detect bias:

In [None]:
# Checking for bias in fraud detection predictions
from sklearn.metrics import confusion_matrix

# Let's assume you have demographic data available (like gender, age groups, etc.)
# For example, you could check performance across gender:
# df_cleaned['Gender'] is a column in your dataset representing demographic group
for group in df_cleaned['Gender'].unique():
    group_indices = df_cleaned[df_cleaned['Gender'] == group].index
    y_true_group = y_test.loc[group_indices]
    y_pred_group = y_pred.loc[group_indices]

    print(f"Confusion Matrix for Gender Group {group}:")
    print(confusion_matrix(y_true_group, y_pred_group))

# Further analysis could calculate fairness metrics for each demographic group


b. Transparency:
Problem: Stakeholders (banks, regulators, and customers) need to understand how decisions are made by the AI model, especially in cases where a transaction is flagged as fraudulent.
Solution: Use explainability tools like SHAP or LIME to provide instance-level explanations of why a transaction was flagged as fraud.

In [None]:
import shap

explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values[1], X_test)  # For class 1 (fraud)

# You can also explain individual predictions:
sample = X_test.iloc[0]
shap.force_plot(explainer.expected_value[1], shap_values[1][0], sample)


c. Customer Privacy:
Problem: Your fraud detection model may require sensitive personal data (e.g., transaction history, location). This raises privacy concerns, especially with regulations like GDPR (in Europe) and CCPA (in California).
Solution: Ensure compliance with these regulations by anonymizing personal data, using encryption, and allowing customers to opt-out of data collection.

In [None]:
# Anonymizing sensitive data by hashing customer IDs
import hashlib

df_cleaned['Customer_ID_Hashed'] = df_cleaned['Customer_ID'].apply(lambda x: hashlib.sha256(str(x).encode()).hexdigest())

# Drop or mask any personal identifiers before saving or transmitting the data
df_cleaned.drop(columns=['Customer_ID'], inplace=True)


2. IT Security
Your AI model must be secure against external threats such as hacking, data breaches, and adversarial attacks.

a. Data Encryption:
Problem: Sensitive transaction and customer data need to be protected, especially during transmission between systems or storage.
Solution: Use encryption protocols (e.g., AES, TLS) to protect data at rest and in transit.ction

In [None]:
from cryptography.fernet import Fernet

# Generate key for encryption
key = Fernet.generate_key()
cipher_suite = Fernet(key)

# Encrypt sensitive columns (e.g., 'Transaction_Amount')
df_cleaned['Transaction_Amount_Encrypted'] = df_cleaned['Amount'].apply(lambda x: cipher_suite.encrypt(str(x).encode()))

# To decrypt the data (in deployment environment):
# df_cleaned['Transaction_Amount_Decrypted'] = df_cleaned['Transaction_Amount_Encrypted'].apply(lambda x: cipher_suite.decrypt(x).decode())


b. Adversarial Attacks:
Problem: Hackers could use adversarial attacks to manipulate your model's predictions, leading to incorrect fraud detection.
Solution: Implement adversarial training or defenses like defensive distillation to make the model more robust.

In [None]:
import numpy as np
import tensorflow as tf

# Example of generating adversarial examples using the Fast Gradient Sign Method (FGSM)
def create_adversarial_pattern(input_data, input_label, model):
    with tf.GradientTape() as tape:
        tape.watch(input_data)
        prediction = model(input_data)
        loss = tf.keras.losses.mean_squared_error(input_label, prediction)

    # Get the gradients of the loss w.r.t the input image.
    gradient = tape.gradient(loss, input_data)
    # Get the sign of the gradients to create the perturbation
    signed_grad = tf.sign(gradient)
    return signed_grad

# Applying it to test set (assuming model is trained in TensorFlow/Keras)
perturbations = create_adversarial_pattern(X_test, y_test, model)
adversarial_test_data = X_test + perturbations


3. Governance
Fraud detection models must comply with relevant legal and regulatory requirements, especially in the financial services sector. Key governance concerns include compliance with financial regulations like AML (Anti-Money Laundering) laws and ensuring auditability of AI decisions.
a. Regulatory Compliance:
Problem: The model needs to comply with regulatory frameworks such as AML, KYC (Know Your Customer), and international fraud detection guidelines.
Solution: Ensure that the model adheres to industry regulations by keeping decision logs and offering interpretability.
b. Auditability:
Problem: Financial institutions need to audit the model’s predictions for potential legal issues. They need to understand how fraud was detected, and this process must be auditable.
Solution: Keep a detailed log of model decisions, including inputs, outputs, and the reasons why a transaction was flagged.

In [None]:
# Log the model's predictions for auditing purposes
import logging

logging.basicConfig(filename='fraud_detection_audit.log', level=logging.INFO)

for i, (input_data, prediction) in enumerate(zip(X_test.values, y_pred)):
    logging.info(f"Transaction {i}: Input={input_data}, Prediction={prediction}")
