<a href="https://colab.research.google.com/github/goodwillhunting9/AI-Driven-Food-Security-Platform/blob/main/Designing_Business_Analytics_A3_%7C_21386825.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
Unzipping the file: It extracts the CSV from the ZIP file and places it in the /content/ directory.
Loading and previewing the data: The dataset is loaded into a pandas DataFrame and the first five rows are displayed to give you a glimpse of its structure.
Checking for missing values and duplicates: The script will identify any missing or duplicate data and handle them.
Data types and summary statistics: This will provide insights into the data types of each column and give a summary of the numerical features.
Class distribution check: This checks the distribution of the target variable (Class), which indicates fraudulent and non-fraudulent transactions.
Saving the cleaned dataset: The cleaned dataset will be saved as creditcard_cleaned.csv in the /content/ directory for further analysis

In [None]:
# Step 1: Import necessary libraries
import pandas as pd
import zipfile

# Step 2: Extract and load the dataset from the ZIP file
zip_file_path = '/content/creditcard.csv.zip'  # Path to the ZIP file

# Unzip the file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall('/content/')  # Extracts into the /content/ directory in Colab

# Load the CSV file (assuming it is named 'creditcard.csv' after extraction)
csv_file_path = '/content/creditcard.csv'  # Path to the extracted CSV file
df = pd.read_csv(csv_file_path)

# Step 3: Display the first few rows to understand the dataset structure
print("First 5 rows of the dataset:")
print(df.head())

# Step 4: Check for missing values
print("\nChecking for missing values:")
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0])

# Step 5: Handle missing values (if any) - here we drop them, but you could impute as necessary
df_cleaned = df.dropna()  # Drop rows with missing values
# Alternatively, you can fill missing values like this:
# df_cleaned = df.fillna(df.median())  # Impute missing values with median

# Step 6: Remove duplicate rows (if any)
print("\nChecking for duplicate rows:")
duplicate_rows = df_cleaned.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_rows}")
df_cleaned = df_cleaned.drop_duplicates()

# Step 7: Check for data types of the columns
print("\nData types of each column:")
print(df_cleaned.dtypes)

# Step 8: Summary statistics of the numerical features
print("\nSummary statistics of numerical features:")
print(df_cleaned.describe())

# Step 9: Inspect the class distribution (fraudulent vs non-fraudulent transactions)
print("\nClass distribution (fraud vs non-fraud):")
print(df_cleaned['Class'].value_counts())

# Step 10: Save the cleaned dataset to a new CSV file in the /content/ directory
output_file_path = '/content/creditcard_cleaned.csv'  # You can download this file from Colab after cleaning
df_cleaned.to_csv(output_file_path, index=False)
print(f"\nCleaned dataset saved as {output_file_path}")


Step 1: Exploratory Data Analysis (EDA)


In [None]:
# Step 1: Import necessary libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Set the aesthetic style for the plots
sns.set(style="whitegrid")

# Step 2: Check for class imbalance (fraudulent vs non-fraudulent transactions)
plt.figure(figsize=(6,4))
sns.countplot(x='Class', data=df_cleaned)
plt.title('Class Distribution (0: Non-fraud, 1: Fraud)')
plt.show()

# Step 3: Correlation matrix
plt.figure(figsize=(12,10))
corr_matrix = df_cleaned.corr()
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

# Step 4: Distribution of the transaction amounts
plt.figure(figsize=(6,4))
sns.histplot(df_cleaned['Amount'], bins=50, kde=True)
plt.title('Distribution of Transaction Amounts')
plt.show()

# Step 5: Time feature analysis (if 'Time' is a feature in your dataset)
plt.figure(figsize=(6,4))
sns.histplot(df_cleaned['Time'], bins=50, kde=True)
plt.title('Distribution of Transaction Times')
plt.show()


Feature Engineering

In [None]:
# Feature Engineering: Transaction Velocity
df_cleaned['Transaction_Velocity'] = df_cleaned.groupby('Time')['Amount'].transform(lambda x: x.rolling(window=10, min_periods=1).sum())

# Inspect new feature
print(df_cleaned[['Time', 'Amount', 'Transaction_Velocity']].head())


b. Handling Imbalance: Fraudulent transactions are usually much less frequent, so you'll need to handle this class imbalance. You can use techniques like SMOTE (Synthetic Minority Over-sampling Technique) or undersampling. Here's an example using SMOTE:

In [None]:
# Step 1: Install imbalanced-learn library (if needed)
!pip install -U imbalanced-learn

# Step 2: Import SMOTE and apply to balance the classes
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# Separate features and target
X = df_cleaned.drop('Class', axis=1)  # Features
y = df_cleaned['Class']  # Target (fraud or non-fraud)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Apply SMOTE to balance the training set
smote = SMOTE(random_state=42)
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

# Check the class distribution after SMOTE
print("Class distribution before SMOTE:", y_train.value_counts())
print("Class distribution after SMOTE:", y_train_sm.value_counts())


Step 3: Model Training
Now that your data is clean, features are engineered, and the class imbalance is addressed, you're ready for model training. Here’s a quick code to train a Random Forest classifier as an example:

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Step 1: Train a Random Forest classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_sm, y_train_sm)

# Step 2: Make predictions
y_pred = rf.predict(X_test)

# Step 3: Evaluate the model
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


Hyperparameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
}

grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(X_train_sm, y_train_sm)

print("Best parameters:", grid_search.best_params_)


Ensemble Models

In [None]:
from sklearn.ensemble import StackingClassifier
from sklearn.svm import SVC

estimators = [
    ('rf', RandomForestClassifier(n_estimators=100)),
    ('svc', SVC(probability=True))
]

stack_model = StackingClassifier(estimators=estimators, final_estimator=RandomForestClassifier())
stack_model.fit(X_train_sm, y_train_sm)

y_pred_stack = stack_model.predict(X_test)
print(confusion_matrix(y_test, y_pred_stack))
print(classification_report(y_test, y_pred_stack))


Model Explainability: To make the model explainable, especially in a financial fraud context, you can use SHAP or LIME to interpret the feature importance for each prediction

In [None]:
import shap

explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values[1], X_test)  # For class 1 (fraud)


Cross-Validation: Use cross-validation to evaluate the model’s performance across multiple folds of the data, ensuring that the model generalizes well and isn't overfitting.

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(rf, X_train_sm, y_train_sm, cv=5, scoring='accuracy')
print(f"Cross-Validation Accuracy: {scores.mean()}")
