# Malaria Detection using Machine Learning with Dimension Reduction

## Project Overview
This notebook implements a comprehensive approach to malaria detection from blood cell images using various machine learning techniques with a focus on dimension reduction methods.

### Dataset Challenges and Preprocessing Strategy

**Mixed Image Formats Issue:**
- The dataset contains images in different formats (.tiff and .png)
- For consistent processing, we will standardize all images to PNG format
- This ensures uniform handling across the entire pipeline

**Class Imbalance Problem:**
- Training set: 4,000 negative vs 800 positive samples (~5:1 ratio)
- Validation set: 1,531 negative vs 1,035 positive samples (~1.5:1 ratio)
- This significant imbalance requires special handling techniques

### Techniques to Handle Class Imbalance:

1. **Data-Level Approaches:**
   - **SMOTE (Synthetic Minority Oversampling Technique):** Generate synthetic positive samples
   - **Random Oversampling:** Duplicate minority class samples with augmentation
   - **Random Undersampling:** Reduce majority class samples (with caution)
   - **Data Augmentation:** Apply transformations specifically to minority class

2. **Algorithm-Level Approaches:**
   - **Class Weights:** Assign higher weights to minority class during training
   - **Focal Loss:** Focus learning on hard-to-classify examples
   - **Cost-Sensitive Learning:** Penalize misclassification of minority class more heavily

3. **Evaluation Strategies:**
   - **Stratified Cross-Validation:** Maintain class distribution in folds
   - **Balanced Metrics:** Use F1-score, precision, recall, and AUC-ROC instead of just accuracy
   - **Confusion Matrix Analysis:** Detailed analysis of true/false positives and negatives

4. **Ensemble Methods:**
   - **Balanced Random Forest:** Built-in handling of class imbalance
   - **EasyEnsemble:** Combine multiple balanced classifiers
   - **BalanceCascade:** Sequential ensemble with balanced sampling

### Implementation Plan:
We will implement and compare multiple approaches to find the most effective combination for this specific dataset and problem.


## Data Exploration and Preprocessing

In this section, we will:
1. Load and examine the dataset structure
2. Analyze image properties and distributions
3. Visualize sample images from both classes
4. Implement preprocessing pipeline for format standardization
5. Apply class imbalance handling techniques
6. Prepare data for dimension reduction experiments


In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
import cv2
import os
import glob
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Machine Learning libraries
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

# Deep Learning libraries
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications import VGG16, ResNet50

# Imbalanced learning libraries
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from imblearn.ensemble import BalancedRandomForestClassifier

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Set up plotting parameters
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

print("Libraries imported successfully!")
print(f"TensorFlow version: {tf.__version__}")
print(f"NumPy version: {np.__version__}")
