# 1. Introduction
## Understanding Imbalanced Classification and Proposed Methodologies

In real-world classification problems, class imbalance is a frequent challenge where one class (typically the "positive" or event class) is significantly underrepresented compared to the other. This imbalance often leads to biased models that favor the majority class, compromising the ability to detect minority class instances effectively. Data-level approaches, such as oversampling and undersampling, address this issue by modifying the training data distribution without changing the model. Techniques like SMOTE (Synthetic Minority Over-sampling Technique) synthetically generate new samples to boost the minority class, while Edited Nearest Neighbors (ENN) helps remove noisy or ambiguous examples. Hybrid solutions, such as SMOTEENN, combine oversampling with intelligent undersampling, aiming to enhance class balance while preserving data quality. These strategies are particularly useful in improving recall, precision, and overall performance on imbalanced datasets before feeding them into machine learning models.






The primary difficulties in learning from imbalanced data include:

- **Biased predictions** toward the majority class
- **Misleading accuracy metrics** that overlook minority class performance
- **Poor generalization** to unseen data, particularly for minority instances
- **Increased false negatives**, where the model fails to identify critical minority class cases

Since accuracy is not a reliable metric in imbalanced settings, alternative evaluation metrics such as
**precision**, **recall**, **F1-score**, and the **area under the precision-recall curve (PR AUC)** are preferred.




# 2.Problem and Methodologies

To address these challenges, the literature suggests a taxonomy of solutions including:
### 1. Data-Level Methods
These techniques adjust the training data distribution to better represent all classes. Common approaches include:

- **Oversampling** the minority class (e.g., **SMOTE**, **Borderline-SMOTE**, **ADASYN**)
- **Undersampling** the majority class (e.g., random or **cluster-based undersampling**)
- **Combined resampling** strategies that use both oversampling and undersampling

Data-level methods are advantageous as they are model-agnostic and can be used independently of the classifier.

### 2. Hybrid Methods
Hybrid techniques combine data-level preprocessing with algorithm-level improvements to harness the strengths of both. For example:

- **SMOTE + Cost-Sensitive Learning**
- **SMOTEBoost** and **RUSBoost**, which integrate data balancing with boosting algorithms
- **Ensemble learning** combined with hybrid sampling techniques

These approaches are particularly effective in improving model robustness, reducing overfitting, and retaining valuable information.


---

## Basic SMOTE Explained

![SMOTE Technique Explained](smote.jpeg)


# 2. Problems and Methodologies 


### 🔹 1. SMOTE (Synthetic Minority Oversampling Technique)

**Concept:**  
SMOTE is a powerful oversampling technique that generates synthetic samples for the minority class rather than duplicating existing ones. It interpolates between a sample and its nearest neighbors in the feature space.

**How it works:**
1. For each minority class sample, SMOTE identifies *k nearest neighbors* (typically `k=5`).
2. A neighbor is selected randomly, and a synthetic sample is created on the line segment joining the original sample and the neighbor.
3. This synthetic point is added to the dataset.

**Why it's useful:**  
It increases diversity among the minority class samples and helps the classifier better learn the minority class boundaries. However, by itself, SMOTE may introduce overlapping or noisy samples near the majority class.

---

### 🔹 2. SVM-SMOTE (Support Vector Machine–Based SMOTE)

**Concept:**  
SVM-SMOTE is an enhancement over classical SMOTE. It uses a trained Support Vector Machine (SVM) classifier to **identify the support vectors** — data points near the decision boundary — and focuses sample generation near these critical regions.

**How it works:**
1. An SVM is trained to find the boundary between classes.
2. Support vectors from the minority class are selected.
3. Synthetic samples are generated near these support vectors.

**Why it's useful:**  
By concentrating synthetic data near the decision boundary, SVM-SMOTE improves the **discriminative power** of the model and reduces the risk of overgeneralization from too many easy examples.

---

### 🔹 3. Distance-Weighted Edited Nearest Neighbors (ENN)

**Concept:**  
ENN is a data cleaning method that removes noisy or ambiguous samples, particularly from the majority class, by evaluating how well a point agrees with its neighbors.

**Enhancement with Distance Weights:**  
I implemented **distance-weighted ENN** with various strategies for weighting the influence of neighbors:
- **Inverse distance:** closer neighbors have higher weight.
- **Gaussian decay:** exponential decay of weight with squared distance.
- **Exponential decay:** sharper drop-off than Gaussian.
- **Rank-based:** closer ranks get more importance.
- **Adaptive power decay:** tunable control over influence using a beta parameter.

**Why it's useful:**  
These weighting schemes allow more nuanced filtering. For example, if a majority class sample is surrounded by minority class neighbors but farther away, its influence can be discounted rather than completely discarded.

---

### 🔹 4. SMOTEENN with Weighted Voting

**Concept:**  
This is a **hybrid resampling pipeline** combining oversampling (SMOTE) and undersampling (ENN).

**Steps:**
1. Use SMOTE to generate new minority samples.
2. Apply distance-weighted ENN to remove noisy or misclassified samples, mainly from the majority class.

**Why it's useful:**  
This two-step process balances the dataset while **retaining only clean and informative samples**, improving the decision boundary and reducing overfitting caused by noisy or borderline examples.

---

### 🔹 5. SMOTified-GAN (SMOTE + GAN)

**Concept:**  
To go beyond traditional synthetic sample generation, I developed a **SMOTified-GAN**, a type of Generative Adversarial Network (GAN) that creates realistic and diverse samples for the minority class by learning its data distribution.

**How it works:**
1. A generator network learns to create synthetic data conditioned on the minority class.
2. A discriminator network tries to distinguish between real and fake data.
3. The generator improves over time to fool the discriminator, producing high-quality samples.

**Why it's useful:**  
Unlike SMOTE, which relies on linear interpolation, **SMOTified-GAN can learn complex nonlinear data distributions**, resulting in synthetic samples that better resemble real data. This boosts the model’s ability to generalize to unseen minority class examples.

---

### ✅ Summary of Advantages

| Method             | Type         | Key Strengths |
|--------------------|--------------|----------------|
| SMOTE              | Oversampling | Simple and effective; increases minority presence |
| SVM-SMOTE          | Oversampling | Focuses on decision boundary; better for hard examples |
| Distance-weighted ENN | Undersampling | Flexible cleaning with smart neighbor voting |
| SMOTEENN           | Hybrid       | Balances dataset + denoises it |
| SMOTified-GAN      | Oversampling | Realistic and diverse synthetic samples |

---

By integrating these **data-level and hybrid solutions**, I tackled the class imbalance problem from multiple perspectives — improving not just quantity but also the **quality and diversity** of training data. This led to **more robust, generalizable, and fair classifiers** across my experiments.

---

