# 🔄 1. SVM-SMOTE (Support Vector Machine – Synthetic Minority Oversampling Technique)

## **Introduction**

SVM-SMOTE is an advanced version of SMOTE that uses a Support Vector Machine (SVM) to detect the decision boundary. Synthetic samples are then generated **only around the minority class instances that lie near the boundary**, helping the classifier to focus on difficult cases.

---

## **Algorithm Steps**

1. Train an SVM classifier on the dataset.
2. Identify the **support vectors** of the **minority class**.
3. For each support vector:

   - Find its **k-nearest neighbors** among the minority class.
   - Generate synthetic samples as:

     $$
     x_{\text{new}} = x_i + \lambda \cdot (x_{nn} - x_i)
     $$

     where:

     $$
     x_i = \text{support vector from the minority class}
     $$

     $$
     x_{nn} = \text{k-th nearest neighbor of } x_i
     $$

     $$
     \lambda = \text{a random value drawn from the uniform distribution between 0 and 1}
     $$

4. Repeat the process until the desired number of synthetic samples is generated.

---

## **Mathematical Formulation**

$$
x_{\text{new}} = x_i + \lambda \cdot (x_{nn} - x_i)
$$

Where:

$$
x_i = \text{support vector from the minority class}
$$

$$
x_{nn} = \text{k-th nearest neighbor of } x_i \text{ from the minority class}
$$

$$
\lambda = \text{random value between 0 and 1}
$$

---

## **Advantages**

- Targets borderline samples, which are critical for classification.
- Helps improve performance on imbalanced datasets.
- Avoids oversampling safe or redundant regions.

---

## **Limitations**

- Requires SVM training, which can be computationally expensive.
- Sensitive to SVM hyperparameters (e.g., kernel, margin).
- Might not perform well if the SVM does not find a reliable decision boundary.

---

# 🔁 2. SMOTE-ENN (SMOTE + Edited Nearest Neighbors)

## **Introduction**

SMOTE-ENN is a **hybrid technique** that combines:

- **SMOTE** to oversample the minority class with synthetic samples.
- **ENN (Edited Nearest Neighbors)** to remove noisy or misclassified samples based on nearest neighbor disagreement.

The combination leads to a **balanced and cleaner dataset**, improving classification performance.

---

## **Algorithm Steps**

1. Generate synthetic samples using SMOTE:

   $$
   x_{\text{new}} = x_i + \lambda \cdot (x_{nn} - x_i)
   $$

   Where:

   $$
   x_i = \text{a minority class instance}
   $$

   $$
   x_{nn} = \text{k-th nearest neighbor of } x_i \text{ from the minority class}
   $$

   $$
   \lambda = \text{random value between 0 and 1}
   $$

2. Apply ENN:
   - For each sample:

     $$
     x = \text{a data point in the dataset}
     $$

   - Find the **k-nearest neighbors** of \( x \).
   - If the **majority of neighbors have a different class label**, remove \( x \).

---

## **Mathematical Formulation**

**Step 1 – SMOTE:**

$$
x_{\text{new}} = x_i + \lambda \cdot (x_{nn} - x_i)
$$

Where:

$$
x_i = \text{minority class instance}
$$

$$
x_{nn} = \text{k-th nearest neighbor of } x_i \text{ from the minority class}
$$

$$
\lambda = \text{random value between 0 and 1}
$$

**Step 2 – ENN:**

$$
x = \text{a sample in the dataset}
$$

$$
\text{If the majority of the } k \text{ nearest neighbors of } x \text{ have a different class label, remove } x
$$

---

## **Advantages**

- Balances the dataset while reducing noise.
- Eliminates mislabeled or borderline instances that confuse the classifier.
- Improves model performance by simplifying the class boundaries.

---

## **Limitations**

- Can remove valuable borderline points.
- Computationally more expensive than standalone SMOTE.
- Needs careful tuning of \( k \) in both SMOTE and ENN.

---




## 🔄 3. ENN (Edited Nearest Neighbors) with Distance-Weighted Voting

- After oversampling, the dataset may contain **noisy or overlapping instances**. The ENN step removes samples whose class label **disagrees with the majority of their neighbors**.
- Instead of using simple majority voting, this approach introduces **distance-weighted voting** — making the contribution of each neighbor dependent on its proximity to the point.

---

##  Weighting Strategies for ENN

This version of ENN supports multiple **distance-based weighting mechanisms**, allowing for more nuanced decisions during noise removal.

| **Strategy**         | **Mathematical Formula** | **Description** |
|----------------------|--------------------------|------------------|
| **Inverse Distance** | $$ w = \frac{1}{d + \varepsilon} $$ | Closer neighbors get more weight. Avoids division by zero with \( \varepsilon \). |
| **Gaussian**         | $$ w = \exp\left( -\frac{d^2}{2\sigma^2} \right) $$ | Weights decrease smoothly with squared distance. \( \sigma \) controls the spread. |
| **Exponential**      | $$ w = \exp(-\alpha d) $$ | Exponentially decreases as distance increases. \( \alpha \) controls the rate. |
| **Rank-based**       | $$ w = \frac{1}{\text{rank} + 1} $$ | Uses neighbor rank instead of distance. First neighbor gets highest weight. |
| **Adaptive Power**   | $$ w = \frac{1}{d^\beta + \varepsilon} $$ | Flexible decay based on \( \beta \); allows tuning for sharper or smoother weighting. |

---

##  Decision Rule (ENN Cleaning)

For each sample in the dataset:
1. Find its \( k \)-nearest neighbors.
2. Assign weights to neighbor labels using one of the selected strategies.
3. Predict the sample's class label via **weighted majority voting**.
4. **Remove the sample** from the dataset if its true label **does not match** the predicted label.

This process effectively removes:
- Ambiguous samples near the decision boundary,
- Potential label noise,
- Overlapping samples caused by SMOTE.

---

## Benefits of This Approach

-  More **robust noise removal** compared to standard ENN.
-  Customizable weighting provides **flexibility** for different datasets.
-  Reduces the risk of **overfitting** on synthetic or noisy examples.

---

## Considerations

-  Requires **tuning** of additional hyperparameters (e.g., `sigma`, `alpha`, `beta`) for best results.
-  Computationally more **intensive** due to extra neighbor weighting.
-  Risk of **over-cleaning** the dataset if ENN is too aggressive.

---

## Summary

This SMOTEENN variant blends **synthetic oversampling** with **intelligent undersampling**. By integrating **distance-aware voting**, it provides a smarter way to remove misleading data points — offering a powerful and flexible tool for tackling **class imbalance problems** in real-world datasets.

---

# 🔄 4. SMOTIFIED-GAN FOR CLASS IMBALANCE PROBLEM

## **Introduction**

SMOTified-GAN is a **combination** of **SMOTE** (Synthetic Minority Oversampling Technique) and **GAN** (Generative Adversarial Networks) that aims to overcome the deficiencies of both methods. The process starts by using **SMOTE** to generate synthetic minority class samples and then refines these samples using **GAN**. This results in **more diverse** and **realistic** minority class samples.

---

## **Mathematical Formulation**

### **SMOTE Step**

SMOTE generates synthetic samples by **interpolating** between nearest neighbors of the minority class samples. Given a minority class data point $x_i$ and its **k-nearest neighbors** $\{x_1, x_2, \dots, x_k\}$, the synthetic sample is generated by interpolation:

$$
x_{\text{synthetic}} = x_i + \lambda (x_j - x_i), \quad j \in \{1, 2, \dots, k\}, \quad \lambda \sim \mathcal{U}(0, 1)
$$

Where:
- $x_i$ = an original sample from the minority class,
- $x_j$ = one of the k-nearest neighbors of $x_i$,
- $\lambda$ = random scalar chosen from a uniform distribution.

### **GAN Step**

The **GAN** consists of two networks: the **Generator ($G$)** and the **Discriminator ($D$)**. The goal of the **Generator** is to produce synthetic data $x_{\text{fake}}$ that cannot be distinguished from real data by the **Discriminator**.

- **Generator Output:**

$$
x_{\text{fake}} = G(z), \quad z \sim \mathcal{U}(0, 1)
$$

Where:
- $z$ = random noise vector,
- $x_{\text{fake}}$ = synthetic data sample generated by the Generator.

- **Discriminator Objective:**

$$
\mathcal{L}_D = \frac{1}{2} \left[ \mathbb{E}_{x \sim P_{\text{real}}} \left[ (D(x) - 1)^2 \right] + \mathbb{E}_{z \sim P_z} \left[ (D(G(z)))^2 \right] \right]
$$

Where:
- $D(x)$ = discriminator prediction for real data $x$,
- $D(G(z))$ = discriminator prediction for generated data $x_{\text{fake}}$.

- **Generator Objective:**

$$
\mathcal{L}_G = \mathbb{E}_{z \sim P_z} \left[ (D(G(z)) - 1)^2 \right]
$$

The Generator aims to minimize the loss function to fool the Discriminator into classifying generated data as real.

---

### **SMOTified-GAN Formulation**

**SMOTified-GAN** enhances the GAN architecture by replacing the random noise $z$ in the generator with the synthetic samples generated by **SMOTE**. This allows the GAN to start with **realistic synthetic samples** instead of random noise, improving the quality of generated samples.

1. **Discriminator Score:**

$$
\mathcal{L}_D = \mathbb{E}_{x^* \sim P_{\text{real}}} \left[ \log D(x^*) \right] + \mathbb{E}_{u \sim P_{\text{SMOTE}}} \left[ \log(1 - D(G(u))) \right]
$$

Where:
- $x^*$ = real minority class samples,
- $u$ = oversampled synthetic minority class data from **SMOTE**.

2. **Generator Score:**

$$
\mathcal{L}_G = - \mathbb{E}_{u \sim P_{\text{SMOTE}}} \left[ \log D(G(u)) \right]
$$

Where:
- $u$ = over-sampled data from **SMOTE** used as input to the **Generator**,
- $G(u)$ = synthetic data generated by the Generator.

---

## **Summary**

**SMOTified-GAN** is an advanced technique for addressing **class imbalance** by combining the strengths of **SMOTE** and **GAN**:
- **SMOTE** generates initial synthetic minority samples using interpolation.
- **GAN** refines these samples, making them more diverse and realistic by training the generator to fool the discriminator.

This synergy results in high-quality synthetic samples that can be used to balance datasets and improve model performance.

---


In [None]:
https://medium.com/totalenergies-digital-factory/imbalanced-data-ml-smote-and-its-variants-c69a4b32f7e7