Imbalance Class

1) Undersampling Methods
1. **Undersampling Majority** - undersampling is the process of **randomly discarding <u>majority</u> class observations** to balance training sample
    ![undersampling](undersampling.png)
    * (+) Reduces runtime on very large datasets
    * (-) Discards potentially **important observations**
    * (-) Increases **bias**

2) Oversampling Methods
1. **Oversampling Minority** - oversampling is process of **replicating observations from <u>minority</u> class** to balance training sample
    ![oversampling](oversampling.png)
    * (+) Doesn't discard information
    * (-) Increases **variance** (more likely to overfit)
    * (-) SMOTE is usually a better oversampling technique
2. **Synthetic Minority Oversampling Technique (SMOTE)** - randomly generate new observations from **minority class** **using kNN (modal of neighbor $k$)**
    ![smote1](smote_1.png)
    ![smote1](smote_2.png)
    ![smote1](smote_3.png)
    ![smote1](smote_4.png)
    ![smote1](smote_5.png)
    * **SMOTE Steps**:
        1. For **each minority class observation** and for **each feature**, **randomly generate** between it and one of its k-nearest neighbors
        2. **Randomly** assign a weight $w$ based on $k$ neighbor
        3. Nearest neighbor $(1-w)$
    * **SMOTE pseudecode**:
    ```python
    synthetic_observations = []
    while len(synthetic_observations) + len(minority_observations) < target:
        obs = random.choice(minority_observations): 
        neighbor = random.choice(kNN(obs, k)) # randomly selected neighbor
        new_observation = {}
        for feature in obs:
            weight = random() # random float between 0 and 1
            new_feature_value = weight*obs[feature] \
                                + (1-weight)*neighbor[feature]
            new_observation[feature] = new_feature_value
        synthetic_observations.append(new_observation)
    ```
3. **Random Oversampling Examples (ROSE)** - look at observation and **establish kernel density**, and **draw from its probability distribution** to get a synthetic observation
![rose](rose.png)
4. Other Techniques:
![other_tech_sampling](other_tech_sampling.gif)

3) Sampling Technique Intuition
* What's the right amount of oversampling/undersampling?
    * If you know the cost-benefit matrix, then **maximize profit curve** over target proportion to **determine right amount of sampling**
    * If you **don't** know the cost-benefit matrix, there is no clear answer; however, **ROC's AUC** might be more useful to determine right amount of sampling
* Cost Sensitivity vs. Sampling:
    * Neither is strictly superior
    * Generally, **Oversampling tends to work better** than undersampling **on small datasets**
    * Some algorithms don't have an obvious cost-sensitive adaptation, thus requiring sampling