# Unbalanced and Balanced Datsets

eg. Fraud detection: Such transactions are rare, chanes are high that model is trained on an imbalanced datasets with less fraud transactions and a lot more non fraud transactions. Model might get high accuracy on training, however gves non- fraud answer everytime

Size of positive and negative samples is the same means data is balanced whereas a huge variation in the sizes of the different class data suggests an imbalanced dataset.

## Techniques to Convert Imbalanced Dataset into Balanced Dataset


There are two main strategies for handling the class imbalance problem: data-level and algorithm-level techniques. Data-level techniques involve modifying the dataset to balance the classes, while algorithm-level techniques modify the learning algorithm to handle the imbalance.

- Data-level techniques include undersampling, oversampling, and hybrid approaches. Undersampling involves randomly removing instances from the majority class to balance the dataset. Oversampling involves replicating instances from the minority class to balance the dataset. Hybrid approaches combine both undersampling and oversampling techniques.


- Algorithm-level techniques include cost-sensitive learning, threshold-moving, and ensemble learning. Cost-sensitive learning involves assigning different misclassification costs to different classes. Threshold-moving involves adjusting the decision threshold to favor the minority class. Ensemble learning involves combining multiple classifiers to improve the overall performance

### 1. Use the right evaluation metrics:

    Evaluation metrics can be applied such as:

- Confusion Matrix: a table showing correct predictions and types of incorrect predictions.
- Precision: the number of true positives divided by all positive predictions. Precision is also called Positive Predictive Value. It is a measure of a classifier’s exactness. Low precision indicates a high number of false positives.
- Recall: the number of true positives divided by the number of positive values in the test data. Recall is also called Sensitivity or the True Positive Rate. It is a measure of a classifier’s completeness. Low recall indicates a high number of false negatives.
- F1-Score: the weighted average of precision and recall.

### 2. Over sampling/ Up sampling:

When the quantity of data is insufficient, the oversampling method tries to balance by incrementing the size of rare samples. **Over-sampling increases the number of minority class members in the training set.**

    Advantages:
        - No information from the original training set is lost

    Disadvantage:
        - Overfitting
        
Some of the more widely used and implemented oversampling methods include:

- Random Oversampling: randomly duplicating examples from the minority class


- Synthetic Minority Oversampling Technique (SMOTE): works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample as a point along that line.


- Borderline-SMOTE: Borderline-SMOTE involves selecting those instances of the minority class that are misclassified, such as with a k-nearest neighbor classification model, and only generating synthetic samples that are “difficult” to classify.


- Borderline Oversampling with SVM: Borderline Oversampling is an extension to SMOTE that fits an SVM to the dataset and uses the decision boundary as defined by the support vectors as the basis for generating synthetic examples, again based on the idea that the decision boundary is the area where more minority examples are required.


- Adaptive Synthetic Sampling (ADASYN): ADASYN) is another extension to SMOTE that generates synthetic samples inversely proportional to the density of the examples in the minority class. 


### 3. Under sampling/ Down sampling

    Balances the imbalance dataset by reducing the size of the class which is in abundance. Methods for down sampling in classification:

- The **cluster centroid methods** replace the cluster of samples by the cluster centroid of a K-means algorithm

- **Tomek link method** removes unwanted overlap between classes until all minimally distanced nearest neighbors are of the same class. A Tomek Link refers to a pair of examples in the training dataset that are both nearest neighbors (have the minimum distance in feature space) and belong to different classes. Tomek Links are often misclassified examples found along the class boundary and the examples in the majority class are deleted.

- The **Condensed Nearest Neighbors rule, or CNN** for short, was designed for reducing the memory required for the k-nearest neighbors algorithm. It works by enumerating the examples in the dataset and adding them to the store only if they cannot be classified correctly by the current contents of the store, and can be applied to reduce the number of examples in the majority class after all examples in the minority class have been added to the store.

    Advantages:
        - Run-time can be improved by decreasing the amount of training dataset.
        - Helps in solving the memory problems

    Disadvantage:
        - Losing some critical information
        
### 4. Feature Selection:

Performs the function of intelligent subsampling and potentially helps reduce the imbalance problem.

Methods for feature selection on both classes:
- One-sided metric such as correlation coefficient (CC) and odds ratios (OR) 
- two-sided metric evaluation such as information gain (IG) and chi-square (CHI)

Based on the scores, we then identify the significant features from each class and take the union of these features to obtain the final set of features. 

Then, we use this data to classify the problem.

### 5. Cost Sensitive Learning:

The Cost-Sensitive Learning (CSL) takes the misclassification costs into consideration by minimising the total cost. 

### 6. Ensemble methods:

The ensemble technique is combined the result or performance of several classiﬁers to improve the performance of single classiﬁer, thereby improving generalisability.eg. 


#### Bagging
- Bagging-style methods aim to reduce the variance of the base classifiers by generating multiple bootstrap samples from the original dataset.


- In the context of class imbalance learning, bagging can be used to create multiple models that are trained on different subsets of the minority class, which can help to improve the overall performance of the classifier.


- One popular variant of bagging for class imbalance learning is SMOTEBagging, which combines the SMOTE oversampling technique with bagging to create multiple models that are trained on balanced subsets of the data.

#### Boosting
- Boosting-based methods focus on improving the accuracy of the base classifiers by iteratively reweighting the training examples. Combining multiple weak classifiers to create a strong classifier


- In the context of class imbalance learning, boosting can be used to give more weight to misclassified instances of the minority class, which can help to improve the overall performance of the classifier.


- One popular variant of boosting for class imbalance learning is RUSBoost, which combines random undersampling with boosting to create multiple models that are trained on balanced subsets of the data.

Advantages:
- Stable model with better predictions

#### References:
1. https://medium.com/analytics-vidhya/what-is-balance-and-imbalance-dataset-89e8d7f46bc5
2. https://machinelearningmastery.com/data-sampling-methods-for-imbalanced-classification/
https://thecontentfarm.net/ensemble-techniques-for-handling-class-imbalance/