# Imbalanced Data
- One class outnumbers others by large proportion

## Strategies
- Collect More Data?
- Sampling
    - Consider testing random and non-random (e.g. stratified) sampling schemes
    - Consider testing different resampled ratios (e.g. you don’t have to target a 1:1 ratio in a binary classification problem, try other ratios)
    - Undersampling
        - Undersampling the majority class when you have a lot of data (>10-100k)
        - Can be random or informative undersampling techniques include EasyEnsemble and BalanceCascade
    Consider testing under-sampling when you have an a lot data (tens- or hundreds of thousands of instances or more)
    - Oversampling
        - Oversampling the minority class when you have less data (<10k)
    - Synthetic Samples
        - Type of oversampling, used to generate artifical data
        - Synthetic minority oversampling technique (SMOTE) is a powerful and widely used method
        - SMOTE algorithm creates artificial data based on feature space (rather than data space) similarities from minority samples
        - Other advanced methods: Cluster based sampling, adaptive synthetic sampling, border line SMOTE, SMOTEboost, DataBoost – IM, kernel based methods, etc
- Cost Sensitive Learning
    - Penalized classification imposes an additional cost on the model for making classification mistakes on the minority class during training. These penalties can bias the model to pay more attention to the minority class
    - Often the handling of class penalties or weights are specialized to the learning algorithm. There are penalized versions of algorithms such as penalized-SVM and penalized-LDA
    - It is also possible to have generic frameworks for penalized models. For example, Weka has a CostSensitiveClassifier that can wrap any classifier and apply a custom penalty matrix for misclassification
    - Using penalization is desirable if you are locked into a specific algorithm and are unable to resample or you’re getting poor results. It provides yet another way to “balance” the classes. Setting up the penalty matrix can be complex. You will very likely have to try a variety of penalty schemes and see what works best for your problem.
- Try Different Algorithms
    - Ensemble methods (C4.5, C5.0, CART, and Random Forest) often perform well on imbalanced datasets. Splitting rules that look at the class variable used in the creation of the trees, can force both classes to be addressed
- Fields of study dedicated to imbalanced datasets
    - Anomaly detection is the detection of rare events. This might be a machine malfunction indicated through its vibrations or a malicious activity by a program indicated by it’s sequence of system calls. The events are rare and when compared to normal operation. This shift in thinking considers the minor class as the outliers class which might help you think of new ways to separate and classify samples
    - Change detection is similar to anomaly detection except rather than looking for an anomaly it is looking for a change or difference. This might be a change in behavior of a user as observed by usage patterns or bank transactions.
- Other ideas
    - Decompose larger class into smaller number of other classes
    - Use One Class Classifier (treat like outlier detection)
    - Resample the unbalanced training set into not one balanced set, but several & run an ensemble of classifiers on these sets

## Evaluation
- Confusion Matrix
- Cost Matrix is similar of confusion matrix - more concerned about false positives and false negatives. There is no cost penalty associated with True Positive and True Negatives as they are correctly identified.
- Precision: A measure of a classifiers exactness.
- Recall: A measure of a classifiers completeness
- F1 Score (or F-score): A weighted average of precision and recall.
- Kappa (or Cohen’s kappa): Classification accuracy normalized by the imbalance of the classes in the data.
- ROC Curves: Like precision and recall, accuracy is divided into sensitivity and specificity and models can be chosen based on the balance thresholds of these values.