# Selection bias

One of the types of bias that can be handled with technology is selection bias. The cover forest-dataset has a good example of that.

## Data import

This dataset can be fetched by using sklearn. We'll import it and paste X and y into one dataframe.

In [None]:
from sklearn.datasets import fetch_covtype

# Load the Covertype dataset
data = fetch_covtype(as_frame=True)
X = data.data
y = data.target

print(X.head())
print(y.head())

df = X.copy()
df['target'] = y

## Detecting selection bias

The easiest way of detecting selection bias is seeing how many rows there are per label in the target column.

Show the value_counts, and add a percentage which this label is of the total.

In [None]:
# Up to you!



This shows that the situation is quite dire. Almost 50% is of class 2, and only 0.5% is of class 4. A classic case of class imbalance or selection bias. Let's fix it!

## Get more data

If possible this is the best approach. Make sure you have more data on the classes which have lower samples. But in this case it's not a possibility.

## Use Algorithms that Handle Imbalance Naturally

Some algorithms are more robust to imbalance Gradient Boosted Trees (e.g., LightGBM, CatBoost, XGBoost with scale_pos_weight), or a BalancedBaggingClassifier or BalancedRandomForestClassifier. These algorithms handle the imbalance without us doing any more work.

## Use Class Weights

Set class weights inversely proportional to class frequencies: For logistic regression, SVM, RandomForest, XGBoost, etc., you can use the class_weight='balanced' argument (or manually compute weights). This is often the simplest and most robust first approach, especially for tree-based models.

Example in sklearn:
```Python
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(class_weight='balanced')
```

Let's try this! First split the data in a train/validation/test set. Let's say 60/20/20. Make sure you do a stratified split, so all labels are equally represented in training, test and validation.

In [None]:
# Up to you!



Show the distribution of labels in the test-set (along with percentages) to check if the stratification worked.

In [None]:
# Up to you!



Now train a RandomForestClassifier (using the training-set). Don't use the "class_weight" yet. Also show the accuracy_score.

In [None]:
# Up to you!



Now train a RandomForestClassifier (using the training-set) but use the "class_weight" this time. Show the accuracy_score and don't forget to give your model a different name!

In [None]:
# Up to you!



Accuracy when using weighted: 0.0008% better. Not really the gains we were hoping for. But maybe we are getting the gains we wan't but just in the underrepresented labels. 

Show the confusion matrix and classification_report for the weighted and unweighted results.

In [None]:
# Up to you!



The difference is marginal at best. At least it doesn't make the model less good (as it would in an untuned decision tree).

There is a special random forest implementation for dealing with unbalanced data ([here](https://imbalanced-learn.org/stable/references/generated/imblearn.ensemble.BalancedRandomForestClassifier.html)). Let's try that one!

First install the library...

In [None]:
# !pip install imbalanced-learn

And create and train the model on the existing split.

In [None]:
# Up to you!



Accuracy went down further, dropping all the way below 80% now. And what do the confusion matrix and classification report say?

In [None]:
# Unweighted model
# Compute the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
# Display the confusion matrix
print("Confusion Matrix unweighted model:")
print(conf_matrix)

print(classification_report(y_test, y_pred_weighted_v2))

As expected, bad news all around. Precision even drops to 31% for class 5.

This can be due to a number of reasons:
* It Undersamples the Majority Class (Aggressively): Randomly undersamples each majority class within each tree, so that classes are balanced. This can throw away a lot of useful majority-class data, reducing the model's capacity to learn overall patterns. Especially on a large dataset like the Forest Cover dataset, this can harm overall accuracy and generalization unless tuned well.
* It Needs More Trees to Be Stable: Because each tree sees only a small, randomly sampled (and heavily undersampled) portion of the data, you often need more trees to average out the noise.
* Itâ€™s Sensitive to Tree Depth and Leaf Size: By default, it uses deep trees, which overfit the small, undersampled data and hurt generalization, especially on the majority classes.

So this is not the end of this story, but the beginning: tune all three models and see where you can get. Use the validation-set when you're happy with the model to get a final reading on the accuracy.

But for now, let's look at other ways of dealing with the imbalance.

## Resampling the Dataset

Resampling means working with the samples we already have. We have two options:

1. Oversampling minority classes
    * SMOTE (Synthetic Minority Over-sampling Technique): Generates synthetic examples.
    * Random Oversampling: Simply duplicates examples from minority classes.
1. Undersampling majority classes
    * Random Undersampling: Drops examples from majority classes.
    * Tomek Links / Edited Nearest Neighbors: Smarter undersampling using data geometry.

But before we begin, beware: over- and undersampling work for tree-based models, but there is a risk of overfitting (models training on a small part of the dataset) or losing information (when dropping rows when doing undersampling).

First apply SMOTE (from imblearn, which we installed earlier) to the training dataset. Then use "Counter" from collections to check the amount of rows in every categorie.

In [None]:
# Up to you!



Now train a new random forest classifier. Don't use any tuning parameters (n_estimators, ...), only the default random state of 42. Give it a new name when fitting and don't overwrite the y_pred results we had earlier.

Remember that training the original dataset took about 2 minutes. Now we have a much, much bigger dataset.

While waiting, here's something to think about: we only resampled the training-set, not the test-set. Why not? What would have happened to our precision had we done that?

In [None]:
# Up to you!



A (slightly, but still) better model than without using smote! let's look at all the specs first.

In [None]:
# Up to you!



Precision is sometimes down on the previous model, but general accuracy is up. Once again we didn't find a silver bullet to fix all, but I hope that idea is gone by now: it doesn't exist. Machine learning is not magic but hard work, tinkering with the dataset, trying new tools and stopping before going too far. But this small increase is a good sign: start tuning and you're on to something.

Next thing to try is undersampling. We could do a random undersampling, but that would lead to roughly the same results as the imblearn version of the random forest classifier. In stead of that we'll try two of the other undersampling methods: Tomek links and ENN.

First import TomekLinks from imblearn and use it to resample the training-data.

In [None]:
# Up to you!



Look at the numbers and decide: do we really need to train a new model on this data? Will it be any better?

It won't. Tomek Links work by looking at the data and removing datapoints based on their distance. [read more](https://imbalanced-learn.org/stable/under_sampling.html#tomek-links). Apparantly there are not enough of these to delete, so the dataset remains roughly the same.

But hey, we're not doing deep learning, so let's train another model.

In [None]:
# Up to you!



And once more for ENN!

In [None]:
# Up to you!



(You know the drill by now.)

In [None]:
# Up to you!



When to use these techniques:
* Tomek Links: Cleaning borderline cases, especially in combo with oversampling (like SMOTE + Tomek)
* ENN: Removing noisy samples, cleaning clusters; can be aggressive and may underfit if data is sparse

And how would you combine these?

In [None]:
from imblearn.combine import SMOTETomek, SMOTEENN

# smote_tomek = SMOTETomek(random_state=42)
# X_resampled, y_resampled = smote_tomek.fit_resample(X_train, y_train)

# OR
# smote_enn = SMOTEENN(random_state=42)
# X_resampled, y_resampled = smote_enn.fit_resample(X_train, y_train)


... but since we seem to have decided to stay away from tuning for now, let's not dig deeper there.

## Anomaly Detection Perspective

If one or more classes are extremely rare (like class 4 in this case), treat them as anomalies. We could use one-vs-rest classifiers and train models to detect "normal" vs "abnormal" classes.

Let's start by looking at the original data distribution once again.

In [None]:
# Up to you!



It looks like class 3 to 7 are the smallest. What if we were to combine these into one giant class and retrain the model? We'll remap them all to class 3.

In [None]:
y_test_mapped = y_test.replace({4: 3, 5: 3, 6: 3, 7: 3})
y_train_mapped = y_train.replace({4: 3, 5: 3, 6: 3, 7: 3})

And train yet another random forest classifier to distinguish in our mapped datasets.

In [None]:
# Up to you!



The accuracy is pretty good, amoung the highest we had so far. So now we have a model that says in inference that "a patch is of type 3 or 4 or 5 or 6 or 7" so we need another model to distinguish between these.

Start by getting all rows from the original dataset that map to class 3 or higher. Use the "df", that has the X and y combined.

In [None]:
# Up to you!



Now apply a train/test-split to df (maybe even a validation-set?)

In [None]:
# Up to you!



Maybe one more model? It'll be faster because it's much smaller.

In [None]:
# Up to you!



We have an accuracy, but it's an accuracy that has to be applied after the previous accuracy has been applied so it's hard to compare.

What we've been doing is also very tricky: what if classes 3-7 have no relation whatsoever? What if class 3 falls somewhere between 1 and 2 in a multidimensional space? In that case it would have been better to combine 1 and 3 vs 2 and the rest. When going down this rabbit hole be prepared to:

- Undo all your work and start over
- Keep enough data separate to be able to test your model(s) on unseen data

## Extra

Just because we can: a script that visualizes what smote and tomek do to your data on a subset of 10k rows. It uses as 2D PCA projection to visualize the class distributions in the 54 dimensions of our original data.


In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_covtype
from sklearn.decomposition import PCA
from imblearn.combine import SMOTETomek, SMOTEENN
from collections import Counter
import seaborn as sns

# Load the Forest Cover dataset
X, y = fetch_covtype(return_X_y=True)

# Optional: Downsample for speed/visualization
from sklearn.model_selection import train_test_split
X_small, _, y_small, _ = train_test_split(X, y, train_size=10000, stratify=y, random_state=42)

# Reduce to 2D with PCA for visualization
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_small)

# Helper: Plot function
def plot_pca(X_proj, y, title):
    plt.figure(figsize=(8, 6))
    sns.scatterplot(x=X_proj[:, 0], y=X_proj[:, 1], hue=y, palette='tab10', s=20, alpha=0.7, linewidth=0)
    plt.title(title)
    plt.xlabel("PCA 1")
    plt.ylabel("PCA 2")
    plt.legend(title="Class", bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.tight_layout()
    plt.show()

# Visualize original
plot_pca(X_pca, y_small, "Original Dataset (PCA Projection)")

# ---- SMOTE + Tomek ----
from imblearn.over_sampling import SMOTE
smote_tomek = SMOTETomek(random_state=42)
X_res_tomek, y_res_tomek = smote_tomek.fit_resample(X_small, y_small)
X_res_tomek_pca = pca.transform(X_res_tomek)
print("SMOTE + Tomek class distribution:", Counter(y_res_tomek))
plot_pca(X_res_tomek_pca, y_res_tomek, "SMOTE + Tomek Links (PCA Projection)")

# ---- SMOTE + ENN ----
smote_enn = SMOTEENN(random_state=42)
X_res_enn, y_res_enn = smote_enn.fit_resample(X_small, y_small)
X_res_enn_pca = pca.transform(X_res_enn)
print("SMOTE + ENN class distribution:", Counter(y_res_enn))
plot_pca(X_res_enn_pca, y_res_enn, "SMOTE + ENN (PCA Projection)")
