# Outlier detection in the forest-dataset.

We'll be using the [covertype](https://scikit-learn.org/stable/datasets/real_world.html#covtype-dataset) dataset from sklearn for this example.

## Load data

First load the data!

In [None]:
# !pip install scikit-learn

In [None]:
from sklearn.datasets import fetch_covtype

# Load the Covertype dataset
data = fetch_covtype(as_frame=True)
X = data.data
y = data.target

print(X.head())
print(y.head())

This dataset is pretty large. Not only in rows, 500k, but also in dimensionality (nr of columns): 54. It'll allow us to do some serious searching and cleaning. (The "head"-output is also not very pretty because of that).

SKlearn was really nice in giving us the data pre-split, with X and Y separately. But we want to work with outliers, meaning we'll probably delete rows at some point. Then it'll be better to have one dataframe with both X and y, where the y-column is called "target". We can still split them later on.

(It also has the added advantage that we can randomly delete rows in "df" and restore them by copying "X" and "y" again.)

In [None]:
# Up to you!



## Data exploration

Always start with exploration. First, get the info on X.

In [None]:
# Up to you!



53 fields, all float64 (good for models!) and no null-values. Promising. But what graphs works best to see if any outliers are present? The boxplot. Everything outside of the whiskers is outside of 3*IQR and can safely be deleted.

In [None]:
# Up to you!



The first couple of fields look normal, bet when starting with the soil types it's obvious that something is going on.

![](../files/2025-05-12-21-09-12.png)

This graphs shows all values are around 0, but there is one values (or maybe a couple of values, but not a lot of them) that is 1 and these mess up our graph. With any luck it's the same (couple of) row(s) that have these values.

Show all rows that have a value of 1 in "soil_30".

In [None]:
# Up to you!



No luck. But strangely enough the other soils are all "0". And we said the other values were around 0, but they could have also been exactly 0.

How many different values are stored in the soil-columns (and the other columns)?

In [None]:
# Up to you!



Found it! "Wilderniss area" and "soil" are simply 1 or 0. We won't find any outliers there. Let's refocus our attention to the other columns and draw the box_plots again.

Draw box plots for the first 10 columns.

In [None]:
# Up to you!



Aspect is fine (some skewing, tail to the right) but the others all have outliers.

## Dealing with outliers

Deleting all rows with outliers is one option, but not always the best. It has it's downsides:

The outliers can be informative: There are some patches of wood where very special trees grow. These only grow on places where there is no (or very little) Hillshade_9am. By deleting the outliers in that column, we'll be deleting the entire type of tree.

**You risk introducing bias**: by deleting all data where the horizontal_distance_to_fire_points is large, you deleted all the information in that category. That means that you've now chopped a specific part of your dataset of. And chopping of a part is bad, but chopping of one particular part is really bad.

So how bad is simply deleting all outliers? Go over the first ten columns and delete all outliers, printing the amount of rows you've deleted.

In [None]:
# Up to you!



"Vertical_Distance_To_Hydrology" is responsible for deleting 5157 rows. That is a lot, even in 500k rows. (About 1%, in fact.) And our count isn't fair because "Horizontal_Distance_To_Roadways" doesn't seem to have outliers, but they had but they were already deleted by deleting all the "Vertical_Distance_To_Hydrology".

First run the next cell to restore the data, and then count the amount of rows deleting the outliers would delete (without actually deleting them).

In [None]:
# Run to restore the data
df = X.copy()
df['target'] = y

In [None]:
# Up to you!



Another question we should be asking ourselves: suppose we delete all the outliers, what kind of damage are we inflicting on our output labels? If we delete the outliers, are we deleting rows from all output classes equally or are we targeting one class in particular?

Store the number of rows per class (value_counts in the target-column). Then delete the outliers and store the result again.

In [None]:
# Up to you!



Merge both value_counts and show the percentage decline.

In [None]:
import pandas as pd
# Merge pre_class_counts and post_class_counts into a DataFrame
class_counts_comparison = pd.DataFrame({
    'Before': pre_class_counts,
    'After': post_class_counts
})

# Calculate the percentage decrease
class_counts_comparison['Percentage Decrease'] = ((class_counts_comparison['Before'] - class_counts_comparison['After']) / class_counts_comparison['Before']) * 100

print(class_counts_comparison)

Doable for most classes, but if you want valid prediction ons class 7 this is not the way to go forward. If this is your goal you should now start looking at models that can handle outliers well (like tree-based models).

You could also try other methods of dealing with outliers (as we will be doing), but remember that we're actively interfering with our data. There is a line between "helping" and "going over the line" that is very easy to cross (as anyone with a mother in law can tell you, or so I've heard). When you clip the data you're clipping these rows, 4.4% of class 7, same goes for scaling.

But we have an advantage! We can train many, many models (unlike the folks doing large language models). So maybe try a model with the outliers and another without?

## Windsorization

New plan: we delete everything out of 4\*IQR (way out of bounds) and clip all values above 3\*IQR to 3\*IQR. This way we save most values (or so we think) and get rid of the pesky remaining outliers.

First, delete rows outside of 4*IQR. (And remember to restore the data before starting here.)

In [None]:
# Up to you!



If you stored the values you would have noticed we deleted 1.46% of class 7. The rest is all below 0.4%. Next up is clipping. Let's clip the values at 20% and 80%. That way we still keep some outliers, but the data is better in check.

In [None]:
# Up to you!



We can't see the result anymore by showing the value_counts because we never deleted rows. The only way of showing the results is by doing the box-plots again. In fact, we have X (the original data) and df (the new data). But them next to each other in the same graph!

In [None]:
# Up to you!



The whiskers have grown shorter. This is normal, but it does show an important effect. Let's focus on 2 columns, Vertical_Distance_To_Hydrology and Horizontal_Distance_To_Roadways. We'll start with the first, plotting the values from X and df in a histogram together.

In [None]:
import matplotlib.pyplot as plt

# Plot histogram for "Vertical_Distance_To_Hydrology" from X and df
plt.figure(figsize=(10, 6))
bin_size = 10
min_value = min(X["Vertical_Distance_To_Hydrology"].min(), df["Vertical_Distance_To_Hydrology"].min())
max_value = max(X["Vertical_Distance_To_Hydrology"].max(), df["Vertical_Distance_To_Hydrology"].max())
bins = range(int(min_value), int(max_value) + bin_size, bin_size)

plt.hist(X["Vertical_Distance_To_Hydrology"], bins=bins, alpha=0.5, label="Original Data (X)", color='blue')
plt.hist(df["Vertical_Distance_To_Hydrology"], bins=bins, alpha=0.5, label="Modified Data (df)", color='orange')

# Add labels, title, and legend
plt.xlabel("Vertical_Distance_To_Hydrology")
plt.ylabel("Frequency")
plt.title("Histogram of Vertical_Distance_To_Hydrology")
plt.legend()

# Show the plot
plt.show()

And now Horizontal_Distance_To_Roadways.

In [None]:
import matplotlib.pyplot as plt

# Plot histogram for "Horizontal_Distance_To_Roadways" from X and df
plt.figure(figsize=(10, 6))
bin_size = 100
min_value = min(X["Horizontal_Distance_To_Roadways"].min(), df["Horizontal_Distance_To_Roadways"].min())
max_value = max(X["Horizontal_Distance_To_Roadways"].max(), df["Horizontal_Distance_To_Roadways"].max())
bins = range(int(min_value), int(max_value) + bin_size, bin_size)

plt.hist(X["Horizontal_Distance_To_Roadways"], bins=bins, alpha=0.5, label="Original Data (X)", color='blue')
plt.hist(df["Horizontal_Distance_To_Roadways"], bins=bins, alpha=0.5, label="Modified Data (df)", color='orange')

# Add labels, title, and legend
plt.xlabel("Horizontal_Distance_To_Roadways")
plt.ylabel("Frequency")
plt.title("Histogram of Horizontal_Distance_To_Roadways")
plt.legend()

# Show the plot
plt.show()

These fields were chosen because the first lost a lot of data to 3\*IQR, the second because it lost none. But the effect of clipping is very obvious: we're introducing giant spikes at the end of the orange spectrum. That means we've been clipping way to aggressively. There shouldn't be any visible spikes here. In fact, the guidelines are:

| Percentile range | Portion modified | Correct description     |
| ---------------- | ---------------- | ----------------------- |
| 0.5%â€“99.5%       | 1%               | Conservative            |
| 1%â€“99%           | 2%               | Mild                    |
| 2.5%â€“97.5%       | 5%               | Moderate/Aggressive     |
| 5%â€“95%           | 10%              | Very aggressive         |


Let's try again with the conservative .5% without deleting the extremes first.

In [None]:
# Up to you!



And the graph for Vertical_Distance_To_Hydrology?

In [None]:
import matplotlib.pyplot as plt

# Plot histogram for "Vertical_Distance_To_Hydrology" from X and df
plt.figure(figsize=(10, 6))
bin_size = 10
min_value = min(X["Vertical_Distance_To_Hydrology"].min(), df["Vertical_Distance_To_Hydrology"].min())
max_value = max(X["Vertical_Distance_To_Hydrology"].max(), df["Vertical_Distance_To_Hydrology"].max())
bins = range(int(min_value), int(max_value) + bin_size, bin_size)

plt.hist(X["Vertical_Distance_To_Hydrology"], bins=bins, alpha=0.5, label="Original Data (X)", color='blue')
plt.hist(df["Vertical_Distance_To_Hydrology"], bins=bins, alpha=0.5, label="Modified Data (df)", color='orange')

# Add labels, title, and legend
plt.xlabel("Vertical_Distance_To_Hydrology")
plt.ylabel("Frequency")
plt.title("Histogram of Vertical_Distance_To_Hydrology")
plt.legend()

# Show the plot
plt.show()

Way better. Still some spikes, but at the very least they're not higher than the data anymore. The other graph?

In [None]:
import matplotlib.pyplot as plt

# Plot histogram for "Horizontal_Distance_To_Roadways" from X and df
plt.figure(figsize=(10, 6))
bin_size = 100
min_value = min(X["Horizontal_Distance_To_Roadways"].min(), df["Horizontal_Distance_To_Roadways"].min())
max_value = max(X["Horizontal_Distance_To_Roadways"].max(), df["Horizontal_Distance_To_Roadways"].max())
bins = range(int(min_value), int(max_value) + bin_size, bin_size)

plt.hist(X["Horizontal_Distance_To_Roadways"], bins=bins, alpha=0.5, label="Original Data (X)", color='blue')
plt.hist(df["Horizontal_Distance_To_Roadways"], bins=bins, alpha=0.5, label="Modified Data (df)", color='orange')

# Add labels, title, and legend
plt.xlabel("Horizontal_Distance_To_Roadways")
plt.ylabel("Frequency")
plt.title("Histogram of Horizontal_Distance_To_Roadways")
plt.legend()

# Show the plot
plt.show()

So to conclude: when clipping (or doing winsorzation), keep it chill.