# Lab assignment: Isolation Forests

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/omaForest.jpg"/>

<div style="float: right;">(Oma Forest, Basque Country, photo by <a href=https://commons.wikimedia.org/wiki/File:Bosque_de_Oma_%2821%29.JPG>Wikipedia Commons</a>)</div>

In this assignment we will make use of the Isolation Forest method for density estimation and outlier detection tasks. We will first test this model to compute the density of an unknown distribution, and then for the everyday task of saving our lives on board of a submarine. Because lab assignments are so much fun when you might not get to survive them. ^__^

## Guidelines

Throughout this notebook you will find empty cells that you will need to fill with your own code. Follow the instructions in the notebook and pay special attention to the following symbols.

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
You will need to solve a question by writing your own code or answer in the cell immediately below or in a different file, as instructed.</font>

***

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/exclamation.png" height="80" width="80" style="float: right;"/>

***
<font color=#2655ad>
This is a hint or useful observation that can help you solve this assignment. You should pay attention to these hints to better understand the assignment.
</font>

***

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/pro.png" height="80" width="80" style="float: right;"/>

***
<font color=#259b4c>
This is an advanced exercise that can help you gain a deeper knowledge into the topic. Good luck!</font>

***

To avoid missing packages and compatibility issues you should run this notebook under one of the [recommended Ensembles environment files](https://github.com/albarji/teaching-environments-ensembles).

The following code will embed any plots into the notebook instead of generating a new window, and set the random seed.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import numpy as np
np.random.seed(42)

Lastly, if you need any help on the usage of a Python function you can place the writing cursor over its name and press Shift+Tab to produce a pop-out with related documentation. This will only work inside code cells. 

Let's go!

## Isolation Forest basics

Although the Isolation Forest algorithm is a simple modification of the Random Forest algorith, scikit-learn already implements these modifications for us, so we can easily make use of this anomaly detection model. We just need to import the **IsolationForest** class:

In [None]:
from sklearn.ensemble import IsolationForest

The IsolationForest object follows closely the interface of other machine learning methods available in scikit-learn. That means we just need to follow the usual steps of creating, fitting and then making use of the fitted model. At creation time we can also specify the different parameters of the model. Creating an IsolationForest without arguments will use the recommended values, as we can see here:

In [None]:
model = IsolationForest()
model.get_params()

The most important parameters are the number of trees in the forest (*n_estimators*) and the ratio of training data that will be regarded as outliers (*contamination*). Other parameters of interest are the maximum number of instances to use for each tree in the forest (*max_samples*), and the maximum number of training features to consider in each tree (*max_features*). We will come back later to these parameters, first let's try the Isolation Forest with a simple problem.

## Density estimation

Let us solve a simple density estimation problem. First we need data, which we will generate synthetically:

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
Create a numpy array <b>X</b> of two dimensions (a matrix), with 10000 rows and a single column. Each row should contain a random number following a gaussian distribution.

***

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/exclamation.png" height="80" width="80" style="float: right;"/>

***
<font color=#2655ad>
You can generate random numbers following a given distribution using the functions in the <a href=https://docs.python.org/3/library/random.html>random</a> module.
</font>

***

In [None]:
####### INSERT YOUR CODE HERE

Let's plot the data to make sure they make sense. The following should look like a gaussian distribution:

In [None]:
import seaborn as sns
sns.displot(X)

For the sake of this assignment let's assume now that we have forgotten where this data came from. You can attribute this to extreme hangover, unexpected effects of a global pandemic, or an extraterrestrial eldritch conspiracy. In such unfortunate circunstances we will make use of the Isolation Forest. First, let's train a forest with 1000 trees:

In [None]:
isoforest = IsolationForest(n_estimators=1000)
isoforest.fit(X)

Now let's create some synthetic test data, unformly spaced between -4 and -4. We will estimate the density of the original distribution over these data points:

In [None]:
testpoints = [[x] for x in np.linspace(-4, 4, 200)]

Scikit-learn's implementation of the IsolationForest is mostly focused on anomaly detection, not on density estimation. Because of this, the IsolationForest model does not provide of a direct of way of obtaining the average depth reached by a test data point on the forest trees, which is the usual way of computing densities on a tree. It does however provide of a normalized depth in the form of the *decision_function* method. For instance, if we take a point in the center of the gaussian (0) and another point far away from it (3) we get:

In [None]:
isoforest.decision_function([[0], [3]])

That is, we obtain more negative values for the points corresponding to lower density. These values are actually a normalization of the average depths reached for each test point at each tree in the IsolationForest. The normalization is done in a way that negative values should be considered anomalies. Such normalization depends on the data seen at training and the contamination parameter provided. But we will cover more on this later: for now the relevant point is that these values provided by *decision_function* can be thought as "unnormalized" probability densities, and as such we can plot them for our test data as follows:

In [None]:
densities = isoforest.decision_function(testpoints)
plt.plot(testpoints, densities)

If everything has worked correctly you should be able to see something that resembles the shape of the original Gaussian distribution.

## Submarine mine hunting

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/radar.png" style="width:300px;height:300;">

Now we will apply this machine learning technique for a real world problem: spotting mines through a submarine radar. This knowledge might be useful the next time you visit your regular grocery store, so pay attention.

Our submarine is equipped with a radar that returns readings of our surroundings at different wavelengths. Whenever an object is detected, a pattern of variables is created which we need to analyze to tell whether that object is a dangerous mine or just a simple and boring rock. The problem is, enemy mines are built from a brand-new technology and we don't know what they look like. So even if we have data telling us how a rock looks like, we don't have any mine data. You might have noticed, however, how important for your near-future survival is to spot these mines.

Isolation Forests to the rescue! Let's start with the available data:

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
 Load the training data into a numpy array <b>X</b>. You can find the data as the <i>sonar.train</i> file. All patterns are unlabeled, so all variables present in the file are input features.

***

In [None]:
####### INSERT YOUR CODE HERE

In [None]:
print("Rock training data:", X)

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
 Train an IsolationForest on the rocks data. Then generate a visualization of the distribution of unnormalized densities that are obtained when using the IsolationForest over the same training data.

***

In [None]:
####### INSERT YOUR CODE HERE

If the model is correctly trained you should see that most values are positive.

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/alarm.jpg" style="width:512px;height:384;">
<center><pre>
<span style="color:#FF8080">
#################################################
### ALERT ALERT ALERT ALERT ALERT ALERT ALERT ###
#################################################
</span>
</pre></center>

There is no more time, the enemy ships are attacking! Our intelligence reports say there are **15 mines** in this sector. You must find as many as you can, or else...

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
 The file <i>sonar.test</i> contains labeled data, where boring rocks have been labeled as -1 and life-threatening mines as +1. Load the contents of this file into a numpy matrix <i>Xtest</i> containing the input features, and a list <i>ytest</i> containing the labels.

***

In [None]:
####### INSERT YOUR CODE HERE

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
 Now run the test input features through the isolation forest you trained above. The 15 objects with lowest density must be the mines! Label those objects as mines and check the accuracy of your prediction! How many mines did you manage to spot?

***

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/exclamation.png" height="80" width="80" style="float: right;"/>

***
<font color=#2655ad>
You will need to use the <b>decision_function</b> method of the IsolationForest to obtain density values, sort those values to find to lowest ones, and recover which objects were assigned such values. Then you can check against <b>ytest</b> whether those object are real mines or not.</td></tr>
</font>

***

In [None]:
####### INSERT YOUR CODE HERE

## Using labeled samples

If some labeled samples are available, even if there are only a few of them, we can use them to adjust the parameters of the IsolationForest. Let's suppose we have a small sample of data from 5 mines.

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
 The file <i>mines.train</i> contains labeled data from 5 mines we can use for training. Create a new numpy array named <i>X2</i> that contains the input training data from both rocks and mines.

***

In [None]:
####### INSERT YOUR CODE HERE

With this we have a mix of rocks and mines in the training data. We can use this mixture to choose an appropriate value for the **contamination** argument of the IsolationForest. Ideally, the contamination should be equal to the fraction of anomalies we have in our training data.

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
 Build a new IsolationForest named <i>isoforest</i>, setting the <i>contamination</i> parameter adequately. The value for this parameter should be the fraction of mines present in the training data X2. Then fit this model using X2.
    
***

In [None]:
####### INSERT YOUR CODE HERE

Once we have an IsolationForest with a well-adjusted contamination ratio, we can use it to directly generate predictions on which test data points are likely to be anomalies. We can do this with the *predict* method from the IsolationForest model we just fitted:

In [None]:
isoforest.predict(Xtest)

The IsolationForest returns 1 for those patterns than seem normal, and -1 for those that look like anomalies. Note we have not told the IsolationForest how many mines are on the test data, it is trying to infer this quantity by itself. This should work correctly if we have provided an adequate contamination value at construction time.

<img src="https://albarji-labs-materials.s3-eu-west-1.amazonaws.com/question.png" height="80" width="80" style="float: right;"/>

***

<font color=#ad3e26>
How many mines from the test set have you succesfully detected now?
    
***

In [None]:
####### INSERT YOUR CODE HERE

As a final note, the IsolationForest model does not implement a `predict_proba` method. If we need to obtain soft predictions (for instance, to plot a ROC curve), we can resort again to the `decision_function` method.

In [None]:
isoforest.decision_function(Xtest)