# Random Forest Introduction

John Lawson
Class 496

Random forests (RFs) are machine-learning (ML) models that are ostensibly large decision trees. Many of the decision trees are run with slightly different numbers to simulate "alternate realities" to reflect our uncertainty about the true state of the system (system = lake-breeze front, for our project).

We will use an introductory form of the RF, a *classifier*. Let's say we want to predict the species of a flower from characteristics of the petal itself. This will draw from a famous dataset, "Iris", that is a good way to test any ML model to become familiar with it. In our case, the python package we will use for ML, `scikit learn`, has this dataset baked in, making it easy to load when testing new ML systems (the RF is only one of many!).

OK. Bear with the AI jargon. It's coming.

This RF Classifier takes input variables (like the length and width of a flower's petal) and predicts from this which class (like a species) the petal is from. Because this RF may only may a choice from the classes ("labels"), it cannot imply likelihood that its choice is correct. Let's jump for a minute back to meteorology. Swap species for weather events occurring, and petal width/length for sensible-weather variables (temperature, moisture, etc.), and you can see where we are going: __We want our RF to predict one of two classes (a LBF will pass; a LBF will not pass) using available weather data.__ Here, "available" is down to you as the student and researcher.

Anyway, back to our example so we get an initial feel.

In [110]:
import sklearn
from sklearn import datasets
iris = datasets.load_iris()

# The classes, or labels, are the species of plant. It's our predictand.
# Here, they are called "target names" -- see what I mean about jargon?!
labels = list(iris.target_names)

# The features (or predictors) are the variables we are learning from to
# determine the system's class (i.e., the species of plant).
features = list(iris.feature_names)

# Fun trick for checking your Python code:
# This is called f-formatting
# Use an equals sign after a variable to have its name printed
print(f"{labels=}")
print(f"{features=}")

labels=['setosa', 'versicolor', 'virginica']
features=['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


Now let's look at the data. We'll use the package `pandas` to interact with the dataset. It's like breeding a spreadsheet (like Excel) with `numpy`.

In [111]:
import pandas as pd
# import numpy as np
df_features = pd.DataFrame(iris.data)
print(df_features)

       0    1    2    3
0    5.1  3.5  1.4  0.2
1    4.9  3.0  1.4  0.2
2    4.7  3.2  1.3  0.2
3    4.6  3.1  1.5  0.2
4    5.0  3.6  1.4  0.2
..   ...  ...  ...  ...
145  6.7  3.0  5.2  2.3
146  6.3  2.5  5.0  1.9
147  6.5  3.0  5.2  2.0
148  6.2  3.4  5.4  2.3
149  5.9  3.0  5.1  1.8

[150 rows x 4 columns]


As you can see, we need to make this more organised. Here, columns 0-3 are the features, and each row is a different sample. We can leave the rows un-named and their order is arbitrary.


In [112]:
df_features.columns = features
print(df_features)

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                  5.1               3.5                1.4               0.2
1                  4.9               3.0                1.4               0.2
2                  4.7               3.2                1.3               0.2
3                  4.6               3.1                1.5               0.2
4                  5.0               3.6                1.4               0.2
..                 ...               ...                ...               ...
145                6.7               3.0                5.2               2.3
146                6.3               2.5                5.0               1.9
147                6.5               3.0                5.2               2.0
148                6.2               3.4                5.4               2.3
149                5.9               3.0                5.1               1.8

[150 rows x 4 columns]


Let's see what each sample's species turned out to be (this is *truth*, and we assume the investigator logging the data is 100% sure of the species):


In [113]:
df_labels = pd.DataFrame(iris.target)
df_labels.columns = ["Species"]
print(df_labels)

     Species
0          0
1          0
2          0
3          0
4          0
..       ...
145        2
146        2
147        2
148        2
149        2

[150 rows x 1 columns]


Rename the numbers with their label names just for humans.

In [114]:
label_dict = {n:labels[n] for n in (0,1,2)}
df_labels_named = df_labels.replace({"Species":label_dict})
print(df_labels_named)

       Species
0       setosa
1       setosa
2       setosa
3       setosa
4       setosa
..         ...
145  virginica
146  virginica
147  virginica
148  virginica
149  virginica

[150 rows x 1 columns]


So there's our data looking quite neat. It's time to start our machine learning!

In [115]:
# Split the dataset up between training and testing.
# We train the RF with half of the dataset, showing it properties of different species.
# Then we want to test it like a student, and hold back the answers (labels) so we can grade it!

# I'll use f_ for features (variables, predictors) and
# l_ for labels (species, classes, predictands)
from sklearn.model_selection import train_test_split as tts
f_train, f_test, l_train, l_test = tts(df_features.values,df_labels,test_size=0.5)

In [116]:
# Have a play with the data here:

Importantly, this function `train_test_split` (which I aliases to `tts` to make it easier to type repeatedly) will shuffle the datasets in the same way before slicing off half (as we picked a value of 0.5 for test size). This means we should be careful to preserve the order of outputs. This is a frequent place for trip-ups, in my experience, including re-assigning class names (e.g., species names) to integers.

Now let's train our forest of trees! (That's a tautology, surely?)

In [117]:
from sklearn.ensemble import RandomForestClassifier as RFC
import numpy as np

# Time to turn to the official online (or locally installed) documentation
# What are the estimators? How do we pick this one number?
rfc = RFC(n_estimators=100)

rfc.fit(f_train,np.ravel(l_train))
fcst = rfc.predict(f_test)

# Make this easier to read with DataFrames and first 10 entries
print(pd.DataFrame(fcst).head(10))

   0
0  2
1  2
2  1
3  2
4  1
5  0
6  2
7  0
8  0
9  1


This is the predicted species for each sample. Shall we look at the answers for the same samples? (The ordering is important but positioning is arbitrary.)



In [118]:
# print(pd.DataFrame({"Forecast":fcst,"Actual":l_test.values}))
print(pd.DataFrame(l_test.values).head(10))

   0
0  2
1  1
2  1
3  2
4  1
5  0
6  2
7  0
8  0
9  1


How did it do? Use your eyes first with any data analysis. It's our first defence against rubbish.

Of course, it's important to evaluate the performance with maths rather than just by eye, and here we'll look at the R-squared correlation (remember evaluating lines of best fit?) to keep things simple. First we need to move back to numbers to score the forecast.

In [119]:


from sklearn import metrics
print(metrics.r2_score(l_test.values,fcst))

0.8471170646476412


92% accurate?! That's pretty sweet.

Already you can see how scarily easy (relatively!) this appears, but how difficult it gets making sense of the various dials you can tune. (Look at that documentation for this `RFC` function!) Not to mention we haven't talked about how these RFs __really__ work. But one step at a time.


Brainstorm what else you need to know so you're not just using the RF like a 6-year-old using a microwave. Also think about:

- It will be important not only to see tables printed out as a sanity check (see above), but also to visualise your data in graphical form. Take a look on the scikit-learn.org examples (such as `plot_forest_iris`) to see different ways people like to see the results. It can be quite confusing as a meteorologist wading into this.
- We don't just have four variables. We will have a lot to sort through. How do we rank the importance of each one to our prediction accuracy? Why, it's literally *feature importance*, and it will help us chuck out variables we care less about.
- How are we getting our features and labels? There is an online repository for weather observations called MesoWest (`mesowest.utah.edu`), but I won't expect you to spend weeks writing a Python script to download these variables. I will give you a dataset in time that will have lots of observations from the Chicago area. The time period will depend on the observation stations we have. We probably want as large a dataset as possible (as long as the forecast doesn't take too long to do)
- What is our "label"? It is "yes/no": a LBF will pass, or it won't. We are missing a vital variable that cannot be measured with an instrument: the LBF passage itself. I will have a script for you coming up that will take some observed variables (dry-bulb and dewpoint temperatures; wind components; pressure; large-scale wind strength) to create a best-guess dataset of which days in our training time period we can use to train our RF.
- Don't forget time of day, latitude/longitude of observation station, etc. will be important. I'll show you a funky way of getting time of day and day of year into the weather model. (Because it doesn't work if you just enter numbers from 0 to 23! Think about why when dealing with time-of-day, which is cyclical...)