# Part 3: Machine Learning

After we prepared our data and did some feature engineering, we can try to use a machine learning algorithm to predict the target feature (direction) based on the train features (brightness and motor speed).

First we have to do the module and data imports.

In [None]:
import os
import collections

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score

In [None]:
DATAPATH_INTERMEDIATE = "../../.assets/data/pelf/temp/"

In [None]:
df = pd.read_pickle(os.path.join(DATAPATH_INTERMEDIATE, "data_total_normalized.pkl.zip"))

### Reduce the size of the data
Calculating the following algorithms on the whole dataset would take a long time. We will reduce the dataset by half. There are still around 100.000 data points to look at!

In [None]:
df = df.sample(frac=0.5)

## Split into train and test set

When using machine learning, you will usually divide your dataset into two subsets. The first subset is called "training set" and will be used to fit the model. The second subset is called "test set" and will be used to validate the predictions of the model on a set of data the model did not use for fitting.

Here we collect the different column name which we want to use for the modelling. One contains the columns for the feature averages, the other all feature averages without the averages.

In [None]:
features = [name for name in df.columns if ("brightness" in name) | ("motor" in name)]
target = ["direction_S", "direction_N", "direction_E", "direction_W"]

We can use different sizes of fit and train set, however we will use a split of 90% train and 10% test set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.1, train_size=0.9)

We use the column names defined above to select the relevant columns. We also store the x-y positions for later investigation.

## Decision Tree Classifier

As a first exampe we use a decision tree classifier, which uses a chain of decisions on different columns to arrive at a prediction. (e.g. val>0.5 or val <=0.5)

In [None]:
depth = 10

DTR = DecisionTreeClassifier(max_depth=depth)

In [None]:
DTR.fit(X_train, y_train)

In [None]:
predicted_y = DTR.predict(X_test)

The first metric we can look at is the pure accuracy of the predictions.

In [None]:
metrics.accuracy_score(y_test, predicted_y)

This is a high accuracy and looks already very good. Let us try some other validation techniques like the confusion matrix.

In [None]:
metrics.confusion_matrix(y_test.values.argmax(axis=1), predicted_y.argmax(axis=1))

On the diagonal are the correct predictions and the other values show how many predictions are wrong.

These are suprisingly good results. If the depth is being reduced however, we see a strong decrease in performance, as the tree is too shallow to resolve the differences necessary for an accurate assignment of the different directions.

Let us check where the incorrect cases are located.

In [None]:
X_test_correct = X_test[(y_test == predicted_y).sum(axis=1) == 4]
X_test_incorrect = X_test[(y_test == predicted_y).sum(axis=1) != 4]

df_correct = df[df.index.isin(X_test_correct.index)]
df_incorrect = df[df.index.isin(X_test_incorrect.index)]

In [None]:
plt.figure(figsize=(10, 10))
plt.scatter(df_correct["x_measured"].values, df_correct["y_measured"].values,marker=".",color="blue")
plt.scatter(df_incorrect["x_measured"].values, df_incorrect["y_measured"].values,marker="^",color="red",label="Failure")
plt.legend()
plt.title(f"Performance of the decision tree with depth: {depth}")
plt.xlabel("Relative X-Position",fontsize=20)
plt.ylabel("Relative Y-Position",fontsize=20);

### Feature importance

An important aspect of decision trees is the importance of the features, which the DT uses to assign any value to a data point. Looking into these can (but not necessarily will) yield understanding into possible correlations between input and output.

In [None]:
feature_importance_df = pd.DataFrame()
feature_importance_df["Col_name"] = X_test.columns
feature_importance_df["Col_weight"] = DTR.feature_importances_
feature_importance_df.sort_values("Col_weight",ascending=False).head(10)

### Cross-Validation

So far we have tested the whole procedure for one combination of test and train set. This brings some troubles:

What if we have selected a train set with a certain bias? Then we will have a (strongly) varying performance of the predictor.

To rule this out, we will try to do a so called "Cross-Validation". The dataset is split into X evenly sized subsets (X-fold validation, here X=5), from which X-1 sets will be used for training and the remaining set is used for testing. By using this method we can extinguish the risk of randomly choosing a very good or very bad training/test split.

The output of the function given here depicts the X pure accuracy scores.

In [None]:
cross_val_score(DTR, df[features], df[target], cv=5)

This now tells us something about the average accuracy of the model in general (within the settings of the model). A high variance in the scores would be a reason to investigate further.

## Random Forest Classifier

Now we can use other Classifiers to see if the results differ. The Random Forest Classifier uses many uncorrelated decision tree algorithms, which where fitted under some kind of randomization. The decision of the forest is then typically decided by a majority vote of the trees (if you are not interested in the class probabilities).

In [None]:
n_est = 20

RFC = RandomForestClassifier(n_estimators=n_est)

In [None]:
RFC.fit(X_train, y_train)

Let's get the accuracy score and the confusion matrix again.

In [None]:
predicted_y = RFC.predict(X_test)
metrics.accuracy_score(y_test, predicted_y)

In [None]:
metrics.confusion_matrix(y_test.values.argmax(axis=1), predicted_y.argmax(axis=1))

In [None]:
X_test_correct = X_test[(y_test == predicted_y).sum(axis=1) == 4]
X_test_incorrect = X_test[(y_test == predicted_y).sum(axis=1) != 4]

df_correct = df[df.index.isin(X_test_correct.index)]
df_incorrect = df[df.index.isin(X_test_incorrect.index)]

In [None]:
plt.figure(figsize=(10,10))
plt.scatter(df_correct["x_measured"].values, df_correct["y_measured"].values,marker=".",color="blue")
plt.scatter(df_incorrect["x_measured"].values, df_incorrect["y_measured"].values,marker="^",color="red",label="Failure")
plt.legend()
plt.title(f"Performance of the Random Forest Classifier with {n_est} estimators")
plt.xlabel("Relative X-Position",fontsize=20)
plt.ylabel("Relative Y-Position",fontsize=20);

## K-Neighbours

We can also use the K-Neighbours Classifier, which maps all the data into a high-dimensional space (the fitting). When the model is given a new data point to predict, it also maps this point into the high-dimensional space and calculates which target class the k nearest neighbours have.

In [None]:
KNN = KNeighborsClassifier()

In [None]:
KNN.fit(X_train, y_train)

In [None]:
predicted_y = KNN.predict(X_test)
metrics.accuracy_score(y_test, predicted_y)

In [None]:
metrics.confusion_matrix(y_test.values.argmax(axis=1), predicted_y.argmax(axis=1))

In [None]:
X_test_correct = X_test[(y_test == predicted_y).sum(axis=1) == 4]
X_test_incorrect = X_test[(y_test == predicted_y).sum(axis=1) != 4]

df_correct = df[df.index.isin(X_test_correct.index)]
df_incorrect = df[df.index.isin(X_test_incorrect.index)]

In [None]:
plt.figure(figsize=(10,10))
plt.scatter(df_correct["x_measured"].values, df_correct["y_measured"].values,marker=".",color="blue")
plt.scatter(df_incorrect["x_measured"].values, df_incorrect["y_measured"].values,marker="^",color="red",label="Failure")
plt.legend()
plt.title("Performance of the KNN classifier")
plt.xlabel("Relative X-Position",fontsize=20)
plt.ylabel("Relative Y-Position",fontsize=20);

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_