<a href="https://colab.research.google.com/github/daniel-falk/ai-ml-principles-exercises/blob/main/ML-training/intro-to-libraries/intro_to_sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning with sklearn
Scikit-learn is a machine learning library that has lots of traditional machine learning algorithms but also many useful functions for all the other tasks in machine learning code:

* ML algorithms
* Sample datasets
* Data preparation functions
* Evaluation metrics and functions

The library and project is called scikit-learn but the python module is named `sklearn`.

## Line fitting
To start with we are going to create a fake dataset and fit a line to the data points. This can be useful when trying to predict the price of some asset given some parameters, the weight of an object, the age of a person or how satisfied a customer is with a service.

In [None]:
import sklearn
import numpy as np
import matplotlib.pyplot as plt

In [None]:
x = np.linspace(start=0, stop=2*np.pi, num=20)
y_sin = np.sin(x)  # sesonal variation component
y = 0.1 * x + 0.05 * y_sin

plt.plot(x, y, "*", color="green")
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title("A fake dataset")

In [None]:
from sklearn import linear_model

regressor = linear_model.LinearRegression()
regressor.fit(np.expand_dims(x, axis=1), y)

In [None]:
predicted = regressor.predict(np.expand_dims(x, axis=1))

plt.plot(x, y, "*", color="green", label="training dataset")
plt.plot(x, predicted, color="red", label="predicted")
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title("Predicted vs. true")
plt.legend()

As can be seen the linear function tries to fit a straight line to a function which is not straight. At some points we have an error between the predicted value and the true value.

In [None]:
print(f"True value at x=2 is {y[2]:.4f} but predicted value is {predicted[2]:.4f}")

We can calculate the mean (squared) error over the full training dataset

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

print(f"MSE: {mean_squared_error(y, predicted)}")
print(f"R² score: {r2_score(y, predicted) * 100:.2f}%")

Since we see that it is not possible to fit a linear function to our data, let's try some polynomial. Instead of using a polynomial regressor, we can use the linear regressor with polynomials of the dataset as input. Lets consider the following case where we transform 5 data points to 3rd order polynomials, that is:
$\{a\} → \{1, a, a^2, a^3\}$

In [None]:
from sklearn.preprocessing import PolynomialFeatures

to_polynom = PolynomialFeatures(3)
to_polynom.fit_transform(np.expand_dims(np.array([1,2,3,4,5]), axis=1))

In [None]:
x_3rd_order = to_polynom.fit_transform(np.expand_dims(x, axis=1))
regressor_3rd_order = linear_model.LinearRegression()
regressor_3rd_order.fit(x_3rd_order, y)

In [None]:
predicted_3rd_order = regressor_3rd_order.predict(x_3rd_order)

plt.plot(x, y, "*", color="green", label="training dataset")
plt.plot(x, predicted_3rd_order, color="red", label="predicted (3rd order)")
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title("Predicted vs. true")
plt.legend()

In [None]:
print(f"3rd order prediction MSE: {mean_squared_error(y, predicted_3rd_order)}")
print(f"3rd order prediction R² score: {r2_score(y, predicted_3rd_order) * 100:.2f}%")

We can now see that the function fits very well to the training data, but how about data outside of this interval? We can use the same function as before to create a larger dataset by only adding more x-values in a larger interval.

In [None]:
x_large = np.linspace(start=-10, stop=10, num=20)
y_large_sin = np.sin(x_large)  # sesonal variation component
y_large = 0.1 * x_large + 0.05 * y_large_sin

plt.plot(x_large, y_large, "*", color="green")
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title("A larger fake dataset")

In [None]:
x_3rd_order = to_polynom.fit_transform(np.expand_dims(x_large, axis=1))
predicted_3rd_order = regressor_3rd_order.predict(x_3rd_order)

plt.plot(x_large, y_large, "*", color="green", label="training dataset")
plt.plot(x_large, predicted_3rd_order, color="red", label="predicted (3rd order)")
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title("Predicted vs. true")
plt.legend()

It can be seen that the model only fits well within the range it was trained on ($0-2\pi$), i.e. it is good at interpolating but not at extrapolating.

We did manually select the polynomial to be a 3rd grade polynomial (this is called a hyperparameter), is that model able to fit all kinds of training data?



In [None]:
x2 = np.linspace(start=0, stop=2*np.pi, num=40)
y2_sin = np.sin(x2)
y2_cos = np.cos(x2)
y2 = np.maximum(y2_sin, y2_cos)

plt.plot(x2, y2, "*", color="green")
plt.xlabel('x2')
plt.ylabel('f(x2)')
plt.title("A fake dataset")

In [None]:
# Train a new linear regressor
regressor = linear_model.LinearRegression()
regressor.fit(np.expand_dims(x2, axis=1), y2)
predicted = regressor.predict(np.expand_dims(x2, axis=1))

# Train a new 3rd order polynomial regressor
x2_3rd_order = to_polynom.fit_transform(np.expand_dims(x2, axis=1))
regressor_3rd_order = linear_model.LinearRegression()
regressor_3rd_order.fit(x2_3rd_order, y2)
predicted_3rd_order = regressor_3rd_order.predict(x2_3rd_order)

plt.plot(x2, y2, "*", color="green", label="training dataset")
plt.plot(x2, predicted, color="red", label="linear")
plt.plot(x2, predicted_3rd_order, color="blue", label="3rd order")
plt.xlabel('x2')
plt.ylabel('f(x2)')
plt.legend()
plt.title("Predicted vs. true")

In [None]:
print(f"linear prediction MSE: {mean_squared_error(y2, predicted)}")
print(f"linear prediction R² score: {r2_score(y2, predicted) * 100:.2f}%")
print(f"3rd order prediction MSE: {mean_squared_error(y2, predicted_3rd_order)}")
print(f"3rd order prediction R² score: {r2_score(y2, predicted_3rd_order) * 100:.2f}%")

Can we do better? Some ML algorithms are more powerful and capable of fitting a more diverse set of functions. Let's try with a *Support Vector Regressor*.

In [None]:
from sklearn.svm import SVR

fig, axs = plt.subplots(2)

# Fit an SVR to the new dataset
regressor = SVR()
regressor.fit(np.expand_dims(x2, axis=1), y2)
predicted = regressor.predict(np.expand_dims(x2, axis=1))

axs[0].plot(x2, y2, "*", color="green", label="training dataset")
axs[0].plot(x2, predicted, color="red", label="SVR")
axs[0].set_xlabel('x2')
axs[0].set_ylabel('f(x2)')
axs[0].legend()
axs[0].set_title("Predicted vs. true: new dataset")

# Fit an SVR to the first dataset
regressor = SVR()
regressor.fit(np.expand_dims(x, axis=1), y)
predicted = regressor.predict(np.expand_dims(x, axis=1))

axs[1].plot(x, y, "*", color="green", label="training dataset")
axs[1].plot(x, predicted, color="red", label="SVR")
axs[1].set_xlabel('x')
axs[1].set_ylabel('f(x)')
axs[1].legend()
axs[1].set_title("Predicted vs. true: first dataset")

plt.gcf().tight_layout() # fix margin between subplots

## Data classification with sklearn
Next we are going to create an artificial 2-dimensional dataset with two clusters. We're going to use the [Support Vector Machine](https://scikit-learn.org/stable/modules/svm.html) implemented in scikit-learn and train that to differentiate data from the two clusters.

In [None]:
# Create a fake dataset
num_dots = 100
data = np.random.random(size=(num_dots, 2))
labels = np.random.random(size=num_dots) > 0.5

data[labels,:] += 0.7  # Make the "True" samples different i x and y

plt.title("Fake dataset")
plt.scatter(data[:,0], data[:,1], c=labels, alpha=0.3, cmap='viridis')

In [None]:
from sklearn import svm

classifier = svm.SVC() # Create the SVM model
classifier.fit(data, labels) # train the model

In [None]:
predicted = classifier.predict(data)
correct = predicted == labels

edge_colors = np.zeros(shape=(len(labels), 3))
edge_colors[~correct] = (1, 0, 0)

plt.title("Training data predicted")
plt.scatter(data[:,0], data[:,1], c=predicted, edgecolors=edge_colors, alpha=0.6, cmap='viridis')

In [None]:
# Create more random data and predict labels
num_dots = 1000
random_data = np.random.random(size=(num_dots, 2))
random_data *= 1.7

plt.title("Predicted random data")
plt.scatter(random_data[:,0], random_data[:,1], c=classifier.predict(random_data), alpha=0.3, cmap='viridis')

We can also use the `metrics` module in `sklearn` to measure how many of the data points in the training data that got correctly classified by the classifier.

In [None]:
from sklearn.metrics import classification_report, ConfusionMatrixDisplay

print(classification_report(labels, predicted))

In [None]:
ConfusionMatrixDisplay.from_predictions(labels, predicted)

Do however note that the error measured above on the training dataset is not the error we can expect to see when the model is used in production. This is since we are testing on the training dataset, the model can memorize the training dataset without having any understanding or capability of generalizing.

What we measured above is the "bias error", which is an indication on how capable the model is to fit the dataset. A linear classifier cannot correctly classify the data points in two overlapping clusters since the decision boundary is always a straight line, it will thus have an high "bias error".

Other types of error is "variance error" which indicates how much difference there are in two models trained on two different subsets of the same data. I.e. "variance error" is high for models that generalizes badly.