# Notebook 1 - Machine Learning with SKLearn

SKLearn is a library (set of functions) that are useful for machine learning. It includes a range of machine learning models, including the Logistic Regression and Perceptron that you have already used. Unlike the implementation that you saw last week, the finer details of the algorithm are hidden away, making this library much easier to use.

In practice, you would never choose to write machine learning algorithms from scratch, and instead rely on libraries like SKlearn (or pytorch, for instance, for neural networks). In this session, we will first introduce you to the basic structure of training a machine learning model using SKLearn. In the second notebook, End-to-end Machine Learning, we will go through a typicaly selection of functions that one would use to fully train a model, including pre-processing.

Hopefully, in this process, you will see that training and switching between different Machine Learning models is very simple.

In the third (and final) notebook, we provide a skeleton set of tasks, to guide you through performing end-to-end machine learning on a new set of data.

## Import some data
For this simple example, we are going to import a small data set that has been included with the SKlearn library for teaching purposes. In real life (and indeed, in notebook 2 and 3), we would typically import our data from an external file.

Here, we will not consider any data pre-processing - we assume (rightly or wrongly) that the dataset is ready to be input into a machine learning model. The only thing that we will do is split the data into a training set (70%) and a test set (30%).

In [14]:
# Import functions from the sklearn library
# note that we're not importing the whole sklearn library - just the functions we want
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Perceptron

iris = datasets.load_iris()
X = iris.data[:, :2] # in this case, there are 4 features in the iris data set, but we only use 2, and 150 examples. You can check this by running X.shape
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# setting the random_state to a fixed values means that the split between train and test is consistent between runs.
# This is useful to check that our code is running properly, but we might want to remove if we were checking variability (for instance) over multiple runs


We then instantiate our machine learning model. In this case we, use Logistic Regression. The full list of models included in SKLearn can be found in the documentation: https://scikit-learn.org/stable/

In [15]:
model = LogisticRegression()

After we have created an empty instance of the model, we apply it to our data using the .fit command. This trains the model. We can look at the impact of our model using the .predict command.

In [16]:
model.fit(X_train, y_train)
y_preds = model.predict(X_test)

LogisticRegression()

Note that this procedure of .fit and .predict is always the same for sklearn. For instance, we can choose a different machine learning model, and apply the same fitting and predicting methods.

In [22]:
perceptronModel = Perceptron()
perceptronModel.fit(X_train, y_train)
y_preds_perceptron = perceptronModel.predict(X_test)

SKlearn also has a number of inbuilt functions to report metrics that help us to assess the performance of a machine learning model. A number of them are calculated below, and their meaning should be self-explanatory (following on from the lectures this week)

In [18]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_preds)

0.8222222222222222

In [19]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       0.78      0.54      0.64        13
           2       0.65      0.85      0.73        13

    accuracy                           0.82        45
   macro avg       0.81      0.79      0.79        45
weighted avg       0.83      0.82      0.82        45



In [20]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_preds))

[[19  0  0]
 [ 0  7  6]
 [ 0  2 11]]


That's all for this notebook. You should ensure that you are comfortable with this training process, and you may wish to convince yourself that this training process is consistent in sklearn by trying some other machine learning classification models (note that you can apply these even if you do not know what they are doing under-the-hood!).

You may also wish to see what happens if you use the whole data set, rather than just 2 of the 4 features.