<h1 align="center"> Components of Machine Learning.</h1> 


<div class=" alert alert-danger">
   <h3 align="center"> <b>NOTE: This notebook is not graded. Do not submit it.</b> </h3>
    
</div>


Machine learning (ML) studies methods that enable a computer to learn from data without explicitly telling it (programming) what to do. Technically, ML methods fit models to data to be able to make accurate predictions and/or inferences about phenomena such as the weather or the behavior of humans. You might use ML applications in you everyday life without even noticing it. Recommendation systems, spam filtering, voice recognition, and chatbots are examples of such applications. Some of us might have our current partner or job (partly) chosen by an ML algorithm used in recommender systems in social networks. In a nutshell, ML tries to "find a hypothesis that allows predicting a quantity of interest for any data point". This notebook aims at explaining this informal statement by discussing the three main components of ML: data, hypothesis space, and loss function. 

After completing this notebook, you

will be familiar with following topics:

- Three components of machine learning: data, hypothesis space (model) and loss functions. 
- Properties of data points: features (low-level measurements) and labels (represent high-level facts).
- Difference between regression and classification problems based on the nature of the labels.
- Validation and test errors as indicators for the performance of a hypothesis outside the training set. 

will be able to:

- find useful definitions for data points, their features, and labels for your "real-life applications".
- train a linear classifier by minimizing the average logistic loss (logistic regression).
- use validation and test sets to compute validation and test errors. 

## 1. Component - Data

ML methods view data as collections of atomic units of information called **data points**. A data point can represent very different things or objects. Data points might represent different days, different countries, different persons, or different planets. The concept of data points is very abstract and, in turn, highly flexible. However, it is important to clearly define the meaning of data points when starting to develop ML applications. 
Data points are defined by their properties which we roughly divide into two fundamentally different groups referred to as **features** and **labels**.  

**Features.** Features are properties of a data point that can be measured or computed in an automated fashion without requiring extensive human supervision. For data points representing smartphone snapshots, a natural choice for the features is the red, green, and blue intensities of each pixel in the snapshot. Each data point is characterized by several (typically a lot of) features, which we will stack into a **numeric array**. The simplest forms of such arrays are either vectors or matrices. However, it will be convenient to allow features of a data point to form numeric arrays of arbitrary (but finite) dimensions.  

**Labels.** Besides features, data points are also characterized by higher-level properties which we refer to as **labels**. The labels of a datapoint typically represent some higher-level fact or quantity of interest for that data point. In contrast to features, the label of a data point can often only be determined by a human expert. Consider a data point representing a smartphone snapshot. We could then define the label of this datapoint as $y=1$ if the snapshot contains a cat or $y=0$ if the snapshot does not contain a cat. Depending on the type of label values we distinguish between different ML problems.  

**Regression Problems.** We speak of regression problems when data points have numeric labels which are typically represented by a real-number $y \in \mathbb{R}$. Having numeric label values allows comparing the quality of different predictions. Consider a data point with the true label value $y=10$ and two predicted label values $\hat{y}^{(1)}=20$ and $\hat{y}^{(2)} = 100$ obtained from two different ML methods. Then we can say that the prediction $\hat{y}^{(1)}$ is better than prediction $\hat{y}^{(2)}$ since it is closer to the true label value $y=10$.  

**Classification Problems.** In classification problems there is only a finite number of different label values. The label of a data point typically indicates the category or class to which that data point belongs to. The most simple setting is **binary classification** where data points belong to exactly one out of two different classes. Here, the label values take on values from a set that contains two elements (e.g., $\{0,1\}$ or $\{-1,1\}$ or $\{\mbox{shows cat},\mbox{shows no cat}\}$. If data points belong to exactly one out of more than two categories we speak of the **multiclass classification** problem (image categories "no cat shown' vs 'one cat shown' and "more than one cat shown"). If there are $K$ different categories we might use the label values $\{1,2,\ldots, K\}$. There are also applications where data points can belong to several categories simultaneously (image can "contain cat" and "contain dog" at the same time). **Multilabel classification** methods use several labels $y_{1},y_{2}$,\ldots, for each data point. The label $y_{j}$ represents the $j$th category and its value is $y_{j}=1$ if the data point belongs to the $j$th category and $y_{j}=0$ if not.


ML can *roughly* be divided into **supervised**- and **unsupervised** learning. A supervised ML model uses the labeled data points as examples to learn a predictor function that takes features of a data point as input and outputs a predicted label. A trained model can then be used to predict labels of data points for which the true labels are unknown. Examples of supervised learning:

- linear regression
- logistic regression
- support vector machines
- decision trees

In contrast to supervised methods, unsupervised methods do not require the data to be labeled and are in general used for problems related to the structure and distribution of the data. Examples of unsupervised ML methods:

- clustering algorithms, which aim to identify different clusters of data points in the dataset
- generative models that are used to generate data (see this [example](https://www.youtube.com/watch?v=kSLJriaOumA)).
- dimensionality reduction (PCA, t-SNE) that is used for visualization of high dimensional data or as a pre-processing step before other ML methods

## 2. Component - Hypothesis Space ("Model")

When applying an ML model on labeled data, we want the model to learn a predictor function $h(\mathbf{x})$ that takes the features of a data point as input and outputs a predicted label $\hat{y}$. Ideally, we would like our ML model to be able to learn any possible function so that it can find the one that best represents the relationship between the features and the label. This is, however, impossible in practice, and therefore we have to restrict the set of functions that the ML model can learn. This restricted set of predictor functions is referred to as the **hypothesis space** and denoted $\mathcal{H}$

The choice of which hypothesis space to use in a ML method is often informed by some assumption (or intuition) about the relationship between the features and label of a data point. For example, by selecting the set of linear functions of the form
\begin{equation}
    h(x) = w \cdot x, \; \text{ where } w \in \mathbb{R}
\end{equation}
we are effectively assuming that the relationship between features $x$ and and label $y$ is a linear. It is important to understand that such assumptions can rarely be justified in advance, and it is in practice necessary to experiment with models using different hypothesis spaces to find the one that results in the best predictions.

### Weights are Model Parameters

Consider the above hypothesis space $\mathcal{H}$ which is constituted by the linear maps

\begin{equation}
    h(x) = w \cdot x, \; \text{ where } w \in \mathbb{R}. 
\end{equation}
The elements of $\mathcal{H}$ are predictor functions $h$ that take as input $x$ and return a prediction $\hat{y} = wx$. This hypothesis is an exmample of a paramterized hypothesis space. Each hypothesis is fully determined by the value of the weight $w$. We can denote the hypothesis obtained for a given weight $w$ by $h^{(w)}$. Searching (learning) a good hypothesis out of a parametrized hypothesis space is equivalent to searching (or learning) a weight vector that corresponds to a good hypothesis. Another important family of paramtrized hypothesis spaces are those obtained from ariticial neural networks. 

We refer to the weight $w$ in the above defintion of linear maps as a **model parameter**. Roughly speaking, model parameters are variables that the learning algorithm tunes during the training process to find the best predictor function in the hypothesis space. Models with larger hypothesis spaces than our tiny example have a larger number of parameters. An extreme example is the [GPT-3](https://en.wikipedia.org/wiki/GPT-3) deep learning model, which has **~175 billion model parameters**!

## 3. Component - Loss

Loosely speaking, ML methods are optimization or search algorithms that aim at finding (learning) the best hypothesis out of hypothesis space. However, finding the best hypothesis requires a measure of "success" of the "quality" of a specific hypothesis $h$. ML methods use loss functions to obtain such a quality measure.

A loss functions is a rule (or recipe) for quantifying the discrepancy between the predicted label $\hat{y}=h(\mathbf{x})$ and the true label $y$ of a data point with features $\mathbf{x}$. Formally, a loss function is a map that assigns each pair of data point and hypothesis some (non-negative) number $\mathcal{L}\big( \big(\mathbf{x},y\big),h)$.  In general, ML methods used loss functions that deliver smaller values if the predicted label is "closer" to the true label. However, the precise meaning of "closer" depends the choice for the possible label values (which are real-nubmers in regression problems but might be aribtrarily structured sets for classification problems). 

**Loss Functions for Regression.** The above (rather abstract and) generic definition of a loss function is best understood by looking at specific examples for loss functions. If the label values are numeric, two widely used loss functions are the **squared error loss** $\mathcal{L}\big(\big(\mathbf{x},y\big),h \big) = \big( y - h(\mathbf{x}) \big)^{2}$ and the **absolute error loss** $\mathcal{L}\big( \big(\mathbf{x},y\big),h \big) =| y - h{\mathbf(x})|.$ 

**Loss Functions for Classification.** One important criterion when choosing loss functions is if labels are numeric or categorical. In general, loss functions that are suitable for assessing predictions of numeric labels are not a good choice for assessing predictions of label values that represent categories. For binary classification problems, where the labels take on only two different values, we could use the **$0/1$ loss**   

\begin{equation} 
\mathcal{L}\big( \big(\mathbf{x},y\big),h \big)  = \begin{cases} 1 & \mbox{ , if } y = h(\mathbf{x}) \\ & 0  \mbox{ otherwise.}\end{cases}
\end{equation}


You can read more about loss functions for classification problems in Chapter 2 of the [MLBook](https://github.com/alexjungaalto/MachineLearningTheBasics/blob/master/MLBasicsBook.pdf). 

**Ordinal Label Values.** Some ML applications involve data points with ordinal label values. These label values are somewhat in between numeric (regression) and categorical (classification). Similar to categorical label values, ordinal label values take on values from a finite set. Moreover, similar to numeric label values, ordinal label values have an order. As an example consider data points representing contries and their label being an indicator 0,...,4 of the Covid-19 incidence level. There are loss functions that are particulary suited for assessing predictions of ordinal lavel values (read more in Chapter 2 of the [MLBook](https://github.com/alexjungaalto/MachineLearningTheBasics/blob/master/MLBasicsBook.pdf)). 


**Training Error is Average Loss on Training Set.** For many loss functions, we can only evaluate the loss $\mathcal{L}(\hat{y}, y)$ incurred by the prediction $\hat{y}=h(\mathbf{x})$ if we know the true label value $y$ of the data point. Thus, Most ML methods need to be fed with a set of labeled datapoints $\mathcal{D} = \big\{ \big(\mathbf{x}^{(1)},y^{(1)}\big),\ldots,\big(\mathbf{x}^{(m)},y^{(m)}\big) \big\}$. The dataset $\mathcal{D}$ is referred to as being labeled since it contains datapoints for which we know the true label values. These "labeled" data points are then used as a training set by ML method in the following sense. ML methods learn a predictor map $h \in \mathcal{H}$ that incurs minimal **average loss** on the training data, 

\begin{equation}
    \mathcal{E}\big(h|\mathcal{D} \big) = \frac{1}{m} \sum_{i=1}^{m} \mathcal{L}\big(\underbrace{\hat{y}^{(i)}}_{h\big(\mathbf{x}^{(i)}\big)}, y^{(i)} \big)
\end{equation}

The average loss $\mathcal{E}\big(h|\mathcal{D} \big)$ is the average of the individual losses incurred by using the predictor $h$ to predict the labels of the individual data points in the training set. We sometimes refer to the average loss of a predictor on the training set as the **training error**. 

**Design Choice.** The loss function used by a ML method is a design choice that must balance between different design constraints. These constraints might arise from limited computational resources (storage, number of GPUs, maximum processing time), statistical properties of data (presence of outliers) and their interpretability. One example for computationally appealing loss functions is the squared error loss since it allows to use simple gradient-based methods for learning a hypothesis. One example for a loss function that results in robustness against outliers is the absolute error loss. One example for an interpretable loss function is the $0/1$ loss which resutts in an "error-rate", i.e., the fraction of wrongly classified data points.

**Loss, Score and Metric.** Unfortunately, there is rarely a single loss function that satifies all design contraints. Loss functions that are easy to minimize (e.g. using gradient descent) might not be robust against outliers or difficult to interpret. Therefore, it might be useful to use different loss functions within the same ML method. One loss function that is easy to minimize is used for learning a good hypothsis map. Another loss function that is easier to interpret is then used for the final performance avaluation of the learnt hypothesis on a test set. The latter loss function, that is used for the final performance evaluation, is often referred to as "metric" or "score".

<a id='1.1'></a>
<div class=" alert alert-info">
    <h3 align='center'><b>Notation</b></h3>

The symbol $\{ \}$ indicates a set. The set is a collection of elements, e.g. $\{1,2,3\}$ is a set of three numbers and $\mathcal{D} = \big\{ \big(\mathbf{x}^{(1)},y^{(1)}\big),\ldots,\big(\mathbf{x}^{(m)},y^{(m)}\big) \big\}$ is a set of $m$ data points.

\
Expression $ \mathcal{E}\big(h|\mathcal{D} \big)$ denotes the empirical risk or average loss incurred by hypothesis $h$ on the data points in the set $\mathcal{D}$.

More about sets:
    
- [Introduction to sets](https://www.mathsisfun.com/sets/sets-introduction.html) from mathsisfun.com
- [Set symbols](https://www.mathsisfun.com/sets/symbols.html) from mathsisfun.com
</div>

## Putting Together the Pieces. 

Now that we have discussed the three main components of ML, let us discuss two particular ML methods that use different choices for data representation and loss function but the same hypothesis space. These two methods are referred to as **linear (least-squares) regression** and **logistic regression**. Both methods consider data points  characterized by a feature vector $\mathbf{x} = \big(x_{1},\ldots,x_{n} \big)$ consisting of numeric features $x_{1},\ldots,x_{n}$ of a data point. Moreoever, both methods use the same hypothesis space which is constituted by all linear maps $h(\mathbf{x}) = \sum_{j} x_{j} w_{j}$ with (tunable) weights $w_{1},\ldots,w_{n} \in \mathbb{R}$.


### Linear (Least-Squares) Regression

For regression problems, where data points have numeric labels, a widely used choice for the loss function is the **squared error loss** $(y - \hat{y})^{2}$. The average squared error loss incurred on a training set is referred to as the **mean squared error** (MSE):

\begin{equation}
\frac{1}{m} \sum_{i=1}^{m} (y^{(i)} - \hat{y}^{(i)})^{2} = \frac{1}{m} \sum_{i=1}^{m} \big(y^{(i)} - \underbrace{h\big(x^{(i)}\big)}_{\mbox{predicted label } \hat{y}^{(i)}} \big)^{2}
\end{equation}



### Logistic Regression 

We now consider ML applications that involve data points with a binary label. Thus, the label of each data point is  one out of two possible values. Without loss of generality we assume these two possible label values are $0$ and $1$. (You can easily map transform label values for binary classification problems using `LabelEncoder()`)) The squared error loss is a bad choice for the loss function for classification problems where labels represent class memberships and have no numeric meaning. Consider a binary classification problem where data points represent a webcam snapshot of an animal. Here, we might define the label of a data point as $y=0$ if it shows a cat and $y=1$ if the snapshot does not show any cat. We can use ML methods to learn a hypothesis $h(\mathbf{x}) \in [0,1]$ whose value is the (estimated) probability that the label value is $1$. If the estimated probability is less than $1/2$, we define the predicted label as $\hat{y}=0$, if the probability is at least $1/2$, the predicted label is $\hat{y}=1$. We can denote this classification rule as 

\begin{equation}
 \hat{y} = \begin{cases} 1 & \mbox{ for }h(\mathbf{x}) \ \geq 0.5 \\ 0 & \mbox{ for } h(\mathbf{x}) < 0.5 \end{cases}
 \tag{4}
\end{equation}

We use the value $h(\mathbf{x})$ to construct an estimate for the probability $p(y=1)$ via $p(y=1) = \frac{1}{1+e^{-h(\mathbf{x})}}$. To learn a useful hypothesis $h(\mathbf{x})$ we minimize the logistic (or cross-entropy) loss 

\begin{equation}
\mathcal{L}(y,h(\cdot)) := -y\ln\big(p(y=1)\big)-(1-y)\ln\big(p(y=0)\big).
\end{equation}

<img src="R0_data/logreg.png" width=600>

Logistic regression learns the weights of a linear classifier $h(\mathbf{x})$ by minimizing the average **logistic loss**:

\begin{equation}
\begin{aligned}
    (1/m)\sum_{i=1}^{m}\big[ -y^{(i)}\ln\big(p(y=1)\big)-(1-y^{(i)})\ln\big(p(y=0)\big) \big] \\ = (1/m)    \sum_{i=1}^{m}\big[ -y^{(i)}\ln\big(\sigma(w \cdot x)\big)-(1-y^{(i)})\ln\big(1-\sigma(w \cdot x)\big) \big]
\end{aligned}
\end{equation}
The average logistic loss is evaluated for a set of data points (the training set) with known label $y^{(i)}$. 

If you are interested in logistic regression & logistic loss details, you can find more explanations in the video tutorials: 
- [Andrew Ng, ML course Lecture 6.1](https://www.youtube.com/watch?v=-la3q9d7AKQ) 
- [Andrew Ng, ML course Lecture 6.2](https://www.youtube.com/watch?v=t1IT5hZfS48)
- [StatQuest: Logistic Regression](https://www.youtube.com/watch?v=yIYKR4sgzI8)

<a id='1.1'></a>
<div class=" alert alert-info">
    <h3 align='center'><b>Notation</b></h3>

The symbol $\sum$ (uppercase greek letter sigma) indicates a sum. The sum of all elements in the list $(x_1, x_2, \ldots, x_n)$ (e.g., forming a vector $\mathbf{x}$) is denoted by

\begin{equation}
\sum_{i=1}^{n}{x}_{i} = {x_1}+...+{x_n}
\end{equation}

We can then express an average $\frac{1}{n}({x_1}+...+{x_n})$ of $n$ numbers conveniently as


\begin{equation}
    \frac{1}{n}\sum_{i=1}^{n}{x}_{i} =(1/n) \big( {x_1}+...+{x_n} \big) 
\end{equation}

</div>

### The Machine Learning Pipeline

A typical ML workflow is as follows: 

\
<img src="R0_data/MLsteps2.png" width=700>

- analyze data:\
    dataset quality has a huge impact on ML methods performance, as the whole idea of ML is to learn from data. You will not be able to solve ML problems with low-quality data. As you may have heard saying - "garbage in, garbage out". What is bad data and how do we check the quality of a dataset? There are few typical problems with data:
    
    * small dataset
    * missing values
    * high noise
    * biased sampling procedure \
    (for more details see pp.23-27 of "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by Aurélien Géron)
    

- choose hypothesis space (model):

   in addition to good quality data, an appropriate model must be chosen. For example, for a regression problem, you can choose linear regression (hypothesis space of linear predictors) or decision tree (hypothesis space of piecewise functions).  
   <img src="R0_data/hyp1.png"> 
   
   
- choose a loss function:

  Although machine learning engineers can choose or even invent new loss functions, there are some popular ones that are used the majority of the time:
  - mean square error for regression problems
  - logistic loss for classification problems
    
    
- Find the best predictor (model):
  
  given a hypothesis space and a loss function, we need to find the best model within this hypothesis space. The best model means a model with the lowest loss function value. There are many different algorithms for finding optimal predictors (model). Iterative algorithms, such as gradient descent, perform the same step many times in order to find the optimal solution. Sometimes, we can find the best predictor with just one step. For example, there is a so-called 'closed-form solution' for linear regression and it is possible to find the best parameters for linear predictors by using [normal equation](https://www.youtube.com/watch?v=NN7mBupK-8o).

## Generalization and Validation

The basic workflow of ML seems quite simple: get some data to form a training set, choose a hypothesis space and loss function and then solve an optimization problem to get the hypothesis with smallest training error. Are we done? The answer is a resounding NO! The goal of ML is to find a hypothesis that yields good predictions for **any** data point and not just those in the training set. We must ensure that a hypothesis that has small (or minimum) training error **generalizes** well to data points outside to training data. 

Validation is a simple but powerful technique to probe a learnt hypothesis (that minimizes training erro) outside the training set. The idea is to split the available data points into two subsets, a training set and a validation set. We illustrate this split in the plots below by colourint training set blue and validation set orange. 
The left plot shows a linear hypothesis (predictor) $h(x)=w_{1}+w_{2}x$ is fitted to training dataset (blue dots). This is the so-called training set. Orange dots are "new" samples, data that the model didn't see during the fitting process. This set is called a test set. 
 
It seems that there is a non-linear relationship between features and labels of data points. Therefore, there is no good linear hypothesis map for approximating the relationship between features and labels which means that the training error is large. We might say that the linear model underfits the training data. 

The plot in the middle shows polynomial function of degree $3$: $h(x)=w_{1}+w_{2}x+w_3{x}^{2}+w_{4}{x}^{3}$, which seems to fit data quite well. Plot on the right is polynomial function of degree 6: $h(x)=w_{1}+w_{2}x+w_{3}{x}^{2}+...+w_{7}{x}^{6}$. It fits training data very well. In fact, it fits training data too well, as predictions for a few points from the "new" dataset are quite bad (two orange points from the right side). This is an example of overfitting when a model has too many parameters.

<img src="R0_data/hyp2.png"> 

If we will plot MSE loss against a model complexity (here degree of polynomial), we will see plot like this:

<img src="R0_data/complexity.png" width=600> 

The blue line shows MSE for the data set used in training and the orange line shows MSE for new, test data, which the model did not see before. Low complexity (degree) models underfit training and test sets. High complexity models fit training data perfectly, while the loss on new data set is skyrocketing (note log scale of the y-axis). In the middle, there is a "sweet spot", where both, training and test loss values are relatively low.

## Components of Machine Learning - Practice

<div class=" alert alert-info">
    <h3 align='center'><b>Data</b></h3>
</div>

In this part, you will learn how to formulate and solve a classification problem. Classification problems arise from data points whose labels have only a finite number of different values. Each of these values represents a particular class of category to which data points can belong. The most simple classification problem is a binary classification problem where the label can take on only two distinct values such as  $y=0$ vs $y=1$ or  $y$="picture includes a cat" vs.$y$="picture does not include a cat". The label $y$ of a data point indicates to which class (or category) the data point belongs.

We consider a widely used method for solving classification problems - logistic regression. Logistic regression is a classification algorithm that uses a linear function to classify data points into distinct categories. 

We will use sklearn [iris dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#iris-plants-dataset) consisting of 150 data points, 3 classes (Setosa, Versicolour, Virginica). Data points are characterized by 4 features: sepal length, sepal width, petal length, petal width, but we will use only the first two, as it is easier to visualize in 2D. We will also restrict ourselves to two classes for simplicity.

<img src='R0_data/iris.png' width=700>

<a href="https://www.datacamp.com/community/tutorials/machine-learning-in-r"><p style="text-align:center">source</p></a>

In [1]:
# $ pip3 install python-utils 

In [1]:
import numpy as np                   # the "numpy" package provides methods for processing and manipulating numerical arrays
import matplotlib.pyplot as plt      # library providing tools for plotting data

from utils.styles import load_styles # custom CSS style 

load_styles()

In notebooks you will see yellow sections with student tasks. You need to read instructions and complete a code. The place to put your code is marked as:
    
`# YOUR CODE HERE`\
`raise NotImplementedError`
    




Often you will see comments and pre-filled code lines in cell:
    
`# import load_iris module` <--- This is a comment\
`# from ... import ...`   <---  This is a code line you need to complete

`# load data` <--- This is a comment\
`# data = ... `  <---  This is a code line you need to complete

<div class=" alert alert-warning">
    <h3><b>Student Task.</b> Load Iris dataset.</h3>

Your task is to:
 
- import load_iris module from sklearn.datasets
- load dataset and store in variable `data`
</div>

<details>
    <summary>
        <span class="summary-title">Hints. Click to open.</span>
    </summary>
    <div class="summary-content">
This is a hint cell. Sometimes we put extra information about task here. It is adviced to try to solve task without hints first. 
        <p>
            For this task you need to fill in gaps (...) and remove comment tag <code>#</code>:
        </p>
        <ul>
             <li>
                 <code># import load_iris module</code>
             </li>
             <li>
                 <code>from sklearn.datasets import load_iris</code>
            </li>
            <li>
                 <code># load data</code>
             </li>
             <li>
                 <code>data = load_iris()</code>
            </li>
        </ul>
      Finally, remove <code>raise NotImplementedError</code> otherwise it will return an error when code cell is run.
     </div>    
</details>

In [14]:
# import load_iris module
# from ... import ...

# load data
# data = ...

# remove the line raise NotImplementedError() before testing your solution and submitting code
# YOUR CODE HERE
#raise NotImplementedError()

from sklearn import datasets

# import load_iris module
from sklearn.datasets import load_iris


import pandas as pd

# load data
iris_data = load_iris(return_X_y=True, as_frame=True)

print(dir(iris_data))
# print out features' names
#print("\nFeatures:", data.feature_names)
print("\nFeatures:", iris_data)

# print out classes for classification
#print("\nClassess:", data.target_names)

fname = 'load_iris'

loader = getattr(datasets, fname)()

df = pd.DataFrame(loader['data'], columns = loader['feature_names'])

df['target'] = loader['target']

df.head(2)

#print("\nFeature:", iris_data_frame["feature_names"])

# length
print(len(data))

['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'count', 'index']

Features: (     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                  5.1               3.5                1.4               0.2
1                  4.9               3.0                1.4               0.2
2                  4.7               3.2                1.3               0.2
3                  4.6               3.1                1.5               0.2
4                  5.0               3.6                1.4               0.2
..                 ...               ...                ...               ...
145           

Below you can see Sanity checks. They will check few properties of your solution, to make sure it is not completely irrelevant. 


<div class=" alert alert-danger">
    <h3 align="center">Note, that passing sanity checks does not mean that the solution is correct!</h3>
</div>

In [15]:
# Sanity checks 
assert len(data) == 7, "length of data should be 7!"
assert data.target_names[0]=='setosa', "First target name should be 'setosa'!"

print("Sanity checks passed!")

AssertionError: length of data should be 7!

Below is a hidden cell. It contains test for student task solution and is hidden from students. The hidden test will be run after deadline. You can see the content of hidden cells after deadline in your html feedback files.

In [13]:
# this cell is for tests


In [16]:
# load data as numpy array
X, y = load_iris(return_X_y=True)
# choose first 2 features
ind = np.where((y==1) | (y==2))[0]
y = y[ind]
X = X[ind,:2]

print("Feature matrix dimensions: ", X.shape)
print("Label vector dimensions: ", y.shape)

Feature matrix dimensions:  (100, 2)
Label vector dimensions:  (100,)


It is a good idea to study your data first. Let's visualize the dataset and see how two classes of iris plants related to each other. We will plot to scatter plot and histograms for two features.

In [17]:
import seaborn as sns

# set seaborn theme for plots
sns.set_theme()

# plot data
fig, axes = plt.subplots(1,3, figsize=(15,4))
# plot histogram of first feature
sns.histplot(X[y==1,0], kde=True, ax=axes[0], color='b').set_title('first feature')
sns.histplot(X[y==2,0], kde=True, ax=axes[0], color='r')
# plot histogram of second feature
sns.histplot(X[y==1,1], kde=True, ax=axes[1], color='b').set_title('second feature')
sns.histplot(X[y==2,1], kde=True, ax=axes[1], color='r')

# plot data points
sns.scatterplot(ax=axes[2], x=X[:,0],y=X[:,1], hue=y, palette=['b','r'], legend=False)

plt.xlabel('first feature')
plt.ylabel('second feature')
plt.show()

ModuleNotFoundError: No module named 'seaborn'

As you can see, data distribution is similar to a normal distribution, there are no apparent outliers. Scatterplot and histograms show great overlap in the distribution of features in the two classes. This means that the separation of two classes might be difficult.

<div class=" alert alert-info">
    <h3 align='center'><b>Hypothesis space</b></h3>
</div>

In [None]:
from sklearn.linear_model import LogisticRegression

# define hypothesis space / model
clf = LogisticRegression(random_state=0)
clf

<div class=" alert alert-info">
    <h3 align='center'><b>Loss function</b></h3>
</div>

The loss function for logistic regression implemented in sklearn is described here: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

<div class=" alert alert-info">
    <h3 align='center'><b>Training/ Fitting a model</b></h3>
</div>

In [None]:
# fit logistic regression
clf.fit(X, y)
# calculate the accuracy of the predictions
y_pred = clf.predict(X)
accuracy = clf.score(X, y)
print(f"Accuracy of classification: {round(100*accuracy, 2)}%")

Below we plot linear decision boundary:

In [None]:
# get the weights of the fitted model
w = clf.coef_ 
w = w.reshape(-1)

# minimum and maximum values of features x1 and x2
x1_min, x2_min = np.min(X, axis=0)
x1_max, x2_max = np.max(X, axis=0)

# plot the decision boundary h(x) = 0
# for data with 2 features this means w1x1 + w2x2 + bias = 0 --> x2 = (-1/w2)*(w1x1+bias)
x_grid = np.linspace(x1_min, x1_max, 100)
y_boundary = (-1/w[1])*(x_grid*w[0] + clf.intercept_)

fig, axes = plt.subplots(1, 1, figsize=(5, 4))

# plot data points belonging to class 1 and 2
sns.scatterplot(ax=axes, x=X[:,0],y=X[:,1], hue=y, palette=['b','r'], s=50, legend=False)
# plot decision boundary
axes.plot(x_grid, y_boundary, color='green')
# display x- and y-axis labels
axes.set_xlabel(r'$x_{1}$')
axes.set_ylabel(r'$x_{2}$')
# display title of figure
axes.set_title('Decision boundary', fontsize=16)
# set axes limits
axes.set_xlim(x1_min-.5, x1_max+.5)
axes.set_ylim(x2_min-0.5, x2_max+0.5)
    
plt.show()

We have briefly mentioned that when overfitting happens, a model shows good results on the training set (data set used to fit/train a model), but performs poorly on a new dataset. In this case, it is said that a model not generalizing well. **Generalisation** is an ability to perform well on new data. How can we estimate the generalization of a model? As we cannot use loss/score values obtained on the training set, we need additional data set, which will be used only for the final model evaluation. This is the so-called **test set**.\
If you also want to choose between different models (e.g. logistic regression vs decision tree), you will need one more "new" dataset - **validation dataset**. The procedure then as follows:

- train two models on the training dataset
- evaluate two trained models on the validation set
- choose the best model
- do a final evaluation of chosen model on the test set

More about training-validation-test sets:

- [What is the Difference Between Test and Validation Datasets? (Machine Learning Mastery blog)](https://machinelearningmastery.com/difference-test-validation-datasets/)
- [Training, validation, and test sets (Wikipedia)](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets)
- [Cross-validation: evaluating estimator performance (sklearn documentation)](https://scikit-learn.org/stable/modules/cross_validation.html)

Let's classify iris plants again, but now split the dataset on training and test sets. Performance score on the test set will show a more realistic estimate of the trained model.

In [None]:
from sklearn.model_selection import train_test_split

# load data as numpy array
X, y = load_iris(return_X_y=True)
# choose first 2 features
ind = np.where((y==1) | (y==2))[0]
y = y[ind]
X = X[ind,:2]

# split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

print("Training set dimensions: ", X_train.shape)
print("Test set dimensions: ", X_test.shape)

In [None]:
# fit logistic regression
clf = LogisticRegression(random_state=0).fit(X_train, y_train)

# calculate the accuracy of the predictions
train_accuracy = clf.score(X_train, y_train)
test_accuracy = clf.score(X_test, y_test)

print(f"Training accuracy of classification: {round(100*train_accuracy, 2)}%")
print(f"Test accuracy of classification: {round(100*test_accuracy, 2)}%")

Note, that accuracy on the test set is lower than on the training set. In addition, accuracy on the training set itself is lower than previously. This might be due to the small size of the dataset, effect which can be alleviated by using [cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html):

In [None]:
from sklearn.model_selection import cross_val_score

# create a logistic regression model
clf = LogisticRegression(random_state=0)
# data splitting to train-val sets, fitting and evaluation 
# is performed "under the hood" of `cross_val_score()` function.
# output scores are accuracies on validation folds
scores = cross_val_score(clf, X, y, cv=5)

print(f"Cross-validation scores: {scores}%")
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))

Now we can say that our model on average will perfom with 74±12% accuracy.

<div class=" alert alert-info">
    <h3 align='center'><b>Predictions</b></h3>
</div>

In [None]:
# traing the model
clf.fit(X, y)

# get predictions 
predict = clf.predict(X)

# plot true and predicted labels and decision boundary
fig, axes = plt.subplots(1,2, sharex=True, sharey=True,  figsize=(9,3))

# plot data points set with true lables
sns.scatterplot(ax=axes[0], x=X[:,0],y=X[:,1], hue=y, palette=['b','r'], legend=False)
# plot decision boundary
axes[0].plot(x_grid, y_boundary, color='green')
# plot data points with predicted lables
sns.scatterplot(ax=axes[1], x=X[:,0],y=X[:,1], hue=predict, palette=['b','r'], legend=False)
# plot decision boundary
axes[1].plot(x_grid, y_boundary, color='green')

#set axes limits
axes[0].set_xlim(x1_min-.5, x1_max+.5)
axes[0].set_ylim(x2_min-0.5, x2_max+0.5)

plt.show()

More about LogisticRegression class in sklearn:\
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

## Key takeaways

- ML methods can be decomposed into three components: **data**, **hypothesis space** (or **model**), and a **loss function**.

- Data(sets) consist of (typically large sets of) individual data points of similar type (e.g. "persons", "flowers", "images", "random variables" or "1 km by 1km areas on the earth surface").  

- Each individual data point is characterized by its features $\mathbf{x}$ and label(s) $y$. The features can be determined easily (by hard and software) whereas the labels are higher-level facts whose determination needs "intelligence" (human or artificial). 

- ML methods search (or "learn") a hypothesis map (or predictor function) $h(\mathbf{x})$ that allows to estimate/predict/approximate the label $y$ of a data point based solely on its features $\mathbf{x}$, $y \approx h(\mathbf{x})$. 

- ML methods have only finite computatonal resrouces and therefore can only search a subset of possible maps (there are too many of them!). This subset is the hypothesis space (or model) $\mathcal{H}$ and consists of the (feasible or allowed) predictor maps that might be learnt by a ML mehtod. 

- ML methods use different **loss functions** to measure the quality of a prediction $\hat{y} = h(\mathbf{x})$ for the true label value $y$. Maybe the most widely-used loss function (for numeric labels) is the squared error loss $(y - \hat{y})^{2}$. 

- The goal of an ML method is: "Find hypothesis out of hypothesis space such that loss incurred by its predictions are minimized for any data point". 