# 11 Basic concepts of machine learning

Part of ["Introduction to Data Science" course](https://github.com/kupav/data-sc-intro) by Pavel Kuptsov, [kupav@mail.ru](mailto:kupav@mail.ru)

Recommended reading for this section:

1. Grus, J. (2019). Data Science From Scratch: First Principles with Python (Vol. Second edition). Sebastopol, CA: O’Reilly Media
1. Muller, A and Guido, S (2017). Introduction to Machine Learning with Python. O'Reilly
1. B. Shmueli. Matthews Correlation Coefficient is The Best Classification Metric You’ve Never Heard Of. https://towardsdatascience.com/the-best-classification-metric-youve-never-heard-of-the-matthews-correlation-coefficient-3bf50a2f3e9a

The following Python modules will be required. Make sure that you have them installed.
- `matplotlib`
- `numpy`
- `collections`
- `sklearn`

## Lesson 1

### Classical science vs machine learning

All science is about prediction of a future. 

First of all one collects data from a domain of interest: records planets locations vs time of year, or air temperature vs wind 
speed, or canon ball distance vs amount of gun powder and so on. 

Then one tries to extract dependencies: does one value really depends on another? What can be a mathematical formula for this 
dependence? 

Next step is to check the revealed mathematical formulas on new observations. Do they rely work? After this step experimental laws appear: Newton's second law and Coulomb's law in physics, or Mendelian inheritance in genetics and so on.

When enough number of the laws are discovered a mathematical theory appears that integrates them and provides the tools for making predictions. 

Examples of predictions done in this way are: marine navigation (one predicts ship location observing stars), safe bridge load (one can compute it knowing properties of materials), probability of genetic diseases (using genetics laws).

The main problem in creation of science in this way is to find the most simple, fundamental governing laws that lay in 
the basis of all other laws.

In physics such fundamental law is for example Newton's laws of motions, in genetics this is a knowledge about DNA structure and so on.

And the law means the law: the things alway occur as it states. No exclusions.

The success of the whole filed depends on the success in finding such basic laws.

All of this can be called classical approach in knowledge creation.

In the second part of the XX century when computers allowed to deal with large amount of data, scientists started trying to apply the classical ideas to a more complicated areas, like for example social relations or medicine.

But in such areas no fundamental law can be found. Only some probabilistic dependencies.

If you have a lot of practical experience in a certain area, say medicine, you are an expert. 

May be you do not know the exact fundamental laws (because nobody knows) but you are sure that an event A most probably will be followed by an event B.

For example a medical doctor observing certain symptoms can be sure that a patient has a certain disease.

How can we formalize this experience, how can we take the knowledge from an expert and put it to a computer?

Computer expert systems were created. Mostly they were lists of manually coded rules "if then else". 

The rule were created after questioning experts in the field.

In fact this was just a try to apply classical methods to fields where they did not worked properly: 
recall that the classical science approach requires to move from raw observations to extracting laws. But the computer expert systems were stuck on the first step.

Two main problems of the computer systems purely based on expert rules are:

- No generalization. We can only answer the questions that the expert were asked. No laws derived so that previously unseen cases can not be properly proceed.

- No universality. Solves only specific problems and cannot solve any others. Expert opinion based computer systems could not recognize faces on photos. Because no one human expert can firmly formulate strict rules that allows him to do it.

That is why machine learning concept has appeared.

**Machine learning is a collection of mathematical methods and computer algorithms that automatically
extract knowledge from data.** 

The generalization and universality is the key requirement. 

The extracted knowledge must be generic. 

It means that the machine learning method can not just remember the answers. 

It must derive common dependencies in data and must be able correctly process new previously unseen cases.

This is what classical science does extracting fundamental laws from experimental data. 

There are two differences.

- Classical science laws are discovered by humans. The generalizing of the data is performed automatically 
by a computer program in the course of learning process.

- Since the classical laws are discovered by humans they have a form understandable for humans. These laws are represented 
as mathematical equations or text in natural language. Machine learning knowledge obtained after generalization of data 
consist of numbers. Typically this are huge arrays of numbers. In the most cases humans can not understand this knowledge. We can only 
use it without knowing why it works.

Machine learning methods are universal. They are applicable in every filed that can be described by a dataset. 

Of course some methods work better for face recognition while others are preferred for time series predictions. 

But the key ideas are common and all domain of interest can be analyzed using machine methods.

### Machine learning

The following paragraphs are taken form book [2].

---

Machine learning is about extracting knowledge from data. 

The application of machine learning methods has in recent years become ubiquitous in everyday life. 

From automatic recommendations of which movies to watch, to what food to order or which
products to buy, to personalized online radio and recognizing your friends in your
photos, many modern websites and devices have machine learning algorithms at their
core. 

When you look at a complex website like Facebook, Amazon, or Netflix, it is
very likely that every part of the site contains multiple machine learning models.

Outside of commercial applications, machine learning has had a tremendous 
influence on the way data-driven research is done today. 

The \[machine learning\] tools ... have been applied to diverse scientific problems 
such as understanding stars, finding distant planets, discovering new particles, 
analyzing DNA sequences, and providing personalized cancer treatments.

---

Quite possibly the most important part in the machine learning process is understanding 
the data you are working with and how it relates to the task you want to
solve. 

It will not be effective to randomly choose an algorithm and throw your data at
it. 

It is necessary to understand what is going on in your dataset before you begin
building a model. 

Each algorithm is different in terms of what kind of data and what
problem setting it works best for. 

While you are building a machine learning solution,
you should answer, or at least keep in mind, the following questions:

- What question(s) am I trying to answer? Do I think the data collected can answer that question?

- What is the best way to phrase my question(s) as a machine learning problem?

- Have I collected enough data to represent the problem I want to solve?

- What features of the data did I extract, and will these enable the right predictions?

- How will I measure success in my application?

- How will the machine learning solution interact with other parts of my research or business product?

In a larger context, the algorithms and methods in machine learning are only one
part of a greater process to solve a particular problem, and it is good to keep the big
picture in mind at all times. 

Many people spend a lot of time building complex machine learning solutions, only to find 
out they don’t solve the right problem.

When going deep into the technical aspects of machine learning ..., it is easy to lose sight 
of the ultimate goals. 

---

End of citation

### Modeling

The key role in the machine learning belong to models. 

Models extraction knowledges and do predictions.

The following paragraphs are taken from book [1]

---

What is a model? 

It’s simply a specification of a mathematical (or probabilistic) relationship 
that exists between different variables.

For instance, if you're trying to raise money for your social networking site,
you might build a business model (likely in a spreadsheet) that takes inputs
like "number of users," "ad revenue per user," and "number of employees"
and outputs your annual profit for the next several years. 

A cookbook recipe entails a model that relates inputs like "number of eaters" 
and "hungriness" to quantities of ingredients needed. 

And if you've ever watched poker on television, you know that each player's 
"win probability" is estimated in real time based on a model that takes into 
account the cards that have been revealed so far and the distribution of cards in the deck.

The business model is probably based on simple mathematical
relationships: profit is revenue minus expenses, revenue is units sold times
average price, and so on. 

The recipe model is probably based on trial and
error - someone went in a kitchen and tried different combinations of
ingredients until they found one they liked. 

And the poker model is based on probability theory, the rules of poker, 
and some reasonably innocuous
assumptions about the random process by which cards are dealt.

---

End of citation

More rigorously, model is a simplified system that possess the most important properties and relations of another system. 

Models are created using a formal language, such that mathematic notation, programming language. 

Human language is basically inappropriate for creation models due to its vagueness. However sometimes it is also used for example in psychology.

Models are created to obtain knowledges about the modeled system.

Typically natural system (e.g, social ones) are too complicated for a straightforward study.

In these case a model is created that keeps only the most essential feature of the original system. 

Studying the model gives knowledges about the original system.

Since models are always truncated in features in comparison with the original system errors are unavoidable.

Building a model is always a trade-off between simplicity (and thus possibility of its study) and accuracy.

### Machine learning models

Machine learning model is an aggregate of computer program and array of parameters (are mere computer variables). 

The model has inputs where the information from the domain of interest is fed and outputs where the prediction 
appears.

The predictions are done after processing inputs by a model computer program using parameters. 

While the computer program is often standard, the values of parameters are specially tuned in 
the course of training.

The parameters are the most precious part of the model. Their number is usually large and they are kept in 
files in external memory.

While predictions are done by the model itself, its training requires additional software. 

It can be standard or written by a data scientist for the particular task.

The training software 

- performs data gathering from different sources (data mining, big data),
- provides data access (reading from files, downloading from web, etc), 
- feeding the data to the model,
- receiving the prediction, 
- estimation its quality,
- correction of the model parameters to improve it if needed.

### Supervised models

These models are trained on a labeled dataset.

It means each input, i.e., each data record, is accompanied by a desired output. 

In the course of training the input is passed to the model. 

The model produces the output that is compared  with the desired one. 

If they are different the model parameters are corrected to decrease the difference.

Labeled datasets are created by humans. Often this is done by volunteers via crowd-findings platforms.

The famous example is ImageNet. 

The ImageNet project is a large visual database designed for use in visual object recognition software research.

More than 14 million images have been hand-annotated by the project to indicate what objects are pictured. In at least one million of the images bounding boxes are also provided.

The purpose of the supervised learning is the extraction and storing the essential features of data as model parameters.

The desired result of the training: the algorithm is able to create a correct output for an input
it has never seen before without any help from a human.

For example we can create a spam filter that analyses the content of emails and decides if this is spam or not. 

To train it we have to create a dataset the includes various emails that are labeled as spam or not spam.

After training, given a new email, the algorithm will then produce a prediction as to whether the new email is
spam.

More examples of supervised machine learning tasks

- Identifying user by their faces.
- Identifying the zip code from handwritten digits on an envelope.
- Detecting fraudulent activity in credit card transactions.
- Identifying license plate number on road cameras images.

### Unsupervised models

In unsupervised learning, only the input data are known, and no known output data is given to the model. 

The purpose of the unsupervised learning is to detect patterns from data.

One type of the unsupervised learning tasks is to discover groups of similar examples within the data. This is called __data clustering__. 

Also this class of tasks is called __non-parametric unsupervised Learning__ (in comparison with the next one).

The second type is known as __density estimation__ or __parametric learning__. 

In this case, we expect that the data are not grouped into clusters but distributed according to a certain law, e.g., normal distribution. 

We assume a certain distribution function followed by the data and the goal is to compute its parameters. 

Examples of unsupervised learning tasks:

- Identifying topics in a set of blog posts
- Segmenting customers into groups with similar preferences
- Detecting abnormal access patterns to a website

### Reinforcement models

These models are little bit similar to the supervised ones. But nevertheless there is a serious difference. 

If the learning is supervised we know a global loss function. 

It means that we always know exact desired output for each training input. 

Thus we can compare the desired and the obtained outputs and compute the distance between them. 

Given the distance we use an optimization
algorithm (e.g., gradient decent) and compute parameters updates directly towards to the optimum. 

In the course of the reinforcement learning a model interacts with a dynamic environment that sends it inputs and the models replies with outputs (examples are driving a vehicle or playing a game against an opponent). 

The difference with the supervised learning is that the exact desired output is unknown. 

The environment is dynamic, i.e., the situation changes in time. 

The model trainer can not estimate its output globally (if this move definitely leads to a win or to a loss), but it can say is it good or not locally, for this particular situation.

Thus each model output evaluated in terms of rewards and punishments and parameters updates are computed to minimize the punishments and to increase rewards in future situations.

Reinforcement learning is used to teach computer plays chess or Go game.

In fact computer learns itself. 

The learning process starts and the model make moves for one and another side.

The training software estimates its moves using known functions characterizing quality of a game configuration.

Using this estimates the model parameters are updated.

After sufficiently many games the model becomes a grandmaster.

### Exercises

1\. Answer in writing what is a model and what is a machine learning model

2\. Write brief descriptions of three types of model learning

## Lesson 2

### Overfitting and underfitting

A standard problem in machine learning is overfitting. 

The model with overfitting performs very well on the train data but fails on any new data.

Overfitting means that instead of finding generalizing of the train data it merely has 
remembered all of them. Also it can involve learning noise in the data.

The opposite problem is underfitting.

A model with underfitting performs bad even on training data. 

Overfitting usually appears when the model is too complex
for a given dataset, i.e., has too many parameters. 

Underfitting on the contrary means that either the model is too simple. Number of 
its degrees of freedom (parameters) is not enough to store the extracted information. 

Another reason is that the dataset have not enough features (columns) and does not describe the solved problem well.

Fighting underfitting and overfitting is the central problem of creation of an appropriate machine learning model.

When the model is underfitted we need to add more capacity to it. 

It means that its structure must be changed to add more parameters. 

We can add more structure elements to the model (e.g., more layers to a neural network), or we can collect 
more features from the domain of interest. 

A simple trick is to add new data columns as powers of the old ones: squares, cubes and so on. 

And if the overfitting occurs we can either simplify the mode by decreasing the number of its parameter, or we can try to 
find more data records (often this is impossible) to enlarge the training dataset.

One more way to fight the overfitting is called regularization.

Regularization means limiting parameter variation. 

Overfitted models often have very large by magnitude parameters.

When we apply the regularization we add a penalty to the parameter magnitude: the smaller the better.

And one more way do defeat the overfitting is called dropout.

In the course of training we switch off at random some model parameters, i.e., exclude them from the training routine.

As we can see all ways of fighting the overfitting are directed to prohibiting of the mere remembering the dataset.


### Split dataset to train, validation and test parts

The underfitting is clearly seen: models just works bad.

And the overfitting can not be detected considering only the training dataset. 

To revel the overfitting we need to split the dataset to training and testing parts.

But here we have another problem.

Imagine that we have done several training steps and want to see how it is going.

We take the test part of the data and estimate the model performance. 

If it is as good as for the training data everything is good, there is no overfitting.

But if the models performs bad on the testing data we detect the overfitting.

In this case we change something in the model.

It means that the test data also take part in training. 

They influences although indirectly on the modification to the models.

But the final score of the model can be estimated only for the data than are never seen by the model and that
never influenced the model training.

It means that the splitting the initial dataset into training and testing parts is not enough. 

We must split the dataset into three parts, training, validation and test:
- training data are immediately used to update model parameters 
- validation data are used after several steps of training to estimate the performance and overfitting of the model; on the basis of this estimation corrections to training can be done
- test data are used just once to get the final score of the model; no model parameter updates cab be done after that.

Usually the dataset is split as 70\%, 20\%, 10\% or 80\%, 10\%, 10\%. 

The training part is usually larger than all others.

### Cross-validation

If the size of the whole dataset is not too large, we can improve the training by using cross-validation.

First we take off the test data to find the final model score.

The rest of the data is not split to the training and validation data permanently. 

Instead we split all remaining data into $k$ parts. First $k-1$ parts are used for the model training and then
the performance is checked using the last one.

Then we rotate the data so that the next part becomes the testing set. 

The procedure is repeated $k$ times and the result of tests are averaged.

Such rolling estimate allows more uniform use of the data.

### Stochastic gradient decent, epoch

Often the dataset is very large and computer has not enough memory to keep it all at once.

In this case the dataset is divided by pieces called minibatches or just batches.

Instead of applying the training algorithm to the whole dataset it is applied at minibatches and model parameters are updated after each minibatch. 

Usually this is used when the parameters are optimized using gradient decent optimization method. 

Gradient decent applied at minibatches is called stochastic gradient decent.

It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data). 

Especially in high-dimensional optimization problems this reduces the computational burden, achieving faster iterations in trade for a lower convergence rate.

When the dataset is split to mimibatches and we have fed all of them one by one to the training routine this is called epoch.

In the other words epoch passes when we show all minibatches, i.e., the whole training dataset, to the model.

### Regression and classification

Supervised machine learning usually solves two main problem:

- regression,
- classification.

Classification is the task of assigning labels to data samples belonging to different classes. For example training a model to distinguish between cats and dogs is a classification problem with cats and dogs being the two classes.

Regression, on the other hand, is the task of predicting continuous values by learning from various independent features. For example predicting the price of a house based on features like the number of bedrooms, locality etc.

### Metrics for regression

Performance of machine learning models are estimated with metrics.

In regression tasks we have a true vector $\hat y$ and a vector predicted by a model $y$.

Metrics tell how different they are.

Usually some versions of the distance is used.

Assume the vectors $\hat y$ and $y$ has $n$ elements.

For example the following vectors have $n=5$:
$$
\hat y = (2.1, 3.4, -5.1, 1.6, -3.3)
$$
$$
y = (2.2, 3.45, -5.0, 1.7, -3.2)
$$

Let us denote vector elements as $\hat y_i$ and $y_i$, where $i=1,2,\ldots, n$.

For the example above $y_1=2.2$, $y_2=3.45$.

__MSE, Mean Squared Error__, the most often used metric:
$$
\text{MSE} = \frac{1}{n}\sum_{i=1}^n (\hat y_i - y_i)^2
$$
This is the squared Euclidean distance between two vectors divided by $n$.


__RMSE, Root Mean Squared Error__. This is the square root of MSE, the Euclidean distance divided by $\sqrt{n}$.
$$
\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^n (\hat y_i - y_i)^2}
$$

__MAE, Mean Absolute Error__.
$$
\text{MAE} = \frac{1}{n}\sum_{i=1}^n |\hat y_i - y_i|
$$
This is the taxi-cab distance divided by $n$.

The difference between these metrics in different scale for small and large errors.

For example MSE squares the error so that it assigns more penalty to large errors in comparison with RMSE.

And when errors are small RMSE put more penalty then MSE.

MAE unlike those two gives uniform penalties to small and large errors.

### Metrics for classification

When a model performs classification must return a class label.

But often the actual output of the model is a vector $y$ of $n$ elements where $n$ is 
a number of classes.

To obtain a class label this vector is normalized. Often so called softmax function is used:
$$
q_i = \frac{e^{y_i}}{\sum_{i=1}^n e^{y_i}}
$$

After the normalization all vector elements are positive and the their sum equals to 1. If one of the elements approaches 1 all others go to zeros.

The predicted class label corresponds to the vector entry with the larges value after the normalization.

To estimate how good is the prediction this vector is compared with the desired prediction.

For this purpose the desired class labels are represented in one-hot form:
$$
p=(1,0,0,0,0) \text{ : class 1}
$$
$$
p=(0,1,0,0,0) \text{ : class 2}
$$
$$
p=(0,0,1,0,0) \text{ : class 3}
$$
and so on. 

Then the model prediction $q$ is compared with the desired one-hot output via 
__cross entropy__:
$$
H(p,q) = - \sum_{i=1}^n p_i \log q_i
$$

This cross entropy will be very high if the model predicts wrong class (wrong vector entry is the largest) and becomes very small when the correct entry of the predicted vector is much higher then the others. 

In case of binary classification the model returns only one real value $y$. It
must be first of all fitted into the range $[0,1]$.

Usually the sigmoid function $\sigma(y)$ is used:
$$
q = \sigma(y) = \frac{1}{1+e^{-y}}
$$
The output of this function belongs to the range $[0,1]$.

In [None]:
%matplotlib inline

In [None]:
import numpy as np
import matplotlib.pyplot as plt
xs = np.linspace(-5, 5, 100)
ys = 1 / (1 + np.exp(-xs))  # The sigmoid function is computed here
fig, ax = plt.subplots()
ax.plot(xs, ys)
ax.grid()
ax.set_title("Sigmod function")
ax.set_xlabel("y")
ax.set_ylabel("s");

The labels, i.e., the desired outputs $p$ in the binary classification 
are either 0 or 1.

Thus, given the desired label $p$ and the predicted $q$ 
a __binary cross entropy__ is computed:
$$
H(p,q) = -p \log q - (1-p) \log (1-q)
$$

### Metrics for binary classification: accuracy, precision and recall. Matthews Correlation Coefficient

Dealing with a classification problem it seems at first glance obvious
to estimate the performance via accuracy, a fraction of percentage of correct predictions.

But in fact it this not a good idea when the data is not balanced, i.e., 
when categories appear in the data with different probabilities.

Let us for the sake of simplicity consider binary classification: 
is this email spam or not, is this person diseased or not, will be a rain tomorrow or not and so on.

There are two types of predictions: positive and negative. And each one can be true or false 
We know it because we consider a supervised learning and have ground true labels.

- True positive: model says that this message is spam and this indeed spam.
- False positive: model says that this is spam, but this is actually not spam. Type I error, wrong discovery.
- False negative: model says that the message is not spam, but this is incorrect. Type II error, missed discovery.
- True negative: model says that the message is not spam and the message is indeed not spam.

All these cases can be collect into a table that is called confusion matrix.

|                    | Actually spam                           | Actually not spam                      |
|--------------------|-----------------------------------------|----------------------------------------|
| Predicted spam     | TP (number of <br>true positive cases)  | FP (number of<br>false positive cases) |
| Predicted not spam | FN (number of <br>false negative cases) | TN (number of <br>true negative cases) |

Let us now create a spam filter that will have 90% accuracy. It means its 90% predictions will be correct.

The filter will produce its predictions at random. 

In most cases it will mark messages as non-spam and
only with the probability $p=0.001$ it will report spam. 

Let us model its operation. First let us create messages. 

Assume one spam message appear at each 10 messages. Our dataset will contain $10^6$ messages.

In [None]:
import numpy as np
rng = np.random.default_rng()

# One spam message (marked as 1) for each 10 messages (marked as 0), 1000000 totqly
messages = ([1] + [0] * 9) * 100000

# Shuffle spam and non-spam messages
rng.shuffle(messages)

Now we create the filter. It just generates 1 (means spam is predicted) with small probability.

In [None]:
p_spam_filter = 0.001

# Do predicitons without even looking at messages
predicts = [1 if rng.random() < p_spam_filter else 0 for _ in range(len(messages))]

Compute the accuracy of our filter: run along the messages and predictions in parallel and put 1 if they coincide.

In [None]:
# Put 1 when precition is correct
success = [1 if m == p else 0 for m, p in zip(messages, predicts)]

accuracy = sum(success) / len(messages)
print(f"accuracy={accuracy:.3f}")

Our random filter predicts correct results in 90% cases.

Why? 

The message list contains many zeros (non-spam messages). And the spam filter basically reports 
zeros (prefer to predict non-spam).

These zeros coincide very often.

To have more appropriate metric, let us compute the confusion matrix.

In [None]:
from collections import Counter

def report_result(mes, pred):
    if mes == 1:
        if pred == 1:
            return 'tp'  # True positive prediction
        else:
            return 'fn'  # False negative
    else:
        if pred == 1:
            return 'fp'  # False positive
        else:
            return 'tn'  # True negative

check = [report_result(m, p) for m, p in zip(messages, predicts)]

# Count results of predictions
cm = Counter(check)

# Print confusion matrix
print(f"tp={cm['tp']:10}", f"  fp={cm['fp']:10}")
print(f"fn={cm['fn']:10}", f"  tn={cm['tn']:10}")

We see that there are extremely many true negative predictions - as we mentioned above both messages 
and predictions have many zeros and they coincide very often.

It means the value TN must be excluded since it is high just by chance, due to highly unbalanced data.

Instead of accuracy precision and recall as well as their combination is considered.

Precision is a fraction of true positive predictions in the total number of positive predictions.

$$
\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}
$$

Recall is a fraction of true positive predictions in the total number of positive cases

$$
\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}
$$

Given these two values their harmonic mean is computed that is called F1-score:

$$
\text{F1-score} = 2\frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$$

Let us compute these metrics for our model.

In [None]:
precision = cm['tp'] / (cm['tp'] + cm['fp'])
recall = cm['tp'] / (cm['tp'] + cm['fn'])
f1_score = 2 * precision * recall / (precision + recall)

print(f"precision={precision:.3f}")
print(f"recall={recall:.3f}")
print(f"f1_score={f1_score:.3f}")

We see that all three values are very small that corresponds to our intuition that our spam filter is not so good as indicates the accuracy.

In some cases precision or recall can be more important than the other one.

Imagine, for example, that a classifier needs to detect some disease in human patients. 

Positive means that the patient has the disease, and negative means the patient is healthy. 

In this case, the recall is more important because we need that the classifier revel as many truly diseased patients as possible.

Another example is a recommendation system. The classifier prediction is considered positive when the recommendation is relevant and negative for non-relevant recommendations. 

In this case we need high precision: most of positive recommendations are relevant.

To summarize, the relative importance assigned to precision and recall should be an aspect of the problem. 

Classifying a sick person as healthy has a different cost from classifying a healthy person as sick.

That is why F1-score as a single characteristic of a classifier should be used with care: it assigns identical weights both to recall and precision.

One of the possible solution is to use the generalized F-score (also called Fbeta-score) that has an additional parameter that changes the relative weights of the precision and recall

$$
\text{F-score} = (1+\beta^2)\frac{\text{Precision} \times \text{Recall}}{(\beta^2 \cdot \text{Precision}) + \text{Recall}}
$$

At $\beta=1$ the general F-score becomes F1-score.

A smaller beta value, such as 0.5, gives more weight to precision and less to recall, whereas a larger beta value, such as 2.0, gives less weight to precision and more weight to recall in the calculation of the score.

All three considered metrics have some issues. Accuracy works bad for imbalanced data. Precision and recall are asymmetric, i.e., ignores some information. F1-score combines them but it is unclear what weights to assign to each one of them.

In some cases as mentioned above precision or recall are enough. 

But what if both classes are of interest and true predictions for both are very important? There is one way to combine results of binary classification.

This is based on the idea that the true class and the predicted class can be treated as two (binary) variables. 

The measure of quality of prediction is their correlation coefficient. 

Previously we considered Pearson correlation coefficient. But we analyzed time series. Now we have two random binary variables and TP, TN, FP, FN are their probabilities. 

The higher the correlation between true and predicted values, the better the prediction. 

This is called phi-coefficient or Matthews Correlation Coefficient

$$MCC=\frac{TP\times TN-FP\times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$$

When the classifier is perfect (FP = FN = 0) the value of MCC is 1, indicating perfect positive correlation. 

Conversely, when the classifier always misclassifies (TP = TN = 0), we get a value of -1, representing perfect negative correlation. 

In this case, you we simply reverse the classifier’s outcome to get the ideal classifier. 

In fact, MCC value is always between -1 and 1, with 0 meaning that the classifier is no better than a random flip of a fair coin. 

MCC is also perfectly symmetric, so no class is more important than the other; if we switch the positive and negative, we will still get the same value.

MCC takes into account all four values in the confusion matrix, and a high value (close to 1) means that both classes are predicted well, even if one class is disproportionately under- (or over-) represented.

Although the equation for MCC is rather simple its straightforward computation will probably fail.

Observe a lot of multiplications in denominator and in numerator. They will be really huge after that! 

But after their division we will have a reasonable value. 

Thus to compute this value first we need to rescale properly the denominator and numerator. Due to this reason we will use out-of-box routine from sklearn module.

In [None]:
from sklearn.metrics import matthews_corrcoef
MCC = matthews_corrcoef(messages, predicts)
print(f"MCC={MCC:10}")

Observe that MCC estimates our stupid spam classifier as a very very bad. 

As mentioned above the very close to zero value indicates that the classifiers is just a tossing a coin.

### Example of model training, underfitting and overfitting

Before considering the example let us discuss some mathematical concepts.

The polynomial is a construct that looks as follows:

$$
y = a_0 + a_1 x^1 + a_2 x^2 + a_3 x^3 + \ldots
$$

The largest exponents is called a degree of the polynomial:

$$
y = 2 \text{ :  degree 0}
$$
$$
y = 3-4x \text{ :  degree 1}
$$
$$
y = 6+2x+3x^2 \text{ :  degree 2}
$$

A polynomial can be considered as a function. One can substitute a value of $x$ there and obtain the corresponding $y$.

For example, consider the polynomial of degree 2.
$$
y = 2 + x^2
$$

If $x=2$ the corresponding $y$ is 6, if $x=4$, $y=18$ and so on.

The polynomials can be plotted as functions.

In `numpy` there is a submodule `polynomial` that provides a class `Polynomial` for working with polynomials.

Let us consider the polynomial of the 4th degree

$$
y = 4 + 2 x - 4 x^2 -2 x^3 + 5 x^4
$$

In [None]:
import numpy as np
from numpy.polynomial import Polynomial as P

# We create a polynomial by passing to the class constructior P its coefficients
poly1 = P([4, 2, -4, -2, 5])

# Print the created polynomial
print("y =", poly1)
print("degree =", poly1.degree())

# Check how it works: create list of x values and conpute the corresponding y
x = np.array([-3.8, 4.4, 0.55, 1])
y = poly1(x)
print("x =", x)
print("y =", y)

Now we plot its graph

In [None]:
import matplotlib.pyplot as plt

# This functions runs x in the most appropriate rannge and compute the corresponding y
xp, yp = poly1.linspace()

fig, ax = plt.subplots()
ax.plot(xp, yp)
ax.set_xlabel("x")
ax.set_ylabel("y");

Polynomial can be used to fit data. 

The fitting means searching such polynomial the sum of squared distances from data points to the polynomial curve is the smallest

![poly_fit.svg](attachment:poly_fit.svg)

Fitting of the data is performed with the method `.fit`. 

In what follows we will fit a dataset with polynomials of various orders. 

The polynomials will be our models. 

Parameters of the models are the polynomial coefficients.

The training is performed when we fit the data with the polynomial.

Finding the bets approximation of the data using certain functions, polynomial for example, is called regression analysis.

Imagine that a certain natural process is described by a polynomial.
$$
y = -1 + 1.5 x + 0.1 x^2 - 1.5 x^3
$$


In [None]:
# fun_orig is the true dependence from nature that we will try to recover
from numpy.polynomial import Polynomial as P
import matplotlib.pyplot as plt

fun_orig = P([-1, 1.5, 0.1, -1.5])
print("fun_orig")
print("y =", fun_orig)

xp, yp = fun_orig.linspace()
fig, ax = plt.subplots()
ax.plot(xp, yp)
ax.set_xlabel("x")
ax.set_ylabel("y");

We pretend that this formula is unknown. 

Trying to recover it we can do some measurements: $x$ values and the corresponding $y$. 

Unfortunately the measurements are not so exact, they are spoiled by a noise.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
rng = np.random.default_rng()

n_total = 30
data = np.zeros((n_total, 2))

# Measured values of x. Simulate them by uniform random numbers
xs = rng.uniform(-1, 1, size=(n_total,))

# Measured values of are computed from xs and spoiled by noise. 
ys = fun_orig(xs) + 0.1 * rng.normal(size=(n_total,))

fig, ax = plt.subplots()
ax.scatter(xs, ys)
ax.set_xlabel("x")
ax.set_xlabel("y");

Collect all the data into a dataset. Values of $x$ is the first column, $y$ values are in the column two.

In [None]:
data = np.array([xs, ys]).T
print(data)

Now we will create machine learning models that will try to recover `fun_orig`

First we need to split the dataset into training, validation and test parts.

Here is the function for it.

In [None]:
def split_dataset(data, p_train, p_valid, shuffle=True):
    """Split dataset into train, validation and test parts.
    p_train and p_valid must be within [0, 1]
    """
    assert 0 < p_train + p_valid < 1 
    n_tot = len(data)
    n_train = round(p_train * n_tot)
    n_valid = round(p_valid * n_tot)
    n_test = n_tot - n_train - n_valid
    
    if shuffle:
        # Shuffle data if needed
        rng.shuffle(data, axis=0)
        
    # Extract train dataset
    data_train = data[:n_train]
    # Validation dataset
    data_valid = data[n_train:n_train+n_valid]
    # Test dataset
    data_test = data[n_train+n_valid:n_train+n_valid+n_test]
    return data_train, data_valid, data_test

Do the splitting. 

We take 40\% for training, and 30\% for validation and test.

In [None]:
p_train, p_valid = 0.4, 0.3

data_train, data_valid, data_test = split_dataset(data, p_train, p_valid)

fig, axs = plt.subplots(nrows=1, ncols=3, figsize=(12, 3))
axs[0].scatter(data_train[:,0], data_train[:,1], color='C0')
axs[0].set_title(label=f"training, size={len(data_train)}")

axs[1].scatter(data_valid[:,0], data_valid[:,1], color='C1')
axs[1].set_title(f"validation, size={len(data_valid)}")

axs[2].scatter(data_test[:,0], data_test[:,1], color='C2')
axs[2].set_title(f"test, size={len(data_test)}")

for ax in axs:
    ax.grid();

Now we start training models. 

The model is a polynomial and different models have different order of the polynomial.

The training is very simple. All is done in one step: we perform a polynomial fit.

Notice that considering polynomials of the order higher then 1 means using the trick with powers: 

Our original dataset has only two features, $x$ values and $y$ values. 

And we add more features by taking into account higher powers of $x$.

In [None]:
from numpy.polynomial import Polynomial as P

# Max degree of the polynomials
max_degree = 12

models = []
for deg in range(max_degree):
    model = P.fit(data_train[:, 0], data_train[:, 1], deg)
    print("deg=", deg, ":", model)
    models.append(model)

To estimate the performance of these models we need to pass $x$ values to each of them and compute the corresponding $y$.

Then we compare the computed $y$ with those in the dataset.

We will use the MSE metric to estimate the difference.

In [None]:
def mse(a, b):
    """Mean squared error"""
    return np.mean((a-b)**2)

The function that computes the approximation error reads

In [None]:
def approx_error(model, data):
    """Approximation error"""
    xs = data[:, 0]
    ys_true = data[:, 1]
    ys_pred = model(xs)
    return mse(ys_true, ys_pred)

Let us see how the polynomials approximate our data.

We plot the graphs and compute approximation errors that serves as estimations of performance  

In [None]:
fig, axs = plt.subplots(nrows=max_degree//3, ncols=3, figsize=(12, (max_degree//3)*2.5))

data_plot = data_train
for model, ax in zip(models, axs.reshape(-1)):
    xp_orig, yp_orig = fun_orig.linspace()
    xp, yp = model.linspace()
    ax.scatter(data_plot[:, 0], data_plot[:, 1], color='C0');
    ax.plot(xp_orig, yp_orig, color='C0')  # original dependence that we try to recover
    ax.plot(xp, yp, color='C1')
    error = approx_error(model, data_plot)
    ax.set_title(f"deg={model.degree()}, error={error:.3}")
    ax.grid()
    
plt.tight_layout()

We observe that the polynomial of degree 11 approximates the train data perfectly well. 

Error is actually zero. Small nonzero value appears only due to numerical errors.

But what about the validation data?

This is the data that are never seen by our models.

In [None]:
fig, axs = plt.subplots(nrows=max_degree//3, ncols=3, figsize=(12, (max_degree//3)*2.5))

data_plot = data_valid
for model, ax in zip(models, axs.reshape(-1)):
    xp_orig, yp_orig = fun_orig.linspace()
    xp, yp = model.linspace()
    ax.scatter(data_plot[:, 0], data_plot[:, 1], color='C0');
    ax.plot(xp_orig, yp_orig, color='C0')  # original dependence that we try to recover
    ax.plot(xp, yp, color='C1')
    error = approx_error(model, data_plot)
    ax.set_title(f"deg={model.degree()}, error={error:.3}")
    ax.grid()
    
plt.tight_layout()

Now everything is different.

Approximation error for the polynomial 11 degree is huge.

For better visualization let us collect approximation errors as arrays and plot them.

In [None]:
import matplotlib.pyplot as plt

degrees = [model.degree() for model in models]
error_train = [approx_error(model, data_train) for model in models]
error_valid = [approx_error(model, data_valid) for model in models]

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10, 3))

# Apprximation error vs degree of the model
ax = axs[0]
ax.plot(degrees, error_train, '-*', label='train')
ax.plot(degrees, error_valid, '-*', label='valid')
ax.legend()
ax.set_yscale('log')
ax.grid()

# Same but y scale is linear and large valies are truncated
ax = axs[1]
ax.plot(degrees, error_train, '-*', label='train')
ax.plot(degrees, error_valid, '-*', label='valid')
ax.legend()
ax.legend()
ax.set_ylim([0, 0.05])
ax.grid()

We see that the models with low degrees 0 and 1 are underfitted.

They perform bad both in training and in validation data.

Models with high degrees are overfitted. Model 11 has error 0 on the training data because it merely remembers them. 

Showing it unseen data results in huge error.

Optimal models have degree near 3, the same as the original polynomial.

We used training data to compute model parameters. These are the polynomial coefficient.

These data can not be used to compute the final model score.

The validation data was used to select better model. This is also a sort of training: we take the model that 
performs the best on the validation data. 

The validation data also can not be used for computing the final score.

That is why we need the test data.

They are showed to the selected model just once and no modes selection is done after that.

Let us select the model of degree 3 as the result of our training and find its final score on the test data.

In [None]:
import matplotlib.pyplot as plt

the_model = models[3]
print("the_model:\n", the_model)

fig, ax = plt.subplots()

xp_orig, yp_orig = fun_orig.linspace()
xp, yp = the_model.linspace()
ax.scatter(data_test[:, 0], data_test[:, 1], color='C0');
ax.plot(xp_orig, yp_orig, color='C0')  # original dependence that we try to recover
ax.plot(xp, yp, color='C1')

error = approx_error(the_model, data_test)
ax.set_title(f"deg={the_model.degree()}, error={error:.3}")
ax.grid()

Let us discuss why the overfitting occurs for the high degree polynomials.

These polynomials have to many parameters in comparison with the dataset size.

The model can easily remember all the inputs without finding their generalization.

Thus we can fight the overfitting either by decreasing number of parameters, i.e., considering lower order polynomials.

Or we can (if we can) find more data. The models will be unable to remember all these data and will be forced to search the for generalizing. 

Another way to avoid the overfitting is called regularization. 

Applying regularization to the model parameter means limit ranges if their variations. 

The idea is the same: squeezing the range of allowed variations we prohibit the mere remembering the data.

Let us look at our models again:

In [None]:
for model in models:
    print("deg=", model.degree(), '\n', model, '\n')

Models of high degree, i.e., those that are highly overfitted, have really high parameters.

Regularization means that when searching the optimal parameter we add additional penalty for the parameter amplitudes, requiring that they must be not so large.

This method is called ridge regression.

### Exercises

3\. Answer in writing what are overfitting and underfitting. Why dataset must be split into parts?

4\. Describes in writing the ways of fighting with overfitting and underfitting.

5\. Describe in writing  what is cross-validation

6\. List metrics that are used for regression

7\. Reproduce the analysis of a random spam filter for the case when spam messages emerge more often. In what case the accuracy metric become appropriate?