# Ten Quick Tips for Machine Learning 

Adapted from: 
**Ten quick tips for Machine Learning in Computational Biology** [paper](https://biodatamining.biomedcentral.com/articles/10.1186/s13040-017-0155-3)

<span class="fn"><i>[REF]</i>: Chicco, D. Ten quick tips for machine learning in computational biology. BioData Mining 10, 35 (2017). https://doi.org/10.1186/s13040-017-0155-3</span>

---

### Motivation

*(from the paper)*

> Recent advances in high-throughput sequencing technologies have made large biological datasets available to the scientific community. 
>
> Together with the growth of these datasets, internet web services expanded, and enabled biologists to put large data online for scientific audiences.
> 
> As a result, scientists have begun to search for novel ways to interrogate, analyze, and process data [$\ldots$]

> A machine learning algorithm is a computational method based upon statistics, implemented in software, able to discover hidden non-obvious patterns in a dataset, and moreover to make reliable statistical predictions about similar new data. 
> 

As explained by _Kevin Yip and colleagues_ : 

> “The ability [of machine learning] to automatically identify patterns in data [$\ldots$] is particularly important when the expert knowledge is incomplete or inaccurate, when the amount of available data is too large to be handled manually, or when there are exceptions to the general cases”
[\[$1$\]](#fn1). 

**This is clearly the case for computational biology and bioinformatics.**

> Machine learning (*as well as Deep Learning*, e.d.) has thus been applied to multiple computational biology problems [$\ldots$]. Despite its importance, often researchers with biology or healthcare backgrounds do not have the specific skills to run a data mining project. 
> This lack of skills (could) result in incorrect practices, which lead to error-prone analyses, or give them the illusion of success.

<span id="fn1"><i>[1]</i>: Yip KY, Cheng C, Gerstein M. Machine learning and genome annotation: a match meant to be? Genome Biol. 2013; 14(5):205.</span>

## Decalogue 

(*in other words:* Ten simple rules to keep in mind, whenever possible)

<a name="top"></a>
1.  [T1](#t1)  Check and arrange your input dataset properly 
2.  [T2](#t2)  Split your input dataset into three independent subsets 
3.  [T3](#t3)  Frame your biological problem into the right algorithm category 
4.  [T4](#t4)  Which algorithm should you choose to start?
5.  [T5](#t5)  Take care of the imbalanced data problem 
6.  [T6](#t6)  Optimize each hyper-parameter 
7.  [T7](#t7)  Minimize overfitting 
8.  [T8](#t8)  Evaluate your algorithm performance using the right metric 
9.  [T9](#t9)  Program your software with open source code and platforms 
10. [T10](#t10) Ask for feedback and engage with the community 

---

<a name="t1"></a>
### T1. Check and arrange your input dataset properly

>Even though it might seem surprising, the most important key point of a machine learning project does not regard machine learning: it regards your dataset properties and arrangement.

This is indeed the pillar of **Data Science**

After addressing the issue of the dataset size, the most important priority of your project is the dataset arrangement.

<img src="toy_pipeline.png" class="maxw80" alt="Toy ML Pipeline" />


###### MLOps (Machine Learning in Production)

<img src="complex_pipeline.png" class="maxw80" alt="Full-fledged ML Pipeline" />

##### So.. how to arrange data properly?

We will see that indeed the relationships that exists between **Machine Learning** and **Data** 
is quite the point! 

Therefore, we need to really understand what does it mean to _learn from data_. 

> This is true in general, and particularly true for the _Bio/Med_ context.

Therefore, it will be important to understant:

- _how to **represent** data_ for Machine learning;
- _how to **use** data_ in (different) Machine learning settings.

This topic will be analysed from two angles, shortly summarised in the following two 
**Research Questions** (RQ):

* **Data for Machine Learning**
    - (RQ1): How to represent data for Machine/Deep learning models?
    
* __Machine Learning for Data__
    - (RQ2): Which model should I use for different (types of) data?
    (See [T4](#t4))

###### The Idea

*They say* :

> A picture worths a thousand words

(hope this applies to drawings and sketches too)

I am going to guide our discussion via drawing sketches in order to finally derive a *mind-map-like* schema that will help us in taking decisions about data and (deep learning) models.

#### Data for Machine Learning

In this section we will try to find an answer to the following questions:

1. How to represent data for machine learning?

Before going into the details of this, let's have a brief look at what is the **general framework** in which we are operating.

> "Machine learning is about **mapping data to predictions**"

![from data to predictions](ml_model.png)

The ultimate goal of a machine learning model is to discover **general**[$^1$](#fn1) patterns.
So, ideally, a `Model` would take in input some `Data`, and will generate a `prediction`.
The nature of this prediction depends on the specific learning problem at hand: it can be either the `class` data belongs to, or its corresponding `cluster`. 


<span id="fn1"><i>[1]: </i>We will better understand the meaning of _general_ pattern, and _generalisation_ of a Machine learning model, working on a concrete example.</span>

##### A slightly more complete picture

However the real _pipeline_ is **never** that simple. 

It involves a lot of **data science** to _preprocess_ `raw` data in order to **generate** a representation which will be suitable for the `Model`. 

Moreover, the Model itself will encompass a series of further steps and operations before it will be 
really exercised to generate predictions. 
In fact, the Model needs to be deployed on a real system to actually go in production. 

This usually requires some monitor, logging, and the model to be validated to enable _serving_.

<img src="ml_full_picture.png" class="maxw100" />

So, how we should **transform** data to make it ready to be processed by a Machine learning `Model`? 

#### Data Representation for Machine learning

Machine learning is about creating models from data: for that reason, we'll start by
discussing how data can (or should) be represented in order to be understood by **models**.

(With very few exceptions) In Machine learning data is assumed to be stored as a
**two-dimensional array**, of size `[n_samples, n_features]`. 

This array is usually referrred as the **feature matrix**.

In case of *Supervised learning settings*, there is also the **label vector**, of size `n_samples`, containing the list of labels
for each samples.

$$
{\rm feature~matrix:~~~} {\bf X}~=~\left[
\begin{matrix}
x_{11} & x_{12} & \cdots & x_{1D}\\
x_{21} & x_{22} & \cdots & x_{2D}\\
x_{31} & x_{32} & \cdots & x_{3D}\\
\vdots & \vdots & \ddots & \vdots\\
\vdots & \vdots & \ddots & \vdots\\
x_{N1} & x_{N2} & \cdots & x_{ND}\\
\end{matrix}
\right]
$$

$$
{\rm label~vector:~~~} {\bf y}~=~ [y_1, y_2, y_3, \cdots y_N]
$$

Here there are $N$ samples and $D$ features.

- $N$ (`n_samples`):   The number of samples: each sample is an item to process (e.g. classify).
  A sample can be a document, a picture, a sound, a video, an astronomical object,
  a row in database or CSV file,
  or whatever you can describe with a fixed set of quantitative traits.
- $D$ (`n_features`):  The number of features or distinct traits that can be used to describe each
  item in a quantitative manner.  Features are generally real-valued, but may be boolean or
  discrete-valued in some cases.

The number of **features** _must_ be fixed in advance.
Each sample (data point) is a row in the feature data array, and each feature **may be** a column 
(if features can be expressed by `1D vector`). 

Features can be also be very high dimensional (e.g. millions of features), but sometimes also very sparse. This is a case where `scipy.sparse` matrices (and `torch.sparse`) [tensors](https://pytorch.org/docs/stable/sparse.html?highlight=torch%20sparse) can be very useful. 
These structures are much more memory-efficient than **dense** `numpy` arrays.

(Here, the **Supervised Machine Learning** settings is considered, without loss of generality)

<img src="ml_full_supervised.png" class="maxw100" />


[top](#top)

<a name="t2"></a>

### T2. Split your input dataset into three independent subsets (training set, validation set, test set), and use the test set only once you complete training and optimization phases


#### Training and Test Data


To evaluate how well our supervised models generalize, we can split our data into a training and a test set:

<img src="train_test_split.svg" />




* Thinking about how machine learning is normally performed, the idea of a train/test split makes sense. 

* Real world systems train on the data they have, and as other data comes in (from customers, sensors, or other sources) the classifier that was trained must predict on fundamentally *new* data. 

* We can simulate this during training using a train/test split - the test data is a simulation of "future data" which will come into the system during production. 

```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=1999)
```

By evaluating our classifier performance on data that has been seen during training, we 
**do get** false confidence in the power of our system. 

This might lead to putting a system into production which *fails* at predicting new data! 

It is much better to use a train/test split in order to properly see how your trained model is doing on new data. (**Ultimately, this is the goal of ML**)

```python
# using sklearn API
model.fit(X_train, y_train)
y_hat = model.predict(X_test)
# Accuracy
print(np.sum(y_hat == y_test) / len(y_test))
```

If $\hat{y}_i$ corresponds to the prediction generated by the model for $i-th$ sample, and $y_i$ the corresponding **expected** target in the ground-truth, the **Accuracy** is defined as:

$$
\mathtt{accuracy}(y, \hat{y}) = \frac{1}{n_{samples}} \sum_{i=0}^{n_{samples}} 1(\hat{y_i} = y_i)
$$

where $1(x)$ is the **indicator function**

##### Cross-Validation and Scoring Methods

A further specialisation of the previous exprimental schema consider a **three-way** split (also suggested as the reference strategy by the TIP)

![Train Validation Test](train_validation_test2.svg)

However, often (labeled) data is precious, and this approach lets us only use ~3/4 of our data for training. 

On the other hand, we will only ever try to apply our model 1/4 of our data for testing.

A common way to use more of the data to built a model, but also get a more robust estimate of the generalization performance is cross-validation.

In cross-validation, the data is split repeatedly into a **training** and **test-set**, with a separate model built for every pair. 

The test-set scores are then aggregated for a more robust estimate.


The most common way to do cross-validation is **k-fold cross-validation**, in which the data is first split into k (often 5 or 10) equal-sized folds, and then for each iteration, one of the k folds is used as test data, and the rest as training data:

![Cross Validation Schema](cross_validation.svg)

This way, each data point will be in the **validation** set exactly once, and we can use all but a `k-th` of the data for training.

[top](#top)

<a name="t3"></a>

### T3. Frame your biological problem into the right algorithm category

(This is an simple and intuitive (but **yet important**) advice:

> You have your biological dataset, your scientific question, and a scientific goal for your project. You have arranged and engineered your dataset, as explained in Tip 1. You decide you want to solve your scientific project with machine learning, but you are undecided about what algorithm to start with.

> Before choosing the data mining method, you have to frame your biological problem into the right algorithm category, which will then help you find the right tool to answer your scientific question.

**In other words**: It is important to understand how to turn the **biological** problem into a **learning** problem.

Once the learning objective is clear, then the (learning) problem has to be properly declined in the 
most appropriate (*learning*) framework:

* Supervised Learning
    * Classification
    * Regression
* Unsupervised Learning
    * Clustering
    * Dimensionality Reduction

[top](#top)

<a name="t4"></a>

### T4. Which algorithm should you choose to start?

The **Answer** to this question is the most obvious: **The simplest one!**

>The **No Free Lunch** theorem states that there is no one model that works best for every problem.  
>
> The assumptions of a great model for one problem may not hold for another problem, so it is common in machine learning to try multiple models and find one that works best for a particular problem.  
>
> This is especially true in supervised learning, and cross-validation is commonly used to assess the predictive accuracies of multiple models of varying complexity to find the best model.  A model that works well could also be trained by multiple algorithms – for example, linear regression could be trained by the normal equations or by gradient descent.

<span class="fn"><i>[Source]</i>: https://chemicalstatistician.wordpress.com/2014/01/24/machine-learning-lesson-of-the-day-the-no-free-lunch-theorem/ </span>

This is **mostly** to emphasise that **even in the DL** era, classical **ML** methods still have their say, and it is always important to include them in our experimental pipelines (**also because** they're generally super fast to try, if compared to Deep network training).



(from `arXiv` - **11 Jun 2020** )

##### Is Deep learning necessary for simple Classification Tasks? 

(paper [link](https://arxiv.org/abs/2006.06730))

> Automated machine learning (`AutoML`) and deep learning (`DL`) are two cutting-edge paradigms 
> used to solve a myriad of inductive learning tasks. $\ldots$ 
> Our observations suggest that `AutoML` outperforms simple `DL` classifiers when trained on similar datasets for binary classification but integrating DL into `AutoML` improves classification performance even further.
>
> $\rightarrow$ However, the substantial time needed to train `AutoML+DL` pipelines will likely outweigh performance advantages in many applications.

[top](#top)

<a name="t5"></a>

### T5. Take care of the imbalanced data problem

> In computational biology and in biomedicine, it is often common to have **imbalanced** datasets. 
>
> An imbalanced (or unbalanced) dataset is a dataset in which one class is over-represented respect to the other(s)


**Some workarounds**:

- _class weighting_
- Imputation and/or Synth. Data
- (at the very least) under-sampling (hardly the case, e.d.)

**In addition**:

- Use appropriate metrics (see [T8](#t8))
- Increase the statistics with Cross-Validation (see [T7](#t7))

[top](#top)

<a name="t6"></a>

### T.6 Optimise each Hyper-parameter

This is the only rule I should argue with.

Optimising _each_ hyperparam is **hardly** the case, especially with Deep Learning in which you would try to optimise million of parameters, and plenty of *hyper-parameters*.

So, **NO**! Alternatively, though: some **general rule of thumbs** might apply:

In [T1](#t1), we explored how Machine learning models expect the data to be represented 
(i.e. `Data` $\mapsto$ `Machine Learning`). Here the focus is on the reverse relationship:
`Machine Learning` $\mapsto$ `Data`.

In particular, we will try to come up with some practical _rule of thumbs_ that can guide us in the choice of the different _family_ of **Deep Learning** models, given the **nature** and the **type** of data.

<img src="data.png" class="maxw50" />

###### Data Shape

The first characteristic of the data that we will analyse refers to the **shape** of data, also related to **representation**.

Data can be **structured** or **un**structured, and its representation can be **dense** or **sparse** (either is the shape).

<img src="data_shape.png" class="maxw85" />

###### Data Type

<img src="data_type.png" class="maxw85" />

###### ML for Data: Choosing the right estimator

A very popular page in the **scikit-learn** documentation reports a [guide map](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)[$^1$](#dt) containing practical rules and conditions that could be used to choose the right estimator to use.

<img src="https://scikit-learn.org/stable/_static/ml_map.png" alt="scikit-learn map" class="maxw85" />


<span class="fn" id="dt"><i>[1] :</i> In the form of a Decision Tree</span>

###### The (almost) complete picture

<img src="the_full_map.png" class="maxw100" />

as for **Unsupervised learning**, we have options too!

* AutoEncoders (`AE`)
    - Convolutional `AE`

* **Variational** AutoEncoders (`VAE`)

* Generative Adversarial Networks (`GAN`)

[top](#top)

<a name="t7"></a>

### T7. Minimise Overfitting

Ultimately, the goal of machine learning algorithms is to be able to **discover patterns** in data. 

The question is *how to be sure that we have truly learnt a **general** pattern, and not simply memorised our (training) data?*

>In order to discuss this phenomenon more formally, 
we need to differentiate between **training error** and **generalization error**.

The *training error* is the error of our model as calculated on the training dataset; 
while *generalization error* is the expectation of our model's error when applied to
an infinite stream of additional data points
drawn from the **same underlying data distribution as our original sample**.

##### Bias-Variance Tradeoff

(*ML Theory and Statistical Learning*)

- **VARIANCE**: "The amount by which the model **varies** as we change training data is **Variance**"

- **BIAS**: "The **bias** reflects the amount of **assumptions** we do on the model"

##### Regularisation (against Overfitting)

*Weight decay* (commonly called *L2* regularization),
might be the most widely-used technique
for regularizing parametric machine learning models.

In practice, we characterize the regularisation via the *regularisation constant* $\lambda > 0$, 
a non-negative hyperparameter that we fit using validation data:

$$l(\mathbf{w}, b) + \frac{\lambda}{2} \|\mathbf{w}\|^2.$$

For $\lambda = 0$, we recover our original loss function.
For $\lambda > 0$, we restrict the size of $|| \mathbf{w} ||$.

While L2-regularized linear models constitute
the classic *ridge regression* algorithm,
L1-regularized linear regression
is a similarly fundamental model in statistics
(popularly known as *lasso regression*).

(More on **Weight Decay** [here](https://github.com/dsgiitr/d2l-pytorch/blob/master/Ch06_Multilayer_Perceptrons/Weight_Decay.ipynb))

We will talk about another _technique_ for model regularisation - also known as **structural regularisation**: the **Dropout**

[top](#top)


<a name="t8"></a>

### T8. Evaluate your algorithm performance with the Matthews correlation coefficient (MCC) or the Precision-Recall curve

>When you apply your trained model to the validation set or to the test set, you need statistical scores to measure your performance.

In fact, in a typical supervised binary classification problem, for each element of the validation set (or test set) you have a label stating if the element is `positive` or `negative` (i.e. `1` or `0`). 

Your machine learning algorithm makes a prediction for each element of the validation set, 
expressing if it is `1` or `0`, and, based upon these prediction and the gold-standard labels, 
it will assign each element to one of the following categories: 

- true negatives (`TN`)  $\mapsto$ predicted 0; expected 0;
- true positives (`TP`)  $\mapsto$ predicted 1; expected 1; 
- false positives (`FP`) $\mapsto$ predicted 1; expected 0; 
- false negatives (`FN`) $\mapsto$ predicted 0; expected 1; 

###### Confusion Matrix 

<table class="wikitable" style="border:none; float:left; margin-top:0;">
<tbody><tr>
<th style="background:white; border:none;" colspan="2" rowspan="2">
</th>
<th colspan="2" style="background:none;">Actual class
</th></tr>
<tr>
<th>P
</th>
<th>N
</th></tr>
<tr>
<th rowspan="2" style="height:6em;"><div style="display: inline-block; -ms-transform: rotate(-90deg); -webkit-transform: rotate(-90deg); transform: rotate(-90deg);;">Predicted<br>class</div>
</th>
<th>P
</th>
<td><b>TP</b>
</td>
<td>FP
</td></tr>
<tr>
<th>N
</th>
<td>FN
</td>
<td><b>TN</b>
</td></tr>
</tbody></table>

**Accuracy**

$$
accuracy = \frac{TP+TN}{TP+TN+FP+FN}
$$

**F1-Score**

$$
F1 \; score = \frac{2 \cdot TP}{2 \cdot TP+FP+FN}
$$

**MCC**

$$
MCC = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)\cdot(TP+FN)\cdot(TN+FP)\cdot(TN+FN)}}
$$

As also reported in [2](#fnmcc), the **most informative metric** to evaluate a confusion matrix is the Matthews correlation coefficient (`MCC`)


###### Cases & Flaws:

`TP = 90, FP = 5; TN = 1, FN = 4.` $\mapsto$ OK w/ positive; KO w/ negative

`ACC=91%`; `F1=95.24%`; `MCC=0.14`

---

`TP = 95, FP = 5; TN = 0, FN = 0` $\mapsto$ Just predicting the majority class (very often the case)

`ACC=95%`; `F1=97.44%`; `MCC= **undef** ` (`TF=0` and `TN=0`)


<span id="fnmcc"><i>[2]</i>: Chicco D, Jurman G (January 2020). "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation". BMC Genomics. 21 (6). doi:10.1186/s12864-019-6413-7 </span>

[top](#top)

<a name="t9"></a>

### T9. Program your software with open source code and platforms

*in other words:* **DO NOT RE-Invent, but contribute** _plus_

![](ds_code.png)

[top](#top)

<a name="t10"></a>

### T10. Ask for feedback and help to computer science experts, or to collaborative Q&A online communities

- Stack Overflow
- Open issues on GitHub
- Write on mailing list

Be polite, and respectful!


[top](#top)

---

### (Extra) T11. Reproducibility and Replicability

![](reproducible.png)