# Everything you need to know (not really actually) about machine learning, as a non-data scientist

This is a short guide to the code and machine learning theory used in my project. 

Note 1: It is strongly recommended that you read this document in order, as ML concepts are introduced and explained as we go along.

Note 2: This introduction does not delve too deep into the mathematical theory of ML because this crash course is only meant to cover enough material to follow the basic models in the `Brain_Decoding.ipynb` file. 

## The Basics

### So, what the heck is machine learning? 

First, let's begin with artificial intelligence. Its definition is quite broad, but one could say that it describes machines that can complete complex tasks usually done by humans. These tasks include problems that seem trivial to humans, like image classification, but which have been historically nearly impossible for computers to crack. However, the tasks tackled by artificial intelligence have not been limited by human performance; in fact, many artificial algorithms *outperform* humans. For example, Google's Alpha Go computer program recently beat the best Go player in the world.

<img src="https://i.postimg.cc/bNTvMtkM/ml.png" width="400" height="400">

source: [Sumo Logic](https://www.sumologic.com/blog/machine-learning-deep-learning/)

Anyways, machine learning is a specific field within artificial intelligence. Traditionally, programming consisted of writing a set on instructions for the computer to run, and to correct these instructions as needed. The computer will compute these instructions and return a calculation, an output. However, this has severe limitations in that we can only use computers to solve problems where we already "knew what to do". The computer is used purely to compute our calculations and so the computer is limited by what rules we give it. With machine learning, we give the computer *data* instead of *instructions*, and the computer's task is to figure out the *instructions*. In a sentence, machine learning describes models that correct themselves through use applied math rather than through manual instruction. 

[![trad.png](https://i.postimg.cc/vH4Lhbpp/trad.png)](https://postimg.cc/RNzHV5CR)

source: [Data Science Central](https://www.datasciencecentral.com/profiles/blogs/traditional-programming-versus-machine-learning-in-one-picture)

***

### A quick note on data

In general, data is stored as *samples*. A sample is a data point from your data set. For this tutorial, each sample is made of a *label* and *features*. A feature is a property of your data. A label is characteristic of your data you want to predict. 

Here is an example to tie these terms together:

Imagine you have a dataset of NBA players:

|         |             | NBA Players of the 2018-2019 Season  |              |            |
|---------|-------------|--------------------------------------|--------------|------------|
| Players | Points/Game | Rebounds/Game                        | Assists/Game | Salary     |
| Harden  | 36.1        | 6.6                                  | 7.5          | 30 421 854 |
| Curry   | 27.3        | 5.3                                  | 5.2          | 37 457 154 |
| Durant  | 26.0        | 6.4                                  | 5.9          | 30 000 000 |
| James   | 27.4        | 8.5                                  | 8.3          | 35 654 150 |
| Leonard | 26.6        | 7.3                                  | 3.3          | ?          |

Each player is a sample, their features are their in-game statistics (PPG, RPG, APG) and their salaries are the labels. An example of a machine learning problem would be to try to predict their salary based off their statistics. Our model would learn *how* PPG, RPG and APG affect salary by using the known samples: Harden, Curry, Durrant and James. Then, we can use the model to predict the salary of players that we don't know, like Leonard.

***

This tutorial deals exclusively with *supervised* learning. This means that when we train our model, we give it the labels so that it can learn the connections between features and label. In *unsupervised* learning tasks, we do *not* give the labels during training. It's a whole other topic which is beyond the scope of this tutorial.

## Okay, but what does it *look* like?

Generally, machine learning models have the same steps:

1. **Split the data**: As I just mentioned, machine learning models "learn" by going through loads of data. So, the first step is to split the data into a training and test set (and potentially with cross-validation, which we will get to later). Unsurprisingly, the training set is used to train the model and the testing set is used to test the model's accuracy. The reason we keep these separate is because the results from testing the model on data it has already seen is useless. Imagine you were writing a test. You then give your friends the answers to the test and test them. Their near perfect scores are not representative of how well they know the material, they already knew what the answers were! Back to machine learning, testing your model on data it has already seen is not indicative of whether or not your model learned to draw meaningful connections between features and model. 

2. **Train the model**: 
3. **Test the model**:

***

Roughly in order of complexity, the models explored in this tutorial are:

- (Supervised) k-Nearest Neighbours (kNN)
- Logistic Regression
- Support Vector Machines (SVM)
- Convolutional Neural Network (CNN)

Note: There are typically many, *many* variants of each machine learning model. I have chosen the most basic versions but I will mention the extra bells and whistles you can explore on your own. 

### k-Nearest Neighbours (kNN)

**The idea**: <br>
Say, for example, I have two sets of points, green and blue, situated on a Cartesian plane. 


<img src="https://i.postimg.cc/QMGhdgL0/Screen-Shot-2019-08-23-at-2-02-04-PM.png" width="400" height="400">

Their coordinates are their features, and their colour is their label. The idea is that if I have a point whose colour I do not know, I can place this point onto the same coordinate system, and assume that this point belongs to the same group as the points closest to it on the coordinate space (ie, most similar to it). For example, if I was considering the 3 closest points, then the unknown point in the visual above would be classified as green because there are 2 green points and 1 blue point. 

Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data.

**The 'k' value**: <br>
The 'k' in "k-Nearest Neighbours" defines how many points we want to consider in the classification of our point. For example, if we define `k` to be 3, we look at the 3 closest points. In the diagram, that would be 2 green points and 1 blue point. So, we would conclude our unknown point is green. However, if we define our `k` to be 5, we would look at the 5 closest points and conclude that the point is blue. 

`k` is an example of a *hyper-parameter*. A hyper-parameter is a setting that we must decide before the training and testing of our model. This is a first look at how selecting hyper-parameters can strongly influence our model.

Choosing the optimal `k` value is strongly dependant on your data. In a classification task, we assume that a dataset can be classified in two (or more) categories. However, almost all real world data is affected by noise, meaning that the classifications of points is not completely distinct (ie, your data is not perfectly separated). For example, consider the following example of blue and red points, 
where the areas of the graphare coloured according to its prediction:

<img src="https://i.postimg.cc/j5bnFB86/visual.png" width="500" height="500">

[source: Coursica](http://www.corsica.hockey/blog/2016/10/31/hockey-and-euclid-predicting-aav-with-k-nearest-neighbours/)

The smaller the `k` value, the more accurate the predictions will be, because you will be able to correctly classify the points that are "unique" (like those that are blue but that are situated among red points; they are the reason why the classification is not very clear). However, the more sensitive to noise the classifier will be. Please note that the classifier would be more accurate to *the dataset in question* and not datasets in general. This means that your classifier will be prone to *overfitting* because not all datasets will contain those "unique" points that your classifier has adapted to. 

A greater `k` value will be less accurate to the dataset in question (as it will incorrectly classify the "unique" points) but it will be more generalized, meaning that it will be more applicable to unknown datasets. By considering more points from the dataset for each prediction, its predictions are averaged out over a greater number, so its predictions are more general, because it is less affected by noise (the "unique" points). 

Something to keep in mind is that the higher our `k` value, the more computationally expensive our model is. That's because the more points we are using to label an unknown point, the more distances we must calculate. 

There is another neighbor classifier which is implemented through the [RadiusNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier) class. It selects points for classification through a fixed radius `r` around the unknown point. We will focus on the `kNeighborsClassifier` but feel free to explore on your own.

These neighbors-based classifiers are not like typical machine learning models. They do not actually model the data; rather, they keep all the training data in memory and search this collection when asked to classify unknown data. Thus, they are called *non-generalizing* machine learning methods.

**The notion of 'distance'**: <br>
The "distance" that is computed for kNNs has several forms, because there are different ways to define it. The default metric in sk-learn is the 'minkowski' distance, defined as `sum(|x - y|^p)^(1/p)`, which is stored in the 'metric' argument for the kNN class. This metric requires you also set the `p` parameter. The default value is `2`, and so it is equivalent to classic euclidean distance. Other choices for the distance metric can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html).

**A note on dimensionality**: <br>
Practically speaking, we often don't restrict ourselves to data that can only be described in 2-D. Real world data is often characterized by many more features than just two. Although hard to visualize, k-Nearest Neighbours can be applied to data of much higher dimensions and the math is still valid. In this tutorial, each sample is a brain image, and the *features* of each sample is the activation values of each voxel. 

**The notion of 'weights'**: <br>
When implementing a kNN, another hyper-parameter to consider is the weight function used in prediction. You must choose how strongly you want your points to be considered. In sk-learn, the default argument is `'uniform'`, meaning that all points that are being used for the classification are valued equally. Conversely, the `distance` argument will weigh points by the inverse of their distance.
TODO: This means that points further away will 

**The algorithm: how the model searches its memory**: <br>
For each prediction, the kNN has to compute the distance between the points it has in memory and the unknown point. If you have 10s, 100s or thousands of features and points, you can imagine that this amount of computation can be extremely demanding. To increase the speed at which we process data, we can keep improving our hardware (think [Moore's Law](https://en.wikipedia.org/wiki/Moore%27s_law)), but another (perhaps more) important way is to think of better searching algorithms. 

Searching algorithms are described in *Big O notation*, which is basically a mathematical notation that describes how well an algorithm can scale to larger datasets. When we are dealing with small datasets, the choice of algorithm and the strength of processor is often insignificant. However, as we deal with more and more datasets, 

This is often described through the concept of "order of complexity", that is, a mathematical expression that basically describes

LEAF_SIZE

METRIC_PARAMS:

N_JOBS:






weights: For example, if you specify `'distance'`, points will be weighed proportional to the inverse of their distance, so closer points will have a greater influence than further points.

Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.



algorithm: If unspecified, `'auto'` will be chosen and sklearn will attempt to decide the most appropriate algorithm based on the values passed to the `fit` method.


techincal theory part: in a vector space so that it compute the distance between points and can make future predictions. 

https://scikit-learn.org/stable/modules/neighbors.html