# 2. Core concepts

In this notebook, we will review:
- Estimators in _scikit-learn_, and some of their functions.
- How estimators can be supervised models that perform classification or regression tasks, as well as unsupervised models.
---

# Some important concepts

Let's quickly review some conceptual distinctions in Machine Learning (ML). This section is a refresher. If you are lacking some knowledge on these concepts, please consult our suggested reading in [Notebook 1](./01-preliminaries.ipynb).

## What machine learning is about
An excellent working definition of ML can be found in [Tal Yarkoni's tutorial](https://github.com/neurohackademy/nh2020-curriculum/blob/master/tu-machine-learning-yarkoni/02-core-concepts.ipynb): ML is the field of science/engineering that seeks to build systems capable of learning from experience. The goal of ML is to develop algorithms that can learn from data with a minimum set of explicitly programmed rules on how to do so.

There are two main types of ML models depending on how they learn from data: supervised and unsupervised.

## Supervised ML
In supervised ML, we have available the real values of the variables we want to predict. The model can then use this information to train itself by comparing its predicted values with the real ones using a __loss function__, and an __optimization algorithm__ to iteratively make small adjustments and improve its perfomance. 

### Regression vs classification
Supervised learning models can also be divided into regression and classification tasks. Regression models seek to predict a continuous variable (e.g. age), while classification models predict discrete labels (e.g. wine class).

## Unsupervised ML
In unsupervised ML, these labels are unkown. The algorithm instead seeks to find a pattern in the data that might be useful.


# Estimators

In _scikit-learn_ an [estimator](https://scikit-learn.org/stable/tutorial/statistical_inference/settings.html#estimators-objects) is a Python object that __learns from data__. That means, both supervised (classification or regression) and unsupervised models can be constructed and fitted using estimators. We will review some properties of estimators in _scikit-learn_ using an example for each of these types of models.


## Linear Regression

A linear regression is an example of a supervised regression model. Used as a machine learning tool, linear regression predicts the values of a continuous variable from a __linear combination__ of one or more features.

For example, if we had a feature matrix $X$ contaning the values of features $x1$ and $x2$, the value $\hat{y}$ predicted by linear regression could be expressed as:

$$\hat{y}_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2}$$

> - where $\beta$ are the parameters the model learns from the data to make the predictions
> - $\beta_1$ and $\beta_2$ are also called the coefficients, and $\beta_0$ the intercept

Let's now see how we can fit a linear regression model using _scikit-learn_. 


We will first need to create a dataset for this exercise. With _scikit-learn_ we can do so using the `make_regression()` function (read the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html#sklearn.datasets.make_regression)). Let's create one containing 400 samples and 100 features. We will also define 20 of these features as informative, and add some gaussian noise to the data to make the task harder for the model:

In [None]:
import numpy as np
from sklearn.datasets import make_regression

# Create fake dataset
X, y = make_regression(
    n_samples=400, n_features=100, n_informative=20, noise=10, random_state=0
)

# Print shape of feature matrix and labels
print(f"Shape of dataset: {np.shape(X)}")
print(f"Shape of labels: {np.shape(y)}")

Since it's a regression problem, let's make sure the target of our model is a continuous variable. Let's print the first ten values of `y`:

In [None]:
print(y[:10])

Now let's create a linear regression estimator using `LinearRegression` (read documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html?highlight=linear%20regression#sklearn.linear_model.LinearRegression)).

In [None]:
from sklearn.linear_model import LinearRegression

# Create model
reg = LinearRegression()

Estimator objects contain certain parameters that define how they will behave when learning the data, as well as their outputs. These are called __estimator parameters__. Let's inspect the ones of `reg`:

In [None]:
# Print estimator parameters
vars(reg)

These parameters can be changed by modifying their corresponding attributes when calling the estimator, or afterwards using `set_params()`:

In [None]:
# Set new model parameters
reg.set_params(**{"normalize": True})
vars(reg)

### Training the model

Once the estimator object has been created, it can now learn the value of its parameters from the data. For this we need to call the `fit()` function, and pass our feature matrix (`X`) and true values (`y`) as input:

In [None]:
# Fit linear regression model
reg = reg.fit(X, y)

Let's inspect the attributes of `reg` again:

In [None]:
# Print the names of the attributes
vars(reg).keys()

`reg` now contains new attributes. These are refered to as __estimated parameters__, because they have been learned from the data. In _scikit-learn_, these are indexed by an underscore (`_`) at the end. 

For example, we can now access the coefficients learned by our linear model. We should have as many coefficients as features in our dataset:

In [None]:
print(f"Number of coefficients: {reg.coef_.shape[0]}")

Let's also print the values of some of them, and the value of the intercept.

In [None]:
# Define coefficients and intercept
coefs = reg.coef_
intercept = reg.intercept_

# Print
print(f"Model coefficients (first 10):\n {coefs[:10]} \n")
print(f"Model intercept: \n {intercept}")

Be careful if your intention is to interpret the coefficients of the model. This process is far from straightforward. Read this very useful [example](https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html) to learn more about this issue.

### Making predictions with the model

Now that our model is fitted, we can use it to make predictions. In _scikit-learn_, this is achieved by calling the function `predict()`. 

Let's predict the values of `X` using our fitted model, and visually compare them to their real values on the first ten samples:

In [None]:
import pandas as pd

# Predict labels with trained model
y_pred = reg.predict(X)

# Create dataframe for printing the predictions
df = pd.DataFrame({"y_pred": y_pred[:10], "y_real": y[:10]})
df

### Scoring the model

We can use the predicted values to evaluate the performance of the model by quantifyng the difference between these and the real values.

In _scikit-learn_ we can evaluate the performance of the estimator using the function `score()`:

In [None]:
# Score the model using r2
score = reg.score(X, y)

# Print score
print(f"Linear model R2: {np.round(score,3)}")

By default, linear models are evaluated by calculating $R^2$, also called __coefficient of determination__. $R^2$ quantifies how much of the total variance of the outcome variable (`y`) is explained by the fitted model. The best possible value is 1. The higher the value, the best job the model does at explaining the data. You can read more about the implementation of $R^2$ in _scikit-learn_ [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score).

#### ✍️ Exercise

There are other scoring metrics for regression problems. Check the module [sklearn.metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) for an overview of the alternatives. Pick one, and implement it in the cell below. Press the three dots to reveal the solution.

_Hint!_ If you want to implement a scoring function that is not the default one, you won't be able to do so using the `score()` method. You will need to use a function specifically designed for the scoring metric, and pass the real and predicted values as input.

In [None]:
#### Answer using mean squared error
from sklearn.metrics import mean_squared_error

# Compute mean squared error
mse = mean_squared_error(y, y_pred)

# Print score
print(f"Mean squared error: {mse}")

## Logistic regression
Logistic regression is a very popular classification model. It uses a [logistic function](https://en.wikipedia.org/wiki/Logistic_function) to estimate the probability that an observation belongs to different classes.

Let's implement a logistic regression is _scikit-learn_.

We will create a fake dataset ready for classification using the `make_classification()` method (read the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html#sklearn.datasets.make_classification)): 

In [None]:
from sklearn.datasets import make_classification

# Create fake dataset
X, y = make_classification(
    n_samples=400, n_features=100, n_informative=20, random_state=0
)

Our `y` should now be a categorical variable. Let's print 10 samples to make sure:

In [None]:
print(y[:10])

Classifiers are also estimators in _scikit-learn_. This means we can also use them with the functions illustrated for the linear regression case.

Let's create a `LogisticRegression` estimator (read the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)), and fit it to our data:

In [None]:
from sklearn.linear_model import LogisticRegression

# Create model
clf = LogisticRegression()

# Fit model
clf = clf.fit(X, y)

#### ✍️ Exercise

Can you compare the first 10 predictions of the fitted model to their real values?  Write your answer in the cell below, and press the three dots to reveal the solution.

In [None]:
#### Answer
# Predict labels with trained model
y_pred = clf.predict(X[:10])
y_real = y[:10]

# Create dataframe for printing the predictions
df = pd.DataFrame({"y_pred": y_pred, "y_real": y_real})
df

### Probabilistic predictions

Logistic Regression is a [probabilistic classifier](https://en.wikipedia.org/wiki/Probabilistic_classification), meaning it predicts a probability distribution over the classes.

In _scikit-learn_ we can inspect the probabilities assigned to each class using `predict_proba()`:

In [None]:
# Predict the probability of each class
y_pred_proba = clf.predict_proba(X[:10])

# Create dataframe for printing the predictions for each class
df = pd.DataFrame(y_pred_proba, columns=["class 0", "class 1"])
df

By default, the predictions made by `LogisticRegression` when calling `score()` are evaluated by computing the __mean accuracy__ of the predictions:

In [None]:
# Score predictions
score = clf.score(X, y)
print(f"Mean accuracy: {np.round(score, 2)}")

## K-Means

Unsupervised models are also estimators in _scikit-learn_, since they also learn from data. One type of unsupervised models are __clustering algorithms__. These learn to group the data from their feature values so that observations within a group are more similar than those between groups. You can read more about clustering [here](https://github.com/martinagvilas/intro_stat_learning/blob/master/notebooks/lab2_clustering.ipynb).

A very popular clustering algorithm is __k-means__. This method partitions the data into __$k$ pre-specified__ clusters in a way that minimizes the within-cluster variance.

Let's implement k-means using _scikit-learn_. We first need to generate a dataset suitable for clustering using `make_blobs()` (read documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html#sklearn.datasets.make_blobs)) which generates Gaussian shaped blobs. We will create a very simple dataset with only two features, to simplify visualization of the clusters:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs

# Create fake dataset
X, y = make_blobs(
    n_samples=400, n_features=2, random_state=0, cluster_std=1
)

Let's visualize our dataset with a scatterplot, and color the observations according to their real labels:

In [None]:
# Plot dataset
sns.scatterplot(
    x=X[:, 0], y=X[:, 1], hue=y,
    marker='o', s=25, edgecolor='k', legend=True
).set_title("Data")
plt.show()

There are 3 clusters in our fake dataset. Usually we don't have this information available and we need to select an arbitrary number of clusters for the algorithm to find.

Let's see an example of this and perform k-means clustering by calling `KMeans` (read documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans)) with 5 clusters ($k=5$):

In [None]:
from sklearn.cluster import KMeans

# Create model
kmeans = KMeans(n_clusters=5)

# Fit model
kmeans = kmeans.fit(X)

Since this is an unsupervised method, we don't need to provide `y` as input to `fit()`. We also cannot compute the accuracy of the fitted model. But we can compute the average distance of the labeled example to the center of their assigned cluster using the `score()` function:

In [None]:
# Compute average distance
score = kmeans.score(X, y)
print(f"Average distance: {score}")

If you want to read more about the meaning behind the returned value, read [this answer](https://stackoverflow.com/questions/32370543/understanding-score-returned-by-scikit-learn-kmeans) on stackoverflow.

More importantly, we can now use our fitted model to predict to which cluster the observations belong to. Let's predict the assignment of the first 10 observations:

In [None]:
# Predict cluster label
y_pred = kmeans.predict(X)
print(f"Predicted labels (first 10): {y_pred[:10]}")

We can use a scatterplot to inspect the predicted labels from the model:

In [None]:
# Plot predicted labels
sns.scatterplot(
    x=X[:, 0], y=X[:, 1], hue=y_pred,
    marker='o', s=25, edgecolor='k', legend=False
).set_title("Data")
plt.show()

We have the same number of predicted labels as the number of $k$.

### ✍️ Exercise 

Can you create a `KMeans` model specifying the correct number of clusters (`k=3`) and plot its predictions? Compare it with the plot of the real labels. Write your code in the cell below and press the three dots to see the solution.

In [None]:
#### Answer
# Create model
kmeans = KMeans(n_clusters=3)

# Fit model
kmeans = kmeans.fit(X, y)

# Predict labels
y_pred = kmeans.predict(X)

# Plot predicted labels
sns.scatterplot(
    x=X[:, 0], y=X[:, 1], hue=y_pred,
    marker='o', s=25, edgecolor='k', legend=False
).set_title("Data")
plt.show()

---
# ✏️ Check your knowledge

Load the ABIDE 2 dataset and:

1. Use logistic regression to predict `group` from the features encoding brain data. How accurate is the model? Play around with different accuracy metrics and inspect their results.
2. Select two features encoding brain data, run a clustering analysis on them, and plot the predicted labels as shown in this notebook. Compare it with a similar plot displaying the true group labels.

If you forgot how to load your data and define `X` and `y`, go back to [Notebook 1](./01-preliminaries.ipynb) and refresh this knowledge.


---
# Additional reading

- [Choosing the right estimator](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html): A useful map to decide which estimator is best given your dataset and learning goal.
- [Core concepts of machine learning](https://github.com/neurohackademy/nh2020-curriculum/blob/master/tu-machine-learning-yarkoni/02-core-concepts.ipynb)  by _Tal Yarkoni_.