# 02 - Core concepts

In this notebook, we will review:
- Estimators in scikit-learn, what are they and some of their functions.
- How estimators can be supervised models that perform classification or regression tasks, as well as unsupervised models.
- Common metrics used to evaluate the estimator performance.

---

# Estimators

In scikit-learn an [estimator](https://scikit-learn.org/stable/tutorial/statistical_inference/settings.html#estimators-objects) is a Python object that learns from data.

(...) Estimators come with associated functions. We will review some of these in this tutorial.

Most importantly, in scikit-learn both supervised and unsupervised models are created using estimator objects. Let's review each of them in turn.


# Supervised models

(...) Explain supervised models

(...) They divide into regression and classification models. (difference between regression and classification)


## Linear Regression

We will make a quick recap of linear regression models, but we will not provide a detailed description. Be sure to read one of our additional resources if you want to refresh your knowledge or dig deeper into the topic.

As a machine learning model, linear regression predicts the values of a continuous variable from a linear combination of the values in one or more features.

For example, if we had a dataset $X$ contaning the values of features $x1$ and $x2$, the value $\hat{y}$ predicted by linear regression could be expressed as:

$$\hat{y} = ax1 + bx2 + c$$

> where $a$, $b$ and $c$ are the parameters the model learns from the data to make the predictions

- !! Add vector notation

- !! Explain training and loss
     - Supervised approach --> The model learns the parameters that minimize the distance between the predictve value and the real value


Let's now see how we can fit a linear regression model using scikit-learn.

In scikit-learn, we can easily create a fake dataset for fitting a linear regression model using the `make_regression` method (read documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html#sklearn.datasets.make_regression)). 

Let's create one dataset with 400 samples and 100 features. We will also define 20 of these features as informative, and add some gaussian noise to the data to make the task harder for the model.

In [None]:
import numpy as np
from sklearn.datasets import make_regression

X, y = make_regression(
    n_samples=400, n_features=100, n_informative=20, noise=10, random_state=0
)
print(f"Shape of dataset: {np.shape(X)}")
print(f"Shape of targets: {np.shape(y)}")

Since it's a regression problem, let's make sure the target of our model is a continuous variable. Let's print the first ten values of `y`:

In [None]:
print(y[:10])

Now let's create a linear regression estimator using `LinearRegression` (read documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html?highlight=linear%20regression#sklearn.linear_model.LinearRegression)).

In [None]:
from sklearn.linear_model import LinearRegression

reg = LinearRegression()

Estimator objects contain certain parameters that define how they will behave when learning the data, as well as their outputs. Let's inspect the parameters of `reg`:

In [None]:
vars(reg)

These parameters can be changed by modifying their corresponding attributes when calling the estimator, or afterwards in the following way:

In [None]:
reg.set_params(**{"normalize": True})
vars(reg)

### Training the model

Once the object has been created, we can now train the model using our data. For this we need to call the `fit` method, and pass our data (`X`) and target (`y`) as input:

In [None]:
# Fit linear regression model
reg = reg.fit(X, y)

Let's inspect the attributes of `reg` again:

In [None]:
vars(reg).keys()

`reg` now contains new attributes, which are refered to as _estimated parameters_, because they have been learned from the data. In scikit-learn, these are indexed by an underscore (`_`) at the end. 

For example, we can now access the coefficients learned by our linear model. We should have as many coefficients as features in our dataset:

In [None]:
print(f"Number of coefficients: {reg.coef_.shape[0]}")

Let's also print the values of some of them, and the value of the intercept.

In [None]:
coefs = reg.coef_
intercept = reg.intercept_

print(f"Model coefficients (first 10):\n {coefs[:10]} \n")
print(f"Model intercept: \n {intercept}")

### Making predictions with the model

Now that our model is fitted, we can use it to make predictions. In scikit-learn, this is achieved by calling the method `predict`. 

Let's predict the values of `X` using our fitted model, and visually compare them to their real values for the first ten samples:

In [None]:
import pandas as pd

# Predict labels with trained model
y_pred = reg.predict(X)

# Create dataframe for printing the predictions
df = pd.DataFrame({"y_pred": y_pred[:10], "y_real": y[:10]})
df

### Scoring the model

We can use these predictions to evaluate the performance of the model. That is, estimate how wrong the model is by quantifing the difference between the predicted values and the real ones. 

In scikit-learn we can evaluate this performance using the method `score`. Let's use this method to evaluate how well our model predicts the targets of our fake dataset:

In [None]:
score = reg.score(X, y)
print(f"Linear model R2: {np.round(score,3)}")

By default, linear models are evaluated by calculating $R^2$.

(...) Explain $R^2$

There are other scoring metrics for regression problems besides $R^2$. Check the module [sklearn.metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) for an overview of the alternatives.

#### ✍️ Exercise

Suppose we already have `X` and `y` defined and we want perform linear regression on our data. Why would the following code fail?

```
reg = LinearRegression()
reg.score(X, y)
```

Can you fix it? Click the three dots to reveal the answer.

The code did not fit the model! For it to work it should read:

```
reg = LinearRegression()
reg.fit(X, y)
reg.score(X, y)
```

## Logistic regression

As mentioned, classification models are also estimators. One of the most popular classification models is __logistic regression__

(...) Explain logistic regression


Let's create a fake dataset ready for classification using the `make_classification` method (read documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html#sklearn.datasets.make_classification)). 

In [None]:
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=400, n_features=100, n_informative=20, random_state=0
)

Our `y` should now be a categorical variable. Let's print 10 samples of it to make sure:

In [None]:
print(y[:10])

Let's now create a `LogisticRegression` estimator (read the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)) and fit it to our data:

In [None]:
from sklearn.linear_model import LogisticRegression

# Create model
clf = LogisticRegression()

# Fit model
clf = clf.fit(X, y)

#### ✍️ Exercise

Do you remember how to inspect the coefficients and intercept of the model? Try it below, and press the three dots to check your answer.

In [None]:
## Answer
coefs = clf.coef_
intercept = clf.intercept_

print(f"Coefficients:\n{coefs}\n")
print(f"Intercept:{intercept}")

Let's also compare the model predictions of the first 10 samples of `X` to their real labels:

In [None]:
# Predict labels with trained model
y_pred = clf.predict(X[:10])
y_real = y[:10]

# Create dataframe for printing the predictions
df = pd.DataFrame({"y_pred": y_pred, "y_real": y_real})
df

### Probabilistic predictions

Logistic Regression is a [probabilistic classifier](https://en.wikipedia.org/wiki/Probabilistic_classification), meaning it predicts a probability distribution over the classes.

In _scikit-learn_ we can inspect the probabilities assigned to each class in the following manner:

In [None]:
# Predict the probability of each class
y_pred_proba = clf.predict_proba(X[:10])

# Create dataframe for printing the predictions
df = pd.DataFrame(y_pred_proba, columns=["class 0", "class 1"])
df

### Confusion matrix

By default, the predictions made by `LogisticRegression` when calling `score` are evaluated by computing the __mean accuracy__ of the predictions:

In [None]:
# Score predictions
score = clf.score(X, y)
print(f"Mean accuracy: {np.round(score, 2)}")

Besides scoring the model, in classification problems, it is very common to plot the __confusion matrix__ of the predictions.

(...) Explain what is a confusion matrix. Paste picture.


We will now learn how to plot the confusion matrix of the predictions of some model in scikit-learn. 

We will first create an imbalanced classification dataset, meaning one containing more samples from one of the classes than the other. This dataset will make the example more interesting. We can create the imbalance by setting the parameter `weights` of `make_classification`: 

In [None]:
# Create dataset
X, y = make_classification(
    n_samples=400, n_features=100, n_informative=20, 
    weights=[0.8, 0.2], random_state=0
)

Let's now create and fit a logistic regression model, and use it to make predictions.

In [None]:
# Create and fit model
clf = LogisticRegression().fit(X, y)

# Use model to make predictions
y_pred = clf.predict(X)

Using the predicted labels, we can now run this computation using the function `confusion_matrix` (read documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)), and display it using `ConfusionMatrixDisplay` (read documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html#sklearn.metrics.ConfusionMatrixDisplay)):

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

conf_matrix = confusion_matrix(y, y_pred, labels=clf.classes_)
cm_display = ConfusionMatrixDisplay(conf_matrix).plot()

(...) Explain output

#### ✍️ Exercise

There are other ways of scoring your model besides computing its mean accuracy. Read the documentation about scoring the [precision](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score) and the [recall](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score) of a model. 

Can you implement these score functions yourself? Try them below, and press the three dots to reveal the solution.

In [None]:
# Answer
from sklearn.metrics import precision_score, recall_score

precision = precision_score(y, y_pred)
print(f"Precision: {precision}")

recall = recall_score(y, y_pred)
print(f"Recall: {recall}")

## K-Means

Unsupervised models are also estimators in scikit-learn, since they also learn from data

- (...) recap of unsupervised learning:
    - the goal is to find interesting or useful structure in the data
    - we don't have the ground truth

Clustering methods are unsupervised models. 
    - (...) explain clustering methods

One popular clustering method is k-means
    - (...) Explain how k-means work

Let's generate a dataset suitable for clustering using `make_blobs` (read documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html#sklearn.datasets.make_blobs)) which generates Gaussian shaped blobs, and visualize it:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs

# Create fake dataset
X, y = make_blobs(
    n_samples=400, n_features=2, random_state=0, cluster_std=1
)

# Plot dataset
sns.scatterplot(
    x=X[:, 0], y=X[:, 1], hue=y,
    marker='o', s=25, edgecolor='k', legend=True
).set_title("Data")
plt.show()

We can perform k-means clustering by calling `KMeans` in scikit-learn (read documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans)). We will pre-define $k$ to be equal to 5.

In [None]:
from sklearn.cluster import KMeans

# Create model
kmeans = KMeans(n_clusters=5)

# Fit model
kmeans = kmeans.fit(X, y)

Since this is an unsupervised method and there is no real truth labels, we cannot compute the accuracy of the fitted model. But we can compute the average distance of the labeled example to the center of their assigned cluster using the `score` function:

In [None]:
# Compute average distance
score = kmeans.score(X, y)
print(f"Average distance: {score}")

If you want to read more about the meaning behind the returned value, read [this answer](https://stackoverflow.com/questions/32370543/understanding-score-returned-by-scikit-learn-kmeans) on stackoverflow.

More importantly, we can now use our fitted model to predict to which cluster the observations belong to. Let's predict the assignment of the first 10 observations:

In [None]:
# Predict cluster label
y_pred = kmeans.predict(X)
print(f"Predicted labels (first 10): {y_pred[:10]}")

We can use a scatterplot to inspect the predicted labels from the model:

In [None]:
# Plot predicted labels
sns.scatterplot(
    x=X[:, 0], y=X[:, 1], hue=y_pred,
    marker='o', s=25, edgecolor='k', legend=False
).set_title("Data")
plt.show()

We have the same number of predicted labels as the number of $k$.

### ✍️ Exercise 

Can you create a `KMeans` model specifying the correct number of clusters (`k=3`) and plot its predictions? Try it below and press the three dots to see the solution.

In [None]:
#Answer
# Create model
kmeans = KMeans(n_clusters=3)

# Fit model
kmeans = kmeans.fit(X, y)

# Predict labels
y_pred = kmeans.predict(X)

# Plot predicted labels
sns.scatterplot(
    x=X[:, 0], y=X[:, 1], hue=y_pred,
    marker='o', s=25, edgecolor='k', legend=False
).set_title("Data")
plt.show()

# ✏️ Check your knowledge

Load the ABIDE 2 dataset and:

1. Use logistic regression to predict "group" from the features encoding brain data.
    - How accurate is the model?
    - Compute the confusion matrix and inspect the proportion of false positives and false negatives.
2. Select two features encoding brain data, run a clustering analyses and plot the predicted labels as shown in this example. Compare it with a similar plot showing the true group labels.


# Additional reading

- [Choosing the right estimator](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html): A useful map to decide which estimator is best given your dataset and learning goal.