# 02 - Estimators

In scikit-learn an [estimator](https://scikit-learn.org/stable/tutorial/statistical_inference/settings.html#estimators-objects) is a Python object that learns from data. 

In this notebook, we will show:
- How estimators can be supervised models in machine learning that perform classification or regression tasks, as well as unsupervised models.
- The methods associated with an estimator.
- Common metrics used to evaluate the estimator performance.

# Linear Regression

- !! Description of linear regression
    - Linear combination of features
    - Explain training and loss
- !! Links to detail tutorials explaining linear regression

In scikit-learn, we can easily create a fake dataset for fitting a linear regression model using the `make_regression` method (read documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html#sklearn.datasets.make_regression)). 

Let's create one such dataset with 400 samples and 100 features. We will also define 20 of these features as informative, and add some gaussian noise to the model to make the task harder for the model.

In [44]:
import numpy as np
from sklearn.datasets import make_regression

X, y = make_regression(
    n_samples=400, n_features=100, n_informative=20, noise=10, random_state=0
)
print(f"Shape of dataset: {np.shape(X)}")
print(f"Shape of targets: {np.shape(y)}")

Shape of dataset: (400, 100)
Shape of targets: (400,)


Since it's a regression problem, let's make sure the target of our model is a continuous variable. Let's print the first ten values of `y`:

In [63]:
print(y[:10])

[-258.45661829 -357.87561738  108.00450947   -8.40675451 -226.74854238
  -38.23202934   18.14753729 -821.27365083  320.76896452  145.63900011]


Now let's create a linear regression estimator using `LinearRegression` (read documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html?highlight=linear%20regression#sklearn.linear_model.LinearRegression)).

In [45]:
from sklearn.linear_model import LinearRegression

reg = LinearRegression()

Estimator objects contain certain parameters that define how the estimator will behave when learning the data, as well as its outputs. Let's inspect the parameters of `reg`:

In [46]:
vars(reg)

{'fit_intercept': True,
 'normalize': False,
 'copy_X': True,
 'n_jobs': None,
 'positive': False}

These parameters can be changed by modifying their corresponding attributes when calling the estimator, or afterwards in the following way:

In [47]:
reg.set_params(**{"normalize": True})
vars(reg)

{'fit_intercept': True,
 'normalize': True,
 'copy_X': True,
 'n_jobs': None,
 'positive': False}

Once the object has been created, we can now train the model using our data. For this we need to call the `fit` method, and pass our data (`X`) and target (`y`) as input:

In [48]:
# Fit linear regression model
reg = reg.fit(X, y)

Let's inspect `reg` again:

In [49]:
vars(reg)

{'fit_intercept': True,
 'normalize': True,
 'copy_X': True,
 'n_jobs': None,
 'positive': False,
 'n_features_in_': 100,
 'coef_': array([-1.13225733e+00, -3.77616990e-01, -5.50097816e-01,  1.16898332e-01,
        -3.78358698e-01,  2.82099643e-01, -1.17569483e-01,  9.62901136e+01,
        -1.00224858e+00,  8.19910907e+01, -8.78135425e-02, -1.82280556e-01,
         8.81973951e-01,  7.07777762e-01, -3.64732376e-01,  1.16557370e-01,
        -1.41480911e-01,  9.76797502e+01,  2.44778366e-01,  1.18660618e-01,
        -2.56474406e-01, -2.63234326e-01, -3.79093461e-01, -7.56963260e-01,
         8.10451096e+01, -2.89729858e-01, -5.87770902e-01,  1.10550635e+00,
         6.33377915e-01,  8.79685725e+01, -2.13671654e-01,  9.45154025e+01,
         3.27848144e-01, -1.20963595e+00,  6.96968792e-01, -3.35168170e-01,
        -1.30714932e-01,  1.89034742e+01,  5.66612978e-01,  8.67670415e+01,
         5.59028417e-01, -4.83315208e-01,  5.86503099e-01, -3.26951768e-02,
        -4.81610251e-01, -4.95151

`reg` now contains new parameters, which are refered to as _estimated parameters_, because they have been learned from the data. In scikit-learn, these are indexed by an underscore (`_`) at the end. 

For example, we can now access the coefficients learned by our linear model. We should have as many coefficients as features in our dataset:

In [50]:
print(f"Number of coefficients: {reg.coef_.shape[0]}")

Number of coefficients: 100


Let's also print the values of some of them, and the value of the intercept.

In [51]:
coefs = reg.coef_
intercept = reg.intercept_

print(f"Model coefficients (first 10):\n {coefs[:10]} \n")
print(f"Model intercept: \n {intercept}")

Model coefficients (first 10):
 [-1.13225733 -0.37761699 -0.55009782  0.11689833 -0.3783587   0.28209964
 -0.11756948 96.29011361 -1.00224858 81.99109074] 

Model intercept: 
 -0.49212396559274585


Now that our model is fitted, we can use it to make predictions and evaluate how the predictions differ from the real values. 

In scikit-learn we can evaluate this performance using the method `score`. Let's use this method to evaluate how well our model predicts the targets of our fake dataset:

In [56]:
score = reg.score(X, y)
print(f"Linear model R2: {np.round(score,3)}")

Linear model R2: 0.999


By default, linear models are evaluated by calculating $R^2$.

- !! Explain $R^2$

Besides evaluating the predictive accuracy of our model, we can also obtain the predicted values that the model calculates for each observation using the method `predict`. Let's predict the values of the first ten observations of our dataset, and compare them to their real values:

In [61]:
import pandas as pd

y_pred = reg.predict(X[:10])
y_real = y[:10]

df = pd.DataFrame({"y_pred": y_pred, "y_real": y_real})
df

Unnamed: 0,y_pred,y_real
0,-269.240138,-258.456618
1,-349.576199,-357.875617
2,106.003478,108.004509
3,-31.193701,-8.406755
4,-240.331344,-226.748542
5,-44.646219,-38.232029
6,3.240374,18.147537
7,-819.226112,-821.273651
8,322.174534,320.768965
9,158.470684,145.639


### Exercise

Suppose we already have `X` and `y` defined, what's wrong with the following code if we wanted to create a linear regression model of the data?

```
reg = LinearRegression()
reg.score(X, y)
```

Can you fix it?

# Logistic regression

- !! What is logistic regression
- !! A logistic regression model is an estimator too.

Let's create a fake dataset for classification using the `make_classification` method (read documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html#sklearn.datasets.make_classification)). 

In [64]:
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=400, n_features=100, n_informative=20, random_state=0
)

Our `y` should now be a categorical variable. Let's print 10 samples of it to make sure:

In [66]:
print(y[:10])

[1 0 0 1 0 1 1 0 1 1]


Let's now create a `LogisticRegression` estimator (read the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)) and fit it to our data:

In [74]:
from sklearn.linear_model import LogisticRegression

# Create model
clf = LogisticRegression()

# Fit model
clf = clf.fit(X, y)

By default, the predictions made by `LogisticRegression` when calling `score` are evaluated by computing the mean accuracy of the predictions:

In [76]:
# Score predictions
score = clf.score(X, y)
print(f"Mean accuracy: {np.round(score, 2)}")

Mean accuracy: 0.86


Let's also compare the model predictions of the first 10 observations in `X`, to the real classes:

In [68]:
y_pred = clf.predict(X[:10])
y_real = y[:10]

df = pd.DataFrame({"y_pred": y_pred, "y_real": y_real})
df

Unnamed: 0,y_pred,y_real
0,1,1
1,0,0
2,0,0
3,1,1
4,0,0
5,1,1
6,0,1
7,0,0
8,1,1
9,1,1


{TO_DO}: classification problems also predict probabilities:

In [72]:
y_pred_proba = clf.predict_proba(X[:10])

df = pd.DataFrame(y_pred_proba, columns=["class 0", "class 1"])
df

Unnamed: 0,class 0,class 1
0,0.388956,0.611044
1,0.786049,0.213951
2,0.971892,0.028108
3,0.052677,0.947323
4,0.664012,0.335988
5,0.015693,0.984307
6,0.661506,0.338494
7,0.992583,0.007417
8,0.045168,0.954832
9,0.027215,0.972785


### Exercise

Do you remember how to inspect the coefficients and intercept of the model? Try it here.

## Support Vector Machine
- !! What is SVM 
    - Very common in fMRI research
- !! Link to SVM resources

Let's train a support vector classifier using `SVC` (read the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)). By default `SVC` uses `rbf` as its kernel.

In [73]:
from sklearn.svm import SVC

# Create model
svc = SVC()

# Fit model
svc = svc.fit(X, y)

# Score predictions
svc.score(X, y)

0.9875

### Exercise 

Can you print the predictions of the model for the first 10 observations? Can you print the predicted probabilities?

## KMeans

- !! Not only supervised models can be estimators in scikit-learn. An unsupervised model also learns from the data. 
- !! One example of such unsupervised method is KMeans, a clustering method.
- !! Explain how KMeans work

Let's generate a dataset suitable for clustering using `make_blobs` (read documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html#sklearn.datasets.make_blobs)) which generates Gaussian shaped blobs:

In [65]:
from sklearn.datasets import make_blobs

X, y = make_blobs(
    n_samples=400, n_features=100, random_state=0
)

We can perform k-means clustering by calling `KMeans` in scikit-learn (read documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans)). We will pre-define `k` to be equal to 5.

In [81]:
from sklearn.cluster import KMeans

# Create model
kmeans = KMeans(n_clusters=5)

# Fit model
kmeans = kmeans.fit(X, y)

Since this is an unsupervised method and there is no real truth labels, we cannot compute the accuracy of the fitted model. But we can compute the average distance of the labeled example to the center of their assigned cluster using the `score` function:

In [83]:
# Compute average distance
score = kmeans.score(X, y)
print(f"Average distance: {score}")

Average distance: -88424.37254302565


If you want to read more about the meaning behind the returned value, read [this answer](https://stackoverflow.com/questions/32370543/understanding-score-returned-by-scikit-learn-kmeans).

More importantly, we can now use our fitted model to predict to which cluster the observations belong to. Let's predict the assignment of the first 10 observations:

In [84]:
# Predict cluster label
kmeans.predict(X[:10])

array([2, 3, 1, 3, 4, 0, 2, 2, 3, 0], dtype=int32)

We have the same number of predicted labels as the number of `k`.

### Exercise 

Can you try reproducing the code above running `KMeans` with 3 clusters?

# Performance metrics

- !! There are many ways of evaluating the performance of a model
- !! Explain how to choose the performance metrics

## Area Under the Curve

- !! Explain area under the curve

Let's create a classification imbalanced dataset where one class has more samples than the other. We can do this by setting the parameter `weights` of `make_classification`: 

In [124]:
# Create dataset
X, y = make_classification(
    n_samples=400, n_features=100, n_informative=20, 
    weights=[0.8, 0.2], random_state=0
)

# Create model
clf = LogisticRegression()

# Fit model
clf = clf.fit(X, y)

Let's make sure one of the classes has more samples than the other:

In [125]:
print(f"Proportion of class 1: {np.sum(y)/y.size}")

Proportion of class 1: 0.2


In scikit-learn, we can compute the area under the curve using `roc_auc_score` (read the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score)):

In [126]:
from sklearn.metrics import roc_auc_score

# Predict labels with model
y_pred = clf.predict(X)

# Compute AUC score
auc_score = roc_auc_score(y, y_pred)
print(f"AUC score: {np.round(auc_score, 2)}")

AUC score: 0.87


How does this score compares to the mean accuracy score?

In [127]:
# Use another scoring method
print(f"Mean accuracy score: {np.round(clf.score(X, y), 2)}")

Mean accuracy score: 0.94


## Precision-recall

- !! Explain what it is
- !! Explain why its a good metric when you have imbalanced datasets

In [None]:
# TODO

## Confusion matrix

- !! Explain confusion matrix

In [128]:
from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y, y_pred, labels=clf.classes_)
conf_matrix

array([[313,   7],
       [ 19,  61]])

!! Explain output

In [None]:
# from sklearn.metrics import ConfusionMatrixDisplay

# conf_matrix_disp = ConfusionMatrixDisplay(
#     confusion_matrix=conf_matrix, display_labels=clf.classes_
#     #display_labels=clf.classes_
# )
# conf_matrix_disp.display.plot()

In [86]:
# TODO: add classification report

### Exercise

Read the documentation of `classification_report` [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html). Can you understand what this method does, and implement it yourself?

#### Answer

In [130]:
# from sklearn.metrics import classification_report

# print(classification_report(y, y_pred))

              precision    recall  f1-score   support

           0       0.94      0.98      0.96       320
           1       0.90      0.76      0.82        80

    accuracy                           0.94       400
   macro avg       0.92      0.87      0.89       400
weighted avg       0.93      0.94      0.93       400



# Check your knowledge

{TO-DO}

Load the dataset pre-processed in notebook 1 {give name}, and:

1. 
2. 
3. 

Answer the following questions:

1. 
2.
3.



# Additional reading

- [Choosing the right estimator](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html): A useful map to decide which estimator is best given your dataset and learning goal.