# Programming for Data Science and Artificial Intelligence

## Linear Regression using Sklearn

### Readings: 
- https://scikit-learn.org/stable/

[Scikit-Learn](http://scikit-learn.org) provides quick access to a huge pool of machine learning algorithms.

Before using sklearn, there is **one thing you need to know**, i.e., the **data shape that sklearn wants**.

To apply majority of the algorithms, sklearn requires two inputs, i.e., $\mathbf{X}$ and $\mathbf{y}$.

-  $\mathbf{X}$, or the **feature matrix** *typically* has the shape of ``[n_samples, n_features]``
-  $\mathbf{y}$, or the **target/label vector** *typically* has the shape of ``[n_samples, ]`` or ``[n_samples, n_targets]`` depending whether that algorithm supports multiple labels

<img src = "figures/shape.png">

Note 1:  if you $\mathbf{X}$ has only 1 feature, the shape must be ``[n_samples, 1]`` NOT ``[n_samples, ]``

Note 2:  sklearn supports both numpy and pandas, as long as the shape is right.  For example, if you use pandas, $\mathbf{X}$ would be a dataframe, and $\mathbf{y}$ could be a series or dataframe.

Tips:  it's always better to look at sklearn documentation before applying any algorithm.

### Basics of the sklearn API

Most commonly, the steps in using the Scikit-Learn API are as follows:

1. Import a class of model
2. Choose model hyperparameters by instantiating this class with desired values.
3. Arrange data into a features matrix and target vector following the discussion above.
4. Fit the model to your data by calling the ``fit()`` method of the model instance.
5. Perform inference using the ``predict()`` method.

### Let's try!

Before anything, let's load a toy regression dataset.

In [1]:
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#### 1. Import a class of model

In [2]:
from sklearn.linear_model import LinearRegression

#### 2. Choose model hyperparameters

For our linear regression example, we can instantiate the ``LinearRegression`` class and specify that we would like to fit the intercept using the ``fit_intercept`` hyperparameter:

In [3]:
model = LinearRegression(fit_intercept=True)
model

LinearRegression()

#### 3. Arrange data into a features matrix and target vector

In [4]:
assert len(X_train.shape) == 2 and len(X_test.shape) == 2  #correct shape!

In [5]:
assert len(y_train.shape) == 1 and len(y_test.shape) == 1  #correct shape!

#### 4. Fit the model to your data

In [6]:
model.fit(X_train, y_train)  #when you train your model, use your training set

LinearRegression()

This ``fit()`` command causes a number of model-dependent internal computations to take place, and the results of these computations are stored in model-specific attributes that the user can explore.
In Scikit-Learn, by convention all model parameters that were learned during the ``fit()`` process have trailing underscores; for example in this linear model, we have the following:

In [7]:
model.coef_

array([ -0.87547741,  -9.26843436,  23.24283857,  18.52697371,
       -30.72400693,  17.48415588,   2.06334652,   7.45377418,
        33.45802239,   2.61129274])

In [8]:
model.intercept_

151.29126213592232

These two parameters represent the slope and intercept of the simple linear fit to the data.

#### 5. Predict labels for unknown data

Once the model is trained, we can now evaluate our model which is called **inference** or **testing**.  Usually we do this with test set (but here we are just lazy for simplicity).  

In Scikit-Learn, this can be done using the ``predict()`` method.
For the sake of this example, our "new data" will be a grid of *x* values, and we will ask what *y* values the model predicts:

In [9]:
y_hat = model.predict(X_test)  #inference

In [10]:
# compute mean squared error

from sklearn.metrics import mean_squared_error

# mean_squared_error(y_true, y_pred)
mean_squared_error(y_test, y_hat)

3180.1621027594665

As you can see, it's very close to what we got before, using our code from scratch, but with 10x fewer lines :-)!

## Cross-validation

Having only two sets, i.e., training and test set is NOT recommended because:

1. What if I want to check which hyperparameter is good?  How to check when I should NEVER touch test set?

2. What if somehow I got lucky with my split and my training set is very good, and my test set is also very good, just **by chance*?
 
The recommended way is to do **cross-validation**

- **Idea**:  further **split the training set into actual training set and validation set**.  To make sure we don't get lucky with our validation set, we do this split either randomly or walkforward like in this picture:

<img width="400" src = "figures/cv.png" >

Here we split the data into five groups, and use each of them in turn to evaluate the model fit on the other 4/5 of the data.  This would be rather tedious to do by hand, and so we can use Scikit-Learn's ``cross_val_score`` convenience routine to do it succinctly

In [11]:
from sklearn.model_selection import cross_val_score
cross_val_score(model, X_train, y_train, cv=5) 

array([0.21229232, 0.45349073, 0.51633375, 0.58399592, 0.64328844])

By default, if we specify cv with integer, the <code>cross_val_score</code> uses KFold strategies by default (KFold is basically the picture above).  We can also manually specify the CV strategies we want.

<img width="400" src = "figures/kfold.png">

For example, **ShuffleSplit**:

ShuffleSplit is a good alternative to KFold cross validation that allows a finer control on the number of iterations and the proportion of samples on each side of the train / test split.

<img width="400" src = "figures/shuffle.png">

In [12]:
from sklearn.model_selection import ShuffleSplit

shuffle_cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)  #splitting in a randomized way
cross_val_score(model, X_train, y_train, cv=shuffle_cv) 

array([0.55653483, 0.61272515, 0.49418558, 0.2213325 , 0.40555013])

Another common strategy is **Stratified KFold**

StratifiedKFold is a variation of k-fold which returns stratified folds: **each set contains approximately the same percentage of samples of each target class**.

<img width="400" src = "figures/skfold.png">

In [13]:
from sklearn.model_selection import StratifiedKFold #mostly used for classification

sk_cv = StratifiedKFold(n_splits=3)  #there's also stratified shuffle kfold!
cross_val_score(model, X_train, y_train, cv=sk_cv) 



array([0.30962551, 0.5635614 , 0.59665989])

Another strategy is **Group KFold**:

Very useful if you don't want the same group in both training and validation set.

<img width="400" src = "figures/groupkfold.png">

In [14]:
from sklearn.model_selection import GroupKFold

gkf = GroupKFold(n_splits=5)  #there's also group shuffle kfold!

#we have to specify the group
#let's specify, just for the sake
groups = np.random.randint(0, 5, size=y_train.shape[0])
print(groups.shape)
#print(groups)

cross_val_score(model, X_train, y_train, cv=gkf, groups=groups) 

(309,)


array([0.53971909, 0.44107877, 0.52276067, 0.47754004, 0.4587158 ])

For more details, read https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation

Note: **one big reminder is that cross-validation is for finding optimal hyperparameters/models, you still need to evaluate with final test set**

## Double/Nested Cross-validation

Recall the two problems we have:

1. What if I want to check which hyperparameter is good?  How to check when I should NEVER touch test set?

2. What if somehow I got lucky with my split and my training set is very good, and my test set is also very good, just **by chance*?

Actually, we solved number 1 only!

How about number 2?  Because we may be lucky with our testing set!

**Idea: put another loop when we initally split training and test set, i.e., we now have two loops**

<img width="400" src = "figures/nestedcv.png">

**Then the final performance is simply the average of all outerloop performance, instead of testing the model with final test set, because here, we don't have final test set**

To do nested cross-validation, let's first learn <code>GridSearch</code> which will be needed for doing nested cross-validation quickly.

## Grid Search

Recall that we can compare models/hyperparameters using <code>cross_val_score</code>, right?   But this can be tiring....Scikit-Learn provides automated tools to do this called <code>GridSearchCV</code>:

In [15]:
from sklearn.model_selection import GridSearchCV

param_grid = {'fit_intercept': [True, False],
              'positive': [True, False]}

#GridSearchCV(algorithm, dictionary, cross-validation strategy)
grid = GridSearchCV(model, param_grid, cv=shuffle_cv, refit=True)

grid.fit(X_train, y_train)

GridSearchCV(cv=ShuffleSplit(n_splits=5, random_state=0, test_size=0.3, train_size=None),
             estimator=LinearRegression(),
             param_grid={'fit_intercept': [True, False],
                         'positive': [True, False]})

Now that this is fit, we can ask for the best parameters as follows:

In [16]:
grid.best_params_

{'fit_intercept': True, 'positive': False}

Now we got the new model, we can test it on the final final test set.

In [17]:
y_hat = grid.predict(X_test)  #note that here i can use grid right away, because i specify refit=True
mean_squared_error(y_test, y_hat)

3180.1621027594665

**Please note that there is also other form of gridsearchcv such as randomized grid search which can be more efficient.**

## Coming back to nested cross-validation

Once we learn about grid search, we can utilize both <code>grid search</code> and <code>cross_val_score</code> to perform nested cross validation like this:

In [18]:
from sklearn.model_selection import KFold

#specify the inner cv and outer cv
inner_cv = KFold(n_splits=4, shuffle=True, random_state=1)
outer_cv = KFold(n_splits=4, shuffle=True, random_state=1)

#inner loop
inner_model = GridSearchCV(model, param_grid=param_grid, cv=inner_cv)

#outer loop
nested_score = cross_val_score(inner_model, X, y, scoring='neg_mean_squared_error', cv=outer_cv)
                              
nested_score #higher mean better....(sklearn wants to keep this convention)

#see this ==> https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values

array([-2903.10000132, -3315.33400852, -3144.27434507, -3045.8808507 ])

Now you just average them, and this is your very robust estimates of the model performance.

In [19]:
nested_score.mean()

-3102.1473014039575

There are two big problems of nested cross-validation:
    
1. It is time consuming and resource hungry.
2. You no longer know what hyperparameters or best models....because in the inner loop, the model varies....**so yes, nested cross-validation do not give you a final model!! XD**

**So how to use nested cross-validation**

1. First, use nested cross-validation to look for **model instability**.  If there is a lot of instability, you want to **skip the model or change the search space**.  

2. Once you got a model that is very stable, run a typical (simple, no nested) cross-validation to find the best version of that model, so you can deploy.
    - you can either train (1) on the entire dataset (if you are super sure), or (2) training set (preferable).

**It takes too much time!!!**

1. Make the outer loop smaller OR
2. Don't use it
3. Implement by yourself, and putting breaks to save progress