# Optional: Scikit-learn primer.
In this additional assignment, you will learn to use the scikit-learn library.   
**Note:** It is highly recommended to go through this notebook before starting with the bonus assignment.
 $\newcommand{\q}[1]{\rightarrow \textbf{Question #1}}$
 $\newcommand{\ex}[1]{\rightarrow \textbf{Exercise #1}}$

## Introduction
All algorithms, both for training models and for pre-processing data, in scikit-learn have been implemented with the same `fit`, `predict` and `transform` API. As soon as you have learned this API you can use any algorithm without having to implement it on your own. For a given learning problem, you can then apply all those algortihms in the same way. The API also hides all the complex optimization choices that have to be made. You can control these by changing the hyper-parameters. The effects of these choices have been well documented in the API documentation and the provided tutorials of scikit-learn.  



## Dataset

In this assignment, we will use the Iris dataset to keep things simple.

In [2]:
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)

## Using classifiers
Using a classifier in scikit-learn consist of 3 steps:
1. Initialize the model. During this step, you can already give it some default hyper-parameters.
2. Fitting the model on the training data.
3. Making predictions and/or evaluating the model.

### Create
Creating models is very easy in scikit-learn. All you have to do is create a new instance of the model's class.

$\ex{1}$ Extend the list of models with the `SVC` and `LogisticRegression` algorithms. Give the SVM a `poly` kernel. Give the Logistic Regression a maximum number of iterations of `1000`. Also, give both algorithms a regularization constant `C=0.5` and `random_state=42`.

In [3]:
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier

random_state = 42

models = {
    "GaussianNB": GaussianNB(),
    "DummyClassifier": DummyClassifier(strategy="most_frequent"),
    "DecisionTreeClassifier": DecisionTreeClassifier(max_depth=None, min_samples_leaf=2, random_state=random_state),
    "KNeighborsClassifier": KNeighborsClassifier(n_neighbors=3, weights="distance"),
    # START ANSWER
    "SVM": SVC(kernel="poly", C=0.5, random_state=42),
    "LogisticRegression": LogisticRegression(max_iter=1000, C=0.5, random_state=42),
    # END ANSWER
}

assert "GaussianNB" in models and isinstance(models["GaussianNB"], GaussianNB), "There is no GaussianNB in models"
assert "DecisionTreeClassifier" in models and isinstance(models["DecisionTreeClassifier"], DecisionTreeClassifier), "There is no DecisionTreeClassifier in models"
assert "KNeighborsClassifier" in models and isinstance(models["KNeighborsClassifier"], KNeighborsClassifier), "There is no KNeighborsClassifier in models"
assert "SVM" in models and isinstance(models["SVM"], SVC), "There is no SVC in models"
assert "LogisticRegression" in models and isinstance(models["LogisticRegression"], LogisticRegression), "There is no LogisticRegression in models"

### Fit
$ \ex{2} $ Fit each of your models on the entire training set by calling the `.fit` method of the model.

In [4]:
for name, model in models.items():
    # START ANSWER
    model.fit(X, y)
    # END ANSWER

In [5]:
from sklearn.utils.validation import check_is_fitted

for model in models.values():
    check_is_fitted(model)

### Evaluate
The `sklearn.metrics` module has lots of metrics that can evaluate a model's predictions. Here is an example of how to calculate a model's F1 and accuracy score.

In [6]:
from sklearn.metrics import f1_score, accuracy_score

for name, model in models.items():
    prediction = model.predict(X)
    f1_score_value = f1_score(prediction, y, average="weighted")
    accuracy = accuracy_score(prediction, y)
    print(name)
    print("- accuracy_score", accuracy)
    print("- f1_score", f1_score_value)

GaussianNB
- accuracy_score 0.96
- f1_score 0.96
DummyClassifier
- accuracy_score 0.3333333333333333
- f1_score 0.5
DecisionTreeClassifier
- accuracy_score 0.98
- f1_score 0.9800020002000202
KNeighborsClassifier
- accuracy_score 1.0
- f1_score 1.0
SVM
- accuracy_score 0.9733333333333334
- f1_score 0.973344004268374
LogisticRegression
- accuracy_score 0.9666666666666667
- f1_score 0.9666700003333667


## Data splitting
Models usually achieve a high evaluation score on the training set. However, this doesn't say anything about how well it generalizes to unseen data. We usually evaluate models using either a test/validation split or k-fold validation. Scikit-learn  makes our life easier here by implementing both functions for us.

### Test/validation split
We can split datasets into training and test sets using the `train_test_split` function. The `test_size` parameter indicates the percentage of data that should go to the test set. The `stratify`  parameter indicate that the split should take the distribution of target labels `y` into account during the split. This parameter ensures that both the train and test have the same distribution of target variables.

$ \ex{3} $ The data has already been split into a training and a test set. Fit the model using the training set and evaluate them for `accuracy` and `weighted F1-score` using the test set.

The result on the test set should roughly be equal to:

|                  Model |    Accuracy  | F1 |
|-----------------------:|------:|---------:|
|             GaussianNB |  0.86 |     0.86 |
| DummyClassifier        | 0.33  | 0.166      |
| DecisionTreeClassifier | 0.866 |    0.866 |
| KNeighborsClassifier   | 1     | 1        |
| SVM                    | 0.93     | 0.932        |
| LogisticRegression     | 0.933 | 0.932    |

Manually verify that this is indeed the case.

In [7]:
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42, shuffle=True, stratify=y)

# START ANSWER
for name, model in models.items():
    model.fit(X_train, y_train)

    prediction = model.predict(X_test)
    accuracy = accuracy_score(y_test, prediction)
    f1_score_value = f1_score(y_test, prediction, average="weighted")
    print(name)
    print("- accuracy_score", accuracy)
    print("- f1_score", f1_score_value)
# END ANSWER 

GaussianNB
- accuracy_score 0.8666666666666667
- f1_score 0.8666666666666667
DummyClassifier
- accuracy_score 0.3333333333333333
- f1_score 0.16666666666666666
DecisionTreeClassifier
- accuracy_score 0.8666666666666667
- f1_score 0.8666666666666667
KNeighborsClassifier
- accuracy_score 1.0
- f1_score 1.0
SVM
- accuracy_score 0.9333333333333333
- f1_score 0.9326599326599326
LogisticRegression
- accuracy_score 0.9333333333333333
- f1_score 0.9326599326599326


## K-fold validation
Setting up k-fold validation is a bit more work but we can do it as follows:

In [8]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate

def k_fold_fit_and_evaluate(X, y, model, scoring_method, n_splits=5):
    # Define evaluation procedure
    cv = KFold(n_splits=n_splits, random_state=42, shuffle=True)
    # Evaluate model
    scores = cross_validate(model, X, y, scoring=scoring_method, cv=cv, n_jobs=-1)
    
       
    return scores["test_score"]

**Note:** `cross_validate` expects a `scoring_method`. We can create a `scoring_method` using the `make_scorer` function from scikit-learn.

$ \ex{4} $ Use the example below to calculate the mean and std for both the F1 and the accuracy score. The `k_fold_fit_and_evaluate` method returns the resulting k-fold validation score from the provided `scoring_method`.

Hint: use `np.mean` and `np.std`.

The result using k-fold validation should roughly be equal to:


|                  Model | mean F1 | std F1 | mean Accuracy | std Accuracy |
|-----------------------:|--------:|--------|--------------:|--------------|
|             GaussianNB |   0.960 | 0.0249 |        0.960 | 0.0249        |
| DummyClassifier        |0.1079   | 0.01866 | 0.26        | 0.0249       |
| DecisionTreeClassifier |   0.946 | 0.0340 |       0.946 | 0.0339       |
| KNeighborsClassifier   | 0.966   | 0.0214 | 0.966        | 0.02144      |
| SVM                    | 0.980   | 0.0163 | 0.980        | 0.0163      |
| LogisticRegression     | 0.966   | 0.0298 | 0.966        | 0.0298     |

In [9]:
from sklearn.metrics import make_scorer
import numpy as np

n_splits = 5


scoring_method_f1 = make_scorer(lambda prediction, true_target: f1_score(true_target, prediction, average="weighted"))
# START ANSWER
scoring_method_accuracy = make_scorer(lambda prediction, true_target: accuracy_score(true_target, prediction))
# END ANSWER 


for name, model in models.items():
    print(name)
    metrics_f1 = k_fold_fit_and_evaluate(X, y, model, scoring_method_f1, n_splits=n_splits) 
    # START ANSWER
    metrics_accuracy = k_fold_fit_and_evaluate(X, y, model, scoring_method_accuracy, n_splits=n_splits)
    print(np.mean(metrics_f1), np.std(metrics_f1), np.mean(metrics_accuracy), np.std(metrics_accuracy))
    # END ANSWER

GaussianNB
0.960083877475182 0.02496762328890136 0.9600000000000002 0.024944382578492935
DummyClassifier
0.4120800962906226 0.031237010908414645 0.26 0.024944382578492935
DecisionTreeClassifier
0.9467734439473571 0.034088683237891465 0.9466666666666667 0.03399346342395189
KNeighborsClassifier
0.9669716271573856 0.020723569597320978 0.9666666666666668 0.02108185106778919
SVM
0.9800214654282765 0.0163127186195678 0.9800000000000001 0.01632993161855452
LogisticRegression
0.9666666666666668 0.029814239699997188 0.9666666666666668 0.029814239699997188


## Grid search
Scikit-learn also makes it easier to tune hyper-parameters using `GridSearchCV`.

$ \ex{5} $ Extend the `model_parameters` dictionary by specifying a grid search for the `KNeighborsClassifier`, `SVM` and the `LogisticRegression` models. Choose a set of hyper-parameters values that seems reasonable for you.

In [10]:
from sklearn.model_selection import GridSearchCV

random_state = 42
n_splits = 5
scoring_method = make_scorer(lambda true_target, prediction: f1_score(true_target, prediction, average="weighted"))

model_parameters = {
    "GaussianNB": {
    
    },
    "DummyClassifier": {
        
    },
    "DecisionTreeClassifier": {
        'random_state': [random_state],
        'max_depth': [None, 2, 5, 10]
    },
    # START ANSWER
    "KNeighborsClassifier" : {
        "n_neighbors": [3, 5, 7, 9],
        "weights": ["uniform", "distance"]
    },
    "SVM" : {
        "C": [5, 10, 15],
        "kernel": ["poly", "linear", "rbf"],
        "gamma": ["scale", "auto"]
    },
    "LogisticRegression": {
        "penalty": ["none", "l2"]
    }
    # END ANSWER
}

for model_name, parameters in model_parameters.items():
    model = models[model_name]
    
    cv = KFold(n_splits=n_splits, random_state=random_state, shuffle=True)
    grid_search = GridSearchCV(model, parameters, cv=cv, n_jobs=-1, verbose=False, scoring=scoring_method).fit(X, y)
    
    best_model = grid_search.best_estimator_
    best_score = grid_search.best_score_
    best_params = grid_search.best_params_
    
    print(model_name)
    print("- best_score =", best_score)
    print("best paramters:")
    for k,v in best_params.items():
        print("-", k, v)


GaussianNB
- best_score = 0.9599161225248183
best paramters:
DummyClassifier
- best_score = 0.10791990370937739
best paramters:
DecisionTreeClassifier
- best_score = 0.9465598893859765
best paramters:
- max_depth None
- random_state 42
KNeighborsClassifier
- best_score = 0.9797720797720798
best paramters:
- n_neighbors 7
- weights distance
SVM
- best_score = 0.9799785345717235
best paramters:
- C 5
- gamma scale
- kernel rbf
LogisticRegression
- best_score = 0.9666666666666668
best paramters:
- penalty l2


## Using Transformers
The transformers have a similar but slightly different API than the models. Transformers still have the `fit` method. The fit method is, for example, used in the `StandardScaler` to find the `mean` and `std` values. However, the `predict` method is replaced with the `transform` method. Scikit-learn did this to make it clear to the users that this is not a model but a feature transformer.  
**Note:** [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) can be used to center the features and scale them to unit variance.

In [11]:
from sklearn import preprocessing

scaler = preprocessing.StandardScaler().fit(X_train)

scaler.mean_, scaler.scale_

(array([5.82962963, 3.05703704, 3.75111111, 1.20518519]),
 array([0.82210877, 0.44297659, 1.74999965, 0.763842  ]))

After fitting the transformer, you can call the `transform` method. It will transform the input features based on the parameters found during the last `fit` call.

In [12]:
import numpy as np

X_train_transformed = scaler.transform(X_train)
print("X_train")
print("mean", X_train.mean())
print("std", X_train.std())
print()
print("X_train_transformed")
print("mean", X_train_transformed.mean())
print("std", X_train_transformed.std())

assert np.isclose(X_train_transformed.mean(), 0)
assert np.isclose(X_train_transformed.std(), 1)

X_train
mean 3.460740740740741
std 1.9662465199534571

X_train_transformed
mean 6.579099405186112e-17
std 1.0


$ \ex{6} $ First, transform the dataset using the `Normalizer` transformer. The fit and evaluate each model using the transformed features.  
**Note:** [`Normalizer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer) scales the features of each sample so that it has unit norm.

In [13]:
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42, shuffle=True, stratify=y)

scaler = preprocessing.Normalizer()

# START ANSWER
scaler = scaler.fit(X)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

for name, model in models.items():
    model.fit(X_train, y_train)

    prediction = model.predict(X_test)
    accuracy = accuracy_score(y_test, prediction)
    f1_score_value = f1_score(y_test, prediction, average="weighted")
    print(name)
    print("- accuracy_score", accuracy)
    print("- f1_score", f1_score_value)

# END ANSWER

mean 0.43844230430986214
std 0.24035046451267422
GaussianNB
- accuracy_score 1.0
- f1_score 1.0
DummyClassifier
- accuracy_score 0.3333333333333333
- f1_score 0.16666666666666666
DecisionTreeClassifier
- accuracy_score 0.8
- f1_score 0.7802197802197803
KNeighborsClassifier
- accuracy_score 1.0
- f1_score 1.0
SVM
- accuracy_score 1.0
- f1_score 1.0
LogisticRegression
- accuracy_score 0.8666666666666667
- f1_score 0.861111111111111
