# Exercise 4 (solution)

In [None]:
from sklearn.datasets import load_digits
import pandas as pd
import seaborn as sns

## Note on import statements

- In all real projects, all import statements should be in the first cell of a notebook
- It is part of this exercise that you learn how to import what you need from sklearn
- Therefore, in this exercise notebooks you will see imports in many places

## Task 1: Load and inspect the dataset

In this task you will load the digits dataset from `sklearn.datasets`, using scikit-learn's `load_digits` function, which will return a dictionary-like `Bunch` object. 

The goal of this warmp-up task is that you use your Python knowledge to inspect the object you get from `load_digits`. You do not need to google.


1. List the keys of the object
2. Look some of the entries and understand their format (e.g. using `type()` and `.shape`
3. Look at the description inside digits and find all the terms mentioned on the terminology slide

In [None]:
digits = load_digits()

In [None]:
digits.keys()

In [None]:
type(digits["data"])

In [None]:
digits["data"].shape

In [None]:
print(digits["DESCR"])

## Task 2: Data splitting

Split the data and assign the splits to the variables `X_train`, `X_test`, `y_train`, `y_test`. Set a `random_state` of your choice. Split such that the training sets contain 75 percent of the data. Confirm that by looking at the shapes of the resulting arrays. 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    digits["data"],
    digits["target"],
    random_state=1234,
    test_size=0.25,
)
X_train.shape

In [None]:
X_test.shape

## Task 3: Logistic Regression

1. Run a logistic regression without regularization and with intercept
2. Use the fitted model to create predictions on the test dataset

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(
    fit_intercept=True,
    penalty=None,
)
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)
y_pred

In [None]:
model.score(X_test, y_test)

## Task 4: Assess model quality

1. Calculate the accurracy score
2. Calculate the f1 score
3. Convert the `"target_names"` to a `string` data type
4. Create a classification report
5. Calculate a confusion_matrix
6. Plot the confusion matrix using seaborns [heatmap function](https://seaborn.pydata.org/generated/seaborn.heatmap.html) (Optional)

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

In [None]:
from sklearn.metrics import f1_score

f1_score(y_test, y_pred, average=None)

In [None]:
digits["target_names"] = digits["target_names"].astype(str)

In [None]:
from sklearn.metrics import classification_report

report = classification_report(
    y_test,
    y_pred,
    target_names=digits["target_names"],
)
print(report)

In [None]:
from sklearn.metrics import confusion_matrix

confusion = confusion_matrix(y_test, y_pred, normalize="true")
confusion = pd.DataFrame(
    confusion, columns=digits["target_names"], index=digits["target_names"]
)

In [None]:

sns.heatmap(
    confusion.round(3),
    cmap=sns.color_palette("Blues", as_cmap=True),
    annot=True,
)
sns.set(rc={"figure.figsize": (12, 8.27)})

## Task 5: Logit fitting with penalty

1. Run a logistic regression with an "l2" penalty. Set the penalty parametr C = $1 / \lambda$ to 1. 
2. You will get a warning. You have two options to solve it:
    1. Find a good explanation of why it is acceptable to ignore this warning. Relate this to the differences between machine learning and econometrics
    2. Change the settings so you don't get the warning

In [None]:
logit = LogisticRegression(fit_intercept=True, max_iter=4500, C=1)
logit.fit(X_train, y_train)
logit.score(X_test, y_test)

In econometrics it would be a huge problem if a numerical optimization terminates without convergence due to reaching max iterations. This is so, because we have no way of knowing whether that introduces a huge bias in our parameters. In supervised machine learning, we can try it out. It can even be the case that fewer iterations work better than more because of avoiding overfitting. 

## Task 6: Understanding decision trees and random forrests in group work

Read the following two sections of the Python Data Science Handbook

- [Decision trees](https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-forests.html#Motivating-Random-Forests:-Decision-Trees)
- [Random forrests](https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-forests.html#Ensembles-of-Estimators:-Random-Forests)

Discuss decision trees and random forrests with your neighbor or in groups of up to 5 people. Make sure, everyone understands the basic idea and no-one gets hung-up on small technicalities. 

After everyone has a good understanding of the two methods, go through the basic steps (import, create model instance, fit, evaluate score) for a decision tree and a random forrest.

In [None]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
tree.score(X_test, y_test)

In [None]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier()
forest.fit(X_train, y_train)
forest.score(X_test, y_test)

## Task 7: K-fold Cross Validation

Do a five fold cross validation for a model of your choice on the training dataset

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(logit, X_train, y_train, cv=5)
scores

## Task 8: Hyperparameter tuning

Tune the hyperparameters of one of the methods used above using a grid search with cross validation

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "penalty": ["l2", "l1"],
    "max_iter": [100, 2000],
    "C": [0.01, 0.1, 100],
}

grid = GridSearchCV(
    LogisticRegression(
        fit_intercept=True,
        penalty="l2",
    ),
    param_grid,
    cv=7,
)

In [None]:
grid.fit(X_train, y_train)

In [None]:
grid.best_params_

In [None]:
grid.best_estimator_.score(X_test, y_test)