In [None]:
import pandas as pd
import numpy as np
import sklearn
from matplotlib import pyplot as plt
import seaborn as sns

## Learning goals

To know the basics regarding...

-   our example datasets/tasks,
-   machine learning in R (with the **mlr3** package),
-   selected learning algorithms for this course,
-   ML task definition.

## 1.1 Introduction / Reading (30 min)

### Data

We prepared four ML tasks based on two datasets for this course:

(a) **ctg** (simplified) $\rightarrow$ a binary classification task (target: "status")
(b) **ctg3** (original) $\rightarrow$ a 3-class classification task (target: "status")
(c) **support_regr** $\rightarrow$ a regression task (target: "totcst" - total costs)
(d) **support_surv** $\rightarrow$ a survival task (target: "death" & "d.time")

Please read the basic documentation on the two datasets which is provided here

-   ctg:
    -   <https://archive.ics.uci.edu/ml/datasets/cardiotocography>
-   support:
    -   <https://pubmed.ncbi.nlm.nih.gov/7810938/> $\rightarrow$ Abstract
    -   <https://hbiostat.org/data/repo/supportdesc> $\rightarrow$ "support2 Dataset"

Throughout the course, all assignments will initially focus on binary classification tasks and hence the **ctg** (simplified) dataset. For these tasks a sample solution will be provided. If you are a beginner, we recommend to stick to this dataset!

If you have additional time or bring some prior knowledge, we encourage you to consider translating the assignments to other datasets. Of course, you are also allowed to work on datasets from your own work!

### Machine learning in R (with mlr3)

In R, several frameworks are available for model development (mlr3, caret, tidymodels). In this course, mlr3-based sample solutions will be provided. However, you can utilize any other framework (or even another language) if you prefer.

If you indeed use **mlr3** for the exercises, we recommend the **mlr3 book** at <https://mlr3book.mlr-org.com/index.html> as a useful resource, in particular the first two sections "Introduction and Overview" and "Basics". You can open a separate R script (ctrl-shift-n) and experiment with the code from the book yourself.

We encourage you to revisit the mlr3 book whenever you need help during the course. In addition, a wide variety of code examples is provided in the **mlr3 gallery** at <https://mlr3gallery.mlr-org.com/>.

### Machine learning in Python

In python, there also exist several packages for model developement and testing. For simple machine learning and relaterd functions, we recommend the package scikit-learn, which contains many of the most commonly used models and utility functions.

For anything table/dataframe related, pandas is the most commonly used package in python.

numpy is a package surrounding matrices (potentially in high dimensions) called arrays. These are often used as the internal data structure in machine learning models as they are easier to include in computations than dataframes.

matplotlib and seaborn are two packages for creating illustrations. Seaborn plots usually look nicer out-of-the-box, whereas matplotlib (which seaborn is based on) is more barebones but its plots are more easily customizable

scipy is a package that we will use mainly in Ex.3 for an easier time dealing with probability distributions; for example for evaluating density functions of common distributions.  

### ML algorithms

While this is not a course on the fundamentals of machine learning, you may want to quickly refresh your knowledge on two basic ML algorithms for tabular data which we will utilize later:

-   **Elastic Net** (*glmnet*): <https://en.wikipedia.org/wiki/Elastic_net_regularization>
-   **Random Forests** (*ranger*): <https://en.wikipedia.org/wiki/Random_forest>

## 1.2 Data loading and exploration (20 mins)

### Task

For one (e.g. the **ctg** dataset) or multiple of the datasets mentioned above, load the data (e.g. with the pyreadr package)
and explore it data via common pandas functions such as

-   `shape()`
-   `columns()`
-   `head()`

You could also check for NA entries, or create explorative (scatter-) plots with matplotlib or seaborn
A helpful reference can be found under <https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf>

Additionally, create a (simple) plot to visualize the target variable ("status" for the **ctg** dataset/task) to check if the data is balanced.

In [None]:
# You can use this codebox for your solution or add additional ones if needed

## 1.3 ML task definition (10 minutes)

### Task

Based on the variable names and/or documentation, should any of the feature variables not be used for training?

How can we describe the ML task now in terms of

-   task type
-   outcome
-   features

To my (Tom's) knowledge, no equivalent to the `as_task_classif` function of **mlr3** exists in python - at least not in sklearn.

Usually, one just manually implements the corresponding steps ad hoc.

First, you should separate your data into a target array (y) and a feature matrix (X), for example with the `pop()` function

Make sure that the data is coded in a way that the algorithm you're using later on can handle it and that irrelevant columns are dropped. For example it might make sense to recode binary variables as 0/1. Hint: `replace()`


In [None]:
# You can use this codebox for your solution or add additional ones if needed

## 1.4 ML algorithm investigation (10 minutes)

### Task

Identify suitable learning algorithms for the **ctg** task (or your task of interest) in **sklearn** (or your framework of choice) and study the implementation and the documentation (for example on <https://scikit-learn.org/stable/api/index.html>).

Usually, it is especially important to have a look at the Parameters needed for initialization, the fit function and the predict function

If necessary, download additional required packages for model development.

In [None]:
# You can use this codebox for your solution or add additional ones if needed

## 1.5 First model training (20 minutes)

### Task

For the **ctg** dataset/task, train a simple elastic net model and estimate the AUC on an independent test set. Only train the model with the hyperparameter $\lambda = 0.01$. For this purpose, use 70% of observations (rows) for training and 30% for testing.

Create a evaluation data.frame, i.e. a matrix with two columns: the true labels and the model predictions for the test data. How could we summarize this matrix further?

Hints: 
- sklearn contains a function for train/test splitting
- sklearn contains several functions for a wide range of model metrics. ROC-AUC included
- there is also a function for the confusion matrix in sklearn 
- a seaborn heatmap could be a nice vizualization for the confusion matrix you just obtained


In [None]:
# You can use this codebox for your solution or add additional ones if needed