In [220]:
# Initialize Otter
import otter

grader = otter.Notebook("11-exercise-pids2024.ipynb")

# Exercise sheet 11

**Hello everyone!**

**Points: 15**

Topics of this exercise sheet are:
* Classification
* Cross validation
* Grid search
* Data cleaning

Please let us know if you have questions or problems! <br>
Contact us during the exercise session or on [Piazza](https://piazza.com/unibas.ch/spring2024/63982).

**Automatic Feedback**

This notebook can be automatically graded using Otter grader. To find how many points you get, simply run `grader.check_all()` from a new cell. 

In [221]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_validate
from sklearn.datasets import fetch_openml

pd.options.mode.copy_on_write = True

# Question 1: Classification with neural networks (5 Points)


In this first task, you will use a neural network for classification. For this, we load the dataset `cancer.csv`. When you run `df.head()`, you will get

```text
   radius_mean  texture_mean  ...  fractal_dimension_worst  diagnosis
0        13.74         17.91  ...                  0.07014          0
1        13.37         16.39  ...                  0.07628          0
2        14.69         13.98  ...                  0.09208          0
3        12.91         16.33  ...                  0.06949          0
4        13.62         23.23  ...                  0.06953          0
```

Each line represents a cell. The last column, `diagnosis`, is 1 if the cell is cancerous, and 0 if it is benign. All other columns describe geometric features of the cell. 

In [222]:
df = pd.read_csv("daten/cancer.csv")
df.head()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,diagnosis
0,13.74,17.91,88.12,585.0,0.07944,0.06376,0.02881,0.01329,0.1473,0.0558,...,22.46,97.19,725.9,0.09711,0.1824,0.1564,0.06019,0.235,0.07014,0
1,13.37,16.39,86.1,553.5,0.07115,0.07325,0.08092,0.028,0.1422,0.05823,...,22.75,91.99,632.1,0.1025,0.2531,0.3308,0.08978,0.2048,0.07628,0
2,14.69,13.98,98.22,656.1,0.1031,0.1836,0.145,0.063,0.2086,0.07406,...,18.34,114.1,809.2,0.1312,0.3635,0.3219,0.1108,0.2827,0.09208,0
3,12.91,16.33,82.53,516.4,0.07941,0.05366,0.03873,0.02377,0.1829,0.05667,...,22.0,90.81,600.6,0.1097,0.1506,0.1764,0.08235,0.3024,0.06949,0
4,13.62,23.23,87.19,573.2,0.09246,0.06747,0.02974,0.02443,0.1664,0.05801,...,29.09,97.58,729.8,0.1216,0.1517,0.1049,0.07174,0.2642,0.06953,0


#### Question 1a: Loading and preprocessing the data (1 Point)

Create two DataFrames again, `X` and `y`, where `y` contains only the diagnoses and `X` contains all the other columns. 
*Hint: Don't forget to scale your data using [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)*

In [223]:
class Question1a:
    X = df[df.columns.drop(['diagnosis'])]
    y = df['diagnosis']

    scaler = StandardScaler()
    scaler.fit(X)
    X_scaled = scaler.transform(X)

In [224]:
grader.check("Question 1a")

### Question 1b: Train the model (2 Points)

Train an [MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier) 
using two hidden layers with `10` neurons each and set `max_iter` to 30000. 
Test how well your model performs on the data you trained on, by computing the *accuracy*. 
The *Accuracy*, is is the number of correct predictions divided by the number of predictions.


In [225]:
class Question1b:
    mlp = MLPClassifier(hidden_layer_sizes=(10, 10), max_iter=30000, random_state=999)
    mlp.fit(Question1a.X_scaled, np.squeeze(Question1a.y))
    predictions = mlp.predict(Question1a.X_scaled)

    accuracy = predictions[predictions == Question1a.y].size / predictions.size
    print(f"accuracy is {accuracy}")

accuracy is 0.9899328859060402


In [226]:
grader.check("Question 1b")

### Question 1c: Test on a test dataset (2 Points)

Load the test set `cancer_test.csv` on a new set of patients. Compute the accuracy again. What do you observe.
Use the scaler from `Question1a` to scale the data. 

In [227]:
class Question1c:
    df_test = pd.read_csv("daten/cancer_test.csv")
    X_test = df_test[df_test.columns.drop(['diagnosis'])]
    y_test = df_test['diagnosis']

    X_test_scaled = Question1a.scaler.transform(X_test)
    predictions = Question1b.mlp.predict(X_test_scaled)

    accuracy = predictions[predictions == y_test].size / predictions.size
    print(f"accuracy is {accuracy}")

accuracy is 0.97


In [228]:
grader.check("Question 1c")

# Question 2: Cross-validation and Grid search (5 Points)

In this exercise we try to compare the quality of different models using cross-validation on the training data. 
We use the cancer dataset from the previous exercise

#### Question 2a: Setting up a parameter grid (2 Points)

Read in the documentation [cross_validate](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate) (and using a search engine/AI) how to do crossvalidation in scikit learn. Perform a 5 fold cross-validation on the scaled data from Question1a using the classifier
defined in Question1b.  Compute the mean and standard deviation of the `test_score`. 

In [229]:
class Question2a:
    mlp, X, y = Question1b.mlp, Question1a.X_scaled, Question1a.y
    cv = cross_validate(mlp, X, y, cv=5)
    mean_test_score = np.mean(cv["test_score"])
    std_test_score = np.std(cv["test_score"])
    print(f"mean {mean_test_score} standard deviaton {std_test_score}")

mean 0.9732203389830507 standard deviaton 0.017042492514851577


In [230]:
grader.check("Question 2a")

Is the average test score closer to what you got in training or what you got when applying the classifier for new patients?

### Question 2b: Setting up a param grid (1 Point)

Next we want to do grid search. For this, we set up a dictionary with all parameters we want to search. 
The keys are the parameter of the classifier we want to vary and the values
are all the possible values the parameter can take on. (Check the documentation of [MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html) to find the parameter names.  Define a dictionary `param_grid` for a grid search over
six different models, with layer sizes `(10, ), (10, 10), (10, 10, 10)` and with activation functions `relu` and `tanh`. 

In [231]:
class Question2b:
    param_grid = {
        'hidden_layer_sizes': [(10,), (10, 10), (10, 10, 10)],
        'activation': ['tanh', 'relu'],
    }

In [232]:
grader.check("Question 2b")

### Question 2c: Do the grid search (2 Points)

Perform a grid search using [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) where you use the given MLPClassifier as the estimator and the parameter grid you set up above. 
Use `accuracy` as scoring and choose a 5 fold cross-validation. 

In [233]:

class Question2c:
    mlp = MLPClassifier(max_iter=30000, random_state=999, solver='lbfgs')
    g = GridSearchCV(mlp, param_grid=Question2b.param_grid, scoring='accuracy', cv=5)
    grid_search = g.fit(Question1a.X_scaled, Question1a.y)

    print(grid_search.best_params_.values())

dict_values(['relu', (10,)])


In [234]:
grader.check("Question 2c")


# Question 3: A more complex example (5 Points)

In this exercise we walk through a data cleaning process and working with non-numerical data, before we 
do a classification. 

In [235]:
# Load the dataset
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True, parser='auto')
y = y.apply(lambda x: int(x))

### Question 3a) Data cleaning (I) (2 Points)

Not all columns contain meaningful information for prediction. 
Extract the column `pclass, sex, fare, boat`. 
You can read about the meaning of the columns [here](https://www.openml.org/search?type=data&sort=runs&id=40945&status=active).
Fill the `nan`s in column `boat` with a value of `0` and replace the non-numerical values with another unique value. 
Change `sex` to 0 for male and 1 for female. Assign the resulting dataframe to the variable `X_cleaned`. 


*Hint: To find out what values you have in the column 'boat' use the method `unique`*. 

In [236]:

class Question3a:
    X = X
    X_cleaned = X[['pclass', 'sex', 'fare', 'boat']]

    X_cleaned['sex'] = X_cleaned['sex'].replace({'male': 1, 'female': 0})

    codes, uniques = pd.factorize(X_cleaned['boat'], use_na_sentinel=True)
    X_cleaned['boat'] = codes + 1


Question3a.X_cleaned


Unnamed: 0,pclass,sex,fare,boat
0,1,0,211.3375,1
1,1,1,151.5500,2
2,1,0,151.5500,0
3,1,1,151.5500,0
4,1,0,151.5500,0
...,...,...,...,...
1304,3,0,14.4542,0
1305,3,0,14.4542,0
1306,3,1,7.2250,0
1307,3,1,7.2250,0


In [237]:
grader.check("Question 3a")

### Question 3b) Data cleaning (II) (2 Points)

Drop now all rows which contain `nan`s. Make sure you also drop them from the label `y`. 

In [238]:
class Question3b:
    X = Question3a.X_cleaned

    dropped = X.index.difference(X.dropna().index).tolist()

    X_nonan = X.drop(dropped)
    y_nonan = y.drop(dropped)


Question3b.y_nonan

0       1
1       1
2       0
3       0
4       0
       ..
1304    0
1305    0
1306    0
1307    0
1308    0
Name: survived, Length: 1308, dtype: category
Categories (2, int64): [0, 1]

In [239]:
grader.check("Question 3b")

### Question3c: Split in training and test set (1 point)

Use scikit's [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
to split the data into a training, test and validation set. The training set should be around 60% of the data and the other two sets 20%. Assign the data to the variables `X_train, y_train, X_val, y_val, X_test, y_test`

In [240]:

class Question3c:
    X_train, X_test_val, y_train, y_test_val = train_test_split(Question3b.X_nonan, Question3b.y_nonan, test_size=0.6,
                                                                random_state=0)
    X_val, X_test = np.array_split(X_test_val, 2)
    y_val, y_test = np.array_split(y_test_val, 2)


Question3c.X_train


Unnamed: 0,pclass,sex,fare,boat
245,1,0,86.5000,10
685,3,1,16.1000,0
518,2,1,36.7500,0
1188,3,0,16.7000,17
1255,3,1,7.2292,0
...,...,...,...,...
763,3,0,20.5750,4
835,3,1,8.0500,0
1216,3,0,7.7333,17
559,2,0,36.7500,2


In [241]:
grader.check("Question 3c")

### Question 3d) Training the classifier. 

Now you can train the classifier. You won't get points for it, but wouldn't it be unsatisfying not to do it?


In [242]:
class Question3d:
    scaler = StandardScaler()
    scaler.fit(Question3c.X_train)

    X_train_scaled = scaler.transform(Question3c.X_train)
    X_test_scaled = scaler.transform(Question3c.X_test)
    X_val_scaled = scaler.transform(Question3c.X_val)

    mlp = MLPClassifier(hidden_layer_sizes=(10, 10), max_iter=30000, random_state=666)
    mlp.fit(X_train_scaled, np.squeeze(Question3c.y_train))

    test_predictions = mlp.predict(X_test_scaled)
    val_predictions = mlp.predict(X_val_scaled)

    test_accuracy = test_predictions[test_predictions == Question3c.y_test].size / test_predictions.size
    print(f"test_accuracy is {test_accuracy}")

    val_accuracy = val_predictions[val_predictions == Question3c.y_val].size / val_predictions.size
    print(f"val_accuracy is {val_accuracy}")

test_accuracy is 0.9719387755102041
val_accuracy is 0.9567430025445293


In [243]:
grader.check_all()

Question 1a results: All test cases passed!

Question 1b results: All test cases passed!

Question 1c results: All test cases passed!

Question 2a results: All test cases passed!

Question 2b results: All test cases passed!

Question 2c results: All test cases passed!

Question 3a results: All test cases passed!

Question 3b results: All test cases passed!

Question 3c results: All test cases passed!