**NOTE: This notebook is written for the Google Colab platform. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install git+https://github.com/michalgregor/class_utils.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [None]:
#@title -- Downloading Data -- { display-mode: "form" }
DATA_HOME = "https://github.com/michalgregor/ml_notebooks/blob/main/data/{}?raw=1"

from class_utils.download import download_file_maybe_extract
download_file_maybe_extract(DATA_HOME.format("adult_income.zip"), directory="data/adult_income")

# also create a directory for storing any outputs
import os
os.makedirs("output", exist_ok=True)

## Decision Trees Classifier: The Adult Income Dataset

Having shown how to apply a decision tree classifier to the Titanic dataset, we will now go over one more example: we will apply it to the [Adult Income Dataset](https://archive.ics.uci.edu/ml/datasets/adult). Since this is again a dataset that we have already worked with, the code to load and preprocess the data is in the next cell and it is hidden for conciseness.



In [None]:
#@title -- Loading and Preprocessing the Data: X_train, Y_train, X_test, Y_test -- { display-mode: "form" }
df_train = pd.read_csv("data/adult_income/adult.data",
                       header=None)
df_test = pd.read_csv("data/adult_income/adult.test",
                      header=None, skiprows=1)
df_test[14] = df_test[14].apply(lambda x: x[:-1])
    
categorical_inputs = [1, 3, 5, 6, 7, 8, 9, 13]
numeric_inputs = [0, 2, 4, 10, 11, 12]

output = 14

input_preproc = make_column_transformer(
    (make_pipeline(
        SimpleImputer(strategy="most_frequent"),
        OrdinalEncoder()),
     categorical_inputs),
    
    (make_pipeline(
        SimpleImputer(),
        StandardScaler()),
     numeric_inputs)
)

output_enc = OrdinalEncoder()
    
X_train = input_preproc.fit_transform(df_train)
Y_train = output_enc.fit_transform(df_train[[output]]).reshape(-1)

X_test = input_preproc.transform(df_test)
Y_test = output_enc.transform(df_test[[output]]).reshape(-1)

---
### Task 1: Apply a Decision Tree Classifier, Tuning Its Hyperparameters

**The dataset is stored in arrays `X_train`, `Y_train`, `X_test`, `Y_test`. Apply a decision tree classifier to it and tune its hyperparameters using cross-validation on the training set. Once the hyperparameters are tuned, retrain the final model on the entire training set.** 

NOTE 1: You might be tempted to just reuse the hyperparameters from the previous example. And they might work. However, note that in general optimal hyperparameters do depend on the dataset and not just on the model.

NOTE 2: Be careful if you decide to plot the model. Unless you apply sufficient pruning, the tree may be too large to be visualized conveniently and its plotting may take an excessive amount of time.

---


In [None]:


# ----



### Testing

Having tuned and trained the model, we can now proceed to verify generalization on the testing set.



In [None]:
y_test = model.predict(X_test)

cm = pd.crosstab(Y_test, y_test,
                 rownames=['actual'],
                 colnames=['predicted'])
print(cm, "\n")

acc = accuracy_score(Y_test, y_test)
print("Accuracy = {}".format(acc))