# Training a Decision Tree Classifier

In [None]:
import os
os.makedirs('/tmp/wine', exist_ok=True)
os.chdir('/tmp/wine')
!sudo pacman -S wget --noconfirm

In this tutorial we will use sklearn to train and validate a simple wine classifier using the [wine quality data set](https://archive.ics.uci.edu/ml/datasets/wine+quality). We will use `dud` to track the data and version the model weights.

## Environment Setup

We will first need to install `dud` which requires `go` to be in your PATH. The preferred method to install `dud` right now is to clone the repo and run `make install`.

In [None]:
!git clone https://github.com/kevin-hanselman/dud

In [None]:
!cd dud && make install && cd ../

We can verify that `dud` installed correctly by outputting the version.

In [None]:
!dud version

We will be using python and the sklearn and pandas packages to train our classifier. To manage python packages, we recommend that you first install a python virtual environment.

In [None]:
!python -m venv .env
!source .env/bin/activate

We can then install the python packages we need.

In [None]:
!pip install scikit-learn pandas --user

Now that we have `dud` installed and all of our python packages installed, let's make a new directory for our work.

In [None]:
!mkdir wine_classifier && cd wine_classifier

In [None]:
os.chdir('wine_classifier')

Before we add any data, let's initialize a `dud` repo.

In [None]:
!dud init

This creates a `.dud` folder in the current folder and populates it with some config files that have sensible defaults. A `.dud/cache` folder is also created, but it's empty for now.

Let's verify that the above is true.

In [None]:
!tree .dud

Now we're ready to add some data.

## Adding Data

We will be using the "Wine Quality Data Set" from the UCI Machine Learning Repository, which we can download easily with `wget`. If you're unfamiliar with `wget`, all you need to know is that this command downloads a couple CSVs and saves them in the data folder. The command is shown below.

    wget -q -r -np -nd -A csv https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/ -P data

We can first make the `data` directory where we will download the dataset.

In [None]:
!mkdir data/

We can then create a `dud` stage to run the above command for us. A group of files or directories are known to dud as "artifacts". A stage is just a collection or an _operation_ on a collection of artifacts. A stage can be defined by a YAML file, and can be tracked with source control.

`dud` provides an easy way to generate stages, although stages can always be created and edited manually.

In [None]:
!dud stage gen

A `-o` flag indicates that the stage will generate an artifact. We want to tell `dud` that our command will generate the `data` directory so that it knows to track it as an artifact.

In [None]:
!dud stage gen -o data/

We also want to tell `dud` _how_ the artifact is generated. We can use the wget command from above. We use also use `--` which in bash means to stop parsing flags. This is needed so that `dud` doesn't try and parse the `wget` flags!

In [None]:
!dud stage gen -o data/ -- wget -q -r -np -nd -A csv https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/ -P data

It looks like this has all of the information we need! However, we haven't actually saved the output stage to a file yet. We can use ">" to redirect `dud`'s autogenerated stage to a YAML file.

In [None]:
!dud stage gen -o data/ -- wget -q -r -np -nd -A csv https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/ -P data > get_data.yaml

When we add a stage to `dud`, we're just letting `dud` know that a stage exists. This is kept track in an index file.

In [None]:
!cat .dud/index

Let's tell `dud` to track that stage and check the index file and the status.

In [None]:
!dud stage add get_data.yaml

In [None]:
!cat .dud/index

In [None]:
!dud status

Notice that the stage hasn't been checksummed, but that `dud` is aware of the data artifact and knows that it is uncommitted. In fact, it is a design decision for `dud` to wait to commit until you tell it to. This is because the commit operation is one of the most costly operations since it involves lots of hashing. Let's fix that by running the stage and committing the output.

In [None]:
!dud run get_data.yaml

In [None]:
!dud commit get_data.yaml

In [None]:
!dud status

Excellent! Remember how the cache was empty before? You'll find that after the commit, the cache is no longer empty.

In [None]:
!tree .dud

Those content-addressed files in the cache corespond to the data in the `data` folder. By default, `dud` symlinks the files in the working directory to point to the files in the cache.

The files in the cache are important because that's where the real copy of the file lives. `dud` tries to protect you from monkeying around with those cache files.

In [None]:
!ls -l .dud/cache/3f/89718d7db7e8983db992bbe63b63c912b510aa279eef55fd6927f98a4a72f5

You can see that the this cache file has file permissions "-r--r--r--" which means that the file has read-only permissions.

`dud` also tries to protect us from mishaps in our working directory. What if, through some freak accident, we delete our `data` folder? Since we committed, `dud` has us covered.

In [None]:
!ls

In [None]:
!rm -r data/

In [None]:
!ls

No worries! We can simply recover the folder back to the working directory with a `dud checkout`.

In [None]:
!dud checkout get_data.yaml

In [None]:
!ls

As another protection, `dud` will refuse to run as root.

In [None]:
!sudo dud status

## Training a Decision Tree Classifier

Here's an example of a quick script to train a decision tree using scikit-learn on the wine dataset which saves the output model to a pickle file. Save this as train.py.

In [None]:
%%writefile train.py
import pickle

import pandas as pd
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

red_wine =  pd.read_csv('data/winequality-red.csv', sep=';')

X_train, X_test, y_train, y_test = train_test_split(
    red_wine[['sulphates', 'alcohol']],
    red_wine['quality'],
    test_size=0.25,
    random_state=0)

clf = tree.DecisionTreeClassifier().fit(X_train, y_train)

print(f'Training Accuracy {round(accuracy_score(y_train, clf.predict(X_train)) * 100, 2)}%')
print(f'Testing Accuracy {round(accuracy_score(y_test, clf.predict(X_test)) * 100, 2)}%')

with open('dt.pkl', 'wb') as f:
    pickle.dump(clf, f)

Let's use the same technique as our `get_data.yaml` stage, but use the `-d` flag to tell `dud` that the `data/` and `train.py` are dependencies of the this stage. 

In [None]:
!dud stage gen -d data/ -d train.py -o dt.pkl python train.py > train.yaml

In [None]:
!dud stage add train.yaml

In [None]:
!dud st

We can then train the model with `dud` run.

In [None]:
!dud run train.yaml

In [None]:
!dud st

As before, we can commit whenever we're ready.

In [None]:
!dud commit