In [18]:
from src.data.datasets import load_dataset, available_datasets
from src.models import available_algorithms

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger()

In [4]:
available_datasets()

['f-mnist']

## In this example we'll use the training data to build and deploy the initial model

In [5]:
ds = load_dataset('f-mnist', kind='train')

2018-10-10 21:11:12,162 - fetch - INFO - Ungzipping train-images-idx3-ubyte
2018-10-10 21:11:12,588 - fetch - INFO - Ungzipping train-labels-idx1-ubyte
2018-10-10 21:11:12,590 - fetch - INFO - Ungzipping t10k-images-idx3-ubyte
2018-10-10 21:11:12,669 - fetch - INFO - Ungzipping t10k-labels-idx1-ubyte
2018-10-10 21:11:12,671 - fetch - INFO - Copying f-mnist.license
2018-10-10 21:11:12,672 - fetch - INFO - Copying f-mnist.readme


In [6]:
ds.data.shape

(60000, 784)

In [7]:
ds.target.shape

(60000,)

## Train an algorithm

`make train`

We're going to build an visulization tool for our stylists to use. Let's use the UMAP dimension reduction algorithm for this.

https://umap-learn.readthedocs.io/en/latest/

We can pip install umap via:

`pip install umap-learn`

Time to show off your new reproducible environment skills!

Add `umap-learn` under your pip requirements in the `environment.yml` and run `make requirements`.

In [10]:
from umap import UMAP

In [12]:
# We want a 2 dimensional visualization
model = UMAP(n_components=2, random_state=42)

In [14]:
%%time
model.fit(ds.data)

CPU times: user 2min 9s, sys: 6.5 s, total: 2min 16s
Wall time: 1min 58s


UMAP(a=None, angular_rp_forest=False, b=None, init='spectral',
   learning_rate=1.0, local_connectivity=1.0, metric='euclidean',
   metric_kwds=None, min_dist=0.1, n_components=2, n_epochs=None,
   n_neighbors=15, negative_sample_rate=5, random_state=42,
   repulsion_strength=1.0, set_op_mix_ratio=1.0, spread=1.0,
   target_metric='categorical', target_metric_kwds=None,
   target_n_neighbors=-1, target_weight=0.5, transform_queue_size=4.0,
   transform_seed=42, verbose=False)

### Check that UMAP is a BaseEstimator

In [16]:
from sklearn.base import BaseEstimator

In [17]:
isinstance(model, BaseEstimator)

True

## Exercise

Add UMAP to the available_algorithms and add an entry to model_list.json for training UMAP on f-mnist

In [21]:
assert 'umap' in available_algorithms()

AssertionError: 

In [22]:
!cd .. && make train

python3 -m src.models.train_model model_list.json
  import imp
2018-10-10 21:31:01,223 - train_model - INFO - Building models from model_list.json
Traceback (most recent call last):
  File "/opt/software/anaconda3/envs/bus_number/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/software/anaconda3/envs/bus_number/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ava00125/src/devel/bus_number/src/models/train_model.py", line 112, in <module>
    main()
  File "/opt/software/anaconda3/envs/bus_number/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/opt/software/anaconda3/envs/bus_number/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/opt/software/anaconda3/envs/bus_number/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File 

## Use it to predict
`make predict`

In [60]:
p_test = model.predict(X_test);
p_test

array([6, 9, 3, 7, 2, 1, 5, 2, 5, 2, 1, 8, 4, 0, 4, 2, 3, 7, 8, 8, 4, 3,
       9, 7, 1, 6, 3, 5, 6, 3, 4, 9, 1, 4, 4, 6, 9, 4, 7, 6, 6, 9, 1, 3,
       6, 1, 3, 0, 6, 5, 5, 1, 9, 5, 6, 0, 2, 0, 0, 1, 0, 4, 5, 2, 4, 5,
       7, 0, 7, 5, 9, 5, 5, 4, 7, 0, 4, 5, 5, 9, 9, 0, 2, 3, 8, 0, 6, 4,
       4, 9, 1, 2, 8, 3, 5, 2, 9, 4, 4, 4, 4, 3, 5, 3, 1, 3, 5, 9, 4, 2,
       7, 7, 4, 4, 1, 9, 2, 7, 8, 7, 2, 6, 9, 4, 0, 7, 2, 7, 5, 8, 7, 5,
       7, 9, 0, 6, 6, 4, 2, 8, 0, 9, 4, 6, 9, 9, 6, 9, 0, 5, 5, 6, 6, 0,
       6, 4, 2, 9, 3, 8, 7, 2, 9, 0, 4, 5, 3, 6, 5, 8, 9, 8, 4, 2, 1, 3,
       7, 3, 2, 2, 3, 9, 8, 0, 3, 2, 2, 5, 6, 9, 9, 4, 1, 2, 4, 2, 3, 6,
       4, 8, 5, 9, 5, 7, 8, 9, 4, 8, 1, 5, 4, 4, 9, 6, 1, 8, 6, 0, 4, 5,
       2, 7, 1, 6, 4, 5, 6, 0, 3, 2, 3, 6, 7, 1, 9, 1, 4, 7, 6, 5, 8, 5,
       5, 1, 5, 2, 8, 8, 9, 8, 7, 6, 2, 2, 2, 3, 4, 8, 8, 3, 6, 0, 8, 7,
       7, 0, 1, 0, 4, 5, 8, 5, 3, 6, 0, 4, 1, 0, 0, 3, 6, 5, 9, 7, 3, 5,
       5, 9, 9, 8, 5, 3, 3, 2, 0, 5, 8, 3, 4, 0, 2,

## Test the quality of the prediction
`make analysis`

In [57]:
model.score(X_test, y_test)

0.9292929292929293

In [58]:
from sklearn.metrics import accuracy_score

In [65]:
accuracy_score(y_test, p_test)

0.9292929292929293

## Next: 
* automate the basic workflow
* compare a bunch of algorithms for our Swedish Chef paper

## Add our algorithm to available_algorithms

In [67]:
from src.models import available_algorithms

There are currently no available algorithms.

In [70]:
available_algorithms()

{}

To add an algorithm, add  a key:value pair to the dict `_ALGORITHMS` in `src/models/algorithms.py`.

For example, add
```
'linearSVC': LinearSVC()
```
to the `_ALGORITHMS` dict, and add
```
from sklearn.svm import LinearSVC
```
to the top of the file.

Also, add `linearSVC` to the docstring of `available_algorithms`.

In [74]:
available_algorithms()

['linearSVC']

In [76]:
print(available_algorithms.__doc__)

Valid Algorithms for training or prediction

    This function simply returns a dict of known
    algorithms strings and their corresponding estimator function.

    It exists to allow for a description of the mapping for
    each of the valid strings as a docstring

    The valid algorithm names, and the function they map to, are:

    Algorithm        Function
    linearSVC        sklearn.svm.LinearSVC
    


Now we're in a position where the `make train` script can run using `linearSVC`. 

You'll notice that `make train` takes a `models/model_list.json` as input. Let's make one.
```
## train / fit / build models
train: models/model_list.json
	$(PYTHON_INTERPRETER) -m src.models.train_model model_list.json
```

A `model_list.json` is a list of dicts, where each dict specifices a combination of:
* `dataset`: A valid dataset name from `available_datasets`
* `dataset_params`: A dictionary of parameters that can be passed to `load_dataset()` with the specified `dataset`
* `algorithm`: A valid dataset name from `available_algorithms`
* `algorithm_params`: A dictionary of parameters to use when running the specified `algorithm`

In [79]:
model_list = [
    {
        'dataset': 'lvq-pak',
        'dataset_params': {},
        'algorithm': 'linearSVC',
        'algorithm_params': {'random_state': 42, 'max_iter': 200000},
    }
]

In [83]:
from src.paths import model_path
from src.utils import save_json

In [85]:
save_json(model_path / 'model_list.json', model_list)

In [11]:
!cat ../models/model_list.json

[
  {
    "algorithm": "linearSVC",
    "algorithm_params": {
      "max_iter": 200000,
      "random_state": 42
    },
    "dataset": "lvq-pak",
    "dataset_params": {}
  }
]

Now running `make train` will train `LinearSVC` on `lvq-pak` with the specified parameters.

The output will be:
* A trained model in `models/trained_models`
* A json file `models/trained_models.json` that keeps track of the models that we've trained

In [14]:
!cd .. && make train

python3 -m src.models.train_model model_list.json
  import imp
2018-10-10 19:04:43,569 - train_model - INFO - Building models from model_list.json
Traceback (most recent call last):
  File "/opt/software/anaconda3/envs/bus_number/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/software/anaconda3/envs/bus_number/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ava00125/src/devel/bus_number/src/models/train_model.py", line 112, in <module>
    main()
  File "/opt/software/anaconda3/envs/bus_number/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/opt/software/anaconda3/envs/bus_number/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/opt/software/anaconda3/envs/bus_number/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File 

Any algorithm will work that: 
* is a subclass of the sklearn `BaseEstimator` class (needed for setting and getting params)
* has a `fit` method (needed for `make train`)
* has either a `predict` method (supervised) or a `transform` method (unsupervised) (needed for `make predict`)

In particular, you can use an sklearn pipeline, or an sklearn meta estimator like GridSearchCV as an algorithm. 

If you algorithm of choice is **not** a `BaseEstimator` with the appropriate API, it is fairly easy to wrap it to be used in this way. While we won't have time to cover an example of this during the in-person part of this tutorial, the EDA Text Embedding (advanced usage tutorial) has an example of how to do this with gensim's FastText model.

SyntaxError: invalid syntax (<ipython-input-84-13c602ec17af>, line 1)