In [None]:
from src.data.datasets import load_dataset, available_datasets

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger()

In [None]:
available_datasets()

## Get data

In [None]:
ds_test = load_dataset('lvq-pak', kind='test')

In [None]:
ds_test.data.shape

In [None]:
ds = load_dataset('lvq-pak', kind='train')

In [None]:
ds.data.shape

In [None]:
ds.target

In [None]:
print(ds.DESCR)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(ds.data, ds.target, test_size=0.33, random_state=42)

## Should save off train and test sets as part of the dataset process if you're planning on doing supervised learning...probable worth showing how to do this...

## Train an algorithm

`make train`

In [None]:
from sklearn.svm import LinearSVC

In [None]:
model = LinearSVC(random_state=42)

In [None]:
model.fit(X_train, y_train)

In [None]:
model = LinearSVC(random_state=42, max_iter=200000)
model.fit(X_train, y_train)

## Use it to predict
`make predict`

In [None]:
p_test = model.predict(X_test);
p_test

## Test the quality of the prediction
`make analysis`

In [None]:
model.score(X_test, y_test)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(y_test, p_test)

## Next: 
* automate the basic workflow
* compare a bunch of algorithms for our Swedish Chef paper

# Step 1: `make train`

## Add our algorithm to available_algorithms

In [None]:
from src.models import available_algorithms

There are currently no available algorithms.

In [None]:
list(available_algorithms().keys())

To add an algorithm, add  a key:value pair to the dict `_ALGORITHMS` in `src/models/algorithms.py`.

For example, add
```
'linearSVC': LinearSVC()
```
to the `_ALGORITHMS` dict, and add
```
from sklearn.svm import LinearSVC
```
to the top of the file.

Also, add `linearSVC` to the docstring of `available_algorithms`.

In [None]:
list(available_algorithms().keys())

In [None]:
print(available_algorithms.__doc__)

Now we're in a position where the `make train` script can run using `linearSVC`. 

You'll notice that `make train` takes a `models/model_list.json` as input. Let's make one.
```
## train / fit / build models
train: models/model_list.json
	$(PYTHON_INTERPRETER) -m src.models.train_model model_list.json
```

A `model_list.json` is a list of dicts, where each dict specifices a combination of:
* `dataset`: A valid dataset name from `available_datasets`
* `dataset_params`: A dictionary of parameters that can be passed to `load_dataset()` with the specified `dataset`
* `algorithm`: A valid dataset name from `available_algorithms`
* `algorithm_params`: A dictionary of parameters to use when running the specified `algorithm`

In [None]:
model_list = [
    {
        'dataset': 'lvq-pak',
        'dataset_params': {'kind': 'train'},
        'algorithm': 'linearSVC',
        'algorithm_params': {'random_state': 42, 'max_iter': 200000},
    }
]

In [None]:
from src.paths import model_path
from src.utils import save_json

In [None]:
save_json(model_path / 'model_list.json', model_list)

In [None]:
!cat ../models/model_list.json

Now running `make train` will train `LinearSVC` on `lvq-pak` with the specified parameters.

The output will be:
* A trained model in `models/trained_models`
* A json file `models/trained_models.json` that keeps track of the models that we've trained

In [None]:
!cd .. && make train

## TODO: Caching! Then, checking against existing files and metadata and looking for caching!

## TODO: Don't overwrite the trained_models.json, append to it (as long as the files are still there)

### Let's take a look at the output from `make train`

In [None]:
from src.paths import trained_model_path
from src.data.utils import list_dir
from src.utils import load_json

In [None]:
list_dir(model_path)

In [None]:
load_json(model_path / 'trained_models.json')

In [None]:
list(load_json(model_path / 'trained_models.json').keys())

## TODO: have an "available_trained_models()" as function to access the results of this .json

In [None]:
list_dir(trained_model_path)

In [None]:
# load up the trained model
from src.models.train import load_model

s_model, s_model_metadata = load_model(model_name='linearSVC_lvq-pak_0', model_path=trained_model_path)

In [None]:
s_model

In [None]:
s_model_metadata

## TODO: explore the effects of caching once it's implemented

Any algorithm will work that: 
* is a subclass of the sklearn `BaseEstimator` class (needed for setting and getting params)
* has a `fit` method (needed for `make train`)
* has either a `predict` method (supervised) or a `transform` method (unsupervised) (needed for `make predict`)

We will see how things work in the unsupervised case in the next example. 

Note that an **algorithm** here can be a combination of "algorithms" as long as that combination is a `BaseEstimator` with the above methods. For example, you can use an sklearn pipeline, or an sklearn meta estimator like GridSearchCV as an algorithm. 

If your algorithm of choice is **not yet** a `BaseEstimator` with the appropriate API, it is fairly easy to wrap it to be used in this way. While we won't have time to cover an example of this during the in-person part of this tutorial, the EDA Text Embedding (advanced usage tutorial) has an example of how to do this with gensim's FastText model.



# Step 2: `make predict`

```
## predict / transform / run experiments
predict: models/predict_list.json
	$(PYTHON_INTERPRETER) -m src.models.predict_model predict_list.json
```

Similar to `models_list.json` in `predict_list.json` we specify the dataset to operate on, and in this case, the trained_model to apply to the given dataset.


A `predict_list.json` is a list of dicts, where each dict specifices a combination of:
* `dataset`: A valid dataset name from `available_datasets`
* `dataset_params`: A dictionary of parameters that can be passed to `load_dataset()` with the specified `dataset`
* `trained_model`: A valid dataset name from `available_trained_models` (aka. a key name in `trained_models.json`
