In [None]:
from src.data import Dataset
from src import workflow

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger()

In [None]:
help(workflow)

In [None]:
dataset_list = workflow.available_datasets()
dataset_list

## Overall task: 

Train a supervised model on the lvq-pak Finnish phoneme dataset. Try three different techniques, three times, and pick the one with the best accuracy score.

## Get data
Recall we created training and test versions of the datasets.

In [None]:
ds_test = Dataset.load('lvq-pak_test')

In [None]:
ds_test.data.shape

In [None]:
ds_train = Dataset.load('lvq-pak_train')

In [None]:
ds_train.data.shape

In [None]:
ds_train.target

In [None]:
print(ds_train.DESCR)

## Train an algorithm

Let's start with one algorithm!

`make train`

In [None]:
from sklearn.svm import LinearSVC

In [None]:
model = LinearSVC(random_state=42)

In [None]:
model.fit(ds_train.data, ds_train.target)

In [None]:
%%time
model = LinearSVC(random_state=42, max_iter=200000)
model.fit(ds_train.data, ds_train.target)

## Use it to predict
`make predict`

In [None]:
our_prediction = model.predict(ds_test.data);
our_prediction

## Test the quality of the prediction
`make analysis`

In [None]:
model.score(ds_test.data, ds_test.target)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy_score(ds_test.target, p_test)

## Next: 
* automate the basic workflow
* compare 3 different algorithms run with 3 different random states for our Swedish Chef

# Step 1: `make train`

## Add our algorithm to available_algorithms

In [None]:
help(workflow.available_algorithms)

There are currently no available algorithms.

In [None]:
workflow.available_algorithms()

To add an algorithm, add  a key:value pair to the dict `_ALGORITHMS` in `src/models/algorithms.py`.

For example, add
```
'linearSVC': LinearSVC()
```
to the `_ALGORITHMS` dict, and add
```
from sklearn.svm import LinearSVC
```
to the top of the file.

Also, add `linearSVC` to the docstring of `available_algorithms`.

In [None]:
workflow.available_algorithms()

Now we can add instructions for generating the model to our reproducible data science workflow

In [None]:
workflow.add_model(dataset_name='lvq-pak_train',
                   algorithm_name="linearSVC",
                   algorithm_params={'random_state': 42, 'max_iter': 200000})

In [None]:
workflow.get_model_list()

Now running `make train` or `workflow.build_models()` will train `LinearSVC` on `lvq-pak` with the specified parameters.

The output will be:
* A trained model in `models/trained_models`
* A json file `models/trained_models.json` that keeps track of the models that we've trained

In [None]:
workflow.build_models()

Or alternately, from the Makefile:

In [None]:
!cd .. && make train

In [None]:
workflow.available_models()

### ASIDE: Under the Hood

If you take a peek into the `Makefile`, you'll notice that `make train` takes a `models/model_list.json` as input.
```
## train / fit / build models
train: models/model_list.json
	$(PYTHON_INTERPRETER) -m src.models.train_model model_list.json
```

Under the hood, a `model_list.json` is a list of dicts, where each dict specifices a combination of:
* `dataset_name`: A valid dataset name from `available_datasets()`
* `algorithm_name`: A valid dataset name from `available_algorithms()`
* `algorithm_params`: A dictionary of parameters to use when running the specified algorithm
* `run_number`: (optional, default 1) A unique integer used to distinguish between different builds with otherwise identical parameters



In [None]:
!cat ../models/model_list.json

You don't necessarily need to know any of this, but sometimes it's nice to know what's going on under the hood.

## TODO: Caching! Then, checking against existing files and metadata and looking for caching! (note: will need a force parameter eventually)

## TODO: Don't overwrite the trained_models.json, append to it (as long as the files are still there) --- add call to available_models in build_models and give it a force option.

### Let's take a look at the output from `make train`

In [None]:
from src.paths import trained_model_path
from src.utils import list_dir
from src.utils import load_json

In [None]:
workflow.available_models()

In [None]:
# load up the trained model
from src.models.train import load_model

tm, tm_metadata = load_model(model_name='linearSVC_lvq-pak_train_1', model_path=trained_model_path)

In [None]:
tm

In [None]:
tm_metadata

Just to check, we can verify that the stored dataset called `lvq-pak_train` was the same one used to train this model: (**data provenance** in action!)

In [None]:
ds = Dataset.load('lvq-pak_train')
ds.DATA_HASH

## TODO: explore the effects of caching once it's implemented

## What exactly is a "model" in this process?
To implement the notion of a model, we borrow a basic data type from scikit-learn: the **Estimator**. To use an algorithm as a model, we must build it into a class which:: 
* is a subclass of the sklearn `BaseEstimator` class (needed for setting and getting params)
* has a `fit` method (needed for `make train`)
* has either a `predict` method (if it's a **supervised learning** problem) or a `transform` method (**unsupervised learning** problem) (needed for `make predict`)

We will see how things work in the unsupervised case in the next workbook. 

One of the advantages of using the sklearn **Estimator** API is that a model can consist of any combination of "algorithms" as long as that combination is a `BaseEstimator` implementing above methods. For example, you can use an sklearn `Pipeline`, or an sklearn meta-estimator like `GridSearchCV` to implement a model. 

If your algorithm of choice is **not yet** a `BaseEstimator` with the appropriate API, it is fairly easy to wrap it to be used in this way. While we won't have time to cover an example of this during the in-person part of this tutorial, the Text Embedding (advanced usage tutorial notebook) has an example of implementing gensim's FastText algorithm as an Estimator.



# Step 2: `make predict`

```
## predict / transform / run experiments
predict: models/predict_list.json
	$(PYTHON_INTERPRETER) -m src.models.predict_model predict_list.json
```

Similar to `models_list.json` in `predict_list.json` we specify the dataset to operate on, and in this case, the `trained_model` to apply to the given dataset. Again, we do this using the `workflow` module.


A `predict_list.json` is a list of dicts, where each dict specifices a combination of:
* `dataset_name`: A valid dataset name from `available_datasets`
* `dataset_params`: A dictionary of parameters that can be passed to `load_dataset()` with the specified `dataset`
* `model_name`: A valid dataset name from `available_trained_models` (aka. a key name in `trained_models.json`
* `is_supervised`: Whether to use the `predict` (supervised) or `transform` (unsupervised) method


Let's use the test set here to do the prediction.

In [None]:
workflow.add_prediction(dataset_name='lvq-pak_test', model_name='linearSVC_lvq-pak_train_1', is_supervised=True)

In [None]:
workflow.get_prediction_list()

In [None]:
workflow.run_predictions(predict_file='predict_list.json')

In [None]:
!cd .. && LOGLEVEL=INFO make predict

In [None]:
workflow.available_predictions()

We didn't specify an output dataset name, so it just inferred one that makes sense (though it is a bit of a mouthful). Let's fix that.

In [None]:
workflow.get_prediction_list()

In [None]:
prediction = workflow.pop_prediction()
prediction['output_dataset'] = 'lvq-test-svc'
workflow.add_prediction(**prediction)
workflow.get_prediction_list()

In [None]:
workflow.run_predictions()

In [None]:
workflow.available_predictions()

#### Note: Predictions are just Datasets tagged with experiment metadata

In [None]:
from src.paths import model_output_path

In [None]:
predict_ds = Dataset.load('lvq-test-svc', data_path=model_output_path)

In [None]:
predict_ds.data.shape

In [None]:
predict_ds.metadata['experiment']

In [None]:
## Check that our prediction matches what we got before we turned this into a reproducible workflow:
all(predict_ds.data == our_prediction)

# Step 3: Analysis

## TODO: Add all of this to the standard workflow

In [None]:
summarizer_list = [{
    'summarizer_name': 'supervised_score_df',
    'summarizer_params': {}
}
]

In [None]:
from src.paths import reports_path
from src.utils import save_json

In [None]:
save_json(reports_path / 'summary_list.json', summarizer_list)

In [None]:
!cd .. && make summary

## TODO: Outputs available via available_sumamries

## TODO: Add caching of the summary dfs to know if you're about to overwrite one

# Step 4: Add other algorithms


### Exercise: Add GradientBoostingClassifier and some other sklearn Classifier of your choice

### Advanced Exercise: Use GridSearchCV applied to your classifier of choice as the 3rd alg

In [None]:
workflow.add_model(
    dataset_name = 'lvq-pak_train',
    algorithm_name = 'GradientBoostingClassifier',
    algorithm_params = {'random_state': 42}    
)

In [None]:
### Add your choice of classifier here

In [None]:
### Take a look to see what's there
workflow.get_model_list()

In [None]:
workflow.get_model_list()

In [None]:
workflow.available_algorithms(keys_only=False)

In [None]:
!cd .. && make train

In [None]:
workflow.available_models()

In [None]:
workflow.get_prediction_list()

In [None]:
## Set up predictions using all of the available models
for tm in workflow.available_models():
    workflow.add_prediction(
        dataset_name = 'lvq-pak_test',
        model_name = tm,
        is_supervised = True,
    )

In [None]:
workflow.get_prediction_list()

In [None]:
!cd .. && LOGLEVEL=DEBUG make predict

In [None]:
workflow.available_predictions()

The default for running the the summary df is to run on all available predictions. We have nothing more that we have to add to our existing script to get all the new scores. 

In [None]:
!cd .. && make summary

## TODO: add the next part to a `load_summary` call

In [None]:
import pandas as pd

In [None]:
from src.paths import summary_path

In [None]:
list_dir(summary_path)

In [None]:
pd.DataFrame.from_csv(summary_path / 'supervised_score_df')

## TODO: Figure out where and how to include a "lesson" on random_state
