In [1]:
from src.data import Dataset, available_datasets
from src import workflow

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger()

In [4]:
dataset_list = workflow.available_datasets()
dataset_list

['lvq-pak_test', 'lvq-pak_train']

## Overall task: 

Train a supervised model on the lvq-pak. Try three different techniques, three times, and pick the one with the best accuracy score.

## Get data

In [5]:
ds_test = Dataset.load('lvq-pak_test')

In [6]:
ds_test.data.shape

(981, 20)

In [7]:
ds_train = Dataset.load('lvq-pak_train')

In [8]:
ds_train.data.shape

(2943, 20)

In [9]:
ds_train.target

array([12,  9,  0, ...,  2,  6,  7])

In [10]:
print(ds_train.DESCR)

************************************************************************
*                                                                      *
*                              LVQ_PAK                                 *
*                                                                      *
*                                The                                   *
*                                                                      *
*                   Learning  Vector  Quantization                     *
*                                                                      *
*                          Program  Package                            *
*                                                                      *
*                   Version 3.1 (April 7, 1995)                        *
*                                                                      *
*                          Prepared by the                             *
*                    LVQ Programming Team of the   

## Train an algorithm

Let's start with one algorithm!

`make train`

In [11]:
from sklearn.svm import LinearSVC

In [12]:
model = LinearSVC(random_state=42)

In [13]:
model.fit(ds_train.data, ds_train.target)



LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=42, tol=0.0001,
     verbose=0)

In [14]:
%%time
model = LinearSVC(random_state=42, max_iter=200000)
model.fit(ds_train.data, ds_train.target)

CPU times: user 33.1 s, sys: 182 ms, total: 33.2 s
Wall time: 33.8 s


## Use it to predict
`make predict`

In [15]:
p_test = model.predict(ds_test.data);
p_test

array([ 7, 10, 14,  7,  1,  9,  4,  0,  0, 15, 18, 14, 11, 15,  2,  0,  9,
        2, 15, 18, 14,  0,  2, 10,  0,  9,  0, 14,  2,  7,  0,  4, 15,  0,
        2, 18,  8,  0,  0, 15,  7, 12,  2, 14,  4,  4,  0,  4,  0,  0, 11,
        0, 18, 11,  0, 14,  2,  0,  7,  0, 14, 14,  9,  7,  2,  0, 18,  2,
        0, 14,  0,  2,  2, 18, 18,  0,  2, 11, 18,  7,  2,  0, 12,  7, 19,
        2, 16,  2,  2, 18,  9, 18, 15,  7,  0,  2,  7,  4, 14, 11, 10,  7,
        2,  4,  0, 14,  4,  9,  0,  7,  5,  9,  7,  7, 19,  0,  0,  9,  0,
       11,  0,  0, 18,  9, 11, 11,  2,  2,  0,  2, 11,  7, 12,  7, 19,  2,
       18,  2,  0,  7, 11,  4,  0,  0,  7, 14,  0,  0,  2, 15, 15, 18,  4,
       14,  0,  0,  0,  0,  0,  0,  7,  0,  0, 14, 14,  0,  2,  2,  7,  0,
        2,  7,  2,  0,  0,  7,  2, 11, 14, 11,  2, 18, 11, 14, 11,  2,  2,
       10,  0, 11,  0,  2, 10,  2,  0,  4, 15,  0, 15, 11, 18,  9,  4, 15,
        3, 10, 11, 12, 13,  0,  0, 18, 12,  2,  7,  0, 12, 18,  0,  0, 18,
        0,  7,  0,  0,  7

## Test the quality of the prediction
`make analysis`

In [16]:
model.score(ds_test.data, ds_test.target)

0.8756371049949032

In [17]:
from sklearn.metrics import accuracy_score

In [18]:
accuracy_score(ds_test.target, p_test)

0.8756371049949032

## Next: 
* automate the basic workflow
* compare 3 different algorithms run with 3 different random states for our Swedish Chef

# Step 1: `make train`

## Add our algorithm to available_algorithms

There are currently no available algorithms.

In [19]:
workflow.available_algorithms()

['linearSVC',
 'GradientBoostingClassifier',
 'GridSearchCV',
 'RandomForestClassifier']

To add an algorithm, add  a key:value pair to the dict `_ALGORITHMS` in `src/models/algorithms.py`.

For example, add
```
'linearSVC': LinearSVC()
```
to the `_ALGORITHMS` dict, and add
```
from sklearn.svm import LinearSVC
```
to the top of the file.

Also, add `linearSVC` to the docstring of `available_algorithms`.

In [20]:
workflow.available_algorithms()

['linearSVC',
 'GradientBoostingClassifier',
 'GridSearchCV',
 'RandomForestClassifier']

In [21]:
help(workflow.available_algorithms)

Help on function available_algorithms in module src.models.algorithms:

available_algorithms(keys_only=True)
    Valid Algorithms for training or prediction
    
    This function simply returns a dict of known
    algorithms strings and their corresponding estimator function.
    
    It exists to allow for a description of the mapping for
    each of the valid strings as a docstring
    
    The valid algorithm names, and the function they map to, are:
    
    
    Algorithm                    Function
    LinearSVC                    sklearn.svm.LinearSVC
    GradientBoostingClassifier   sklearn.ensemble.GradientBoostingClassifier
    
    Parameters
    ----------
    keys_only: boolean
        If True, return only keys. Otherwise, return a dictionary mapping keys to algorithms



Now we're in a position where the `make train` script can run using `linearSVC`. 

In [22]:
!cd .. && make -n train

make: *** No rule to make target `models/model_list.json', needed by `train'.  Stop.


You'll notice that `make train` takes a `models/model_list.json` as input.
```
## train / fit / build models
train: models/model_list.json
	$(PYTHON_INTERPRETER) -m src.models.train_model model_list.json
```

Under the hood, a `model_list.json` is a list of dicts, where each dict specifices a combination of:
* `dataset_name`: A valid dataset name from `available_datasets`
* `algorithm_name`: A valid dataset name from `available_algorithms`
* `algorithm_params`: A dictionary of parameters to use when running the specified `algorithm`
* `run_number`: (optional, default 1) A unique integer used to distinguish between different builds with otherwise identical parameters

We don't need to know this, as we will use helper functions in `workflow` to build it.

In [23]:
workflow.add_model(dataset_name='lvq-pak_train',
                   algorithm_name="linearSVC",
                   algorithm_params={'random_state': 42, 'max_iter': 200000})

In [24]:
workflow.get_model_list()

[{'algorithm_name': 'linearSVC',
  'algorithm_params': {'max_iter': 200000, 'random_state': 42},
  'dataset_name': 'lvq-pak_train',
  'run_number': 1}]

Now running `make train` will train `LinearSVC` on `lvq-pak` with the specified parameters.

Alternately, we can run `workflow.build_models()`.

The output will be:
* A trained model in `models/trained_models`
* A json file `models/trained_models.json` that keeps track of the models that we've trained

In [25]:
workflow.build_models()

{'linearSVC_lvq-pak_train_1': {'algorithm_name': 'linearSVC',
  'algorithm_params': {'C': 1.0,
   'class_weight': None,
   'dual': True,
   'fit_intercept': True,
   'intercept_scaling': 1,
   'loss': 'squared_hinge',
   'max_iter': 200000,
   'multi_class': 'ovr',
   'penalty': 'l2',
   'random_state': 42,
   'tol': 0.0001,
   'verbose': 0},
  'dataset_name': 'lvq-pak_train',
  'run_number': 1,
  'data_hash': 'ee5ba3c3b1cae9cac275d47832c404b688344dd1',
  'target_hash': '2918cb40ea4eca1c1bd770a0d2fd179249407202',
  'model_hash': 'b56909bb97424fd9ada4ad7c4306a95cece7b311'}}

In [26]:
!cd .. && make train

python3 -m src.models.train_models model_list.json
  import imp
2018-10-14 10:39:30,732 - train_models - INFO - Training complete! Access results via workflow.available_models


## TODO: Caching! Then, checking against existing files and metadata and looking for caching! (note: will need a force parameter eventually)

## TODO: Don't overwrite the trained_models.json, append to it (as long as the files are still there) --- add call to available_models in build_models and give it a force option.

### Let's take a look at the output from `make train`

In [27]:
from src.paths import trained_model_path
from src.utils import list_dir
from src.utils import load_json

In [28]:
workflow.available_models()

['GradientBoostingClassifier_lvq-pak_train_1',
 'linearSVC_lvq-pak_train_1',
 'GridSearchCV_lvq-pak_train_1']

In [29]:
# load up the trained model
from src.models.train import load_model

tm, tm_metadata = load_model(model_name='linearSVC_lvq-pak_train_1', model_path=trained_model_path)

In [30]:
tm

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=200000,
     multi_class='ovr', penalty='l2', random_state=42, tol=0.0001,
     verbose=0)

In [31]:
tm_metadata

{'algorithm_name': 'linearSVC',
 'algorithm_params': {'C': 1.0,
  'class_weight': None,
  'dual': True,
  'fit_intercept': True,
  'intercept_scaling': 1,
  'loss': 'squared_hinge',
  'max_iter': 200000,
  'multi_class': 'ovr',
  'penalty': 'l2',
  'random_state': 42,
  'tol': 0.0001,
  'verbose': 0},
 'data_hash': 'ee5ba3c3b1cae9cac275d47832c404b688344dd1',
 'dataset_name': 'lvq-pak_train',
 'model_hash': 'b56909bb97424fd9ada4ad7c4306a95cece7b311',
 'run_number': 1,
 'target_hash': '2918cb40ea4eca1c1bd770a0d2fd179249407202'}

In [32]:
ds = Dataset.load('lvq-pak_train')
ds.DATA_HASH

'ee5ba3c3b1cae9cac275d47832c404b688344dd1'

## TODO: explore the effects of caching once it's implemented

Any algorithm will work that: 
* is a subclass of the sklearn `BaseEstimator` class (needed for setting and getting params)
* has a `fit` method (needed for `make train`)
* has either a `predict` method (supervised) or a `transform` method (unsupervised) (needed for `make predict`)

We will see how things work in the unsupervised case in the next example. 

Note that an **algorithm** here can be a combination of "algorithms" as long as that combination is a `BaseEstimator` with the above methods. For example, you can use an sklearn pipeline, or an sklearn meta estimator like GridSearchCV as an algorithm. 

If your algorithm of choice is **not yet** a `BaseEstimator` with the appropriate API, it is fairly easy to wrap it to be used in this way. While we won't have time to cover an example of this during the in-person part of this tutorial, the EDA Text Embedding (advanced usage tutorial) has an example of how to do this with gensim's FastText model.



# Step 2: `make predict`

```
## predict / transform / run experiments
predict: models/predict_list.json
	$(PYTHON_INTERPRETER) -m src.models.predict_model predict_list.json
```

Similar to `models_list.json` in `predict_list.json` we specify the dataset to operate on, and in this case, the `trained_model` to apply to the given dataset. Again, we do this using the `workflow` module.


A `predict_list.json` is a list of dicts, where each dict specifices a combination of:
* `dataset_name`: A valid dataset name from `available_datasets`
* `dataset_params`: A dictionary of parameters that can be passed to `load_dataset()` with the specified `dataset`
* `model_name`: A valid dataset name from `available_trained_models` (aka. a key name in `trained_models.json`
* `is_supervised`: Whether to use the `predict` (supervised) or `transform` (unsupervised) method


Let's use the test set here to do the prediction.

In [33]:
workflow.add_prediction(dataset_name='lvq-pak_test', model_name='linearSVC_lvq-pak_train_1', is_supervised=True)

In [34]:
workflow.get_prediction_list()

[{'dataset_name': 'lvq-pak_test',
  'is_supervised': True,
  'model_name': 'linearSVC_lvq-pak_train_1'}]

In [35]:
workflow.run_predictions(predict_file='predict_list.json')

2018-10-14 10:39:31,812 - predict - INFO - Experiment has already been run. Returning Cached Result


{'linearSVC_lvq-pak_train_1_exp_lvq-pak_test_1': {'dataset_name': 'linearSVC_lvq-pak_train_1_exp_lvq-pak_test_1',
  'hash_type': 'sha1',
  'data_hash': '37b8e0111dc53ab39717d79fe0a67e13a9ebb265',
  'target_hash': '5dab31bc1f020abc091c927aa9d420880171cb36',
  'experiment': {'model_name': 'linearSVC_lvq-pak_train_1',
   'dataset_name': 'lvq-pak_test',
   'run_number': 1,
   'hash_type': 'sha1',
   'data_hash': '5561f5d951ec546bf9d221a2c7e60173c1f9beba',
   'target_hash': '5dab31bc1f020abc091c927aa9d420880171cb36',
   'model_hash': 'b56909bb97424fd9ada4ad7c4306a95cece7b311',
   'start_time': 1539527535.5877252,
   'duration': 0.00220489501953125}}}

In [36]:
!cd .. && LOGLEVEL=INFO make predict

python3 -m src.models.predict_model predict_list.json
  import imp
2018-10-14 10:39:33,480 - predict - INFO - Experiment has already been run. Returning Cached Result
2018-10-14 10:39:33,485 - predict_model - INFO - Predict complete! Results accessible via workflow.available_predictions


In [37]:
workflow.available_predictions()

['GridSearchCV_lvq-pak_train_1_exp_lvq-pak_test_1',
 'GradientBoostingClassifier_lvq-pak_train_1_exp_lvq-pak_test_1',
 'linearSVC_lvq-pak_train_1_exp_lvq-pak_test_1']

In [38]:
from src.paths import model_output_path

#### Note: Predictions are just Datasets tagged with experiment data

In [39]:
predict_ds = Dataset.load('linearSVC_lvq-pak_train_1_exp_lvq-pak_test_1', data_path=model_output_path)

In [40]:
predict_ds.data.shape

(981,)

In [41]:
predict_ds.metadata['experiment']

{'model_name': 'linearSVC_lvq-pak_train_1',
 'dataset_name': 'lvq-pak_test',
 'run_number': 1,
 'hash_type': 'sha1',
 'data_hash': '5561f5d951ec546bf9d221a2c7e60173c1f9beba',
 'target_hash': '5dab31bc1f020abc091c927aa9d420880171cb36',
 'model_hash': 'b56909bb97424fd9ada4ad7c4306a95cece7b311',
 'start_time': 1539527535.5877252,
 'duration': 0.00220489501953125}

In [42]:
## Check that our prediction matches
all(predict_ds.data == p_test)

True

# Step 3: Analysis

## TODO: Move this function to a module

In [43]:
from src.data import Dataset
from src.paths import model_path, model_output_path

In [44]:
import pandas as pd

In [45]:
def available_scorers():
    _SCORERS = {
        'accuracy_score': accuracy_score
    }
    return _SCORERS

In [46]:
def supervised_score_df(predict_json='predictions.json', predict_json_path=None,
                        score_list=['accuracy_score']):
    if predict_json_path is None:
        predict_json_path = model_path
    else:
        predict_json_path = pathlib.Path(predict_json_path)
    predictions = load_json(predict_json_path / predict_json)
    score_df = pd.DataFrame(columns=['score_name', 'algorithm_name', 'dataset_name',
                                     'model_name', 'run_number'])
    for current_scorer_name in score_list:
        current_scorer = available_scorers()[current_scorer_name]

        score_dict = {}
        score_dict['score_name'] = current_scorer_name
        for key in predictions.keys():
            prediction = predictions[key]
            exp = prediction['experiment']
            pred_ds = Dataset.load(prediction['dataset_name'], data_path=model_output_path)

            ds_name = exp['dataset_name']
            ds = Dataset.load(ds_name)
            score_dict['dataset_name'] = ds_name

            score_dict['score'] = current_scorer(ds.target, pred_ds.data)

            model_metadata = load_model(model_name=exp['model_name'], metadata_only=True)
            score_dict['algorithm_name'] = model_metadata['algorithm_name']
            score_dict['model_name'] = exp['model_name']
            score_dict['run_number'] = exp['run_number']
            new_score_df = pd.DataFrame(score_dict, index=[0])
            score_df = score_df.append(new_score_df, sort=True)
    return score_df

In [47]:
supervised_score_df()

Unnamed: 0,algorithm_name,dataset_name,model_name,run_number,score,score_name
0,linearSVC,lvq-pak_test,linearSVC_lvq-pak_train_1,1,0.875637,accuracy_score


## TODO: Add caching of the summary dfs to know if you're about to overwrite one

# Step 4: Add other algorithms


### Exercise: Add GradientBoostingClassifier and some other sklearn Classifier of your choice

### Advanced Exercise: Use GridSearchCV applied to your classifier of choice as the 3rd alg

In [48]:
workflow.add_model(
    dataset_name = 'lvq-pak_train',
    algorithm_name = 'GradientBoostingClassifier',
    algorithm_params = {'random_state': 42}    
)

In [49]:
### Add your choice of classifier here

In [50]:
### Take a look to see what's there
workflow.get_model_list()

[{'algorithm_name': 'linearSVC',
  'algorithm_params': {'max_iter': 200000, 'random_state': 42},
  'dataset_name': 'lvq-pak_train',
  'run_number': 1},
 {'algorithm_name': 'GradientBoostingClassifier',
  'algorithm_params': {'random_state': 42},
  'dataset_name': 'lvq-pak_train',
  'run_number': 1}]

In [51]:
## Advanced example...see if you can make it work with this as well.

workflow.add_model(
    dataset_name = 'lvq-pak_train',
    algorithm_name = 'GridSearchCV',
    algorithm_params = {'alg_name': 'RandomForestClassifier',
                             'alg_params': {'n_estimators': 200},
                             'gridsearch_params':{'max_features':['sqrt', 'log2', 10],
                                                   'max_depth':[5, 7, 9],
                                                   'random_state':[42, 62345, 3457],
                                                   },
                             'params': {'cv': 3}
                       }  
)

In [52]:
workflow.get_model_list()

[{'algorithm_name': 'linearSVC',
  'algorithm_params': {'max_iter': 200000, 'random_state': 42},
  'dataset_name': 'lvq-pak_train',
  'run_number': 1},
 {'algorithm_name': 'GradientBoostingClassifier',
  'algorithm_params': {'random_state': 42},
  'dataset_name': 'lvq-pak_train',
  'run_number': 1},
 {'algorithm_name': 'GridSearchCV',
  'algorithm_params': {'alg_name': 'RandomForestClassifier',
   'alg_params': {'n_estimators': 200},
   'gridsearch_params': {'max_depth': [5, 7, 9],
    'max_features': ['sqrt', 'log2', 10],
    'random_state': [42, 62345, 3457]},
   'params': {'cv': 3}},
  'dataset_name': 'lvq-pak_train',
  'run_number': 1}]

In [53]:
workflow.available_algorithms(keys_only=False)

{'linearSVC': LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
      intercept_scaling=1, loss='squared_hinge', max_iter=200000,
      multi_class='ovr', penalty='l2', random_state=42, tol=0.0001,
      verbose=0),
 'GradientBoostingClassifier': GradientBoostingClassifier(criterion='friedman_mse', init=None,
               learning_rate=0.1, loss='deviance', max_depth=3,
               max_features=None, max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=100,
               n_iter_no_change=None, presort='auto', random_state=None,
               subsample=1.0, tol=0.0001, validation_fraction=0.1,
               verbose=0, warm_start=False),
 'GridSearchCV': ComboGridSearchCV(alg_name=None, alg_params=None, gridsearch_params=None,
          params=None),
 'RandomForestClassifier': RandomForestClassifier(bootstrap=True, c

In [54]:
!cd .. && make train

python3 -m src.models.train_models model_list.json
  import imp
2018-10-14 10:43:27,803 - train_models - INFO - Training complete! Access results via workflow.available_models


In [72]:
workflow.available_models()

['GradientBoostingClassifier_lvq-pak_train_1',
 'linearSVC_lvq-pak_train_1',
 'GridSearchCV_lvq-pak_train_1']

In [111]:
workflow.get_prediction_list()

[]

In [112]:
## Set up predictions using all of the available models
for tm in workflow.available_models():
    workflow.add_prediction(
        dataset_name = 'lvq-pak_test',
        model_name = tm,
        is_supervised = True,
    )

In [113]:
workflow.get_prediction_list()

[{'dataset_name': 'lvq-pak_test',
  'force': False,
  'is_supervised': True,
  'model_name': 'GradientBoostingClassifier_lvq-pak_train_1'},
 {'dataset_name': 'lvq-pak_test',
  'force': False,
  'is_supervised': True,
  'model_name': 'linearSVC_lvq-pak_train_1'},
 {'dataset_name': 'lvq-pak_test',
  'force': False,
  'is_supervised': True,
  'model_name': 'GridSearchCV_lvq-pak_train_1'}]

In [114]:
!cd .. && LOGLEVEL=DEBUG make predict

python3 -m src.models.predict_model predict_list.json
  import imp
2018-10-14 11:04:28,179 - predict_model - DEBUG - Executing models from predict_list.json
2018-10-14 11:04:30,886 - predict - DEBUG - Predict: Applying GradientBoostingClassifier_lvq-pak_train_1 on lvq-pak_test
2018-10-14 11:04:30,887 - predict - INFO - Experiment has already been run. Returning Cached Result
2018-10-14 11:04:31,029 - predict - DEBUG - Predict: Applying linearSVC_lvq-pak_train_1 on lvq-pak_test
2018-10-14 11:04:31,030 - predict - INFO - Experiment has already been run. Returning Cached Result
2018-10-14 11:04:31,516 - predict - DEBUG - Predict: Applying GridSearchCV_lvq-pak_train_1 on lvq-pak_test
2018-10-14 11:04:31,517 - predict - INFO - Experiment has already been run. Returning Cached Result
2018-10-14 11:04:31,521 - predict_model - INFO - Predict complete! Results accessible via workflow.available_predictions


In [115]:
workflow.available_predictions()

['GridSearchCV_lvq-pak_train_1_exp_lvq-pak_test_1',
 'GradientBoostingClassifier_lvq-pak_train_1_exp_lvq-pak_test_1',
 'linearSVC_lvq-pak_train_1_exp_lvq-pak_test_1']

In [116]:
df = supervised_score_df()
df

Unnamed: 0,algorithm_name,dataset_name,model_name,run_number,score,score_name
0,GradientBoostingClassifier,lvq-pak_test,GradientBoostingClassifier_lvq-pak_train_1,1,0.879715,accuracy_score
0,GridSearchCV,lvq-pak_test,GridSearchCV_lvq-pak_train_1,1,0.888889,accuracy_score
0,linearSVC,lvq-pak_test,linearSVC_lvq-pak_train_1,1,0.875637,accuracy_score


## TODO: Figure out where and how to include a "lesson" on random_state
