# Thoughts on Hyperparameter selection/storary.

Analysis by Jeremy P Mann

Main goal of this notebook is to nail down the hyperparameters for the basic ML pipeline I outlined previously. This will be stored as YAML file alongside this notebook. No training will be done at this point. This notebook was mainly just me getting the hang of storing/loading yaml files. 

These models will be: 

- Logistic Regression 
- Rbf SVC 
- XGBoost 
- Random Forest

I chose hyperparameters in a really ad hoc way, mainly by looking at people's choices in other competitions. This brings up some subtle points that I don't really understand:

- Although obviously there's no "best" initial choice of hyperparameters, one shouldn't be reinventing the wheel when doing *preliminary*, "benchmark" ML experiments. There should be some *convention* for model choices/hyperparameters for every fixed type of learning problem. 
    - Although sklearn's "default" choices are the obvious choice, they change and therefore lack reproducibility. The software part of the reproducibility can be fixed with requirements.txt files, but this doesn't address the human part of the equation. 
    - The goal of this would not be to get a really good model, but to just set a context for the nature and difficulty of the learning problem that is established by convention. 
        - I can't emphasize the *convention* part! 
    - An simple use case would be to say with confidence that, for example, some 'expensive' ML algorithm gives *statistcally significant* improvements over more conventional methods.
    - The analysis should be extremely automated and thorough.

For concreteness, I will restrict myself to classification problems, where the features have been engineered (e.g. the "trivial" engineering where the features are given by the raw data)
The analysis should give:
- breakdowns of the distribution of features amongst each class. 
- Distributions of confusion matrices, and summary statistics of these distributions
- Precision/sensitivy/accuracy broken down by classes.


In [1]:
import numpy as np
import yaml
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import ParameterGrid
from sklearn.pipeline import Pipeline

In [5]:
with open("sample_hyperparameters.yml", "r") as ymlfile:
    params = yaml.safe_load(ymlfile)

params
model_names = params.keys()

In [6]:
with open(r'stored_file.yaml', 'w') as file:
    documents = yaml.dump(params, file)