# Overview
The notebook shows how to take a template kernel and make a bunch of parallel runs with slightly different hyper-parameters. In order for a parameter to be identified as a _hyperparameter_ it needs to be written in the code field in the following form
```py
LEARNING_RATE=0.1
EPOCHS=5
MODEL='VGG16'
```
The notebook shows how we can automatically extract those parameters, create a series of runs based on them and submit all of the runs to Kaggle as Kernels

# Setup the Environment
Here we setup the variables for our kaggle account
- you need a USER_ID and USER_SECRET which you can get by following the instructions here: https://github.com/Kaggle/kaggle-api#api-credentials
- the credentials below have already been invalidated and so you cannot use them

In [1]:
USER_ID = 'kevinbot'
USER_SECRET = ''
DRY_RUN = True # should the actual kernels be run

In [2]:
import os, json, nbformat, pandas as pd
import ast
from itertools import product
import copy
from nbformat import v4 as nbf
import json
from time import time, sleep
import hashlib

In [3]:
kaggle_conf_dir = os.path.join(os.path.expandvars('$HOME'), '.kaggle')
os.makedirs(kaggle_conf_dir, exist_ok = True)
with open(os.path.join(kaggle_conf_dir, 'kaggle.json'), 'w') as f:
    json.dump({'username': USER_ID, 'key': USER_SECRET}, f)
!chmod 600 {kaggle_conf_dir}/kaggle.json

# Download a Template Notebook/Kernel
Here we use Use the "Hot Dog not Hot Dog" Kernel as a Basis. In order to be a good kernel, the file should have a number of simple lines like the ones below that can be changed through the script below
```py
LEARNING_RATE=0.1
EPOCHS=5
MODEL='VGG16'
```
We also download the metadata to use as a template

In [17]:
base_dir = 'base_kernel'
kernel_path = os.path.join(base_dir, 'hot-dog-not-hot-dog-gpu.ipynb')
meta_path = os.path.join(base_dir, 'kernel-metadata.json')
if not (os.path.exists(kernel_path) and os.path.exists(meta_path)):
    !kaggle kernels pull -k kmader/hot-dog-not-hot-dog-gpu -p {base_dir} -m
with open(meta_path, 'r') as f:
    base_metadata = json.load(f)
base_metadata

Source code and metadata downloaded to base_kernel


{'id': 'kmader/hot-dog-not-hot-dog-gpu',
 'title': 'Hot Dog/Not Hot Dog (GPU)',
 'code_file': 'hot-dog-not-hot-dog-gpu.ipynb',
 'language': 'python',
 'kernel_type': 'notebook',
 'is_private': False,
 'enable_gpu': True,
 'enable_internet': False,
 'keywords': [],
 'dataset_sources': ['gaborfodor/keras-pretrained-models', 'kmader/food41'],
 'kernel_sources': [],
 'competition_sources': []}

# Notebook Metadata
Here we make the notebook metadata template for submitting new notebooks. Basically it has what datasets we want to include, if the notebook should be private, if we want GPU enabled and so forth

In [18]:
def notebook_meta_template(user_id, title, file_id, nb_path): 
    """
     'is_private': False, # probably better to make them private but for the demo notebook it is useful to see them
     'enable_gpu': True,
     'enable_internet': False,
    """
    c_dict = copy.deepcopy(base_metadata)
    kv_list = [('id', f'{user_id}/{file_id}')]
    kv_list += [('title', f'{title}')]
    kv_list += [('code_file', nb_path)]
    kv_list += [('keywords', ['hyperparameter-optimization'])]
    
    for k, v in kv_list:
        c_dict[k] = v
    return c_dict

## Parse the Notebook
Here we parse the notebook looking for parameters to play with, we use pandas to show a bit what is inside.

In [6]:
kernel_data = nbformat.read(kernel_path, as_version=4)
cell_df = pd.DataFrame(kernel_data['cells'])
cell_df.query('cell_type=="code"')

Unnamed: 0,cell_type,execution_count,metadata,outputs,source
2,code,1.0,"{'collapsed': True, '_cell_guid': 'e94de3e7-de...",[],!mkdir ~/.keras\n!mkdir ~/.keras/models\n!cp ....
3,code,2.0,"{'collapsed': True, '_cell_guid': 'c3cc4285-bf...",[],%matplotlib inline\nimport numpy as np # linea...
5,code,3.0,{'_cell_guid': '1d79959c-4921-48f9-a660-f1d550...,[],from sklearn.preprocessing import LabelEncoder...
7,code,4.0,{'_cell_guid': '5c8bd288-8261-4cbe-a954-e62ac7...,[],"all_paths_df['source'].hist(figsize = (20, 7),..."
9,code,5.0,{'_cell_guid': '1192c6b3-a940-4fa0-a498-d7e0d4...,[],from sklearn.model_selection import train_test...
11,code,46.0,{'_cell_guid': '21b5d30f-c645-41ad-85bc-4b51d2...,[],train_df = raw_train_df.groupby(['source_id'])...
12,code,7.0,{'_cell_guid': '9954bfda-29bd-4c4d-b526-0a972b...,[],from keras.preprocessing.image import ImageDat...
13,code,8.0,"{'collapsed': True, '_cell_guid': 'b5767f42-da...",[],"def flow_from_dataframe(img_data_gen, in_df, p..."
14,code,47.0,{'_cell_guid': '810bd229-fec9-43c4-b3bd-afd62e...,[],"train_gen = flow_from_dataframe(core_idg, trai..."
15,code,10.0,{'_cell_guid': '2d62234f-aeb0-4eba-8a38-d713d8...,[],"t_x, t_y = next(train_gen)\nfig, m_axs = plt.s..."


## Use Abstract Syntax Tree
We can use the abstract syntax tree to find relevant code that we can change to run notebooks with new settings

In [7]:
all_asgn = []
for cell_idx, c_cell in enumerate(kernel_data['cells']):
    if c_cell['cell_type']=='code':
        c_src = c_cell['source']
        # remove jupyter things
        c_src = '\n'.join(['' if (c_block.strip().startswith('!') or 
                                  c_block.strip().startswith('%')) else
                           c_block
                           for c_block in c_src.split('\n')])
        
        for c_statement in ast.parse(c_src).body:
            if isinstance(c_statement, ast.Assign):
                # only keep named arguments that are not assigned from function calls
                if all([isinstance(c_targ, ast.Name) 
                        for c_targ in c_statement.targets]) and not (isinstance(c_statement.value, ast.Call) or 
                                                                     isinstance(c_statement.value, ast.Lambda)) and len(c_statement.targets)==1:
                    
                    all_asgn += [{'cell_id': cell_idx,
                                  'line_no': c_statement.lineno,
                                  'line_code': c_src.split('\n')[c_statement.lineno-1],
                                  #'value': c_statement.value,
                                  'target':  c_statement.targets[0].id}
                                  ]
assignment_df = pd.DataFrame(all_asgn)
assignment_df['line_replacement'] = assignment_df['line_code'] 
assignment_df

Unnamed: 0,cell_id,line_code,line_no,target,line_replacement
0,12,"IMG_SIZE = (299, 299) # slightly smaller than ...",7,IMG_SIZE,"IMG_SIZE = (299, 299) # slightly smaller than ..."
1,19,pt_depth = base_pretrained_model.get_output_sh...,8,pt_depth,pt_depth = base_pretrained_model.get_output_sh...
2,19,use_attention = False,29,use_attention,use_attention = False
3,20,"callbacks_list = [checkpoint, early, reduceLRO...",16,callbacks_list,"callbacks_list = [checkpoint, early, reduceLRO..."


# Make our batches
Here we can make the batches of code to run. Each batch has a parameter data.frame associated it with that we write into the first block in the notebook.

We use the product function to perform a grid search over all the possibilities

In [8]:
batch_dict = {'IMG_SIZE': [(139, 139), (299, 299), (384, 384), (512, 512)],
             'use_attention': [False, True]}
batch_keys = list(batch_dict.keys())
batches = []
for c_vec in product(*[batch_dict[k] 
                       for k in batch_keys]):
    cur_df = assignment_df.copy()
    sub_lines = dict(zip(batch_keys, c_vec))
    print(sub_lines)
    for c_key, c_value in sub_lines.items():
        cur_df.loc[cur_df['target']==c_key, 'line_replacement'] = cur_df[cur_df['target']==c_key]['line_code'].map(lambda x: '{}= {}'.format(
            x.split('=')[0],
            c_value))
    batches+=[(sub_lines, cur_df)]

{'IMG_SIZE': (139, 139), 'use_attention': False}
{'IMG_SIZE': (139, 139), 'use_attention': True}
{'IMG_SIZE': (299, 299), 'use_attention': False}
{'IMG_SIZE': (299, 299), 'use_attention': True}
{'IMG_SIZE': (384, 384), 'use_attention': False}
{'IMG_SIZE': (384, 384), 'use_attention': True}
{'IMG_SIZE': (512, 512), 'use_attention': False}
{'IMG_SIZE': (512, 512), 'use_attention': True}


## Replace the lines in the notebook
The code here surgically replaces just the necessary lines in the notebook and leaves (hopefully) everything else exactly the way it is

In [9]:
def replace_line(in_code, in_line_idx, in_replacement):
    return '\n'.join([j if i!=in_line_idx else in_replacement for i, j in enumerate(in_code.split('\n'), 1)])
def apply_replacement_df(in_nb, rep_df):
    cur_nb = copy.deepcopy(in_nb)
    for _, c_row in rep_df.iterrows():
        if c_row['line_code']!=c_row['line_replacement']:
            # lines to fix
            cell_idx = c_row['cell_id']
            cur_nb['cells'][cell_idx]['source'] = replace_line(cur_nb['cells'][cell_idx]['source'], c_row['line_no'], c_row['line_replacement'])
    return cur_nb

# Add the relevant information
So we want to add a first field with all the info about the current run so we can harvest it later

In [10]:
run_start_time = time()
run_id = hashlib.md5('{:2.2f}-{}'.format(run_start_time, kernel_data).encode('ascii')).hexdigest()

# Launch the kernels
Here we use the Kaggle API to launch the kernels with the different settings

In [11]:
launched_kernels_list = []
kernel_id_list = []
cur_nb = nbformat.read(kernel_path, as_version = 4)
for i, (sub_lines, cur_df) in enumerate(batches):
    out_name = '{}-{:04d}'.format(run_id, i)
    out_kernel_path = '{}.ipynb'.format(out_name)
    new_nb = apply_replacement_df(kernel_data, cur_df)
    # append cells containing useful metadata we might need later
    last_cells = [nbf.new_markdown_cell('# Notebook Settings\nThe last cell is just for metadata settings that will be read out later.')]
    last_cells += [nbf.new_markdown_cell(json.dumps({'run_id': run_id,
                                      'run_time': run_start_time,
                                      'run_settings': sub_lines,
                                                     'run_df': list(cur_df.T.to_dict().values())
                                                    }))]
    new_nb['cells']+=last_cells
    nbformat.write(new_nb, out_kernel_path)
    with open('kernel-metadata.json', 'w') as f:
        meta_dict = notebook_meta_template(USER_ID, 
                           out_name, 
                           out_name, 
                           out_kernel_path)
        json.dump(meta_dict, f)
    if not DRY_RUN:
        out_str = !kaggle kernels push -p .
    else:
        out_str = ['Not Run']
    kernel_id_list += [dict(id=meta_dict['id'], **sub_lines)]
    launched_kernels_list += [out_str] 

In [12]:
for c_line in launched_kernels_list:
    print(c_line[0])

Not Run
Not Run
Not Run
Not Run
Not Run
Not Run
Not Run
Not Run


In [13]:
# export results to a text file
kernel_df = pd.DataFrame(kernel_id_list)
kernel_df.to_csv('kernels.csv', index=False)
kernel_df

Unnamed: 0,IMG_SIZE,id,use_attention
0,"(139, 139)",kevinbot/6c52f3e29082feb805bc87daaf490850-0000,False
1,"(139, 139)",kevinbot/6c52f3e29082feb805bc87daaf490850-0001,True
2,"(299, 299)",kevinbot/6c52f3e29082feb805bc87daaf490850-0002,False
3,"(299, 299)",kevinbot/6c52f3e29082feb805bc87daaf490850-0003,True
4,"(384, 384)",kevinbot/6c52f3e29082feb805bc87daaf490850-0004,False
5,"(384, 384)",kevinbot/6c52f3e29082feb805bc87daaf490850-0005,True
6,"(512, 512)",kevinbot/6c52f3e29082feb805bc87daaf490850-0006,False
7,"(512, 512)",kevinbot/6c52f3e29082feb805bc87daaf490850-0007,True


# Status
We can check the status like so if we want to follow up on the kernels

In [14]:
!kaggle kernels status -k {meta_dict['id']}

(404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'private', 'Content-Length': '33', 'Content-Type': 'application/json; charset=utf-8', 'Set-Cookie': 'TempData=.16yH3HWBoWlSCfBIW/2rR02+JsrGGeGU/Npuv8afoyKRDpTtGe34k0RJ/b8S+OLbAeGMRENUSnsmgDnzsqKY3mOTDFsq7NYlpr0nI+HDkNibg/dMeGQxmH2Wwt4XLMJWxlfq3X9lahiqnziz81vZvfAa7Z+86Xhs77LcIQ/XzrtBIgGXkc3mgaAU+FHMR8MESqTDDVemn7BFORLaLgpWe+K7xEw=; path=/; secure; HttpOnly, ARRAffinity=f22663401b2cea415e7583dbe472f1a5aa4c7c97922fc8f06f7718766a4c8037;Path=/;HttpOnly;Domain=www.kaggle.com', 'X-Kaggle-MillisecondsElapsed': '67', 'X-Kaggle-RequestId': '59aa4ed0ffb85e1afed016b45d81bafc', 'X-Kaggle-ApiVersion': '1.4.2', 'Access-Control-Allow-Origin': '*', 'X-Frame-Options': 'SAMEORIGIN', 'Referrer-Policy': 'strict-origin-when-cross-origin', 'Date': 'Tue, 24 Jul 2018 15:57:01 GMT'})
HTTP response body: {"code":404,"message":"NotFound"}

