# Initialization

Change to `valuenet` directory and add `src` path to `PYTHONPATH`.

In [None]:
%cd /home/ec2-user/SageMaker/valuenet

In [None]:
import sys
sys.path.insert(0, '/home/ec2-user/SageMaker/valuenet/src')
sys.path

Read environment

In [None]:
NER_API_SECRET=%env NER_API_SECRET
API_KEY=%env API_KEY
DB_USER=%env DB_USER
DB_PW=%env DB_PW
DB_HOST=%env DB_HOST
DB_PORT=%env DB_PORT
DB_SCHEMA="public"

# Prepare & Preprocess Data

## Add your custom data

TODO..

## Transform into Spider representation

In [None]:
%run src/tools/training_data_builder/training_data_builder.py --data hack_zurich

You will now find your custom data in the two files [data/hack_zurich/original/train.json](data/hack_zurich/original/train.json) and [data/hack_zurich/original/dev.json](data/hack_zurich/original/dev.json).

## Extract Value Candidates using Named Entity Recognition

In [None]:
%run src/named_entity_recognition/api_ner/extract_values.py --data_path=data/hack_zurich/original/train.json --output_path=data/hack_zurich/ner_train.json --ner_api_secret={NER_API_SECRET}

In [None]:
%run src/named_entity_recognition/api_ner/extract_values.py --data_path=data/hack_zurich/original/dev.json --output_path=data/hack_zurich/ner_dev.json --ner_api_secret={NER_API_SECRET}

## Extract the ground truth values from the SQL query

In [None]:
%run src/tools/get_values_from_sql.py --data_path data/hack_zurich/original/train.json --table_path data/hack_zurich/original/tables.json --ner_path data/hack_zurich/ner_train.json
%run src/tools/get_values_from_sql.py --data_path data/hack_zurich/original/dev.json --table_path data/hack_zurich/original/tables.json --ner_path data/hack_zurich/ner_dev.json

This last script doesn't create a new file, but adds the ground truth values to the *ner_dev.json* and *ner_train.json* files, see the new attribute *values*:

```json
    "values": [
      "Wetzikon",
      "2016"
    ]
```

## Pre-processing

In [None]:
%run src/preprocessing/pre_process.py --data_path=data/hack_zurich/original/train.json --ner_data_path=data/hack_zurich/ner_train.json --table_path=data/hack_zurich/original/tables.json --output=data/hack_zurich/preprocessed_train.json --database_host={DB_HOST} --database_port={DB_PORT} --database_user={DB_USER} --database_password={DB_PW} --database_schema={DB_SCHEMA}

In [None]:
%run src/preprocessing/pre_process.py --data_path=data/hack_zurich/original/dev.json --ner_data_path=data/hack_zurich/ner_dev.json --table_path=data/hack_zurich/original/tables.json --output=data/hack_zurich/preprocessed_dev.json --database_host={DB_HOST} --database_port={DB_PORT} --database_user={DB_USER} --database_password={DB_PW} --database_schema={DB_SCHEMA}

## Modelling JOINs and SQL-to-SemQL

We start by modeling some JOINs as filters (minor importance, has most probably no effect on your data - you might skip it)

In [None]:
%run src/preprocessing/model_joins_as_filter.py --data_path=data/hack_zurich/preprocessed_train.json --table_path=data/hack_zurich/original/tables.json --output=data/hack_zurich/preprocessed_with_joins_train.json 

In [None]:
%run src/preprocessing/model_joins_as_filter.py --data_path=data/hack_zurich/preprocessed_dev.json --table_path=data/hack_zurich/original/tables.json --output=data/hack_zurich/preprocessed_with_joins_dev.json 

And then transform SQL to SemQL

In [None]:
%run src/preprocessing/sql2SemQL.py --data_path data/hack_zurich/preprocessed_with_joins_train.json --table_path data/hack_zurich/original/tables.json --output data/hack_zurich/train.json

In [None]:
%run src/preprocessing/sql2SemQL.py --data_path data/hack_zurich/preprocessed_with_joins_dev.json --table_path data/hack_zurich/original/tables.json --output data/hack_zurich/dev.json 

# Train Model

In [None]:
exp_name="train-01"

In [None]:
%run src/main.py --exp_name {exp_name} --cuda --batch_size 8 --num_epochs 5 --loss_epoch_threshold 70 --sketch_loss_weight 1.0 --beam_size 1 --seed 90