# Clone or pull the valuenet repository

In [None]:
!git clone https://github.com/brunnurs/valuenet.git
# !git pull

fatal: destination path 'valuenet' already exists and is not an empty directory.


In [None]:
%cd valuenet/

/content/valuenet


In [None]:
import sys
sys.path.insert(0, '/content/valuenet/src')

# Prepare & Preprocess Data
### We follow the [user manual from Valuenet](https://github.com/brunnurs/valuenet).

## Add your custom data

An example of custom data preparation can be found in statbot repository: [generate_sql_statments_and_questions.ipynb](https://github.com/statistikZH/statbot/blob/main/hackathon_hackzurich/generate_sql_statments_and_questions.ipynb). In this repository, random values are taken from the hack_zurich database and fed through a template to generate questions and queries.
We then convert these questions and queries in the required format and save them as 
- statbot/hackathon_hackzurich/handmade_data_dev.json
- statbot/hackathon_hackzurich/handmade_data_train.json

If you wish to then preprocess these generated data, copy the handmade_data_xxx.json from statbot to valuenet:
- valuenet/data/hack_zurich/handmade_training_data/handmade_data_dev.json'
- valuenet/data/hack_zurich/handmade_training_data/handmade_data_train.json'

You are now ready to preprocess the dataset by running the following steps.

**Note**: If you decide to create your own query template, make sure to be very careful of the syntax and stay close to the one in the example notebook as valuenet codebase is quite sensible to the queries. For instance, 
- as it tokenize the SQL queries based on spaces, make sure to always add spaces everywhere (T1.year=2006 will error and sould be replaced with T1.year = 2006)
- always put the keyword 'AS' when using shortcut names

If you have errors in the next step, it is probably because of one of this reason. If error persist, check the effect of the function `tokenize()` in `valuenet/src/spider/test_suite_eval/process_sql.py` and adapt the codebase.

# Setup the environnement

In [None]:
!pip install ipywidgets
!spacy download en_core_web_sm
!pip install -r requirements.txt

Collecting en-core-web-sm==3.1.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


*** **texte en gras**When you run this part for the first time, you need to restart the runtime***

In [None]:
import torch, gc
gc.collect()
torch.cuda.empty_cache()
!nvidia-smi

Thu Sep 23 13:22:47 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   61C    P8    32W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Transform into Spider representation

In [None]:
import nltk
nltk.download('punkt')
%run src/tools/training_data_builder/training_data_builder.py --data hack_zurich

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Load data from data/hack_zurich/original/tables.json. N=1
successfully transformed 1 samples for train split
successfully transformed 1 samples for dev split


You will now find your custom data in the two files [data/hack_zurich/original/train.json](data/hack_zurich/original/train.json) and [data/hack_zurich/original/dev.json](data/hack_zurich/original/dev.json).

## Extract Value Candidates using Named Entity Recognition

In [None]:
NER_API_SECRET = 'AIzaSyBmltaFngsITuLJhyV3sk8rjVft4sZeDfw' # "{YOUR_NER_API_SECRET_HERE}"

In [None]:
%run src/named_entity_recognition/api_ner/extract_values.py --data_path=data/hack_zurich/original/train.json --output_path=data/hack_zurich/ner_train.json --ner_api_secret={NER_API_SECRET}

HTTP: 200. for request 'How high was the accessibility by bus in Schlieren in 2014?'
Extracted 1 values. 0 requests failed.


In [None]:
%run src/named_entity_recognition/api_ner/extract_values.py --data_path=data/hack_zurich/original/dev.json --output_path=data/hack_zurich/ner_dev.json --ner_api_secret={NER_API_SECRET}

HTTP: 200. for request 'What is the telephone number of Urdorf?'
Extracted 1 values. 0 requests failed.


## Extract the ground truth values from the SQL query

In [None]:
%run src/tools/get_values_from_sql.py --data_path data/hack_zurich/original/train.json --table_path data/hack_zurich/original/tables.json --ner_path data/hack_zurich/ner_train.json
%run src/tools/get_values_from_sql.py --data_path data/hack_zurich/original/dev.json --table_path data/hack_zurich/original/tables.json --ner_path data/hack_zurich/ner_dev.json

Found values ['Schlieren', '2014'] for question: "How high was the accessibility by bus in Schlieren in 2014?"
Read out values from 1 questions and added it to NER-file data/hack_zurich/ner_train.json
Found values ['Urdorf'] for question: "What is the telephone number of Urdorf?"
Read out values from 1 questions and added it to NER-file data/hack_zurich/ner_dev.json


This last script doesn't create a new file, but adds the ground truth values to the *ner_dev.json* and *ner_train.json* files, see the new attribute *values*:

```json
    "values": [
      "Wetzikon",
      "2016"
    ]
```

## Pre-processing

In [None]:
DB_HOST = 'YOUR_DB_HOST'
DB_PORT = 'YOUR_DB_PORT'
DB_USER = 'YOUR_DB_USER' # Read Only User cannot access commune_type
DB_PW = 'YOUR_DB_PASSWORD'
DB_SCHEMA = 'YOUR_DB_SCHEMA'


In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')
%run src/preprocessing/pre_process.py --data_path=data/hack_zurich/original/train.json --ner_data_path=data/hack_zurich/ner_train.json --table_path=data/hack_zurich/original/tables.json --output=data/hack_zurich/preprocessed_train.json --database_host={DB_HOST} --database_port={DB_PORT} --database_user={DB_USER} --database_password={DB_PW} --database_schema={DB_SCHEMA}

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!



Process example idx: 0
Question: How high was the accessibility by bus in Schlieren in 2014?
SQL: select access_by_bus
from accessibility_bus as b
         join spatialunit as s on b.spatialunit_id = s.spatialunit_id
where s.name = 'Schlieren' and b.year = 2014
Look for potential candidates "[('Schlieren', 0.7), ('accessibility', 0.7), ('bus', 0.7), ('2014', 1.0)]" in database hack_zurich (include primary keys: False)
Confirmed the following candidates "[('Schlieren', 'name', 'city'), ('Schlieren', 'name combined', 'city')]"
Elapsed time is 3.181026 seconds.
Total pre-processing took 3.278246 seconds.
Could not find all values in 0 examples. All examples where values could not get extracted, will get disable on evaluation


In [None]:
%run src/preprocessing/pre_process.py --data_path=data/hack_zurich/original/dev.json --ner_data_path=data/hack_zurich/ner_dev.json --table_path=data/hack_zurich/original/tables.json --output=data/hack_zurich/preprocessed_dev.json --database_host={DB_HOST} --database_port={DB_PORT} --database_user={DB_USER} --database_password={DB_PW} --database_schema={DB_SCHEMA}


Process example idx: 0
Question: What is the telephone number of Urdorf?
SQL: select tel from spatialunit where name = 'Urdorf'
Look for potential candidates "[('Urdorf', 0.7), ('telephone number', 0.7), ('telephone', 0.7), ('number', 0.7)]" in database hack_zurich (include primary keys: False)
Confirmed the following candidates "[('Urdorf', 'name combined', 'city'), ('Urdorf', 'name', 'city')]"
Elapsed time is 3.167395 seconds.
Total pre-processing took 3.175103 seconds.
Could not find all values in 0 examples. All examples where values could not get extracted, will get disable on evaluation


## Modelling JOINs and SQL-to-SemQL
We start by modeling some JOINs as filters (minor importance, has most probably no effect on your data - you might skip it)


In [None]:
%run src/preprocessing/model_joins_as_filter.py --data_path=data/hack_zurich/preprocessed_train.json --table_path=data/hack_zurich/original/tables.json --output=data/hack_zurich/preprocessed_with_joins_train.json 

Load data from data/hack_zurich/original/tables.json. N=1


In [None]:
%run src/preprocessing/model_joins_as_filter.py --data_path=data/hack_zurich/preprocessed_dev.json --table_path=data/hack_zurich/original/tables.json --output=data/hack_zurich/preprocessed_with_joins_dev.json 

Load data from data/hack_zurich/original/tables.json. N=1


And then transform SQL to SemQL

In [None]:
%run src/preprocessing/sql2SemQL.py --data_path data/hack_zurich/preprocessed_with_joins_train.json --table_path data/hack_zurich/original/tables.json --output data/hack_zurich/train.json

Root1(3) Root(3) Sel(0) N(0) A(0) C(20) T(3) Filter(0) Filter(2) A(0) C(12) T(1) V(1) Filter(2) A(0) C(18) T(3) V(0)
Finished 1 datas and failed 0 datas


In [None]:
%run src/preprocessing/sql2SemQL.py --data_path data/hack_zurich/preprocessed_with_joins_dev.json --table_path data/hack_zurich/original/tables.json --output data/hack_zurich/dev.json 

Root1(3) Root(3) Sel(0) N(0) A(0) C(11) T(1) Filter(2) A(0) C(12) T(1) V(0)
Finished 1 datas and failed 0 datas
