# Templates
The feedback generation templates and the reimplementation explanation templates from the NL-Edit paper.

### Corrections since last meeting:
1. Minor errors are fixed.
2. Using foreign key relationship to judge whether two column names are equal. If on column is another's foreign key, the are equaivelent. Otherwise they are not even they have identical column name. Foreign keys are defined as a subgraph of all connected columns.
3. Connect conditions in `WHERE` clause according to the correct logical operations, `AND` or `OR` accordingly.
   1. *cond<sub>1</sub> (and | or) cond<sub>2</sub> ...*
   2. *you should consider (both | either) of the conditions rather than (either | both) of them.*
4. When adding additional table, also mention the old table. 
   * *additionally use the information from the tab_name<sub>add</sub> table besides the tab_name<sub>old</sub>.*
5. Applied three types of tags to category different tokens in feedback templates.
   1. `add` represents added **column names**, **table names**, and **order directions** or **largest smallest**.
   2. `sub` represents removed **column names**, **table names**, and **order directions** or **largest smallest**.
   3. `info` represents other information which is necessary to correct the wrong query.
6. The sequence correction of each part depends on the sequence of SQL execution.
7. If in the correct parse `JOIN ON` condition, two columns are equal, their foreign key group will be unioned.

### Considerations
1. When mention column names, literal ambituity only depends on whether there are identical column names in other tables which also used in the current SQL query. In feedback generation, we should consider all tables mentioned in the wrong and correct queries.
2. When judge the equivalence between column names, if the two columns are from different tables, only when they are foreign keys of the each other, they are considered as the same column.
3. When add an extra table to the wrong query, users tend to mentioned the existing table in wrong table.
4. The input of BART-rephrasing mode should take both question and template feedback as input. Because sometimes users are rephrase the questions.
5. There are some column name **abbreviations** in the SPIDER dataset, like `student id` for `stuid` and `longitude` for `long`. Sometimes users use explanations and sometimes they use the original names.
6. There are 618 cases in 9314 samples of adding or removing an entire SQL sub-query of UNION/INTERSECT/EXCEPT. 3 cases of adding SQL sub-query in FROM clause.

### TODO:
- [ ] Correct the **remove the entire GROUPBY clause** tempalte follows the overleaf file.
- [ ] Check the importance of each clause, 10 samples for each case:
  - [ ] SELECT vs FROM
  - [ ] SELECT vs WHERE
  - [ ] SELECT vs GROUP BY
  - [ ] SELECT vs ORDER BY
  - [ ] FROM vs WHERE
  - [ ] FROM vs GROUP BY
  - [ ] FROM vs ORDER BY
  - [ ] WHERE vs GROUP BY
  - [ ] WHERE vs ORDER BY
  - [ ] GROUP BY vs ORDER BY
- [x] Implement the `primary` and `secondary` tags.
- [ ] Test the Explanation on the entire SPIDER dataset.

## Testing Cases

### Test NLTK tokenizer

In [2]:
from nltk import word_tokenize, sent_tokenize
sent = 'in step 1, find for each value of MANAGER_ID and DEPARTMENT_ID whose number of EMPLOYEE_ID greater than or equals 4 . in step 2, make sure no repetition in the results.'

In [3]:
print(sent_tokenize(sent))

['in step 1, find for each value of MANAGER_ID and DEPARTMENT_ID whose number of EMPLOYEE_ID greater than or equals 4 .', 'in step 2, make sure no repetition in the results.']


## Generate samples of modifying / adding / removing the GROUP_BY Clause

In [4]:
import sys, os
from pathlib import  Path
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(Path(SCRIPT_DIR).parent))

from utils.utils import load_json, edit_size, store_json
from tqdm import tqdm
from sqlComponents import Query
import json
import config
from templates import explanation, feedback, find_span
from utils.utils import print_dict
from config import ADD_TAG1, ADD_TAG2, SUB_TAG1, SUB_TAG2
data = load_json('/Users/taiyintao/Interactive_Semantic_Parsing_Correction/splash_structure_error_removed/train.json')
data += load_json('/Users/taiyintao/Interactive_Semantic_Parsing_Correction/splash_structure_error_removed/dev.json')

In [5]:
add_groupby = []
remove_groupby = []
change_or_remove_groupby_column = []
add_groupby_column = []

for id, sample in tqdm(list(enumerate(data))):
    sample.pop('beam')
    db_id = sample['db_id']
    gold_sql = sample['gold_parse']
    pred_sql = sample['predicted_parse_with_values']
    original_explaination = sample['predicted_parse_explanation']
    
    pred_query = Query(pred_sql, db_id)
    gold_query = Query(gold_sql, db_id)
    size = edit_size(pred_sql, gold_sql, db_id)
    sample['edit_size'] = size
    fb = feedback(pred_sql, gold_sql, db_id)
    sample['generated_feedback'] = fb

    original_show_tag = config.SHOW_TAG
    config.SHOW_TAG = True
    fd = feedback(pred_sql, gold_sql, db_id)
    primary_span = find_span(ADD_TAG1, ADD_TAG2, [SUB_TAG1, SUB_TAG2], fd)
    secondary_span = find_span(SUB_TAG1, SUB_TAG2, [ADD_TAG1, ADD_TAG2], fd)
    config.SHOW_TAG = original_show_tag

    sample['primary_span'] = primary_span
    sample['secondary_span'] = secondary_span

    if not gold_query.group_by.is_empty and pred_query.group_by.is_empty:           # add group by
        add_groupby += [sample]
    if gold_query.group_by.is_empty and not pred_query.group_by.is_empty:           # remove group by
        remove_groupby += [sample]
    if not gold_query.group_by.is_empty and not pred_query.group_by.is_empty:       # modifying the groupby
        if gold_query.group_by != pred_query.group_by:
            gold_groupby = set(gold_query.group_by.args)
            pred_groupby = set(pred_query.group_by.args)

            added = list(gold_groupby - pred_groupby)
            removed = list(pred_groupby - gold_groupby)

            if len(removed) == 0 and len(added) > 0:                                # add column(s) in groupby
                add_groupby_column += [sample]
            else:
                change_or_remove_groupby_column += [sample]                         # change or remove column(s) in groupby





100%|██████████| 8536/8536 [05:45<00:00, 24.67it/s]


In [None]:
current = add_groupby_column
average = sum(list(map(lambda x: x['edit_size'], current))) / len(current)
print(f"{len(current)} cases in {len(data)} samples, average Edit Size: {average}")
print_dict(current)