# 05. Annotation

We prepare a custom dataset to train a model for V-XNLI in this notebook.

The annotator is one person, me (the first author of this project).
We planned to ask some persons to annotate the dataset.
However, it takes a lot of time to train annotators to prepare an annotation UI, to understand V-XNLI, vega_zero, etc.
That's why there was no (large-scale) V-NLI dataset for many years.
Anyway, our goal is to compare V-XNLI with typical V-NLI in the user study, not to provide a good dataset.
There is a concern that the trained model is too optimized for the annotator, but we decided to annotate the dataset by a single person to advance our study quickly.

The sample size is 1500.
We randomly sample 350 items as the training dataset from the train subset of the preprocessed nvBench dataset,
75 items as the test dataset from the test subset, and 75 items as the validation dataset from the validation subset.
After that, I annotate 3 input patterns for each items.
It takes about 10 seconds per a input.
While conducting other tasks, I annotated 1500 examples for 3 week. 
As we saw in the eda notebook, VegaZero parser can't parse some examples.
In that case, I tried to imagine the output figure, but I skipped some of them.

This dataset is used to train model-v1 in the next notebook.

## Setup

In [1]:
import json
import sqlite3
import sys

from pathlib import Path

import altair as alt
import pandas as pd

from IPython.display import display

from vxnli._vega_zero import VegaZero


In [2]:
DATA_DIR: Path = Path("../data")
DATABASE_DIR: Path = DATA_DIR.joinpath("datasets/nvBench/database/")
DATASET_V0_DIR: Path = DATA_DIR.joinpath("datasets/vxnli-v0/")
DATASET_V1_DIR: Path = DATA_DIR.joinpath("datasets/vxnli-v1/")


In [3]:
# HACK: Add ncnet parser path (it doesn't support pip install)
sys.path.append(str(DATA_DIR.joinpath("datasets/ncNet/utilities").resolve()))

from vis_rendering import VegaZero2VegaLite


In [4]:
DATASET_V1_DIR.mkdir(exist_ok=True)


In [5]:
dataset_v0_train_df = pd.read_csv(DATASET_V0_DIR.joinpath("train.csv"))
dataset_v0_test_df = pd.read_csv(DATASET_V0_DIR.joinpath("test.csv"))
dataset_v0_val_df = pd.read_csv(DATASET_V0_DIR.joinpath("val.csv"))

dataset_v0_train_df["subset"] = "train"
dataset_v0_test_df["subset"] = "test"
dataset_v0_val_df["subset"] = "val"

dataset_v0_df = pd.concat([dataset_v0_train_df, dataset_v0_test_df, dataset_v0_val_df])
dataset_v0_df = dataset_v0_df.reset_index(drop=True)

dataset_v0_df


Unnamed: 0,db_id,chart,hardness,query,question,vega_zero,SQL,table,subset
0,customers_and_products_contacts,Bar,Medium,"Visualize BAR SELECT product_name , COUNT(prod...",bar chart x axis product name y axis how many ...,mark bar encoding x product_name y aggregate c...,"SELECT product_name , COUNT(product_name) FROM...",products,train
1,network_2,Bar,Easy,"Visualize BAR SELECT job , min(age) FROM Perso...",how old is the youngest person for each job ? ...,mark bar encoding x job y aggregate min age tr...,"SELECT job , min(age) FROM Person GROUP BY job...",person,train
2,pets_1,Bar,Medium,"Visualize BAR SELECT PetType , avg(pet_age) FR...",please give me a bar chart to show the average...,mark bar encoding x pettype y aggregate mean p...,"SELECT PetType , avg(pet_age) FROM pets GROUP ...",pets,train
3,products_for_hire,Bar,Extra Hard,"Visualize BAR SELECT payment_date , COUNT(paym...",what are the payment date of the payment with ...,mark bar encoding x payment_date y aggregate c...,"SELECT payment_date , COUNT(payment_date) FROM...",payments,train
4,election,Bar,Easy,"Visualize BAR SELECT County_name , Population ...",what are the name and population of each count...,mark bar encoding x county_name y aggregate no...,"SELECT County_name , Population FROM county",county,train
...,...,...,...,...,...,...,...,...,...
15721,school_finance,Bar,Medium,"Visualize BAR SELECT County , count(*) FROM sc...",draw a bar chart of county versus the total nu...,mark bar encoding x county y aggregate count c...,"SELECT County , count(*) FROM school GROUP BY ...",school,val
15722,cre_Doc_Tracking_DB,Stacked Bar,Hard,"Visualize BAR SELECT Date_in_Locaton_To , COUN...",stacked bar of date in locaton to and the numb...,mark bar encoding x date_in_locaton_to y aggre...,"SELECT Date_in_Locaton_To , COUNT(Date_in_Loca...",document_locations,val
15723,cre_Doc_Tracking_DB,Bar,Medium,"Visualize BAR SELECT Location_Code , count(*) ...",show the location codes and the number of docu...,mark bar encoding x location_code y aggregate ...,"SELECT Location_Code , count(*) FROM Document_...",document_locations,val
15724,cre_Doc_Tracking_DB,Bar,Easy,"Visualize BAR SELECT Location_Code , count(*) ...",what is the code of each location and the numb...,mark bar encoding x location_code y aggregate ...,"SELECT Location_Code , count(*) FROM Document_...",document_locations,val


In [6]:
dataset_v1_train_df = pd.read_json(DATASET_V1_DIR.joinpath("train.ndjson"), lines=True)
dataset_v1_test_df = pd.read_json(DATASET_V1_DIR.joinpath("test.ndjson"), lines=True)
dataset_v1_val_df = pd.read_json(DATASET_V1_DIR.joinpath("val.ndjson"), lines=True)

dataset_v1_train_df["subset"] = "train"
dataset_v1_test_df["subset"] = "test"
dataset_v1_val_df["subset"] = "val"

dataset_v1_df = pd.concat([dataset_v1_train_df, dataset_v1_test_df, dataset_v1_val_df])
dataset_v1_df = dataset_v1_df.reset_index(drop=True)

dataset_v1_df


Unnamed: 0,db_id,table,vega_zero,args,kwargs,hardness,subset
0,employee_hire_evaluation,hiring,mark point encoding x shop_id y aggregate none...,[],"{'scatter': True, 'x': 'shop', 'y': 'employee'...",Easy,train
1,employee_hire_evaluation,hiring,mark point encoding x shop_id y aggregate none...,[],"{'graph': 'point', 'x': 'shop_id', 'y': 'emplo...",Easy,train
2,employee_hire_evaluation,hiring,mark point encoding x shop_id y aggregate none...,[scatter],"{'x': 'shop', 'y': 'employee', 'color': 'full ...",Easy,train
3,employee_hire_evaluation,hiring,mark line encoding x start_from y aggregate su...,[],"{'chart': 'line', 'x': 'start from (bin by yea...",Easy,train
4,employee_hire_evaluation,hiring,mark line encoding x start_from y aggregate su...,[show a line chart],"{'x': 'start from (year)', 'y': 'sum of shop'}",Easy,train
...,...,...,...,...,...,...,...
1489,match_season,match_season,mark arc encoding x position y aggregate count...,[],"{'chart': 'pie', 'value': 'count position'}",Easy,val
1490,match_season,match_season,mark arc encoding x position y aggregate count...,[],{'proportion': 'position'},Easy,val
1491,school_finance,school,mark bar encoding x county y aggregate sum enr...,[],"{'sum': 'enrollment', 'axis': 'country', 'grap...",Easy,val
1492,school_finance,school,mark bar encoding x county y aggregate sum enr...,[],"{'bar': True, 'x': 'country', 'y': 'sum enroll...",Easy,val


In [7]:
tasks_df = dataset_v0_df[["db_id", "table", "SQL", "vega_zero", "subset", "hardness"]]
tasks_df = tasks_df.drop_duplicates()
tasks_df["done"] = False

for _, row in dataset_v1_df.iterrows():
    tasks_df.loc[
        (tasks_df["db_id"] == row["db_id"])
        & (tasks_df["vega_zero"] == row["vega_zero"])
        & (tasks_df["table"] == row["table"]),
        "done",
    ] = True

tasks_df


Unnamed: 0,db_id,table,SQL,vega_zero,subset,hardness,done
0,customers_and_products_contacts,products,"SELECT product_name , COUNT(product_name) FROM...",mark bar encoding x product_name y aggregate c...,train,Medium,False
1,network_2,person,"SELECT job , min(age) FROM Person GROUP BY job...",mark bar encoding x job y aggregate min age tr...,train,Easy,False
2,pets_1,pets,"SELECT PetType , avg(pet_age) FROM pets GROUP ...",mark bar encoding x pettype y aggregate mean p...,train,Medium,False
3,products_for_hire,payments,"SELECT payment_date , COUNT(payment_date) FROM...",mark bar encoding x payment_date y aggregate c...,train,Extra Hard,False
4,election,county,"SELECT County_name , Population FROM county",mark bar encoding x county_name y aggregate no...,train,Easy,True
...,...,...,...,...,...,...,...
15668,voter_2,voting_record,"SELECT Registration_Date , COUNT(Registration_...",mark bar encoding x registration_date y aggreg...,val,Medium,True
15674,school_finance,school,"SELECT County , count(*) FROM school GROUP BY ...",mark bar encoding x county y aggregate count c...,val,Easy,False
15675,cre_Doc_Tracking_DB,document_locations,"SELECT Date_in_Locaton_To , COUNT(Date_in_Loca...",mark bar encoding x date_in_locaton_to y aggre...,val,Extra Hard,True
15692,department_store,products,"SELECT product_type_code , min(product_price) ...",mark bar encoding x product_type_code y aggreg...,val,Easy,True


In [8]:
def execute_sql(db_id: str, sql: str):
    with sqlite3.connect(DATABASE_DIR.joinpath(f"{db_id}/{db_id}.sqlite")) as con:
        return pd.read_sql(sql, con)


def load_table(db_id: str, table: str) -> pd.DataFrame:
    return execute_sql(db_id, f"SELECT * FROM {table}")


def preprocess_table(df: pd.DataFrame) -> pd.DataFrame:
    # As we see in the EDA notebook, vega_zero (in ncnet) is lower-cased
    df = df.rename(columns={col: col.lower() for col in df.columns})

    for col_name, col_dtype in zip(df.columns, df.dtypes):
        if pd.api.types.is_string_dtype(col_dtype):
            df[col_name] = df[col_name].str.lower()

    return df


def solve_task(task_idx: int):
    task = dataset_v0_df.iloc[task_idx]

    db_id, table, vega_zero, hardness, sql, subset, chart = (
        task["db_id"],
        task["table"],
        task["vega_zero"],
        task["hardness"],
        task["SQL"],
        task["subset"],
        task["chart"],
    )

    print(f"Task: {task_idx}")
    print(
        "Task Details:", task[["db_id", "table", "hardness", "chart", "SQL"]].to_dict()
    )

    df = load_table(db_id, table)
    df = preprocess_table(df)

    print("Table:")
    display(df.head(n=5))

    print("SQL Result:")

    display(execute_sql(db_id, sql).head(n=5))

    questions = dataset_v0_df.loc[
        (dataset_v0_df["db_id"] == db_id)
        & (dataset_v0_df["table"] == table)
        & (dataset_v0_df["vega_zero"] == vega_zero),
        "question",
    ]

    vega_zero = VegaZero.parse(vega_zero)

    # Restore data field for the ncNet vega-zero parser (removed in preprocessing because it's no need)
    vega_zero.data = table

    try:
        ncnet_vega_lite = VegaZero2VegaLite().to_VegaLite(str(vega_zero), df)
        ncnet_vega_lite_w_o_data = {
            k: v for k, v in ncnet_vega_lite.items() if k != "data"
        }
    except Exception as e:
        ncnet_vega_lite = None
        ncnet_vega_lite_w_o_data = None

        print(e)

    # Remove data field again
    # It's not a must for vega_zero.to_vega_lite method
    # However, we want to save vega_zero in the plot function defined below without the data field
    # It's no need for training / prediction
    vega_zero.data = None

    try:
        our_vega_lite = vega_zero.to_vega_lite(df)
        our_vega_lite_w_o_data = {k: v for k, v in our_vega_lite.items() if k != "data"}
    except Exception as e:
        our_vega_lite = None
        our_vega_lite_w_o_data = None

        print(e)

    if (
        str(ncnet_vega_lite_w_o_data) == str(our_vega_lite_w_o_data)
        and ncnet_vega_lite is not None
        and our_vega_lite is not None
    ):
        print("Vega-Lite: ", our_vega_lite_w_o_data)

        display(alt.Chart.from_dict(our_vega_lite))
    else:
        if ncnet_vega_lite is None:
            print("Vega-Lite (ncNet): ERROR")
        else:
            print("Vega-Lite (ncNet):", ncnet_vega_lite_w_o_data)

            display(alt.Chart.from_dict(ncnet_vega_lite))

        if our_vega_lite is None:
            print("Vega-Lite (ours): ERROR")
        else:
            print("Vega-Lite (ours): ", our_vega_lite_w_o_data)

            display(alt.Chart.from_dict(our_vega_lite))

    print("Questions:")

    for q in questions.iloc[:3]:
        print(f"    {q}")

    # Emphasize vega_zero because it is the most important info in annotation
    print()
    print("Vega-Zero:")
    print()
    print(f"    {vega_zero}")
    print()

    def plot(*args, **kwargs):
        if len(args) + len(kwargs) == 0:
            raise ValueError("Provide at least 1 args or kwargs")

        with DATASET_V1_DIR.joinpath(f"{subset}.ndjson").open(mode="a") as f:
            f.write(
                json.dumps(
                    {
                        "db_id": db_id,
                        "table": table,
                        "chart": chart,
                        "hardness": hardness,
                        "vega_zero": str(vega_zero),
                        "args": list(args),
                        "kwargs": kwargs,
                    }
                )
            )
            f.write("\n")

    return plot


In [9]:
N_ANNOTATIONS_PER_EXAMPLE: int = 3

TARGETS = {
    "train": 1050,
    "test": 225,
    "val": 225,
}

remaining_tasks = {
    "train": (TARGETS["train"] - len(dataset_v1_train_df)) // N_ANNOTATIONS_PER_EXAMPLE,
    "test": (TARGETS["test"] - len(dataset_v1_test_df)) // N_ANNOTATIONS_PER_EXAMPLE,
    "val": (TARGETS["val"] - len(dataset_v1_val_df)) // N_ANNOTATIONS_PER_EXAMPLE,
}

remaining_tasks


{'train': 1, 'test': 1, 'val': 0}

In [10]:
sampled_tasks_df = [
    tasks_df[~tasks_df["done"] & (tasks_df["subset"] == subset)].sample(n)
    for subset, n in remaining_tasks.items()
]

sampled_tasks_df = pd.concat(sampled_tasks_df)
sampled_tasks_df = sampled_tasks_df.sample(frac=1)
sampled_tasks_df


Unnamed: 0,db_id,table,SQL,vega_zero,subset,hardness,done
667,employee_hire_evaluation,hiring,"SELECT Start_from , SUM(Employee_ID) FROM hiri...",mark bar encoding x start_from y aggregate sum...,train,Medium,False
12860,customers_and_invoices,accounts,"SELECT date_account_opened , COUNT(date_accoun...",mark line encoding x date_account_opened y agg...,test,Medium,False


## Annotation


In [11]:
i: int = 0


In [14]:
print(f"i: {i}")
p = solve_task(sampled_tasks_df.index[i])
i += 1


i: 1
Task: 12860
Task Details: {'db_id': 'customers_and_invoices', 'table': 'accounts', 'hardness': 'Medium', 'chart': 'Line', 'SQL': 'SELECT date_account_opened , COUNT(date_account_opened) FROM Accounts  ORDER BY date_account_opened ASC '}
Table:


Unnamed: 0,account_id,customer_id,date_account_opened,account_name,other_account_details
0,1,8,2016-07-30 22:22:24,900,regular
1,2,3,2017-05-29 16:45:17,520,vip
2,3,8,2012-05-04 18:50:32,323,regular
3,4,15,2011-03-29 15:06:59,390,vip
4,5,15,2014-08-11 18:15:14,935,regular


SQL Result:


Unnamed: 0,date_account_opened,COUNT(date_account_opened)
0,2016-07-30 22:22:24,15


Vega-Lite (ncNet): {'mark': 'line', 'encoding': {'x': {'field': 'date_account_opened', 'type': 'temporal', 'timeUnit': 'year'}, 'y': {'field': 'date_account_opened', 'type': 'quantitative', 'aggregate': 'count', 'sort': 'x'}}}


Vega-Lite (ours):  {'mark': 'line', 'encoding': {'x': {'field': 'date_account_opened', 'type': 'temporal', 'timeUnit': 'year', 'sort': 'ascending'}, 'y': {'field': 'date_account_opened', 'type': 'quantitative', 'aggregate': 'count'}}}


Questions:
    show the number of accounts opened in each year for all accounts in a line chart , list by the x-axis from low to high .
    how many accounts are opened in each year ? show a line chart , and list by the x-axis from low to high .
    i want to see trend of how many date account opened by date account opened , and show by the x from low to high please .

Vega-Zero:

    mark line encoding x date_account_opened y aggregate count date_account_opened transform sort x asc bin x by year



In [15]:
p()
p()
p()


## Analysis

In [16]:
# Load dataset again because the sample size can increase
dataset_v1_train_df = pd.read_json(DATASET_V1_DIR.joinpath("train.ndjson"), lines=True)
dataset_v1_test_df = pd.read_json(DATASET_V1_DIR.joinpath("test.ndjson"), lines=True)
dataset_v1_val_df = pd.read_json(DATASET_V1_DIR.joinpath("val.ndjson"), lines=True)

dataset_v1_train_df["subset"] = "train"
dataset_v1_test_df["subset"] = "test"
dataset_v1_val_df["subset"] = "val"

dataset_v1_df = pd.concat([dataset_v1_train_df, dataset_v1_test_df, dataset_v1_val_df])
dataset_v1_df = dataset_v1_df.reset_index(drop=True)

dataset_v1_df


Unnamed: 0,db_id,table,vega_zero,args,kwargs,hardness,subset
0,employee_hire_evaluation,hiring,mark point encoding x shop_id y aggregate none...,[],"{'scatter': True, 'x': 'shop', 'y': 'employee'...",Easy,train
1,employee_hire_evaluation,hiring,mark point encoding x shop_id y aggregate none...,[],"{'graph': 'point', 'x': 'shop_id', 'y': 'emplo...",Easy,train
2,employee_hire_evaluation,hiring,mark point encoding x shop_id y aggregate none...,[scatter],"{'x': 'shop', 'y': 'employee', 'color': 'full ...",Easy,train
3,employee_hire_evaluation,hiring,mark line encoding x start_from y aggregate su...,[],"{'chart': 'line', 'x': 'start from (bin by yea...",Easy,train
4,employee_hire_evaluation,hiring,mark line encoding x start_from y aggregate su...,[show a line chart],"{'x': 'start from (year)', 'y': 'sum of shop'}",Easy,train
...,...,...,...,...,...,...,...
1495,match_season,match_season,mark arc encoding x position y aggregate count...,[],"{'chart': 'pie', 'value': 'count position'}",Easy,val
1496,match_season,match_season,mark arc encoding x position y aggregate count...,[],{'proportion': 'position'},Easy,val
1497,school_finance,school,mark bar encoding x county y aggregate sum enr...,[],"{'sum': 'enrollment', 'axis': 'country', 'grap...",Easy,val
1498,school_finance,school,mark bar encoding x county y aggregate sum enr...,[],"{'bar': True, 'x': 'country', 'y': 'sum enroll...",Easy,val


In [17]:
dataset_v1_df["subset"].value_counts()


train    1050
test      225
val       225
Name: subset, dtype: int64