# 05. Annotation

We prepare a custom dataset to train a model for V-XNLI in this notebook.

The annotator is one person, me (the first author of this project).
We planned to ask some persons to annotate the dataset.
However, it takes a lot of time to train annotators to prepare an annotation UI to understand V-XNLI, vega_zero, etc.
That's why no (large-scale) V-NLI dataset existed for many years.
There is a concern that the trained model is too optimized for the annotator, but we decided to annotate the dataset by a single person to advance our study quickly.
Because our goal is to compare V-XNLI with typical V-NLI in the user study, not to provide a good dataset.

The sample size is 1800.
We sample 420 items as the training dataset from the training subset of the preprocessed nvBench dataset,
90 items as the test dataset from the test subset, and 90 as the validation dataset from the validation subset.
After that, I annotate 3 input patterns for each item.
It takes about 10 ~ 30 seconds per input (no timestamp, so my feeling... I should have set it).

There are some caveats.
Firstly, as I mentioned above, the annotation was conducted by a single person (= me).
Secondly, as we saw in the EDA notebook, the vega zero parser can't handle some examples.
In that case, I tried to imagine the output figure, but I skipped most of them.
Finally, the annotation process was iterative and I didn't care about comparison against V-NLI (It was my mistake...).
We first annotated about 250 "easy" (nvBench's hardness) examples so we could train the model for prototyping.
After that, we annotated about 500 "uneasy" examples (750 in total) to balance the distribution of "hardness", and did another 750 examples (1500 in total) at random.
However, we noticed that the trained model could not predict the rare operations well.
For example, I couldn't plot a stacked bar chart with the trained model, even though I could do it with our baseline (V-NLI) model.
We cherry-picked "uneasy" examples and "rare" examples containing between, like, filter, and color operations intermittently from 1050 to 1800 example.
As a result, our dataset contains 15 topk examples (nvBench x 1.96 [%]), 141 color examples (nvBench x 1.48 [%]) and 1251 "uneasy" examples (nvBench x 1.05 [%]).

The annotated examples don't have a timestamp (I should have done that).
However, the annotation order is maintained.
So you can check our annotation process to some extent.

We use this dataset to train model-v1 in the following notebook.


## Setup

In [1]:
import json
import sqlite3
import sys

from pathlib import Path

import altair as alt
import pandas as pd

from IPython.display import display

from vxnli._vega_zero import VegaZero


In [2]:
DATA_DIR: Path = Path("../data")
DATABASE_DIR: Path = DATA_DIR.joinpath("datasets/nvBench/database/")
DATASET_V0_DIR: Path = DATA_DIR.joinpath("datasets/vxnli-v0/")
DATASET_V1_DIR: Path = DATA_DIR.joinpath("datasets/vxnli-v1/")


In [4]:
# HACK: Add ncnet parser path (it doesn't support pip install)
sys.path.append(str(DATA_DIR.joinpath("datasets/ncNet/utilities").resolve()))

from vis_rendering import VegaZero2VegaLite


In [3]:
DATASET_V1_DIR.mkdir(exist_ok=True)


In [4]:
dataset_v0_train_df = pd.read_csv(DATASET_V0_DIR.joinpath("train.csv"))
dataset_v0_test_df = pd.read_csv(DATASET_V0_DIR.joinpath("test.csv"))
dataset_v0_val_df = pd.read_csv(DATASET_V0_DIR.joinpath("val.csv"))

dataset_v0_train_df["subset"] = "train"
dataset_v0_test_df["subset"] = "test"
dataset_v0_val_df["subset"] = "val"

dataset_v0_df = pd.concat([dataset_v0_train_df, dataset_v0_test_df, dataset_v0_val_df])
dataset_v0_df = dataset_v0_df.reset_index(drop=True)

dataset_v0_df


Unnamed: 0,db_id,chart,hardness,query,question,vega_zero,SQL,table,subset
0,customers_and_products_contacts,Bar,Medium,"Visualize BAR SELECT product_name , COUNT(prod...",bar chart x axis product name y axis how many ...,mark bar encoding x product_name y aggregate c...,"SELECT product_name , COUNT(product_name) FROM...",products,train
1,network_2,Bar,Easy,"Visualize BAR SELECT job , min(age) FROM Perso...",how old is the youngest person for each job ? ...,mark bar encoding x job y aggregate min age tr...,"SELECT job , min(age) FROM Person GROUP BY job...",person,train
2,pets_1,Bar,Medium,"Visualize BAR SELECT PetType , avg(pet_age) FR...",please give me a bar chart to show the average...,mark bar encoding x pettype y aggregate mean p...,"SELECT PetType , avg(pet_age) FROM pets GROUP ...",pets,train
3,products_for_hire,Bar,Extra Hard,"Visualize BAR SELECT payment_date , COUNT(paym...",what are the payment date of the payment with ...,mark bar encoding x payment_date y aggregate c...,"SELECT payment_date , COUNT(payment_date) FROM...",payments,train
4,election,Bar,Easy,"Visualize BAR SELECT County_name , Population ...",what are the name and population of each count...,mark bar encoding x county_name y aggregate no...,"SELECT County_name , Population FROM county",county,train
...,...,...,...,...,...,...,...,...,...
15721,school_finance,Bar,Medium,"Visualize BAR SELECT County , count(*) FROM sc...",draw a bar chart of county versus the total nu...,mark bar encoding x county y aggregate count c...,"SELECT County , count(*) FROM school GROUP BY ...",school,val
15722,cre_Doc_Tracking_DB,Stacked Bar,Hard,"Visualize BAR SELECT Date_in_Locaton_To , COUN...",stacked bar of date in locaton to and the numb...,mark bar encoding x date_in_locaton_to y aggre...,"SELECT Date_in_Locaton_To , COUNT(Date_in_Loca...",document_locations,val
15723,cre_Doc_Tracking_DB,Bar,Medium,"Visualize BAR SELECT Location_Code , count(*) ...",show the location codes and the number of docu...,mark bar encoding x location_code y aggregate ...,"SELECT Location_Code , count(*) FROM Document_...",document_locations,val
15724,cre_Doc_Tracking_DB,Bar,Easy,"Visualize BAR SELECT Location_Code , count(*) ...",what is the code of each location and the numb...,mark bar encoding x location_code y aggregate ...,"SELECT Location_Code , count(*) FROM Document_...",document_locations,val


In [5]:
dataset_v1_train_df = pd.read_json(DATASET_V1_DIR.joinpath("train.ndjson"), lines=True)
dataset_v1_test_df = pd.read_json(DATASET_V1_DIR.joinpath("test.ndjson"), lines=True)
dataset_v1_val_df = pd.read_json(DATASET_V1_DIR.joinpath("val.ndjson"), lines=True)

dataset_v1_train_df["subset"] = "train"
dataset_v1_test_df["subset"] = "test"
dataset_v1_val_df["subset"] = "val"

dataset_v1_df = pd.concat([dataset_v1_train_df, dataset_v1_test_df, dataset_v1_val_df])
dataset_v1_df = dataset_v1_df.reset_index(drop=True)

dataset_v1_df


Unnamed: 0,db_id,table,chart,hardness,vega_zero,args,kwargs,subset
0,employee_hire_evaluation,hiring,point,Easy,mark point encoding x shop_id y aggregate none...,[],"{'scatter': True, 'x': 'shop', 'y': 'employee'...",train
1,employee_hire_evaluation,hiring,point,Easy,mark point encoding x shop_id y aggregate none...,[],"{'graph': 'point', 'x': 'shop_id', 'y': 'emplo...",train
2,employee_hire_evaluation,hiring,point,Easy,mark point encoding x shop_id y aggregate none...,[scatter],"{'x': 'shop', 'y': 'employee', 'color': 'full ...",train
3,employee_hire_evaluation,hiring,line,Easy,mark line encoding x start_from y aggregate su...,[],"{'chart': 'line', 'x': 'start from (bin by yea...",train
4,employee_hire_evaluation,hiring,line,Easy,mark line encoding x start_from y aggregate su...,[show a line chart],"{'x': 'start from (year)', 'y': 'sum of shop'}",train
...,...,...,...,...,...,...,...,...
1795,activity_1,faculty,Bar,Medium,mark bar encoding x building y aggregate count...,[],"{'value': 'count', 'axis': 'building', 'sort_y...",val
1796,activity_1,faculty,Bar,Medium,mark bar encoding x building y aggregate count...,[sort values in descending order],"{'count_records': True, 'for_each': 'building'}",val
1797,assets_maintenance,assets,Bar,Medium,mark bar encoding x asset_make y aggregate cou...,[draw a bar chart of asset make versus the num...,{'z_to_a': True},val
1798,assets_maintenance,assets,Bar,Medium,mark bar encoding x asset_make y aggregate cou...,[asset_make vs count],{'sort_alphabetically': True},val


In [7]:
tasks_df = dataset_v0_df[["db_id", "table", "SQL", "vega_zero", "subset", "hardness"]]
tasks_df = tasks_df.drop_duplicates()
tasks_df["done"] = False

for _, row in dataset_v1_df.iterrows():
    tasks_df.loc[
        (tasks_df["db_id"] == row["db_id"])
        & (tasks_df["vega_zero"] == row["vega_zero"])
        & (tasks_df["table"] == row["table"]),
        "done",
    ] = True

tasks_df


Unnamed: 0,db_id,table,SQL,vega_zero,subset,hardness,done
0,customers_and_products_contacts,products,"SELECT product_name , COUNT(product_name) FROM...",mark bar encoding x product_name y aggregate c...,train,Medium,False
1,network_2,person,"SELECT job , min(age) FROM Person GROUP BY job...",mark bar encoding x job y aggregate min age tr...,train,Easy,False
2,pets_1,pets,"SELECT PetType , avg(pet_age) FROM pets GROUP ...",mark bar encoding x pettype y aggregate mean p...,train,Medium,False
3,products_for_hire,payments,"SELECT payment_date , COUNT(payment_date) FROM...",mark bar encoding x payment_date y aggregate c...,train,Extra Hard,False
4,election,county,"SELECT County_name , Population FROM county",mark bar encoding x county_name y aggregate no...,train,Easy,True
...,...,...,...,...,...,...,...
15668,voter_2,voting_record,"SELECT Registration_Date , COUNT(Registration_...",mark bar encoding x registration_date y aggreg...,val,Medium,True
15674,school_finance,school,"SELECT County , count(*) FROM school GROUP BY ...",mark bar encoding x county y aggregate count c...,val,Easy,False
15675,cre_Doc_Tracking_DB,document_locations,"SELECT Date_in_Locaton_To , COUNT(Date_in_Loca...",mark bar encoding x date_in_locaton_to y aggre...,val,Extra Hard,True
15692,department_store,products,"SELECT product_type_code , min(product_price) ...",mark bar encoding x product_type_code y aggreg...,val,Easy,True


In [8]:
def execute_sql(db_id: str, sql: str):
    with sqlite3.connect(DATABASE_DIR.joinpath(f"{db_id}/{db_id}.sqlite")) as con:
        return pd.read_sql(sql, con)


def load_table(db_id: str, table: str) -> pd.DataFrame:
    return execute_sql(db_id, f"SELECT * FROM {table}")


def preprocess_table(df: pd.DataFrame) -> pd.DataFrame:
    # As we see in the EDA notebook, vega_zero (in ncnet) is lower-cased
    df = df.rename(columns={col: col.lower() for col in df.columns})

    for col_name, col_dtype in zip(df.columns, df.dtypes):
        if pd.api.types.is_string_dtype(col_dtype):
            df[col_name] = df[col_name].str.lower()

    return df


def solve_task(task_idx: int):
    task = dataset_v0_df.iloc[task_idx]

    db_id, table, vega_zero, hardness, sql, subset, chart = (
        task["db_id"],
        task["table"],
        task["vega_zero"],
        task["hardness"],
        task["SQL"],
        task["subset"],
        task["chart"],
    )

    print(f"Task: {task_idx}")
    print(
        "Task Details:", task[["db_id", "table", "hardness", "chart", "SQL"]].to_dict()
    )

    df = load_table(db_id, table)
    df = preprocess_table(df)

    print("Table:")
    display(df.head(n=5))

    print("SQL Result:")

    display(execute_sql(db_id, sql).head(n=5))

    questions = dataset_v0_df.loc[
        (dataset_v0_df["db_id"] == db_id)
        & (dataset_v0_df["table"] == table)
        & (dataset_v0_df["vega_zero"] == vega_zero),
        "question",
    ]

    vega_zero = VegaZero.parse(vega_zero)

    # Restore data field for the ncNet vega-zero parser (removed in preprocessing because it's no need)
    vega_zero.data = table

    try:
        ncnet_vega_lite = VegaZero2VegaLite().to_VegaLite(str(vega_zero), df)
        ncnet_vega_lite_w_o_data = {
            k: v for k, v in ncnet_vega_lite.items() if k != "data"
        }
    except Exception as e:
        ncnet_vega_lite = None
        ncnet_vega_lite_w_o_data = None

        print(e)

    # Remove data field again
    # It's not a must for vega_zero.to_vega_lite method
    # However, we want to save vega_zero in the plot function defined below without the data field
    # It's no need for training / prediction
    vega_zero.data = None

    try:
        our_vega_lite = vega_zero.to_vega_lite(df)
        our_vega_lite_w_o_data = {k: v for k, v in our_vega_lite.items() if k != "data"}
    except Exception as e:
        our_vega_lite = None
        our_vega_lite_w_o_data = None

        print(e)

    if (
        str(ncnet_vega_lite_w_o_data) == str(our_vega_lite_w_o_data)
        and ncnet_vega_lite is not None
        and our_vega_lite is not None
    ):
        print("Vega-Lite: ", our_vega_lite_w_o_data)

        display(alt.Chart.from_dict(our_vega_lite))
    else:
        if ncnet_vega_lite is None:
            print("Vega-Lite (ncNet): ERROR")
        else:
            print("Vega-Lite (ncNet):", ncnet_vega_lite_w_o_data)

            display(alt.Chart.from_dict(ncnet_vega_lite))

        if our_vega_lite is None:
            print("Vega-Lite (ours): ERROR")
        else:
            print("Vega-Lite (ours): ", our_vega_lite_w_o_data)

            display(alt.Chart.from_dict(our_vega_lite))

    print("Questions:")

    for q in questions.iloc[:3]:
        print(f"    {q}")

    # Emphasize vega_zero because it is the most important info in annotation
    print()
    print("Vega-Zero:")
    print()
    print(f"    {vega_zero}")
    print()

    def plot(*args, **kwargs):
        if len(args) + len(kwargs) == 0:
            raise ValueError("Provide at least 1 args or kwargs")

        with DATASET_V1_DIR.joinpath(f"{subset}.ndjson").open(mode="a") as f:
            f.write(
                json.dumps(
                    {
                        "db_id": db_id,
                        "table": table,
                        "chart": chart,
                        "hardness": hardness,
                        "vega_zero": str(vega_zero),
                        "args": list(args),
                        "kwargs": kwargs,
                    }
                )
            )
            f.write("\n")

    return plot


In [9]:
N_ANNOTATIONS_PER_EXAMPLE: int = 3

TARGETS = {
    "train": 1260,
    "test": 270,
    "val": 270,
}

remaining_tasks = {
    "train": (TARGETS["train"] - len(dataset_v1_train_df)) // N_ANNOTATIONS_PER_EXAMPLE,
    "test": (TARGETS["test"] - len(dataset_v1_test_df)) // N_ANNOTATIONS_PER_EXAMPLE,
    "val": (TARGETS["val"] - len(dataset_v1_val_df)) // N_ANNOTATIONS_PER_EXAMPLE,
}

remaining_tasks


{'train': 2, 'test': 0, 'val': 0}

In [10]:
sampled_tasks_df = [
    tasks_df[~tasks_df["done"] & (tasks_df["subset"] == subset)].sample(n)
    for subset, n in remaining_tasks.items()
]

sampled_tasks_df = [
    tasks_df[
        ~tasks_df["done"]
        & (tasks_df["subset"] == subset)
        & (tasks_df["hardness"] != "Easy")
    ].sample(n)
    for subset, n in remaining_tasks.items()
]

sampled_tasks_df = pd.concat(sampled_tasks_df)
sampled_tasks_df = sampled_tasks_df.sample(frac=1)
sampled_tasks_df


Unnamed: 0,db_id,table,SQL,vega_zero,subset,hardness,done
619,university_basketball,basketball_match,"SELECT ACC_Road , SUM(Team_ID) FROM basketball...",mark bar encoding x acc_road y aggregate sum t...,train,Medium,False
5757,station_weather,station,"SELECT local_authority , COUNT(local_authority...",mark bar encoding x local_authority y aggregat...,train,Hard,False


## Annotation


In [12]:
i: int = 0


In [15]:
print(f"i: {i}")
p = solve_task(sampled_tasks_df.index[i])
i += 1


i: 1
Task: 5757
Task Details: {'db_id': 'station_weather', 'table': 'station', 'hardness': 'Hard', 'chart': 'Stacked Bar', 'SQL': 'SELECT local_authority , COUNT(local_authority) FROM station GROUP BY services ,  local_authority ORDER BY local_authority ASC'}
Table:


Unnamed: 0,id,network_name,services,local_authority
0,1,amersham,metropolitan line and chiltern railways,chiltern
1,2,bushey,london overground and london midland,watford
2,3,brentwood,greater anglia,brentwood
3,4,broxbourne,greater anglia,broxbourne
4,5,carpenders park,london overground,three rivers


SQL Result:


Unnamed: 0,local_authority,COUNT(local_authority)
0,Brentwood,1
1,Broxbourne,2
2,Chiltern,1
3,Chiltern,2
4,Three Rivers,1


Vega-Lite (ncNet): {'mark': 'bar', 'encoding': {'x': {'field': 'local_authority', 'type': 'nominal'}, 'y': {'field': 'local_authority', 'type': 'quantitative', 'aggregate': 'count', 'sort': 'x'}, 'color': {'field': 'services', 'type': 'nominal'}}}


Vega-Lite (ours):  {'mark': 'bar', 'encoding': {'x': {'field': 'local_authority', 'type': 'nominal', 'sort': 'ascending'}, 'y': {'field': 'local_authority', 'type': 'quantitative', 'aggregate': 'count'}, 'color': {'field': 'services', 'type': 'nominal'}}}


Questions:
    compute the number of services by services and then split by local authorities show the result with a stacked bar graph , show by the x-axis from low to high .
    count services by services , and split by local authorities with a stacked bar chart , and show by the bars from low to high .
    stack bar chart of the number of local authority vs services based on local authority , and could you order from low to high by the bars ?

Vega-Zero:

    mark bar encoding x local_authority y aggregate count local_authority color services transform group x sort x asc



In [16]:
p()
p()
p()


## Analysis

In [6]:
# Load dataset again because the sample size can increase
dataset_v1_train_df = pd.read_json(DATASET_V1_DIR.joinpath("train.ndjson"), lines=True)
dataset_v1_test_df = pd.read_json(DATASET_V1_DIR.joinpath("test.ndjson"), lines=True)
dataset_v1_val_df = pd.read_json(DATASET_V1_DIR.joinpath("val.ndjson"), lines=True)

dataset_v1_train_df["subset"] = "train"
dataset_v1_test_df["subset"] = "test"
dataset_v1_val_df["subset"] = "val"

dataset_v1_df = pd.concat([dataset_v1_train_df, dataset_v1_test_df, dataset_v1_val_df])
dataset_v1_df = dataset_v1_df.reset_index(drop=True)

dataset_v1_df


Unnamed: 0,db_id,table,chart,hardness,vega_zero,args,kwargs,subset
0,employee_hire_evaluation,hiring,point,Easy,mark point encoding x shop_id y aggregate none...,[],"{'scatter': True, 'x': 'shop', 'y': 'employee'...",train
1,employee_hire_evaluation,hiring,point,Easy,mark point encoding x shop_id y aggregate none...,[],"{'graph': 'point', 'x': 'shop_id', 'y': 'emplo...",train
2,employee_hire_evaluation,hiring,point,Easy,mark point encoding x shop_id y aggregate none...,[scatter],"{'x': 'shop', 'y': 'employee', 'color': 'full ...",train
3,employee_hire_evaluation,hiring,line,Easy,mark line encoding x start_from y aggregate su...,[],"{'chart': 'line', 'x': 'start from (bin by yea...",train
4,employee_hire_evaluation,hiring,line,Easy,mark line encoding x start_from y aggregate su...,[show a line chart],"{'x': 'start from (year)', 'y': 'sum of shop'}",train
...,...,...,...,...,...,...,...,...
1795,activity_1,faculty,Bar,Medium,mark bar encoding x building y aggregate count...,[],"{'value': 'count', 'axis': 'building', 'sort_y...",val
1796,activity_1,faculty,Bar,Medium,mark bar encoding x building y aggregate count...,[sort values in descending order],"{'count_records': True, 'for_each': 'building'}",val
1797,assets_maintenance,assets,Bar,Medium,mark bar encoding x asset_make y aggregate cou...,[draw a bar chart of asset make versus the num...,{'z_to_a': True},val
1798,assets_maintenance,assets,Bar,Medium,mark bar encoding x asset_make y aggregate cou...,[asset_make vs count],{'sort_alphabetically': True},val


In [7]:
dataset_v1_df["subset"].value_counts()


train    1260
test      270
val       270
Name: subset, dtype: int64

In [16]:
def fix_chart(vega_zero: str) -> str:
    vega_zero = VegaZero.parse(vega_zero)

    mark = vega_zero.mark
    color = vega_zero.encoding.color

    if color is not None:
        if mark == "bar":
            return "Stacked Bar"
        
        if mark == "line":
            return "Grouping Line"
        
        raise ValueError("oops")
    
    if mark == "bar":
        return "Bar"
    
    if mark == "line":
        return "Line"
    
    if mark == "arc":
        return "Pie"
    
    if mark == "point":
        return "Scatter"
    
    raise ValueError("oops")


dataset_v1_df["chart"] = dataset_v1_df["vega_zero"].apply(fix_chart)


In [20]:
dataset_v1_df["chart"].value_counts()


Bar              1218
Pie               156
Scatter           144
Line              141
Stacked Bar       114
Grouping Line      27
Name: chart, dtype: int64

In [21]:
dataset_v0_df["chart"].value_counts()


Bar                 11592
Pie                  1124
Line                 1078
Scatter               722
Stacked Bar           624
Grouping Scatter      377
Grouping Line         209
Name: chart, dtype: int64

In [24]:
[156 / 555, 144 / 555, 141 / 555, 114 / 555]

[0.2810810810810811,
 0.2594594594594595,
 0.25405405405405407,
 0.20540540540540542]

In [27]:
[1124 / 3548, 1078 / 3548, 722 / 3548, 624 / 3548]

[0.31679819616685456,
 0.3038331454340473,
 0.20349492671927846,
 0.17587373167981962]

In [26]:
sum([1124, 1078, 722, 624])

3548

In [33]:
len(dataset_v1_df[dataset_v1_df["vega_zero"].str.contains(" between ")]) / len(dataset_v1_df), len(dataset_v0_df[dataset_v0_df["vega_zero"].str.contains(" between ")]) / len(dataset_v0_df)

(0.04833333333333333, 0.060027979142820806)

In [34]:
len(dataset_v1_df[dataset_v1_df["vega_zero"].str.contains(" color ")]) / len(dataset_v1_df), len(dataset_v0_df[dataset_v0_df["vega_zero"].str.contains(" color ")]) / len(dataset_v0_df)

(0.07833333333333334, 0.05277883759379372)

In [38]:
dataset_v0_df["hardness"].value_counts() / dataset_v0_df["hardness"].value_counts().sum()

Medium        0.440735
Easy          0.354954
Hard          0.116622
Extra Hard    0.087689
Name: hardness, dtype: float64

In [37]:
dataset_v1_df["hardness"].value_counts() / dataset_v1_df["hardness"].value_counts().sum()

Medium        0.455000
Easy          0.305000
Hard          0.161667
Extra Hard    0.078333
Name: hardness, dtype: float64

In [50]:
print(len(dataset_v0_train_df[dataset_v0_train_df["vega_zero"].str.contains(" color ")]))
print()

for i in [900, 1050, 1200, 1350, 1500, 1650, 1800]:
    tmp = dataset_v1_train_df[:int(i*0.8)]
    tmp = tmp[tmp["vega_zero"].str.contains(" color ")]

    print(f"{i:4d}: {len(tmp)}")


610

 900: 30
1050: 33
1200: 33
1350: 54
1500: 63
1650: 81
1800: 81


In [53]:
63 / 15000

0.0042

In [54]:
15 / 1800

0.008333333333333333

In [57]:
print(len(dataset_v0_train_df[dataset_v0_train_df["vega_zero"].str.contains(" topk ")]))
print()

for i in [900, 1050, 1200, 1350, 1500, 1650, 1800]:
    tmp = dataset_v1_train_df[:int(i*0.8)]
    tmp = tmp[tmp["vega_zero"].str.contains(" topk ")]

    print(f"{i:4d}: {len(tmp)}")


print(len(dataset_v0_test_df[dataset_v0_test_df["vega_zero"].str.contains(" topk ")]))
print()

for i in [900, 1050, 1200, 1350, 1500, 1650, 1800]:
    tmp = dataset_v1_test_df[:int(i*0.1)]
    tmp = tmp[tmp["vega_zero"].str.contains(" topk ")]

    print(f"{i:4d}: {len(tmp)}")


print(len(dataset_v0_val_df[dataset_v0_val_df["vega_zero"].str.contains(" topk ")]))
print()

for i in [900, 1050, 1200, 1350, 1500, 1650, 1800]:
    tmp = dataset_v1_val_df[:int(i*0.1)]
    tmp = tmp[tmp["vega_zero"].str.contains(" topk ")]

    print(f"{i:4d}: {len(tmp)}")



63

 900: 3
1050: 3
1200: 6
1350: 12
1500: 15
1650: 15
1800: 15
0

 900: 0
1050: 0
1200: 0
1350: 0
1500: 0
1650: 0
1800: 0
4

 900: 0
1050: 0
1200: 0
1350: 0
1500: 0
1650: 0
1800: 0


In [51]:
print(len(dataset_v0_train_df[dataset_v0_train_df["vega_zero"].str.contains(" like ")]))
print()

for i in [900, 1050, 1200, 1350, 1500, 1650, 1800]:
    tmp = dataset_v1_train_df[:int(i*0.8)]
    tmp = tmp[tmp["vega_zero"].str.contains(" like ")]

    print(f"{i:4d}: {len(tmp)}")


754

 900: 48
1050: 54
1200: 60
1350: 60
1500: 72
1650: 75
1800: 75


In [52]:
print(len(dataset_v0_train_df[dataset_v0_train_df["vega_zero"].str.contains(" between ")]))
print()

for i in [900, 1050, 1200, 1350, 1500, 1650, 1800]:
    tmp = dataset_v1_train_df[:int(i*0.8)]
    tmp = tmp[tmp["vega_zero"].str.contains(" between ")]

    print(f"{i:4d}: {len(tmp)}")


924

 900: 33
1050: 45
1200: 57
1350: 60
1500: 81
1650: 81
1800: 81


610