# Column Type Annotation with ArcheType

The following notebook is designed to get you up and running with [ArcheType](https://arxiv.org/abs/2310.18208), a new framework for column type annotation using large language models.

Unlike most existing CTA solutions, ArcheType can operate *zero-shot* -- it doesn't require any labeled data to get up and running with new column types. This flexibility is extremely helpful when tackling real-world data cleaning challenges.



## Setup

In [1]:
!git clone https://github.com/penfever/archetype/
import os
os.chdir("archetype")

!pip install -r requirements.txt -qq

Cloning into 'archetype'...
remote: Enumerating objects: 463, done.[K
remote: Counting objects: 100% (308/308), done.[K
remote: Compressing objects: 100% (179/179), done.[K
remote: Total 463 (delta 185), reused 221 (delta 126), pack-reused 155[K
Receiving objects: 100% (463/463), 8.98 MiB | 14.72 MiB/s, done.
Resolving deltas: 100% (258/258), done.
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 GB[0m [31m865.6 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.4/297.4 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.5/92.5 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m60.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m63.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

**IMPORTANT**

ArcheType is a *model-agnostic* CTA system -- it is compatible with a variety of large language models, both public and private.

If you're interested in public models, we recommend trying it out with one of Google's [Flan-T5](https://huggingface.co/docs/transformers/en/model_doc/flan-t5) family of models. *Flan-T5-base* is what we use by default in this notebook.

If you prefer using an API-based model, ArcheType is compatible with GPT 3.5 and GPT 4.0.

**SETUP INSTRUCTIONS**

If you don't follow these instructions, ArcheType may not run as expected!

IF USING T5:

1. Make sure that your Colab notebook has GPU enabled.
2. In const.py, replace the ARCHETYPE_PATH variable with the absolute path to your ArcheType installation. We provide an example of how to do this in the cells below.

IF USING GPT

1. Create an account with OpenAI and use python-dotenv to store your OpenAI API key.
2. In const.py, replace the DOTENV_PATH variable with the absolute path to the directory containing your dotenv with your OpenAI API key.

In [2]:
!cat /content/archetype/src/const.py

EST_CHARS_PER_TOKEN=4
MAX_LEN=2000*EST_CHARS_PER_TOKEN
INTEGER_SET = set(r"0123456789,/\+-.^_()[] :")
BOOLEAN_SET = set(["True", "true", "False", "false", "yes", "Yes", "No", "no"])

ARCHETYPE_PATH = "/home/bf996/archetype"
DOTENV_PATH = "/home/bf996/.env"


In [3]:
!sed 's/\/home\/bf996\//\/content\//g' /content/archetype/src/const.py > /content/archetype/src/const_mod.py
!rm /content/archetype/src/const.py
!mv /content/archetype/src/const_mod.py /content/archetype/src/const.py

## Data

In [4]:
import pandas as pd
from src.predict import ArcheTypePredictor

TEST_FILE_PATH = "./table_samples/Book_5sentidoseditora.pt_September2020_CTA.json"
df = pd.read_json(TEST_FILE_PATH, lines=True)

In [5]:
df.head()

Unnamed: 0,0,1,2,3
0,Desejo Subtil (eBook),978-972-0-68199-7,336,2013-06-12
1,Desejo Subtil,978-972-0-04396-2,336,2014-04-23
2,Rendida,978-972-0-04429-7,352,2016-05-25
3,Preferida,978-989-745-023-5,448,2016-12-05
4,Confia em mim,978-989-745-028-0,272,2016-06-16


## Model

In [16]:
args = {
            "model_name": "flan-t5-base-zs",
            "custom_labels" : ["text", "number", "id", "place"],
}

arch = ArcheTypePredictor(input_files = [df], user_args = args)
new_df = arch.annotate_columns()
new_df.head()

Initializing model...


  0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,text,number,number.1,text.1
0,Desejo Subtil (eBook),978-972-0-68199-7,336,2013-06-12
1,Desejo Subtil,978-972-0-04396-2,336,2014-04-23
2,Rendida,978-972-0-04429-7,352,2016-05-25
3,Preferida,978-989-745-023-5,448,2016-12-05
4,Confia em mim,978-989-745-028-0,272,2016-06-16


ArcheType has filled in the missing column names from the set of options we provided, but categorizing date-strings as 'text' doesn't seem ideal, and 'text' isn't a very targeted column type -- what if we try a different set of options?

In [12]:
args = {
            "model_name": "flan-t5-base-zs",
            "custom_labels" : ["text", "number", "date", "book title"],
}

arch = ArcheTypePredictor(input_files = [df], user_args = args)
new_df = arch.annotate_columns()
new_df.head()

Initializing model...


  0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,book title,number,date,date.1
0,Desejo Subtil (eBook),978-972-0-68199-7,336,2013-06-12
1,Desejo Subtil,978-972-0-04396-2,336,2014-04-23
2,Rendida,978-972-0-04429-7,352,2016-05-25
3,Preferida,978-989-745-023-5,448,2016-12-05
4,Confia em mim,978-989-745-028-0,272,2016-06-16


As we can see, ArcheType is very sensitive to the semantics of the class names (as we would expect for a zero-shot model). It correctly detected the more fine-grained semantic type of "book title", and it got the "date" column correct as well, but now one of the "number" columns is getting mislabeled.

Sometimes we get better results when we change the prompt, which we can do by adding a trailing string to the model architecture (for examples of what these prompts look like, please refer to our [paper](https://arxiv.org/abs/2310.18208)).

Unlike methods which present prompt engineering as a contribution to the method itself, in ArcheType it is something we optimize, like any other hyperparameter.

Some options we can choose from are --
```
-koriniprompt
-shortprompt
-chorusprompt
-invertedprompt
```

In [15]:
args = {
            "model_name": "flan-t5-base-zs-koriniprompt",
            "custom_labels" : ["text", "number", "date", "book title"],
}

arch = ArcheTypePredictor(input_files = [df], user_args = args)
new_df = arch.annotate_columns()
new_df.head()

Initializing model...


  0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,book title,number,number.1,date
0,Desejo Subtil (eBook),978-972-0-68199-7,336,2013-06-12
1,Desejo Subtil,978-972-0-04396-2,336,2014-04-23
2,Rendida,978-972-0-04429-7,352,2016-05-25
3,Preferida,978-989-745-023-5,448,2016-12-05
4,Confia em mim,978-989-745-028-0,272,2016-06-16


Perfect! Now we have every single column labeled just as we'd like -- and we didn't have to do any model training at all.

In a zero-shot domain, it's easy to experiment with different variations.

We can also inspect the output of our model a bit more closely, if we like, by parsing the JSON results files produced by ArcheType.

In [20]:
import json

with open("/content/archetype/results/archetype_predict.json", 'r', encoding='utf-8') as f:
  j = json.load(f)

ArcheType outputs are structured as a standard key-value store (dictionary in Python-speak). The keys are the prompts which the model received.

Because ArcheType operates in a column-at-once fashion, each entry corresponds to a single column in the DataFrame.

In [24]:
key0 = list(j.keys())[0]
key0

'INSTRUCTION: Select the option which best describes the input. \n INPUT: [Eve e a destruição (eBook), Nas asas de um coração, Paixão Sublime (eBook), Confia em mim (eBook), Desejo Subtil (eBook)] .\n  \n OPTIONS:\n - number\n- place\n- text\n- id \n ANSWER: '

Each key points to a results dictionary for that prompt, containing a number of different fields.

Prompts can be uniquely identified using their "prompt_hash" value, so you can look them up later.

"Original Model Answer" is the output of the model in response to the prompt.



"Response" is the final output of ArcheType (with some string cleaning and normalization).

In [26]:
list(j[key0].keys())

['response',
 'context',
 'ground_truth',
 'correct',
 'original_model_answer',
 'rules',
 'prompt_hash',
 'prompt_hash_count',
 'original_label',
 'file+idx']