# Few-Shot Demonstration Selection

Select the best example to use as a demonstatration in the few-prompt tests.

The goal is to find the document that not only contains a good quantity of the entity to be extracted but also a good divercity in the types of entties to be extracted. 

In [17]:
from collections import Counter
from pathlib import Path

import numpy as np

from src.reader import read_lusa, read_timebank
from src.prompts import Prompter

ROOT = Path().resolve().parent

## Portuguese

In [18]:
DATA_PATH = ROOT / "resources" / "lusa_news"
dataset = read_lusa(DATA_PATH)

In [3]:
SAMPLE_DOCS_IDS = [
    "lusa_189",
    "lusa_100",
    "lusa_197",
    "lusa_161",
    "lusa_116",
    "lusa_176",
    "lusa_195",
    "lusa_173",
    "lusa_172",
    "lusa_13",
    "lusa_142",
    "lusa_126",
    "lusa_188",
    "lusa_107",
    "lusa_203",
    "lusa_191",
    "lusa_170",
    "lusa_133",
    "lusa_179",
    "lusa_155",
]

# remove documents used in the selection of the prompts
dataset = [doc for doc in dataset if doc.id not in SAMPLE_DOCS_IDS]

In [4]:
def get_best_document(documents, entities, attribute):
    best_doc = None
    max_n_entities, max_n_classes = None, None
    for doc in documents:
        
        doc_classes = set([
            getattr(entity, attribute)
            for entity in getattr(doc, entities)
            if hasattr(entity, str(attribute))
        ])
        
        n_doc_classes = len(doc_classes)
        n_entities = len(getattr(doc, entities))

        if best_doc is None:
            best_doc = doc
            max_n_entities = n_entities
            max_n_classes = n_doc_classes
            continue
        
        if n_doc_classes >= max_n_classes and n_entities >= max_n_entities:
            max_n_entities = n_entities
            max_n_classes = n_doc_classes
            best_doc = doc
    return best_doc

### Events

In [5]:
event_classes = set(event.class_ for doc in dataset for event in doc.events if hasattr(event, "class_"))
print("Event classes:", event_classes)

Event classes: {'State', 'I_Action', 'Reporting', 'Perception', 'I_State', 'Aspectual', 'Occurrence'}


In [6]:
n_event_per_doc = [len(doc.events) for doc in dataset] 
max_events = max(n_event_per_doc)

print("Max events per document:", max_events)

Max events per document: 46


In [7]:
max_events_idx = np.argmax(n_event_per_doc)
doc_max_events = dataset[max_events_idx]

max_doc_class_count = Counter([event.class_ for event in doc_max_events.events if hasattr(event, "class_")])
print("Max events per document class count:", max_doc_class_count)

Max events per document class count: Counter({'Occurrence': 20, 'State': 18, 'Reporting': 6, 'I_Action': 1, 'I_State': 1})


In [8]:
best_doc = get_best_document(dataset, "events", "class_")

In [9]:
print("Document with the most events and number of classes:", best_doc.id)

Document with the most events and number of classes: lusa_119


In [10]:
annotation = [ent.text for ent in best_doc.events]
prompter = Prompter(entity="event triggers", example=best_doc)
print(prompter.template.template)

Task:
Extract all event triggers.

Example:
	Input:
	"Covi-19: Governo de estado australiano pede desculpa por erros em quarentenas em hotéis
O líder do governo do estado australiano de Victoria pediu hoje desculpa pelos erros do programa de quarentena em dois hotéis que levaram à maioria das mortes por covid-19 no país.
Após a divulgação do relatório de investigação, o primeiro-ministro de Victoria, Dan Andrews, explicou que o sistema de quarentena tinha sido implementado rapidamente e sem um livro de regras pandémico.
"Quero pedir desculpa à comunidade vitoriana pelos erros muito claros que foram cometidos neste programa", disse Andrews.
O fraco controlo em dois hotéis de quarentena desencadearam uma onda de infeções na segunda maior cidade da Austrália, enquanto o resto do país tinha estado em grande parte livre de vírus.
Das 908 mortes australianas por covid-19, 820 morreram em Victoria.
A polícia fornece agora segurança nos hotéis de quarentena de Melbourne, algo que neste dois oi

### Time Expressions

In [11]:
timex_classes = set(timex.time_type for doc in dataset for timex in doc.timexs if hasattr(timex, "time_type"))
print("Timex classes:", timex_classes)

Timex classes: {'Date', 'Time', 'Duration', 'Set'}


In [12]:
n_timex_per_doc = [len(doc.timexs) for doc in dataset] 
max_timex = max(n_timex_per_doc)

print("Max events per document:", max_timex)

Max events per document: 9


In [13]:
max_timex_idx = np.argmax(n_timex_per_doc)
doc_max_timexs = dataset[max_timex_idx]

max_doc_class_count = Counter([timex.time_type for timex in doc_max_timexs.timexs if hasattr(timex, "time_type")])
print("Max events per document class count:", max_doc_class_count)

Max events per document class count: Counter({'Time': 4, 'Date': 3, 'Duration': 2})


In [14]:
best_doc = get_best_document(dataset, "timexs", "time_type")
print("Document with the most timexs and number of classes:", best_doc.id)
print("Number of timexs:", len(best_doc.timexs))
print("Number of timex classes:", len(set([timex.time_type for timex in best_doc.timexs if hasattr(timex, "time_type")])))

Document with the most timexs and number of classes: lusa_11
Number of timexs: 9
Number of timex classes: 3


In [15]:
annotation = [ent.text for ent in best_doc.timexs]
prompter = Prompter(entity="time expressions", example=best_doc)
print(prompter.template.template)

Task:
Extract all time expressions.

Example:
	Input:
	"Autoridades moçambicanas apreendem mais de uma tonelada de caranguejo
A fiscalização marítima moçambicana apreendeu 1.100 quilos de caranguejo, no centro do país, em menos de uma semana, capturado na "época de veda", quando é proibido apanhar a espécie, disse hoje à Lusa fonte das autoridades.
A última apreensão aconteceu no sábado quando as autoridades descobriram uma embarcação com 600 quilos de caranguejo.
“Estamos a apreender caranguejo e embarcações e os responsáveis incorrem em pesadas multas, caso sejam neutralizados”, explicou o chefe da fiscalização, César Maphossa.
No sábado, os tripulantes abandonaram o barco, fundeado nos arredores da cidade da Beira, centro de Moçambique, quando se aperceberam da chegada dos fiscais.
A embarcação foi confiscada e os caranguejos, dissimulados em caixas, foram posteriormente devolvidos ao seu habitat natural, no mangal do rio Maria, arredores da capital provincial de Sofala.
A apreensão

### Participants

In [16]:
participant_classes = set(participant.participant_type_domain for doc in dataset for participant in doc.participants if hasattr(participant, "participant_type_domain"))
print("Participant classes:", participant_classes)

Participant classes: {'Per', 'Fac', 'Nat', 'Other', 'Obj', 'Loc', 'Org'}


In [17]:
n_part_per_doc = [len(doc.participants) for doc in dataset] 
max_part = max(n_part_per_doc)

print("Max events per document:", max_part)

Max events per document: 48


In [18]:
max_part_idx = np.argmax(n_part_per_doc)
doc_max_parts = dataset[max_part_idx]

max_doc_class_count = Counter([part.participant_type_domain for part in doc_max_parts.participants if hasattr(part, "participant_type_domain")])
print("Max events per document class count:", max_doc_class_count)

Max events per document class count: Counter({'Per': 17, 'Loc': 15, 'Org': 8, 'Nat': 6, 'Other': 2})


In [19]:
best_doc = get_best_document(dataset, "participants", "participant_type_domain")
print("Document with the most participants and number of classes:", best_doc.id)
print("Number of participants:", len(best_doc.participants))
print("Number of participants classes:", len(set([part.participant_type_domain for part in best_doc.participants if hasattr(part, "participant_type_domain")])))

Document with the most participants and number of classes: lusa_156
Number of participants: 47
Number of participants classes: 6


In [20]:
annotation = [ent.text for ent in best_doc.participants]
prompter = Prompter(entity="participants", example=best_doc)
print(prompter.template.template)

Task:
Extract all participants.

Example:
	Input:
	"Homem armado faz vários reféns dentro de um banco na Geórgia
Um homem armado fez hoje à tarde vários reféns, ainda em número incerto, dentro das instalações de um banco na Geórgia, informaram as autoridades desta ex-república soviética.
O Ministério do Interior da Geórgia não precisou, até ao momento, quantas pessoas foram feitas reféns dentro do banco, localizado na cidade de Zugdidi (região oeste), ou quais são as exigências do agressor.
A polícia isolou, entretanto, a zona onde fica a sucursal bancária e montou uma operação "para neutralizar o agressor", informou o ministério num comunicado.
A televisão estatal da Geórgia, a Mtavari TV, noticiou que o sequestrador está armado com uma granada de mão e exige 500.000 dólares (cerca de 420.000 euros) em dinheiro.
A Mtavari TV conseguiu falar com um dos reféns que indicou que o agressor mantém 19 pessoas dentro das instalações bancárias.
O canal de televisão também divulgou um vídeo que

## English

In [19]:
DATA_PATH = ROOT / "resources" / "timebank"
dataset = read_timebank(DATA_PATH)

100%|██████████| 183/183 [00:00<00:00, 365.45it/s]


In [20]:
SAMPLE_DOCS_IDS = [
    "wsj_0551",
    "wsj_0815",
    "wsj_0135",
    "wsj_1042",
    "wsj_0266",
    "wsj_0924",
    "PRI19980306.2000.1675",
    "wsj_0332",
    "wsj_0176",
    "wsj_0348",
    "wsj_0144",
    "wsj_0670",
    "ABC19980114.1830.0611",
    "wsj_0674",
    "wsj_0376",
    "VOA19980305.1800.2603",
    "APW19980301.0720",
    "wsj_0938",
    "wsj_0745",
    "wsj_0584",
]

# remove documents used in the selection of the prompts
dataset = [doc for doc in dataset if doc.id not in SAMPLE_DOCS_IDS]

In [21]:
def get_best_document(documents, entities, attribute):
    best_doc = None
    max_n_entities, max_n_classes = None, None
    for doc in documents:
        if len(doc.text.split(" ")) > 600:  # avoid big examples
            continue
        
        doc_classes = set([
            getattr(entity, attribute)
            for entity in getattr(doc, entities)
            if hasattr(entity, str(attribute))
        ])
        
        n_doc_classes = len(doc_classes)
        n_entities = len(getattr(doc, entities))

        if best_doc is None:
            best_doc = doc
            max_n_entities = n_entities
            max_n_classes = n_doc_classes
            continue
        
        if n_doc_classes >= max_n_classes and n_entities >= max_n_entities:
            max_n_entities = n_entities
            max_n_classes = n_doc_classes
            best_doc = doc
    return best_doc

### Events

In [22]:
event_classes = set(event.class_ for doc in dataset for event in doc.events if hasattr(event, "class_"))
print("Event classes:", event_classes)

Event classes: {'I_ACTION', 'STATE', 'OCCURRENCE', 'PERCEPTION', 'ASPECTUAL', 'REPORTING', 'I_STATE'}


In [23]:
n_event_per_doc = [len(doc.events) for doc in dataset] 
max_events = max(n_event_per_doc)

print("Max events per document:", max_events)

Max events per document: 269


In [24]:
max_events_idx = np.argmax(n_event_per_doc)
doc_max_events = dataset[max_events_idx]

max_doc_class_count = Counter([event.class_ for event in doc_max_events.events if hasattr(event, "class_")])
print("Max events per document class count:", max_doc_class_count)

Max events per document class count: Counter({'OCCURRENCE': 189, 'REPORTING': 26, 'STATE': 17, 'I_STATE': 15, 'PERCEPTION': 8, 'ASPECTUAL': 7, 'I_ACTION': 7})


In [25]:
best_doc = get_best_document(dataset, "events", "class_")

In [26]:
print("Document with the most events and number of classes:", best_doc.id)

Document with the most events and number of classes: APW19980213.1310


In [27]:
annotation = [ent.text for ent in best_doc.events]
prompter = Prompter(entity="event triggers", example=best_doc)
print(prompter.template.substitute(text=""))

Task:
Extract all event triggers.

Example:
	Input:
	"Turning its back on 210 years of loyalty to the British royal family, a constitutional convention voted overwhelmingly Friday to make Australia a republic under its own president. Prime Minister John Howard, a monarchist himself, promised to put the question to a national referendum next year after convention delegates voted 89-52 for a republic, with 11 abstentions. Spontaneous applause echoed through the chamber and public galleries as the crucial vote passed by a wide margin. ``I want a referendum,'' Howard said. ``The Australian people are owed the opportunity of expressing an opinion on this.'' ``It would be a travesty in common sense terms of Australian democracy for that proposition not to be put to the Australian people,'' Howard said. Even in his own Cabinet, Howard is becoming increasingly isolated with his monarchist stance. Treasurer Peter Costello, Environment Minister Robert Hill and Attorney General Daryl Williams all

### Time Expressions

In [28]:
timex_classes = set(timex.time_type for doc in dataset for timex in doc.timexs if hasattr(timex, "time_type"))
print("Timex classes:", timex_classes)

Timex classes: {'SET', 'DURATION', 'DATE', 'TIME'}


In [29]:
n_timex_per_doc = [len(doc.timexs) for doc in dataset] 
max_timex = max(n_timex_per_doc)

print("Max events per document:", max_timex)

Max events per document: 34


In [30]:
max_timex_idx = np.argmax(n_timex_per_doc)
doc_max_timexs = dataset[max_timex_idx]

max_doc_class_count = Counter([timex.time_type for timex in doc_max_timexs.timexs if hasattr(timex, "time_type")])
print("Max events per document class count:", max_doc_class_count)

Max events per document class count: Counter({'DATE': 29, 'DURATION': 4, 'TIME': 1})


In [31]:
best_doc = get_best_document(dataset, "timexs", "time_type")
print("Document with the most timexs and number of classes:", best_doc.id)
print("Number of timexs:", len(best_doc.timexs))
print("Number of timex classes:", len(set([timex.time_type for timex in best_doc.timexs if hasattr(timex, "time_type")])))

Document with the most timexs and number of classes: APW19980306.1001
Number of timexs: 16
Number of timex classes: 4


In [32]:
annotation = [ent.text for ent in best_doc.timexs]
prompter = Prompter(entity="time expressions", example=best_doc)
print(prompter.template.substitute(text=""))

Task:
Extract all time expressions.

Example:
	Input:
	"BAGHDAD, Iraq (AP)_ An American leader of a U.N. weapons inspection team resumed work in Iraq Friday, nearly two months after his team was effectively blocked. Scott Ritter led his team on a 10-hour tour of three suspected weapons sites classified as ``sensitive'' by the Iraqi authorities, U.N. spokesman Alan Dacey said. ``All sites were inspected to the satisfaction of the inspection team and with full cooperation of Iraqi authorities,'' Dacey said. At least one of the sensitive sites was a barracks of the elite Republican Guard, a well-placed source told The Associated Press. Previously the Iraqis have resisted attempts to inspect such quarters. The U.N. Security Council has charged the inspectors with verifying that Iraq has destroyed its long-range missiles and weapons of mass destruction. It was the first time that Ritter, who arrived Thursday with some 50 inspectors for a tour likely to last over a week, had been allowed to 