# Introduction

Large-scale language models (LMs) such as BERT are optimized to predict masked-out textual inputs and have notably advanced performances on a range of downstream NLP tasks. Recently, LMs also gained attention for their purported ability to yield structured pieces of knowledge directly from their parameters. This is promising as current knowledge bases (KBs) such as Wikidata and ConceptNet are part of the backbone of the Semantic Web ecosystem, yet are inherently incomplete. In the recent seminal LAMA paper [(Petroni et al., 2019)](https://arxiv.org/pdf/1909.01066.pdf), authors showed that LMs could highly rank correct object tokens when given an input prompt specifying the subject-entity and relation. Despite much follow-up work reporting further advancements, the prospect of using LMs for knowledge base construction remains unexplored. 

We invite participants to present solutions to make use of **LMs for KB construction** without prior information on the cardinality of relations, i.e., for a given subject-relation pair, the details on the total count of possible object-entities are absent. We require participants to submit a system that takes an input consisting of a subject-entity and relation, uses an LM depending on the choice of the track (BERT-type or open), generates subject-relation-object tuples, and makes actual accept/reject decisions for each generated output triple. Finally, we evaluate the resulting KBs using established F1-score (harmonic mean of precision and recall) metric.

**NOTE:** Before continuing further, follow the steps given in [README.md](https://github.com/lm-kbc/dataset/blob/main/README.md) to install the required python packages if you haven't.

## LM Probing

Knowledge Base Construction from Language Models (LM-KBC) pipeline has the following important modules:

1. Choosing the subject-entity (e.g., Germany) and relation (e.g., CountryBordersWithCountry)
2. Creating a prompt ( e.g., "_Germany shares border with [MASK]_.", a masked prompt for BERT-type masked language models)
3. Probing an existing language model using the above prompt as an input
4. Obtaining LM's output, which are the likelihood based ranked object-entities in the [MASK] position, using the  on the input prompt
5. Applying a selection criteria on LM's output to get only the factually correct object-entitites for the given subject-entity and relation

<font color='blue'>Participants can propose solutions that either improves the performance of these modules compared to the given baseline system or submit a new idea to better generate the object-entities, with the goal to beat the baseline's F1-score. Below we explain how some of these modules affect the LM's output when probed.</font>

In [1]:
import pandas as pd
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline, logging

from baseline import create_prompt  # our baseline's prompt templates
from file_io import read_lm_kbc_jsonl_to_df  # function to read the ground-truth files

logging.set_verbosity_error()  # avoid irritating transformers warnings

In [2]:
# Our probing function
def probe_lm(model_name, subject_entity, relation, top_k=100, prompt=None):
    # Load the model
    print(f"Loading model \"{model_name}\"...", end=" ")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForMaskedLM.from_pretrained(model_name)

    pipe = pipeline(
        task="fill-mask",
        model=model,
        tokenizer=tokenizer,
        top_k=top_k,
    )

    mask_token = tokenizer.mask_token
    
    # Create prompt
    if not prompt:
        prompt = create_prompt(subject_entity, relation, mask_token)
    
    # Probe the LM
    print("Probing...")
    outputs = pipe(prompt)
    
    return [{
        "Prompt": prompt,
        "SubjectEntity": subject_entity,
        "Relation": relation,
        "ObjectEntity": out["token_str"],
        "Probability": out["score"]
    } for out in outputs]

In [3]:
# Assume the following subject entity and relation from here on
subject_entity = "Singapore"
relation = "CountryBordersWithCountry"

## Effect of Languge Model

Let's see how the output object-entities varies for three different pre-trained LMs: 
- BERT-base-cased
- BERT-large-cased
- RoBERTa-base

In [4]:
# probing the three different LMs on the chosen subject-entity and relation
bert_base_cased_output = probe_lm("bert-base-cased", subject_entity, relation)
bert_large_cased_output = probe_lm("bert-large-cased", subject_entity, relation)
roberta_base_output = probe_lm("roberta-base", subject_entity, relation)

Loading model "bert-base-cased"... Probing...
Loading model "bert-large-cased"... Probing...
Loading model "roberta-base"... Probing...


Setting the probability threshold equal to 0.5 (our selection criteria) and filtering the LMs outputs by this threshold

In [5]:
prob_threshold = 0.5

filtered_bert_base_cased_output = [
    out["ObjectEntity"] for out in bert_base_cased_output if out["Probability"] >= prob_threshold
]

filtered_bert_large_cased_output = [
    out["ObjectEntity"] for out in bert_large_cased_output if out["Probability"] >= prob_threshold
]

filtered_roberta_base_output = [
    out["ObjectEntity"] for out in roberta_base_output if out["Probability"] >= prob_threshold
]

In [6]:
# retrieving the ground truth labels from the given train dataset for the chosen subject-entity and relation
df = read_lm_kbc_jsonl_to_df("data/train.jsonl")
ground_truth = df[(df["SubjectEntity"] == subject_entity) & (df["Relation"] == relation)]["ObjectEntities"].tolist()
print("Ground truth object-entities are:", ground_truth)

# printing the filtered outputs
print("bert_base_output:", filtered_bert_base_cased_output)
print("bert_large_output:", filtered_bert_large_cased_output)
print("roberta_base_output:", filtered_roberta_base_output)

Ground truth object-entities are: [[['malaysia'], ['indonesia']]]
bert_base_output: []
bert_large_output: ['Malaysia']
roberta_base_output: []


<font color='blue'>**Observation**: From the above output, we see that the choice of the pre-trained language model has a direct effect on the generated output. Participants can try to further fine-tune the BERT model (for track 1) on this task or experiment with other existing pre-training LMs (for track 2).<font>

## Effect of prompt formulation

Let's see how the output object-entities varies while using different prompt structures on the BERT-large-cased LM

In [7]:
# creating different prompts:

bert_large_masked_token = "[MASK]"

prompt0 = f"{subject_entity} shares border with {bert_large_masked_token}."
prompt1 = f"{subject_entity} borders {bert_large_masked_token}"
prompt2 = f"{subject_entity} borders {bert_large_masked_token}, which is a country" 

In [8]:
# probing the BERT-large-cased LM using the three different prompts for same subject-entity and relation
prompt0_output = probe_lm("bert-large-cased", subject_entity, relation, prompt=prompt0)
prompt1_output = probe_lm("bert-large-cased", subject_entity, relation, prompt=prompt1)
prompt2_output = probe_lm("bert-large-cased", subject_entity, relation, prompt=prompt2)

Loading model "bert-large-cased"... Probing...
Loading model "bert-large-cased"... Probing...
Loading model "bert-large-cased"... Probing...


Let's see the top-3 results of each output

In [9]:
pd.DataFrame(prompt0_output).head(3)

Unnamed: 0,Prompt,SubjectEntity,Relation,ObjectEntity,Probability
0,Singapore shares border with [MASK].,Singapore,CountryBordersWithCountry,Malaysia,0.690764
1,Singapore shares border with [MASK].,Singapore,CountryBordersWithCountry,Thailand,0.112451
2,Singapore shares border with [MASK].,Singapore,CountryBordersWithCountry,Indonesia,0.067123


In [10]:
pd.DataFrame(prompt1_output).head(3)

Unnamed: 0,Prompt,SubjectEntity,Relation,ObjectEntity,Probability
0,Singapore borders [MASK],Singapore,CountryBordersWithCountry,;,0.517871
1,Singapore borders [MASK],Singapore,CountryBordersWithCountry,.,0.481356
2,Singapore borders [MASK],Singapore,CountryBordersWithCountry,|,0.000378


In [11]:
pd.DataFrame(prompt2_output).head(3)

Unnamed: 0,Prompt,SubjectEntity,Relation,ObjectEntity,Probability
0,"Singapore borders [MASK], which is a country",Singapore,CountryBordersWithCountry,Malaysia,0.382691
1,"Singapore borders [MASK], which is a country",Singapore,CountryBordersWithCountry,Indonesia,0.14675
2,"Singapore borders [MASK], which is a country",Singapore,CountryBordersWithCountry,Thailand,0.094743


<font color='blue'>**Observation**: From the above output, we see that the prompt used for probing affects the quality of the generated output. Participants can propose a solution that automatically designs better and optimal prompts for this task.<font> 

## Effect of selection criteria

Let's see how the choosing different the probability thresholds affects the generated output object-entities.

In [12]:
# initializing different probability thresholds
prob_threshold1 = 0.1
prob_threshold2 = 0.5
prob_threshold3 = 0.9

# filtering bert-large outputs using the thresholds
thres1_result = [out for out in bert_large_cased_output if out["Probability"] >= prob_threshold1]
thres2_result = [out for out in bert_large_cased_output if out["Probability"] >= prob_threshold2]
thres3_result = [out for out in bert_large_cased_output if out["Probability"] >= prob_threshold3]

Let's see the top-3 results

In [13]:
pd.DataFrame(thres1_result).head(3)

Unnamed: 0,Prompt,SubjectEntity,Relation,ObjectEntity,Probability
0,Singapore shares border with [MASK].,Singapore,CountryBordersWithCountry,Malaysia,0.690764
1,Singapore shares border with [MASK].,Singapore,CountryBordersWithCountry,Thailand,0.112451


In [14]:
pd.DataFrame(thres2_result).head(3)

Unnamed: 0,Prompt,SubjectEntity,Relation,ObjectEntity,Probability
0,Singapore shares border with [MASK].,Singapore,CountryBordersWithCountry,Malaysia,0.690764


In [15]:
pd.DataFrame(thres3_result).head(3)

<font color='blue'>**Observation**: From the above output, we see that changing the threshold leads to very different performance scores. When the threshold is 0.1, F1-score would be 0.5 (1 out of 2 generations is correct and 1 out of the 2 ground truth object-entities was selected); however for threshold 0.9, F1-score would be 0. Participants can propose a solution that uses a better thresholding mechanism or even further calibrate the LM's likelihood on this task.<font> 