In this homework, you can pick one of the two sections (Automated Fact-checking or Relatio) to get the full completion point.

# Automated Fact-checking
In the notebook, we see one off-the-shelf fact-check model based on RoBERTa.
However, that setting is more like textual entailment. Real-world Fact-checking pipeline requires an extra module: evidence retrieval. In this homework, we will add an evidence retrieval model based on <b>sentence-bert</b> to the RoBERTa Fact-checker.

Note: SBERT was introduced in Notebook 6 (06_transformers.ipynb).

In [1]:
!wget https://scifact.s3-us-west-2.amazonaws.com/release/latest/data.tar.gz
!tar -xzf data.tar.gz

--2023-05-25 06:40:06--  https://scifact.s3-us-west-2.amazonaws.com/release/latest/data.tar.gz
Resolving scifact.s3-us-west-2.amazonaws.com (scifact.s3-us-west-2.amazonaws.com)... 3.5.84.112, 3.5.77.189, 3.5.83.150, ...
Connecting to scifact.s3-us-west-2.amazonaws.com (scifact.s3-us-west-2.amazonaws.com)|3.5.84.112|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3115079 (3.0M) [application/x-tar]
Saving to: ‘data.tar.gz’


2023-05-25 06:40:07 (3.59 MB/s) - ‘data.tar.gz’ saved [3115079/3115079]



In [2]:
import json

claim_file = 'data/claims_dev.jsonl'
corpus_file = 'data/corpus.jsonl'

corpus = {}
with open(corpus_file) as f:
    for line in f:
        abstract = json.loads(line)
        corpus[str(abstract["doc_id"])] = abstract
        
claims = []
with open(claim_file) as f:
    for line in f:
        claim = json.loads(line)
        claims.append(claim)

print(claims[1])
print(corpus['14717500'])

print("Number of Corpus: ", len(corpus))

{'id': 3, 'claim': '1,000 genomes project enables mapping of genetic sequence variation consisting of rare variants with larger penetrance effects than common variants.', 'evidence': {'14717500': [{'sentences': [2, 5], 'label': 'SUPPORT'}, {'sentences': [7], 'label': 'SUPPORT'}]}, 'cited_doc_ids': [14717500]}
{'doc_id': 14717500, 'title': 'Rare Variants Create Synthetic Genome-Wide Associations', 'abstract': ['Genome-wide association studies (GWAS) have now identified at least 2,000 common variants that appear associated with common diseases or related traits (http://www.genome.gov/gwastudies), hundreds of which have been convincingly replicated.', 'It is generally thought that the associated markers reflect the effect of a nearby common (minor allele frequency >0.05) causal site, which is associated with the marker, leading to extensive resequencing efforts to find causal sites.', 'We propose as an alternative explanation that variants much less common than the associated one may crea

In [3]:
!pip install sentence_transformers
from sentence_transformers import SentenceTransformer
import numpy as np
from tqdm import tqdm


def cosine_similarity(a, b):
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot_product / (norm_a * norm_b)

# utility function for cosine similarity calculation
def cosine_similarity_matrix(vector, matrix):
    return np.apply_along_axis(cosine_similarity, 1, matrix, vector)

# preprocessing function for SciFact corpus
def preprocess_sentence(text):
    text = text.replace('/', ' / ')
    text = text.replace('.-', ' .- ')
    text = text.replace('.', ' . ')
    text = text.replace('\'', ' \' ')
    text = text.lower()

    return text

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0 (from sentence_transformers)
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m68.8 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence_transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m78.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0 (from sentence_transformers)
  Downloading huggingface_hub-0.14.1-py3-

### TODO 1: find the top-10 evidences for the following claim USING SBERT.

In [4]:
claim_1 = claims[1]['claim']

corpus_ids, corpus_texts = [], []
for k, v in corpus.items():
  original_sentences = [v['title']] + v['abstract']
  processed_paragraph = " ".join([preprocess_sentence(sentence) for sentence in original_sentences])
  corpus_ids.append(k)
  corpus_texts.append(processed_paragraph)

# TODO: find the top-10 evidences for claim_1 using SBERT
# Hint 1: use Colab's GPU (or your local GPUs) to accelerate SBERT Encoding
# Hint 2: use parallel encoding (i.e. batch encoding) from SBERT to accelerate encoding.
# Hint 3: SciFact is a scientific domain dataset. Are there SBERT models on the same domain?

model = "bert-base-nli-mean-tokens"
embedder = SentenceTransformer(model)

c1_embedded = embedder.encode(claim_1)

batch_size = 128
corpus_emb = []


for i in tqdm(range(0, len(corpus_texts), batch_size)):
  batch = corpus_texts[i:i+batch_size]
  emb = embedder.encode(batch)
  corpus_emb.extend(emb)


similarity = cosine_similarity_matrix(c1_embedded, np.array(corpus_emb))

top_index = np.argsort(similarity, axis=0)[::-1][:10]
top_evidences = [(corpus_ids[i], corpus_texts[i]) for i in top_index]

for evidence in top_evidences:
    print(evidence)



Downloading (…)821d1/.gitattributes:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)8d01e821d1/README.md:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

Downloading (…)d1/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)01e821d1/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)821d1/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading (…)8d01e821d1/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)1e821d1/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

100%|██████████| 41/41 [00:37<00:00,  1.10it/s]

('1388704', 'the essence of snps .  single nucleotide polymorphisms (snps) are an abundant form of genome variation, distinguished from rare variations by a requirement for the least abundant allele to have a frequency of 1% or more .  a wide range of genetics disciplines stand to benefit greatly from the study and use of snps .  the recent surge of interest in snps stems from, and continues to depend upon, the merging and coincident maturation of several research areas, i . e .  (i) large-scale genome analysis and related technologies, (ii) bio-informatics and computing, (iii) genetic analysis of simple and complex disease states, and (iv) global human population genetics .  these fields will now be propelled forward, often into uncharted territories, by ongoing discovery efforts that promise to yield hundreds of thousands of human snps in the next few years .  major questions are now being asked, experimentally, theoretically and ethically, about the most effective ways to unlock the




### TODO 2: use Dzeniks/roberta-fact-check and the retrieved evidence to verify the claim. You can use one or multiple evidence.

In [5]:
#!pip install transformers
import torch
from transformers import RobertaTokenizer, RobertaForSequenceClassification

In [6]:
# TODO: use Dzeniks/roberta-fact-check and the retrieved evidence to verify the claim. You can use one or multiple evidence.

model_name = "Dzeniks/roberta-fact-check"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaForSequenceClassification.from_pretrained(model_name)


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/405 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/686 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

In [8]:
evidence_text = top_evidences[0][1]
print(evidence_text[:70], '...')

encoded_input = tokenizer.encode_plus(
    claim_1,
    evidence_text,
    add_special_tokens=True,
    truncation=True,
    padding="longest",
    return_tensors="pt"
)

input_ids = encoded_input["input_ids"]
attention_mask = encoded_input["attention_mask"]

outputs = model(input_ids=input_ids, attention_mask=attention_mask)
logits = outputs.logits
predicted_labels = torch.argmax(logits, dim=1).flatten().tolist()

prediction = predicted_labels[0]
print("Claim prediction:", bool(prediction))


the essence of snps .  single nucleotide polymorphisms (snps) are an a ...
Claim prediction: True


# Relatio

In [9]:
!pip install relatio

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting relatio
  Downloading relatio-0.3.0-py3-none-any.whl (28 kB)
Collecting allennlp-models>=2.3 (from relatio)
  Downloading allennlp_models-2.10.1-py3-none-any.whl (464 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m464.5/464.5 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
Collecting pyvis>=0.1.9 (from relatio)
  Downloading pyvis-0.3.2-py3-none-any.whl (756 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m756.0/756.0 kB[0m [31m42.0 MB/s[0m eta [36m0:00:00[0m
Collecting umap-learn>=0.5.3 (from relatio)
  Downloading umap-learn-0.5.3.tar.gz (88 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.2/88.2 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting hdbscan>=0.8.28 (from relatio)
  Downloading hdbscan-0.8.29.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━

In [10]:
# Catch warnings for an easy ride
from relatio import FileLogger
logger = FileLogger(level = 'WARNING')

  warn(f"Failed to load image Python extension: {e}")


In [33]:
#!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv

import pandas as pd
import nltk
df = pd.read_csv('train.csv')

df.columns = ["label", "title", "lead"]
label_map = {1:"world", 2:"sport", 3:"business", 4:"sci/tech"}
def replace_label(x):
	return label_map[x]
df["label"] = df["label"].apply(replace_label) 
df["doc"] = df["title"] + " " + df["lead"]
df["id"] = df.index
df = df[['id', 'doc']]
df.head()

df = df.sample(n=5000)

### TODO 1: Predict the Semantic Roles in the AG news dataset. 

In [34]:
df.head()

Unnamed: 0,id,doc
7978,7978,Cisco Ties Microsoft CRM to Your Phone System ...
97007,97007,Date Set for Auction Of Russian Oil Giant The ...
63060,63060,Astros Recover From Oswalt's Poor Start (AP) A...
119773,119773,NBA Wrap: Heat Tame Bobcats to Extend Winning ...
5280,5280,AMD using strained silicon on 90-nanometer chi...


In [None]:
from relatio import Preprocessor, SRL, extract_roles

# TODO 1: Predict the Semantic Role Labels in the AG news dataset.
# Hint: you can sample a few sentences to accelerate SRL prediction.

preprocessor = Preprocessor(
    spacy_model = "en_core_web_sm",
    remove_punctuation = True,
    remove_digits = True,
    lowercase = True,
    lemmatize = True,
    remove_chars = ["\"",'-',"^",".","?","!",";","(",")",",",":","\'","+","&","|","/","{","}",
                    "~","_","`","[","]",">","<","=","*","%","$","@","#","’"],
    stop_words = [],
    n_process = -1,
    batch_size = 100
)

srl = SRL(
    path = "https://storage.googleapis.com/allennlp-public-models/openie-model.2020.03.26.tar.gz",
    batch_size = 10,
    cuda_device = -1
)

df = preprocessor.split_into_sentences(
    df, output_path = None, progress_bar = True
)

In [38]:
srl_pred = srl(df['sentence'], progress_bar=True)

roles, _ = extract_roles(
    srl_pred,
    used_roles = ["ARG0","B-V","B-ARGM-NEG","B-ARGM-MOD","ARG1","ARG2"],
    only_triplets = True,
    progress_bar = True
)

for d in roles[0:20]: print(d)

Running SRL...


100%|██████████| 652/652 [24:51<00:00,  2.29s/it]


Extracting semantic roles...


100%|██████████| 6516/6516 [00:01<00:00, 4842.97it/s]

{'ARG0': 'Cisco Systems', 'B-V': 'Ties', 'ARG1': 'Microsoft CRM to Your Phone System'}
{'ARG0': 'Cisco Systems', 'B-V': 'announced', 'ARG1': 'the Cisco CRM Communications Connector'}
{'ARG0': 'the government', 'B-V': 'set', 'ARG1': 'a date to auction off a majority stake in the company # 39;s'}
{'ARG0': 'the government', 'B-V': 'auction', 'ARG1': 'a majority stake in the company # 39;s'}
{'ARG0': "Astros Recover From Oswalt 's Poor Start ( AP ) AP - Roy Oswalt", 'B-V': 'worry', 'ARG1': "the Houston Astros had n't counted on against the St. Louis Cardinals in the NL championship series"}
{'ARG0': 'Astros', 'B-V': 'counted', 'ARG2': 'against the St. Louis Cardinals', 'B-ARGM-NEG': True}
{'ARG0': 'Tame Bobcats', 'B-V': 'Winning', 'ARG1': 'Streak'}
{'ARG0': 'the Miami Heat', 'B-V': 'scorched', 'ARG1': 'the expansion Charlotte Bobcats'}
{'ARG0': 'the Miami Heat', 'B-V': 'extend', 'ARG1': 'their franchise record win streak', 'ARG2': 'to 14 games'}
{'ARG0': '90 - nanometer chips Advanced Micr




### TODO 2: Postprocess the retrieved semantic roles

In [None]:
# TODO 2: Postprocess the retrieved semantic roles
# Hint: use p.process_roles

process_roles = preprocessor.process_roles(roles)

for d in roles[0:20]: print(d)


Cleaning phrases for role ARG0...
Cleaning phrases for role B-V...


### TODO 3: Extract the named entities that can be recognized from the semantic roles.

In [None]:
# TODO 3: Extract the named entities that can be recognized from the semantic roles.
# Hint use p.mine_entities


### TODO 4: modeling the narratives

In [None]:
from relatio.narrative_models import NarrativeModel

# TODO 4: modeling the narratives using NarrativeModel
# Hint: follow the notebook's hyperparameter setting
 

In [None]:
narative_model.plot_clusters(path = './clusters.pdf')