<a href="https://colab.research.google.com/github/kevin-rn/AISTR-Lab/blob/main/fact_check.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Extract sentences from cnn dailymail articles and index them. Use claim detection or evidence sentence selection models to achieve this. For each summary generated from model consider it to be a claim and retrieve closed sentences from index. Use an out of box stance detection model to verify the summary against retrieved evidences.  


In [1]:
import os
from google.colab import drive
drive.mount('/content/drive')
%cd drive/MyDrive/Grounding_LM/

Mounted at /content/drive
/content/drive/MyDrive/Grounding_LM


In [2]:
%pip install -q transformers
%pip install -q sentence-transformers
%pip install -q -U annoy

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m50.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m27.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m65.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m57.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m647.5/647.5 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing meta

In [3]:
from annoy import AnnoyIndex
import ast
from collections import Counter
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification, GPT2LMHeadModel, GPT2Tokenizer
import torch
from tqdm.auto import tqdm
import nltk
from nltk.tokenize import sent_tokenize
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import random
import time

nltk.download('punkt')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tqdm.pandas()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### Load data

In [None]:
df_t5_cnn = pd.read_csv("results/generated summaries/t5_large_cnn_dailymail.csv", index_col=0)
df_t5_cnn.head()

Unnamed: 0,text,summary,id,generated
0,(CNN)The Palestinian Authority officially beca...,Membership gives the ICC jurisdiction over all...,f001ec5c4704938247d27a44948eebb37ae98d01,The Palestinians have become a member of the I...
1,(CNN)Never mind cats having nine lives. A stra...,"Theia, a bully breed mix, was apparently hit b...",230c522854991d053fe98a718b1defa077a8efef,A dog that was apparently buried alive after b...
2,"(CNN)If you've been following the news lately,...",Mohammad Javad Zarif has spent more time with ...,4495ba8f3a340d97a9df1476f8a35502bcce1f69,It's been a busy week for Iran.
3,(CNN)Five Americans who were monitored for thr...,17 Americans were exposed to the Ebola virus w...,a38e72fed88684ec8d60dd5856282e999dc8c0ca,Five Americans who were being treated for Ebol...
4,(CNN)A Duke student has admitted to hanging a ...,Student is no longer on Duke University campus...,c27cf1b136cc270023de959e7ab24638021bc43f,A student at Duke University has admitted hang...


In [4]:
df_halueval = pd.read_csv('data/halueval/summarization_data.csv')
sampled_df = df_halueval.sample(n=50, random_state=42)
sampled_df['index'] = sampled_df.index
sampled_df.head()

Unnamed: 0,document,right_summary,hallucinated_summary,index
6252,Driving around in their mother's consular BMW ...,"Marc Wabafiyebazu, 15, bragged to officials th...",Brothers Marc and Jean Wabafiyebazu were arres...,6252
4684,Lance Armstrong has said the World Anti-Doping...,WADA director general David Howman said he was...,Lance Armstrong has apologized to the World An...,4684
1731,Andy King thinks his 50th goal for Leicester C...,Andy King scored his 50th goal to earn Leicest...,Leicester City secured a crucial win against W...,1731
4742,West Ham have announced a new five-year multi-...,West Ham have signed a new kit deal with Umbro...,West Ham have announced a partnership with Umb...,4742
4521,"At half-time, everything pointed to another hu...",George Ford scythed through the Leinster defen...,Bath's George Ford scored a hat-trick of tries...,4521


In [5]:
def tokenize_sentences(df_input):
  df_input['sentences'] = df_input['document'].apply(sent_tokenize)

In [6]:
# tokenize_sentences(df_t5_cnn)
tokenize_sentences(sampled_df)


### Claim detection

1. Load pre-trained claim detection model (BERT pretrained on Claimbuster dataset)
2. Split each source document text into sentences using NLTK's `sent_tokenize`
3. Extract claimworthy sentences from this

In [None]:
claim_tokenizer = AutoTokenizer.from_pretrained("Nithiwat/bert-base_claimbuster")
claim_model = AutoModelForSequenceClassification.from_pretrained("Nithiwat/bert-base_claimbuster").to(device)

Downloading (…)okenizer_config.json:   0%|          | 0.00/348 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/881 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [None]:
def extract_claimworthy(sentences):
    tokenized_inputs = claim_tokenizer(sentences, padding=True, truncation=True, return_tensors="pt").to(device)
    with torch.no_grad():
        logits = claim_model(**tokenized_inputs).logits
        logits = logits.cpu()
    label_indices = torch.nonzero(logits.argmax(dim=1) == 0).squeeze().cpu()
    # Prevent looping over 0d-tensor error.
    if label_indices.dim() == 0:
        label_indices = label_indices.unsqueeze(0)

    claimworthy = [sentences[idx] for idx in label_indices]
    return claimworthy

In [None]:
# df_test['claims'] = df_test['sentences'].progress_apply(extract_claims)
# df_test.to_csv('claims.csv', index=False)

In [None]:
sentences = extract_claimworthy(df_test['sentences'][0])

print(f"evidence: {' '.join(sentences)} \nclaim: {df_test['generated'][0]}")

evidence: The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based. These are substantive commitments, which cannot be taken lightly," she said. Prosecutor Fatou Bensouda said her office would "conduct its analysis in full independence and impartiality." 
claim: The Palestinians have become a member of the International Criminal Court (ICC).


In [None]:
df_test['text'][0]

'(CNN)The Palestinian Authority officially became the 123rd member of the International Criminal Court on Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based. The Palestinians signed the ICC\'s founding Rome Statute in January, when they also accepted its jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including East Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination into the situation in Palestinian territories, paving the way for possible war crimes investigations against Israelis. As members of the court, Palestinians may be subject to counter-charges as well. Israel and the United States, neither of which is an ICC member, opposed the Palestinians\' efforts to join the body. But Palestinian Foreign Minister Riad al-Malki, speaking at Wednesday\'s ceremony

### Construct Index

1. Load sentence-transformers model to create text embeddings for sentences & paragraphs
2. Calculate embeddings for each claimworthy sentence
3. Store embeddings using ANNOY library for index and retrieval.

In [7]:
model = SentenceTransformer('sentence-transformers/paraphrase-MiniLM-L6-v2') # 384 dimensional dense vector space

Downloading (…)001fa/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)3bbb8001fa/README.md:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

Downloading (…)bb8001fa/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)001fa/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading (…)3bbb8001fa/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)b8001fa/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [8]:
def index_annoy(df_input, df_name, embedding_dim = 384, number_of_trees=100):
  for doc_id, row in df_input.iterrows():
    embeddings = [model.encode(txt) for txt in row['sentences']]
    ann = AnnoyIndex(embedding_dim, metric = "angular")
    for index, embed in enumerate(embeddings):
        ann.add_item(index, embed)
    ann.build(number_of_trees)
    ann.save(f"data/{df_name}/annoy/{doc_id}_{df_name}.annoy")

In [9]:
index_annoy(sampled_df, 'halueval')
sampled_df.to_csv('data/sample_df.csv', index=False)

# Inference
1. Retrieve top-k source document claimworthy sentence embeddings from ANNOY for a given claim (generated summary).
2. Calculate cosine similarity between the given claim and the retrieved sentences and keep the ones above certain cosine similarity.
3. Load pre-trained fact-checking model and infer whether evidence supports, refutes or is neutral for the given claim.

### KNN

In [10]:
class KnnSearch:
    def __init__(self, emb_dim=384):
        self.annoy = AnnoyIndex(384, metric="angular")
        self.model = SentenceTransformer('sentence-transformers/paraphrase-MiniLM-L6-v2')
        self.emb_dim = emb_dim

    def get_embeddings_for_data(self, data_ls):
        embeddings = self.model.encode(data_ls)
        return embeddings

    def standardize_normalize_cosine_similarities(self, cosine_similarities):
        cosine_sims_norm = (cosine_similarities - np.min(cosine_similarities)) / (np.max(cosine_similarities) - np.min(cosine_similarities))
        cosine_sims_norm = 0.5 + (cosine_sims_norm - np.mean(cosine_sims_norm)) / np.std(cosine_sims_norm)
        return cosine_sims_norm

    def max_normalize_cosine_similarities(self, cosine_similarities):
        return 1 / np.max(cosine_similarities) * cosine_similarities.squeeze(axis=1)

    def max_normalize_cosine_similarities_pairwise(self, cosine_similarities):
        cosine_sims_norm = np.copy(cosine_similarities)
        np.fill_diagonal(cosine_sims_norm, np.NaN)
        cosine_sims_norm = (cosine_similarities - np.nanmin(cosine_similarities, axis=0)) / (np.nanmax(cosine_similarities, axis=0) - np.nanmin(cosine_similarities, axis=0))
        cosine_sims_norm = 0.5 + (cosine_sims_norm - np.nanmean(cosine_sims_norm, axis=0)) / np.nanstd(cosine_sims_norm, axis=0)
        return cosine_sims_norm

    def get_top_nn_neighbours(self, df_name, df_input, df_index, k=15, beta=0.7):
        annoy_index = df_input['index'][df_index]
        self.annoy.load(f"data/{df_name}/annoy/{annoy_index}_{df_name}.annoy")

        query_sentence = df_input['right_summary'][df_index]
        new_emb = model.encode(query_sentence)

        top_matches = self.annoy.get_nns_by_vector(new_emb, k)
        evidence_sentences =  [df_input["sentences"][df_index][i] for i in top_matches]
        evidence_embeddings = self.get_embeddings_for_data(evidence_sentences)
        # top_sentences = []
        # for idx, similarity in sorted(enumerate(text_sims[0]), key=lambda x: x[1], reverse=True):
        #     if similarity > beta:
        #       top_sentences.append(evidence_sentences[idx])
        # return top_sentences

        text_sims = cosine_similarity(evidence_embeddings,[new_emb]).tolist()
        candidate_sims = cosine_similarity(evidence_embeddings)
        text_sims_norm = self.standardize_normalize_cosine_similarities(text_sims)
        phrase_sims_norm = self.max_normalize_cosine_similarities_pairwise(candidate_sims)

        selected_data_indices = []
        data_len = len(evidence_sentences)
        unselected_data_indices = list(range(data_len))

        best_idx = np.argmax(text_sims)
        selected_data_indices.append(best_idx)
        unselected_data_indices.remove(best_idx)

        # Select top N data
        for _ in range(min(data_len, k) - 1):
            unselected_data_distances_to_text = text_sims_norm[unselected_data_indices, :]
            unselected_data_distances_pairwise = phrase_sims_norm[unselected_data_indices][:,selected_data_indices]
            # if dimension of data distances is 1 we add additional axis to the end
            if unselected_data_distances_pairwise.ndim == 1:
                unselected_data_distances_pairwise = np.expand_dims(unselected_data_distances_pairwise, axis=1)

            # find new candidate with MMR retrieval
            idx = int(np.argmax(beta * unselected_data_distances_to_text - (1 - beta) * np.max(unselected_data_distances_pairwise, axis=1).reshape(-1, 1)))
            best_idx = unselected_data_indices[idx]

            # select new best phrase and update selected/unselected phrase indices list
            selected_data_indices.append(best_idx)
            unselected_data_indices.remove(best_idx)
            top_sent = [evidence_sentences[i] for i in selected_data_indices]

        return top_sent


### FactCheck

In [18]:
knn = KnnSearch()
checkpoint = 'Dzeniks/roberta-fact-check'
factcheck_model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
factcheck_tokenizer = AutoTokenizer.from_pretrained(checkpoint)
label_mapping = ['support', 'refute', 'neutral']

def fact_check(claim, evidences):
    factcheck_model.eval()
    labels = []
    for evidence in evidences:
      features = factcheck_tokenizer.encode_plus(claim, evidence, truncation=True, return_tensors="pt", max_length=512).to(device)
      print(features)
      with torch.no_grad():
        prediction = factcheck_model(**features).logits
        logits = prediction.cpu().numpy()
        result = label_mapping[logits.argmax().item()]
        labels.append(result)
    return labels

In [19]:
# Load sampled Halueval data
sampled_df = pd.read_csv('data/sample_df.csv')
sampled_df['sentences'] = sampled_df['sentences'].apply(ast.literal_eval)

# Get closest evidence sentences and infer labels support or refute
results = knn.get_top_nn_neighbours(df_name='halueval', df_input=sampled_df, df_index=2, k=15, beta=0.7)
labels = fact_check(sampled_df['right_summary'][2], results)

# Majority vote
vote_counts = Counter(labels)
majority_vote = vote_counts.most_common(1)[0][0]
print(f"Final label: {majority_vote}\n evidence: {results}\n claim: {sampled_df['right_summary'][2]}")

{'input_ids': tensor([[    0, 32743,  1745,  1008,    39,   654,   212,   724,     7,  4073,
          9035,   130,   332,     4,  5441, 26528, 26955,   118, 21413,    18,
           724,    56,    57,  8102,    66,    30,  3576,  1758, 27866,   219,
           877,     4, 16734, 16116, 28506,    39,  1175,  7293,   338,    54,
            37,  1146,     9,     5,  3638,     4, 50118,     2,     2, 32743,
          1745,  4265,    39,   654,   212,   724,    13,  9035,   412,   115,
          3364,     7,    28,    39,   144,   505,   648,     4,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
{'input_ids': tensor([[    0, 32743,  1745,  1008,    39,   654,   212,   724,     7,  4073,
          9035,   130,   332,     4,  5441, 26528, 26955,   118, 21413,    18,
          

In [15]:
stances, times = [], []
for idx in sampled_df.index:
    start_time = time.time()
    # Retrieve sentences
    top_sent = knn.get_top_nn_neighbours(df_name='halueval', df_input=sampled_df, df_index=idx)
    # Use sentences as evidence and summary as claim for factchecking
    labels = fact_check(sampled_df['right_summary'][idx], top_sent)
    # Majority vote on labels
    vote_counts = Counter(labels)
    majority_vote = vote_counts.most_common(1)[0][0]
    stances.append(majority_vote)
    elapsed_time = time.time() - start_time
    times.append(elapsed_time)

sampled_df['stance'] = stances
sampled_df['inference_time'] = times
sampled_df

Unnamed: 0,document,right_summary,hallucinated_summary,index,sentences,stance,inference_time
0,Driving around in their mother's consular BMW ...,"Marc Wabafiyebazu, 15, bragged to officials th...",Brothers Marc and Jean Wabafiyebazu were arres...,6252,[Driving around in their mother's consular BMW...,refute,15.213089
1,Lance Armstrong has said the World Anti-Doping...,WADA director general David Howman said he was...,Lance Armstrong has apologized to the World An...,4684,[Lance Armstrong has said the World Anti-Dopin...,refute,5.039494
2,Andy King thinks his 50th goal for Leicester C...,Andy King scored his 50th goal to earn Leicest...,Leicester City secured a crucial win against W...,1731,[Andy King thinks his 50th goal for Leicester ...,refute,6.339607
3,West Ham have announced a new five-year multi-...,West Ham have signed a new kit deal with Umbro...,West Ham have announced a partnership with Umb...,4742,[West Ham have announced a new five-year multi...,refute,4.814542
4,"At half-time, everything pointed to another hu...",George Ford scythed through the Leinster defen...,Bath's George Ford scored a hat-trick of tries...,4521,"[At half-time, everything pointed to another h...",refute,7.709097
5,Veteran airman Andrew Danziger claims to have ...,Andrew Danziger flew President Obama during 20...,"Andrew Danziger, a veteran pilot, has witnesse...",6340,[Veteran airman Andrew Danziger claims to have...,refute,5.70843
6,"Shortly after being elected chief prosecutor, ...",Prosecutor Marilyn Mosby has only been on the ...,"Marilyn Mosby, Baltimore's chief prosecutor, i...",576,"[Shortly after being elected chief prosecutor,...",refute,4.450048
7,Pictures of the Australian man reportedly faci...,An Australian man has reportedly been sentence...,An Australian man was caught trying to smuggle...,5202,[Pictures of the Australian man reportedly fac...,refute,6.575221
8,Oregon-based defense contractor FLIR Systems I...,FLIR Systems Inc. has agreed to pay $9.5 milli...,FLIR Systems Inc. has been accused of giving l...,6363,[Oregon-based defense contractor FLIR Systems ...,refute,7.438112
9,Sawyer Sweeten grew up before the eyes of mill...,Sawyer Sweeten played across from his twin bro...,Child star Sawyer Sweeten tragically took his ...,439,[Sawyer Sweeten grew up before the eyes of mil...,refute,3.675971


In [17]:
sampled_df.values[:2]

array([["Driving around in their mother's consular BMW in Miami to shoot up homes and steal hordes of marijuana: this was the indulgent life of two wealthy teenage sons of a Canadian diplomat. But the antics of 15-year-old Marc and 17-year-old Jean Wabafiyebazu were brought to a sudden halt on March 30 when a drugs raid went wrong and the older brother was shot dead. Marc is waiting to discover if he will be charged with his murder. Now, as the details of that night unfold, investigators are learning more about the gangster lifestyle they led - despite attending top private schools bankrolled by their mother Roxanne Dube, Canada's Consul General in Miami. Indulgent life: Jean Wabafiyebazu, 17, (left) and his 15-year-old brother Marc (right) drove around Canada raiding homes and buying drugs, the younger boy has told investigators. Bloody aftermath: The boys and a friend had driven to a house, pictured here with blood on the floor, to reportedly purchase two pounds of marijuana for $5,0