# Assignment 4

**Due to**: TBD

**Credits**: Andrea Galassi, Federico Ruggeri, Paolo Torroni

**Summary**: Fact checking, Neural Languange Inference (**NLI**)

# Intro

This assignment is centred on a particular and emerging NLP task, formally known as **fact checking** (or fake checking). As AI techniques become more and more powerful, reaching amazing results, such as image and text generation, it is more than ever necessary to build tools able to distinguish what is real from what is fake.

Here we focus on a small portion of the whole fact checking problem, which aims to determine whether a given statement (fact) conveys a trustworthy information or not. 

More precisely, given a set of evidences and a fact to verify, we would like our model to correctly predict whether the fact is true or fake.

In particular, we will see:

*   Dataset preparation (analysis and pre-processing)
*   Problem formulation: multi-input binary classification
*   Defining an evaluation method
*   Simple sentence embedding
*   Neural building blocks
*   Neural architecture extension

# The FEVER dataset

First of all, we need to choose a dataset. In this assignment we will rely on the [FEVER dataset](https://fever.ai).

The dataset is about facts taken from Wikipedia documents that have to be verified. In particular, facts could face manual modifications in order to define fake information or to give different formulations of the same concept.

The dataset consists of 185,445 claims manually verified against the introductory sections of Wikipedia pages and classified as ```Supported```, ```Refuted``` or ```NotEnoughInfo```. For the first two classes, systems and annotators need to also return the combination of sentences forming the necessary evidence supporting or refuting the claim.

## Dataset structure

Relevant data is divided into two file types. Information concerning the fact to verify, its verdict and associated supporting/opposing statements are stored in **.jsonl** format. In particular, each JSON element is a python dictionary with the following relevant fields:

*    **ID**: ID associated to the fact to verify.

*    **Verifiable**: whether the fact has been verified or not: ```VERIFIABLE``` or ```NOT VERIFIABLE```.
    
*    **Label**: the final verdict on the fact to verify: ```SUPPORTS```, ```REFUTES``` or ```NOT ENOUGH INFO```.
    
*    **Claim**: the fact to verify.
    
*    **Evidence**: a nested list of document IDs along with the sentence ID that is associated to the fact to verify. In particular, each list element is a tuple of four elements: the first two are internal annotator IDs that can be safely ignored; the third term is the document ID (called URL) and the last one is the sentence number (ID) in the pointed document to consider.

**Some Examples**

---

**Verifiable**

{"id": 202314, "verifiable": "VERIFIABLE", "label": "REFUTES", "claim": "The New Jersey Turnpike has zero shoulders.", "evidence": [[[238335, 240393, "New_Jersey_Turnpike", 15]]]}

---

**Not Verifiable**

{"id": 113501, "verifiable": "NOT VERIFIABLE", "label": "NOT ENOUGH INFO", "claim": "Grease had bad reviews.", "evidence": [[[133128, null, null, null]]]}

---

## Some simplifications and pre-processing

We are only interested in verifiable facts. Thus, we can filter out all non-verifiable claims.

Additionally, the current dataset format does not contain all necessary information for our classification purposes. In particular, we need to download Wikipedia documents and replace reported evidence IDs with the corresponding text.

Don't worry about that! We are providing you the already pre-processed dataset so that you can concentrate on the classification pipeline (pre-processing, model definition, evaluation and training).

You can download the zip file containing all set splits (train, validation and test) of the FEVER dataset by clicking on this [link](https://drive.google.com/file/d/1wArZhF9_SHW17WKNGeLmX-QTYw9Zscl1/view?usp=sharing). Alternatively, run the below code cell to automatically download it on this notebook.

**Note**: each dataset split is in .csv format. Feel free to inspect the whole dataset!

In [None]:
import os
import requests
import zipfile

def save_response_content(response, destination):
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

def download_data(data_path):
    toy_data_path = os.path.join(data_path, 'fever_data.zip')
    toy_data_url_id = "1wArZhF9_SHW17WKNGeLmX-QTYw9Zscl1"
    toy_url = "https://docs.google.com/uc?export=download"

    if not os.path.exists(data_path):
        os.makedirs(data_path)

    if not os.path.exists(toy_data_path):
        print("Downloading FEVER data splits...")
        with requests.Session() as current_session:
            response = current_session.get(toy_url,
                                   params={'id': toy_data_url_id},
                                   stream=True)
        save_response_content(response, toy_data_path)
        print("Download completed!")

        print("Extracting dataset...")
        with zipfile.ZipFile(toy_data_path) as loaded_zip:
            loaded_zip.extractall(data_path)
        print("Extraction completed!")

download_data('dataset')

Downloading FEVER data splits...
Download completed!
Extracting dataset...
Extraction completed!


# Classification dataset

At this point, you should have a reay-to-go dataset! Note that the dataset format changed as well! In particular, we split the evidence set associated to each claim, in order to build (claim, evidence) pairs. The classification label is propagated as well.

We'll motivate this decision in the next section!

Just for clarity, here's an example of the pre-processed dataset:

---

**Claim**: "Wentworth Miller is yet to make his screenwriting debut."

**Evidence**: "2	He made his screenwriting debut with the 2013 thriller film Stoker .	Stoker	Stoker (film)"

**Label**: Refutes

---

**Note**: The dataset requires some text cleaning as you may notice!


# Problem formulation

As mentioned at the beginning of the assignment, we are going to formulate the fact checking problem as a binary classification task.

In particular, each dataset sample is comprised of:

*     A claim to verify
*     A set of semantically related statements (evidence set)
*     Fact checking label: either evidences support or refute the claim.

Handling the evidence set from the point of view of neural models may imply some additional complexity: if the evidence set is comprised of several sentences we might incur in memory problems.

To this end, we further simplify the problem by building (claim, evidence) pairs. The fact checking label is propagated as well.

Example:

     Claim: c1 
     Evidence set: [e1, e2, e3]
     Label: S (support)

--->

    (c1, e1, S),
    (c1, e2, S),
    (c1, e3, S)

## Schema

The overall binary classification problem is summed up by the following (simplified) schema

![](https://drive.google.com/uc?export=view&id=1Wm_YBnFwgJtxcWEBpPbTBEVkpKaL08Jp)

Don't worry too much about the **Encoding** block for now. We'll give you some simple guidelines about its definition. For the moment, stick to the binary classification task definition where, in this case, we have 2 inputs: the claim to verify and one of its associated evidences.

# Architecture Guidelines

There are many neural architectures that follow the above schema. To avoid phenomena like the writer's block, in this section we are going to give you some implementation guidelines.

In particular, we would like you to test some implementations so that you explore basic approaches (neural baselines) and use them as building blocks for possible extensions.

## Handling multiple inputs

The first thing to notice is that we are in a multi-input scenario. In particular, each sample is comprised of a fact and its asssociated evidence statement.

Each of these input is encoded as a sequence of tokens. In particular, we will have the following input matrices:

*    Claim: [batch_size, max_tokens]
*    Evidence: [batch_size, max_tokens]

Moreover, after the embedding layer, we'll have:

*    Claim: [batch_size, max_tokens, embedding_dim]
*    Evidence: [batch_size, max_tokens, embedding_dim]

But, we would like to have a 2D input to our classifier, since we have to give an answer at pair level. Therefore, for each sample, we would expect the following input shape to our classification block:

*   Classification input shape: [batch_size, dim]

**How to do that?**

We inherently need to reduce the token sequence to a single representation. This operation is formally known as **sentence embedding**. Indeed, we are trying to compress the information of a whole sequence into a single embedding vector.

Here are some simple solutions that we ask you to try out:

*   Encode token sequences via a RNN and take the last state as the sentence embedding.

*   Encode token sequences via a RNN and average all the output states.

*   Encode token sequences via a simple MLP layer. In particular, if your input is a [batch_size, max_tokens, embedding_dim] tensor, the matrix multiplication works on the **max_tokens** dimension, resulting in a [batch_size, embedding_dim] 2D matrix.

*   Compute the sentence embedding as the mean of its token embeddings (**bag of vectors**).

## Merging multi-inputs

At this point, we have to think about **how** we should merge evidence and claim sentence embeddings.

For simplicity, we stick to simple merging strategies:

*     **Concatenation**: define the classification input as the concatenation of evidence and claim sentence embeddings

*     **Sum**: define the classification input as the sum of evidence and claim sentence embeddings

*     **Mean**: define the classification input as the mean of evidence and claim sentence embeddings

For clarity, if we the sentence embedding of a single input has shape [batch_size, embedding_dim], then the classification input has shape:

*     **Concatenation**: [batch_size, 2 * embedding_dim]

*     **Sum**: [batch_size, embedding_dim]

*     **Mean**: [batch_size, embedding_dim]

# A simple extension

Lastly, we ask you to modify previously defined neural architectures by adding an additional feature to the classification input.

We would like to see if some similarity information between the claim to verify and one of its associated evidence might be useful to the classification.

Compute the cosine similarity metric between the two sentence embeddings and concatenate the result to the classification input.

For clarity, since the cosine similarity of two vectors outputs a scalar value, the classification input shape is modified as follows:

*     **Concatenation**: [batch_size, 2 * embedding_dim + 1]

*     **Sum**: [batch_size, embedding_dim + 1]

*     **Mean**: [batch_size, embedding_dim + 1]



# Performance evaluation

Due to our simplifications, obtained results are not directly compatible with a traditional fact checking method that considers the evidence set as a whole.

Thus, we need to consider two types of evaluations.

**Multi-input classification evaluation**

This type of evaluation is the easiest and concerns computing evaluation metrics, such as accuracy, f1-score, recall and precision, of our pre-processed dataset.

In other words, we assess the performance of chosen classifiers.

**Claim verification evaluation**

However, if we want to give an answer concerning the claim itself, we need to consider the whole evidence set. 

Intuitively, for a given claim, we consider all its corresponding (claim, evidence) pairs and their corresponding classification outputs. 

At this point, all we need to do is to compute the final predicted claim label via majority voting.

Example:

    Claim: c1
    Evidence set: e1, e2, e3
    True label: S

    Pair outputs:
    (c1, e1) -> S (supports)
    (c1, e2) -> S (supports)
    (c1, e3) -> R (refutes)

    Majority voting:
    S -> 2 votes
    R -> 1 vote

    Final label:
    c1 -> S

Lastly, we have to compute classification metrics just like before.

# Tips and Extras

## Extensions are welcome!

Is this task too easy for you? Are you curious to try out things you have seen during lectures (e.g. attention)? Feel free to try everything you want!

Don't forget to try neural baselines first!

## Comments and documentation

Remember to properly comment your code (it is not necessary to comment each single line) and don't forget to describe your work!

## Organization

We suggest you to divide your work into sections. This allows you to build clean and modular code, as well as easy to read and to debug.

A possible schema:

*   Dataset pre-processing
*   Dataset conversion
*   Model definition
*   Training
*   Evaluation
*   Comments/Summary

# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Andrea Galassi -> a.galassi@unibo.it
* Federico Ruggeri -> federico.ruggeri6@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

Don't forget that your feedback is very important! Your suggestions help us improving course material.

# FAQ

---

**Q: Can I do something text pre-processing?**

**A:** You have to! If you check text data, the majority of sentences need some cleaning.

---

**Q: I'm struggling with the implementation. Can you help me?**

**A:** Yes sure! Write us an email about your issue. If you are looking for a particular type of operation, you can easily check the documentation of the deep learning framework you are using (google is your friend).

---

**Q: Can I try other encoding strategies or neural architectures?**

**A:** Absolutely! Remember to try out recommended neural baselines first and only then proceed with your extensions.

---

## Dataset pre-processing

### Dataset loading and inspection

In [None]:
import pandas as pd
import numpy as np

# wider pandas columns
pd.options.display.max_colwidth = 1000

train_df = pd.read_csv("dataset/train_pairs.csv")
val_df   = pd.read_csv("dataset/val_pairs.csv")
test_df  = pd.read_csv("dataset/test_pairs.csv")

In [None]:
# inspect dataset
train_df.head(-1)

Unnamed: 0.1,Unnamed: 0,Claim,Evidence,ID,Label
0,0,Chris Hemsworth appeared in A Perfect Getaway.,"2\tHemsworth has also appeared in the science fiction action film Star Trek -LRB- 2009 -RRB- , the thriller adventure A Perfect Getaway -LRB- 2009 -RRB- , the horror comedy The Cabin in the Woods -LRB- 2012 -RRB- , the dark-fantasy action film Snow White and the Huntsman -LRB- 2012 -RRB- , the war film Red Dawn -LRB- 2012 -RRB- , and the biographical sports drama film Rush -LRB- 2013 -RRB- .\tStar Trek\tStar Trek (film)\tA Perfect Getaway\tA Perfect Getaway\tThe Cabin in the Woods\tThe Cabin in the Woods\tSnow White and the Huntsman\tSnow White and the Huntsman\tRed Dawn\tRed Dawn (2012 film)\tRush\tRush (2013 film)",3,SUPPORTS
1,1,Roald Dahl is a writer.,"0\tRoald Dahl -LRB- -LSB- langpronˈroʊ.əld _ ˈdɑːl -RSB- , -LSB- ˈɾuːɑl dɑl -RSB- ; 13 September 1916 -- 23 November 1990 -RRB- was a British novelist , short story writer , poet , screenwriter , and fighter pilot .\tfighter pilot\tfighter pilot",7,SUPPORTS
2,2,Roald Dahl is a governor.,"0\tRoald Dahl -LRB- -LSB- langpronˈroʊ.əld _ ˈdɑːl -RSB- , -LSB- ˈɾuːɑl dɑl -RSB- ; 13 September 1916 -- 23 November 1990 -RRB- was a British novelist , short story writer , poet , screenwriter , and fighter pilot .\tfighter pilot\tfighter pilot",8,REFUTES
3,3,Ireland has relatively low-lying mountains.,"10\tThe island 's geography comprises relatively low-lying mountains surrounding a central plain , with several navigable rivers extending inland .\tisland\tisland\tgeography\tgeography\tseveral navigable rivers\tRivers of Ireland",9,SUPPORTS
4,4,Ireland does not have relatively low-lying mountains.,"10\tThe island 's geography comprises relatively low-lying mountains surrounding a central plain , with several navigable rivers extending inland .\tisland\tisland\tgeography\tgeography\tseveral navigable rivers\tRivers of Ireland",10,REFUTES
...,...,...,...,...,...
121734,121734,Anderson Silva is a former UFC heavyweight Champion.,"0\tAnderson da Silva -LRB- -LSB- ˈɐ̃deʁsõ ˈsiwvɐ -RSB- ; born April 14 , 1975 -RRB- is a Brazilian mixed martial artist and former UFC Middleweight Champion .\tMiddleweight\tMiddleweight (MMA)\tmixed martial artist\tmixed martial arts\tUFC Middleweight Champion\tUFC Middleweight Championship",229439,REFUTES
121735,121735,April was the month Anderson Silva was born.,"0\tAnderson da Silva -LRB- -LSB- ˈɐ̃deʁsõ ˈsiwvɐ -RSB- ; born April 14 , 1975 -RRB- is a Brazilian mixed martial artist and former UFC Middleweight Champion .\tMiddleweight\tMiddleweight (MMA)\tmixed martial artist\tmixed martial arts\tUFC Middleweight Champion\tUFC Middleweight Championship",229440,SUPPORTS
121736,121736,Anderson Silva is an American Brazilian mixed martial artist.,"0\tAnderson da Silva -LRB- -LSB- ˈɐ̃deʁsõ ˈsiwvɐ -RSB- ; born April 14 , 1975 -RRB- is a Brazilian mixed martial artist and former UFC Middleweight Champion .\tMiddleweight\tMiddleweight (MMA)\tmixed martial artist\tmixed martial arts\tUFC Middleweight Champion\tUFC Middleweight Championship",229443,REFUTES
121737,121737,Anderson Silva is incapable of being a Brazilian mixed martial artist.,"0\tAnderson da Silva -LRB- -LSB- ˈɐ̃deʁsõ ˈsiwvɐ -RSB- ; born April 14 , 1975 -RRB- is a Brazilian mixed martial artist and former UFC Middleweight Champion .\tMiddleweight\tMiddleweight (MMA)\tmixed martial artist\tmixed martial arts\tUFC Middleweight Champion\tUFC Middleweight Championship",229444,REFUTES


### Preprocessing
Now we are going to apply some preprocessing to the dataset.

In [None]:
# examine evidences
print(train_df["Evidence"][0])
print(train_df["Evidence"][993])

2	Hemsworth has also appeared in the science fiction action film Star Trek -LRB- 2009 -RRB- , the thriller adventure A Perfect Getaway -LRB- 2009 -RRB- , the horror comedy The Cabin in the Woods -LRB- 2012 -RRB- , the dark-fantasy action film Snow White and the Huntsman -LRB- 2012 -RRB- , the war film Red Dawn -LRB- 2012 -RRB- , and the biographical sports drama film Rush -LRB- 2013 -RRB- .	Star Trek	Star Trek (film)	A Perfect Getaway	A Perfect Getaway	The Cabin in the Woods	The Cabin in the Woods	Snow White and the Huntsman	Snow White and the Huntsman	Red Dawn	Red Dawn (2012 film)	Rush	Rush (2013 film)
25	She has appeared in Time 100 most influential people in the world -LRB- 2010 and 2015 -RRB- , Forbes top-earning women in music -LRB- 2011 -- 2015 -RRB- , Forbes 100 most powerful women -LRB- 2015 -RRB- , and Forbes Celebrity 100 -LRB- 2016 -RRB- .	Time	Time (magazine)	100 most influential people in the world	Time 100	Forbes	Forbes	100 most powerful women	The World's 100 Most Powerfu

In [None]:
import re
from functools import reduce
import nltk
from nltk.corpus import stopwords
try:
    STOPWORDS = set(stopwords.words('english'))
except LookupError:
    nltk.download('stopwords')
    STOPWORDS = set(stopwords.words('english'))
def remove_stopwords(text):
    return ' '.join([x for x in text.split() if x and x not in STOPWORDS])

def remove_leading_tabs(text) :
  # remove leading tabs
  pattern = r'[0-9]+?\t'
  return re.sub(pattern, '', text)

def remove_pronunciations(text) :
  # remove pronunciations
  pattern = r'-LSB-.*?-RSB-(\s;)*?'
  return re.sub(pattern, '', text)

def convert_round_brackets(text) :
  # convert -LRB- -RRB- to ( or )
  pattern = r'-LRB-'
  text = re.sub(pattern, '(', text)
  pattern = r'-RRB-'
  return re.sub(pattern, ')', text)

def fix_double_dashes(text) :
  # fix: double dashes (--)
  pattern = r'\-\-'
  return re.sub(pattern, '-', text)

def remove_trailing_words(text) :
  # remove trailing words (hyperlinks)
  pattern = r'.\t.*?$'
  return re.sub(pattern, '.', text)

def split_genitive(text) :
  # make sure all possesive 's are split from other words
  pattern = r"(\s.+?)'s"
  return re.sub(pattern, r"\1's", text)

def split_periods(text) :
  pattern = r'(\s.+?)\.'
  return re.sub(pattern, r'\1 .', text)

def fix_days(text) :
  # fix: 31st -> 31 st
  pattern = r'([0-9]{1,2})(st|nd|rd|th)'
  return re.sub(pattern, r'\1', text)

def separate_years(text) :
  # fix: separate years from other words
  pattern = r'(\s.+?)([0-9]{4})'
  return re.sub(pattern, r'\1 \2', text) 

def fix_comma_thousands(text) :
  # fix: comma thousands notations 
  pattern = r'([0-9]{1,3}),([0-9]{1,3})'
  text = re.sub(pattern, r'\1\2', text)
  pattern = r'([0-9]{1,3}),'
  return re.sub(pattern, r'\1', text)

def fix_weird_dash(text) :
  # fix: replace weird dash with normal one
  pattern = r'–'
  return re.sub(pattern, '-', text)

def fix_years_ranges(text) :
  # fix years ranges
  pattern = r'([0-9]{4})\-([0-9]{4})'
  text = re.sub(pattern, r'\1 - \2', text)
  pattern = r'([0-9]{2})([0-9]{2})\-([0-9]{2})'
  text = re.sub(pattern, r'\1\2 - \1\3', text)
  pattern = r'\'([0-9]{2})-\'([0-9]{2})'
  return re.sub(pattern, r'19\1 - 19\2', text)

def fix_number_ranges(text) :
  # fix: numbers ranges
  pattern = r'([0-9]+?[,\.][0-9]+?)+?-([0-9]+?[,\.][0-9]+?)+'
  return re.sub(pattern, r'\1 - \2', text)

def fix_double_tick(text) :
  # fix: double tick
  pattern = r'\`\`'
  return re.sub(pattern, '"', text)

def fix_date_merged(text) :
  # fix: year/day merged with other word
  pattern = r'([0-9]{1,4})([a-zA-Z]+?)'
  return re.sub(pattern, r'\1 \2', text)

def fix_double_ending_periods(text) :
  # fix: double ending periods in claims
  pattern = r'([a-zA-Z]{1,2}\.)\.$'               # except abbreviations (e.g jr. or c. k.)
  text = re.sub(pattern, '\1 .', text)
  pattern = r'\.\.$'
  return re.sub(pattern, '.', text)

def fix_slashes_words(text) :
  # fix: slashes separated from second word/number  e.g. "2006/ 2007"
  pattern = r'(\s.+?)\/\s(.+?\s)'
  text =  re.sub(pattern, r'\1 \/ \2', text)    
  # fix: separate strings between slashes
  pattern = r'(.+?)\/(.+?)'
  return re.sub(pattern, r'\1 \/ \2', text)

def fix_string_ending_dash(text) :
  # fix: fix strings ending with dash
  pattern = r'(.+?)\-\s'
  return re.sub(pattern, r'\1 - ', text)

def separate_non_words(text) :
  # fix: separate words like non-something
  pattern = r'([a-zA-Z]+?)\-([a-zA-Z]+?)'
  return re.sub(pattern, r'\1 - \2', text)
  
def fix_remove_round_brackets(text) :
  # remove between round brackets
  pattern = r'\([^\(\)]+?\)'
  return re.sub(pattern, ' ', text)

def remove_double_spaces(text) :
  # remove double spaces
  pattern = r'(\s)\s+?'
  return re.sub(pattern, r'\1', text)

#test
def replace_special_characters(text):
  REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
  return REPLACE_BY_SPACE_RE.sub(' ', text)

def good_symbols(text) :
  GOOD_SYMBOLS_RE = re.compile('[^0-9a-zA-Z\s\+_]')
  return GOOD_SYMBOLS_RE.sub('', text)

def lower(text) :
  return text.lower()

PREPROCESSING_PIPELINE = [
                          remove_leading_tabs,
                          remove_trailing_words,
                          #remove_stopwords,
                          convert_round_brackets,
                          #fix_remove_round_brackets,
                          remove_pronunciations,
                          fix_double_dashes,
                          split_genitive,
                          split_periods,
                          fix_days,
                          separate_years,
                          fix_comma_thousands,
                          fix_weird_dash,
                          fix_years_ranges,
                          fix_number_ranges,
                          fix_double_tick,
                          fix_date_merged,
                          fix_double_ending_periods,
                          fix_slashes_words,
                          fix_string_ending_dash,
                          separate_non_words,
                          remove_double_spaces,
                          #replace_special_characters,
                          #good_symbols,
                          lower
]


def preprocess_text(text, filter_methods=None):
    """
    Applies a list of pre-processing functions in sequence (reduce).
    Note that the order is important here!
    """
    filter_methods = filter_methods if filter_methods is not None else PREPROCESSING_PIPELINE
    return reduce(lambda txt, f: f(txt), filter_methods, text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
print("Preprocessing dataset...")

train_df["Claim"] = train_df["Claim"].apply(preprocess_text)
train_df["Evidence"] = train_df["Evidence"].apply(preprocess_text)
print("Training data done.")
val_df["Claim"] = val_df["Claim"].apply(preprocess_text)
val_df["Evidence"] = val_df["Evidence"].apply(preprocess_text)
print("Validation data done.")
test_df["Claim"] = test_df["Claim"].apply(preprocess_text)
test_df["Evidence"] = test_df["Evidence"].apply(preprocess_text)
print("Testing data done.")

print("Preprocessing complete!")

Preprocessing dataset...
Training data done.
Validation data done.
Testing data done.
Preprocessing complete!


In [None]:
print(train_df["Claim"][0])
print(train_df["Evidence"][0])

chris hemsworth appeared in a perfect getaway .
hemsworth has also appeared in the science fiction action film star trek ( 2009 ) , the thriller adventure a perfect getaway ( 2009 ) , the horror comedy the cabin in the woods ( 2012 ) , the dark - fantasy action film snow white and the huntsman ( 2012 ) , the war film red dawn ( 2012 ) , and the biographical sports drama film rush ( 2013 ) .


In [None]:
train_df

Unnamed: 0.1,Unnamed: 0,Claim,Evidence,ID,Label
0,0,chris hemsworth appeared in a perfect getaway .,"hemsworth has also appeared in the science fiction action film star trek ( 2009 ) , the thriller adventure a perfect getaway ( 2009 ) , the horror comedy the cabin in the woods ( 2012 ) , the dark - fantasy action film snow white and the huntsman ( 2012 ) , the war film red dawn ( 2012 ) , and the biographical sports drama film rush ( 2013 ) .",3,SUPPORTS
1,1,roald dahl is a writer .,"roald dahl ( , ; 13 september 1916 - 23 november 1990 ) was a british novelist , short story writer , poet , screenwriter , and fighter pilot .",7,SUPPORTS
2,2,roald dahl is a governor .,"roald dahl ( , ; 13 september 1916 - 23 november 1990 ) was a british novelist , short story writer , poet , screenwriter , and fighter pilot .",8,REFUTES
3,3,ireland has relatively low - lying mountains .,"the island 's geography comprises relatively low - lying mountains surrounding a central plain , with several navigable rivers extending inland .",9,SUPPORTS
4,4,ireland does not have relatively low - lying mountains .,"the island 's geography comprises relatively low - lying mountains surrounding a central plain , with several navigable rivers extending inland .",10,REFUTES
...,...,...,...,...,...
121735,121735,april was the month anderson silva was born .,"anderson da silva ( ; born april 14 , 1975 ) is a brazilian mixed martial artist and former ufc middleweight champion .",229440,SUPPORTS
121736,121736,anderson silva is an american brazilian mixed martial artist .,"anderson da silva ( ; born april 14 , 1975 ) is a brazilian mixed martial artist and former ufc middleweight champion .",229443,REFUTES
121737,121737,anderson silva is incapable of being a brazilian mixed martial artist .,"anderson da silva ( ; born april 14 , 1975 ) is a brazilian mixed martial artist and former ufc middleweight champion .",229444,REFUTES
121738,121738,anderson silva was born on the month of april 14 1975 .,"anderson da silva ( ; born april 14 , 1975 ) is a brazilian mixed martial artist and former ufc middleweight champion .",229445,SUPPORTS


In [None]:
train_df[train_df['Claim'].str.contains('k\.\.') == True]

Unnamed: 0.1,Unnamed: 0,Claim,Evidence,ID,Label


## Dataset conversion

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
import gensim.downloader as api
from nltk.tokenize.treebank import TreebankWordTokenizer
from nltk.tokenize.toktok import ToktokTokenizer

embedding_dimension = 50
glove = api.load(f"glove-wiki-gigaword-{embedding_dimension}")



In [None]:
# tokenizer based on penn treebank
def tokenize_df(df) :
  word_tokenizer = ToktokTokenizer()
  tmp = pd.DataFrame(columns=["Claim", "Evidence", "Label"])
  for col in ["Claim", "Evidence"]:
    tmp[col] = df[col].apply(word_tokenizer.tokenize)
  tmp["Label"] = df["Label"]
  return tmp

In [None]:
def get_embedding_matrix(embedding_model, tokenizer) :
  embedding_matrix = []
  emebedding_dimension = embedding_model.vectors[0].shape[0]
  for word in tokenizer.index_word.values() :
    if word in embedding_model:
      embedding_matrix.append(embedding_model[word])
    else:
      embedding_matrix.append(np.random.uniform(low=-1.0, high=1.0, size=(embedding_dimension,)))
  return embedding_matrix

In [None]:
tok_train_df = tokenize_df(train_df)
tok_train_df.head()

Unnamed: 0,Claim,Evidence,Label
0,"[chris, hemsworth, appeared, in, a, perfect, getaway, .]","[hemsworth, has, also, appeared, in, the, science, fiction, action, film, star, trek, (, 2009, ), ,, the, thriller, adventure, a, perfect, getaway, (, 2009, ), ,, the, horror, comedy, the, cabin, in, the, woods, (, 2012, ), ,, the, dark, -, fantasy, action, film, snow, white, and, the, huntsman, (, 2012, ), ,, the, war, film, red, dawn, (, 2012, ), ,, and, the, biographical, sports, drama, film, rush, (, 2013, ), .]",SUPPORTS
1,"[roald, dahl, is, a, writer, .]","[roald, dahl, (, ,, ;, 13, september, 1916, -, 23, november, 1990, ), was, a, british, novelist, ,, short, story, writer, ,, poet, ,, screenwriter, ,, and, fighter, pilot, .]",SUPPORTS
2,"[roald, dahl, is, a, governor, .]","[roald, dahl, (, ,, ;, 13, september, 1916, -, 23, november, 1990, ), was, a, british, novelist, ,, short, story, writer, ,, poet, ,, screenwriter, ,, and, fighter, pilot, .]",REFUTES
3,"[ireland, has, relatively, low, -, lying, mountains, .]","[the, island, ', s, geography, comprises, relatively, low, -, lying, mountains, surrounding, a, central, plain, ,, with, several, navigable, rivers, extending, inland, .]",SUPPORTS
4,"[ireland, does, not, have, relatively, low, -, lying, mountains, .]","[the, island, ', s, geography, comprises, relatively, low, -, lying, mountains, surrounding, a, central, plain, ,, with, several, navigable, rivers, extending, inland, .]",REFUTES


In [None]:
# build vocab
def build_vocab_from_df(df) :
  ret = []
  for col in ["Claim", "Evidence"] :
    for r in df[col] :
      ret += r
  ret = pd.unique(ret)    # much faster than np.unique
  return ret

vocab = list(glove.vocab.keys())               # glove vocab

vocab_v1 = np.array(build_vocab_from_df(tok_train_df), dtype=str)   # add unique terms from train_df
oov_v1 = vocab_v1[~np.in1d(vocab_v1, vocab)] # find OOV terms
vocab = np.concatenate((vocab, oov_v1))   #update vocab

vocab_v2 = np.array(build_vocab_from_df(tokenize_df(val_df)), dtype=str)   # add unique terms from val_df
oov_v2 = vocab_v2[~np.in1d(vocab_v2, vocab)]
vocab = np.concatenate((vocab, oov_v2))

vocab_v3 = np.array(build_vocab_from_df(tokenize_df(test_df)), dtype=str)   # add unique terms from val_df
oov_v3 = vocab_v3[~np.in1d(vocab_v3, vocab)]
vocab = np.concatenate((vocab, oov_v3))

print(f"vocab len: {len(vocab)}")

vocab len: 403506


In [None]:
# build word -> int encoding 
word_to_idx = dict(zip(vocab, range(1, len(vocab)+1))) # start from 1 to reserve 0 to padding

In [None]:
def encode_sent(sent, word_to_idx=word_to_idx) :
  return [word_to_idx[w] for w in sent]

def encode_df(tok_df, word_to_idx) :
  enc_df = pd.DataFrame(columns=["Claim", "Evidence", "Label"])
  for col in ["Claim", "Evidence"] :
    enc_df[col] = tok_df[col].apply(lambda s: encode_sent(s, word_to_idx))
  enc_df["Label"] = tok_df["Label"].apply(lambda x: 1 if x=="SUPPORTS" else 0)
  return enc_df

In [None]:
# encode dataset
enc_train_df = encode_df(tok_train_df, word_to_idx)

In [None]:
enc_train_df.head()

Unnamed: 0,Claim,Evidence,Label
0,"[2103, 107954, 790, 7, 8, 2616, 20647, 3]","[107954, 32, 53, 790, 7, 1, 1122, 3955, 609, 320, 754, 9780, 24, 704, 25, 2, 1, 8966, 6041, 8, 2616, 20647, 24, 704, 25, 2, 1, 5989, 2842, 1, 7741, 7, 1, 2508, 24, 940, 25, 2, 1, 2238, 12, 5848, 609, 320, 2643, 299, 6, 1, 34011, 24, 940, 25, 2, 1, 137, 320, 640, 4650, 24, 940, 25, 2, 6, 1, 18899, 885, 2693, 320, 3993, 24, 1280, 25, 3]",1
1,"[53403, 21758, 15, 8, 1542, 3]","[53403, 21758, 24, 2, 90, 677, 442, 6907, 12, 1022, 488, 1456, 25, 16, 8, 298, 8998, 2, 637, 524, 1542, 2, 4820, 2, 12604, 2, 6, 3511, 2499, 3]",1
2,"[53403, 21758, 15, 8, 1005, 3]","[53403, 21758, 24, 2, 90, 677, 442, 6907, 12, 1022, 488, 1456, 25, 16, 8, 298, 8998, 2, 637, 524, 1542, 2, 4820, 2, 12604, 2, 6, 3511, 2499, 3]",0
3,"[1323, 32, 2224, 654, 12, 4740, 2755, 3]","[1, 584, 58, 1535, 4214, 7817, 2224, 654, 12, 4740, 2755, 2724, 8, 324, 5406, 2, 18, 202, 30455, 4058, 5787, 8331, 3]",1
4,"[1323, 261, 37, 34, 2224, 654, 12, 4740, 2755, 3]","[1, 584, 58, 1535, 4214, 7817, 2224, 654, 12, 4740, 2755, 2724, 8, 324, 5406, 2, 18, 202, 30455, 4058, 5787, 8331, 3]",0


In [None]:
def build_embedding_matrix(vocab=vocab, embedding_model=glove, embedding_dimension=50) :
  matrix = [np.zeros((embedding_dimension))] # first element reserved to padding and set to all zeros
  for w in vocab :
    if w in embedding_model.vocab :
      matrix.append(embedding_model[w])
    else:
      matrix.append(np.random.uniform(low=-1.0, high=1.0, size=embedding_dimension))
  return np.array(matrix)

In [None]:
embedding_matrix = build_embedding_matrix(vocab, glove, 50)
embedding_matrix.shape

(403507, 50)

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
max_length = pad_sequences(enc_train_df['Evidence'], padding='post').shape[1]

In [None]:
enc_val_df = encode_df(tokenize_df(val_df), word_to_idx)
enc_test_df = encode_df(tokenize_df(test_df), word_to_idx)

In [None]:
pad_train_claim = pad_sequences(enc_train_df['Claim'], maxlen=max_length, padding='post')
pad_train_evidence = pad_sequences(enc_train_df['Evidence'], maxlen=max_length, padding='post')

pad_val_claim = pad_sequences(enc_val_df['Claim'], maxlen=max_length, padding='post')
pad_val_evidence = pad_sequences(enc_val_df['Evidence'], maxlen=max_length, padding='post')

pad_test_claim = pad_sequences(enc_test_df['Claim'], maxlen=max_length, padding='post')
pad_test_evidence = pad_sequences(enc_test_df['Evidence'], maxlen=max_length, padding='post')

In [None]:
pad_train_claim.shape, pad_train_evidence.shape

((121740, 144), (121740, 144))

## Model definition

In [None]:
import tensorflow as tf
import tensorflow.keras.backend as K
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Input, Dense, Embedding, Bidirectional, \
                                    LSTM, GRU, Lambda, GlobalMaxPooling1D, GlobalAveragePooling1D, \
                                    Concatenate, Add, Average, Dropout, Flatten, TimeDistributed
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import SGD, Adam

from sklearn.metrics.pairwise import cosine_similarity

In [None]:
def build_embedding_layer(embedding_matrix, name=None) :
    embedding_layer = Embedding(
        embedding_matrix.shape[0],    # vocab size 
        embedding_matrix.shape[1],    # embedding dimension
        weights = [embedding_matrix],
        mask_zero = True,
        name = name,
        trainable = False
    )
    return embedding_layer

We have implemented the four embeddeing stategies in one function which by default is `average` which means that the token sequences are encoded through a RNN which averages all the output states.
The parameter `multi_input` can be set also to:
- **last_state**: means that the last state is taken as the sentence embedding
- **mlp**: means that the token sequences are encoded via a MLP layer.
- **bag of vectors**: means that the sentence embedding are computed as the mean of its token embeddings
- **max**: means that the token sequences are encoded through a RNN which takes the max from all the output states. This technique work a bit better between the others, this is why we decided to introduce it even though was not asked to do.

In [None]:
def build_sent_emb(units, multi_input='average', name=None) :
  sent_emb = Sequential(name=name)
  if multi_input == 'last_state' :
    sent_emb.add(Bidirectional(LSTM(units, return_sequences=False)))
  elif multi_input == 'average' :
    sent_emb.add(Bidirectional(LSTM(units, return_sequences=True)))
    sent_emb.add(Lambda(lambda x: tf.reduce_mean(x, axis=1)))
  elif multi_input == 'max' :
    sent_emb.add(Bidirectional(LSTM(units, return_sequences=True)))
    sent_emb.add(Lambda(lambda x: tf.reduce_max(x, axis=1)))
  elif multi_input == 'mlp':
    sent_emb.add(Flatten())
    sent_emb.add(Dense(144, activation='relu'))
    sent_emb.add(Dense(50, activation='relu'))
  elif multi_input == 'bag_of_vectors' :
    sent_emb.add(Lambda(lambda x: tf.reduce_mean(x, axis=1)))

  return sent_emb

The parameter `merge_mode` in the `build_model()` function handles the multi-inputs strategies. In particular:
- **concat**: representes the concatenation between the two inputs
- **sum**: representes the sum between the two inputs
- **mean**: representes the mean between the two inputs

In [None]:
tf.keras.backend.clear_session()
from keras import backend as K

def cosine_distance(vests):
    x, y = vests
    x = K.l2_normalize(x, axis=-1)
    y = K.l2_normalize(y, axis=-1)
    return -K.mean(x * y, axis=-1, keepdims=True)

def cos_dist_output_shape(shapes):
    shape1, shape2 = shapes
    return (shape1[0],1)

def build_model(embedding_matrix, max_length, multi_input='average', merge_mode='concat', cos_sim_feature=False) :

  claim = Input((max_length,), name='claim')
  evidence = Input((max_length,), name='evidence')

  embedding_layer = build_embedding_layer(embedding_matrix, "glove_embedding")  
    
  embedding_c = embedding_layer(claim)
  embedding_e = embedding_layer(evidence)

  sent_emb_c = build_sent_emb(max_length, multi_input=multi_input, name="sent_emb_claim") (embedding_c)
  sent_emb_e = build_sent_emb(max_length, multi_input=multi_input, name="sent_emb_evidence") (embedding_e)

  if merge_mode == 'concat' :
    output = Concatenate(name='refined_input')([sent_emb_c, sent_emb_e])    # option 1
  elif merge_mode == 'sum' :
    output = Add()([sent_emb_c, sent_emb_e])                                # option 2
  elif merge_mode == 'mean' :
    output = Average()([sent_emb_c, sent_emb_e])                            # option 3
  
  if cos_sim_feature :
    distance = Lambda(cosine_distance, output_shape=cos_dist_output_shape)([sent_emb_c, sent_emb_e])
    output = Concatenate(name='cossim_refined_input')([output, distance])          # co sim
  output = Dense(max_length/2, activation='relu')(output)
  output = Dropout(0.5)(output)
  output = Dense(max_length/(2**2), activation='relu')(output)
  output = Dense(max_length/(2**3), activation='relu')(output)
  output = Dense(1, activation='sigmoid')(output)
  return Model([claim, evidence], output)

model = build_model(embedding_matrix, max_length, multi_input='max', merge_mode='mean', cos_sim_feature=True)

## Training

In [None]:
import tensorflow.keras.backend as K

def recall_m(y_true, y_pred):
    true_positives = tf.reduce_sum(tf.math.round(tf.clip_by_value(y_true * y_pred, 0, 1)))
    possible_positives = tf.reduce_sum(tf.math.round(tf.clip_by_value(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def precision_m(y_true, y_pred):
    true_positives = tf.reduce_sum(tf.math.round(tf.clip_by_value(y_true * y_pred, 0, 1)))
    predicted_positives = tf.reduce_sum(tf.math.round(tf.clip_by_value(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1_m(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))


model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy',f1_m,precision_m, recall_m])
model.summary()

In [None]:
model.fit(x=(pad_train_claim, pad_train_evidence), 
          y=enc_train_df['Label'],
          batch_size=64,
          epochs=20,
          validation_data=((pad_val_claim, pad_val_evidence), enc_val_df['Label']),
          callbacks = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True),
          )

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20

## Evaluation

### Classic metrics

In [None]:
# evaluate the model
loss, accuracy, f1_score, precision, recall = model.evaluate((pad_test_claim, pad_test_evidence), enc_test_df["Label"], verbose=0)
print("Accuracy: {}".format(accuracy))
print("F1-score: {}".format(f1_score))
print("Precision: {}".format(precision))
print("Recall: {}".format(recall))

###  Majority voting metrics

In [None]:
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score
import nltk
nltk.download('punkt')

def build_df_from_model(model, df, pad_claims, pad_evidences) :
  predictions = model.predict((pad_claims, pad_evidences))
  pred_df = pd.DataFrame()
  pred_df["Claim"] = df["Claim"]
  pred_df["Label"] = np.rint(predictions)
  pred_df["Label"] = pred_df["Label"].apply(lambda x: "SUPPORTS" if x == 1.0 else "REFUTES")
  return pred_df

def get_majority_voting(df) :
  unique_claims = pd.DataFrame(columns=["Claim", "Result"])
  unique_claims["Claim"] = pd.unique(df["Claim"])

  for claim in unique_claims["Claim"]:
    labels = df[df["Claim"] == claim]
    supports = len(labels[labels["Label"] == "SUPPORTS"])
    refutes = len(labels[labels["Label"] == "REFUTES"])
    if supports > refutes :
      unique_claims["Result"][unique_claims["Claim"] == claim] = 1
    else :
      unique_claims["Result"][unique_claims["Claim"] == claim] = 0

  return unique_claims

def compute_metrics_majority_voting(true, pred):
  print("Accuracy:  ", accuracy_score(true, pred))
  print("F1 score:  ", f1_score(true, pred))
  print("Precision: ", precision_score(true, pred))
  print("Recall:    ", recall_score(true, pred))

trues = get_majority_voting(test_df)["Result"].to_numpy(int)
preds = build_df_from_model(model, test_df, pad_test_claim, pad_test_evidence)
preds = get_majority_voting(preds)["Result"].to_numpy(int)

compute_metrics_majority_voting(trues, preds)