# Assignment 2

**Due to**: 23/12/2021 (dd/mm/yyyy)

**Credits**: Andrea Galassi, Federico Ruggeri, Paolo Torroni

**Summary**: Fact checking, Neural Languange Inference (**NLI**)

# Intro

This assignment is centred on a particular and emerging NLP task, formally known as **fact checking** (or fake checking). As AI techniques become more and more powerful, reaching amazing results, such as image and text generation, it is more than ever necessary to build tools able to distinguish what is real from what is fake.

Here we focus on a small portion of the whole fact checking problem, which aims to determine whether a given statement (fact) conveys a trustworthy information or not. 

More precisely, given a set of evidences and a fact to verify, we would like our model to correctly predict whether the fact is true or fake.

In particular, we will see:

*   Dataset preparation (analysis and pre-processing)
*   Problem formulation: multi-input binary classification
*   Defining an evaluation method
*   Simple sentence embedding
*   Neural building blocks
*   Neural architecture extension

# The FEVER dataset

First of all, we need to choose a dataset. In this assignment we will rely on the [FEVER dataset](https://fever.ai).

The dataset is about facts taken from Wikipedia documents that have to be verified. In particular, facts could face manual modifications in order to define fake information or to give different formulations of the same concept.

The dataset consists of 185,445 claims manually verified against the introductory sections of Wikipedia pages and classified as ```Supported```, ```Refuted``` or ```NotEnoughInfo```. For the first two classes, systems and annotators need to also return the combination of sentences forming the necessary evidence supporting or refuting the claim.

## 2.1 Dataset structure

Relevant data is divided into two file types. Information concerning the fact to verify, its verdict and associated supporting/opposing statements are stored in **.jsonl** format. In particular, each JSON element is a python dictionary with the following relevant fields:

*    **ID**: ID associated to the fact to verify.

*    **Verifiable**: whether the fact has been verified or not: ```VERIFIABLE``` or ```NOT VERIFIABLE```.
    
*    **Label**: the final verdict on the fact to verify: ```SUPPORTS```, ```REFUTES``` or ```NOT ENOUGH INFO```.
    
*    **Claim**: the fact to verify.
    
*    **Evidence**: a nested list of document IDs along with the sentence ID that is associated to the fact to verify. In particular, each list element is a tuple of four elements: the first two are internal annotator IDs that can be safely ignored; the third term is the document ID (called URL) and the last one is the sentence number (ID) in the pointed document to consider.

**Some Examples**

---

**Verifiable**

```
{"id": 202314, "verifiable": "VERIFIABLE", "label": "REFUTES", "claim": "The New Jersey Turnpike has zero shoulders.", "evidence": [[[238335, 240393, "New_Jersey_Turnpike", 15]]]}
```

---

**Not Verifiable**

```
{"id": 113501, "verifiable": "NOT VERIFIABLE", "label": "NOT ENOUGH INFO", "claim": "Grease had bad reviews.", "evidence": [[[133128, null, null, null]]]}
```

---

## 2.2 Some simplifications and pre-processing

We are only interested in verifiable facts. Thus, we can filter out all non-verifiable claims.

Additionally, the current dataset format does not contain all necessary information for our classification purposes. In particular, we need to download Wikipedia documents and replace reported evidence IDs with the corresponding text.

Don't worry about that! We are providing you the already pre-processed dataset so that you can concentrate on the classification pipeline (pre-processing, model definition, evaluation and training).

You can download the zip file containing all set splits (train, validation and test) of the FEVER dataset by clicking on this [link](https://drive.google.com/file/d/1wArZhF9_SHW17WKNGeLmX-QTYw9Zscl1/view?usp=sharing). Alternatively, run the below code cell to automatically download it on this notebook.

**Note**: each dataset split is in .csv format. Feel free to inspect the whole dataset!

In [3]:
import os
import requests
import zipfile

def save_response_content(response, destination):
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)

def download_data(data_path):
    toy_data_path = os.path.join(data_path, 'fever_data.zip')
    toy_data_url_id = "1wArZhF9_SHW17WKNGeLmX-QTYw9Zscl1"
    toy_url = "https://docs.google.com/uc?export=download"

    if not os.path.exists(data_path):
        os.makedirs(data_path)

    if not os.path.exists(toy_data_path):
        print("Downloading FEVER data splits...")
        with requests.Session() as current_session:
            response = current_session.get(toy_url,
                                   params={'id': toy_data_url_id},
                                   stream=True)
        save_response_content(response, toy_data_path)
        print("Download completed!")

        print("Extracting dataset...")
        with zipfile.ZipFile(toy_data_path) as loaded_zip:
            loaded_zip.extractall(data_path)
        print("Extraction completed!")

download_data('dataset')

# Classification dataset

At this point, you should have a reay-to-go dataset! Note that the dataset format changed as well! In particular, we split the evidence set associated to each claim, in order to build `(claim, evidence)` pairs. The classification label is propagated as well.

We'll motivate this decision in the next section!

Just for clarity, here's an example of the pre-processed dataset:

---

**Claim**: "Wentworth Miller is yet to make his screenwriting debut."

**Evidence**: "2	He made his screenwriting debut with the 2013 thriller film Stoker .	Stoker	Stoker (film)"

**Label**: Refutes

---

[**Note**]: The dataset requires some text cleaning as you may have noticed!


In [4]:
# Import pandas 
import pandas as pd 
import numpy as np
# reading our datasets
train_df = pd.read_csv("dataset/train_pairs.csv",index_col=0) 
val_df = pd.read_csv("dataset/val_pairs.csv",index_col=0) 
test_df = pd.read_csv("dataset/test_pairs.csv",index_col=0) 

train_df.head()

Unnamed: 0,Claim,Evidence,ID,Label
0,Chris Hemsworth appeared in A Perfect Getaway.,2\tHemsworth has also appeared in the science ...,3,SUPPORTS
1,Roald Dahl is a writer.,0\tRoald Dahl -LRB- -LSB- langpronˈroʊ.əld _ ˈ...,7,SUPPORTS
2,Roald Dahl is a governor.,0\tRoald Dahl -LRB- -LSB- langpronˈroʊ.əld _ ˈ...,8,REFUTES
3,Ireland has relatively low-lying mountains.,10\tThe island 's geography comprises relative...,9,SUPPORTS
4,Ireland does not have relatively low-lying mou...,10\tThe island 's geography comprises relative...,10,REFUTES


In [5]:
train_df['Evidence'][6]

'16\tAfter suppressing a minor rebellion in Wales in 1276 -- 77 , Edward responded to a second rebellion in 1282 -- 83 with a full-scale war of conquest .\tfull-scale war of conquest\tConquest of Wales by Edward I'

In [6]:
# Let's do some data cleaning on the Evidence field of our data set
import re
from functools import reduce
import nltk
from nltk.corpus import stopwords

# Config
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
GOOD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
try:
    STOPWORDS = set(stopwords.words('english'))
except LookupError:
    nltk.download('stopwords')
    STOPWORDS = set(stopwords.words('english'))
    
def lower(text):
    """
    Transforms given text to lower case.
    Example:
    Input: 'I really like New York city'
    Output: 'i really like new your city'
    """

    return text.lower()

def replace_special_characters(text):
    """
    Replaces special characters, such as paranthesis,
    with spacing character
    """

    return REPLACE_BY_SPACE_RE.sub(' ', text)

def filter_out_uncommon_symbols(text):
    """
    Removes any special character that is not in the
    good symbols list (check regular expression)
    """

    return GOOD_SYMBOLS_RE.sub('', text)

def remove_stopwords(text):
    return ' '.join([x for x in text.split() if x and x not in STOPWORDS])

def remove_extra_spaces(text):
    """
    Removes any left or right spacing (including carriage return) from text.
    Example:
    Input: '  This assignment is cool\n'
    Output: 'This assignment is cool'
    """

    return re.sub('\s+',' ',text)

def strip_text(text):
    """
    Removes any left or right spacing (including carriage return) from text.
    Example:
    Input: '  This assignment is cool\n'
    Output: 'This assignment is cool'
    """

    return text.strip()

def remove_starting_index(text):
    """
    Removes: pronunciation description from the text, it comes in between -LSB- and -RSB
    Input: '2\tHemsworth has also appeared in the science'
    Output: 'Hemsworth has also appeared in the science'
    """

    return re.sub('^[0-9]+\t', '', text)


def remove_pronunciation(text):
    """
    Removes: pronunciation description from the text, it comes in between -LSB- and -RSB
    Input: 'Vietnam -LRB- ˌ ; -LSB- vîət nāːm -RSB- -RRB- , officially the Socialist Republic of Vietnam -LRB- \n'
    Output: 'Vietnam -LRB-ˌ ; -RRB- , officially the Socialist Republic of Vietnam\n'
    """

    return re.sub('-LSB- .* -RSB- ', '', text)

NLTK_SYM = re.compile('\-[A-Z]{3}\-')

def remove_NLTK_sym(text):
    """
    Removes: NLTK special tokens, eg. LRB (Left Round Bracket) and RRB (Right Round Brackets) .
    Example:
    Input: 'Vietnam -LRB- ˌ ; -LSB- vîət nāːm -RSB- -RRB- , officially the Socialist Republic of Vietnam -LRB- \n'
    Output: 'Vietnam ˌ ; vîət nāːm , officially the Socialist Republic of Vietnam\n'
    """

    return NLTK_SYM.sub('', text)

def remove_tabs(text):
    """
    Removes: NLTK special tokens, eg. LRB (Left Round Bracket) and RRB (Right Round Brackets) .
    Example:
    Input: 'Vietnam -LRB- ˌ ; -LSB- vîət nāːm -RSB- -RRB- , officially the Socialist Republic of Vietnam -LRB- \n'
    Output: 'Vietnam ˌ ; vîət nāːm , officially the Socialist Republic of Vietnam\n'
    """

    return re.sub('\t',' ', text)

def label_parse(label):
    label = label.to_numpy()
    for i in range(len(label)):
        label[i] = 1 if label[i] == "SUPPORTS" else 0
    return pd.DataFrame(label)
    
PREPROCESSING_PIPELINE = [
                          remove_starting_index,
                          remove_pronunciation,
                          remove_NLTK_sym,
                          lower,
                          replace_special_characters,
                          remove_extra_spaces,
                          remove_tabs,
                          strip_text
                          ]

# Anchor method

def text_prepare(text, filter_methods=None):
    """
    Applies a list of pre-processing functions in sequence (reduce).
    Note that the order is important here!
    """

    filter_methods = filter_methods if filter_methods is not None else PREPROCESSING_PIPELINE

    return reduce(lambda txt, f: f(txt), filter_methods, text)

# Pre-processing

print('Pre-processing evidence...')
# Replace each sentence with its pre-processed version
train_df['Evidence']=train_df['Evidence'].apply(lambda txt: text_prepare(txt))
val_df['Evidence']=val_df['Evidence'].apply(lambda txt: text_prepare(txt))
test_df['Evidence']=test_df['Evidence'].apply(lambda txt: text_prepare(txt))

# Converting claims into lower case
train_df['Claim']=train_df['Claim'].str.strip().str.lower()
val_df['Claim']=val_df['Claim'].str.strip().str.lower()
test_df['Claim']=test_df['Claim'].str.strip().str.lower()

train_df['Label'] = label_parse(train_df['Label'])
test_df['Label'] = label_parse(test_df['Label'])
val_df['Label'] = label_parse(val_df['Label'])


print("Pre-processing completed!")
train_df.head(10)

Pre-processing evidence...
Pre-processing completed!


Unnamed: 0,Claim,Evidence,ID,Label
0,chris hemsworth appeared in a perfect getaway.,hemsworth has also appeared in the science fic...,3,1
1,roald dahl is a writer.,roald dahl 13 september 1916 -- 23 november 19...,7,1
2,roald dahl is a governor.,roald dahl 13 september 1916 -- 23 november 19...,8,0
3,ireland has relatively low-lying mountains.,the island 's geography comprises relatively l...,9,1
4,ireland does not have relatively low-lying mou...,the island 's geography comprises relatively l...,10,0
5,there have been many notable performances by d...,his most commercially successful role to date ...,14,1
6,edward i of england responded to a second rebe...,after suppressing a minor rebellion in wales i...,17,1
7,h. h. holmes owned a building west of chicago.,many victims were said to have been killed in ...,19,1
8,h. h. holmes was the owner of a building locat...,many victims were said to have been killed in ...,20,1
9,the beastie boys released paul's boutique.,in 2009 they released digitally remastered del...,21,1


In [7]:
train_df['Evidence'][1]

'roald dahl 13 september 1916 -- 23 november 1990 was a british novelist short story writer poet screenwriter and fighter pilot . fighter pilot fighter pilot'

# Problem formulation

As mentioned at the beginning of the assignment, we are going to formulate the fact checking problem as a binary classification task.

In particular, each dataset sample is comprised of:

*     A claim to verify
*     A set of semantically related statements (evidence set)
*     Fact checking label: either evidences support or refute the claim.

Handling the evidence set from the point of view of neural models may imply some additional complexity: if the evidence set is comprised of several sentences we might incur in memory problems.

To this end, we further simplify the problem by building (claim, evidence) pairs. The fact checking label is propagated as well.

Example:

     Claim: c1 
     Evidence set: [e1, e2, e3]
     Label: S (support)

--->

    (c1, e1, S),
    (c1, e2, S),
    (c1, e3, S)

In [8]:
#%%time
# Let's build new dataframes with the format (C,E,S)
from nltk import sent_tokenize
nltk.download('punkt')
# train_df: Generate the normalized version

def xtr_evidence(df):
    """
    From a dataframe with the format [Claim, Evidence1.Evidence2..., Label] 
    Returns a list of the form [Claim, Evidence1,Label], [Claim, Evidence2, Label]...
    """
    ev_list=[]
    for index, row in df.iterrows():
        for sentence in sent_tokenize(row['Evidence']):
            ev_list.append([row['Claim'],sentence,row['Label']])
    return(ev_list)

cols = ['Claim', 'Evidence', 'Label']
xtrain_df = pd.DataFrame(xtr_evidence(train_df),columns=cols)
ytrain_df = xtrain_df["Label"]
xtrain_df = xtrain_df.drop("Label",axis=1)
xval_df = pd.DataFrame(xtr_evidence(val_df),columns=cols)
yval_df = xval_df["Label"]
xval_df = xval_df.drop("Label",axis=1)
xtest_df = pd.DataFrame(xtr_evidence(test_df),columns=cols)
ytest_df = xtest_df["Label"]
xtest_df = xtest_df.drop("Label",axis=1)
# An inspection
xtrain_df.head()

[nltk_data] Downloading package punkt to /home/riemann/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,Claim,Evidence
0,chris hemsworth appeared in a perfect getaway.,hemsworth has also appeared in the science fic...
1,chris hemsworth appeared in a perfect getaway.,star trek star trek film a perfect getaway a p...
2,roald dahl is a writer.,roald dahl 13 september 1916 -- 23 november 19...
3,roald dahl is a writer.,fighter pilot fighter pilot
4,roald dahl is a governor.,roald dahl 13 september 1916 -- 23 november 19...


## Tokenizing, vocabulary building, encoding - embedding



In [9]:
## tokenizing
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
# 1st - sentence to words
from tensorflow.keras.preprocessing.text import text_to_word_sequence
xtest_df['Claim'] = xtest_df['Claim'].apply(text_to_word_sequence)
xtest_df['Evidence'] = xtest_df['Evidence'].apply(text_to_word_sequence)
xval_df['Claim'] = xval_df['Claim'].apply(text_to_word_sequence)
xval_df['Evidence'] = xval_df['Evidence'].apply(text_to_word_sequence)
xtrain_df['Claim'] = xtrain_df['Claim'].apply(text_to_word_sequence)
xtrain_df['Evidence'] = xtrain_df['Evidence'].apply(text_to_word_sequence)

xtrain_df.head()

2021-12-13 18:54:15.638749: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-12-13 18:54:15.638835: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


Unnamed: 0,Claim,Evidence
0,"[chris, hemsworth, appeared, in, a, perfect, g...","[hemsworth, has, also, appeared, in, the, scie..."
1,"[chris, hemsworth, appeared, in, a, perfect, g...","[star, trek, star, trek, film, a, perfect, get..."
2,"[roald, dahl, is, a, writer]","[roald, dahl, 13, september, 1916, 23, novembe..."
3,"[roald, dahl, is, a, writer]","[fighter, pilot, fighter, pilot]"
4,"[roald, dahl, is, a, governor]","[roald, dahl, 13, september, 1916, 23, novembe..."


In [53]:
%%time 
# Let's build our vocabulary just on our train data, both Evidence and Claim
word_tokenizer = Tokenizer()              # we'll address OOV
word_tokenizer.fit_on_texts(pd.concat([
    xtrain_df['Evidence'],xtrain_df['Claim']]))

word_index = word_tokenizer.word_index
word_index = {k:v-1 for k,v in word_index.items()} #v-1
index_word = word_tokenizer.index_word
index_word = {k-1:v for k,v in index_word.items()} #k-1
word_count = dict(word_tokenizer.word_counts)
word_listing = sorted([*word_count.keys()])
print(len(word_listing))

33806
CPU times: user 5.31 s, sys: 0 ns, total: 5.31 s
Wall time: 5.31 s


In [54]:
print(index_word)

{0: 'the', 1: 'of', 2: 'in', 3: 'a', 4: 'and', 5: 'is', 6: 'film', 7: 'was', 8: 'by', 9: 'an', 10: 'for', 11: 'american', 12: 'to', 13: 'on', 14: 'as', 15: 'award', 16: 'has', 17: 'united', 18: 'with', 19: 'born', 20: 'series', 21: 'states', 22: 'best', 23: 'from', 24: 'television', 25: 'actor', 26: 'drama', 27: 'his', 28: "'s", 29: 'he', 30: 'comedy', 31: 'academy', 32: 'world', 33: 'one', 34: 'album', 35: 'directed', 36: 'actress', 37: 'awards', 38: 'known', 39: 'at', 40: 'music', 41: "''", 42: 'first', 43: 'released', 44: 'john', 45: 'it', 46: 'new', 47: 'her', 48: 'rock', 49: 'that', 50: 'city', 51: 'who', 52: 'name', 53: 'which', 54: 'producer', 55: 'films', 56: 'stars', 57: 'written', 58: 'not', 59: 'also', 60: 'or', 61: 'director', 62: 'she', 63: 'fiction', 64: 'tv', 65: 'british', 66: 'won', 67: 'david', 68: 'singer', 69: 'role', 70: 'list', 71: 'golden', 72: 'character', 73: 'only', 74: 'band', 75: 'state', 76: 'science', 77: 'records', 78: 'are', 79: 'game', 80: 'novel', 81: 

In [59]:
print(word_listing)

["'", "''", "'49", "'64", "'70s", "'80s", "'90s", "'a'", "'abd", "'d", "'dan", "'disco", "'em", "'j'", "'ll", "'m", "'m'", "'n", "'n'", "'re", "'s", "'s'", "'smart'", "'til", "'ve", '0', '00', "00's", '000', '001', '005', '006', '007', '009', '00s', '01', '016', '02', '022', '023', '024', '026', '027', '03', '030', '035', '039', '04', '040', '042', '046', '05', '05383', '06', '062', '07', '076', '08', '084', '09', '0914918327', '095', '096', '1', '10', '100', "100's", '1000', '10000', '1000th', '1001', '1003', '1005', '1008', '1009', '100th', '101', '1011', '101st', '102', '102nd', '103', '103000', '1038', '104', '1045', '1047', '1048', '105', '1050s', '106', '1066', '1068', '107', '1070', '108', '109', '1093', '1096', '1099', '109th', '10s', '10th', '11', '110', '1100', '1113', '111th', '112', '112077', '113', '1132', '1135', '1136', '1138', '1139', '114', '115', '1150', '115th', '116', '1162', '117', '118', '1181', '11823', '1186', '118th', '119', '11a', '11th', '11y', '11处特工皇妃', '12

In [56]:
import simplejson as sj

dataset_name = "factchecking"
vocab_path = os.path.join(os.getcwd(), 'dataset', dataset_name, 'vocab.json')

print("Saving vocabulary to {}".format(vocab_path))
with open(vocab_path, mode='w') as f:
    sj.dump(word_index, f, indent=4)
print("Saving completed!")

Saving vocabulary to /media/riemann/mnt/university/fifth_year/NLP/assignments/dataset/factchecking/vocab.json
Saving completed!


In [57]:
#So, basically we are trying to embed the sentences above, what we can do is use glove embedding to achieve it

import gensim
import gensim.downloader as gloader

def load_embedding_model(embedding_dimension=50):
    """
    Loads a pre-trained word embedding model via gensim library.

    :param embedding_dimension: size of the embedding space to consider

    :return
        - pre-trained word embedding model (gensim KeyedVectors object)
    """
    
    emb_model = gloader.load("glove-wiki-gigaword-{}".format(embedding_dimension))

    return emb_model


embedding_dimension = 50

embedding_model = load_embedding_model(embedding_dimension)
print(embedding_model)

<gensim.models.keyedvectors.KeyedVectors object at 0x7f0a32c92d00>


In [58]:
# Function definition

def check_OOV_terms(embedding_model, word_listing):
    """
    Checks differences between pre-trained embedding model vocabulary
    and dataset specific vocabulary in order to highlight out-of-vocabulary terms.

    :param embedding_model: pre-trained word embedding model (gensim wrapper)
    :param word_listing: dataset specific vocabulary (list)

    :return
        - list of OOV terms
    """
    return list(set(word_listing) - set(embedding_model.key_to_index))


oov_terms = check_OOV_terms(embedding_model, word_listing)

print("Total OOV terms: {0} ({1:.2f}%)".format(len(oov_terms), float(len(oov_terms)) / len(word_listing)))

# for i in range(len(oov_terms) // 10):
#   print(oov_terms[i * 10:(i+1) * 10])

Total OOV terms: 4028 (0.12%)


In [60]:
import random
import scipy.sparse

def build_embedding_matrix(embedding_model, embedding_dimension, word_index, vocab_size, oov_terms):
    """
    Builds the embedding matrix of a specific dataset given a pre-trained word embedding model

    :param embedding_model: pre-trained word embedding model (gensim wrapper)
    :param word_index: vocabulary map (word -> index) (dict)
    :param vocab_size: size of the vocabulary
    :param oov_terms: list of OOV terms (list)

    :return
        - embedding matrix that assigns a high dimensional vector to each word in the dataset specific vocabulary (shape |V| x d)
    """
    
    embedding_matrix = np.zeros((vocab_size,embedding_dimension),dtype=np.float32)
    print(word_index)
    print(vocab_size)
    
    for w,i in word_index.items():
        if w not in oov_terms:
            embedding_matrix[i] = embedding_model[w]
        else:
            embedding_matrix[i] = np.random.uniform(low=-0.05, high=0.05, size=embedding_dimension)
    
    return embedding_matrix

embedding_matrix = build_embedding_matrix(embedding_model, embedding_dimension, word_index, len(word_index), oov_terms)

print("Embedding matrix shape: {}".format(embedding_matrix.shape))

{'the': 0, 'of': 1, 'in': 2, 'a': 3, 'and': 4, 'is': 5, 'film': 6, 'was': 7, 'by': 8, 'an': 9, 'for': 10, 'american': 11, 'to': 12, 'on': 13, 'as': 14, 'award': 15, 'has': 16, 'united': 17, 'with': 18, 'born': 19, 'series': 20, 'states': 21, 'best': 22, 'from': 23, 'television': 24, 'actor': 25, 'drama': 26, 'his': 27, "'s": 28, 'he': 29, 'comedy': 30, 'academy': 31, 'world': 32, 'one': 33, 'album': 34, 'directed': 35, 'actress': 36, 'awards': 37, 'known': 38, 'at': 39, 'music': 40, "''": 41, 'first': 42, 'released': 43, 'john': 44, 'it': 45, 'new': 46, 'her': 47, 'rock': 48, 'that': 49, 'city': 50, 'who': 51, 'name': 52, 'which': 53, 'producer': 54, 'films': 55, 'stars': 56, 'written': 57, 'not': 58, 'also': 59, 'or': 60, 'director': 61, 'she': 62, 'fiction': 63, 'tv': 64, 'british': 65, 'won': 66, 'david': 67, 'singer': 68, 'role': 69, 'list': 70, 'golden': 71, 'character': 72, 'only': 73, 'band': 74, 'state': 75, 'science': 76, 'records': 77, 'are': 78, 'game': 79, 'novel': 80, 'mos

Embedding matrix shape: (33806, 50)


In [30]:
print(type(xtrain_df))

<class 'pandas.core.frame.DataFrame'>


In [81]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

def flatten(input):
    new_list = []
    for i in input:
        for j in i:
            new_list.append(j)
    return np.array(new_list)

tmp = []

def parse_arrays(arr):
    tmp2 = []
    tmp3 = []
    for i in range(len(arr)):
        tmp2.append(arr[i][0])
        tmp3.append(arr[i][1])

    tmp2, tmp3 = pad_sequences(tmp2), pad_sequences(tmp3)
    tmp4 = [1,1]
    tmp4[0] = tmp2
    tmp4[1] = tmp3

    tmp5 = []

    for i in range(len(tmp4[0])):
        l = pad_sequences([tmp4[0][i], tmp4[1][i]])
        tmp5.append(list(l[0]))
        tmp5.append(list(l[1]))

    tmp5 = np.reshape(tmp5,(int(len(tmp5)/2),2,-1))

    return tmp5

def convert_text(texts, tokenizer, is_training=False, max_seq_length=None):
    """
    Converts input text sequences using a given tokenizer

    :param texts: either a list or numpy ndarray of strings
    :tokenizer: an instantiated tokenizer
    :is_training: whether input texts are from the training split or not
    :max_seq_length: the max token sequence previously computed with
    training texts.

    :return
        text_ids: a nested list on token indices
        max_seq_length: the max token sequence previously computed with
        training texts.
    """
    #print(texts)
    text_ids = tokenizer.texts_to_sequences(texts)
    
    for i in range(len(text_ids)):
        for j in range(len(text_ids[i])):
            if text_ids[i][j] == 33806:
                print("haha")
                text_ids[i][j] = 33805
    #print(text_ids[:5])
    # Padding
    if is_training:
        max_seq_length = int(np.quantile([len(seq) for seq in text_ids], 0.99))
    else:
        assert max_seq_length is not None

    text_ids = [seq + [0] * (max_seq_length - len(seq)) for seq in text_ids]
    text_ids = np.array([seq[:max_seq_length] for seq in text_ids])
    #print(text_ids)
    #print(aaa)
    if is_training:
        return np.array(text_ids), max_seq_length
    else:
        return np.array(text_ids)

def encode_words(df,wt,train=False,max_seq_length=None):
    """
    #From a dataframe with the format [Claim, Evidence, Label] and a word_tokenizer
    #Returns the endoded dataframe
    """
    ev_list=[]
    #for index, row in df.iterrows():
    if train:
        cl,max_seq_length1 = convert_text(df["Claim"].to_numpy(),wt,train,max_seq_length)
        ev,max_seq_length2 = convert_text(df["Evidence"].to_numpy(),wt,train,max_seq_length)
        max_seq_length = max_seq_length1 if max_seq_length1 > max_seq_length2 else max_seq_length2
        if max_seq_length == max_seq_length1:
            ev = convert_text(df["Evidence"].to_numpy(),wt,False,max_seq_length)
        else:
            cl = convert_text(df["Claim"].to_numpy(),wt,False,max_seq_length)
        ev_list.append([cl,ev])
        return ev_list[0],max_seq_length
    else:
        cl = convert_text(df["Claim"].to_numpy(),wt,train,max_seq_length)
        ev = convert_text(df["Evidence"].to_numpy(),wt,train,max_seq_length)
        ev_list.append([cl,ev])
        return ev_list[0]

"""
xtrain = parse_arrays(encode_words(xtrain_df,word_tokenizer,True))
ytrain = ytrain_df.to_numpy()
xval = parse_arrays(encode_words(xval_df,word_tokenizer))
yval = yval_df.to_numpy()
xtest = parse_arrays(encode_words(xtest_df,word_tokenizer))
ytest = ytest_df.to_numpy()

print(xtrain[:5])
"""
xtrain, max_input_length = encode_words(xtrain_df,word_tokenizer,True)
ytrain = ytrain_df.to_numpy()
xval = encode_words(xval_df,word_tokenizer,False,max_input_length)
yval = yval_df.to_numpy()
xtest = encode_words(xtest_df,word_tokenizer,False,max_input_length)
ytest = ytest_df.to_numpy()

haha
haha


In [82]:
print(list(train_df["Claim"])[:5])

['chris hemsworth appeared in a perfect getaway.', 'roald dahl is a writer.', 'roald dahl is a governor.', 'ireland has relatively low-lying mountains.', 'ireland does not have relatively low-lying mountains.']


In [83]:
def np_and_reshape(x):
    x = np.array(x)
    sh1,sh2,sh3 = x.shape
    return np.reshape(x,(sh2,sh1,sh3))

xtrain, xval, xtest = np_and_reshape(xtrain), np_and_reshape(xval), np_and_reshape(xtest)

max_input_length = len(xtrain[0][0])

print("Max token sequence: {}".format(max_input_length))

print('X train shape: ', xtrain.shape)
print('Y train shape: ', ytrain.shape)

print('X val shape: ', xval.shape)
print('Y val shape: ', yval.shape)

print('X test shape: ', xtest.shape)
print('Y test shape: ', ytest.shape)

print("Embedding matrix shape: ",embedding_matrix.shape)
print("Vocab size: ",len(word_index))

Max token sequence: 67
X train shape:  (234126, 2, 67)
Y train shape:  (234126,)
X val shape:  (13567, 2, 67)
Y val shape:  (13567,)
X test shape:  (13633, 2, 67)
Y test shape:  (13633,)
Embedding matrix shape:  (33806, 50)
Vocab size:  33806


In [116]:
from tensorflow import keras
from tensorflow.keras import layers

def create_model(layers_info, compile_info):
    
    model = keras.Sequential()
    for info in layers_info:
        layer = info["layer"](**{key:value for key,value in info.items() if key != "layer"})
        model.add(layer)
    
    model.summary()
    
    model.compile(**compile_info)
    
    return model


layers_info = [
    {
        "layer": layers.Embedding,
        "output_dim": 50,
        "input_dim": len(word_index)+1,
        "input_length": max_input_length,
        #"input_length"
        "weights": embedding_matrix if embedding_matrix is None else [embedding_matrix],
        #"mask_zero": True,
        "name": "embedding_layer",
        "trainable": False
    },
    {
        "layer": layers.SimpleRNN,
        "units": 50,
        "name": "sentence_embedding"
    },
    {
        'layer': layers.Dense,
        "units": 256,
        "activation": "relu",
        "name": "dense_1"
    },
    {
        "layer": layers.Dense,
        "units": 64,
        "activation": "relu",
        "name": "dense_2"
    },
    {
        "layer": layers.Dense,
        "units": 2,
        "activation": "softmax",
        "name": "logits"
    }
]

compile_info = {
    'optimizer': keras.optimizers.Adam(learning_rate=1e-3),
    'loss': 'sparse_categorical_crossentropy',
    'metrics': [keras.metrics.SparseCategoricalAccuracy()],
}

model = create_model(layers_info, compile_info)

Model: "sequential_11"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_layer (Embedding)  (None, 67, 50)           1690300   
                                                                 
 sentence_embedding (SimpleR  (None, 50)               5050      
 NN)                                                             
                                                                 
 dense_1 (Dense)             (None, 256)               13056     
                                                                 
 dense_2 (Dense)             (None, 64)                16448     
                                                                 
 logits (Dense)              (None, 2)                 130       
                                                                 
Total params: 1,724,984
Trainable params: 34,684
Non-trainable params: 1,690,300
______________________________________

In [59]:
print(xtrain[:,0].shape)
print(xtrain.shape)

(234126, 122)
(234126, 2, 122)


In [49]:
tmp = 0
for j in range(33807):
    found=False
    for i in range(len(xtrain)):
        if tmp in xtrain[i,0,:] or tmp in xtrain[i,1,:]:
            found = True
            break
    if not found:
        print(tmp,"not found")
        break
    tmp+=1

KeyboardInterrupt: 

In [84]:
#from tensorflow.keras.layers import Embedding, Input, Flatten
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Model

embed_lookup = layers.Embedding(
    input_dim=len(word_index),
    output_dim= 50,
    input_length=max_input_length,
    #"input_length"
    weights= embedding_matrix if embedding_matrix is None else [embedding_matrix],
    #"mask_zero": True,
    name= "embedding_layer",
)

x = layers.Input(shape=xtrain[0,0,:].shape)
x_lookup = embed_lookup(x)
#x_lookup_flattened = layers.Flatten()(x_lookup)
embed_model = Model(x,x_lookup,name="embed_model")

embed_model.summary()

Model: "embed_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_8 (InputLayer)        [(None, 67)]              0         
                                                                 
 embedding_layer (Embedding)  (None, 67, 50)           1690300   
                                                                 
Total params: 1,690,300
Trainable params: 1,690,300
Non-trainable params: 0
_________________________________________________________________


In [94]:
claim = layers.Input(shape=xtrain[0,0,:].shape, name='claim')
evidence = layers.Input(shape=xtrain[0,0,:].shape, name='evidence')

claim_embedded = embed_model(claim)
evidence_embedded = embed_model(evidence)

out = layers.Concatenate()([claim_embedded, evidence_embedded])
out = layers.SimpleRNN(50,activation="relu",name="RNN")(claim_embedded)
#out = layers.Dense(256,activation="relu",name="dense_1")(out)
#out = layers.Dense(64,activation="relu",name="dense_2")(out)
#out = layers.Dense(2,activation="softmax",name="logits")(out)

model2 = Model([claim,evidence],out)

model2.summary()

Model: "model_8"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 claim (InputLayer)             [(None, 67)]         0           []                               
                                                                                                  
 embed_model (Functional)       (None, 67, 50)       1690300     ['claim[0][0]']                  
                                                                                                  
 evidence (InputLayer)          [(None, 67)]         0           []                               
                                                                                                  
 RNN (SimpleRNN)                (None, 50)           5050        ['embed_model[6][0]']            
                                                                                            

In [87]:
print(xtrain[:,0,:].shape)

(234126, 67)


In [95]:
model2.compile(optimizer=keras.optimizers.Adam(learning_rate=1e-3),
    loss='sparse_categorical_crossentropy',
    metrics= [keras.metrics.SparseCategoricalAccuracy()])

model2.fit([xtrain[:,0,:],xtrain[:,1,:]],
          ytrain, validation_data=(xval, yval),
          epochs=1, verbose=True, batch_size=64)




ValueError: in user code:

    File "/home/riemann/anaconda3/envs/test_env/lib/python3.9/site-packages/keras/engine/training.py", line 1366, in test_function  *
        return step_function(self, iterator)
    File "/home/riemann/anaconda3/envs/test_env/lib/python3.9/site-packages/keras/engine/training.py", line 1356, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/riemann/anaconda3/envs/test_env/lib/python3.9/site-packages/keras/engine/training.py", line 1349, in run_step  **
        outputs = model.test_step(data)
    File "/home/riemann/anaconda3/envs/test_env/lib/python3.9/site-packages/keras/engine/training.py", line 1303, in test_step
        y_pred = self(x, training=False)
    File "/home/riemann/anaconda3/envs/test_env/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
        raise e.with_traceback(filtered_tb) from None
    File "/home/riemann/anaconda3/envs/test_env/lib/python3.9/site-packages/keras/engine/input_spec.py", line 199, in assert_input_compatibility
        raise ValueError(f'Layer "{layer_name}" expects {len(input_spec)} input(s),'

    ValueError: Layer "model_8" expects 2 input(s), but it received 1 input tensors. Inputs received: [<tf.Tensor 'IteratorGetNext:0' shape=(None, 2, 67) dtype=int64>]


In [41]:
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score, accuracy_score
from functools import partial

def show_history(history):
    """
    Shows training history data stored by the History Keras callback

    :param history: History Keras callback
    """

    history_data = history.history
    print("Displaying the following history keys: ", history_data.keys())

    for key, value in history_data.items():
        if not key.startswith('val'):
            fig, ax = plt.subplots(1, 1)
            ax.set_title(key)
            ax.plot(value)
            if 'val_{}'.format(key) in history_data:
                ax.plot(history_data['val_{}'.format(key)])
            else:
                print("Couldn't find validation values for metric: ", key)

            ax.set_ylabel(key)
            ax.set_xlabel('epoch')
            ax.legend(['train', 'val'], loc='best')

    plt.show()


def train_model(model: keras.Model,
                x_train: np.ndarray,
                y_train: np.ndarray,
                x_val: np.ndarray,
                y_val: np.ndarray,
                training_info):
    """
    Training routine for the Keras model.
    At the end of the training, retrieved History data is shown.

    :param model: Keras built model
    :param x_train: training data in np.ndarray format
    :param y_train: training labels in np.ndarray format
    :param x_val: validation data in np.ndarray format
    :param y_val: validation labels in np.ndarray format
    :param training_info: dictionary storing model fit() argument information

    :return
        model: trained Keras model
    """
    print("Start training! \nParameters: {}".format(training_info))
    history = model.fit(x=[x_train['Claim'],x_train['Evidence']], y=y_train,
                        validation_data=(x_val, y_val),
                        **training_info)
    print("Training completed! Showing history...")

    show_history(history)

    return model

def predict_data(model,
                 x,
                 prediction_info) -> np.ndarray:
    """
    Inference routine of a given input set of examples

    :param model: Keras built and possibly trained model
    :param x: input set of examples in np.ndarray format
    :param prediction_info: dictionary storing model predict() argument information

    :return
        predictions: predicted labels in np.ndarray format
    """

    print('Starting prediction: \n{}'.format(prediction_info))
    print('Predicting on {} samples'.format(x.shape[0]))

    predictions = model.predict(x, **prediction_info)
    return predictions


def evaluate_predictions(predictions,
                         y,
                         metrics,
                         metric_names):
    """
    Evaluates given model predictions on a list of metric functions

    :param predictions: model predictions in np.ndarray format
    :param y: ground-truth labels in np.ndarray format
    :param metrics: list of metric functions
    :param metric_names: list of metric names

    :return
        metric_info: dictionary containing metric values for each input metric
    """

    assert len(metrics) == len(metric_names)

    print("Evaluating predictions! Total samples: ", y.shape[0])

    metric_info = {}

    for metric, metric_name in zip(metrics, metric_names):
        metric_value = metric(y_pred=predictions, y_true=y)
        metric_info[metric_name] = metric_value

    return metric_info


# Training

training_info = {
    'verbose': 1,
    'epochs': 5,
    'batch_size': 64,
}
model = train_model(model=model, x_train=xtrain, y_train=ytrain,
                    x_val=xval, y_val=yval, training_info=training_info)

# Inference

prediction_info = {
    'batch_size': 64,
    'verbose': 1
}
test_predictions = predict_data(model=model, x=xtest,
                                      prediction_info=prediction_info)

# Retrieving labels from raw predictions
test_predictions = np.argmax(test_predictions, axis=-1)

# Evaluation

metrics = [
    accuracy_score,
    partial(f1_score, pos_label=1, average='binary')
]
metric_names = [
    "accuracy",
    "binary_f1"
]
metric_info = evaluate_predictions(predictions=test_predictions,
                                   y=ytest,
                                   metrics=metrics,
                                   metric_names=metric_names)

print('Metrics info: \n{}'.format(metric_info))

Start training! 
Parameters: {'verbose': 1, 'epochs': 5, 'batch_size': 64}
Epoch 1/5


2021-12-13 11:15:19.435178: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 228506976 exceeds 10% of free system memory.


ValueError: in user code:

    File "/home/riemann/anaconda3/envs/test_env/lib/python3.9/site-packages/keras/engine/training.py", line 878, in train_function  *
        return step_function(self, iterator)
    File "/home/riemann/anaconda3/envs/test_env/lib/python3.9/site-packages/keras/engine/training.py", line 867, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/riemann/anaconda3/envs/test_env/lib/python3.9/site-packages/keras/engine/training.py", line 860, in run_step  **
        outputs = model.train_step(data)
    File "/home/riemann/anaconda3/envs/test_env/lib/python3.9/site-packages/keras/engine/training.py", line 808, in train_step
        y_pred = self(x, training=True)
    File "/home/riemann/anaconda3/envs/test_env/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
        raise e.with_traceback(filtered_tb) from None
    File "/home/riemann/anaconda3/envs/test_env/lib/python3.9/site-packages/keras/engine/input_spec.py", line 263, in assert_input_compatibility
        raise ValueError(f'Input {input_index} of layer "{layer_name}" is '

    ValueError: Input 0 of layer "sequential_5" is incompatible with the layer: expected shape=(None, 122), found shape=(None, 2, 122)


## 4.1 Schema

The overall binary classification problem is summed up by the following (simplified) schema

![](https://drive.google.com/uc?export=view&id=1Wm_YBnFwgJtxcWEBpPbTBEVkpKaL08Jp)

Don't worry too much about the **Encoding** block for now. We'll give you some simple guidelines about its definition. For the moment, stick to the binary classification task definition where, in this case, we have 2 inputs: the claim to verify and one of its associated evidences.

# Architecture Guidelines

There are many neural architectures that follow the above schema. To avoid phenomena like the writer's block, in this section we are going to give you some implementation guidelines.

In particular, we would like you to test some implementations so that you explore basic approaches (neural baselines) and use them as building blocks for possible extensions.

## 5.1 Handling multiple inputs

The first thing to notice is that we are in a multi-input scenario. In particular, each sample is comprised of a fact and its asssociated evidence statement.

Each of these input is encoded as a sequence of tokens. In particular, we will have the following input matrices:

*    Claim: `[batch_size, max_tokens]`
*    Evidence: `[batch_size, max_tokens]`

Moreover, after the embedding layer, we'll have:

*    Claim: `[batch_size, max_tokens, embedding_dim]`
*    Evidence: `[batch_size, max_tokens, embedding_dim]`

But, we would like to have a 2D input to our classifier, since we have to give an answer at pair level. Therefore, for each sample, we would expect the following input shape to our classification block:

*   Classification input shape: `[batch_size, dim]`

**How to do that?**

We inherently need to reduce the token sequence to a single representation. This operation is formally known as **sentence embedding**. Indeed, we are trying to compress the information of a whole sequence into a single embedding vector.

Here are some simple solutions that we ask you to try out:

1.   Encode token sequences via a RNN and take the last state as the sentence embedding.

2.  Encode token sequences via a RNN and average all the output states.

3.  Encode token sequences via a simple MLP layer. In particular, if your input is a `[batch_size, max_tokens, embedding_dim]` tensor, the matrix multiplication works on the **max_tokens** dimension, resulting in a `[batch_size, embedding_dim]` 2D matrix. Alternatively, you can reshape the 3D input tensor from `[batch_size, max_tokens, embedding_dim]` to `[batch_size, max_tokens * embedding_dim]` and then apply the MLP layer.

4.   Compute the sentence embedding as the mean of its token embeddings (**bag of vectors**).

## 5.2 Merging multi-inputs

At this point, we have to think about **how** we should merge evidence and claim sentence embeddings.

For simplicity, we stick to simple merging strategies:

*     **Concatenation**: define the classification input as the concatenation of evidence and claim sentence embeddings

*     **Sum**: define the classification input as the sum of evidence and claim sentence embeddings

*     **Mean**: define the classification input as the mean of evidence and claim sentence embeddings

For clarity, if the sentence embedding of a single input has shape `[batch_size, embedding_dim]`, then the classification input has shape:

*     **Concatenation**: `[batch_size, 2 * embedding_dim]`

*     **Sum**: `[batch_size, embedding_dim]`

*     **Mean**: `[batch_size, embedding_dim]`

# A simple extension

Lastly, we ask you to modify previously defined neural architectures by adding an additional feature to the classification input.

We would like to see if some similarity information between the claim to verify and one of its associated evidence might be useful to the classification.

Compute the cosine similarity metric between the two sentence embeddings and concatenate the result to the classification input.

For clarity, since the cosine similarity of two vectors outputs a scalar value, the classification input shape is modified as follows:

*     **Concatenation**: `[batch_size, 2 * embedding_dim + 1]`

*     **Sum**: `[batch_size, embedding_dim + 1]`

*     **Mean**: `[batch_size, embedding_dim + 1]`



# Performance evaluation

Due to our simplifications, obtained results are not directly compatible with a traditional fact checking method that considers the evidence set as a whole.

Thus, we need to consider two types of evaluations.

---

A. **Multi-input classification evaluation**

This type of evaluation is the easiest and concerns computing evaluation metrics, such as accuracy, f1-score, recall and precision, of our pre-processed dataset.

In other words, we assess the performance of chosen classifiers.

---

B. **Claim verification evaluation**

However, if we want to give an answer concerning the claim itself, we need to consider the whole evidence set. 

Intuitively, for a given claim, we consider all its corresponding (claim, evidence) pairs and their corresponding classification outputs. 

At this point, all we need to do is to compute the final predicted claim label via majority voting.

---

Example:

    Claim: c1
    Evidence set: e1, e2, e3
    True label: S

    Pair outputs:
    (c1, e1) -> S (supports)
    (c1, e2) -> S (supports)
    (c1, e3) -> R (refutes)

    Majority voting:
    S -> 2 votes
    R -> 1 vote

    Final label:
    c1 -> S

Lastly, we have to compute classification metrics just like before.

Shortly speaking, implement both strategies for your classification metrics.

# Tips and Extras

## 8.1 Extensions are welcome!

Is this task too easy for you? Are you curious to try out things you have seen during lectures (e.g. attention)? Feel free to try everything you want!

**Don't forget to try neural baselines first!**

## 8.2 Comments and documentation

Remember to properly comment your code (it is not necessary to comment each single line) and don't forget to describe your work!

## 8.3 Organization

We suggest you to divide your work into sections. This allows you to build clean and modular code, as well as easy to read and to debug.

A possible schema:

*   Dataset pre-processing
*   Dataset conversion
*   Model definition
*   Training
*   Evaluation
*   Comments/Summary

# Evaluation

Which are the evaluation criteria on which we'll judge you and your work?

1. Pre-processing: whether you have done some pre-processing or not.
2. Sentence embedding: you should implement all required strategies (with an example and working code for each). That is, we, as evaluators, should be able to test all strategies without writing down new code.
3. Multiple inputs merging strategies: you should implement all required strategies (with an example and working code for each).
4. Similarity extension: you should implement the cosine similarity extension (with an example and working code).
5. Voting strategy: you should implement the majority voting strategy and provide results.
6. Report: when submitting your notebook, you should also attach a small summary report that describes what you have done (provide motivations as well for abitrary steps. For instance, "We've applied L2 regularization since the model was overfitting".

Extras (possible extra points):

1. Any well defined extension is welcome!
2. Well organized and commented code is as important as any other criteria.

# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Andrea Galassi -> a.galassi@unibo.it
* Federico Ruggeri -> federico.ruggeri6@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

*Note*: We highly recommend you to check the [course useful material](https://virtuale.unibo.it/pluginfile.php/1036039/mod_resource/content/2/NLP_Course_Useful_Material.pdf) for additional information before contacting us!

# FAQ

---

**Question**: Can I do something text pre-processing?

**Answer:** You have to! If you check text data, the majority of sentences need some cleaning.

---

**Question**: The model architecture schema is not so clear, are we doing end-to-end training?

**Answer**: Exactly! All models can be thought as:

1. Input
2. (word) Embedding
3. Sentence embedding
4. Multiple inputs merging
5. Classification

---

**Question**: Can I extend models by adding more layers?

**Answer**: Feel free to define model architectures as you wish, but remember satisfy our requirements. This assignment should not be thought as a competition to achieve the best performing model: fancy students that want to show off but miss required assignment objectives will be punished!!

---

**Question**: I'm struggling with the implementation. Can you help me?

**Answer**: Yes sure! Contact us and describe your issue. If you are looking for a particular type of operation, you can easily check the documentation of the deep learning framework you are using (google is your friend).

---

**Question**: Can I try other encoding strategies or neural architectures?

**Answer:** Absolutely! Remember to try out recommended neural baselines first and only then proceed with your extensions.

---

**Question**: Do we have to test all possible sentence embedding and input merging combinations?

**Answer**: Absolutely no! Feel free to pick one sentence embedding strategy and try all possible input merging strategies with it! For instance, pick the best performing sentence embedding method and proceed with next steps (extras included). Please, note that you still have to implement all mentioned strategies!

---

**Question**: I'm hitting out of memory error when training my models, do you have any suggestions?

**Answer**: Here are some common workarounds:

1. Try decreasing the mini-batch size
2. Try applying a different padding strategy (if you are applying padding): e.g. use quantiles instead of maximum sequence length
3. Check the efficiency of your custom code implementation (if any)
4. Try to define same length mini-batches to avoid padding (**It should not be necessary here!**)

---

**Question**: I'm hitting CUDNN_STATUS_BAD_PARAM error! What I'm doing wrong?

**Answer**: This error is a little bit tricky since the stack trace is not meaningful at all! This error occurs when the RNN is fed with a sequence of all 0s and pad masking is enabled (e.g. from the embedding layer). Please, check your conversion step, since there might be an error that leads to the encoding of a sentence to all 0s.

---