The notebook is supplementary to the paper [Semantics and Deep Learning](https://lingbuzz.net/lingbuzz/007736).  
It is assembled by [Lasha Abzianidz](mailto:lasha.abzianidze@gmail.com)

#Setup 🛠️

Preparing the environment for running demo.

In [None]:
import importlib
import transformers # preinstalled in colab
print(f"transformers ver. = {transformers.__version__}")
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from nltk.metrics import ConfusionMatrix, scores
from tqdm import tqdm
from sklearn import metrics
import matplotlib.pyplot as plt

In [None]:
# !pip install sentencepiece

In [None]:
# cloning SemDL package which includes utility functions
!rm -fr SemDL # helps to rerun this cell witthout errors, if recloning needed
!git clone https://github.com/kovvalsky/SemDL.git

In [None]:
# importing utility functions from the SemDL package
import SemDL.reasoning
importlib.reload(SemDL.reasoning) # useful when updating the module files
from SemDL.reasoning import gen_syllogism, load_tok_model, predict_nli

# Loading models 📦

We will load a model from the 🤗[huggingface model](https://huggingface.co/models) hub. With the transformers library this is simple: one needs to provide a huggingface model hub name.  
We will load [Nie et al. (2020)](https://aclanthology.org/2020.acl-main.441/)'s natural language inference (NLI) model that is based on the *large* model of [RoBERTa](https://arxiv.org/abs/1907.11692) fine-tuned on four textual inference datasets: [SNLI](https://nlp.stanford.edu/projects/snli/), [MNLI](https://cims.nyu.edu/~sbowman/multinli/), [FEVER-NLI](https://huggingface.co/datasets/pietrolesci/nli_fever), and Adversarial NLI. The model card of the model can be found [here](https://huggingface.co/ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli).  
Let's load the tokenizer and inference models (we are not going to use GPU as the demo is only about inference without model training).



In [None]:
model_name = 'ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli'
# the tokenizer model to preprocess the natural language input
anli_tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
# the model that is responsible for textual inference prediction
anli_model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Tokenization 🪓

Before predicting an inference label for a sentence pair, we need to tokenize input sentences.

In [None]:
s1, s2 = "A cat is napping on the mat.", "An animal is sleeping."

In [None]:
toks1 = anli_tokenizer(s1)
print(toks1)

`input_ids` represents IDs of tokens. Note that tokenization can also split the longer or relatively rare words into smaller pieces to prevent the out-of-vocabulary cases.  
We can map the token IDs to tokens as follows:

In [None]:
[ anli_tokenizer.convert_ids_to_tokens(tok_id) for tok_id in toks1["input_ids"] ]

Here, we can see that the input is circumfixed by the sequence tags `<s>` and `</s>`. We see a weird symbol `Ġ` at the beginning of many tokens. This stands for white space and it is represented by a symbol that has a code point 256 + 32: 32 (code point of a white space) + 256 (just a trick to consistently map invisible characters to the visible ones). For example, this is decision is shared by RoBERTa and GPT-2 tokenizers.  
`A` and `apping` have no prefix as they were not preceded with white space.  
Probably you also noticed that `napping` is chopped into `n` and `apping`. While this is not ideal, some relatively rare words get such unfair treatment in order to keep the number of tokens tractable and avoid out-of-vocabulary tokens.    

To explain the role of `attention_mask`, we need more than one input to the tokenizer. Usually, to make the processing fast, a batch of input is processed in parallel. We will consider here a batch of size 2.

In [None]:
# tokenization without padding by default
pair_toks = anli_tokenizer([s1, s2])
print(pair_toks)

# tokenization with padding
pair_toks_padded = anli_tokenizer([s1, s2], padding=True)
print(pair_toks_padded)

print(f"padding symbol is {anli_tokenizer.convert_ids_to_tokens(1)}")
print(f"contrasting attention_mask for the 2nd input\n{pair_toks['attention_mask'][1]}\n{pair_toks_padded['attention_mask'][1]}")

Since the calculations in deep learning are carried out with Tensor operations, it is handy to represent a batch of tokenized input as a rectangle matrix. When we set `padding=True` for the tokenizer, then the length of all inputs in a batch is set to the longest input size and shorter inputs are padded with a special `<pad>` token with an ID 1. That's why in the padded version the token IDs of the 2nd sentence are appended with 1s.    
`attention_mask` records for each input in the batch which tokens are relevant (marked with 1) and which are due to padding (marked with 0).

Note that different models might tokenize differently. For example, the [BERT](https://aclanthology.org/N19-1423/)-base tokenizer tokenizes `napping` as `nap` and `ping`.  
Let's see this in the example of a BERT-base model. We will use `.tokenize` method to directly obtain a string representation of the tokens.

In [None]:
# note that 'bert-base-uncased' is note fine-tuned on an NLI dataset,
# however the way tokenizer chops input remains the same with or without fine-tuning
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', use_fast=True)

In [None]:
bert_tokenizer.tokenize(s1)

Different language models also might insert different _invisible_ tokens. For example, BERT uses `[CLS]` token to model an entire sequence with a single vector. `CLS` stands for _classification_ (not for _clause_ as linguists might think of :)). `[SEP]` is a sequence separator.

In [None]:
[ bert_tokenizer.convert_ids_to_tokens(tok_id) for tok_id in bert_tokenizer(s1)["input_ids"] ]

When classifying two sequences, like NLI and QA tasks require, the class-prediction model takes as input `[CLS]sequence_1[SEP]sequence_2[SEP]`, where sequence_N is a tokenized sequence. That's why two sequences should be fed as two arguments to the corresponding tokenizer.   

In [None]:
[ bert_tokenizer.convert_ids_to_tokens(tok_id) for tok_id in bert_tokenizer(s1, s2)["input_ids"] ]

# Inference prediction 🔬

## Single problem

Let's consider a toy inference problem (which is entailment) with the following premise and hypothesis:

In [None]:
p, h = "A cat is napping on the mat.", "An animal is sleeping."

In [None]:
# we tokenize it together as two sequences
# We ask pytorch tensors as output to directly feed the output to the prediction model
tokenized_pair = anli_tokenizer(p, h, return_tensors="pt")

The output of the tokenizer can be used as a dictionary (but it is not a dict type!), hence, it can be fed to the prediction model as a set of parameter-value pairs. We use `**` to convert dict-like objects in parameter-value pairs.  
A good thing is that the tokenizer _knows_ what input the corresponding prediction model needs and returns an output that can be directly given to the prediction model.

In [None]:
print(anli_model(**tokenized_pair))
print(f"Mapping positions/indices to inference classes/labels {anli_model.config.id2label}")

The prediction model returned logits over the possible inference classes. The probability distribution over the classes can be obtained by applying softmax to the logits. The correspondence between the logic positions and classes can be obtained from the configuration of the prediction model.  

All these steps are executed under the hood of our wrapper function that returns a dictionary containing the info about the predicted label distribution and the most probable label.  
🎉 The model correctly predicts the entailment label for our toy inference problem.

In [None]:
prediction = predict_nli(anli_tokenizer, anli_model, (p, h))
print(f"probability distribution = {prediction['probs']}")
print(f"predicted label = {prediction['label']}")

Let's try a different model, [BART](https://aclanthology.org/2020.acl-main.703/) fine-tuned only on MNLI, on the inference problem.

In [None]:
model_name = 'facebook/bart-large-mnli'
tokenizer, model = load_tok_model(model_name)
prediction = predict_nli(tokenizer, model, (p, h))
print(prediction['probs'])

The prediction is again entailment with almost perfect probability.

In [None]:
predict_nli(tokenizer, model, ("John is sleeping", "John is sleeping"))

## Classifying syllogisms
<div>
<img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*9rpYBtSjreRD_NBlzE5nuA.png" width="64"/>
</div>

[Aristotle's syllogisms](https://en.wikipedia.org/wiki/Syllogism) can be regarded as the oldest textual inference problems.

The provided `gen_syllogism` function generates all syllogisms.
Read the above-cited link to better understand the structure of syllogisms (e.g., categorization of syllogisms based on the figure value).
We can also inject desired concepts in the generated syllogisms.
In this example we will use nouns for professions/expertise.  
Note that the generated syllogisms are labeled with three inference labels.




In [None]:
# generating a couple of syllogisms
for name_label, (p1, p2, c) in gen_syllogism('logicians', 'linguists', 'engineers', figures="1"):
    if "neutral" not in name_label:
        print(f"{name_label}\n{p1}\n{p2}\n{'':->30}\n{c}\n")

Let's generate all 256 syllogistic inference problems and classify them with the [ANLI](https://huggingface.co/ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli) model.

In [None]:
pred_ref = dict() # keeps gold and predicted labels
# looping over the syllogistic problems and classifying them
for name_label, (p1, p2, c) in tqdm(gen_syllogism('logicians', 'linguists', 'engineers')):
    name, ref_label = name_label.rsplit('-', 1)
    pred = predict_nli(anli_tokenizer, anli_model, (f"{p1}. {p2}.", f"{c}."))
    pred_ref[name] = pred['label'], ref_label

In [None]:
# Draw confusion matrix
preds, refs = zip(*pred_ref.values())
cm = metrics.confusion_matrix(refs, preds)
draw_cm = metrics.ConfusionMatrixDisplay(confusion_matrix=cm, display_labels = ['contradiction', 'entailment', 'neutral'])
draw_cm.plot()
plt.show()

In [None]:
# calculate accuracy
acc = metrics.accuracy_score(refs, preds)
print(f"Accuracy = {acc}")

As we can see the model performs poorly on the syllogisms. It is known that large language model-based inference systems are not good at logical reasoning. Our results confirm this fact.

For analysis, below we print interesting cases of syllogism: problems that are entailment but were predicted as contradiction.  

In [None]:
# Analysis
for name_label, (p1, p2, c) in gen_syllogism('logicians', 'linguists', 'engineers'):
    name, ref_label = name_label.rsplit('-', 1)
    pred, ref = pred_ref[name]
    if pred != ref and {ref, pred} == {"entailment", "contradiction"}:
        print(f"{name}\t{ref.upper()}\t{pred}\n{p1}\n{p2}\n{'':->30}\n{c}\n")

In practice neural models are run on GPUs and on the batched input. In this way prediction and training procedures are lot faster than on a CPU.