<a href="https://colab.research.google.com/github/koad7/NLP_PYTORCH/blob/main/Semantic_claim_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook we will build a semantic search engine for fact-checked claims using BERT. The overall approach will be to:
1. Create a fine-tuned version of BERT that is able to produce claim embeddings in such a way that semantically similar claims are close to each other in the embedding space.
2. Use the BERT claim encoder to create an index for a dataset of fact-checked claims and use it to find claims.


**Remember to use a runtime with GPU**

# Train a Semantic Claim Encoder
As explained above, we want to train a deep learning model that is capable of:
* given a claim $c$, produce an embedding for that claim $v_c$ in such a way that:
 * if $c_1$ and $c_2$ are semantically similar (e.g. they are paraphrases of each other), then $f_{dist}(v_{c_1}, v_{c_2}) \approx 0$ for some distance function $f_{dist}$.




## Training Dataset: STS-B
Fortunately, the [SemEval](https://aclweb.org/aclwiki/SemEval_Portal) series of workshops/challenges have produce many tasks that aim to test exactly such *semantic similarity*.

[Various of these SemEval task datasets have been bundled together](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark) into what is known as the **STS-B: Semantic Textual Similarity Benchmark**, which is part of the [GLUE collection of NLP benchmark datasets](https://gluebenchmark.com/tasks).

STS-B consists of pairs of sentences which have been manually rated on a scale between $0$ (no semantic similarity) and $5$ semantically equivalent.

We can download and load the dataset into a pandas `DataFrame`:

In [None]:
!wget http://ixa2.si.ehu.es/stswiki/images/4/48/Stsbenchmark.tar.gz
!tar -xzf Stsbenchmark.tar.gz

--2019-10-06 17:13:42--  http://ixa2.si.ehu.es/stswiki/images/4/48/Stsbenchmark.tar.gz
Resolving ixa2.si.ehu.es (ixa2.si.ehu.es)... 158.227.106.100
Connecting to ixa2.si.ehu.es (ixa2.si.ehu.es)|158.227.106.100|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 409630 (400K) [application/x-gzip]
Saving to: ‘Stsbenchmark.tar.gz’


2019-10-06 17:13:43 (606 KB/s) - ‘Stsbenchmark.tar.gz’ saved [409630/409630]



Unfortunately, we cannot use the standard pandas `read_csv` method, because some lines in the csv have additional fields which are not well documented and cause the pandas parser to fail.. we implement our own:

In [None]:
import pandas as pd
def read_sts_csv(path, columns=['source', 'type', 'year', 'id', 'score', 'sent_a', 'sent_b']):
  rows = []
  with open(path, mode='r', encoding='utf-8') as f:
    lines = f.readlines()
    print('Reading', len(lines), 'lines from', path)
    for lnr, line in enumerate(lines):
      cols = line.split('\t')
      assert len(cols) >= 7, 'line %s has %s columns instead of %s:\n\t%s' % (
          lnr, len(cols), 7, "\n\t".join(cols)
      )
      cols = cols[:7]
      assert len(cols) == 7
      rows.append(cols)
  result = pd.DataFrame(rows, columns=columns)
  # score is read as a string, so add a copy with correct type
  result['score_f'] = result['score'].astype('float64')
  return result

In [None]:
sts_dev_df = read_sts_csv('stsbenchmark/sts-dev.csv')
sts_train_df = read_sts_csv('stsbenchmark/sts-train.csv')

Reading 1500 lines from stsbenchmark/sts-dev.csv
Reading 5749 lines from stsbenchmark/sts-train.csv


You can explore the dataset by looking at a small sample:

In [None]:
sts_train_df.sample(n=5)

Unnamed: 0,source,type,year,id,score,sent_a,sent_b,score_f
3946,main-news,headlines,2013,243,2.4,2 Traffic Accidents Leave 47 Dead in China,3 traffic accidents leave 56 dead in China\n,2.4
4836,main-news,headlines,2014,606,5.0,China's new PM rejects US hacking claims,China Premier Li rejects 'groundless' US hacki...,5.0
4794,main-news,headlines,2014,555,2.6,West hails Syria opposition vote to join peace...,Syrian opposition to name delegation for talks\n,2.6
3281,main-news,MSRpar,2012train,508,4.333,"""PNC regrets its involvement"" in the deals, Ch...","James Rohr, chairman and chief executive offic...",4.333
5534,main-news,headlines,2015,1460,1.8,FAA continues ban on US flights to Tel Aviv,FAA lifts ban on U.S. flights to Tel Aviv\n,1.8


## Load the BERT model
We will use BERT as a starting point, since it's the current state of the art in deep learning architectures for NLP tasks, and is a representative of a Transformer-based deep learning models. The advantage of using BERT is that it has already been pre-trained on a large corpus, so we only need to *fine-tune it* on the STS-B dataset.

We will use the [Hugginface Pytorch-Transformers](https://github.com/huggingface/pytorch-transformers) library as an interface to the BERT model. We can install it on our environment, as follows:

In [None]:
!pip install pytorch-transformers

Collecting pytorch-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/a3/b7/d3d18008a67e0b968d1ab93ad444fc05699403fa662f634b2f2c318a508b/pytorch_transformers-1.2.0-py3-none-any.whl (176kB)
[K     |████████████████████████████████| 184kB 6.3MB/s 
Collecting sacremoses (from pytorch-transformers)
[?25l  Downloading https://files.pythonhosted.org/packages/1f/8e/ed5364a06a9ba720fddd9820155cc57300d28f5f43a6fd7b7e817177e642/sacremoses-0.0.35.tar.gz (859kB)
[K     |████████████████████████████████| 860kB 44.5MB/s 
Collecting sentencepiece (from pytorch-transformers)
[?25l  Downloading https://files.pythonhosted.org/packages/14/3d/efb655a670b98f62ec32d66954e1109f403db4d937c50d779a75b9763a29/sentencepiece-0.1.83-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |████████████████████████████████| 1.0MB 39.1MB/s 
[?25hCollecting regex (from pytorch-transformers)
[?25l  Downloading https://files.pythonhosted.org/packages/6f/a6/99eeb5904ab763db87af4bd71d9b1dfdd97926812406

Now we can import various libraries:

In [None]:
import torch
from pytorch_transformers import *
import torch.nn.functional as F

And we can load BERT, which consists of two main parts:
1. the **model** itself, it:
 * receives as input a sequence of *token ids* according to a vocabulary defined during pre-training
 * has an initial embedding layer that combines non-contextual and positional embeddings and
 * $n$ Transformer layers (seq 2 seq), which produce contextual embeddings for the input tokens of increasing complexity.
2. a **tokenizer** that converts the input sentence into a sequence of *token ids*
 * BERT (and other Transformer-based architectures) usually tokenize the input sentence based on wordpieces or subword units. See the [sentencepiece](https://github.com/google/sentencepiece) repo for more information about variants.
 * as part of the tokenization, BERT (and other models) adds special tokens that help the model understand where sentences begin and end; useful during training.

BERT has two main variants `base` (has 12 layers) and `large` (24 layers). In this notebook we will use the `bert-base-cased` variant, but feel free to [explore alternative pre-trained models](https://huggingface.co/transformers/pretrained_models.html).

We load the tokenizer and the model as follows:

In [None]:
bert_model_name = 'bert-base-cased'
tokenizer = BertTokenizer.from_pretrained(bert_model_name, do_lower_case=False)
bert = BertModel.from_pretrained(bert_model_name, output_hidden_states=True)
if torch.cuda.is_available():
  bert = bert.cuda()

### Implement function to produce sentence encodings based on the model

Now that we have the bert tokenizer and model, we can pass it a sentence, but we need to define which output of BERT we want to use as the sentence embedding. We have several options:
 * input sequences are pre-prended with a special token `[cls]` which is meant to be used for classification of the sequence.
 * we can combine the final layer of contextual embeddings, e.g. by concatenating or pooling them (take the sum or average).
 * we can combine any combination of layers (e.g. the final 4 layers).

Also, since the model and tokenizer need to be used together, we define a `tok_model` dict that we can pass to the function. We'll split the implementation into the following methods:
1. `pad_encode` creates token ids of a uniform sequence length for a given sentence
2. `tokenize` tokenizes a batch of sentences and produces a tensor that can be fed to the model
3. `embedding_from_bert_output` produces a sentence embedding from the outputs of a BERT model, based on some encoding strategy
3. `calc_sent_emb` receives a list of sentences and produces a tensor of sentence embeddings. Orchestrates by calling the other methods.

In [None]:
def pad_encode(text, tokenizer, max_length=50):
  """creates token ids of a uniform sequence length for a given sentence"""
  tok_ids = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
  tok_ids2 = tokenizer.add_special_tokens_single_sentence(tok_ids)
  att_mask = [1 for _ in tok_ids2]
  n_spectoks = len(tok_ids2) - len(tok_ids)
  if len(tok_ids2) > max_length: # need to truncate
    #print('Truncating from', len(tok_ids2))
    n_to_trunc = len(tok_ids2) - max_length
    tok_ids2 = tokenizer.add_special_tokens_single_sentence(tok_ids[:-n_to_trunc])
    att_mask = [1 for _ in tok_ids2]
  elif len(tok_ids2) < max_length: # need to pad
    padding = []
    for i in range(len(tok_ids2), max_length):
      padding.append(tokenizer.pad_token_id)
    att_mask += [0 for _ in padding]
    tok_ids2 = tok_ids2 + padding
  assert len(tok_ids2) == max_length
  assert len(att_mask) == max_length
  return tok_ids2, att_mask

def tokenize_batch(sentences, tok_model, max_len=50, debug=False):
  assert type(sentences) == list
  encoded = [pad_encode(s, tokenizer=tok_model['tokenizer'],
                        max_length=max_len)[0] for s in sentences]
  att_masks = [pad_encode(s, tokenizer=tok_model['tokenizer'],
                        max_length=max_len)[1] for s in sentences]
  input_ids = torch.tensor(encoded)
  att_masks = torch.tensor(att_masks)
  if debug: print(input_ids.shape)

  if torch.cuda.is_available():
    input_ids = input_ids.cuda()
    att_masks = att_masks.cuda()
  return input_ids, att_masks

def embedding_from_bert_output(bert_output, strategy="pooled"):
  """Given the output tensor from a BERT model, return embeddings for the batch.
  :param strategy can be:
    1. a tuple ("reduce_mean_layer", n) where n is the index of the layer in model
    2. a tuple ("layer", n)
    2. "pooled" returns the default pooled embedding for the model. E.g. for BERT,
      this is the last output for token [CLS]
  """
  assert len(bert_output) == 3, "Expecting 3 outputs, make sure model outputs hidden states"
  last_layer, pooled, hidden_layers = bert_output
  if strategy == "pooled":
    return pooled
  if not type(strategy) == tuple:
    raise ValueError("Expecting a tuple, but found %s " % (type(strategy)))
  strat_name, strat_val = strategy
  if strat_name == "reduce_mean_layer":
    layer_index = strat_val
    layer_to_pool = hidden_layers[layer_index]
    pooled_layer = torch.sum(layer_to_pool, dim=1) / (layer_to_pool.shape[1] + 1e-10)
    if debug: print('pooled layer %s of %s' % (layer_index, len(hidden_layers)),
                    pooled_layer.shape,
                    'pooled from', layer_to_pool.shape)
    return pooled_layer
  if strat_name == "layer":
    layer_index = strat_val
    return hidden_layers[layer_index]
  raise ValueError("Unsupported strategy %s " % strategy)

def calc_sent_emb(sentences, tok_model, strategy="pooled", seq_len=50, debug=False):
  """Returns the embeddings for the input sentences, based on the `tok_model`
  :param tok_model dict with keys `tokenizer` and `model`
  :param strategy see `embedding_from_bert_output`
  """
  input_ids, att_masks = tokenize_batch(sentences, tok_model, debug=debug, max_len=seq_len)

  model = tok_model['model']
  model.eval() # needed to deactivate any Dropout layers

  with torch.no_grad():
    model_out = model(input_ids, attention_mask=att_masks)

  return embedding_from_bert_output(model_out, strategy)


#### Play around with the model and encoder
If you want, before starting to train the model, now's a good time to explore the model.

In [None]:
bert_tok_model = {'tokenizer': tokenizer, "model": bert}

For example, see what the tokenizer does to an input text:

In [None]:
tokenizer.tokenize("Here is some text to encode")

['Here', 'is', 'some', 'text', 'to', 'en', '##code']

In [None]:
pad_encode("Here is some text to encode", tokenizer, max_length=12)

[101, 3446, 1110, 1199, 3087, 1106, 4035, 13775, 102, 0, 0, 0]

In [None]:
calc_sent_emb(['Here is some text to encode', 'Here is another text'], tok_model=bert_tok_model).shape

torch.Size([2, 768])

### Define Pytorch Encoder Module for fine-tuning
The pre-trained BERT is optimized to predict masked tokens or the next sentence in a pair of sentences. This means that we cannot expect the pre-trained BERT to perform well in our task of semantic similarity. Therefore, we need to fine-tune the model.

In pytorch, we can do this by defining a pytorch `Module` as follows:

In [None]:
class BERT_Finetuned_Encoder(torch.nn.Module):
  def __init__(self,
               bert_model_name='bert-base-cased',
               pooling_strategy="pooled",
               train_from_layer=6,
               seq_len=50):
    super(BERT_Finetuned_Encoder, self).__init__()
    tokenizer = BertTokenizer.from_pretrained(bert_model_name, do_lower_case=False)
    bert_model=BertModel.from_pretrained(bert_model_name, output_hidden_states=True)
    if train_from_layer is not None:
      assert type(train_from_layer) == int
      assert train_from_layer >= 0 and train_from_layer <= len(bert_model.encoder.layer)
      print("Freezing wordpiece embeddings")
      for param in bert_model.embeddings.parameters():
        param.requires_grad = False
      for i, layer in enumerate(bert_model.encoder.layer):
        if i < train_from_layer:
          print("Freezing layer", i)
          for param in layer.parameters():
            param.requires_grad = False
        else:
          print("Trainable layer", i)
      print("Trainable pooling layer") # pooler layer is always trained
    self.tokenizer = tokenizer
    self.bert_model = bert_model
    self.pooling_strategy = pooling_strategy
    self.seq_len = seq_len

    # power func parameters
    self.min_val = 0.8
    self.k = 1.0


  def forward(self, sentences, sents_to_compare=None):
    assert type(sentences) == list
    if sents_to_compare is not None:
      return self.predict_similarity(sentences, sents_to_compare)
    else:
      return self.encode(sentences)

  def predict_encoded_similarity(self, semembs_as, semembs_bs):
    cosim = F.cosine_similarity(semembs_as, semembs_bs) # (batch_size, 1)
    # make prediction a value between 0.0 and 1.0
    return self.power_fun_cosim2predfn(cosim)

  def predict_similarity(self, sentsA, sentsB):
    """Predict pairwise similarity between two lists of sentences
    Predicted values range from 0 (no similarity) and 1(semantically equal)
    """
    assert type(sentsB) == list
    assert len(sentsB) == len(sentsA)
    #print('semembs_as', type(semembs_as))
    return self.predict_encoded_similarity(
        self.encode(sentsA), self.encode(sentsB))

  def power_fun_cosim2predfn(self, cosim, min_val=0.8, k=25, steps=100):
    """Converts a cosine similarity result onto a value in range [0.0, 1.0] using
    a non-linear mapping. This is useful because cosine similarities betweeen
    vectors in embedding spaces are usually skewed towards a specific value."""
    assert min_val < 1.0
    cosim_step = (1.0-min_val)/steps
    val = torch.clamp(cosim, min=min_val, max=1.0)
    step_i = (val - min_val)/cosim_step
    pred = (step_i/steps)**k
    assert len(pred.shape) == 1, pred.shape # (batch_size)
    return torch.clamp(pred, min=0.0, max=1.0)

  def linear_cosim2predfn(self, cosim):
    """Alternative mapping from a cosim tensor to a prediction range
    Use `power_fun_cosim2predf` instead since it better aligns with the
    distribution of cosine similarities.
    """
    return (cosim + 1.0) / 2.0 # make prediction a value between 0.0 and 1.0

  def encode(self, sentences):
    # essentially the same as calc_sent_emb, but without explicitly setting model
    #  for evaluation (since we can be in training mode)
    input_ids, att_masks = tokenize_batch(sentences, {"tokenizer": self.tokenizer,
         "model": self.bert_model}, max_len=self.seq_len)
    model_out = self.bert_model(input_ids, attention_mask=att_masks)
    return embedding_from_bert_output(model_out, self.pooling_strategy)

### Define training method

We are now ready to define the main training loop. This is a pretty standard loop for PyTorch. The main thing here is that we:
 * iterate over batches of the STS-B dataset and produce encodings for both sentences.
 * then we calculate the cosine similarity between the two encodings and map that onto a predicted similarity score in a range between 0 and 1.
 * we use the STS-B value (normalised to the same range) to define a loss and train the model

In [None]:
import time
import copy
from scipy import stats

def train_semantic_encoder(semantic_encoder,
                           dataloaders,
                           optimizer, criterion, scheduler, num_epochs=25,
                           device="cuda"):
  """ Trains a semantic encoder model
  :param semantic_encoder maps a list of sentences onto a semantic embedding
    space
  :param dataloaders a dict with keys `train` and `val`, the values must be PyTorch
    DataLoader instances providing STS-B item batches
  :param cosim2predfn a function that maps a cosine similarity metric onto a
    value in the range [0.0, 1.0]
  """
  since = time.time()

  assert getattr(semantic_encoder, 'state_dict', None) is not None, "No model to train!!"

  def run_epoch(phase):
    """Execute a single epoch through the datasets.
    :param phase can be `train` or `val`
    returns a result dict with `loss` and `pearson`
    """

    def run_step(sts_itembatch):
      """Execute a step in this epoch, ie process a batch.
      Returns a triple with the batch (loss int, labels floats, predictions floats)
      """
      #print('sts_itembatch', type(sts_itembatch))
      sent_as = [item['sent_a'][0] for item in sts_itembatch]
      sent_bs = [item['sent_b'][0] for item in sts_itembatch]
      assert type(sent_as[0]) == str
      label_scores = torch.tensor([float(item['score'][0]) for item in sts_itembatch])

      label_scores = label_scores.to(device)
      optimizer.zero_grad()

      with torch.set_grad_enabled(phase == 'train'):
        pred_score = semantic_encoder(sent_as, sent_bs)
        loss = criterion(pred_score, label_scores/5.0) # make label between 0.0 and 1.0

        if phase == 'train':
          loss.backward()
          optimizer.step()
      return loss.item(), label_scores.tolist(), pred_score.tolist()

    # run epoch:
    if phase == 'train':
      semantic_encoder.train()  # Set model to training mode
    else:
      semantic_encoder.eval()   # Set model to evaluate mode (important for Dropout layers)

    running_loss, _label_scores, _pred_scores = 0.0, [], []
    for sts_itembatch in dataloaders[phase]: # Iterate over data in epoch
      batch_loss, batch_labels, batch_preds = run_step(sts_itembatch)
      running_loss += batch_loss # * len(sts_itembatch) # update state
      _label_scores += batch_labels
      _pred_scores += batch_preds

    if phase == 'val' and scheduler is not None:
      scheduler.step(running_loss) #

    epoch_loss = running_loss / len(dataloaders[phase])
    assert len(_label_scores) == len(_pred_scores), "%s %s" % (len(_label_scores), len(_pred_scores))
    epoch_correl, p_val = stats.pearsonr(_label_scores, _pred_scores)
    print('{} Loss: {:.4f}, Pearson: r={:.4f} p={:.4f} n={}'.format(
        phase, epoch_loss, epoch_correl, p_val, len(_label_scores)))
    return {"loss": epoch_loss,
            "pearson": {"r": epoch_correl,
                        "p": p_val,
                        "n": len(_label_scores)}} # run_epoch

  def is_better_result(current_best, new_val):
    return new_val['pearson']['r'] > current_best['pearson']['r']

  best_weights = copy.deepcopy(semantic_encoder.state_dict())
  print('Validating initial model')
  best_val = run_epoch('val') # run a validation epoch before the actual training

  for epoch in range(num_epochs):
    print('Epoch {}/{}'.format(epoch, num_epochs - 1))
    print('-' * 10)

    # Each epoch has a training and validation phase
    for phase in ['train', 'val']:
      epoch_result = run_epoch(phase)
      if phase == 'val' and is_better_result(best_val, epoch_result):
        best_val = epoch_result  # store state of best model
        best_weights = copy.deepcopy(semantic_encoder.state_dict())
    print()

  time_elapsed = time.time() - since
  print('Training complete in {:.0f}m {:.0f}s'.format(
      time_elapsed // 60, time_elapsed % 60))
  print('Best loss: {:4f} correl: {:.4f}'.format(best_val['loss'],
                                                 best_val['pearson']['r']))

  # load best model weights
  semantic_encoder.load_state_dict(best_weights)
  return semantic_encoder

The `train_semantic_encoder` method expects the data to be provided via Pytorch' [Dataset](https://pytorch.org/docs/stable/data.html?highlight=dataset#torch.utils.data.Dataset) and [DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) mechanisms, so we need to wrap our STS train and dev sets (at the moment a pandas `DataFrame`) into classes:

In [None]:
import torch.utils.data
import math

class STSDataset(torch.utils.data.Dataset):
  def __init__(self, sts_df, batch_size=20):
    super(STSDataset).__init__()
    self.sts_df = sts_df
    self.batch_size = batch_size

  def __len__(self):
    n_sents = self.sts_df.shape[0]
    n_batch = n_sents/self.batch_size
    result = math.ceil(n_batch)
    return result

  def __getitem__(self, index):
    begin, end = index*self.batch_size, (index+1)*self.batch_size
    values = self.sts_df[begin:end].values
    result = []
    for row in values:
      result.append({col: row[i] for i, col in enumerate(self.sts_df.columns.values)})
    return result

  def __iter__(self):
    raise NotImplementedError()
    #return self.sts_df.iterrows()

We are now ready to train the model, by defining the data loaders,

In [None]:
dataloaders = {'train': torch.utils.data.DataLoader(STSDataset(sts_train_df, batch_size=64)),
               'val': torch.utils.data.DataLoader(STSDataset(sts_dev_df, batch_size=64))}

the model to fine-tune:

In [None]:
bert_finetuned_semencoder = BERT_Finetuned_Encoder(train_from_layer=8)
if torch.cuda.is_available():
  bert_finetuned_semencoder = bert_finetuned_semencoder.cuda()

Freezing wordpiece embeddings
Freezing layer 0
Freezing layer 1
Freezing layer 2
Freezing layer 3
Freezing layer 4
Freezing layer 5
Freezing layer 6
Freezing layer 7
Trainable layer 8
Trainable layer 9
Trainable layer 10
Trainable layer 11
Trainable pooling layer


the optimizer, starting the training (this can take about 10 minutes with a GPU):

In [None]:
print(bert_finetuned_semencoder(['This is a sentence to encode']).shape)
len([p for p in bert_finetuned_semencoder.parameters() if p.requires_grad])

torch.Size([1, 768])


66

In [None]:
# using learning rate for fine-tuning as suggested in BERT paper
adam_optim = AdamW([p for p in bert_finetuned_semencoder.parameters() if p.requires_grad], lr=5e-5)

bert_finetuned_semencoder = train_semantic_encoder(
    bert_finetuned_semencoder,
    dataloaders=dataloaders,
    optimizer=adam_optim,
    criterion=torch.nn.SmoothL1Loss(reduction='sum'), # also an option torch.nn.MSELoss(),
    scheduler=torch.optim.lr_scheduler.ReduceLROnPlateau(adam_optim),
    num_epochs=5
    )

Validating initial model
val Loss: 4.1100, Pearson: r=0.2335 p=0.0000 n=1500
Epoch 0/4
----------
train Loss: 3.0001, Pearson: r=0.2724 p=0.0000 n=5749
val Loss: 2.0837, Pearson: r=0.6648 p=0.0000 n=1500

Epoch 1/4
----------
train Loss: 1.6692, Pearson: r=0.6261 p=0.0000 n=5749
val Loss: 1.8330, Pearson: r=0.7431 p=0.0000 n=1500

Epoch 2/4
----------
train Loss: 1.2347, Pearson: r=0.7415 p=0.0000 n=5749
val Loss: 1.8352, Pearson: r=0.7724 p=0.0000 n=1500

Epoch 3/4
----------
train Loss: 1.0340, Pearson: r=0.7901 p=0.0000 n=5749
val Loss: 1.7462, Pearson: r=0.7845 p=0.0000 n=1500

Epoch 4/4
----------
train Loss: 0.8830, Pearson: r=0.8251 p=0.0000 n=5749
val Loss: 1.6302, Pearson: r=0.7923 p=0.0000 n=1500

Training complete in 11m 21s
Best loss: 1.630164 correl: 0.7923


The output should be something like:
```
Validating initial model
val Loss: 5.2647, Pearson: r=0.1906 p=0.0000 n=1500
Epoch 0/5
...
Epoch 5/5
...
Training complete in 9m 20s
Best loss: 1.688949 correl: 0.7717
```

Note that before training, we validate using the `dev` part of the dataset and achieve $r_{pearson}=0.1906$, which is what the pre-trained BERT produces. This shows that the default BERT embeddings are not very *semantic*, or at least not well-aligned with what humans regard as semantic similarity.

The fine-tuned model should achieve a $r_{pearson}$ score close to $0.8$, which is much better aligned with human ratings.

# Create a semantic index of embeddings and explore it

Now that we have a model for producing semantic embeddings of sentences, we can create a simple semantic index and define methods to populate and query it.

Our semantic index is simply a python `dict` with fields `sent_encoder`, our semantic encoder, and `sent2emb` a `dict` from the sentence to its embedding.

In [None]:
  index = {
      'sent_encoder': bert_finetuned_semencoder,
      'sent2emb': {}
  }

## Define a method to populate the index

In [None]:
def populate_index(sentence_generator, index, debug=False):
  """Populates a semantic sentence index with sentences from a generator
  Returns the `index` with the new embeddings."""

  def add_batch(index, batch):
    with torch.no_grad():
      batch_embs = index['sent_encoder'](batch)
    assert batch_embs.shape[0] == len(batch)
    for i, s in enumerate(batch):
      index['sent2emb'][s] = batch_embs[i]

  index['sent_encoder'].eval() # put into evaluation mode

  batch = []
  for snr, sentence in enumerate(sentence_generator):
    batch.append(sentence)
    if len(batch) > 32:
      if debug: print('At', snr, "processing batch..", )
      add_batch(index, batch)
      batch = []
  if len(batch) > 0:
    add_batch(index, batch)

  print('Index now has', len(index['sent2emb']), 'sentences')
  return index

And a method to iterate over all the STS-B items in one of the `DataFrame`s we loaded at the beginning of the notebook:

In [None]:
def sts_df_as_sent_generator(df):
  """Create a sentence generator given a DataFrame with STS-B rows"""
  for rnr, row in df.iterrows():
    for s in [row['sent_a'], row['sent_b']]:
      yield s

## Populate index with STS-B `dev`

In [None]:
index = populate_index(sts_df_as_sent_generator(sts_dev_df), index)

Index now has 2941 sentences


To explore the newly populated dataset, we can define a method to find the top k elements in the index for a given sentence:

In [None]:
# do not trim the sentences in the pandas tables too much...
pd.set_option('display.max_colwidth', 150)

def find_most_similar(text, semb_index, k=5):
  text_emb = semb_index['sent_encoder']([text])
  if len(text_emb.shape) == 2:
    text_emb = text_emb[0]
  assert len(text_emb.shape) == 1, "" + str(text_emb.shape)
  s2cosim = {}
  for s, s_emb in semb_index['sent2emb'].items():
    assert len(s_emb.shape) == 1, "%s" % (s_emb.shape)
    s2cosim[s] = F.cosine_similarity(text_emb, s_emb, dim=0).item()
  sorted_s2cosim = sorted(s2cosim.items(), key=lambda kv: kv[1], reverse=True)
  results = [{'sentence': kv[0], 'cosim': kv[1]} for kv in sorted_s2cosim[:k]]
  return pd.DataFrame(results).sort_values(by=['cosim'], ascending=False)

### Explore the dataset using some examples

#### news about traffic accidents in China

In [None]:
find_most_similar("3 traffic accidents leave 56 dead in China", index)

Unnamed: 0,cosim,sentence
0,0.993376,'Around 100 dead or injured' after China earthquake
1,0.992287,Hundreds dead or injured in China quake\n
2,0.990573,Floods leave six dead in Philippines
3,0.989853,At least 28 people die in Chinese coal mine explosion\n
4,0.989653,Heavy rains leave 18 dead in Philippines\n


#### economic output in US

In [None]:
find_most_similar("US' industrial output growth slows to 9.2 pct in July", index)

Unnamed: 0,cosim,sentence
0,0.997261,"North American markets grabbed early gains Monday morning, as earnings season begins to slow and economic indicators take the spotlight.\n"
1,0.996923,North American markets finished mixed in directionless trading Monday as earnings season begins to slow and economic indicators move into the spot...
2,0.996634,S. Korean economic growth falls to near 3-year low\n
3,0.996336,"The blue-chip Dow Jones industrial average .DJI climbed 164 points, or 1.91 percent, to 8,765.38, brushing its highest levels since mid-January."
4,0.996165,"That took the benchmark 10-year note US10YT=RR down 9/32, its yield rising to 3.37 percent from 3.34 percent late on Thursday."


# Create another index for a Claims dataset
So the results on STS-B `dev` seem OK. Now, let's create an index for a dataset of checked facts from [datacommons factcheck](https://www.datacommons.org/factcheck/download#research-data).

First, let's download the dataset:

In [None]:
!wget https://storage.googleapis.com/datacommons-feeds/claimreview/latest/data.json
!mv data.json datacommons-factcheck.json

--2019-10-06 17:37:51--  https://storage.googleapis.com/datacommons-feeds/claimreview/latest/data.json
Resolving storage.googleapis.com (storage.googleapis.com)... 173.194.76.128, 2a00:1450:400c:c08::80
Connecting to storage.googleapis.com (storage.googleapis.com)|173.194.76.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9801768 (9.3M) [application/json]
Saving to: ‘data.json’


2019-10-06 17:37:51 (68.8 MB/s) - ‘data.json’ saved [9801768/9801768]



## Load dataset into a pandas `DataFrame`
This dataset is formatted using JSON-LD, so we can simply parse it as JSON

In [None]:
import json
with open('datacommons-factcheck.json', mode='r', encoding='utf-8') as f:
  js_datafeed = json.load(f)

and define a method to convert the nested python `dict` into a pandas `DataFrame`. We are not interested in all the data in the json feed, so we only populate a few columns.

In [None]:
def load_datacommons_feed_df(js_datafeed):
  claims = []
  for feed_item in js_datafeed['dataFeedElement']:
    claim_items = feed_item.get('item', [])
    if claim_items is None:
      claim_items = []
    for claim_in_feed in claim_items:
      claim = claim_in_feed.get('claimReviewed', None)
      if claim is not None:
        claims.append({
          'claimReviewed': claim,
          'reviewed_by': claim_in_feed.get('author', {}).get('name', 'unknown'),
          'review_altName': claim_in_feed.get('reviewRating', {}).get('alternateName', ""),
          'claim_date': claim_in_feed.get('itemReviewed', {}).get('datePublished', None),
          'claimed_by': claim_in_feed.get('itemReviewed', {}).get('author', {}).get('name', None)
        })
  return pd.DataFrame(claims)

In [None]:
claims_df = load_datacommons_feed_df(js_datafeed)
claims_df.shape

(5647, 5)

In [None]:
claims_df.sample(n=5)

Unnamed: 0,claimReviewed,claim_date,claimed_by,review_altName,reviewed_by
4409,"“Sumber daya yang sebelumnya dikuasai asing, berhasil dikuasai oleh negara. [Blok] Mahakam, Rokan, Freeport sebagai contoh.“",2019-04-13,Joko Widodo dalam Debat Capres-Cawapres Kelima,Benar,Tempo.co
1014,The push by Assembly Democrats seeking Americans with Disabilities Act accommodations for a lawmaker were timed to make Vos look bad as he became ...,2019-08-14,Robin Vos,False,PolitiFact
5307,錯誤內容與截圖訊息來自中國網民討論內容，事實上完全沒有任何的根據，對於「國立故宮博物院2000件文物，將與日本100件文物互換，且雙方交換展期50年」，故宮澄清絕無此事。,2018-09-10,Charles Yeh,不實,MyGoPen
2285,The EU sends Northern Ireland €500 million a year,2019-05-08,"SDLP leader, Colum Eastwood",ACCURATE WITH CONSIDERATION. The €500 million figure quoted by the SDLP is substantiated by European Commission figures for EU regional funding of...,Fact Check NI
620,A claim that herdsmen walked into the terminal of Big Joe transport in Kogi and shot all passengers boarding to travel to Edo state.,2019-08-13,A Facebook Post,False,DUBAWA


## Create claim iterator

The datafeed contains claims in many different languages, and since our model only works for English, we should only take into account English claims. Unfortunately, the feed does not include a language tag, so we need to filter the feed.

In [None]:
!pip install langdetect

Collecting langdetect
[?25l  Downloading https://files.pythonhosted.org/packages/59/59/4bc44158a767a6d66de18c4136c8aa90491d56cc951c10b74dd1e13213c9/langdetect-1.0.7.zip (998kB)
[K     |████████████████████████████████| 1.0MB 6.3MB/s 
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.7-cp36-none-any.whl size=993460 sha256=1098442956688161b1fb5efba9e29bf00c8d42654aaf5f774eb5b596c4926cb7
  Stored in directory: /root/.cache/pip/wheels/ec/0c/a9/1647275e7ef5014e7b83ff30105180e332867d65e7617ddafe
Successfully built langdetect
Installing collected packages: langdetect
Successfully installed langdetect-1.0.7


In [None]:
from langdetect import detect

def is_english(sentence):
  try:
    return detect(sentence) == 'en'
  except:
    # e.g. because sentence is empty
    return False

In [None]:
is_english("Tin bài hàng đầu"), is_english(
    "Claim: H Raja and S Ve Sekher supporters fighting in BJP TN office"), is_english(" ")

(False, True, False)

In [None]:
def claims_df_english_row_generator(df):
  for rnr, row in df.iterrows():
    s = row['claimReviewed']
    if is_english(s):
      yield row.to_dict()

## Populate a `claim_index`

We could just reuse the `populate_index` we defined above, but we already have some interesting metadata about reviewed claims, so it's interesting to keep those in our index. So define a slightly modified version:

In [None]:
def populate_claim_index(claim_rows, index, debug=False):
  """Populates a semantic sentence index with sentences from a generator
  Returns the `index` with the new embeddings."""

  def add_batch(index, batch):
    sent_batch = [row['claimReviewed'] for row in batch]
    with torch.no_grad():
      batch_embs = index['sent_encoder'](sent_batch)
    assert batch_embs.shape[0] == len(batch)
    for i, s in enumerate(sent_batch):
      index['sent2emb'][s] = batch_embs[i]
      index['claim_meta'][s] = {
          'review_altName': batch[i]['review_altName'],
          'reviewed_by': batch[i]['reviewed_by']
          }

  index['sent_encoder'].eval() # put into evaluation mode

  batch = []
  for snr, claim_row in enumerate(claim_rows):
    batch.append(claim_row)
    if len(batch) > 32:
      if debug: print('At', snr, "processing batch..", )
      add_batch(index, batch)
      batch = []
  if len(batch) > 0:
    add_batch(index, batch)

  print('Index now has', len(index['sent2emb']), 'sentences')
  return index

In [None]:
claim_index = {
      'sent_encoder': bert_finetuned_semencoder,
      'sent2emb': {},
      'claim_meta': {}
  }

In [None]:
claim_index = populate_claim_index(
    claims_df_english_row_generator(claims_df), claim_index)

Index now has 3519 sentences


## Explore dataset
We define a custom version of `find_most_similar` to display more relevant info about the most similar claims:

In [None]:
def find_most_similar_claim(text, claim_index, k=5):
  text_emb = claim_index['sent_encoder']([text]) # shape (1, emb_dim)
  s2cosim = {}
  s2pred = {}
  for s, s_emb in claim_index['sent2emb'].items():
    ts_emb = s_emb.unsqueeze(0) # shape (1, emb_dim)
    pred_score = claim_index['sent_encoder'].predict_encoded_similarity(
        text_emb, ts_emb)
    #s2cosim[s] = F.cosine_similarity(text_emb, s_emb, dim=0).item()
    s2pred[s] = pred_score.item()

  #sorted_s2cosim = sorted(s2cosim.items(), key=lambda kv: kv[1], reverse=True)
  sorted_s2pred = sorted(s2pred.items(), key=lambda kv: kv[1], reverse=True)
  claim_meta = claim_index['claim_meta']
  results = [{'claim': claim,
              'true?': claim_meta[claim].get('review_altName', '??'),
              'reviewed by': claim_meta[claim].get('reviewed_by', "??"),
              'pred': pred,
              #'cosim': cosim
              #} for claim, cosim in sorted_s2cosim[:k]]
              } for claim, pred in sorted_s2pred[:k]]

  return pd.DataFrame(results).sort_values(by=['pred'], ascending=False)

### Brexit, UK

In [None]:
find_most_similar_claim("Most people in UK now want Brexit", claim_index)

Unnamed: 0,claim,pred,reviewed by,true?
0,77% of young people in the UK don’t want Brexit.,0.766278,unknown,"Inaccurate. Polls for Great Britain show support from those aged 18-24 for remaining in the EU at between 57% and 71%; for Northern Ireland, betwe..."
1,A claim that says Nigeria’s Independent National Electoral Commission [INEC] ban phones at polling stations,0.681251,DUBAWA,The claim that INEC has banned the use of phones and cameras at polling stations is NOT ENTIRELY FALSE. While you are not banned from going to the...
2,"The DUP at no point has ever agreed to establish an Irish Language Act with the UK government, with the Irish government, with Sinn Féin or anybod...",0.636494,unknown,"Accurate. The St Andrew’s Agreement committed the UK Government to an Irish Language Act, but subsequent legislation compelled the Northern Irelan..."
3,Says there could be a potential mass shooting at a Walmart nearby.,0.636324,PolitiFact,It's a widespread hoax message
4,Claim video claiming muslims protesting in Kashmir after Eid prayers against article 370 dissolution,0.630145,Fact Crescendo,FALSE


The predictions are what the model outputs, i.e. the range is between 0 (not similar at all) and 1 (semantically very similar). In this case, we see that a related, but narrower, claim was found with semantic similarity score of $0.76$. Other results are below $0.7$ and are not about Brexit at all.

### Northern Ireland and EU contributions

In [None]:
find_most_similar_claim("Northern Ireland receives yearly half a billion pounds from the European Union", claim_index)

Unnamed: 0,claim,pred,reviewed by,true?
0,The EU sends Northern Ireland €500 million a year,0.751983,Fact Check NI,ACCURATE WITH CONSIDERATION. The €500 million figure quoted by the SDLP is substantiated by European Commission figures for EU regional funding of...
1,Northern Ireland is a net contributor to the EU.,0.746601,unknown,"This claim is false, as we estimate that Northern Ireland was a net recipient of £74 million in the 2014/15 financial year. Others have claimed th..."
2,"Arlene Foster, the leader of the Democratic Unionist Party, said that the party delivered “an extra billion pounds” for Northern Ireland.",0.735942,Fact Check NI,ACCURATE. The £1bn is specific to the jurisdiction of Northern Ireland and is in addition to funding pledged as a result of the Stormont House Agr...
3,Northern Ireland were once net contributors of revenue to HM Treasury.,0.732447,unknown,"True, up until the 1930s. But data show that Northern Ireland has run a fiscal deficit since 1966. The most recent figure, from 2013-14, is a subv..."
4,The number of homes in Northern Ireland that have had their housing supplementary payments removed has increased five times more in the past year.,0.731485,unknown,Accurate with considerations. The 140 households whose top-up payments ceased in the past year is four times more than the 35 households in the pr...


In this case we see tha the first two matches are on topic with scores above $0.74$. Notice that *the only words that appear both in the query and the result for the top result are 'Northern Ireland'*.

The rest of the top $5$ is still about money and Northern Ireland, but no longer relate to the EU, even though the similarity score is still in the range of $[0.73, 0.74]$.

### Religion/archeology

In [None]:
find_most_similar_claim("A Bible was found in depths of ocean", claim_index)

Unnamed: 0,claim,pred,reviewed by,true?
0,"An ancient Bible, which has been found at the bottom of the ocean is still readable.",0.748978,FACTLY,FALSE
1,Crystallised book is a bible found at the bottom of the ocean.,0.667992,Africa Check,False
2,Body of ancient Egyptian pharaoh “Fir’auna” miraculously preserved without any mummification despite being “inside the sea for more than 3000 years”,0.49265,Africa Check,Incorrect
3,"In Shirsal village of Beed district, lava was seen coming out of the ground when bore well was drilled for 1200 feet",0.472769,FACTLY,FALSE
4,Says an image shows a photo of hurricane-ravaged Abaco Island in the Bahamas.,0.467003,PolitiFact,False


This was an easy case, where the top 2 matches have scores above $0.66$ and the rest, unrelated claims were below $0.5$.

### Climate change

In [None]:
find_most_similar_claim("There is NO climate emergency", claim_index)

Unnamed: 0,claim,pred,reviewed by,true?
0,Prime Minister Scott Morrison has defending his government’s action on climate change and telling the the United Nations General Assembly that the...,0.61238,AAP FactCheck,Mostly True - Mostly accurate but there is more than one error or problem.
1,"""The vast majority of Americans believe that climate change is real and we need to do something about it.""",0.611185,Michael Bennet,Mostly True
2,"""I share the sense of urgency. I’m a scientist, so I recognize that we’re within 10 or 12 years of actually suffering irreversible damage (of clim...",0.608243,Andrew Yang,Deadline lacks nuance
3,"""91% of the world’s population are exposed to air pollution above the World Health Organization’s suggested level. NONE ARE IN THE U.S.A.!""",0.60067,FactCheck.org,False
4,The time taken for environmental approvals has been brought down to 180 days from 600 days.,0.590571,FACTLY,TRUE


In this case, we fail to find any similar claims (they all have scores under $0.62$, although the top results are topically related.

In [None]:
find_most_similar_claim(
    "Current climate changes are to be expected from the cyclic behaviour " +
    "of the climate system",
    claim_index)

Unnamed: 0,claim,pred,reviewed by,true?
0,Today’s global warming is no different from previous warming periods in Earth’s past.,0.741014,"National Academies of Sciences, Engineering, and Medicine",False
1,Extreme weather can be linked to global warming.,0.714379,"National Academies of Sciences, Engineering, and Medicine",In some cases
2,"""The vast majority of Americans believe that climate change is real and we need to do something about it.""",0.642089,Michael Bennet,Mostly True
3,The time taken for environmental approvals has been brought down to 180 days from 600 days.,0.563477,FACTLY,TRUE
4,Claim NASA develops rain cloud generator engine,0.562113,Fact Crescendo,FALSE


We find two claims with scores above $0.70$. Especially the first result is a paraphrasing of the query sentence.

### State hacking of digital devices

In [None]:
find_most_similar_claim("The state can hack into any digital device", claim_index)

Unnamed: 0,claim,pred,reviewed by,true?
0,Claim: All computers can now be monitored by government agencies,0.703854,Fact Crescendo,Fact Crescendo Rating: True
1,Claim unrelated image from a random FB profile used to recirculate an old incident,0.641392,Fact Crescendo,FALSE
2,EVMs hacked by JIO network,0.641277,Fact Crescendo,Fact Crescendo Rating: False
3,"A video of Mark Zuckerberg shows him talking about controlling ""billions of people’s stolen data"" to control the future.",0.632752,Instagram post,Pants on Fire
4,A claim that the Government of the United States has asked INEC to release the “real” figures of the 2019 Presidential Elections.,0.59749,DUBAWA,False


In our final example, we see that one claim has score above $0.7$ and is again a related claim. The other results are somewhat related, but not directly relevant to assess the query claim.

# Acknowledgements
This notebook is based on work performed as part of the [**Co-inform** project](https://coinform.eu/).

![](https://coinform.eu/wp-content/uploads/2018/06/EC-H2020.png)

> <sub>
Co-inform project is co-funded by Horizon 2020 – the Framework Programme for Research and Innovation (2014-2020)

> <sub> H2020-SC6-CO-CREATION-2016-2017 (CO-CREATION FOR GROWTH AND INCLUSION) </sub>

> <sub> Type of action: RIA (Research and Innovation action) </sub>

> <sub> Proposal number: 770302 </sub>