# Introduction

Semantic Similarity is the task of determining how similar two sentences are, in terms of what they mean. This example demonstrates the use of SNLI (Stanford Natural Language Inference) Corpus to predict sentence semantic similarity with Transformers. We will fine-tune a BERT model that takes two sentences as inputs and that outputs a similarity score for these two sentences.

## Dataset

- [SNLI](https://nlp.stanford.edu/projects/snli/)

 Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). [[pdf](https://nlp.stanford.edu/pubs/snli_paper.pdf)]

<img width="685" alt="image" src="https://user-images.githubusercontent.com/37654013/111018451-2d4dc200-83fc-11eb-9f23-11ec849d85e4.png">



# Setup

In [8]:
%autoreload 2
%reload_ext autoreload

import numpy as np
import pandas as pd
import torch
import transformers
import warnings

warnings.filterwarnings(action='ignore')

print('transformers version: ',transformers.__version__)

transformers version:  4.3.3


In [10]:
from utils import progress_bar

In [11]:
device = torch.device(f'cuda:1' if torch.cuda.is_available() else 'cpu')

# Configuration

In [12]:
max_length = 128 # Maximun length of input sentence to the model
batch_size = 32
epochs = 2

# Labels in our dataset
labels = ["contradiction", "entailment", "neutral"]

# Load the data

In [13]:
# !curl -LO https://raw.githubusercontent.com/MohamadMerchant/SNLI/master/data.tar.gz
# !tar -xvzf data.tar.gza

In [112]:
# There are more than 550k samples in total; we will use 100k for this example.
train_df = pd.read_csv("SNLI_Corpus/snli_1.0_train.csv", nrows=100000)
valid_df = pd.read_csv("SNLI_Corpus/snli_1.0_dev.csv")
test_df = pd.read_csv("SNLI_Corpus/snli_1.0_test.csv")

# Shape of the data
print(f"Total train samples: {train_df.shape[0]}")
print(f"Total validation samples: {valid_df.shape[0]}")
print(f"Total test samples: {test_df.shape[0]}")

Total train samples: 100000
Total validation samples: 10000
Total test samples: 10000


Dataset Overview:

- sentence1: The premise caption that was supplied to the author of the pair.
- sentence2: The hypothesis caption that was written by the author of the pair.
- similarity: This is the label chosen by the majority of annotators. Where no majority exists, the label "-" is used (we will skip such samples here).

Here are the "similarity" label values in our dataset:

- Contradiction: The sentences share no similarity.
- Entailment: The sentences have similar meaning.
- Neutral: The sentences are neutral.

Let's look at one sample from the dataset:

In [113]:
print(f"Sentence1: {train_df.loc[1, 'sentence1']}")
print(f"Sentence2: {train_df.loc[1, 'sentence2']}")
print(f"Similarity: {train_df.loc[1, 'similarity']}")

Sentence1: A person on a horse jumps over a broken down airplane.
Sentence2: A person is at a diner, ordering an omelette.
Similarity: contradiction


## Missing values

In [114]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   similarity  100000 non-null  object
 1   sentence1   100000 non-null  object
 2   sentence2   99997 non-null   object
dtypes: object(3)
memory usage: 2.3+ MB


In [115]:
train_df[train_df.sentence2.isnull()]

Unnamed: 0,similarity,sentence1,sentence2
91479,neutral,Cannot see picture to describe.,
91480,entailment,Cannot see picture to describe.,
91481,contradiction,Cannot see picture to describe.,


# Preprocessing

In [116]:
# We have some NaN entries in our train data, we will simply drop them.
print("Number of missing values")
print(train_df.isnull().sum())
train_df.dropna(axis=0, inplace=True)

Number of missing values
similarity    0
sentence1     0
sentence2     3
dtype: int64


Distribution of our training targets.

In [117]:
print("Train Target Distribution")
print(train_df.similarity.value_counts())

Train Target Distribution
entailment       33384
contradiction    33310
neutral          33193
-                  110
Name: similarity, dtype: int64


Distribution of our validation targets.

In [118]:
print("Validation Target Distribution")
print(valid_df.similarity.value_counts())

Validation Target Distribution
entailment       3329
contradiction    3278
neutral          3235
-                 158
Name: similarity, dtype: int64


The value "-" appears as part of our training and validation targets. We will skip these samples.

In [119]:
train_df = (
    train_df[train_df.similarity != "-"]
    .sample(frac=1.0, random_state=42)
    .reset_index(drop=True)
)

valid_df = (
    valid_df[valid_df.similarity != "-"]
    .sample(frac=1.0, random_state=42)
    .reset_index(drop=True)
)

Label encoding train, valid and test labels

In [120]:
encoder = {
    'contradiction':0,
    'entailment':1,
    'neutral':2
}

train_df['similarity'] = train_df['similarity'].map(encoder)
valid_df['similarity'] = valid_df['similarity'].map(encoder)
test_df['similarity'] = test_df['similarity'].map(encoder)

# Create a custom data generator

## Tokeinzer Example

**Tokenizer return values description**
- **input_ids** : 입력값으로 들어간 각 토큰에 대한 id
- **attention_mask** : 인코딩된 토큰 중에서 focus를 취해야하는 곳은 '1'로 아닌 곳은 '0'으로 나타냄 
- **token_type_ids** : Sequence classification 또는 QA 같은 task 경우 여러 sentence를 함께 encoding하는데 각 sentence를 구분하기 위해 나타냄. 첫 sentence는 '0', 두 번째는 '1'. 이런식으로 구분함

**source:** https://huggingface.co/transformers/glossary.html

In [121]:
tokenizer = transformers.BertTokenizer.from_pretrained(
    "bert-base-uncased", do_lower_case=True
)

In [122]:
example = train_df[['sentence1','sentence2']].values.astype('str')[0].tolist()
encoded = tokenizer.encode_plus(
    text=example[0],
    text_pair=example[1],
    add_special_tokens=True, # Such as '[CLS]', '[SEP]'
    max_length=max_length, # maximun length 
    return_attention_mask=True, # whether returns attention_mask
    return_token_type_ids=True, # whether returns token_type_ids
    pad_to_max_length=True, # padding
    return_tensors="pt" # 'pt': pytorch, 'tf': tensorflow
)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [123]:
print('[input]')
print(f'Sentence1: {example[0]}')
print(f'Sentence2: {example[1]}')
print('\n[encoded]')
print(f"input_ids: {encoded['input_ids']}")
print(f'attention_mask: ',encoded['attention_mask'])
print(f'token_type_ids: ',encoded['token_type_ids'])
print('\n[decoded]')
print(f"decode: {tokenizer.convert_ids_to_tokens(encoded['input_ids'][0])}")

[input]
Sentence1: Two male clowns, one in a plaid suit and the other in black, performing a musical number in a theater setting.
Sentence2: The clowns are in the dressing room.

[encoded]
input_ids: tensor([[  101,  2048,  3287, 15912,  2015,  1010,  2028,  1999,  1037, 26488,
          4848,  1998,  1996,  2060,  1999,  2304,  1010,  4488,  1037,  3315,
          2193,  1999,  1037,  4258,  4292,  1012,   102,  1996, 15912,  2015,
          2024,  1999,  1996, 11225,  2282,  1012,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
          

## Create dataloader using tokeinzer

1. Make a Dataset
2. Build a DataLoader

In [163]:
class BertSemanticDataset(torch.utils.data.Dataset):
    """Generates batches of data.

    Args:
        sentence_pairs: Array of premise and hypothesis input sentences.
        targets: Array of labels.
        max_length: maximun length of sentence
        include_targets: boolean, whether to incude the labels.

    Returns:
        Dictionary keys : ['input_ids','attention_mask','token_type_ids','target']
        (or just [input_ids, attention_mask, token_type_ids] if include_targets=False)
    """
    
    def __init__(
        self,
        sentence_pairs,
        tokenizer,
        max_length,
        targets=None,
        include_targets=True,
    ):
        self.sentence_pairs = sentence_pairs
        self.targets = targets
        self.include_targets = include_targets
        self.tokenizer = tokenizer
        
    def __len__(self):
        # Denotes the number of sentence pairs
        return len(self.sentence_pairs)

    def __getitem__(self, idx):
        encoded = self.tokenizer.encode_plus(
            self.sentence_pairs[idx][0],
            text_pair=self.sentence_pairs[idx][1],
            add_special_tokens=True,
            max_length=max_length,
            return_attention_mask=True,
            return_token_type_ids=True,
            pad_to_max_length=True,
            return_tensors="pt",
        )
        
        if self.include_targets:
            return {
                'input_ids':encoded['input_ids'][0],
                'attention_mask':encoded['attention_mask'][0],
                'token_type_ids':encoded['token_type_ids'][0],
                'target': self.targets[idx]
            }
        else:
            return {
                'input_ids':encoded['input_ids'][0],
                'attention_mask':encoded['attention_mask'][0],
                'token_type_ids':encoded['token_type_ids'][0]
            }

In [125]:
trainset = BertSemanticDataset(
    sentence_pairs=train_df[['sentence1','sentence2']].values.astype('str'),
    targets=train_df['similarity'].values,
    tokenizer=tokenizer,
    max_length=max_length
)

validset = BertSemanticDataset(
    sentence_pairs=valid_df[['sentence1','sentence2']].values.astype('str'),
    targets=valid_df['similarity'].values,
    tokenizer=tokenizer,
    max_length=max_length
)

testset = BertSemanticDataset(
    sentence_pairs=test_df[['sentence1','sentence2']].values.astype('str'),
    targets=test_df['similarity'].values,
    tokenizer=tokenizer,
    max_length=max_length,
    include_targets=True
)

In [126]:
trainloader = torch.utils.data.DataLoader(
    dataset=trainset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=4
)
validloader = torch.utils.data.DataLoader(
    dataset=validset,
    batch_size=batch_size,
    shuffle=False,
    num_workers=4
)
testloader = torch.utils.data.DataLoader(
    dataset=testset,
    batch_size=batch_size,
    shuffle=False,
    num_workers=4
)

# Build the model

- **model.eval()** will notify all your layers that you are in eval mode, that way, batchnorm or dropout layers will work in eval mode instead of training mode.

- **torch.no_grad()** impacts the autograd engine and deactivate it. It will reduce memory usage and speed up computations but you won’t be able to backprop (which you don’t want in an eval script).

**Problem**
- 학습이 제대로 진행히 안되고 33%에 수렴.

**Reason**
- `forward`에서 bert를 torch.no_grad()로 감싸줘야함. 
    - `__init__`에서 bert의 parameter를 requires_grad=False로 설정해줘도 안됨
- fine tuning을 할때는 learning rate를 작게해줘야함. Bert가 크다보니 제대로 수렴이 되지않음

In [129]:
class BertSemanticModel(torch.nn.Module):
    def __init__(self):
        super(BertSemanticModel, self).__init__() 
        
        self.fine_tuning = False
        self.bert = transformers.BertModel.from_pretrained('bert-base-uncased')
            
        self.bi_lstm = torch.nn.LSTM(input_size=self.bert.config.hidden_size, 
                                     hidden_size=64,
                                     bidirectional=True)
        
        self.linear = torch.nn.Linear(in_features=64*2*2, out_features=3) 
        self.dropout = torch.nn.Dropout(p=0.3)
        
    def forward(self, input_ids, attention_mask, token_type_ids):
        
        if self.fine_tuning:
            embedding = self.bert(input_ids,
                                  attention_mask=attention_mask,
                                  token_type_ids=token_type_ids)
        else:
            with torch.no_grad():
                embedding = self.bert(input_ids,
                                      attention_mask=attention_mask,
                                      token_type_ids=token_type_ids)
        
        # sequence_output (batch size x #token x hidden size) : (batch size x 128 x 768)        
        # pooled_output (batch size x  hidden size)           : (batch size x  768) CLS token에 linear mapping 후 tanh 결과
        sequence_output, pooled_output = embedding[0], embedding[1]
        
        # lstm_out (batch size x #token x hidden size)        : (batch size x 128 x 128)
        lstm_out, _ = self.bi_lstm(sequence_output)

        # gap_out (batch size x hidden size)                  : (batch size x 128)
        gap_out = lstm_out.mean(dim=1) # GAP
        
        # gmp_out (batch size x hidden size)                  : (batch size x 128)
        gmp_out, _ = lstm_out.max(dim=1) # GMP
           
        # out (batch size x hidden size)                      : (batch size x 256)
        out = torch.cat([gap_out, gmp_out], dim=1)
        out = self.dropout(out)
        
        # out (batch size x #class)                           : (batch size x 3)
        out = self.linear(out)
        
        return out

In [130]:
model = BertSemanticModel().to(device)

## Model Summary

In [131]:
print('Ther number of parameters from pytorch model: ',sum([np.prod(param.size()) for param in model.parameters()]))

Ther number of parameters from pytorch model:  109910019


In [132]:
print('The difference of the number of model parameters between Pytorch and Keras')
print('109910019 - 109909507 = ',109910019 - 109909507)

The difference of the number of model parameters between Pytorch and Keras
109910019 - 109909507 =  512


In [133]:
lstm_params = sum([np.prod(param.size()) for param in model.bi_lstm.parameters()])
print('The difference of the number of LSTM parameters between Pytorch and Keras')
print(f'{lstm_params} - 426496 = {lstm_params - 426496}')

The difference of the number of LSTM parameters between Pytorch and Keras
427008 - 426496 = 512


# Train the Model

In [134]:
def train(model, dataloader, criterion, optimizer, device):
    total = 0
    correct = 0 
    total_loss = 0
    
    model.train()
    for batch_idx, batch_i in enumerate(dataloader):
        # inputs and targets
        input_ids = batch_i['input_ids'].to(device)
        attention_mask = batch_i['attention_mask'].to(device)
        token_type_ids = batch_i['token_type_ids'].to(device)
        targets = batch_i['target'].to(device)
        
        # reset optimizer
        optimizer.zero_grad()
        
        # model output
        outputs = model(input_ids, attention_mask, token_type_ids)
        
        # accuracy
        _, predict = outputs.max(1)
        correct += predict.eq(targets.long()).cpu().float().sum().item()
        total += input_ids.size(0)
        
        # loss
        loss = criterion(outputs, targets)
        loss.backward()
        
        # update optimizer
        optimizer.step()
        
        total_loss += loss.item()
    
        
        # massage
        progress_bar(current=batch_idx, 
                     total=len(dataloader),
                     msg='Loss: %.3f | Acc: %.3f%%' % (total_loss/(batch_idx + 1), 
                                                               100.*(correct/total)),
                     term_width=100)
        
        
def validation(model, dataloader, criterion, device):
    total = 0
    correct = 0 
    total_loss = 0
    
    model.eval()
    with torch.no_grad():
        for batch_idx, batch_i in enumerate(dataloader):
            # inputs and targets
            input_ids = batch_i['input_ids'].to(device)
            attention_mask = batch_i['attention_mask'].to(device)
            token_type_ids = batch_i['token_type_ids'].to(device)
            targets = batch_i['target'].to(device)


            # model output
            outputs = model(input_ids, attention_mask, token_type_ids)

            # accuracy
            _, predict = outputs.max(1)
            correct += predict.eq(targets.long()).cpu().float().sum().item()
            total += input_ids.size(0)

            # loss
            loss = criterion(outputs, targets)
            total_loss += loss.item()

            # massage
            progress_bar(current=batch_idx, 
                         total=len(dataloader),
                         msg='Loss: %.3f | Acc: %.3f%%' % (total_loss/(batch_idx + 1), 
                                                                   100.*(correct/total)),
                         term_width=100)
            
            
def fit(model, epochs, trainloader, criterion, optimizer, device, validloader=None):
    for epoch in range(epochs):
        print('Fit start')
        print(f'\nEpochs: {epoch+1}/{epochs}')
        train(model, trainloader, criterion, optimizer, device)
        if validloader is not None:
            validation(model, validloader, criterion, device)

In [135]:
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())

In [136]:
fit(model=model,
    epochs=1,
    trainloader=trainloader,
    validloader=validloader,
    criterion=criterion,
    optimizer=optimizer,
    device=device)

Fit start

Epochs: 1/1


# Fine-tuning

In [138]:
model.fine_tuning = True
optimizer.param_groups[0]['lr'] = 0.00001

In [139]:
fit(model=model,
    epochs=1,
    trainloader=trainloader,
    validloader=validloader,
    criterion=criterion,
    optimizer=optimizer,
    device=device)

Fit start

Epochs: 1/1


# Train the entire model end-to-end

# Evaluate model on the test set

In [149]:
def predict(model, dataloader, device):
    preds = np.zeros(len(dataloader.dataset))
    
    model.eval()
    
    with torch.no_grad():
        for batch_idx, batch_i in enumerate(dataloader):
            # inputs and targets
            input_ids = batch_i['input_ids'].to(device)
            attention_mask = batch_i['attention_mask'].to(device)
            token_type_ids = batch_i['token_type_ids'].to(device)

            # model output
            outputs = model(input_ids, attention_mask, token_type_ids)

            # predict
            _, predict = outputs.max(1)
            
            start_idx = batch_idx * input_ids.size(0)
            end_idx = (batch_idx+1) * input_ids.size(0)
            preds[start_idx:end_idx] = predict.detach().cpu().numpy()
            
            
            # massage
            progress_bar(current=batch_idx, 
                         total=len(testloader),
                         term_width=100)
                         
    return preds
                         
def evaluate(preds, trues):
    return np.sum(preds == trues) / len(trues)

In [148]:
preds = predict(model, testloader, device)



In [158]:
test_acc = evaluate(preds=preds, trues=test_df['similarity'].values)
print('Test Accuracy: {0:.3%}'.format(test_acc))

Test Accuracy: 84.670%


# Inference on custom sentences

In [209]:
def check_similarity(sentence1, sentence2, device):
    labels = ["contradiction", "entailment", "neutral"]
    
    sentence_pairs = np.array([[str(sentence1), str(sentence2)]])
    test_data = BertSemanticDataset(
        sentence_pairs, 
        max_length=max_length, 
        tokenizer=tokenizer, 
        include_targets=False,
    )
    
    testloader = torch.utils.data.DataLoader(
        dataset=test_data,
        batch_size=1,
        shuffle=False,
        num_workers=4
    )

    test_input = next(iter(testloader))
    
    model.eval()
    output = model(test_input['input_ids'].to(device),
                   attention_mask=test_input['attention_mask'].to(device),
                   token_type_ids=test_input['token_type_ids'].to(device))
    
    output = torch.nn.functional.softmax(output, dim=1)
    proba, idx = output.max(1)
    
    print('Sentence1: ',sentence1)
    print('Sentence2: ',sentence2)
    print("{0:}: {1: .2%}".format(labels[idx].upper(), proba.item()))

In [210]:
sentence1 = "Two women are observing something together."
sentence2 = "Two women are standing with their eyes closed."
check_similarity(sentence1, sentence2, device)

Sentence1:  Two women are observing something together.
Sentence2:  Two women are standing with their eyes closed.
CONTRADICTION:  77.36%


In [211]:
sentence1 = "A smiling costumed woman is holding an umbrella"
sentence2 = "A happy woman in a fairy costume holds an umbrella"
check_similarity(sentence1, sentence2, device)

Sentence1:  A smiling costumed woman is holding an umbrella
Sentence2:  A happy woman in a fairy costume holds an umbrella
NEUTRAL:  90.67%


In [212]:
sentence1 = "A soccer game with multiple males playing"
sentence2 = "Some men are playing a sport"
check_similarity(sentence1, sentence2, device)

Sentence1:  A soccer game with multiple males playing
Sentence2:  Some men are playing a sport
ENTAILMENT:  94.69%
