## **Technical cells**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!pip install --upgrade torch
!pip install --upgrade wandb
!pip install --upgrade catalyst
!pip install --upgrade torchtext
!pip install --upgrade transformers
!wandb login c54b2fcb6b8ca2808f5be303a8a3b6e464f52cca

Requirement already up-to-date: torch in /usr/local/lib/python3.6/dist-packages (1.5.0+cu101)
Requirement already up-to-date: wandb in /usr/local/lib/python3.6/dist-packages (0.8.36)
Requirement already up-to-date: catalyst in /usr/local/lib/python3.6/dist-packages (20.5.1)
Requirement already up-to-date: torchtext in /usr/local/lib/python3.6/dist-packages (0.6.0)
Requirement already up-to-date: transformers in /usr/local/lib/python3.6/dist-packages (2.10.0)
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[32mSuccessfully logged in to Weights & Biases![0m


In [3]:
import torch
import torchtext
import transformers
import wandb
import catalyst
print("Torchtext Version:", torchtext.__version__)
print("PyTorch Version:", torch.__version__)
print("Catalyst Version:", catalyst.__version__)
print("Wandb Version:", wandb.__version__)
print("Transformers Version:", transformers.__version__)

Torchtext Version: 0.6.0
PyTorch Version: 1.5.0+cu101
Catalyst Version: 20.05.1
Wandb Version: 0.8.36
Transformers Version: 2.10.0


In [0]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

In [0]:
def get_device():
	if torch.cuda.is_available():
		return torch.device('cuda:0')
	else:
		return torch.device('cpu')

device = get_device()


## **Introducing cells**

In [6]:
import pdb
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, TensorDataset
from torchtext import data
from torchtext import datasets
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from tqdm.notebook import tqdm
from catalyst import dl
import wandb
import sklearn.datasets
import re
import copy
import time
import copy

DEVICE = 'cuda'
COURSEWORK_PATH = 'drive/My Drive/text-augmentation/'
MAX_SENTENCES_LEN = 40
SPECIAL_TOKENS = ['[SEP]', '[CLS]', '[MASK]']
SYNONYMS_NUMBER = 5


numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject


numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject


numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject



Let's fix random seed to perform augmentation deterministically:

In [0]:
np.random.seed(2007)

## **Examples of using HuggingFace**

In [0]:
from transformers import BertModel, BertTokenizer

In [0]:
premise = 'You have considerably more knowledge about this extremely difficult subject.'
hypothesis = 'I feel that you are far more qualified than I am.'

**TODO:** consider the work principle of Bert Tokenizer.

**(1.)** Take a look on a standard BERT sentence encoding:

In [10]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
premise_tokens = tokenizer.tokenize(premise)
hypothesis_tokens = tokenizer.tokenize(hypothesis)
tokens = ['[CLS]'] + premise_tokens + ['[SEP]'] + hypothesis_tokens + ['[SEP]']
print(tokens)
print(len(tokens))

['[CLS]', 'you', 'have', 'considerably', 'more', 'knowledge', 'about', 'this', 'extremely', 'difficult', 'subject', '.', '[SEP]', 'i', 'feel', 'that', 'you', 'are', 'far', 'more', 'qualified', 'than', 'i', 'am', '.', '[SEP]']
26


**(2.)** Set the paddings to the sentence. `MAX_SENTENCES_LEN` defines the size of input pair to the bert. <br>
**Note:** attention mask here helps us skip the value of `[PAD]` token:

In [0]:
def apply_padding(tokens):
    global MAX_SENTENCES_LEN
    return tokens + ['[PAD]' for _ in range(MAX_SENTENCES_LEN - len(tokens))]

In [12]:
padded = apply_padding(tokens)
attention = [1 if token != '[PAD]' else 0 for token in padded]
print("Attention mask:", attention)

Attention mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


**(3.)** Converting paddings to their id's:

In [13]:
ids = tokenizer.convert_tokens_to_ids(padded)
print("IDs for sentence:", ids)
print("len of sentence:", len(ids))

IDs for sentence: [101, 2017, 2031, 9839, 2062, 3716, 2055, 2023, 5186, 3697, 3395, 1012, 102, 1045, 2514, 2008, 2017, 2024, 2521, 2062, 4591, 2084, 1045, 2572, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
len of sentence: 40


**(4.)** Create segmented tokens to differ premise and hypothesis:


In [14]:
segment = []
switcher = 0
for token in padded:
    segment.append(switcher)
    if token == '[SEP]':
        switcher = 1
print("Segment:", segment)

Segment: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [0]:
ids = torch.tensor(ids).unsqueeze(0)
attention = torch.tensor(attention).unsqueeze(0)
segment = torch.tensor(segment).unsqueeze(0)

## **Usage of BertForMaskedLM**

**(5.)** Apply masking to the sentence:

In [0]:
# If the we spent more than 3 * number_to_mask iterations without success, let's inform user and return
def mask_random_word(tokens, number_to_mask=1):
    masked_tokens = copy.deepcopy(tokens)
    index = np.random.randint(len(masked_tokens))
    cnt, done = 0, 0
    while cnt < number_to_mask * 3 and done < number_to_mask:
        if masked_tokens[index] not in SPECIAL_TOKENS:
            masked_tokens[index] = '[MASK]'
            done += 1
        index = np.random.randint(len(masked_tokens))
        cnt += 1
    if cnt == number_to_mask * 3:
        print(f"tokens={tokens} were not masked due to iterations exceed")
    return masked_tokens

In [0]:
# here we apply a mask directly
def set_mask(tokens, index):
    tokens[index] = '[MASK]'
    return tokens

In [18]:
masked_sentence = apply_padding(mask_random_word(tokens))
masked_indices = np.where(np.array(masked_sentence) == '[MASK]')[0]
print("Masked sentence:", masked_sentence)
print("Masked indices:", masked_indices)

Masked sentence: ['[CLS]', 'you', 'have', 'considerably', 'more', 'knowledge', 'about', 'this', 'extremely', 'difficult', 'subject', '.', '[SEP]', '[MASK]', 'feel', 'that', 'you', 'are', 'far', 'more', 'qualified', 'than', 'i', 'am', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
Masked indices: [13]


**(6.)** Encode masked sentence in order to apply it into Bert:

In [19]:
masked_ids = tokenizer.convert_tokens_to_ids(masked_sentence)
print(masked_ids)
masked_ids = torch.tensor(masked_ids).unsqueeze(0)

[101, 2017, 2031, 9839, 2062, 3716, 2055, 2023, 5186, 3697, 3395, 1012, 102, 103, 2514, 2008, 2017, 2024, 2521, 2062, 4591, 2084, 1045, 2572, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


Load Bert for Masked LM and output the example set of synonyms.

### **Here we see the inner structure of the model:**

In [20]:
from transformers import BertForMaskedLM
pretrained_model = "bert-base-uncased"

model = BertForMaskedLM.from_pretrained(pretrained_model)
model.eval()

BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=Tr

### **Switching to GPU**:

In [21]:
# If you have a GPU, put everything on cuda
ids = ids.to(DEVICE)
masked_ids = masked_ids.to(DEVICE)
segment = segment.to(DEVICE)
attention = attention.to(DEVICE)

model.to(DEVICE)

print("Successfully switched to GPU")

Successfully switched to GPU


In [0]:
# outputs = model(input_ids=ids, attention_mask=attention, token_type_ids=segment, masked_lm_labels=masked_ids)
# loss, prediction_scores = outputs[:2]

In [0]:
# print(prediction_scores[:, masked_indices, :].squeeze(0).squeeze(0).size())

In [0]:
# sub_tensor_for_masked_indices = prediction_scores[:, masked_indices, :].squeeze(0)
# print(sub_tensor_for_masked_indices.size())
# synonyms_ids = sub_tensor_for_masked_indices.detach().numpy().reshape((-1,)).argsort()[-SYNONYMS_NUMBER:]
# print(synonyms_ids)
# result = tokenizer.convert_ids_to_tokens(torch.tensor(synonyms_ids).squeeze(0))
# print(synonyms_ids)

In [25]:
print(type(masked_ids))
print(type(attention))
print(type(segment))

<class 'torch.Tensor'>
<class 'torch.Tensor'>
<class 'torch.Tensor'>


In [36]:
print(masked_ids.size(), attention.size(), segment.size())

torch.Size([1, 40]) torch.Size([1, 40]) torch.Size([1, 40])


In [26]:
with torch.no_grad():
    outputs = model(input_ids=masked_ids, attention_mask=attention, token_type_ids=segment) #masked_lm_labels=masked_ids)
    #loss, predictions = outputs[:2]
    predictions = outputs[0]

predicted_index = torch.argmax(predictions[0, masked_indices[0]]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
print(masked_sentence)
print("The most appropriate word in [MASK] position:", predicted_token)

['[CLS]', 'you', 'have', 'considerably', 'more', 'knowledge', 'about', 'this', 'extremely', 'difficult', 'subject', '.', '[SEP]', '[MASK]', 'feel', 'that', 'you', 'are', 'far', 'more', 'qualified', 'than', 'i', 'am', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
The most appropriate word in [MASK] position: i


In [27]:
sorted_predictions = torch.argsort(predictions[0, masked_indices[0]], descending=True)
print("Five most appropriate words in [MASK] position:", tokenizer.convert_ids_to_tokens(sorted_predictions[:5]))

Five most appropriate words in [MASK] position: ['i', 'you', 'we', 'they', 'please']


## **After considering the example let's aggregate the functionality into classes**:

In [0]:
# Firstly we obtain only two ndarrays of sentences
class PreprocessingDataset:
    def __init__(self, seed=2007, premises=None, hyps=None, max_len=MAX_SENTENCES_LEN, verbose=False, GPU=True):
        np.random.seed(seed)
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        self.premises = premises
        self.hyps = hyps
        assert self.premises.shape[0] == self.hyps.shape[0]
        self.size = self.premises.shape[0]

        # define maximum length
        mx = 0
        for i in range(self.size):
            mx = max(mx, len(self.premises[i]) + len(self.hyps[i]) + 3)
        self.max_len = max(mx, MAX_SENTENCES_LEN)

        # all lengths will have a bound of 256
        self.data = np.zeros((0, self.max_len)).astype('U256')
        self.GPU = GPU
        self.verbose = verbose
        if self.verbose:
            print("Preprocessing dataset columns number:", self.max_len)
    
    # just to define as a min(MAX_SENTENCE_LEN, max(premises, hyps))
    def find_max_len(premises, hyps):
        mx = 0
        for i in premises:
            mx = max(mx, len(prem))
        for hyp in hyps:
            mx = max(mx, len(hyp))
        return min(mx, MAX_SENTENCES_LEN)
    
    def Encode(self, premise, hyp):
        premise_tokens = tokenizer.tokenize(premise)
        hypothesis_tokens = tokenizer.tokenize(hypothesis)
        return ['[CLS]'] + premise_tokens + ['[SEP]'] + hypothesis_tokens + ['[SEP]']

    def ConvertToIDs(self, tokenized_sentence):
        return self.tokenizer.convert_tokens_to_ids(tokenized_sentence)

    def GetMaskingId(self):
        return self.tokenizer.convert_tokens_to_ids(['[MASK]'])[0]

    def AddPaddings(self, tokens):
        return tokens + ['[PAD]' for _ in range(self.max_len - len(tokens))]

    def ConvertToMatrix(self):
        for i in tqdm(range(self.size)):
            sentence = self.Encode(self.premises[i], self.hyps[i])
            assert len(sentence) <= self.max_len
            sentence = self.AddPaddings(sentence)
            self.data = np.vstack((self.data, np.array(sentence).astype('U256')))
        
        if self.verbose:
            print("self.ConvertToMatrix successfully finished.")
            print("self.data[0, 0].dtype:", self.data[0, 0].dtype)
            print("5x20 elements of data:")
            print(self.data[:5, :20])
        return
    
    def GetSegment(self):
        assert self.data is not None
        def segment_for_one_row(row):
            left, right = np.argwhere(row == '[SEP]')
            assert left.shape[0] == 1
            assert right.shape[0] == 1
            left = left[0]
            right = right[0]
            row[left + 1: ] = 1
            row[0: left + 1] = 0
            return row
        segment_matrix = np.copy(self.data)
        segment_matrix = np.apply_along_axis(segment_for_one_row, 1, segment_matrix)
        segment_matrix = np.vectorize(int)(segment_matrix)

        if self.verbose:
            print("self.GetSegment successfully finished.")
            print("segment_matrix[0, 0].dtype:", segment_matrix[0, 0].dtype)
            print("5x20 elements of segment:")
            print(segment_matrix[:5, :20])
        return segment_matrix
    
    def GetAttention(self):
        assert self.data.shape[0] > 0
        assert self.data.shape[1] > 0
        attention = np.vectorize(int)(self.data != '[PAD]')
        if self.verbose:
            print("self.GetAttention successfully finished.")
            print("5x5 elements of attention:")
            print(attention[:5, :5])
        return attention
    
    # Get Masking Matrix And Mask All Possible Items
    def MaskIndices(self, indices):
        matrix_indices = np.zeros((self.size, self.max_len))
        for pair in indices:
            matrix_indices[pair] = 1
        masked_data = copy.deepcopy(self.data)
        masked_data[matrix_indices] = '[MASK]'
        masked_indices = np.argwhere(masked_data == '[MASK]')

        if self.verbose:
            print("self.MaskIndices successfully finished.")
        return masked_data, masked_indices

    def PrepareData(self, ids, segment, attention):
        tensor_ids = torch.tensor(ids).unsqueeze(0)
        tensor_segment = torch.tensor(segment).unsqueeze(0)
        tensor_attention = torch.tensor(attention).unsqueeze(0)
        if self.GPU:
            tensor_ids = tensor_ids.to(DEVICE)
            tensor_segment = tensor_segment.to(DEVICE)
            tensor_attention = tensor_attention.to(DEVICE)
        return tensor_ids, tensor_segment, tensor_attention
    
    def ConvertMatrixToIDs(self, matrix):
        data_shape = matrix.shape
        list_to_ids = np.squeeze(matrix.reshape((-1, 1))).tolist()

        ids = self.ConvertToIDs(list_to_ids)
        ids = np.array(ids).reshape(data_shape)
        return ids
    
    def ConvertIDsToMatrix(tokenized_words):
        np_tokenized_words = tokenized_words.numpy()
        tokens = np.apply_along_axis(self.tokenizer.convert_tokens_to_ids, 1, np_tokenized_words)
        return tokens

    # pipeline returns 3 preprocessed matrices + data converted to ids rather than words
    def Pipeline(self):
        assert self.premises.shape[0] > 0
        assert self.hyps.shape[0] > 0

        self.ConvertToMatrix()

        ids = self.ConvertMatrixToIDs(self.data)

        segment = self.GetSegment()
        attention = self.GetAttention()
        return self.data, segment, attention, ids

    def GetData(self):
        return self.data

    def SetData(self, data):
        self.data = data
    
    def UpdateData(self, premise, hyp):
        sentence = self.Encode(self.premises[i], self.hyps[i])
        assert len(sentence) <= self.max_len
        sentence = self.AddPaddings(sentence)
        if self.GPU:
            sentence = sentence.to(DEVICE)
        self.data = np.vstack((self.data, sentence))

## **Now we implement a class with a functionality:**
*    predicting top $K$ words;
*    apply transformation and predictions to a several sentences;

In [0]:
# preload models once
BERT_TOKENIZER = BertTokenizer.from_pretrained('bert-base-uncased')
BERT_MLM = BertForMaskedLM.from_pretrained('bert-base-uncased')

In [0]:
class RangingMaskingPrediction:
    def __init__(self, seed=2007, top_words=5, save_probability=0.2, preprocessing_dataset=None, GPU=True):
        global BERT_TOKENIZER, BERT_MLM
        np.random.seed(seed)
        self.top_words = top_words
        self.tokenizer = BERT_TOKENIZER
        self.basic_model = BERT_MLM
        self.save_probability = save_probability
        self.preprocessing_dataset = preprocessing_dataset
        self.GPU = GPU

        if self.GPU:
            self.basic_model.to(DEVICE)

    def transform_sentence(self, sentence, segment, attention):
        for i, word in enumerate(sentence):
            sentence[i] = self.tokenizer.convert_tokens_to_ids(['[MASK]'])[0]
            sentenceGPU, segmentGPU, attentionGPU = \
                self.preprocessing_dataset.PrepareData(sentence, segment, attention)

            with torch.no_grad():
                outputs = self.basic_model(input_ids=sentenceGPU, attention_mask=attentionGPU, token_type_ids=segmentGPU)
                predictions = outputs[0]

            top_predicted_ids = torch.argsort(predictions[0, i], descending=True)[:self.top_words]
            proba = np.random.uniform()
            if proba > self.save_probability:
                # TODO: different distributions over top words
                sentence[i] = top_predicted_ids[np.random.randint(self.top_words)]
            else:
                sentence[i] = word
        return sentence

    def TransformData(self):
        _, segment, attention, tokenized_words = self.preprocessing_dataset.Pipeline()

        for i in tqdm(range(self.preprocessing_dataset.size)):
            new_sentence = self.transform_sentence(tokenized_words[i, :], segment[i, :], attention[i, :])
            tokenized_words[i, :] = new_sentence
        
        masked_ids = self.preprocessing_dataset.ConvertIDsToMatrix(tokenized_words)
        return masked_ids, segment, attention

## **Now we preprocess a short text corpus to see that the functionality works properly.**

In [73]:
corpus = pd.read_csv("drive/My Drive/text-augmentation/aug_data/synonym-default.csv", nrows=50)
print(corpus.head())
print(corpus.shape)

                                             premise  ...          label
0  a individual on a gymnastic horse jumps comple...  ...        neutral
1  a person on a horse cavalry jump over a wiped ...  ...  contradiction
2  a person on a horse leap all over a broken dow...  ...     entailment
3                children smiling and wave at camera  ...        neutral
4                children smile and waving at camera  ...     entailment

[5 rows x 3 columns]
(50, 3)


In [74]:
preprocessing_dataset = PreprocessingDataset(premises=corpus['premise'].values, hyps=corpus['hypothesis'].values,
                                             verbose=True)
print("Preprocessing is done!")

Preprocessing dataset columns number: 229
Preprocessing is done!


In [75]:
ranging_masking_prediction = RangingMaskingPrediction(preprocessing_dataset=preprocessing_dataset)
tensor_masked_ids, tensor_segment, tensor_attention = ranging_masking_prediction.TransformData()

HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))


self.ConvertToMatrix successfully finished.
self.data[0, 0].dtype: <U5
5x20 elements of data:
[['[CLS]' 'a' 'individual' 'on' 'a' 'gymnast' '##ic' 'horse' 'jumps'
  'complete' 'a' 'broken' 'down' 'airplane' '.' '[SEP]' 'i' 'feel' 'that'
  'you']
 ['[CLS]' 'a' 'person' 'on' 'a' 'horse' 'cavalry' 'jump' 'over' 'a'
  'wiped' 'out' 'down' 'airplane' '.' '[SEP]' 'i' 'feel' 'that' 'you']
 ['[CLS]' 'a' 'person' 'on' 'a' 'horse' 'leap' 'all' 'over' 'a' 'broken'
  'down' 'airplane' '.' '[SEP]' 'i' 'feel' 'that' 'you' 'are']
 ['[CLS]' 'children' 'smiling' 'and' 'wave' 'at' 'camera' '[SEP]' 'i'
  'feel' 'that' 'you' 'are' 'far' 'more' 'qualified' 'than' 'i' 'am' '.']
 ['[CLS]' 'children' 'smile' 'and' 'waving' 'at' 'camera' '[SEP]' 'i'
  'feel' 'that' 'you' 'are' 'far' 'more' 'qualified' 'than' 'i' 'am' '.']]
self.GetSegment successfully finished.
segment_matrix[0, 0].dtype: int64
5x20 elements of segment:
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1]
 [0 

HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))

KeyboardInterrupt: ignored

## **Algorithm for creating an augmentation add-on to the source dataset**