# References

1. http://jalammar.github.io/illustrated-bert/
1. https://mccormickml.com/2019/07/22/BERT-fine-tuning/

# Problem statement

The aim is to develop a machine learning algorithm to predict whether a tweet is about a real disaster or not.

# Aproach

Transfer learning technique is used to perform the text classification problem. We load pretrained BERT model and finetune the weights.

## Advantages of fine-tuning

* **Time** - Pretrained BERT model weights already encode a lot of information. As a result, it takes much less time to finetune the model

* **Data** - As the pretrained model is trained on large text, the model performs well even with small datasets.

We don't go into the details of BERT architecture. Here is an overview about how BERT is pretrained, and how it can be used for classification.


### BERT (Bidirectional Encoder Representations from Transformers)

Language modeling is a common method of pretraining on unlabeled text (self supervised learning). Most of the language models learned by iteratively predicting next word in a sequence auto regressively across enormous data sets of text like wikepedia. This can be left to right, right to left or bi-directional. 

There are two strategies of applying pretrained language representations to downstream tasks:

1. Feature based approach
1. Fine tuning approach

The feauture based approach, such as **ELMo** uses task specific architectures that include the pretrained representations as additional features.

The fine tuning approach, such as **OpenAI GPT**, introduces minimal task specific parameters, and is trained on the downstream task by fine tuning all the pretrained parameters.

BERT model can be used for both the approaches. BERT reformulates the language modeling pretrained task of iteratively predicting the next word in sequence to instead incorporate bidirectional context and predict mask of intermediate tokens of the sequence and predict the mask token. BERT presented a new self supervised learning task for pretaining transformers inorder to fine tune them for different tasks. They major difference between BERT and prior methods of pretraining transformer models is using the bidirectional context of language modeling. Most of the models either move left to right or right to left to predict next word in sequence, where BERT tries to learn intermediate tokens (by MASK), making the name Bidirectional Encoder.



BERT uses Masked language model and also use "Next sentence prediction" task.

BERT uses 3 embeddings to compute the input representations. They are token embeddings, segment embeddings and position embeddings. 

BERT Transformer will preserve the length of the (dimention of the) input. The final output will take this vector and pass these to seperate tasks (classification, in this case).

# BERT for Classification

BERT consists of stacked encoder layers. Just like the input of encoder of the transformer model, BERT model takes the sequence of numeric representation of the tokens as input. For classification tasks, we must prepend the special [CLS] token to the beginning of every sentence. 

Encoder block of transformer outputs a vector with same length as of input. First position of the vector, corresponding to the [CLS] token, can now be used as the input for a classifier. 

In [2]:
import numpy as np
import pandas as pd
import time
import tqdm
import datetime
import gc
import random
import nltk
from nltk.corpus import stopwords
import re

import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler,random_split
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

import transformers
from transformers import BertForSequenceClassification, AdamW, BertConfig,BertTokenizer,get_linear_schedule_with_warmup


In [3]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

device(type='cuda', index=0)

In [4]:
df = pd.read_csv("../data/raw/train.csv")

In [5]:
df['2category'].isna().sum()

18362

In [6]:
df

Unnamed: 0.1,Unnamed: 0,sentence,1category,2category,sentiment
0,4754,При этом всегда получал качественные услуги.,Communication,,+
1,4417,"Не вижу, за что хотя бы 2 поставить, сервис на 1!",?,,−
2,3629,"Вот так ""Мой любимый"" банк МКБ меня обманул.",?,,−
3,11640,Отвратительное отношение к клиентам.,Communication,,−
4,5571,"Всегда в любое время дня и ночи помогут, ответ...",Communication,,+
...,...,...,...,...,...
19356,8004,Никогда и ни в коем случае не открывайте счет ...,Communication,,−
19357,18182,ТИ откровенно забили на качество и развивают с...,Quality,,−
19358,744,"Я считаю, это прорыв и лидерство финансовых ус...",?,,+
19359,6220,"Писал мужчина очень доходчиво, не финансовым я...",Communication,,+


In [7]:
df[df['2category'].notna()]

Unnamed: 0.1,Unnamed: 0,sentence,1category,2category,sentiment
32,13007,"Начну с того, что я пользовался и пользуюсь ус...",Price,Quality,+
86,18828,Точка идеально походит для таких «чайников» ка...,Quality,Communication,+
122,19214,"Открывали счет 2 недели... Открыли, пока готов...",Quality,Price,−
157,19220,Итого что имеем обещанная ставка выросла более...,Quality,Price,−
175,18944,"Резюме: не ходите в Росбанк, он очень непорядо...",Quality,Communication,−
...,...,...,...,...,...
19190,12378,Почему работают неквалифицированные специалист...,Communication,Quality,−
19248,18821,Это реально круто.2) Очень грамотные менеджеры...,Quality,Communication,+
19311,12593,"Ответа Банка я так и не получила, и, хуже того...",Communication,Safety,−
19335,4600,* Удобство: 10 из 10* Работа сотрудников: 10 и...,Communication,Quality,+


# Data preprocessing

We are using custom functions to perform the following tasks. Cleaning up the data for modeling should be carried out carefully and with the help of subject matter experts, if possible. This cleaning is done completely based on observation, and can not be considered as a generic preprocessing step for all the NLP tasks. This preprocessing function ensures:

* Removing urls from tweet
* Removing html tags
* Removing punctuations
* Removing stopwords
* Removing emoji

In [8]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [9]:
sw = stopwords.words('russian')

def clean_text(text):
    
    text = text.lower()
    
    text = re.sub(r"[^а-яА-Я?.!,¿]+", " ", text) # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")

    text = re.sub(r"http\S+", "",text) #Removing URLs 
    #text = re.sub(r"http", "",text)
    
    html=re.compile(r'<.*?>') 
    
    text = html.sub(r'',text) #Removing html tags
    
    punctuations = '@#!?+&*[]-%.:/();$=><|{}^' + "'`" + '_'
    for p in punctuations:
        text = text.replace(p,'') #Removing punctuations
        
    text = [word.lower() for word in text.split() if word.lower() not in sw]
    
    text = " ".join(text) #removing stopwords
    
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text) #Removing emojis
    
    return text

In [10]:
df['text'] = df.sentence

In [11]:
df['text'] = df['text'].apply(lambda x: clean_text(x))

In [12]:
df['sentiment'].unique()

array(['+', '−', '?'], dtype=object)

In [13]:
dict_lavel = {'+':2, '?':1, '−':0}
df['target'] = df['sentiment'].apply(lambda x: dict_lavel[x])

In [14]:
df

Unnamed: 0.1,Unnamed: 0,sentence,1category,2category,sentiment,text,target
0,4754,При этом всегда получал качественные услуги.,Communication,,+,получал качественные услуги,2
1,4417,"Не вижу, за что хотя бы 2 поставить, сервис на 1!",?,,−,"вижу, хотя поставить, сервис",0
2,3629,"Вот так ""Мой любимый"" банк МКБ меня обманул.",?,,−,любимый банк мкб обманул,0
3,11640,Отвратительное отношение к клиентам.,Communication,,−,отвратительное отношение клиентам,0
4,5571,"Всегда в любое время дня и ночи помогут, ответ...",Communication,,+,"любое время дня ночи помогут, ответят, решат",2
...,...,...,...,...,...,...,...
19356,8004,Никогда и ни в коем случае не открывайте счет ...,Communication,,−,коем случае открывайте счет недостойном довери...,0
19357,18182,ТИ откровенно забили на качество и развивают с...,Quality,,−,ти откровенно забили качество развивают свои м...,0
19358,744,"Я считаю, это прорыв и лидерство финансовых ус...",?,,+,"считаю, это прорыв лидерство финансовых услуг ...",2
19359,6220,"Писал мужчина очень доходчиво, не финансовым я...",Communication,,+,"писал мужчина очень доходчиво, финансовым язык...",2


In [15]:
tweets = df.text.values
labels = df.target.values

### BERT Tokenizer

In BERT, WordPiece tokenizer (a subword tokenizer) is used for tokenization. A word can be broken down into more than one sub-word, which helps in dealing with unknown words. For best results, it is adviced to tokenize with the same tokenizer the BERT model was trained on. 

Next, we need to convert each token to an id as present in the tokenizer vocabulary. If there’s a token that is not present in the vocabulary, the tokenizer will use the special [UNK] token and use its id.



In [16]:
from transformers import BertForMaskedLM, BertTokenizer, RobertaTokenizer, RobertaForMaskedLM

In [17]:
# Load the BERT tokenizer
#tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
#tokenizer = AutoTokenizer.from_pretrained("ai-forever/ruRoberta-large")
tokenizer = RobertaTokenizer.from_pretrained('ai-forever/ruRoberta-large')

In [18]:
tokenizer.decode([819, 40564, 2280, 3895, 12142])

'получал качественные услуги'

In [19]:
print(' Original: ', tweets[0])

# Print the sentence split into tokens.
print('Tokenized: ', tokenizer.tokenize(tweets[0]))

# Print the sentence mapped to token ids.
print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(tweets[0])))

 Original:  получал качественные услуги
Tokenized:  ['Ð¿Ð¾Ð»', 'ÑĥÑĩÐ°Ð»', 'ĠÐºÐ°ÑĩÐµ', 'ÑģÑĤÐ²ÐµÐ½Ð½ÑĭÐµ', 'ĠÑĥÑģÐ»ÑĥÐ³Ð¸']
Token IDs:  [819, 40564, 2280, 3895, 12142]


In [20]:
max_len = 0

# For every sentence...
for sent in tweets:

    # Tokenize the text and add `[CLS]` and `[SEP]` tokens.
    input_ids = tokenizer.encode(sent, add_special_tokens=True)

    # Update the maximum sentence length.
    max_len = max(max_len, len(input_ids))

print('Max sentence length: ', max_len)

Max sentence length:  156


In [21]:
input_ids = []
attention_masks = []

# For every tweet...
for tweet in tweets:
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad or truncate the sentence to `max_length`
    #   (6) Create attention masks for [PAD] tokens.
    encoded_dict = tokenizer.encode_plus(
                        tweet,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length = max_len,           # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.    
    input_ids.append(encoded_dict['input_ids'])
    
    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict['attention_mask'])

# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.from_numpy(labels)

# Print sentence 0, now as a list of IDs.
print('Original: ', tweets[0])
print('Token IDs:', input_ids[0])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Original:  получал качественные услуги
Token IDs: tensor([    1,   819, 40564,  2280,  3895, 12142,     2,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,

In [22]:
input_ids[0].shape

torch.Size([156])

#### Train-validation split
80% of data is split into train and 20% to validation sets.

In [23]:

# Combine the training inputs into a TensorDataset.
dataset = TensorDataset(input_ids, attention_masks, labels)

# Create a 90-10 train-validation split.

# Calculate the number of samples to include in each set.
train_size = int(0.8 * len(dataset))
#val_size = int(0.2 * len(dataset))
val_size = len(dataset)  - train_size

# Divide the dataset by randomly selecting samples.
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))

15,488 training samples
3,873 validation samples


In [24]:

# The DataLoader needs to know our batch size for training, so we specify it 
# here. For fine-tuning BERT on a specific task, the authors recommend a batch 
# size of 16 or 32.
batch_size = 16

# Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order. 
train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )

# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
            val_dataset, # The validation samples.
            sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.
        )

In [25]:

# Load BertForSequenceClassification, the pretrained BERT model with a single 
# linear classification layer on top. 
model = RobertaForMaskedLM.from_pretrained(
    "ai-forever/ruRoberta-large", # Use the 12-layer BERT model, with an uncased vocab.
     num_labels = 3, # The number of output labels--2 for binary classification.
                    # You can increase this for multi-class tasks.   
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)

# if device == "cuda:0":
# # Tell pytorch to run this model on the GPU.
#     model = model.cuda()
model = model.to(device)

In [26]:
optimizer = AdamW(model.parameters(),
                  lr = 2e-5, # args.learning_rate - default is 5e-5, our notebook had 2e-5
                  eps = 1e-8 # args.adam_epsilon  - default is 1e-8.
                )



# Fine tuning the model

In [27]:

# Number of training epochs. The BERT authors recommend between 2 and 4. 
# We chose to run for 4, but we'll see later that this may be over-fitting the
# training data.
epochs = 4

# Total number of training steps is [number of batches] x [number of epochs]. 
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

In [28]:
# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

In [29]:
def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

In [30]:
#torch.ones((1, batch_size)).to(device)

In [31]:
# b_labels

In [32]:
# b_labels[1]

In [33]:
# torch.cat([b_labels[i].item() * torch.ones((1, 156)).to(device) for i in range(batch_size)])

In [None]:
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)
training_stats = []

# Measure the total training time for the whole run.
total_t0 = time.time()

# For each epoch...
for epoch_i in range(0, epochs):
    
    # ========================================
    #               Training
    # ========================================
    # Perform one full pass over the training set.
    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')
    # Measure how long the training epoch takes.
    t0 = time.time()
    total_train_loss = 0
    model.train()
    for step, batch in tqdm.tqdm(enumerate(train_dataloader)):
        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the device using the 
        # `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        b_labels = torch.cat([b_labels[i].item() * torch.ones((1, 156)).to(device) for i in range(batch_size)]).long()
#         print(b_input_ids.shape)
#         print(b_input_mask.shape)
#         b_labels = batch[2].item() * torch.ones((1, batch_size)).to(device)
#         b_labels = torch.ones((1, 156)).long().to(device)
#         print(b_labels.shape)
        optimizer.zero_grad()
        output = model(b_input_ids, 
                             token_type_ids=None, 
                             attention_mask=b_input_mask, 
                             labels=b_labels)        
        loss = output.loss
        total_train_loss += loss.item()
        # Perform a backward pass to calculate the gradients.
        loss.backward()
        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()
        # Update the learning rate.
        scheduler.step()

    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)            
    
    # Measure how long this epoch took.
    training_time = format_time(time.time() - t0)
    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(training_time))
    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.
    print("")
    print("Running Validation...")
    t0 = time.time()
    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()
    # Tracking variables 
    total_eval_accuracy = 0
    best_eval_accuracy = 0
    total_eval_loss = 0
    nb_eval_steps = 0
    # Evaluate data for one epoch
    for batch in validation_dataloader:
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        # Tell pytorch not to bother with constructing the compute graph during
        # the forward pass, since this is only needed for backprop (training).
        with torch.no_grad():        
            output= model(b_input_ids, 
                                   token_type_ids=None, 
                                   attention_mask=b_input_mask,
                                   labels=b_labels)
        loss = output.loss
        total_eval_loss += loss.item()
        # Move logits and labels to CPU if we are using GPU
        logits = output.logits
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        # Calculate the accuracy for this batch of test sentences, and
        # accumulate it over all batches.
        total_eval_accuracy += flat_accuracy(logits, label_ids)
    # Report the final accuracy for this validation run.
    avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)
    print("  Accuracy: {0:.2f}".format(avg_val_accuracy))
    # Calculate the average loss over all of the batches.
    avg_val_loss = total_eval_loss / len(validation_dataloader)
    # Measure how long the validation run took.
    validation_time = format_time(time.time() - t0)
    if avg_val_accuracy > best_eval_accuracy:
        torch.save(model, f'../models/bert_{epoch_i}')
        best_eval_accuracy = avg_val_accuracy
    #print("  Validation Loss: {0:.2f}".format(avg_val_loss))
    #print("  Validation took: {:}".format(validation_time))
    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Valid. Accur.': avg_val_accuracy,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    )
print("")
print("Training complete!")

print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))


Training...


390it [05:35,  1.15it/s]

In [None]:
print('ok')

print('ok')

print('ok')

print('ok')

print('ok')

print('ok')


# Loading the best model

In [None]:
model = torch.load('bert_model')

# Submission

In [20]:
df_test = pd.read_csv('../input/nlp-getting-started/test.csv')
df_test['text'] = df_test['text'].apply(lambda x:clean_text(x))
test_tweets = df_test['text'].values

In [21]:
test_input_ids = []
test_attention_masks = []
for tweet in test_tweets:
    encoded_dict = tokenizer.encode_plus(
                        tweet,                     
                        add_special_tokens = True, 
                        max_length = max_len,           
                        pad_to_max_length = True,
                        return_attention_mask = True,
                        return_tensors = 'pt',
                   )
    test_input_ids.append(encoded_dict['input_ids'])
    test_attention_masks.append(encoded_dict['attention_mask'])
test_input_ids = torch.cat(test_input_ids, dim=0)
test_attention_masks = torch.cat(test_attention_masks, dim=0)

In [22]:
test_dataset = TensorDataset(test_input_ids, test_attention_masks)
test_dataloader = DataLoader(
            test_dataset, # The validation samples.
            sampler = SequentialSampler(test_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.
        )

In [23]:
predictions = []
for batch in test_dataloader:
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        with torch.no_grad():        
            output= model(b_input_ids, 
                                   token_type_ids=None, 
                                   attention_mask=b_input_mask)
            logits = output.logits
            logits = logits.detach().cpu().numpy()
            pred_flat = np.argmax(logits, axis=1).flatten()
            
            predictions.extend(list(pred_flat))

In [24]:
df_output = pd.DataFrame()
df_output['id'] = df_test['id']
df_output['target'] =predictions
df_output.to_csv('submission.csv',index=False)