# BERT embeddings and LSTM model for predicting user rating of hotels based on the review text
The dataset was provided by HSE "Intro to Deep Learning" [course](http://wiki.cs.hse.ru/Основы_глубинного_обучения).


In [1]:
#imports and setting up wandb

import torch 
import re
import string
import random
import wandb 

import transformers as ts
import pandas as pd
import numpy as np
import seaborn as sns

from tqdm import tqdm
from torch import nn
from torch.nn import functional as F
from torchtext.legacy import data
from torchtext.legacy import datasets
from sklearn.model_selection import train_test_split

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
TRAIN_PATH = '../input/hotelreviews/train.csv'

wandb.login(key='XXX') #placeholder key

[34m[1mwandb[0m: W&B API key is configured (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

The train dataset consists of 100.000 reviews, each review consists of two separate parts: positive feedback and negative feedback. The rating is provided in the `score` column - a real number in range of 0-10.

In [2]:
reviews = pd.read_csv(TRAIN_PATH).drop(['review_id'], axis=1)
reviews.head(10)

Unnamed: 0,negative,positive,score
0,There were issues with the wifi connection,No Positive,7.1
1,TV not working,No Positive,7.5
2,More pillows,Beautiful room Great location Lovely staff,10.0
3,Very business,Location,5.4
4,Rooms could do with a bit of a refurbishment ...,Nice breakfast handy for Victoria train stati...,6.7
5,Hotel is under reconstruction and should be c...,Location is excellent for congress activities,6.3
6,Noise from the trains and road but ok for one...,Great location to tube station and local shop...,8.8
7,No Negative,Great location friendly staff and lovely acco...,10.0
8,I known you re renovating but having concierg...,Staff were super helpful and friendly Great l...,9.2
9,Location of room no phone signal,friendly staff,6.7


In [3]:
def seed_all(seed_value): #function to fix random seed
    random.seed(seed_value) 
    np.random.seed(seed_value)
    torch.manual_seed(seed_value) 
    if torch.cuda.is_available() :
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value) 
        torch.backends.cudnn.deterministic = True 
        torch.backends.cudnn.benchmark = False

We'll do a bit of data cleansing, such as removing some bad symbols and converting the text to lowercase. This dataset does not have punctuation included, so we don't have to worry about handling that.

In [4]:
def clean_text(text):
    text = re.sub(r'[^\x00-\x7F]+', ' ', text) #non-ASCII
    text = re.sub('[\\r\\t\\n]+', ' ', text) #delimiters
    return ' '.join(text.lower().split())

To work with one text we will concatenate both parts of the review with a special `BERT` separator token `[SEP]` in between. 

In [5]:
def get_train_val_data():
    df = pd.read_csv(TRAIN_PATH).drop(['review_id'], axis=1)    
    df['review'] = df.negative + ' [SEP] ' + df.positive
    df.review = df.review.apply(lambda x: clean_text(x))
    return train_test_split(df.review, df.score, test_size=0.2, random_state=21)
        
reviews_train, reviews_val, ratings_train, ratings_val = get_train_val_data()

**Defining basic `PyTorch` `Dataset`.**

Since we will be handling vocabulary and tokenization on our own, there is no point in using `torchtext` `Field` and `TabularDataset`.

In [6]:
class ReviewsDataset(torch.utils.data.Dataset):
    def __init__(self, reviews, ratings):
        self.reviews = reviews
        self.ratings = ratings
        
    def __len__(self):
        return len(self.ratings)
    
    def __getitem__(self, idx):
        return self.reviews[idx], self.ratings[idx]

**Creating `PyTorch` `DataLoader` istances**

In [7]:
BATCH_SIZES = (16, 256)

def get_train_val_dataloaders():
    train_dataset = ReviewsDataset(reviews_train.values, ratings_train.values)
    val_dataset = ReviewsDataset(reviews_val.values, ratings_val.values)
    
    train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZES[0], shuffle=True)
    val_dataloader = torch.utils.data.DataLoader(val_dataset, batch_size=BATCH_SIZES[1], shuffle=False)
    
    return train_dataloader, val_dataloader

train_dataloader, val_dataloader = get_train_val_dataloaders()

**Downloading pretrained `BERT` and `tokenizer`.**

Since we'll be using pretrained `BERT`, we will need to use exactly the same `tokenizer` which it has been trained with. This is done by using `ts.AutoTokenizer.from_pretrained` and specifying model name as an argument.

In [8]:
%%capture
MODEL_NAME = 'distilbert-base-uncased'

def get_bert_and_tokenizer():
    BERT = ts.DistilBertModel.from_pretrained(MODEL_NAME, output_hidden_states=True).to(DEVICE).eval()
    TOKENIZER = ts.AutoTokenizer.from_pretrained(MODEL_NAME)
    return BERT, TOKENIZER

BERT, TOKENIZER = get_bert_and_tokenizer()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


**Defining the model.**

Instead of using an embedding layer to get embeddings for the text, we'll be using the pretrained `BERT` model. These embeddings will then be fed to the same `LSTM` architecture used in the previous notebook.

Within the forward pass, we wrap the transformer in a `no_grad` to ensure no gradients are calculated over this part of the model. 

Input sequence is tokenized with the following parameters:
* `padding='longest'` - this will ensure that all instances in the batch will be padded to the length of a longest instance in the batch. 
* `truncation=True` - this will ensure that if the sequence is longer than the maximum length of a `BERT` model input then it will be truncated.
* `return_tensors='pt'` - this will ensure that the tokenizer outputs `PyTorch` tensors.

We retrieve the `input_ids` of internal `tokenizer` vocabulary for every word in a sequence as well as an `attention_mask` - in this case a mask which ignores padded elements - and feed it to the `BERT` model. 

There are a few different ways to get embedding representation of the input sequence as shown in the original [BERT paper](https://arxiv.org/pdf/1810.04805.pdf). The approach with taking second-to-last hidden state was chosen in this architecture.

Finally, a sequence of embeddings is fed to the same `LSTM` architecture used in the previous notebook.

In [9]:
class BERT_LSTM(torch.nn.Module) :
    def __init__(self, 
                 bert, 
                 tokenizer, 
                 lstm_hidden_dim=300, 
                 num_layers=1, 
                 bidirectional=True,
                 dropout=0.3):
        super().__init__()
        
        self.bert = bert
        self.tokenizer = tokenizer
        
        self.lstm = nn.LSTM(input_size=self.bert.config.to_dict()['dim'], 
                            hidden_size=lstm_hidden_dim, 
                            bidirectional=bidirectional, 
                            num_layers=num_layers,
                            batch_first=True)
        
        self.dropout = nn.Dropout(dropout)
        
        self.linear = nn.Linear(lstm_hidden_dim * 2, 1)
        
    def forward(self, text):
        with torch.no_grad():
            tokens = self.tokenizer(list(text), 
                                    padding='longest', 
                                    truncation=True, 
                                    return_tensors='pt')
            
            input_ids, attention_mask = tokens['input_ids'], tokens['attention_mask']
            input_ids, attention_mask = input_ids.to(DEVICE), attention_mask.to(DEVICE)
            
            outputs = self.bert(input_ids=input_ids,
                                attention_mask=attention_mask)
            hidden_state = outputs.hidden_states[-2]
        
        hidden_state = self.dropout(hidden_state)
        lstm_out, (hst, cst) = self.lstm(hidden_state)
        lstm_hidden = torch.cat([hst[-2, :, :], hst[-1, :, :]], dim=1)
        return self.linear(lstm_hidden).squeeze()

**Defining the model and moving it to GPU if available.**

In [10]:
seed_all(21)

HIDDEN_DIM = 300
NUM_LAYERS = 3
DROPOUT = 0.3
BIDIRECTIONAL = True

bert_lstm = BERT_LSTM(
    bert=BERT,
    tokenizer=TOKENIZER,
    lstm_hidden_dim=HIDDEN_DIM,
    num_layers=NUM_LAYERS,
    bidirectional=BIDIRECTIONAL,
    dropout=DROPOUT)

bert_lstm = bert_lstm.to(DEVICE)

**Freezing the `BERT` part of the model since we are only interested in embeddings.**

In [11]:
def freeze_bert():
    for name, param in bert_lstm.named_parameters():                
        if name.startswith('bert'):
            param.requires_grad = False
            
freeze_bert()

**Defining learning rate, optimizer, scheduler and loss function.**   

We'll be using MAE loss function.

In [12]:
LEARNING_RATE = 3e-4
LOSS_NAME = 'MAE'

optimizer = torch.optim.AdamW(bert_lstm.parameters(), lr = LEARNING_RATE)
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, 0.98)
criterion = F.l1_loss

**Defining the train loop.**

`predict` function will return three tensors:  
* `losses` tensor with the loss value for every element of iterator  
* `predicted_ratings` tensor with the model predictions  
* `true_ratings` tensor with the actual ratings

We'll also be [exporting](https://wandb.ai/porfiry/hotels-reviews?workspace=user-porfiry) the results to `wandb` to easily track and visualise the metrics.

In [13]:
def train_one_epoch(model, train_iterator, criterion, optimizer, scheduler=None):
    model.train()
    for reviews, ratings in tqdm(train_iterator):
        ratings = ratings.to(DEVICE)
        predictions = model(reviews)
        loss = criterion(predictions, ratings)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    if scheduler != None:
        scheduler.step()

def predict(model, iterator, criterion):
    model.eval()
    predicted_ratings = torch.tensor([], device=DEVICE)
    true_ratings = torch.tensor([], device=DEVICE)
    losses = torch.tensor([], device=DEVICE)
    with torch.no_grad():
        for reviews, ratings in tqdm(iterator):
            ratings = ratings.to(DEVICE)
            batch_predictions = model(reviews)
            predicted_ratings = torch.cat([predicted_ratings, batch_predictions])
            true_ratings = torch.cat([true_ratings, ratings])
            batch_losses = criterion(batch_predictions, ratings, reduction='none')
            losses = torch.cat([losses, batch_losses])
    return losses, predicted_ratings, true_ratings

def train(model, train_iterator, criterion, optimizer, n_epochs=10, 
          val_iterator=None, scheduler=None, project=None, resume=False, params=None):
    
    # resume is a flag to continue the same wandb run in case we want to continue training
    # the model after initial train loop
    if resume:
        wandb.init(project=project, resume='must', id=wandb.run.id, config=params)
    else:
        wandb.init(project=project, config=params)
        
    wandb.watch(model)

    for epoch in range(n_epochs):
        train_one_epoch(model, train_iterator, criterion, optimizer, scheduler)
        if val_iterator != None:
            val_loss, val_preds, val_ratings = predict(model, val_iterator, criterion)
            print(f'Epoch: {epoch+1}, validation {LOSS_NAME}: {val_loss.mean()}')
        train_loss, train_preds, train_ratings = predict(model, train_iterator, criterion)
        print(f'Epoch: {epoch+1}, train {LOSS_NAME}: {train_loss.mean()}')
        
        # logging has to be done separately because we can't keep track of steps for wandb
        # in case of continuing training after initial train loop
        if val_iterator != None:
            wandb.log({
              f'mean train {LOSS_NAME}': train_loss.mean(),
              f'mean val {LOSS_NAME}': val_loss.mean()
              }) 
        else:
            wandb.log({
              f'mean train {LOSS_NAME}': train_loss.mean()
              }) 

**Beginning training.**

In [14]:
def log_params(): #log model and training parameters to wandb
    params = {}
    params['bidirectional'] = BIDIRECTIONAL
    params['num layers'] = NUM_LAYERS
    params['hidden dim'] = HIDDEN_DIM
    params['dropout'] = DROPOUT
    params['learning rate'] = LEARNING_RATE
    params['train batch size'] = BATCH_SIZES[0]
    params['model'] = bert_lstm
    params['optimizer'] = optimizer
    params['scheduler'] = scheduler
    return params

def start_training(params):
    resume = True if (str.lower(input('RESUME PREVIOUS RUN? ')) == 'y') else False
    print('Resuming previous run:', resume)
    
    n_epochs = int(input('N EPOCHS? '))
    print('Training', n_epochs, 'epochs', end='\n')

    train(bert_lstm, 
          train_dataloader, 
          criterion, 
          optimizer, 
          n_epochs=n_epochs,
          val_iterator=val_dataloader,
          scheduler=scheduler,
          project='hotels-reviews', 
          resume=resume, 
          params=params)

params = log_params()
start_training(params)

RESUME PREVIOUS RUN?  no


Resuming previous run: False


N EPOCHS?  10


[34m[1mwandb[0m: Currently logged in as: [33mporfiry[0m (use `wandb login --relogin` to force relogin)


Training 10 epochs


100%|██████████| 5000/5000 [08:44<00:00,  9.53it/s]
100%|██████████| 79/79 [01:55<00:00,  1.46s/it]


Epoch: 1, validation MAE: 0.8343373537063599


100%|██████████| 5000/5000 [05:12<00:00, 16.01it/s]


Epoch: 1, train MAE: 0.8318880796432495


100%|██████████| 5000/5000 [08:40<00:00,  9.60it/s]
100%|██████████| 79/79 [01:55<00:00,  1.46s/it]


Epoch: 2, validation MAE: 0.7672584652900696


100%|██████████| 5000/5000 [05:10<00:00, 16.11it/s]


Epoch: 2, train MAE: 0.7548600435256958


100%|██████████| 5000/5000 [08:41<00:00,  9.58it/s]
100%|██████████| 79/79 [01:55<00:00,  1.46s/it]


Epoch: 3, validation MAE: 0.7461735010147095


100%|██████████| 5000/5000 [05:10<00:00, 16.10it/s]


Epoch: 3, train MAE: 0.7248134016990662


100%|██████████| 5000/5000 [08:42<00:00,  9.57it/s]
100%|██████████| 79/79 [01:55<00:00,  1.46s/it]


Epoch: 4, validation MAE: 0.789193868637085


100%|██████████| 5000/5000 [05:09<00:00, 16.13it/s]


Epoch: 4, train MAE: 0.7558698058128357


100%|██████████| 5000/5000 [08:42<00:00,  9.57it/s]
100%|██████████| 79/79 [01:55<00:00,  1.46s/it]


Epoch: 5, validation MAE: 0.7319957613945007


100%|██████████| 5000/5000 [05:10<00:00, 16.11it/s]


Epoch: 5, train MAE: 0.6908090710639954


100%|██████████| 5000/5000 [08:40<00:00,  9.60it/s]
100%|██████████| 79/79 [01:55<00:00,  1.46s/it]


Epoch: 6, validation MAE: 0.730614960193634


100%|██████████| 5000/5000 [05:10<00:00, 16.12it/s]


Epoch: 6, train MAE: 0.6758401393890381


100%|██████████| 5000/5000 [08:40<00:00,  9.60it/s]
100%|██████████| 79/79 [01:55<00:00,  1.46s/it]


Epoch: 7, validation MAE: 0.7352918386459351


100%|██████████| 5000/5000 [05:09<00:00, 16.17it/s]


Epoch: 7, train MAE: 0.6714523434638977


 52%|█████▏    | 2602/5000 [04:29<04:11,  9.53it/s]wandb: Network error (ReadTimeout), entering retry loop.
100%|██████████| 5000/5000 [08:39<00:00,  9.62it/s]
100%|██████████| 79/79 [01:55<00:00,  1.46s/it]


Epoch: 8, validation MAE: 0.7214157581329346


100%|██████████| 5000/5000 [05:10<00:00, 16.12it/s]


Epoch: 8, train MAE: 0.6377750039100647


100%|██████████| 5000/5000 [08:42<00:00,  9.57it/s]
100%|██████████| 79/79 [01:55<00:00,  1.46s/it]


Epoch: 9, validation MAE: 0.717763364315033


100%|██████████| 5000/5000 [05:09<00:00, 16.15it/s]


Epoch: 9, train MAE: 0.6236904859542847


100%|██████████| 5000/5000 [08:41<00:00,  9.59it/s]
100%|██████████| 79/79 [01:55<00:00,  1.46s/it]


Epoch: 10, validation MAE: 0.7225016951560974


100%|██████████| 5000/5000 [05:10<00:00, 16.10it/s]


Epoch: 10, train MAE: 0.6116275787353516


As a result minor improvement over GloVe embeddings has been achieved, but at the cost of a very high computational and time complexity.