# GloVe embeddings and LSTM model for predicting user rating of hotels based on the review text
The dataset was provided by HSE "Intro to Deep Learning" [course](http://wiki.cs.hse.ru/Основы_глубинного_обучения).


In [1]:
#imports and setting up wandb

import torch 
import re
import string
import random
import wandb 

import pandas as pd
import numpy as np
import seaborn as sns

from tqdm import tqdm
from torch import nn
from torch.nn import functional as F
from torchtext.legacy import data
from torchtext.legacy import datasets

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
TRAIN_PATH = '../input/hotelreviews/train.csv'

wandb.login(key='XXX') #placeholder key

[34m[1mwandb[0m: W&B API key is configured (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

The train dataset consists of 100.000 reviews, each review consists of two separate parts: positive feedback and negative feedback. The rating is provided in the `score` column - a real number in range of 0-10.

In [2]:
reviews = pd.read_csv(TRAIN_PATH).drop(['review_id'], axis=1)
reviews.head(10)

Unnamed: 0,negative,positive,score
0,There were issues with the wifi connection,No Positive,7.1
1,TV not working,No Positive,7.5
2,More pillows,Beautiful room Great location Lovely staff,10.0
3,Very business,Location,5.4
4,Rooms could do with a bit of a refurbishment ...,Nice breakfast handy for Victoria train stati...,6.7
5,Hotel is under reconstruction and should be c...,Location is excellent for congress activities,6.3
6,Noise from the trains and road but ok for one...,Great location to tube station and local shop...,8.8
7,No Negative,Great location friendly staff and lovely acco...,10.0
8,I known you re renovating but having concierg...,Staff were super helpful and friendly Great l...,9.2
9,Location of room no phone signal,friendly staff,6.7


In [3]:
def seed_all(seed_value): #function to fix random seed
    random.seed(seed_value) 
    np.random.seed(seed_value) 
    torch.manual_seed(seed_value) 
    if torch.cuda.is_available() :
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value) 
        torch.backends.cudnn.deterministic = True 
        torch.backends.cudnn.benchmark = False

We'll do a bit of data cleansing, such as removing some bad symbols and converting text to lowercase. This dataset does not have punctuation included, so we don't have to worry about handling that.

In [4]:
def clean_text(text):
    text = re.sub(r'[^\x00-\x7F]+', ' ', text) #non-ASCII
    text = re.sub('[\\r\\t\\n]+', ' ', text) #delimiters
    return ' '.join(text.lower().split())

To work with one text we will concatenate both parts of the review with a special separator token `SEP` in between. Approach with two separate `LSTM` models for positive and negative feedbacks was tried as well but showed worse results.

In [5]:
def get_train_data():
    df = pd.read_csv(TRAIN_PATH).drop(['review_id'], axis=1)    
    df['review'] = df.negative + ' SEP ' + df.positive
    df.review = df.review.apply(lambda x: clean_text(x))
    df.to_csv('train_merged.csv', index=False, columns=['review', 'score'])
    
get_train_data()

**Setting up `torchtext` `Field` structures to define how we will process the data.**

We'll be using built in `spaCy` tokenizer, also we will include information about lengths of texts by specifying `include_lengths=True`. This will be useful later during batching and padding steps.  

For the `RATING` field we will not be using a vocabulary since it's a real number.

In [6]:
REVIEW = data.Field(
    tokenize='spacy', 
    tokenizer_language='en_core_web_sm', 
    include_lengths=True,
    batch_first=True
)

RATING = data.LabelField(
    dtype=torch.float, 
    use_vocab=False, 
    preprocessing=float, 
    batch_first=True
)

**Creating train instance of `TabularDataset` and splitting to train and validation.**

In [7]:
train_fields = [('review', REVIEW), ('rating', RATING)]

train_data = data.TabularDataset(
    path='./train_merged.csv',
    format='csv',
    fields=train_fields,
    skip_header=True
)

train_data, valid_data = train_data.split(random_state = random.seed(21))

**Building vocabulary of the `REVIEW` field.**

Instead of training our own word embeddings, we'll be using pretrained `GloVe` embeddings. Different types of embeddings and dimensions were tested, using `GloVe` embeddings with dimension of `200` showed the best results. Vectors for words which are not present in `GloVe` vocabulary will be initialized from standard normal distribution by specifying `unk_init=torch.Tensor.normal_`.  

We will ll also limit our vocabulary only to tokens which are present in at least three different reviews by setting `min_freq` parameter to 3. This is done to reduce vocabulary size and avoid overfitting .  


In [8]:
EMBEDDING_DIM = 200

REVIEW.build_vocab(
    train_data, 
    min_freq=3, 
    unk_init=torch.Tensor.normal_,
    vectors = 'glove.6B.' + str(EMBEDDING_DIM) + 'd',
)

VOCAB_SIZE = len(REVIEW.vocab)

.vector_cache/glove.6B.zip: 862MB [02:40, 5.36MB/s]                               
100%|█████████▉| 399999/400000 [00:31<00:00, 12585.64it/s]


**Creating train and test iterators.**

We'll be using a `BucketIterator` - a special type of iterator that will return a batch of instances, where each instance is of a similar length, minimizing the amount of padding per instance.  

`sort_within_batch` parameter is specified to sort instances within batch. This is necessary in order to use `nn.utils.rnn.packed_padded_sequence` later to pack a sequence in a way that only the non-padded elements will be processed by the `LSTM` model. We'll need to define the `sort_key` parameter - how we want to sort instaces within batch, in this case it's by the length of the review text. Since the text has been already tokenized, we can just use `len` of the instance.   

Since we don't backpropagate during validation loop we can set larger batch size of 256 for the validation iterator.

In [9]:
BATCH_SIZES = (8, 256)

train_iterator, valid_iterator = data.BucketIterator.splits(
    (train_data, valid_data),
    sort_within_batch=True,
    sort_key = lambda x: len(x.review), 
    batch_sizes=BATCH_SIZES,
    device=DEVICE
)

We'll set the embeddings for `PAD` and `UNK` tokens used for padding and unknown words respectively to zero vectors to explicitly tell our model that they are irrelevant for determining a score.

In [10]:
PAD_IDX = REVIEW.vocab.stoi[REVIEW.pad_token]
UNK_IDX = REVIEW.vocab.stoi[REVIEW.unk_token]

pretrained_embeddings = REVIEW.vocab.vectors

pretrained_embeddings[PAD_IDX] = torch.zeros(EMBEDDING_DIM)
pretrained_embeddings[UNK_IDX] = torch.zeros(EMBEDDING_DIM)

**Defining the model.**

We'll be using a bidirecitonal `LSTM` model.  

First, `nn.Embedding` layer, which is a simple word-embedding look-up table, will process the input sequence. We provide the `pretrained_embeddings` tensor to this layer with pretrained embeddings information we set up earlier. Also we specify `freeze` parameter as `False`, allowing the embeddings to be changed during backpropagating process. After that, `nn.Dropout` is applied to control overfitting.  

Next, we pack our input with `nn.utils.rnn.pack_padded_sequence` in order to efficiently process the input sequence with `LSTM` model and feed packed input to the model.

The model itself will return three tensors:
* `output` - a tensor containing all hidden states at every time step
* `h_n` - a tensor containing the final hidden state
* `c_n` - a tensor containing the final cell state  

We expect that the last hidden state contains encoded information about the whole input sequence. Since we are using bidirectional `LSTM` we concatenate results from both directions and pass it to the `nn.Linear` layer for the final prediction.

In [11]:
class LSTM(torch.nn.Module) :
    def __init__(self, pretrained_embeddings, embedding_dim, padding_idx,
                 hidden_dim, num_layers=1, bidirectional=True, dropout=0.3):
        super().__init__()
        
        self.embeddings = nn.Embedding.from_pretrained(pretrained_embeddings, 
                                                       padding_idx=padding_idx, 
                                                       freeze=False)
        
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, 
                            batch_first=True, 
                            bidirectional=bidirectional, 
                            num_layers=num_layers)
        
        self.dropout = nn.Dropout(dropout)
        
        self.linear = nn.Linear(hidden_dim * 2, 1)
        
    def forward(self, text, text_lengths):
        embedded = self.embeddings(text)
        embedded = self.dropout(embedded)
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, 
                                                            text_lengths.to('cpu'), 
                                                            batch_first=True)
        lstm_out, (hst, cst) = self.lstm(packed_embedded)
        hidden = torch.cat([hst[-2, :, :], hst[-1, :, :]], dim=1)
        return self.linear(hidden).squeeze()

**Defining the model and moving it to GPU if available.**

In [12]:
seed_all(21)

HIDDEN_DIM = 300
NUM_LAYERS = 3
DROPOUT = 0.3
BIDIRECTIONAL = True
        
lstm = LSTM(
    pretrained_embeddings = pretrained_embeddings,
    embedding_dim = EMBEDDING_DIM,
    padding_idx = PAD_IDX,
    hidden_dim = HIDDEN_DIM,
    num_layers = NUM_LAYERS,
    bidirectional = BIDIRECTIONAL,
    dropout = DROPOUT
)

lstm = lstm.to(DEVICE)

**Defining learning rate, optimizer, scheduler and loss function.**   

We'll be using MAE loss function.

In [13]:
LEARNING_RATE = 3e-4
LOSS_NAME = 'MAE'

optimizer = torch.optim.AdamW(lstm.parameters(), lr = LEARNING_RATE)
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, 0.98)
criterion = F.l1_loss

**Defining the train loop.**

`predict` function will return three tensors:  
* `losses` tensor with the loss value for every element of iterator  
* `predicted_ratings` tensor with the model predictions  
* `true_ratings` tensor with the actual ratings

We'll also be [exporting](https://wandb.ai/porfiry/hotels-reviews?workspace=user-porfiry) the results to `wandb` to easily track and visualise the metrics.

In [14]:
def train_one_epoch(model, train_iterator, criterion, optimizer, scheduler=None):
    model.train()
    for reviews, ratings in tqdm(train_iterator):
        reviews, reviews_lengths = reviews[0], reviews[1]
        reviews, reviews_lengths, ratings = reviews.to(DEVICE), reviews_lengths.to(DEVICE), ratings.to(DEVICE)
        predictions = model(reviews, reviews_lengths)
        loss = criterion(predictions, ratings)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    if scheduler != None:
        scheduler.step()

def predict(model, iterator, criterion):
    model.eval()
    predicted_ratings = torch.tensor([], device=DEVICE)
    true_ratings = torch.tensor([], device=DEVICE)
    losses = torch.tensor([], device=DEVICE)
    with torch.no_grad():
        for reviews, ratings in tqdm(iterator):
            reviews, reviews_lengths = reviews[0], reviews[1]
            reviews, reviews_lengths, ratings = reviews.to(DEVICE), reviews_lengths.to(DEVICE), ratings.to(DEVICE)
            batch_predictions = model(reviews, reviews_lengths)
            predicted_ratings = torch.cat([predicted_ratings, batch_predictions])
            true_ratings = torch.cat([true_ratings, ratings])
            batch_losses = criterion(batch_predictions, ratings, reduction='none')
            losses = torch.cat([losses, batch_losses])
    return losses, predicted_ratings, true_ratings

def train(model, train_iterator, criterion, optimizer, n_epochs=10, 
          val_iterator=None, scheduler=None, project=None, resume=False, params=None):
    
    # resume is a flag to continue the same wandb run in case we want to continue training
    # the model after initial train loop
    if resume:
        wandb.init(project=project, resume='must', id=wandb.run.id, config=params)
    else:
        wandb.init(project=project, config=params)
        
    wandb.watch(model)

    for epoch in range(n_epochs):
        train_one_epoch(model, train_iterator, criterion, optimizer, scheduler)
        if val_iterator != None:
            val_loss, val_preds, val_ratings = predict(model, val_iterator, criterion)
            print(f'Epoch: {epoch+1}, validation {LOSS_NAME}: {val_loss.mean()}')
        train_loss, train_preds, train_ratings = predict(model, train_iterator, criterion)
        print(f'Epoch: {epoch+1}, train {LOSS_NAME}: {train_loss.mean()}')
        
        # logging has to be done separately because we can't keep track of steps for wandb
        # in case of continuing training after initial train loop
        if val_iterator != None:
            wandb.log({
              f'mean train {LOSS_NAME}': train_loss.mean(),
              f'mean val {LOSS_NAME}': val_loss.mean()
              }) 
        else:
            wandb.log({
              f'mean train {LOSS_NAME}': train_loss.mean()
              }) 

**Beginning training.**

In [15]:
def log_params(): #log model and training parameters to wandb
    params = {}
    params['bidirectional'] = BIDIRECTIONAL
    params['num layers'] = NUM_LAYERS
    params['hidden dim'] = HIDDEN_DIM
    params['embedding dim'] = EMBEDDING_DIM
    params['dropout'] = DROPOUT
    params['vocab size'] = VOCAB_SIZE
    params['learning rate'] = LEARNING_RATE
    params['train batch size'] = BATCH_SIZES[0]
    params['model'] = lstm
    params['optimizer'] = optimizer
    params['scheduler'] = scheduler
    return params

def start_training(params):
    resume = True if (str.lower(input('RESUME PREVIOUS RUN? ')) == 'y') else False
    print('Resuming previous run:', resume)
    
    n_epochs = int(input('N EPOCHS? '))
    print('Training', n_epochs, 'epochs', end='\n')

    train(lstm, 
          train_iterator, 
          criterion, 
          optimizer, 
          n_epochs=n_epochs, 
          val_iterator=valid_iterator, 
          scheduler=scheduler, 
          project='hotels-reviews', 
          resume=resume, 
          params=params)
    
params = log_params()
start_training(params)

RESUME PREVIOUS RUN?  no


Resuming previous run: False


N EPOCHS?  10


[34m[1mwandb[0m: Currently logged in as: [33mporfiry[0m (use `wandb login --relogin` to force relogin)


Training 10 epochs


100%|██████████| 8750/8750 [02:14<00:00, 64.98it/s]
100%|██████████| 118/118 [00:02<00:00, 40.47it/s]


Epoch: 1, validation MAE: 0.8558124899864197


100%|██████████| 8750/8750 [00:40<00:00, 217.85it/s]


Epoch: 1, train MAE: 0.8344889283180237


100%|██████████| 8750/8750 [02:12<00:00, 65.84it/s]
100%|██████████| 118/118 [00:02<00:00, 42.20it/s]


Epoch: 2, validation MAE: 0.7614631652832031


100%|██████████| 8750/8750 [00:40<00:00, 217.95it/s]


Epoch: 2, train MAE: 0.7184240221977234


100%|██████████| 8750/8750 [02:12<00:00, 65.98it/s]
100%|██████████| 118/118 [00:02<00:00, 41.85it/s]


Epoch: 3, validation MAE: 0.7579575181007385


100%|██████████| 8750/8750 [00:39<00:00, 219.90it/s]


Epoch: 3, train MAE: 0.6949464082717896


100%|██████████| 8750/8750 [02:12<00:00, 65.87it/s]
100%|██████████| 118/118 [00:02<00:00, 41.60it/s]


Epoch: 4, validation MAE: 0.7493607401847839


100%|██████████| 8750/8750 [00:40<00:00, 217.83it/s]


Epoch: 4, train MAE: 0.6655190587043762


100%|██████████| 8750/8750 [02:12<00:00, 66.02it/s]
100%|██████████| 118/118 [00:02<00:00, 39.90it/s]


Epoch: 5, validation MAE: 0.7399376630783081


100%|██████████| 8750/8750 [00:39<00:00, 218.91it/s]


Epoch: 5, train MAE: 0.6349638104438782


100%|██████████| 8750/8750 [02:12<00:00, 66.14it/s]
100%|██████████| 118/118 [00:02<00:00, 41.73it/s]


Epoch: 6, validation MAE: 0.7475746273994446


100%|██████████| 8750/8750 [00:39<00:00, 219.23it/s]


Epoch: 6, train MAE: 0.6125217080116272


100%|██████████| 8750/8750 [02:12<00:00, 66.07it/s]
100%|██████████| 118/118 [00:02<00:00, 42.17it/s]


Epoch: 7, validation MAE: 0.7524238228797913


100%|██████████| 8750/8750 [00:40<00:00, 218.65it/s]


Epoch: 7, train MAE: 0.6039032340049744


100%|██████████| 8750/8750 [02:12<00:00, 65.84it/s]
100%|██████████| 118/118 [00:02<00:00, 42.19it/s]


Epoch: 8, validation MAE: 0.755949854850769


100%|██████████| 8750/8750 [00:40<00:00, 218.51it/s]


Epoch: 8, train MAE: 0.5822247266769409


100%|██████████| 8750/8750 [02:12<00:00, 65.91it/s]
100%|██████████| 118/118 [00:02<00:00, 40.47it/s]


Epoch: 9, validation MAE: 0.7548894882202148


100%|██████████| 8750/8750 [00:40<00:00, 217.77it/s]


Epoch: 9, train MAE: 0.5589897036552429


100%|██████████| 8750/8750 [02:12<00:00, 65.88it/s]
100%|██████████| 118/118 [00:02<00:00, 41.42it/s]


Epoch: 10, validation MAE: 0.7504647374153137


100%|██████████| 8750/8750 [00:40<00:00, 216.85it/s]


Epoch: 10, train MAE: 0.5262979865074158


Different architectures and sets of hyperparameters were tested. The model was able to [achieve](https://wandb.ai/porfiry/hotels-reviews?workspace=user-porfiry) on average 0.75 MAE on validation and 0.55 MAE on train on this train-validation split.