# 1. Imports and Keras environment setup

## Import R libraries

# BERT-reviews-score-prediction
Prediction of Amazon's reviews rating using a BERT model.

Structure of the project:
* **bertconfig**: Configuration of the pre-trained BERT model, downloaded from Hugging Face ([link](https://huggingface.co/google/bert_uncased_L-12_H-768_A-12)).
* **amazon-cell-phones-reviews**: Dataset from Griko Nibras shared on Kaggle ([link](http://https://www.kaggle.com/grikomsn/amazon-cell-phones-reviews?select=20191226-reviews.csv)).
* **nlp-project.ipynb**: Notebook with the project.

## How to run the code

1) Download the files, including the *bertconfig* and *amazon-cell-phones-reviews* directories.

2) Open the notebook with a Python environment. It is recommended to open it using Kaggle notebooks, as Conda is installed by defult and does not require any further installation. Kaggle notbooks was the cloud platform used for this development.

3) Check the installation of the packages listed in the first cell.

4) Run the cells sequentially, preferably using an accelerator, such as a GPU.


Project developed for the NLP course of Intelligent Systems (Universidad Politécnica de Madrid)


In [24]:
library(magrittr)
library(dplyr)
library(rtweet)
library(textclean)
library(utf8)
library(reticulate)
library(keras)
library(tensorflow)
library(dplyr)
library(tfdatasets)

library(mltools)
library(data.table)

## Install transformers and Keras libraries in Conda

In [25]:
library(reticulate)

reticulate::py_install('transformers', pip = TRUE)
reticulate::py_install('keras-bert', pip = TRUE)

config_path = file.path("../input/bertconfig/bert_config.json")
pretrained_path = file.path("../input/bertconfig/bert_model.ckpt")
vocab_path = file.path("../input/bertconfig/vocab.txt")

k_bert = import('keras_bert')

## Load "Amazon cell phones reviews" dataset

In [26]:
df_reviews <- read.csv("../input/amazon-cell-phones-reviews/20191226-reviews.csv")

# 2. Data preprocessing

### Dataframe structure
- ASIN: Identifier of the device
- Name: Name of the Amazon's user
- Rating: Number of stars given by the user to the product
- Date: Date of the review
- Verified: Indicates whether the user has been verified in the platform or not
- Title: Title of the review
- Body: Content of the review
- Helpful votes: Number of users that have indicated that the review is 'helpful'

In [27]:
head(df_reviews, n=5)

### Selection of the useful variables for the prediction and the ID of the device
Several combinations of variables could have been used for this project. However, as the intention is to create a model able to predict also the score of reviews in other platforms, only the most basic ones have been chosen:
- **ASIN**: In order to be aware of which device is being analyzed
- **Body**: Will be used as input for the model
- **Rating**: Is the target to predict

In [28]:
df_reviews <- subset(df_reviews, select = c(asin, body, rating))

Nulls cleaning for these variables.

In [29]:
df_reviews <- df_reviews %>% filter(!is.na(df_reviews$body)) 
df_reviews <- df_reviews %>% filter(!is.na(df_reviews$rating)) 
df_reviews <- df_reviews %>% filter(!is.na(df_reviews$asin))

### Rating distribution
The distribution of the target is analyzed using a barplot.

In [30]:
barplot(prop.table(table(df_reviews$rating)), col="blue", main = "Rating Distribution")

### Undersampling reviews
As the distribution of the ratings is clearly non-uniform distributed, it has been decided to follow an arbitrary sampling method. In this case, as the dataset includes enough entries for each target, it has been decided not to oversample, as it could have been produced overfitting, but to undersample the dataframe to obtain the same number of inputs for each target.

In [31]:
nsample <- as.numeric(df_reviews %>% 
        group_by(rating) %>% 
        count() %>% 
        ungroup() %>% 
        summarise(min(n)))

df_rating1 <- df_reviews %>% filter(df_reviews$rating == 1)
df_rating2 <- df_reviews %>% filter(df_reviews$rating == 2)
df_rating3 <- df_reviews %>% filter(df_reviews$rating == 3)
df_rating4 <- df_reviews %>% filter(df_reviews$rating == 4)
df_rating5 <- df_reviews %>% filter(df_reviews$rating == 5)

df_list <- list(df_rating1[sample(nrow(df_rating1), 1.3*nsample), ],
                df_rating2[sample(nrow(df_rating2), nsample), ],
                df_rating3[sample(nrow(df_rating3), 1.05*nsample), ],
                df_rating4[sample(nrow(df_rating4), 1.1*nsample), ],
                df_rating5[sample(nrow(df_rating5), 1.5*nsample), ])
df_under <- Reduce(function(x, y) merge(x, y, all=TRUE), df_list, accumulate=FALSE)

In [32]:
head(df_under)

In [33]:
barplot(prop.table(table(df_under$rating)), col="blue", main = "Rating Distribution")
df_reviews <- df_under

### Text cleaning
First, as several emojis are included in the reviews, they have been replaced by text. With the *textclean* library, emojis are transformed in descriptions of themselfs.

However, as some emojis have different variations (for instance, when choosing the color of the face of the symbol) they include a flag, accepted by UTF-8, but not ASCI that has to be deleted.

Finally, the body of the review is UTF-8 normalized and the reviews with 1 or less characters are removed. 

In [34]:
df_reviews$body <- replace_non_ascii(replace_emoji(df_reviews$body), "")
df_reviews$body <- replace_emoticon(df_reviews$body)
df_reviews$body <- utf8_normalize(df_reviews$body)

df_reviews <- df_reviews %>% filter(nchar(body) >= 1)
head(df_reviews)

### One-hot encoding
Before enconding the rating, the column is saved of the dataframe is saved to be used as target when measuring the performance of the prediction.

Then, the 5 values for the stars are hot-encoded in order to be classified.

In [35]:
df_reviews$target <- df_reviews$rating

df_reviews$rating <- as.factor(df_reviews$rating)
df_reviews <- one_hot(as.data.table(df_reviews))

head(df_reviews)

# Modeling BERT
### Sets for evaluation
First, data is splitted in a train and test set, in order to be evaluated lately.

In [36]:
train_index <- sample(seq_len(nrow(df_reviews)), size = floor(0.7 *  nrow(df_reviews)))
df_train <- df_reviews[train_index,]
df_test <- df_reviews[-train_index,]

### Tokenization
Create a Tokenizer from a loaded corpus created for general NLP purposes.

In [37]:
library(reticulate)

reticulate::py_install('transformers', pip = TRUE)
reticulate::py_install('keras-bert', pip = TRUE)

config_path = file.path("../input/bertconfig/bert_config.json")
pretrained_path = file.path("../input/bertconfig/bert_model.ckpt")
vocab_path = file.path("../input/bertconfig/vocab.txt")

k_bert = import('keras_bert')

In [38]:
token_dict = k_bert$load_vocabulary(vocab_path)
tokenizer = k_bert$Tokenizer(token_dict)

This function is used to get the input of the model. 

Using the encode method of the tokenizer, the tokens and mask for each row is obtained. In this case, the values saved for the tokens are not string objects, but integers that represent them.
Masked tokens are used to hide information to the model during the training process and make them more capable of generalizing the inputs given, and thus becoming more flexible and robust to changes.


In [39]:
get_tokens = function(df_set) {
    c(tokens, segments, target) %<-% list(list(),list(),list())

    for (i in 1:nrow(df_set)) {
        c(tk, seg) %<-% tokenizer$encode(df_set[["body"]][i], max_len=50L)

        tokens = tokens %>% append(list(as.matrix(tk)))
        segments = segments %>% append(list(as.matrix(seg)))        
    }
    
    return(list(tokens, segments))
}

In [40]:
# Train set
c(tk_train, seg_train) %<-% get_tokens(df_train)

tokens_train = do.call(cbind, tk_train) %>% t()
segments_train = do.call(cbind, seg_train) %>% t()
#y_train = do.call(cbind, target_train) %>% t()
y_train = data.matrix(df_train[ , 3:7])

x_train = c(list(tokens_train), list(segments_train))

# Test set
c(tk_test, seg_test) %<-% get_tokens(df_test)

tokens_test = do.call(cbind, tk_test) %>% t()
segments_test = do.call(cbind, seg_test) %>% t()
# y_test = do.call(cbind, target_test) %>% t()
y_test = data.matrix(df_test[ , 3:7])

x_test = c(list(tokens_test), list(segments_test))

### Transfer learning
First, a pre-trained model is loaded. This model has not been created for any specific purpose, but to be used as a baseline for others.

In [41]:
model = k_bert$load_trained_model_from_checkpoint(
  config_path,
  pretrained_path,
  training=TRUE,
  trainable=TRUE)

### Hyperparameter-tuning
The input and output layers are added to the BERT model, choosing the activation function and the number of output units. Then, the decay and warmup steps are created according to the length of the training dataset, the batch size and the number of epochs.

In [42]:
c(decay_steps, warmup_steps) %<-% k_bert$calc_train_steps(
  y_train %>% length(),
  batch_size=70,
  epochs=1
)

In [43]:
input_token = get_layer(model,name = 'Input-Token')$input
input_segment = get_layer(model,name = 'Input-Segment')$input
input_layers = list(input_token,input_segment)

dense = get_layer(model,name = 'NSP-Dense')$output

outputs = dense %>% layer_dense(units=5L, activation='softmax',
                         kernel_initializer=initializer_truncated_normal(stddev = 0.02),
                         name = 'output')

model = keras_model(inputs = input_layers,outputs = outputs)

In [44]:
model %>% compile(
  k_bert$AdamWarmup(decay_steps=decay_steps, 
                    warmup_steps=warmup_steps, lr=1e-4),
  loss = "categorical_crossentropy",
  metrics = "categorical_accuracy"
)

### Model fitting

In [45]:
model %>% fit(
    x=x_train,
    y=y_train,
    epochs=3,
    batch_size=70,
    validation_split=0.2)

# Predictions and evaluation
### Predicting values from the Test Set

In [46]:
predicted_values = model$predict(x_test)

y_predicted <- vector()
for(i in 1:length(predicted_values[,1])){
    y_predicted[i] = which.max(predicted_values[i, ])
}
y_target <- as.vector(df_test[["target"]])

### Prediction vs Target comparison

In [47]:
for(i in 1:10){
    print(paste0("-prediction = ", y_predicted[i]))
    print(paste0("-target = ", y_target[i]))
    print("---------------------------")
}

In [48]:
print(table(y_predicted))
print(table(y_target))

### Confusion matrix

In [49]:
library(caret)

cfm = caret::confusionMatrix(as.factor(y_predicted), as.factor(y_target))
print(cfm)