# Summmary

This notebook will show some very basic NLP capability.  
Creating a binary classifier to classify restaurant reviews as positive or negative.  
The purpose is to demonstrate the ability to learn from text by beating a naive baseline.

# Data
The dataset is taken from Kaggle (European Restaurant Reviews).  
It will use only the text columns (ReviewTitle, Review) and the sentiment output class (Positive/Negative)

# Approach
The approach is very simple - use a pre-trained sentiment classifier from huggingface.  
The chosen model is:  cardiffnlp/twitter-roberta-base-sentiment-latest  
It is a fine-tuned version of RoBERTa using Twitter data and the TweetEval benchmark.  
The intuition here is that sentiment classification will transfer well enough to do a good job on the European Restaurant Reviews dataset.  
This represnts a pragmatic apprach to the sentiment classification task as beating the naive baseline can likely be done without a great deal of effort.

This is certainly not the only way to do this, some other options:
*  Use a pre-trained encoder to get an 2d output for each word in the input sequence, then train a traditional neural network classifier on this output.
*  Fine tune an encoder model such as RoBERTa on this, or a more appropriate dataset.
*  You could even do something like train a language model from scratch with gensim and use those encodings to train a traditional nueral network classifier.

Note that this is a small 'toy' dataset and considering any of these other options in reality would only make sense with much more data specialized to the task at hand.


In [1]:
import numpy as np
import pandas as pd
from scipy.special import softmax
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig

  from .autonotebook import tqdm as notebook_tqdm


# Data

read data from csv into a pd dataframe.    
combine the relevant text columns into a single column.  
Split X, y into train/test sets.  


In [2]:
restaurant_review_data: pd.DataFrame = pd.read_csv("European Restaurant Reviews.csv")

In [3]:
restaurant_review_data["ReviewText"] = restaurant_review_data["Review Title"] + " " + restaurant_review_data["Review"]

In [4]:
restaurant_review_data.head()

Unnamed: 0,Country,Restaurant Name,Sentiment,Review Title,Review Date,Review,ReviewText
0,France,The Frog at Bercy Village,Negative,Rude manager,May 2024 •,The manager became agressive when I said the c...,Rude manager The manager became agressive when...
1,France,The Frog at Bercy Village,Negative,A big disappointment,Feb 2024 •,"I ordered a beef fillet ask to be done medium,...",A big disappointment I ordered a beef fillet a...
2,France,The Frog at Bercy Village,Negative,Pretty Place with Bland Food,Nov 2023 •,"This is an attractive venue with welcoming, al...",Pretty Place with Bland Food This is an attrac...
3,France,The Frog at Bercy Village,Negative,Great service and wine but inedible food,Mar 2023 •,Sadly I used the high TripAdvisor rating too ...,Great service and wine but inedible food Sadly...
4,France,The Frog at Bercy Village,Negative,Avoid- Worst meal in Rome - possibly ever,Nov 2022 •,From the start this meal was bad- especially g...,Avoid- Worst meal in Rome - possibly ever From...


In [5]:
restaurant_review_data.shape

(1502, 7)

In [6]:
x_train, x_test, y_train, y_test = train_test_split(restaurant_review_data["ReviewText"],
                                                    restaurant_review_data["Sentiment"],
                                                    test_size=0.1)

In [7]:
print('Train:')
print(x_train.shape)
print(y_train.shape)

print('Test:')
print(x_test.shape)
print(y_test.shape)

Train:
(1351,)
(1351,)
Test:
(151,)
(151,)


In [8]:
# Get some summary statisitics on the number of tokens in the training data.
# This will help to give some intuition on the number of tokens to use in the model.
x_train.str.split().apply(len).describe()

count    1351.000000
mean       70.244264
std        73.774759
min         3.000000
25%        30.000000
50%        46.000000
75%        80.500000
max       650.000000
Name: ReviewText, dtype: float64

# Baseline

Determine a naive baseline from the training data.  
This is done solely based on the majority class.  
Any classifier capable of learning from the training data must beat this baseline.  

In [9]:
baseline = round(sum(y_train == "Positive")/y_train.shape[0],2)

In [10]:
print(f"The majority class Positive occurs in {baseline} % of the training data.")

The majority class Positive occurs in 0.82 % of the training data.


# Model

The model is a hugging face fine-tuned sentiment classifier.  
It will output the probability of each class Positive/Negative/Neutral.  
Coerce this output into a binary classifier Positive/Negative.

In [11]:
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
config = AutoConfig.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [12]:
def get_model_outputs(model, model_inputs):
    # Getting OOM errors trying to put the entire train set through the model.
    # I can batch them by hand or with the PyTorch DataSet/DataLoader classes.
    # In this case I will simply truncate inputs to 50 tokens, this will provide the entire input for about 1/2 of the training data.
    # Intuition says this is enough data to determine sentiment accurately.
    encoded_input = tokenizer(list(model_inputs), padding=True, truncation=True, max_length=50, return_tensors='pt')

    model_output = model(**encoded_input)

    # Get logits output only from model into a np array
    model_output = model_output[0].detach().numpy()

    # Softmax to force to p(x) that adds to 1
    model_output = softmax(model_output)

    # Get the max arg for each input.
    model_output = np.argmax(model_output, axis=1)

    model_output = ['Negative' if output == 0 else 'Positive' for output in model_output]

    return model_output
    

In [13]:
predicted_train = get_model_outputs(model, x_train)

# Evaluation

Evaluate by predicting first on training data then on test.

In [14]:
accuracy_train = accuracy_score(y_train, predicted_train)

In [15]:
print(f"Accuracy on training data: {round(accuracy_train,2)}")

Accuracy on training data: 0.95


In [16]:
# Get predictions for test set
predicted_test = get_model_outputs(model, x_test)

In [17]:
accuracy_test = accuracy_score(y_test, predicted_test)

In [18]:
print(f"Accuracy on test data: {round(accuracy_train,2)}")

Accuracy on test data: 0.95


# Summary

This notebook demonstrated some very basic NLP techniques using pre-trained transformers (encoders).  
The model used was fine-tuned on a sentiment classification task and this notebook proves the learning from this task transfers well to the Restaurant Review dataset.

Train/Test accuracy of 95% beats the naive baseline of 82%.

Any discussion of next-steps to go beyond 95% is outside of the scope of this notebook.