# IMDB to Hotel Sentiment

In this notebook we will take a sentiment analysis model trained on IMDB reviews, and fine tune it to analyse tweets about hotels. We will use AdaTest to help us generate a suitable test suite.

## Seeding the PRNG

Before we do anything else, we first seed the PRNG, to ensure that we have reproducible results:

In [None]:
import torch
torch.manual_seed(1012351)

## The Base Model

We will use the [`aychang/roberta-base-imdb` from Hugging Face](https://huggingface.co/aychang/roberta-base-imdb) as our base model. This is a binary model which has been trained on a collection of IMDB reviews. First, we load the model itself:

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import pipeline

base_model_name = "aychang/roberta-base-imdb"

model = AutoModelForSequenceClassification.from_pretrained(base_model_name,num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

original_pipeline = pipeline("sentiment-analysis",
                             model=model,
                             tokenizer=tokenizer,
                             top_k=2)

Now, let's try a few sentences:

In [None]:
original_pipeline("Great cinematography but a poor movie overall")

In [None]:
original_pipeline("Snappy dialogue makes for enjoyable entertainment")

In [None]:
original_pipeline("Located on a busy street with much traffic")

We can see that the two statements about movies are well classified, but the one about the final one about the hotel is not.

## Using AdaTest

AdaTest is a tool to help create training/test suites for language models. The basic workflow is:

1. User provides some sample input
1. User flags whether the model output is correct or not
1. AdaTest uses a second language model to general more inputs from those already provided
1. User decides which of the AdaTest proposed inputs to incorporate (and whether the model provided a correct response)

Iterating through this process a few times can generate a lot of tests quite quickly.

In [None]:
import adatest

For our generator, we use OpenAI's GPT-3 model. For this, we need to read the access key in from a file:

In [None]:
import os
with open(os.path.expanduser('~/.openai_api_key'), 'r') as file:
    OPENAI_API_KEY = file.read().replace('\n', '')

First, we create the generator object which AdaTest will use to suggest more tests which are similar to the ones we provide:

In [None]:
generator = adatest.generators.OpenAI('curie', api_key=OPENAI_API_KEY)

Now we create the test tree. We will load a set of tests which we have alreadys started work on, to make the process faster:

In [None]:
tests = adatest.TestTree("imdb_hotel_conversion.csv")

And fire up the AdaTest interface:

In [None]:
tests.adapt(original_pipeline, generator, auto_save=True, recompute_scores=True)

In [None]:
import pandas as pd

def load_adatest_data(csv_file: str):
    tmp = pd.read_csv(csv_file)
    
    # Drop topic marker rows
    tmp2 = tmp[tmp['label'] != 'topic_marker']
    # Drop suggestion rows
    tmp3 = tmp2[tmp2['topic'] != 'suggestion']
    
    # Remove columns we don't need
    tmp4 = tmp3.drop(labels=['labeler', 'description', 'author', 'Unnamed: 0'], axis=1)
    
    # Rename columns
    tmp5 = tmp4.rename(mapper={'input': 'sentence', 'label': 'model_is_correct'}, axis=1)
    
    # Don't need to track original rows
    tmp6 = tmp5.reset_index(drop=True)
    
    return tmp6


test_data = load_adatest_data('imdb_hotel_conversion.csv')
display(test_data)

Next, we need to get the actual labels corresponding to each sentence. For this we need to combine the column which contains the output of our model and the column containing our manual labelling of whether the model was correct or incorrect.

In [None]:
def generate_label(row):
    # The model output is either 'pos' or 'neg'
    model_result = row['output']
    # Return based on whether the model response was marked correct or incorrect
    if row['model_is_correct'] == 'pass':
        return model_result
    else:
        if model_result == 'pos':
            return 'neg'
        else:
            return 'pos'

Apply this to the data:

In [None]:
test_data['label'] = test_data.apply(generate_label, axis=1)
test_data

We can also call the pipeline directly on the sentences we have generated, and make sure that we get the same results as the one stored by AdaTest:

In [None]:
import numpy as np

def get_label(label_probabilities):
    # The pipeline returns all of the label probabilities
    # We need to extract the largest
    max_score = 0
    label = None
    for l in label_probabilities:
        if l['score'] > max_score:
            max_score = l['score']
            label = l['label']
    return label

y_pred = [get_label(x) for x in original_pipeline(test_data.sentence.to_list())]


test_data['my_y_pred'] = y_pred
assert np.array_equal(test_data['my_y_pred'], test_data['output'])

display(test_data)

We can also evaluate our chosen metric, and check that the accuracy score matches that we expect from the summary at the top level of the AdaTest widget:

In [None]:
from datasets import load_metric

metric_name = 'accuracy'

metric = load_metric(metric_name)

def label_to_int(l: str) -> int:
    # Use the mapping provided by the model
    return model.config.label2id[l]

metric.compute(predictions=test_data['my_y_pred'].apply(label_to_int), references=test_data['label'].apply(label_to_int))

There is one final tweak to make to our data prior to finetuning the model: the Hugging Face `Trainer`s do not use the human-friendly labels, but the corresponding integer ids. So use the mapping provided by the model to convert the 'label' column:

In [None]:
test_data['label'] = test_data['label'].apply(label_to_int)
print(test_data.dtypes)

Now, we can split our dataset into training and test sets:

In [None]:
from sklearn.model_selection import train_test_split

We stratify based on the 'topic' column, to ensure that we have samples from all of the various topics we have generated:

In [None]:
train_df, test_df = train_test_split(test_data, stratify=test_data['topic'], test_size=0.3)

Convert our DataFrames into Hugging Face `Dataset`s:

In [None]:
from datasets import Dataset

train_ds = Dataset.from_pandas(df = train_df)
test_ds = Dataset.from_pandas(df = test_df)
train_ds

Encode our datasets:

In [None]:
def preprocess_function(examples):
    result = tokenizer(examples["sentence"],
                       add_special_tokens = True,
                       truncation = True,
                       padding = "max_length",
                       return_attention_mask = True
                      )
    return result

train_encoded = train_ds.map(preprocess_function, batched=True)
test_encoded = test_ds.map(preprocess_function, batched=True)

drop_cols = ['topic', '__index_level_0__','model_is_correct', 'model score', 'my_y_pred', 'output']

train_encoded = train_encoded.remove_columns(drop_cols)
test_encoded = test_encoded.remove_columns(drop_cols)

Configure a new training run:

In [None]:
from transformers import TrainingArguments

batch_size = 4

args_ft = TrainingArguments(
    f"hotel_fine_tuned",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    push_to_hub=False,
)

Now, load a fresh copy of the model for fine tuning. This will allow us to compare the two models side-by-side:

In [None]:
ft_model = AutoModelForSequenceClassification.from_pretrained(base_model_name,num_labels=2)

Create our new `Trainer` object, using the model we've just loaded. We pass in our new datasets for training and evaluation:

In [None]:
from transformers import Trainer

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    # Predictions are probabilities, so the actual answer is the index with the highest probability
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

trainer_ft = Trainer(
    ft_model,
    args_ft,
    train_dataset=train_encoded,
    eval_dataset=test_encoded,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

Now, we can run the training. On a CPU, this may take a few minutes (large values of 'few' may be experienced):

In [None]:
trainer_ft.train()

In [None]:
trainer_ft.evaluate()

## Assessing the Fine-Tuned Model

Now that we have fine-tuned the model with some examples which talk about hotels, we can see if it performs better. First, we put the new model into a scoring pipeline:

In [None]:
ft_pipeline = pipeline("sentiment-analysis",
                       model=trainer_ft.model.to('cpu'),
                       tokenizer=tokenizer,
                       top_k=2)

We can re-run the initial samples we tried above:

In [None]:
ft_pipeline("Great cinematography but a poor movie overall")

In [None]:
ft_pipeline("Snappy dialogue makes for enjoyable entertainment")

In [None]:
ft_pipeline("Located on a busy street with much traffic")

The sentences about movies are still well classified, but the final one about a hotel has the correct prediction now.

For a more systematic comparison, we can run our `test_df` through both pipelines:

In [None]:
def get_label(label_probabilities):
    # The pipeline returns all of the label probabilities
    # We need to extract the largest
    max_score = 0
    label = None
    for l in label_probabilities:
        if l['score'] > max_score:
            max_score = l['score']
            label = l['label']
    # Convert back to the id
    return ft_model.config.label2id[label]

y_pred_orig = [get_label(x) for x in original_pipeline(test_df.sentence.to_list())]
y_pred_ft = [get_label(x) for x in ft_pipeline(test_df.sentence.to_list())]

print("Original  : ", metric.compute(predictions=y_pred_orig, references=test_df.label))
print("Fine Tuned: ", metric.compute(predictions=y_pred_ft, references=test_df.label))

We see a noticeable improvement in accuracy.