In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Lesson 4: NLP Models

In lesson 4 we walked through an example of setting up and fine-tuning a pre-trained NLP model for the *U.S. Patents* dataset. We classified phrases on a scale from 0 to 1 depending on how similar/dissimilar they were, depending on the patent's context area. 

In this mini-project, I'm going to fine-tune an NLP model for a different sort of task, namely Sentiment Analysis. I'll be using the Yelp Reviews dataset and use transfer-learning to fine-tune an existing model and will hopefully get some nice results on the test set.

In [3]:
from fastai.vision.all import URLs, untar_data

First let's download the dataset. If I wanted to be exauhstive I could train the model on the entire dataset, but I'd rather not waste my precious kaggle GPU credits on a purely educational project. For this mini-project, a subset (~120000 rows) of the dataset should be fine -- I guess we'll see though.

In [4]:
DOWNLOAD_PATH = '/kaggle/working'
path = untar_data(URLs.YELP_REVIEWS_POLARITY, data=DOWNLOAD_PATH)

In [5]:
train_df = pd.read_csv('/kaggle/working/yelp_review_polarity_csv/train.csv', nrows=120000, names=['label', 'review'])

In [6]:
train_df.head()

Unnamed: 0,label,review
0,1,"Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff. It seems that his staff simply never answers the phone. It usually takes 2 hours of repeated calling to get an answer. Who has time for that or wants to deal with it? I have run into this problem with many other doctors and I just don't get it. You have office workers, you have patients with medical needs, why isn't anyone answering the phone? It's incomprehensible and not work the aggravation. It's with regret that I..."
1,2,"Been going to Dr. Goldberg for over 10 years. I think I was one of his 1st patients when he started at MHMG. He's been great over the years and is really all about the big picture. It is because of him, not my now former gyn Dr. Markoff, that I found out I have fibroids. He explores all options with you and is very patient and understanding. He doesn't judge and asks all the right questions. Very thorough and wants to be kept in the loop on every aspect of your medical health and your life."
2,1,"I don't know what Dr. Goldberg was like before moving to Arizona, but let me tell you, STAY AWAY from this doctor and this office. I was going to Dr. Johnson before he left and Goldberg took over when Johnson left. He is not a caring doctor. He is only interested in the co-pay and having you come in for medication refills every month. He will not give refills and could less about patients's financial situations. Trying to get your 90 days mail away pharmacy prescriptions through this guy is a joke. And to make matters even worse, his office staff is incompetent. 90% of the time when you c..."
3,1,"I'm writing this review to give you a heads up before you see this Doctor. The office staff and administration are very unprofessional. I left a message with multiple people regarding my bill, and no one ever called me back. I had to hound them to get an answer about my bill. \n\nSecond, and most important, make sure your insurance is going to cover Dr. Goldberg's visits and blood work. He recommended to me that I get a physical, and he knew I was a student because I told him. I got the physical done. Later, I found out my health insurance doesn't pay for preventative visits. I received an..."
4,2,"All the food is great here. But the best thing they have is their wings. Their wings are simply fantastic!! The \""Wet Cajun\"" are by the best & most popular. I also like the seasoned salt wings. Wing Night is Monday & Wednesday night, $0.75 whole wings!\n\nThe dining area is nice. Very family friendly! The bar is very nice is well. This place is truly a Yinzer's dream!! \""Pittsburgh Dad\"" would love this place n'at!!"


In [7]:
train_df['label'] -= 1


I'll also want to make sure that there aren't any null values in the reviews column, we can simply replace those with an empty string

In [8]:
train_df.fillna('', inplace=True)
train_df['input'] = train_df.review

Some concepts introduced in class were *tokenization* and *numeralization*. Tokenization refers to the concept of breaking our document (in this case, the review text) into bite-sized chunks, which are then *numeralized* into numerical input that an ML model can interpret. Each NLP model provides its own specification for how it breaks text into tokens, which we can access and use via the model's *tokenizer*. So, in order to proceed, we first need to know which model we'd like to use so we can go grab its tokenizer. 

For this project I've just grabbed on off-the-shelf sentiment analysis model, *finiteautomata/bertweet-base-sentiment-analysis*. It was trained on around 40,000 tweets, and will give us a great starting point to fine-tune our model for Yelp reviews.

In [9]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = 'finiteautomata/bertweet-base-sentiment-analysis'
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/295 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/890 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/824k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/17.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

In [10]:
train_df = train_df[['input', 'label']]

Once we have the tokenizer handy, we can wrap it in a function with a few of the necessary parameters we need to pass to the tokenizer. 

* We need to limit reviews to a length of 128 tokens, the model only supports this many. 
* If a review is too long, we truncate it
* If a review is too short, we pad it

In [11]:
def tokenize_input(x):
    return tokenizer(x['input'], max_length=128, truncation=True, padding='max_length')

Then, we can convert our dataframe to a transformers Dataset (the library we'll use for everything NLP), and run our reviews through the tokenizer.

In [12]:
from datasets import Dataset
train_ds = Dataset.from_pandas(train_df).map(tokenize_input, batched=True)


  0%|          | 0/120 [00:00<?, ?ba/s]

In [67]:
train_ds[0]['input_ids'][:10]

[0, 1292, 54559, 1065, 13892, 7, 6, 15297, 15, 166]

We can see that our review has been turned into a nice set of numbers!

We also need to make sure to split our data between training and validation, let's do that now:

In [13]:
train_validation_dds = train_ds.train_test_split(0.2, seed=98)


Now we're ready to train our model. We'll need to define the necessary hyperparemeters, in this case I'll just use the ones we used in lecture.

In [14]:
from transformers import TrainingArguments, Trainer
args = TrainingArguments('outputs', learning_rate=8e-5, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=128, per_device_eval_batch_size=128*2,
    num_train_epochs=4, weight_decay=0.01, report_to='none')

One important note is that this model was trained to give three output labels:
* 0 - Negative
* 1 - Neutral
* 2 - Positive

However, our training set only defines:
* 0 - Negative
* 1 - Positive

This is not a big deal, we just need to make sure to train the model to give two output labels instead of three. (You'd apply a softmax across these to get a classification).

We'll also need to set *ignore_mismatched_sizes* to true so the model knows we will be replacing the final layer. 

In [15]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2, ignore_mismatched_sizes=True)


Downloading:   0%|          | 0.00/515M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at finiteautomata/bertweet-base-sentiment-analysis and are newly initialized because the shapes did not match:
- classifier.out_proj.weight: found shape torch.Size([3, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
- classifier.out_proj.bias: found shape torch.Size([3]) in the checkpoint and torch.Size([2]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Next, we can start training:

In [16]:
trainer = Trainer(model, args, train_dataset=train_validation_dds['train'], eval_dataset=train_validation_dds['test'],
                  tokenizer=tokenizer)

Using cuda_amp half precision backend


In [17]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: input. If input are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 96000
  Num Epochs = 4
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 128
  Gradient Accumulation steps = 1
  Total optimization steps = 3000


Epoch,Training Loss,Validation Loss
1,0.2448,0.146267
2,0.105,0.145621
3,0.0483,0.172505
4,0.016,0.21067


Saving model checkpoint to outputs/checkpoint-500
Configuration saved in outputs/checkpoint-500/config.json
Model weights saved in outputs/checkpoint-500/pytorch_model.bin
tokenizer config file saved in outputs/checkpoint-500/tokenizer_config.json
Special tokens file saved in outputs/checkpoint-500/special_tokens_map.json
added tokens file saved in outputs/checkpoint-500/added_tokens.json
The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: input. If input are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 24000
  Batch size = 256
Saving model checkpoint to outputs/checkpoint-1000
Configuration saved in outputs/checkpoint-1000/config.json
Model weights saved in outputs/checkpoint-1000/pytorch_model.bin
tokenizer config file saved in outputs/checkpoint-1000/tokenizer_config.json
Special tokens fi

TrainOutput(global_step=3000, training_loss=0.09652806345621745, metrics={'train_runtime': 4564.416, 'train_samples_per_second': 84.129, 'train_steps_per_second': 0.657, 'total_flos': 2.525866131456e+16, 'train_loss': 0.09652806345621745, 'epoch': 4.0})

It looks like after 2 epochs our model began to overfit, as our training loss continued to go down while our validation loss began to climb back up.

We can grab the checkpoint after the second epoch and continue from there:

In [19]:
from transformers import RobertaForSequenceClassification
model = RobertaForSequenceClassification.from_pretrained('./outputs/checkpoint-1500')

loading configuration file ./outputs/checkpoint-1500/config.json
Model config RobertaConfig {
  "_name_or_path": "finiteautomata/bertweet-base-sentiment-analysis",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 130,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "tokenizer_class": "BertweetTokenizer",
  "torch_dtype": "float32",
  "transformers_version": "4.20.1",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 64001
}

loading weights file ./outputs/checkpoint-1500/pytorch_model

Now we can run our model against the test set to see how we perform:

In [43]:
test_df = pd.read_csv('/kaggle/working/yelp_review_polarity_csv/test.csv', nrows=120000, names=['label', 'review'])
test_df['label'] -= 1
test_df.fillna('', inplace=True)
test_df['input'] = test_df.review
test_df = test_df[['input', 'label']]
test_ds = Dataset.from_pandas(test_df).map(tokenize_input, batched=True)

  0%|          | 0/38 [00:00<?, ?ba/s]

In [45]:
eval_trainer = Trainer(model, args, eval_dataset=test_ds)

Using cuda_amp half precision backend


In [46]:
eval_trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: input. If input are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 38000
  Batch size = 256


{'eval_loss': 0.14627839624881744,
 'eval_runtime': 132.8684,
 'eval_samples_per_second': 285.997,
 'eval_steps_per_second': 1.121}

Our loss on the test set is roughly the same as our loss on our validation set, which is an encouraging sign that we didn't mess up badly.

Now let's test out our tuned sentiment classifier! First I'll write a quick function to pass text through the transformer:

In [53]:
def predict_on_text(text):
    tokenized = tokenizer(text, return_tensors='pt').to('cuda:0' if torch.cuda.is_available() else 'cpu')
    with torch.no_grad():
        logits = model(**tokenized).logits
    predicted_class = torch.argmax(logits).item()
    if predicted_class == 1:
        return 'positive'
    else:
        return 'negative'

I'd love to test it out on some real reviews for my favorite restauraunt in Chicago, *Au Cheval* (best burgers in the world).

In [57]:
positive_review = "Everything was delicious and all the staff were very friendly/helpful! They don’t take any reservations so we arrived around 5:30 to put our name down. We waited about an hour, a little less. \
After we were seated, our server arrived quickly and took our orders. He was super attentive-he removed empty plates quickly and refilled our drinks right away. \
The burger was perfect, as expected. The salad was bright and fresh, a great complement to the heavy main course. The fries were perfectly crispy and salty."

neutral_ish_review = "Not as good as it used to be but still solid. Probably not worth waiting for anymore, but definitely worth it without the line."

negative_review = "I really don’t get it. \
We waited two hours after putting our name down. They text you when the table is ready and you have 10 minutes to show up before you lose your spot. \
The burger was painfully average, even with the added egg and bacon. The $9 side of fries were also… just plain old fries. \
I’m usually not one to be overly-critical, but seriously not worth the hype. Maybe my expectations were just much too high thanks to reading reviews. \
Regardless, save your time and money by going elsewhere."
# don't listen to this, you should go

In [58]:
predict_on_text(positive_review)

'positive'

In [59]:
predict_on_text(neutral_ish_review)

'positive'

In [60]:
predict_on_text(negative_review)

'negative'

It's interesting that our neutral review was marked as positive, I would have guessed it would be negative. Our model performed as expected on the positive and negative reviews. 

That's it for this mini-project! I was able to succesfully fine-tune an existing NLP transformer and train it to perform well on Yelp reviews. This was a nice intro to the Huggingface ecosystem, it seems like there is tons of cool work going on over there, even within sentiment analysis I had endless models to choose from on modelhub. Looking forward to revisiting NLP at some point soon!