In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/trip-advisor/tripadvisor_hotel_reviews.csv


### Introduction
Nearly 9 in 10 consumers rely on online reviews. Online reviews are important because they help boost companies’ sales. Moreover, one of the most important things for businesses is to ensure that their customers are satisfied with their services or products. Therefore, content reviewing has the same value as review writing. However, many businesses have limited budgets and time to assess all the reviews and respond to them promptly. Therefore, in this project, we introduce and apply some techniques and help companies to analyze reviews efficiently.

The data is from TripAdvisor, the world’s largest travel website. We use the dataset from Kaggle which is about the hotel reviews in the year 2020.

The dataset consists of 2 columns: Review in text and Rating (1–5)

The goal is to use Bidirectional Encoder Representations from Transformers (BERT) for the sentiment analysis.

**Step1: we first install the transformers and import all the necessary packages.**

In [2]:
!pip install -qq transformers

In [3]:
import transformers as ppb
from transformers import BertModel, BertTokenizer
import torch

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

**Step2: We then import the data. We can use df.head() to look at the first five rows of the data frame to see how the data looks. In this step, we create a binary variable called label and separate the review data based on their ratings into two classes. For ratings equal and above 3, we give label 1 and for any ratings below 3, we give label 0.**

To ease our work, we first work with the first 500 reviews and call it batch_1.

In [4]:
df = pd.read_csv('../input/trip-advisor/tripadvisor_hotel_reviews.csv')
df.head()

Unnamed: 0,Review,Rating
0,nice hotel expensive parking got good deal sta...,4
1,ok nothing special charge diamond member hilto...,2
2,nice rooms not 4* experience hotel monaco seat...,3
3,"unique, great stay, wonderful time hotel monac...",5
4,"great stay great stay, went seahawk game aweso...",5


In [5]:
df.loc[df['Rating'] > 3, 'label'] = 1
df.loc[df['Rating'] <= 3, 'label'] = 0

df['label'].astype(int)
df = df.drop(columns = 'Rating')

batch_1 = df[:500]
batch_1['label'].value_counts()

1.0    299
0.0    201
Name: label, dtype: int64

**The result above shows that the data is balanced. So, we do not need to do any data preparation for our model.**

**Step3: In this step, we import pre-trained DistilBERT model and tokenizer together.**

In [6]:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel,
                                                   ppb.DistilBertTokenizer,
                                                   'distilbert-base-uncased')
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


**Step4: Here, we’ll tokenize and process all sentences together as a batch. I have printed the tokenized values. As you see the output is all words ids.**

In [7]:
tokenized = batch_1['Review'].apply((lambda x: tokenizer.encode(x, add_special_tokens = True,
                                                               truncation = True,
                                                               max_length = 250,
                                                               padding = True)))
print(tokenized)

0      [101, 3835, 3309, 6450, 5581, 2288, 2204, 3066...
1      [101, 7929, 2498, 2569, 3715, 6323, 2266, 1548...
2      [101, 3835, 4734, 2025, 1018, 1008, 3325, 3309...
3      [101, 4310, 1010, 2307, 2994, 1010, 6919, 2051...
4      [101, 2307, 2994, 2307, 2994, 1010, 2253, 2712...
                             ...                        
495    [101, 21068, 3737, 3200, 6919, 3971, 5227, 603...
496    [101, 20783, 3976, 2100, 2994, 15481, 4156, 47...
497    [101, 26931, 1010, 2672, 3309, 11250, 3825, 39...
498    [101, 2307, 3199, 3309, 17463, 14326, 3446, 15...
499    [101, 5151, 20783, 26167, 2164, 4121, 2282, 44...
Name: Review, Length: 500, dtype: object


**Step5: Do the padding. The dataset is currently a list (or pandas Series/DataFrame) of lists. Before DistilBERT can process this as input, we’ll need to make all the vectors the same size by padding shorter sentences. We selected the maximum length of the review sentences as 250 characters. Therefore any words beyond this threshold will be padded (shown as 0)**

In [8]:
# Padding - we need to pad all the list to same size

max_length = 250
for i in tokenized.values:
    if len(i) > max_length:
        max_length = len(i)
        
padded = np.array([i + [0] * (max_length - len(i)) for i in tokenized.values])
input_ids = torch.tensor(np.array(padded))
print(input_ids)
input_ids.shape

tensor([[  101,  3835,  3309,  ...,     0,     0,     0],
        [  101,  7929,  2498,  ...,  2460,  9109,   102],
        [  101,  3835,  4734,  ...,  1005,  1056,   102],
        ...,
        [  101, 26931,  1010,  ...,     0,     0,     0],
        [  101,  2307,  3199,  ...,     0,     0,     0],
        [  101,  5151, 20783,  ...,     0,     0,     0]])


torch.Size([500, 250])

**Step6: Use the attention mask. The attention mask has the same length as padding. Attention mask creates array of 0s (pad token) and 1s (real token).**

In [9]:
# Masking

attention_mask = np.where(padded != 0, 1, 0)
attention_mask = torch.tensor(attention_mask)
print(attention_mask)
attention_mask.shape

tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])


torch.Size([500, 250])

**Step7: We now create an input tensor out of the padded token matrix, and send that to DistilBERT.**

In [10]:
with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask = attention_mask, return_dict = False)

In [11]:
features = last_hidden_states[0][:, 0, :].numpy()
labels = batch_1['label']

In [12]:
features.shape

(500, 768)

In [13]:
labels.shape

(500,)

**Step 8: Divide data into train and test to evaluate the performance of our model and use logistic regression to evaluate the model performance.**

In [14]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression()

In [15]:
lr_clf.score(test_features, test_labels)

0.784

**Conclusion: the accuracy of our model is 0.84 which is pretty good.**