In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import tensorflow as tf


In [None]:
df1 = pd.read_csv(r'archive\1429_1.csv')

df1.head()


By going through the data in the csv files, we can choose the following columns to be part of a final dataset as these are the ones which contain meaningful information:

1. name
2. asins
3. categories
4. reviews.date
5. reviews.doRecommend
6. reviews.numHelpful
7. reviews.rating
8. reviews.text	
9. reviews.title

In [None]:
columns = {c: df1.columns.get_loc(c) for idx, c in enumerate(df1.columns)}
keys = ['name',
        'asins',
        'categories',
        'reviews.date',
        'reviews.doRecommend',
        'reviews.numHelpful',
        'reviews.rating',
        'reviews.text',
        'reviews.title']
main_df = df1.iloc[:, [columns[key] for key in keys]].copy()
print(main_df.shape)
main_df.head()


In [None]:
# checking for null values in each column
main_df.isna().sum()

In [None]:
main_df.to_csv('final_dataset.csv', index=False)

# Selecting Features for the Model

The main columns that we would need for training a Sentiment Analysis model would be the **reviews.rating** column, which would act as the label and the **reviews.text** column, which would be the features.
The NaN cells in the label column can be used as a test set later to check the effectiveness of the model.

In [3]:
main_df = pd.read_csv('/content/drive/MyDrive/Datasets/final_dataset.csv')

In [4]:
dataset = pd.DataFrame(columns=['data', 'label'])
dataset['data'] = main_df['reviews.text']
dataset['label'] = main_df['reviews.rating']

#checking for na values
print(dataset.isna().sum())

# dropping na values
dataset.dropna(inplace=True)
print(dataset.shape)

dataset.head()

data      1
label    33
dtype: int64
(34626, 2)


Unnamed: 0,data,label
0,This product so far has not disappointed. My c...,5.0
1,great for beginner or experienced person. Boug...,5.0
2,Inexpensive tablet for him to use and learn on...,5.0
3,I've had my Fire HD 8 two weeks now and I love...,4.0
4,I bought this for my grand daughter when she c...,5.0


In [5]:
print(dataset.where(dataset.label == 3.0).count())
print(dataset.where(dataset.label > 3.0).count())
print(dataset.where(dataset.label < 3.0).count())


data     1499
label    1499
dtype: int64
data     32315
label    32315
dtype: int64
data     812
label    812
dtype: int64


In [6]:
# converting the ratings on the label column such that 1: 'positive' and 0: 'negative'
# ratings 1 - 3 = negative, 4 - 5 = positive

dataset.replace({'label': {1.0:0, 2.0:0, 3.0:0, 4.0:1, 5.0:1}}, inplace=True)

In [7]:
# splitting into train and test datasets
train = dataset.sample(frac=0.8)
test = dataset.drop(train.index)

print(train.shape)
print(test.shape)

(27701, 2)
(6925, 2)


# Installing Hugging Face Transformers Library

Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone.

We can easily load a pre-trained BERT model from the Transformers library.

In [8]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 5.5 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 42.2 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 33.4 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 43.8 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 464 kB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transfor

Loading the pre-trained BERT Tokenizer and Sequence Classifier as well as InputExample and InputFeatures. Then, we will build our model with the Sequence Classifier and our tokenizer with BERT’s Tokenizer

In [9]:
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures

model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/511M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [10]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_37 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________


In [53]:
def convert_data_to_examples(train, test):
  '''
  This will accept the train and test datasets and convert each row into an InputExample object
  
  Params
  --------------
  train: train dataframe
  test: validation/test dataframe
  
  Returns
  ---------
  train_InputExamples
  validation_InputExamples
  '''
  train_InputExamples = train.apply(lambda x: InputExample(guid=None,
                                                           text_a=x['data'],
                                                           text_b=None,
                                                           label=x['label']), axis=1)

  validation_InputExamples = test.apply(lambda x: InputExample(guid=None,
                                                               text_a=x['data'],
                                                               text_b=None,
                                                               label=x['label']), axis=1)
  return train_InputExamples, validation_InputExamples

def convert_examples_to_tf_dataset(examples, tokenizer, max_length=128):
    """
    This function will tokenize the InputExample objects, 
    then create the required input format with the tokenized objects, 
    finally, create an input dataset that we can feed to the model.
    """
    features = []

    for e in examples:
        input_dict = tokenizer.encode_plus(
            e.text_a,
            add_special_tokens=True,
            max_length=max_length,
            return_token_type_ids=True,
            return_attention_mask=True,
            pad_to_max_length=True,
            truncation=True
        )

        input_ids = input_dict['input_ids']
        token_type_ids = input_dict['token_type_ids']
        attention_mask = input_dict['attention_mask']

        features.append(
            InputFeatures(
                input_ids=input_ids,
                attention_mask=attention_mask,
                token_type_ids=token_type_ids,
                label=e.label
            )
        )
    def gen():
        for f in features:
            yield (
                {
                    'input_ids':f.input_ids, 
                    'attention_mask':f.attention_mask,
                    'token_type_ids':f.token_type_ids
                },
                f.label
            )
    return tf.data.Dataset.from_generator(
        gen,
        (
            {
                "input_ids": tf.int32, 
                "attention_mask": tf.int32, 
                "token_type_ids": tf.int32
            }, 
            tf.int64
        ),

        (
            {
            'input_ids':tf.TensorShape([None]),
            'attention_mask':tf.TensorShape([None]),
            'token_type_ids':tf.TensorShape([None])
            },
            tf.TensorShape([])
        )
    )

In [12]:
train_InputExamples, validation_InputExamples = convert_data_to_examples(train, test)

train_data = convert_examples_to_tf_dataset(list(train_InputExamples), tokenizer)
train_data = train_data.shuffle(100).batch(32).repeat(2)

validation_data = convert_examples_to_tf_dataset(list(validation_InputExamples), tokenizer)
validation_data = validation_data.batch(32)



In [14]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0), 
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])

model.fit(train_data, epochs=2, validation_data=validation_data, verbose=1)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f9cd44dc610>

# Making Predictions

In [16]:
pred_sentences = ["All of them quit working. There's absolutely no way to revive them once that battery loses power If NOT charged immediately, if even just a few weeks go by, NO HOPE of ever using your Kindle again. A major flaw",
                  "Great. Love it.",
                  "I would highly recommend them to everyone. Got it when I was told that I would get it. Very happy."]

# sample reviews to check if model is able to predict the review as positive or negative


We need to tokenize our reviews with our pre-trained BERT tokenizer. We will then feed these tokenized sequences to our model and run a final softmax layer to get the predictions. We can then use the argmax function to determine whether our sentiment prediction for the review is positive or negative.

In [17]:
tf_batch = tokenizer(pred_sentences, max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = model(tf_batch)
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)
labels = ['Negative','Positive']
label = tf.argmax(tf_predictions, axis=1)
label = label.numpy()
for i in range(len(pred_sentences)):
  print(pred_sentences[i], ": \n", labels[label[i]])

All of them quit working. There's absolutely no way to revive them once that battery loses power If NOT charged immediately, if even just a few weeks go by, NO HOPE of ever using your Kindle again. A major flaw : 
 Negative
Great. Love it. : 
 Positive
I would highly recommend them to everyone. Got it when I was told that I would get it. Very happy. : 
 Positive


In [23]:
model.save_weights('/content/drive/MyDrive/Colab Notebooks/Sentiment Analysis/weights.h5')

In [29]:
# creating a test set which contains all the reviews in the dataset which have missing ratings

dataset = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Sentiment Analysis/final_dataset.csv')

test_set = pd.DataFrame(columns=['data', 'label'])

test_set['data'] = dataset['reviews.text']
test_set['label'] = dataset['reviews.rating']

test_set = test_set[test_set.label.isna()].copy()

test_set.head()

Unnamed: 0,data,label
2886,The Kindle is my first e-ink reader. I own an ...,
2887,"I'm a first-time Kindle owner, so I have nothi...",
2888,UPDATE NOVEMBER 2011:My review is now over a y...,
2889,"I'm a first-time Kindle owner, so I have nothi...",
2890,I woke up to a nice surprise this morning: a n...,


In [52]:
# designing a function to perform predictions on the test dataset

def predict_sentiment(test):
    tf_batch = tokenizer(list(test['data']), max_length=128, padding=True, truncation=True, return_tensors='tf')
    tf_outputs = model(tf_batch)
    tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)
    labels = ['Negative','Positive']
    label = tf.argmax(tf_predictions, axis=1)
    label = label.numpy()
    test['label'] = label
    test.replace({'label':{0:'Negative', 1:'Positive'}}, inplace=True)
    return test

pred = predict_sentiment(test_set)
pred.head()

Unnamed: 0,data,label
2886,The Kindle is my first e-ink reader. I own an ...,Positive
2887,"I'm a first-time Kindle owner, so I have nothi...",Positive
2888,UPDATE NOVEMBER 2011:My review is now over a y...,Positive
2889,"I'm a first-time Kindle owner, so I have nothi...",Positive
2890,I woke up to a nice surprise this morning: a n...,Positive


In [51]:
pred.to_csv('/content/drive/MyDrive/Colab Notebooks/Sentiment Analysis/predictions.csv')