## 1. Import Statements

---



In [1]:
# %%capture
# !pip install transformers

In [5]:
import torch
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, BertModel, BertForSequenceClassification

  from .autonotebook import tqdm as notebook_tqdm


In [9]:
# Set up the GPU.
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

## 2. Load the Data

---


The original code in this section is located in `bert-training.ipynb`. It is included here to make the `get_star_predictions()` function to work.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
github_url = 'https://raw.githubusercontent.com/csbanon/bert-product-rating-predictor/master/data/reviews_comments_stars.csv'
df = pd.read_csv(github_url)
df = df[['comment', 'stars']]
df

Unnamed: 0,comment,stars
0,I could sit here and write all about the specs...,5
1,A very reasonably priced laptop for basic comp...,4
2,"This is the best laptop deal you can get, full...",5
3,A few months after the purchase....It is still...,5
4,BUYER BE AWARE: This computer has Microsoft 10...,1
...,...,...
195760,I have not tried this camera without the SD ca...,5
195761,"Hello, I bought this item months ago and I tho...",1
195762,This is an incredible camera for the money!! ...,5
195763,Great cameras. Purchased some for my mother af...,5


In [6]:
train_dataset, test_dataset = train_test_split(df, test_size=0.2, random_state=1)
test_dataset = test_dataset.reset_index(drop=True)

In [7]:
# Nir START 
df = pd.read_csv('../data/amazon_reviews_reviewText_ratings.csv')
df = df.sample(frac=0.1, random_state=34)

## 3. Define the BERT Model

---



The original code in this section is located in `bert-training.ipynb`. It is included here to make the `get_star_predictions()` function to work. The output is suppressed to make the notebook easier to read.

In [10]:
%%capture
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels = len(df['rating'].unique()), # Number of unique labels for our multi-class classification problem.
    output_attentions = False,
    output_hidden_states = False,
)
model.to(device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 4. Load the Trained Model

---

Here, we load the `pytorch_model_2_epochs.bin` file, which contains the trained weights.

In [None]:
# Load the trained model.
model.load_state_dict(torch.load('/home/user/IdeaProjects/velotix_ex/models/Model_V3/3/bert_3.pth'))
model.eval()

In [18]:
from transformers import BertForSequenceClassification

loaded_model = BertForSequenceClassification.from_pretrained('/home/user/IdeaProjects/velotix_ex/models/Model_V3/3/md')

## 5. Define the Reviews Dataset

---



The original code in this section is located in `star_prediction.ipyn`. It is included here to make the `get_star_predictions()` function to work.

In [24]:
class ReviewsDataset(Dataset):
    def __init__(self, df, max_length=512):
        self.df = df
        self.tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
        self.max_length = max_length 
        
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        # input=review, label=stars
        review = self.df.loc[idx, 'reviewText']
        # labels are 0-indexed
        label = int(self.df.loc[idx, 'rating']) - 1
        
        encoded = self.tokenizer(
            review,                      # Review to encode.
            add_special_tokens=True,
            max_length=self.max_length,  # Truncate all segments to max_length.
            padding='max_length',        # Pad all reviews with the [PAD] token to the max_length.
            return_attention_mask=True,  # Construct attention masks.
            truncation=True
        )
        
        input_ids = encoded['input_ids']
        attn_mask = encoded['attention_mask']
        
        return {
            'input_ids': torch.tensor(input_ids),
            'attn_mask': torch.tensor(attn_mask), 
            'label': torch.tensor(label)
        }

## 6. Predict the Star Rating

---

The following code takes a string comment and returns a predicted star rating.

In [66]:
def get_single_prediction(comment, model):
  """
  Predict a star rating from a review comment.

  :comment: the string containing the review comment.
  :model: the model to be used for the prediction.
  """

  df = pd.DataFrame()
  df['reviewText'] = [comment]
  df['rating'] = ['0']

  dataset = ReviewsDataset(df)

  TEST_BATCH_SIZE = 1
  NUM_WORKERS = 1

  test_params = {'batch_size': TEST_BATCH_SIZE,
              'shuffle': True,
              'num_workers': NUM_WORKERS}

  data_loader = DataLoader(dataset, **test_params)

  total_examples = len(df)
  predictions = np.zeros([total_examples], dtype=object)

  for batch, data in enumerate(data_loader):

    # Get the tokenization values.
    input_ids = data['input_ids'].to(device)
    mask = data['attn_mask'].to(device)

    # Make the prediction with the trained model.
    outputs = model(input_ids, mask)

    # Get the star rating.
    big_val, big_idx = torch.max(outputs[0].data, dim=1)
    star_predictions = (big_idx + 1).cpu().numpy()

  return star_predictions[0]

In [72]:
from transformers import BertTokenizer

def get_single_prediction2(comment, model, tokenizer):
    # Tokenize the comment
    inputs = tokenizer.encode_plus(
        comment,
        return_tensors="pt",
        add_special_tokens=True,
        max_length=512,
        padding='max_length',
        truncation=True
    )
    
    input_ids = inputs['input_ids'].to(device)
    attention_mask = inputs['attention_mask'].to(device)
    
    # Predict
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
    
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    
    return predictions.item() + 1 

In [73]:
from transformers import BertForSequenceClassification, BertTokenizer

# Get the star predictions.
model.to(device)
model.eval()
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
prediction = get_single_prediction2("This is an amazing product!", model, tokenizer)
prediction

AttributeError: 'BaseModelOutputWithPoolingAndCrossAttentions' object has no attribute 'logits'

In [35]:
print(prediction)

[424 373  44 127 206 470 404 398 261 469 257 166 322  99 417 450 368 209
 149 329 126 152 156 447  12 257 352 346 443  45 402 224  78 157 309 236
 320 505 324 364 192  76 483  30 496 394 304  69 489   5  13   8 334 425
  40 182  98 245 497 492 195 492 243   4 140 209 220 321 193 398 276 371
 483 481 414  48 391  38 106 257 253 296 305 276 489 469 410 500  82 300
  46 497 217 156 169 124  50 482  33 179 238 263 105 176 478 338 447 176
 226  80 298  16 325 120 202 267 309 269  92 370 252 508 111   5 344  30
 227   9  82 128 429 220  54 506 154 469 325  35  85 397 369   3 347 134
 198  14 228 369 101  44   4 461 215 299 472 114 357 219  36 126 295  68
 314 113 278 465 350 428 352 398 429 310 255  47 210 218 425  94 191  78
 182 309 262 227 350 199 174 250 255 233  18 118 275 256  31 209 166 114
 305  10 121  63 319 327 423  75 330 462 336 416 410 142 510 306 389 294
 191 448  51 222 325  33  94   9  88 261 360   5  42 231 277 199 316 228
  89  75 155 178 152  85 454 146 112 305 456  58 49

In [23]:
df.head()

Unnamed: 0,rating,reviewText
215769,5,Very Cute
183386,5,I bought this for my Petit1 Pilot pen because ...
240678,5,Nice!
75630,5,"I purchased a home with a sidewalk, driveway a..."
420949,5,a great looking plant
