# Sentiment Analyzer for Yelp Reviews with Transformer (Hugging Face)

Sentiment Analysis with scraped data from the review website

**Install the dependecy packages**

In [None]:
!pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio===0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html


In [None]:
!pip install transformers==4.6.0

Collecting transformers==4.6.0
  Downloading transformers-4.6.0-py3-none-any.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 5.3 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 34.2 MB/s 
Collecting huggingface-hub==0.0.8
  Downloading huggingface_hub-0.0.8-py3-none-any.whl (34 kB)
Installing collected packages: tokenizers, huggingface-hub, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.12.1
    Uninstalling tokenizers-0.12.1:
      Successfully uninstalled tokenizers-0.12.1
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 0.5.1
    Uninstalling huggingface-hub-0.5.1:
      Successfully uninstalled huggingface-hub-0.5.1
  Attempting uninstall: transformers
    Found existing installation: transformers

In [None]:
# !pip install tensorflow

Packages for the web-scraping

In [None]:
!pip install requests beautifulsoup4



In [None]:
from transformers import AutoTokenizer,AutoModelForSequenceClassification
import torch
import numpy as np
import pandas as pd
import tensorflow as tf

Use the pre-trained transformer from the HF
https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment

In [None]:
tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

In [None]:
tokens = tokenizer('I love this pizza')
tokens

{'input_ids': [101, 151, 11157, 10372, 59371, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1]}

In [None]:
model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

Create the tokenizer Encodings


In [None]:
import torch
# tokens = tokenizer.encode('I love this pizza',return_tensors='tf')
tokens = tokenizer.encode('I love this pizza',return_tensors='pt')
tokens

tensor([[  101,   151, 11157, 10372, 59371,   102]])

In [None]:
output = model(tokens)
output.logits

tensor([[-2.2257, -2.5091, -0.9815,  1.2772,  3.5961]],
       grad_fn=<AddmmBackward0>)

In [None]:
int(torch.argmax(output.logits)) + 1

5

Unit Test Code

In [None]:
def unit_test():
  test_data = ''
  while test_data != 'quit':
    test_data = input("Please enter the unit test input for the  Sentiment Analysis Model, or enter 'quit': ")
    if test_data == 'quit':
      print('Quiting the unit test')
      
    else:      
      tokens = tokenizer.encode(test_data,return_tensors='pt')
      output = model(tokens)
      sentiment = int(torch.argmax(output.logits)) + 1
      print(f'The sentiment value is : {sentiment}')
if __name__ =='__main__':
  unit_test()

Please enter the unit test input for the  Sentiment Analysis Model, or enter 'quit': quit
The sentiment value is : 1


Web-scraping the reviews

In [None]:
import requests
from bs4 import BeautifulSoup
import re

In [None]:
# This yelp url is chosen randomly for the testing purpose
url_data = requests.get('https://www.yelp.ca/biz/seven-lives-tacos-y-mariscos-toronto')
soup = BeautifulSoup(url_data.text,'html.parser')
regex = re.compile('.*comment.*')
results = soup.find_all('p', {'class':regex})
reviews = [result.text for result in results]

In [None]:
type(reviews),len(reviews)

(list, 10)

Get the data into a dataframe format

In [None]:
df_data = pd.DataFrame(np.array(reviews),columns=['reviews'])
df_data.head(2)

Unnamed: 0,reviews
0,"TLDR: best taco place in Toronto, would eat ev..."
1,The kings of tacos in Toronto.Seven Lives has ...


In [None]:
df_data['reviews']

0    TLDR: best taco place in Toronto, would eat ev...
1    The kings of tacos in Toronto.Seven Lives has ...
2    The best tacos in town!! Amazing selection, so...
3    It's sooooooo good ! I've been wanting to try ...
4    So every time I would look up what to eat in T...
5    MMMM tacos!! And this place does it BIG!They m...
6    Heard about this place from friends in the are...
7    Good Location | Authentic | Quick ServiceI vis...
8    Craving Tacos and Seven Lives is usually my go...
9    Super fun to try! The tacos were incredibly fl...
Name: reviews, dtype: object

In [None]:
def sentiment(review):  
  # tokens = tokenizer(reviews,padding=True,truncation=True,max_length=512,return_tensors='pt')
  tokens = tokenizer(review,return_tensors='pt')["input_ids"]
  output = model(tokens)
  return int(torch.argmax(output.logits)) + 1 

https://stackoverflow.com/questions/68813979/bert-transformer-size-error-while-machine-traslation

In [None]:
# tokens = tokenizer(df_data['reviews'].iloc[1],return_tensors='pt')["input_ids"]
# type(tokens)
df_data['reviews'].iloc[1]

transformers.tokenization_utils_base.BatchEncoding

In [None]:
# Making the assumption that length of the tokens within the review is upto 512
df_data['sentimentscore'] = df_data['reviews'].apply( lambda x: sentiment(x[:512]) )
df_data

Unnamed: 0,reviews,sentimentscore
0,"TLDR: best taco place in Toronto, would eat ev...",5
1,The kings of tacos in Toronto.Seven Lives has ...,5
2,"The best tacos in town!! Amazing selection, so...",5
3,It's sooooooo good ! I've been wanting to try ...,4
4,So every time I would look up what to eat in T...,5
5,MMMM tacos!! And this place does it BIG!They m...,5
6,Heard about this place from friends in the are...,1
7,Good Location | Authentic | Quick ServiceI vis...,5
8,Craving Tacos and Seven Lives is usually my go...,4
9,Super fun to try! The tacos were incredibly fl...,4
