## Sentiment Analysis Using BERT Neural Network
The model being used can be found in the link provided: https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment

This model is great for sentiment analysis as rather than receiving a confidence score or a number between 0 and 1, it predicts the sentiment of the review as a number of stars (between 1 and 5).

Other libraries used include requests and beautifulsoup. Requests will allow us to make a request to the yelp site we will be scraping, and beautifulsoup will allow us to work through the data we receive back from requests.

In [1]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import requests
from bs4 import BeautifulSoup
import re

- AutoTokenizer converts a string to a sequence of numbers to be used by the NLP model.
- AutoModelForSequenceClassification gives us the architecture from transformers to be able to load in the NLP model.
- We are going to be using the hardmax function from torch in order to extract our highest sequence result.
- Importing re to creating a regex function to extract the specific elements we want.

# 1. Instantiate Model

We are going to create two variables, tokenizer and model.
- Tokenizer - creating our tokenizer, using the .from_pretrained function to import a pretrained model and the previously imported AutoTokenizer function from transformers.
- Model - creating our model, using the .from_pretrained function and the AutoModelForSequenceClassifcation function imported from transfomers.

In [2]:
tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

Downloading pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


# 3. Encode and Calculate Sentiment

Here I am going to create a practice token and see if the tokenizer is working how it should:

In [3]:
tokens = tokenizer.encode("I disliked this immensely, the worst outcome possible.", return_tensors='pt')

In [4]:
tokens

tensor([[  101,   151, 23145, 17172, 10163, 10372, 75572, 10563,   117, 10103,
         43060, 80196, 14312,   119,   102]])

Now I will pass through the model:

In [7]:
result = model(tokens)

In [6]:
result

SequenceClassifierOutput(loss=None, logits=tensor([[ 3.0633,  2.8742,  0.5110, -2.5527, -3.1332]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

### Understanding Results
The output from the model is a one-hot encoded list of scores. The position with the highest score represents the sentiment rating. e.g [.9, .2, .1, -2,-5] is a rating of 1, as .9 is the highest value and it's in the first position.

Let's try to make this something more understandable.

In [8]:
result.logits

tensor([[ 3.0633,  2.8742,  0.5110, -2.5527, -3.1332]],
       grad_fn=<AddmmBackward0>)

In [9]:
int(torch.argmax(result.logits))+1

1

# 3. Collect Reviews

- Requests library to grab the webpage to scrape
- We will then pass that variable off to BeautifulSoup to set our parser
- Then we will have to specifically extract the elements from the webpage that we want.
- The reviews are within a comment tag so we will scrape the comment classes off the webpage.

In [13]:
r = requests.get("https://www.yelp.ca/biz/seoul-fried-chicken-edmonton")
soup = BeautifulSoup(r.text, 'html.parser')
regex = re.compile('.*comment.*')
results = soup.find_all('p', {'class':regex})
reviews = [result.text for result in results]

In [14]:
reviews[0]

"SFC around Whyte Ave has been one of the best Korean fried chicken places in Edmonton since they first opened. The overall quality and portion of the food for the price is unbeatable. I'd rate them 9.3/10 and recommend trying these flavours:- Garlic Soy- Curry"

# 4. Load Reviews into DataFrame and Score

- We will first create a dataframe with the reviews we have scraped
- We will create a function, that takes a string and passes it through the function to receive a sentiment result.
- We will be using the tokenizer function and passing the reviews through the tensors.
- We will be taking those tokens through the model and store them in the result function and then finally returning the sentiment score.

In [16]:
import pandas as pd
import numpy as np

In [18]:
df = pd.DataFrame(np.array(reviews), columns=['Review'])

In [20]:
df['Review'].iloc[0]

"SFC around Whyte Ave has been one of the best Korean fried chicken places in Edmonton since they first opened. The overall quality and portion of the food for the price is unbeatable. I'd rate them 9.3/10 and recommend trying these flavours:- Garlic Soy- Curry"

In [25]:
def sentiment_score(review):
    tokens = tokenizer.encode(review, return_tensors='pt')
    result = model(tokens)
    return int(torch.argmax(result.logits))+1

In [26]:
sentiment_score(df['Review'].iloc[1])

5

Now this is useful. But what if we wanted to provide these reviews en masse? And store those sentiment scores within the dataframe?

In [27]:
df['Sentiment'] = df['Review'].apply(lambda x: sentiment_score(x[:512]))

In [31]:
df

Unnamed: 0,Review,Sentiment
0,SFC around Whyte Ave has been one of the best ...,4
1,This is one of the best (Korean style) fried c...,5
2,Korean fried chicken. The best when they are h...,5
3,Moist moist moist! Been a minute since we pop...,5
4,"I love Seoul Fried chicken, this is a popular ...",5
5,Came here for Valentine's Day take out and boy...,5
6,"I've come here for take out only, a couple of ...",5
7,Free parking in the strip-mall parking lot jus...,2
8,The chicken is tender and juicy! One of the be...,5
9,"This was a real disappointment. Overcooked, t...",1


### You can use this code to scrape any Yelp review site, just change the link in the 'r' variable!