# Pre-Trained BERT Model for Sentiment Prediction
- This notebook was created and ran in a Google-Colab environment
- The sentiment model comes from 'nlptown/bert-base-multilingual-uncased-sentiment'
- The model was used to predict sentiment of scraped reddit comments from a Dota2 patch update

In [1]:
# instal transformers library
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.4-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.3 tokenizers-0.13.3 transformers-4.27.4


In [2]:
# import transformers and autotokenizer 
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [3]:
# instantiate tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

Downloading (…)okenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt: 0.00B [00:00, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/669M [00:00<?, ?B/s]

# Test the Pre-trained Model
Everything is loaded, test the model to see if it works

In [4]:
# import torch
import torch

In [5]:
# tokenize a sentence
tokens = tokenizer.encode('It was good but couldve been better. Great', return_tensors='pt')

In [6]:
# pass our tokenized sentence to the BERT model, view results
result = model(tokens)
result

SequenceClassifierOutput(loss=None, logits=tensor([[-2.7768, -1.2353,  1.4419,  1.9804,  0.4584]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

The logits=tensor is the probability that the comment has a sentiment of 1,2,3,4, or 5 (with 5 being a good sentiment)

We can extract this with the .logits method

In [7]:
# grab the logits
result.logits

tensor([[-2.7768, -1.2353,  1.4419,  1.9804,  0.4584]],
       grad_fn=<AddmmBackward0>)

This is showing us that this comment has the highest probability of having a sentiment of 4. We can extract the max value

In [8]:
# pull the highest probability position (0,1,2,3,4)
torch.argmax(result.logits)

tensor(3)

Position 3 of the tensor has the highest rating, we can convert this to an integer to get an overall sentiment

In [9]:
# turn to int, add +1 so we don't get a sentiment of 0
int(torch.argmax(result.logits))+1

4

The comment 'It was good but couldve been better. Great' has been predicted to have a sentiment score of 4/5. Pretty good overall

# Predict on Reddit Comments
- Load data and Score Dota2 Reddit Comments for patch 7.32e

In [10]:
# import libtaries
import numpy as np
import pandas as pd

In [11]:
# mount google drive, gives access to google drive data
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [12]:
# read data
dota = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/dota_comments_optimized.csv')

In [13]:
# check data
dota.head()

Unnamed: 0,text,Logistic_Regression,SVM,Decision_Tree,KNN,PCA1,PCA2,PCA3
0,"Chill everyone, the Newspost says that 7.33 wi...",0,0,0,1,0.086364,-0.005339,0.057743
1,At least CM (Crystal Maiden) not touched.,1,1,1,1,0.0,0.0,0.0
2,I wanna work with this devs since the work two...,1,1,1,1,0.093037,-0.012519,0.075822
3,7.32e hahahahahah,1,1,1,1,0.0,0.0,0.0
4,# 7.32e Summary\n\n* New hero added to the gam...,0,0,0,1,0.070274,0.003729,0.049699


In [14]:
# separate text column
text = dota['text']

In [15]:
# check
text

0      Chill everyone, the Newspost says that 7.33 wi...
1              At least CM (Crystal Maiden) not touched.
2      I wanna work with this devs since the work two...
3                                      7.32e hahahahahah
4      # 7.32e Summary\n\n* New hero added to the gam...
                             ...                        
187    4.2 GB for minor Lina nerf.  \n\n\nI'm done wi...
188    This is bad lol. Barely any changes other than...
189    Hahahaha we overhyped the patch, it's literall...
190                                              No 7.33
191    This has to be what sunsfan was talking about ...
Name: text, Length: 192, dtype: object

In [16]:
# pull comment from list to test
text[2]

'I wanna work with this devs since the work two days in a year'

In [17]:
# test sentiment on comment
tokens = tokenizer.encode(text[2], return_tensors='pt')
result = model(tokens)
result.logits
sentiment = int(torch.argmax(result.logits))+1
print(sentiment)

5


Comment ID 2 has a sentiment of 5 based on the pre-trained BERT model

In [18]:
# create for loop to score every comment for a sentiment ranging from 1-5
# save to empty list
BERT = []

for i in text:
  tokens = tokenizer.encode(i, return_tensors='pt')
  result = model(tokens)
  result.logits
  sentiment = int(torch.argmax(result.logits))+1
  BERT.append(sentiment)

In [19]:
# save empty list as new column in dataframe
dota['BERT'] = BERT

In [20]:
# check
dota.head(5)

Unnamed: 0,text,Logistic_Regression,SVM,Decision_Tree,KNN,PCA1,PCA2,PCA3,BERT
0,"Chill everyone, the Newspost says that 7.33 wi...",0,0,0,1,0.086364,-0.005339,0.057743,4
1,At least CM (Crystal Maiden) not touched.,1,1,1,1,0.0,0.0,0.0,1
2,I wanna work with this devs since the work two...,1,1,1,1,0.093037,-0.012519,0.075822,5
3,7.32e hahahahahah,1,1,1,1,0.0,0.0,0.0,1
4,# 7.32e Summary\n\n* New hero added to the gam...,0,0,0,1,0.070274,0.003729,0.049699,4


In [21]:
# change numbers in column to 'Positive', 'Negative' or 'Neutral'
dota['Logistic_Regression'] = np.where(dota['Logistic_Regression'] == 0, 'Positive',
                              np.where(dota['Logistic_Regression'] == 1, 'Neutral',
                                       'Negative'))

In [22]:
# same as above
dota['SVM'] = np.where(dota['SVM'] == 0, 'Positive',
              np.where(dota['SVM'] == 1, 'Neutral',
                                       'Negative'))

In [23]:
# same as above
dota['Decision_Tree'] = np.where(dota['Decision_Tree'] == 0, 'Positive',
                        np.where(dota['Decision_Tree'] == 1, 'Neutral',
                                       'Negative'))

In [24]:
# same as above
dota['KNN'] = np.where(dota['KNN'] == 0, 'Positive',
              np.where(dota['KNN'] == 1, 'Neutral',
                                       'Negative'))

In [25]:
# check
dota.head()

Unnamed: 0,text,Logistic_Regression,SVM,Decision_Tree,KNN,PCA1,PCA2,PCA3,BERT
0,"Chill everyone, the Newspost says that 7.33 wi...",Positive,Positive,Positive,Neutral,0.086364,-0.005339,0.057743,4
1,At least CM (Crystal Maiden) not touched.,Neutral,Neutral,Neutral,Neutral,0.0,0.0,0.0,1
2,I wanna work with this devs since the work two...,Neutral,Neutral,Neutral,Neutral,0.093037,-0.012519,0.075822,5
3,7.32e hahahahahah,Neutral,Neutral,Neutral,Neutral,0.0,0.0,0.0,1
4,# 7.32e Summary\n\n* New hero added to the gam...,Positive,Positive,Positive,Neutral,0.070274,0.003729,0.049699,4


We now have a dataframe with sentiment predictions from each optimized model and from our pre-trained BERT model from hugging face <br>
This dataframe will be saved and made into a dashboard in R Shiny

In [26]:
dota.to_csv('dota_comments_BERT.csv', index = False)