# Sentiment analysis with Transformers
In traditional NLP techniques section we used Vader library, here we are going to use `Distilbert` - variation of BERT (base model), lighter with fewer parameters - therefore runs faster

In [1]:
import pandas as pd

In [2]:
df= pd.read_excel('Popchip_Reviews_Sentiment.xlsx')
df.head(2)

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan ga...,0.9244
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,I like the puffed nature of this chip that mak...,0.7269


In [3]:
# to run the code faster, just take 30 rows of data
df = pd.read_excel('Popchip_Reviews_Sentiment.xlsx').head(30)
df.head(2)

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan ga...,0.9244
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,I like the puffed nature of this chip that mak...,0.7269


## Sentiment Analysis

In [4]:
from transformers import pipeline

In [5]:
import sys
!{sys.executable} -m pip install transformers

Collecting transformers
  Downloading transformers-4.57.3-py3-none-any.whl.metadata (43 kB)
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Using cached huggingface_hub-0.36.0-py3-none-any.whl.metadata (14 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers)
  Downloading tokenizers-0.22.1-cp39-abi3-win_amd64.whl.metadata (6.9 kB)
Collecting safetensors>=0.4.3 (from transformers)
  Downloading safetensors-0.7.0-cp38-abi3-win_amd64.whl.metadata (4.2 kB)
Downloading transformers-4.57.3-py3-none-any.whl (12.0 MB)
   ---------------------------------------- 0.0/12.0 MB ? eta -:--:--
   ---------------------------------------- 0.0/12.0 MB ? eta -:--:--
   - -------------------------------------- 0.5/12.0 MB 2.5 MB/s eta 0:00:05
   ---- ----------------------------------- 1.3/12.0 MB 2.6 MB/s eta 0:00:05
   ------ --------------------------------- 1.8/12.0 MB 2.7 MB/s eta 0:00:04
   ------- -------------------------------- 2.4/12.0 MB 2.8 MB/s eta 0:00:04
   ---------

In [6]:
from transformers import pipeline

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [9]:
import sys
!{sys.executable} -m pip install torch



In [5]:
sentiment_analyzer = pipeline('sentiment-analysis', model = 'distilbert/distilbert-base-uncased-finetuned-sst-2-english', device =-1) 

Device set to use cpu


In [6]:
text1 = 'When life gives you lemons, make lemonade!'
text2 = 'A dozen lemons will make a gallon of lemonade.'
text3 = 'I didn\'t like the taste of that lemonade at all'

In [7]:
sentiment_analyzer(text1)

[{'label': 'POSITIVE', 'score': 0.9983568787574768}]

In [8]:
sentiment_analyzer(text2)

[{'label': 'POSITIVE', 'score': 0.7781569361686707}]

In [9]:
sentiment_analyzer(text3)

[{'label': 'NEGATIVE', 'score': 0.995613694190979}]

In [10]:
# apply to the dataset
df.Text.apply(sentiment_analyzer)

Token indices sequence length is longer than the specified maximum sequence length for this model (668 > 512). Running this sequence through the model will result in indexing errors


RuntimeError: The size of tensor a (668) must match the size of tensor b (512) at non-singleton dimension 1

within the dataset, there is a review that has 668 tokens (>512 that the model can handle). To resolve this we need to make the text shorter by setting `truncation=True`

In [10]:
sentiment_analyzer = pipeline('sentiment-analysis', model = 'distilbert/distilbert-base-uncased-finetuned-sst-2-english', device =-1, truncation =True) 

Device set to use cpu


In [11]:
df.Text.apply(sentiment_analyzer)

0     [{'label': 'POSITIVE', 'score': 0.993521273136...
1     [{'label': 'POSITIVE', 'score': 0.999605119228...
2     [{'label': 'NEGATIVE', 'score': 0.698487639427...
3     [{'label': 'NEGATIVE', 'score': 0.999630808830...
4     [{'label': 'POSITIVE', 'score': 0.999181449413...
5     [{'label': 'POSITIVE', 'score': 0.999419689178...
6     [{'label': 'POSITIVE', 'score': 0.999218821525...
7     [{'label': 'POSITIVE', 'score': 0.996904075145...
8     [{'label': 'POSITIVE', 'score': 0.989402770996...
9     [{'label': 'POSITIVE', 'score': 0.999183237552...
10    [{'label': 'POSITIVE', 'score': 0.999485135078...
11    [{'label': 'NEGATIVE', 'score': 0.725596308708...
12    [{'label': 'POSITIVE', 'score': 0.996617376804...
13    [{'label': 'POSITIVE', 'score': 0.999719560146...
14    [{'label': 'POSITIVE', 'score': 0.894436717033...
15    [{'label': 'POSITIVE', 'score': 0.998936831951...
16    [{'label': 'POSITIVE', 'score': 0.999853491783...
17    [{'label': 'POSITIVE', 'score': 0.96633774

### Sentiment analysis - round 2, but faster, nicer, better

In [12]:
import torch as t

In [None]:
from transformers import logging
logging.set_verbosity_error() # only see errors not warnings
# Automatically select the best device (GPU) available
device = 0 if t.cuda.is_available() else -1  # 0 for GPU, -1 for CPU

sentiment_analyzer =pipeline('sentiment-analysis', model = 'distilbert/distilbert-base-uncased-finetuned-sst-2-english',
                            device = device,
                            truncation = True)
df.Text.apply(sentiment_analyzer)

0     [{'label': 'POSITIVE', 'score': 0.993521273136...
1     [{'label': 'POSITIVE', 'score': 0.999605119228...
2     [{'label': 'NEGATIVE', 'score': 0.698487639427...
3     [{'label': 'NEGATIVE', 'score': 0.999630808830...
4     [{'label': 'POSITIVE', 'score': 0.999181449413...
5     [{'label': 'POSITIVE', 'score': 0.999419689178...
6     [{'label': 'POSITIVE', 'score': 0.999218821525...
7     [{'label': 'POSITIVE', 'score': 0.996904075145...
8     [{'label': 'POSITIVE', 'score': 0.989402770996...
9     [{'label': 'POSITIVE', 'score': 0.999183237552...
10    [{'label': 'POSITIVE', 'score': 0.999485135078...
11    [{'label': 'NEGATIVE', 'score': 0.725596308708...
12    [{'label': 'POSITIVE', 'score': 0.996617376804...
13    [{'label': 'POSITIVE', 'score': 0.999719560146...
14    [{'label': 'POSITIVE', 'score': 0.894436717033...
15    [{'label': 'POSITIVE', 'score': 0.998936831951...
16    [{'label': 'POSITIVE', 'score': 0.999853491783...
17    [{'label': 'POSITIVE', 'score': 0.96633774

# Clean up the output

In [21]:
pd.set_option('display.max_colwidth', None)

In [22]:
sentiment_scores= df.Text.apply(sentiment_analyzer)
sentiment_scores[:5]

0    [{'label': 'POSITIVE', 'score': 0.9935212731361389}]
1     [{'label': 'POSITIVE', 'score': 0.999605119228363}]
2    [{'label': 'NEGATIVE', 'score': 0.6984876394271851}]
3    [{'label': 'NEGATIVE', 'score': 0.9996308088302612}]
4    [{'label': 'POSITIVE', 'score': 0.9991814494132996}]
Name: Text, dtype: object

In [24]:
sentiment_scores[0][0]['label']

'POSITIVE'

In [25]:
sentiment_scores[0][0]['score']

0.9935212731361389

In [None]:
# create a new column in the dataframe that has all the labels
# for that you need to extract all the labels from sentiment scores 
# sentiment_scores is series
# use lambda function to extract labels 
sentiment_scores.apply(lambda x: x[0]['label'])
# x is an input, in our case it is every row, every row is a list, 0th item in every list is a dictionary and we want 'label' key in these dictionaries

0     POSITIVE
1     POSITIVE
2     NEGATIVE
3     NEGATIVE
4     POSITIVE
5     POSITIVE
6     POSITIVE
7     POSITIVE
8     POSITIVE
9     POSITIVE
10    POSITIVE
11    NEGATIVE
12    POSITIVE
13    POSITIVE
14    POSITIVE
15    POSITIVE
16    POSITIVE
17    POSITIVE
18    NEGATIVE
19    POSITIVE
20    NEGATIVE
21    POSITIVE
22    POSITIVE
23    POSITIVE
24    NEGATIVE
25    POSITIVE
26    NEGATIVE
27    POSITIVE
28    POSITIVE
29    NEGATIVE
Name: Text, dtype: object

In [29]:
# save this output as a new column in the dataset
df['Label_HF'] = sentiment_scores.apply(lambda x: x[0]['label'])
df.head(3)

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER,Label_HF
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244,POSITIVE
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269,POSITIVE
2,23691,A30NYUHEDLWI0Y,5,High,Great Alternative to Potato Chips,"I just love these chips! I was always a big fan of potato chips, but haven't had one since I discovered popchips. They are great for dipping or all alone. I am constantly re-ordering them. One note however-if you are on a low salt diet these chips are probably not for you. They are high in sodium. We go through a case every two months. If you love them it pays to join the subscribe and save program through Amazon. You save money and stay supplied!",0.979,NEGATIVE


In [30]:
df['Score_HF'] = sentiment_scores.apply(lambda x : x[0]['score'])
df.head(2)

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER,Label_HF,Score_HF
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244,POSITIVE,0.993521
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269,POSITIVE,0.999605


In [31]:
# change the score column, if the label is positive keep the score as it is, otherwise return negative version of the score
df.apply(lambda row: row['Score_HF'] if row['Label_HF']=='POSITIVE' else -row['Score_HF'], axis =1)

0     0.993521
1     0.999605
2    -0.698488
3    -0.999631
4     0.999181
5     0.999420
6     0.999219
7     0.996904
8     0.989403
9     0.999183
10    0.999485
11   -0.725596
12    0.996617
13    0.999720
14    0.894437
15    0.998937
16    0.999853
17    0.966338
18   -0.942053
19    0.999761
20   -0.965379
21    0.945946
22    0.998185
23    0.999004
24   -0.752333
25    0.999222
26   -0.990390
27    0.999484
28    0.999874
29   -0.930707
dtype: float64

In [32]:
# create a new column for this
df['Sentiment_HF'] = df.apply(lambda row: row['Score_HF'] if row['Label_HF']=='POSITIVE' else -row['Score_HF'], axis =1)
df.head(2)

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Sentiment_VADER,Label_HF,Score_HF,Sentiment_HF
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan garlic to scoop up cottage cheese as a healthy alternative to chips and dip. My healthy eating program is saved.,0.9244,POSITIVE,0.993521,0.993521
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,"I like the puffed nature of this chip that makes it more unique in the chip market. I ordered the Salt and Vinegar and absolutely love that flavor, hands down my favorite chip ever. I have tried the cheddar and regular flavors as well. The cheddar is about a 4/5 and the regular is about a 3/5 because I prefer strong flavors and obviously that would not be the case for the regular. The Salt and Vinegar is kind of weak compared to some regular S&V chips, but is quite flavorful and makes you wanting to come back for more.",0.7269,POSITIVE,0.999605,0.999605


# Speeding up transformers code
Using GPUs is the fastest way but there are ways to try some techniques to speed up code if we only have GPU available

In [None]:
from transformers import pipeline
sentiment_analyzer =  pipeline('sentiment_analysis',
                               model = 'distilbert-base-uncased-finetuned-sst-2-english', # 1. smaller model
                               device =-1, # running on CPU
                               truncation=True,
                               use_fast=True # 2. faster tokenization
)
import torch
torch.set_num_threads(1) # 3. specify multi-threading

with torch.no_grad(): # 4. disable gradients
    sentiment_scores = df['Text'].apply(sentiment_analyzer)