# NER and Sentiment

In this section we will work through applying basic sentiment analysis to our data using a pre-built *distilBERT* model from the **Flair** library. We will then use our organization labels captured through NER in the previous section to create a list of organizations with the highest and lowest average sentiment scores.

In [1]:
import pandas as pd
import flair

2022-11-13 12:32:52.718368: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-13 12:32:52.830515: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-11-13 12:32:53.179665: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-11-13 12:32:53.179731: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or 

We initialize the English sentiment model `en-sentiment`:

In [2]:
model = flair.models.TextClassifier.load('en-sentiment')

2022-11-13 12:32:54,320 loading file /home/tola/.flair/models/sentiment-en-mix-distillbert_4.pt


For each sample there are a few steps we need to take to create the sentiment score. We need to tokenize the input text, make a prediction, extract the direction (*positive* or *negative*) and confidence (a score from 0 to 1). If this is new to you, we cover the Flair sentiment model in more depth in **TK insert link**.

The following function carries out each of these steps for a single extract:

In [19]:
def get_sentiment(text):
    # tokenize input text
    sentence = flair.data.Sentence(text)
    # make sentiment prediction
    model.predict(sentence)
    # extract sentiment direction and confidence (label and score) object
    value, score = sentence.labels[0].value, sentence.labels[0].score
    return [value, score]

We now need to load our previously processed dataframe (which includes the *organizations* column) and `apply` the `get_sentiment` function to the *selftext* column. These sentiment scores will then be stored in a new *sentiment* column.

In [52]:
# load data
df = pd.read_csv('./data/reddit_hawwkey_ner.csv', sep='|')
df.head()

Unnamed: 0,name,created_utc,subreddit,title,selftext,upvote_ratio,ups,downs,score,organizations
0,t3_yua5fs,1668365000.0,hawwkey,Sabres welcome fans from Roswell Park Comprehe...,,1.0,17.0,0.0,17.0,['Roswell Park Comprehensive Cancer Center on ...
1,t3_ys0byl,1668138000.0,hawwkey,Julien Gauthiers dad does a little dance after...,,0.81,3.0,0.0,3.0,['Rangers']
2,t3_yn8i0r,1667687000.0,hawwkey,Tarasenko absolutely leveled by 9 year old :),,0.88,98.0,0.0,98.0,[]
3,t3_ymighg,1667617000.0,hawwkey,Local Finland Kids Participate in Avalanche-Bl...,,1.0,10.0,0.0,10.0,"['Local Finland Kids Participate', 'Avalanche-..."
4,t3_ymhthh,1667616000.0,hawwkey,some of the hurricanes dancing along with the ...,,0.98,257.0,0.0,257.0,[]


In [53]:
# get sentiment
df['sentiment'] = df['title'].apply(get_sentiment)

df['sent_value'] = df['sentiment'].apply(lambda sentiment: sentiment[0])
df['sent_score'] = df['sentiment'].apply(lambda sentiment: sentiment[1])
df.head()


Unnamed: 0,name,created_utc,subreddit,title,selftext,upvote_ratio,ups,downs,score,organizations,sentiment,sent_value,sent_score
0,t3_yua5fs,1668365000.0,hawwkey,Sabres welcome fans from Roswell Park Comprehe...,,1.0,17.0,0.0,17.0,['Roswell Park Comprehensive Cancer Center on ...,"(POSITIVE, 0.9956150054931641)",POSITIVE,0.995615
1,t3_ys0byl,1668138000.0,hawwkey,Julien Gauthiers dad does a little dance after...,,0.81,3.0,0.0,3.0,['Rangers'],"(POSITIVE, 0.9982194304466248)",POSITIVE,0.998219
2,t3_yn8i0r,1667687000.0,hawwkey,Tarasenko absolutely leveled by 9 year old :),,0.88,98.0,0.0,98.0,[],"(POSITIVE, 0.7387542724609375)",POSITIVE,0.738754
3,t3_ymighg,1667617000.0,hawwkey,Local Finland Kids Participate in Avalanche-Bl...,,1.0,10.0,0.0,10.0,"['Local Finland Kids Participate', 'Avalanche-...","(POSITIVE, 0.9980316758155823)",POSITIVE,0.998032
4,t3_ymhthh,1667616000.0,hawwkey,some of the hurricanes dancing along with the ...,,0.98,257.0,0.0,257.0,[],"(POSITIVE, 0.9984707236289978)",POSITIVE,0.998471


In [54]:
type(val_sco_list)

pandas.core.series.Series

In [55]:
df['sentiment'][0].value

AttributeError: 'tuple' object has no attribute 'value'

Now we need to extract each of the organizations alongside it's sentiment score. We will then loop through each, tallying up a total sentiment score and count.

Before we do that, we need to convert each value in the *organizations* column to a list (they are currently strings because we cannot save Python lists to file within Pandas dataframes, they are automatically converted to strings).

In [56]:
import ast

df['organizations'] = df['organizations'].apply(lambda x: ast.literal_eval(x))

In [82]:
# initialize sentiment dictionary
sentiment = {}

# loop through dataframe and extract org labels and sentiment scores into sentiment dictionary
for i, row in df.iterrows():
    # extract sentiment direction and score
    direction = row['sent_value']
    score = row['sent_score']
    # loop through each label in organizations column
    for org in row['organizations']:
        # check if org label exists in sentiment dictionary already
        if org not in sentiment.keys():
            # if it doesn't, initialize new entry in dictionary
            sentiment[org] = {'POSITIVE': [], 'NEGATIVE': []}
        # append positive/negative score to respective dictionary entry
        sentiment[org][direction].append(score)

In [58]:
df['organizations']

0      [Roswell Park Comprehensive Cancer Center on H...
1                                              [Rangers]
2                                                     []
3      [Local Finland Kids Participate, Avalanche-Blu...
4                                                     []
                             ...                        
994                                   [Nikita Tryampkin]
995                                     [Colton Sissons]
996                                                   []
997                                  [Foligno &, X-Post]
998                                                   []
Name: organizations, Length: 999, dtype: object

In [59]:
sentiment['Rangers']

{'POSITIVE': [0.9982194304466248,
  0.9996633529663086,
  0.9986209869384766,
  0.9953626394271851],
 'NEGATIVE': []}

Now we can loop through each organization entry in the sentiment dictionary and calculate an average positive, and average negative score:

In [83]:
# initialize sentiment list
avg_sentiment = []

for org in sentiment.keys():
    pos_freq = len(sentiment[org]['POSITIVE'])
    neg_freq = len(sentiment[org]['NEGATIVE'])
    for direction in ['POSITIVE', 'NEGATIVE']:
        score = sentiment[org][direction]
        if len(score) == 0:
            sentiment[org][direction] = 0.0
        else:
            # otherwise calculate total
            sentiment[org][direction] = sum(score)
    total = sentiment[org]['POSITIVE'] - sentiment[org]['NEGATIVE']
    avg = total/freq
    pos_avg = sentiment[org]['POSITIVE']/pos_freq if pos_freq !=0 else 0
    neg_avg = sentiment[org]['POSITIVE']/neg_freq if neg_freq !=0 else 0
    avg_sentiment.append({
        'entity': org,
        'positive': pos_avg,
        'negative': neg_avg,
        'frequency': pos_freq+neg_freq,
        'score': avg
    })

In [85]:
avg_sentiment[:3]

[{'entity': 'Roswell Park Comprehensive Cancer Center on Hockey',
  'positive': 0.9956150054931641,
  'negative': 0,
  'frequency': 1,
  'score': 0.9956150054931641},
 {'entity': 'Rangers',
  'positive': 0.9979666024446487,
  'negative': 0,
  'frequency': 4,
  'score': 3.991866409778595},
 {'entity': 'Local Finland Kids Participate',
  'positive': 0.9980316758155823,
  'negative': 0,
  'frequency': 1,
  'score': 0.9980316758155823}]

In [88]:
sentiment_df = pd.DataFrame(avg_sentiment)
sentiment_df.head()

Unnamed: 0,entity,positive,negative,frequency,score
0,Roswell Park Comprehensive Cancer Center on Ho...,0.995615,0.0,1,0.995615
1,Rangers,0.997967,0.0,4,3.991866
2,Local Finland Kids Participate,0.998032,0.0,1,0.998032
3,Avalanche-Blue Jackets Global Series Opener,0.998032,0.0,1,0.998032
4,Kraken,0.994651,0.0,2,1.989303


In [90]:
sentiment_df = sentiment_df[sentiment_df['frequency'] > 3]
sentiment_df.head()

Unnamed: 0,entity,positive,negative,frequency,score
1,Rangers,0.997967,0.0,4,3.991866
14,Oilers,0.967376,0.0,5,4.83688
22,Fleury,0.977178,0.0,8,7.817426
36,Ovi,0.962929,2.888788,4,1.889715
150,Backstrom,0.998438,0.0,4,3.993751


In [91]:
sentiment_df.sort_values('score', ascending=False).head(10)

Unnamed: 0,entity,positive,negative,frequency,score
22,Fleury,0.977178,0.0,8,7.817426
14,Oilers,0.967376,0.0,5,4.83688
150,Backstrom,0.998438,0.0,4,3.993751
1,Rangers,0.997967,0.0,4,3.991866
36,Ovi,0.962929,2.888788,4,1.889715
268,Preds,0.945144,2.835433,4,1.885118


In [92]:
sentiment_df.sort_values('score').head(10)

Unnamed: 0,entity,positive,negative,frequency,score
268,Preds,0.945144,2.835433,4,1.885118
36,Ovi,0.962929,2.888788,4,1.889715
1,Rangers,0.997967,0.0,4,3.991866
150,Backstrom,0.998438,0.0,4,3.993751
14,Oilers,0.967376,0.0,5,4.83688
22,Fleury,0.977178,0.0,8,7.817426


In [60]:
# initialize sentiment list
avg_sentiment = []

# loop through each organization
for org in sentiment.keys():
    # get number of positive and negative ratings
    freq = len(sentiment[org]['POSITIVE']) + len(sentiment[org]['NEGATIVE'])
    for direction in ['POSITIVE', 'NEGATIVE']:
        # assign to variable for cleaner code
        score = sentiment[org][direction]
        # if there are no entries, set to 0
        if len(score) == 0:
            sentiment[org][direction] = 0.0
        else:
            # otherwise calculate total
            sentiment[org][direction] = sum(score)
    # now calculate total amount
    total = sentiment[org]['POSITIVE'] - sentiment[org]['NEGATIVE']
    # and the average score
    avg = total/freq
    # add to sentiment list
    avg_sentiment.append({
        'entity': org,
        'positive': sentiment[org]['POSITIVE'],
        'negative': sentiment[org]['NEGATIVE'],
        'frequency': freq,
        'score': avg
    })

In [68]:
sentiment_df = pd.DataFrame(avg_sentiment)
sentiment_df.sort_values('frequency', ascending=False).head(10)

Unnamed: 0,entity,positive,negative,frequency,score
22,Fleury,7.817426,0.0,8,0.977178
14,Oilers,4.83688,0.0,5,0.967376
1,Rangers,3.991866,0.0,4,0.997967
268,Preds,2.835433,0.950315,4,0.471279
36,Ovi,2.888788,0.999074,4,0.472429
150,Backstrom,3.993751,0.0,4,0.998438
10,Canadiens,2.61165,0.0,3,0.87055
25,St. Louis Blues,2.9682,0.0,3,0.9894
195,Jeff Petry's,2.772388,0.0,3,0.924129
12,Flames,2.863718,0.0,3,0.954573


Immediately we can see we have a lot of entities which have appeared once in our dataset, and because of this their score will be pushed to one extreme or the other. We can filter out anything with less than or equal to a frequency of `3` to remove many of these instances:

In [82]:
sentiment_df = sentiment_df[sentiment_df['frequency'] > 3]
sentiment_df

Unnamed: 0,entity,positive,negative,frequency,score
5,Fed,2.326646,10.933105,14,-0.614747
7,Treasury,0.610764,3.990731,5,-0.675993
8,ARK,6.525701,7.600950,15,-0.071683
11,Citadel,0.901277,2.829939,4,-0.482165
17,eBay,1.879811,2.979899,5,-0.220018
...,...,...,...,...,...
1349,IBM,2.965104,0.883970,4,0.520283
1487,PLTR,1.624318,1.911520,4,-0.071801
1553,LMND,0.000000,4.746248,5,-0.949250
1735,PLUG,0.951824,2.889800,4,-0.484494


Here we have some more relevant information. We can see a few items that we can remove through the `BLACKLIST` covered in earlier sections such as *Fed* and *Treasury*, but nonetheless this list is looking much better than before. We can apply `sort` to search for the entities with the highest overall score:

In [84]:
sentiment_df.sort_values('score', ascending=False).head(10)

Unnamed: 0,entity,positive,negative,frequency,score
1349,IBM,2.965104,0.88397,4,0.520283
317,TAM,5.415756,1.880311,8,0.441931
2011,Sony,4.888052,1.970413,7,0.416806
287,AR,2.5488,0.999818,4,0.387246
504,Intel,3.26843,1.953941,6,0.219082
48,cannabis,6.215654,3.820414,11,0.217749
908,Verizon,3.618954,2.800269,7,0.116955
307,Google,6.957319,5.516039,14,0.102949
452,Company,7.041967,5.876644,14,0.083237
337,SaaS,5.450681,4.846271,11,0.054946


Very quickly we've got our results that we have pulled together using simple, ready-to-use models, and **zero** text preprocessing. With further fine-tuning, and process development, these already good results can become great. Which we will cover soon.