## NER and Sentiment
In this section we will work through applying basic sentiment analysis to our data using a pre-built distilBERT model from the Flair library. We will then use our organization labels captured through NER in the previous section to create a list of organizations with the highest and lowest average sentiment scores

In [2]:
import pandas as pd
import flair
import ast

In [3]:
model = flair.models.TextClassifier.load('en-sentiment')

2022-01-16 17:33:40,953 loading file /home/ec2-user/.flair/models/sentiment-en-mix-distillbert_4.pt


Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [4]:
def get_sentiment(text):
    # tokenize input text
    sentence = flair.data.Sentence(text)
    # make sentiment prediction
    model.predict(sentence)
    # extract sentiment direction and confidence (label and score) object
    sentiment = sentence.labels[0]
    return sentiment

We now need to load our previously processed dataframe (which includes the organizations column) and apply the get_sentiment function to the selftext column. These sentiment scores will then be stored in a new sentiment column.

In [5]:
# load data
df = pd.read_csv('processed_reddit_investing_ner.csv', sep='|')
df.head()

Unnamed: 0,created_utc,downs,id,score,selftext,subreddit,title,ups,upvote_ratio,organizations
0,1642328000.0,0.0,t3_s58zdo,0.0,\n\nThis past September ClimeWorks launched t...,investing,Breakthrough That Could Reverse Climate Change...,0.0,0.13,"['Hengell', 'Orca']"
1,1642327000.0,0.0,t3_s58p7l,19.0,Have a general question? Want to offer some c...,investing,Daily General Discussion and Advice Thread - J...,19.0,0.8,[]
2,1642322000.0,0.0,t3_s57c11,0.0,I tried using crypto as a savings account but ...,investing,I've come in to a little money recently due to...,0.0,0.45,['TYSM']
3,1642312000.0,0.0,t3_s54zb3,0.0,I am closing my Betterment account after exper...,investing,Tax Loss Harvesting When Using a VTI and Chill...,0.0,0.5,"['Robinhood', 'VTI', 'Fidelity']"
4,1642306000.0,0.0,t3_s53082,79.0,All around the news that US inflation is at 4...,investing,High inflationary environment: Warren Buffett ...,79.0,0.87,[]


In [6]:
# get sentiment
df['sentiment'] = df['selftext'].apply(get_sentiment)
df.head()

Unnamed: 0,created_utc,downs,id,score,selftext,subreddit,title,ups,upvote_ratio,organizations,sentiment
0,1642328000.0,0.0,t3_s58zdo,0.0,\n\nThis past September ClimeWorks launched t...,investing,Breakthrough That Could Reverse Climate Change...,0.0,0.13,"['Hengell', 'Orca']",NEGATIVE (0.7773)
1,1642327000.0,0.0,t3_s58p7l,19.0,Have a general question? Want to offer some c...,investing,Daily General Discussion and Advice Thread - J...,19.0,0.8,[],NEGATIVE (0.9993)
2,1642322000.0,0.0,t3_s57c11,0.0,I tried using crypto as a savings account but ...,investing,I've come in to a little money recently due to...,0.0,0.45,['TYSM'],POSITIVE (0.848)
3,1642312000.0,0.0,t3_s54zb3,0.0,I am closing my Betterment account after exper...,investing,Tax Loss Harvesting When Using a VTI and Chill...,0.0,0.5,"['Robinhood', 'VTI', 'Fidelity']",NEGATIVE (0.997)
4,1642306000.0,0.0,t3_s53082,79.0,All around the news that US inflation is at 4...,investing,High inflationary environment: Warren Buffett ...,79.0,0.87,[],NEGATIVE (0.9715)


In [7]:
df['organizations'] = df['organizations'].apply(lambda x: ast.literal_eval(x))

In [10]:
df.iloc[3].selftext

'I am closing my Betterment account after experiencing some problems. I am thinking of liquidating everything and dumping everything on VTI. The only thing that I am concerned abut is the lack of automatic tax loss harvesting in Robinhood or Fidelity. Would I be able to at least manually do this myself if I have a VTI only portfolio? Thanks!\n\n&amp;#x200B;\n\nI am also open to other ideas besides just creating a VTI only account.'

In [11]:
# initialize sentiment dictionary
sentiment = {}

# loop through dataframe and extract org labels and sentiment scores into sentiment dictionary
for i, row in df.iterrows():
    # extract sentiment direction and score
    direction = row['sentiment'].value
    score = row['sentiment'].score
    # loop through each label in organizations column
    for org in row['organizations']:
        # check if org label exists in sentiment dictionary already
        if org not in sentiment.keys():
            # if it doesn't, initialize new entry in dictionary
            sentiment[org] = {'POSITIVE': [], 'NEGATIVE': []}
        # append positive/negative score to respective dictionary entry
        sentiment[org][direction].append(score)

In [13]:
sentiment['VTI']

{'POSITIVE': [],
 'NEGATIVE': [0.9970013499259949,
  0.5219786763191223,
  0.7596681118011475,
  0.9960135221481323]}

Now we can loop through each organization entry in the sentiment dictionary and calculate an average positive, and average negative score:

In [15]:
# initialize sentiment list
avg_sentiment = []

# loop through each organization
for org in sentiment.keys():
    # get number of positive and negative ratings
    freq = len(sentiment[org]['POSITIVE']) + len(sentiment[org]['NEGATIVE'])
    for direction in ['POSITIVE', 'NEGATIVE']:
        # assign to variable for cleaner code
        score = sentiment[org][direction]
        # if there are no entries, set to 0
        if len(score) == 0:
            sentiment[org][direction] = 0.0
        else:
            # otherwise calculate total
            sentiment[org][direction] = sum(score)
    # now calculate total amount
    total = sentiment[org]['POSITIVE'] - sentiment[org]['NEGATIVE']
    # and the average score
    avg = total/freq
    # add to sentiment list
    avg_sentiment.append({
        'entity': org,
        'positive': sentiment[org]['POSITIVE'],
        'negative': sentiment[org]['NEGATIVE'],
        'frequency': freq,
        'score': avg
    })

In [16]:
sentiment_df = pd.DataFrame(avg_sentiment)
sentiment_df.head()

Unnamed: 0,entity,positive,negative,frequency,score
0,Hengell,0.0,0.777287,1,-0.777287
1,Orca,0.0,0.777287,1,-0.777287
2,TYSM,0.848047,0.0,1,0.848047
3,Robinhood,0.521976,1.99268,3,-0.490235
4,VTI,0.0,3.274662,4,-0.818665


mmediately we can see we have a lot of entities which have appeared once in our dataset, and because of this their score will be pushed to one extreme or the other. We can filter out anything with less than or equal to a frequency of 3 to remove many of these instances:

In [17]:
sentiment_df = sentiment_df[sentiment_df['frequency'] > 3]
sentiment_df

Unnamed: 0,entity,positive,negative,frequency,score
4,VTI,0.0,3.274662,4,-0.818665
5,Fidelity,1.604075,15.37509,18,-0.765056
6,OTM,0.967311,4.285027,6,-0.552953
9,EU,1.947235,9.920522,12,-0.664441
12,MSFT,0.877129,2.843752,4,-0.491656
14,Healthcare,0.0,3.998949,4,-0.999737
15,TIPS,0.0,3.998115,4,-0.999529
17,VIX,0.834557,2.283894,4,-0.362334
24,Amazon,3.728065,12.183163,17,-0.497359
27,Apple,3.733522,9.220969,14,-0.39196


Here we have some more relevant information. We can see a few items that we can remove through the BLACKLIST covered in earlier sections such as Fed and Treasury, but nonetheless this list is looking much better than before. We can apply sort to search for the entities with the highest overall score:

In [20]:
sentiment_df.sort_values('score', ascending=False)

Unnamed: 0,entity,positive,negative,frequency,score
112,Intel,3.758579,2.698516,7,0.151438
115,CNBC,1.901295,1.998214,4,-0.02423
101,Ford,2.554175,2.991381,6,-0.072868
32,VC,1.674511,1.995839,4,-0.080332
204,VOO,2.167562,3.887769,7,-0.245744
135,Vanguard,4.342345,7.936739,14,-0.256742
378,EPS,2.920594,4.993761,8,-0.259146
113,Google,2.736281,4.933877,8,-0.2747
17,VIX,0.834557,2.283894,4,-0.362334
75,ETFs,3.877137,8.780845,13,-0.377208


Very quickly we've got our results that we have pulled together using simple, ready-to-use models, and zero text preprocessing. With further fine-tuning, and process development, these already good results can become great