In [6]:
import spacy
import pandas as pd
from collections import Counter

In [None]:
nlp = spacy.load('en_core_web_sm')

The NER code will be reformated into a function that will consume a length of text, and return a list of mentioned organizations:

In [2]:
def get_orgs(text):
    # process the text with our SpaCy model to get named entities
    doc = nlp(text)
    # initialize list to store identified organizations
    org_list = []
    # loop through the identified entities and append ORG entities to org_list
    for entity in doc.ents:
        if entity.label_ == 'ORG':
            org_list.append(entity.text)
    # if organization is identified more than once it will appear multiple times in list
    # we use set() to remove duplicates then convert back to list
    org_list = list(set(org_list))
    return org_list

## Applying NER
All we need to do now is load in the /r/investing data and apply the get_orgs function to our text column to create a new organizations column.

Load the data and view the top five rows with df.head():

In [3]:
df = pd.read_csv('reddit_investing.csv', sep='|')
df.head()

Unnamed: 0,created_utc,downs,id,score,selftext,subreddit,title,ups,upvote_ratio
0,1642328000.0,0.0,t3_s58zdo,0.0,\n\nThis past September ClimeWorks launched t...,investing,Breakthrough That Could Reverse Climate Change...,0.0,0.13
1,1642327000.0,0.0,t3_s58p7l,19.0,Have a general question? Want to offer some c...,investing,Daily General Discussion and Advice Thread - J...,19.0,0.8
2,1642322000.0,0.0,t3_s57c11,0.0,I tried using crypto as a savings account but ...,investing,I've come in to a little money recently due to...,0.0,0.45
3,1642312000.0,0.0,t3_s54zb3,0.0,I am closing my Betterment account after exper...,investing,Tax Loss Harvesting When Using a VTI and Chill...,0.0,0.5
4,1642306000.0,0.0,t3_s53082,79.0,All around the news that US inflation is at 4...,investing,High inflationary environment: Warren Buffett ...,79.0,0.87


In [4]:
df['organizations'] = df['selftext'].apply(get_orgs)
df.head()

Unnamed: 0,created_utc,downs,id,score,selftext,subreddit,title,ups,upvote_ratio,organizations
0,1642328000.0,0.0,t3_s58zdo,0.0,\n\nThis past September ClimeWorks launched t...,investing,Breakthrough That Could Reverse Climate Change...,0.0,0.13,"[Hengell, Orca]"
1,1642327000.0,0.0,t3_s58p7l,19.0,Have a general question? Want to offer some c...,investing,Daily General Discussion and Advice Thread - J...,19.0,0.8,[]
2,1642322000.0,0.0,t3_s57c11,0.0,I tried using crypto as a savings account but ...,investing,I've come in to a little money recently due to...,0.0,0.45,[TYSM]
3,1642312000.0,0.0,t3_s54zb3,0.0,I am closing my Betterment account after exper...,investing,Tax Loss Harvesting When Using a VTI and Chill...,0.0,0.5,"[Robinhood, VTI, Fidelity]"
4,1642306000.0,0.0,t3_s53082,79.0,All around the news that US inflation is at 4...,investing,High inflationary environment: Warren Buffett ...,79.0,0.87,[]


Now we have a list of all mentioned organizations contained within the organizations column. We will now take the full column, merge each list, and use Counter to create a frequency table of organization mentions.

In [5]:
# merge organizations column into one big list
orgs = df['organizations'].to_list()
orgs = [org for sublist in orgs for org in sublist]
orgs[:10]

['Hengell',
 'Orca',
 'TYSM',
 'Robinhood',
 'VTI',
 'Fidelity',
 'OTM',
 'EWS',
 'PEG',
 'EV']

In [7]:
# create dictionary of organization mention frequency
org_freq = Counter(orgs)

In [8]:
org_freq.most_common(10)

[('Fed', 27),
 ('Fidelity', 18),
 ('Amazon', 17),
 ('SPY', 16),
 ('Apple', 14),
 ('Vanguard', 14),
 ('Microsoft', 14),
 ('ETFs', 13),
 ('EV', 12),
 ('EU', 12)]

Clearly there is a need to do some further pruning of the data to remove non-organization labels like EV (electric vehicle). Depending on the use-case it may even be useful to keep a few of these, or remove a few others.

To do this, we would create a custom list and implement it in our get_orgs function like so:

In [9]:
BLACKLIST = ['ev', 'covid', 'etf', 'nyse', 'sec', 'spac', 'fda']

def get_orgs(text):
    doc = nlp(text)
    org_list = []
    for entity in doc.ents:
        # here we modify the original code to check that entity text is not equal to one of our 'blacklisted' organizations
        # (we also add .lower() to lowercase the text, this allows us to match both 'nyse' and 'NYSE' with just 'nyse')
        if entity.label_ == 'ORG' and entity.text.lower() not in BLACKLIST:
            org_list.append(entity.text)
    # if organization is identified more than once it will appear multiple times in list
    # we use set() to remove duplicates then convert back to list
    org_list = list(set(org_list))
    return org_list

In [10]:
df['organizations'] = df['selftext'].apply(get_orgs)
df.head()

Unnamed: 0,created_utc,downs,id,score,selftext,subreddit,title,ups,upvote_ratio,organizations
0,1642328000.0,0.0,t3_s58zdo,0.0,\n\nThis past September ClimeWorks launched t...,investing,Breakthrough That Could Reverse Climate Change...,0.0,0.13,"[Hengell, Orca]"
1,1642327000.0,0.0,t3_s58p7l,19.0,Have a general question? Want to offer some c...,investing,Daily General Discussion and Advice Thread - J...,19.0,0.8,[]
2,1642322000.0,0.0,t3_s57c11,0.0,I tried using crypto as a savings account but ...,investing,I've come in to a little money recently due to...,0.0,0.45,[TYSM]
3,1642312000.0,0.0,t3_s54zb3,0.0,I am closing my Betterment account after exper...,investing,Tax Loss Harvesting When Using a VTI and Chill...,0.0,0.5,"[Robinhood, VTI, Fidelity]"
4,1642306000.0,0.0,t3_s53082,79.0,All around the news that US inflation is at 4...,investing,High inflationary environment: Warren Buffett ...,79.0,0.87,[]


In [11]:
df.to_csv('processed_reddit_investing_ner.csv', sep='|', index=False)