# NER on Reddit Data

In this example we work through building a NER tool for extracting organization mentions from Reddit data to create a frequency table.

In [None]:
import spacy
from spacy import displacy
import pandas as pd
from collections import Counter

2022-11-13 12:22:53.922974: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-13 12:22:54.032339: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-11-13 12:22:54.331198: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-11-13 12:22:54.331266: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or 

In [3]:
nlp = spacy.load('en_core_web_sm')

The `get_orgs` function will consume a length of text, and return a list of mentioned organizations:

In [29]:
df = pd.read_csv('./data/reddit_hawwkey.csv', sep='|')
df.head()

Unnamed: 0,name,created_utc,subreddit,title,selftext,upvote_ratio,ups,downs,score
0,t3_yua5fs,1668365000.0,hawwkey,Sabres welcome fans from Roswell Park Comprehe...,,1.0,17.0,0.0,17.0
1,t3_ys0byl,1668138000.0,hawwkey,Julien Gauthiers dad does a little dance after...,,0.81,3.0,0.0,3.0
2,t3_yn8i0r,1667687000.0,hawwkey,Tarasenko absolutely leveled by 9 year old :),,0.88,98.0,0.0,98.0
3,t3_ymighg,1667617000.0,hawwkey,Local Finland Kids Participate in Avalanche-Bl...,,1.0,10.0,0.0,10.0
4,t3_ymhthh,1667616000.0,hawwkey,some of the hurricanes dancing along with the ...,,0.98,257.0,0.0,257.0


In [30]:
def get_orgs(text):
    # process the text with our SpaCy model to get named entities
    doc = nlp(text)
    # initialize list to store identified organizations
    org_list = []
    for entity in doc.ents:
        # here we modify the original code to check that entity text is not equal to one of our 'blacklisted' organizations
        # (we also add .lower() to lowercase the text, this allows us to match both 'nyse' and 'NYSE' with just 'nyse')
        if entity.label_ == 'ORG':
            org_list.append(entity.text)
    # if organization is identified more than once it will appear multiple times in list
    # we use set() to remove duplicates then convert back to list
    org_list = list(set(org_list))
    return org_list

In [31]:
df['organizations'] = df['title'].apply(get_orgs)

In [32]:
df.head()

Unnamed: 0,name,created_utc,subreddit,title,selftext,upvote_ratio,ups,downs,score,organizations
0,t3_yua5fs,1668365000.0,hawwkey,Sabres welcome fans from Roswell Park Comprehe...,,1.0,17.0,0.0,17.0,[Roswell Park Comprehensive Cancer Center on H...
1,t3_ys0byl,1668138000.0,hawwkey,Julien Gauthiers dad does a little dance after...,,0.81,3.0,0.0,3.0,[Rangers]
2,t3_yn8i0r,1667687000.0,hawwkey,Tarasenko absolutely leveled by 9 year old :),,0.88,98.0,0.0,98.0,[]
3,t3_ymighg,1667617000.0,hawwkey,Local Finland Kids Participate in Avalanche-Bl...,,1.0,10.0,0.0,10.0,"[Local Finland Kids Participate, Avalanche-Blu..."
4,t3_ymhthh,1667616000.0,hawwkey,some of the hurricanes dancing along with the ...,,0.98,257.0,0.0,257.0,[]


In [33]:
from collections import Counter

In [34]:
orgs = df['organizations'].to_list()

In [35]:
# Convert list of lists to a single list
orgs = [org for sublist in orgs for org in sublist]

In [37]:
# Commented to reduce file size of notebook
# orgs

In [38]:
org_freq = Counter(orgs)

In [39]:
org_freq.most_common(10)

[('NHL', 37),
 ('Fleury', 8),
 ('Oilers', 5),
 ('Rangers', 4),
 ('Ovi', 4),
 ('Backstrom', 4),
 ('Instagram', 4),
 ('Preds', 4),
 ('TSN', 3),
 ('Canadiens', 3)]

In [44]:
BLACKLIST = ['nhl', 'instagram', 'gm', 'tsn']

def get_orgs_blacklist(text):
    # process the text with our SpaCy model to get named entities
    doc = nlp(text)
    # initialize list to store identified organizations
    org_list = []
    for entity in doc.ents:
        # here we modify the original code to check that entity text is not equal to one of our 'blacklisted' organizations
        # (we also add .lower() to lowercase the text, this allows us to match both 'nyse' and 'NYSE' with just 'nyse')
        if entity.label_ == 'ORG' and entity.text.lower() not in BLACKLIST:
            org_list.append(entity.text)
    # if organization is identified more than once it will appear multiple times in list
    # we use set() to remove duplicates then convert back to list
    org_list = list(set(org_list))
    return org_list

In [45]:
df['organizations'] = df['title'].apply(get_orgs_blacklist)

In [46]:
orgs = df['organizations'].to_list()
orgs = [org for sublist in orgs for org in sublist]
org_freq = Counter(orgs)
org_freq.most_common(10)

[('Fleury', 8),
 ('Oilers', 5),
 ('Rangers', 4),
 ('Ovi', 4),
 ('Backstrom', 4),
 ('Preds', 4),
 ('Canadiens', 3),
 ('Flames', 3),
 ('St. Louis Blues', 3),
 ('Laila', 3)]

In [47]:
df['organizations'].head()

0    [Roswell Park Comprehensive Cancer Center on H...
1                                            [Rangers]
2                                                   []
3    [Local Finland Kids Participate, Avalanche-Blu...
4                                                   []
Name: organizations, dtype: object

In [48]:
df.to_csv('./data/reddit_hawwkey_ner.csv', sep='|', index=False)

## Applying NER

All we need to do now is load in the */r/investing* data and apply the `get_orgs` function to our text column to create a new `organizations` column.

1. Load the data and view the top five rows with `df.head()`:

In [4]:
df = pd.read_csv('./data/reddit_investing.csv', sep='|')
df.head()

Unnamed: 0,created_utc,downs,id,score,selftext,subreddit,title,ups,upvote_ratio
0,1614290000.0,0.0,t3_lshtjn,10.0,Bloomberg article: [https://www.bloomberg.com/...,investing,Fed Views Rising Yields as Bullish Sign Reflec...,10.0,0.86
1,1614286000.0,0.0,t3_lsgahw,56.0,Given the recent downturn in stocks especially...,investing,ARK ETFs implosion risk ------------------------,56.0,0.83
2,1614283000.0,0.0,t3_lsf8td,1.0,[https://twitter.com/desogames/status/13649710...,investing,The Counter-Party Risk Bubble,1.0,0.53
3,1614282000.0,0.0,t3_lsf3nh,6.0,"When you think of futures, what comes to your ...",investing,Futures were made for days like these,6.0,0.62
4,1614278000.0,0.0,t3_lsdcib,3.0,I've been on this sub for quite some time and ...,investing,Let's talk about liquidity premiums,3.0,0.67


2. Extract mentioned organizations from `selftext` and add to a new column called `organizations`:

In [5]:
df['organizations'] = df['selftext'].apply(get_orgs)
df.head()

Unnamed: 0,created_utc,downs,id,score,selftext,subreddit,title,ups,upvote_ratio,organizations
0,1614290000.0,0.0,t3_lshtjn,10.0,Bloomberg article: [https://www.bloomberg.com/...,investing,Fed Views Rising Yields as Bullish Sign Reflec...,10.0,0.86,"[Rebound \n&gt, Raphael Bostic, St. Louis Fed..."
1,1614286000.0,0.0,t3_lsgahw,56.0,Given the recent downturn in stocks especially...,investing,ARK ETFs implosion risk ------------------------,56.0,0.83,[ARK]
2,1614283000.0,0.0,t3_lsf8td,1.0,[https://twitter.com/desogames/status/13649710...,investing,The Counter-Party Risk Bubble,1.0,0.53,"[Citadel, OWN, ITM]"
3,1614282000.0,0.0,t3_lsf3nh,6.0,"When you think of futures, what comes to your ...",investing,Futures were made for days like these,6.0,0.62,[NQ]
4,1614278000.0,0.0,t3_lsdcib,3.0,I've been on this sub for quite some time and ...,investing,Let's talk about liquidity premiums,3.0,0.67,[]


*(This step can take a long time to run. It can be useful to break larger datasets into more manageable chunks if required)*

Now we have a list of all mentioned organizations contained within the `organizations` column. We will now take the full column, merge each list, and use `Counter` to create a frequency table of organization mentions.

In [6]:
# merge organizations column into one big list
orgs = df['organizations'].to_list()
orgs = [org for sublist in orgs for org in sublist]
orgs[:10]

['Rebound  \n&gt',
 'Raphael Bostic',
 'St. Louis Fed',
 'Bullard',
 'the Atlanta Fed',
 'Bostic',
 'ARK',
 'Citadel',
 'OWN',
 'ITM']

## Create Frequency Table

In [7]:
from collections import Counter

In [8]:
# create dictionary of organization mention frequency
org_freq = Counter(orgs)

We now have a *Counter* dictionary containing all of our organization labels as *keys*, and their mention frequency as *values*. The `most_common(n)` method allows us to view the **n** most frequently mentioned organizations:

In [9]:
org_freq.most_common(10)

[('Amazon', 26),
 ('Apple', 25),
 ('TSLA', 18),
 ('PE', 16),
 ('EPS', 16),
 ('ARK', 15),
 ('EBITDA', 15),
 ('GM', 14),
 ('Google', 14),
 ('Nasdaq', 12)]