# Part-of-Speech Tagging

Part-of-Speech Tagging (POS) Notebook.  These methods will help us computationally parse sentences and better understand words in context.

For this notebook I will use Ruth Ginsberg's obituary as input (Link: https://www.legacy.com/news/celebrity-deaths/ruth-bader-ginsburg-1933-2020-influential-u-s-supreme-court-justice/)

Parts of speech are the grammatical units of language — such as (in English) nouns, verbs, adjectives, adverbs, pronouns, and prepositions. Each of these parts of speech plays a different role in a sentence. By computationally identifying parts of speech, one can start computationally exploring syntax, the relationship between words — rather than only focusing on words in isolation, as in the case of tf-idf.

In [1]:
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.options.display.max_rows = 600
pd.options.display.max_colwidth = 400

In [2]:
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
import en_core_web_sm
nlp = en_core_web_sm.load()

In [4]:
PATHNAME = "ruth.txt"
text = open(PATHNAME, encoding='utf-8').read()
document = nlp(text)

In [5]:
#Set some display options for the visualizer
options = {"compact": True, "distance": 90, "color": "yellow", "bg": "black", "font": "Gill Sans"}

displacy.render(document, style="dep", options=options)

In [6]:
for token in document:
    if token.pos_ == "NOUN":
        print(token, token.pos_)

justice NOUN
woman NOUN
woman NOUN
court NOUN
history NOUN
tenure NOUN
dissents NOUN
longevity NOUN
court NOUN
rulings NOUN
majority NOUN
opinion NOUN
admissions NOUN
policy NOUN
court NOUN
decision NOUN
election NOUN
recount NOUN
arguments NOUN
court NOUN
granting NOUN
marriage NOUN
rights NOUN
sex NOUN
couples NOUN
women NOUN
rights NOUN
aims NOUN
career NOUN
women NOUN
rights NOUN
women NOUN
law NOUN
school NOUN
class NOUN
respect NOUN
status NOUN
woman NOUN
lawyer NOUN
law NOUN
journal NOUN
women NOUN
rights NOUN
appointment NOUN
judge NOUN
cases NOUN
women NOUN
rights NOUN
women NOUN
men NOUN
appointment NOUN
judge NOUN
reputation NOUN
judge NOUN
time NOUN
confirmation NOUN
years NOUN
years NOUN
member NOUN
court NOUN
wing NOUN
Cancer NOUN
cancer NOUN
times NOUN
appointment NOUN
surgery NOUN
colon NOUN
cancer NOUN
chemotherapy NOUN
radiation NOUN
day NOUN
bench NOUN
cancer NOUN
lung NOUN
cancer NOUN
arguments NOUN
time NOUN
lung NOUN
lobectomy NOUN
bench NOUN
faculties NOUN
chemot

In [7]:
for token in document:
    if token.pos_ == "VERB":
        print(token, token.pos_)

appointed VERB
Nominated VERB
noted VERB
wrote VERB
stating VERB
dissented VERB
end VERB
made VERB
led VERB
Fighting VERB
advancing VERB
had VERB
fight VERB
went VERB
co VERB
- VERB
found VERB
focus VERB
founding VERB
argued VERB
including VERB
extended VERB
apply VERB
served VERB
nominated VERB
serving VERB
gained VERB
held VERB
known VERB
fights VERB
fought VERB
underwent VERB
followed VERB
missing VERB
followed VERB
missed VERB
recovering VERB
remained VERB
step VERB
had VERB
serve VERB
announced VERB
undergoing VERB
became VERB
towering VERB
elevated VERB
pop VERB
referred VERB
lifted VERB
beloved VERB
loved VERB
worn VERB
collected VERB
had VERB
wear VERB
issuing VERB
made VERB
uttered VERB
asked VERB
be VERB
say VERB
’d VERB
been VERB
raised VERB
lie VERB
lie VERB
be VERB
receive VERB
be VERB
draped VERB
dating VERB
lost VERB
lost VERB
mourn VERB
remember VERB
knew VERB


In [8]:
for token in document:
    print(token.text, token.pos_, token.dep_)

Ruth PROPN compound
Bader PROPN compound
Ginsburg PROPN nsubj
was AUX ROOT
an DET det
associate ADJ amod
justice NOUN attr
of ADP prep
the DET det
U.S. PROPN compound
Supreme PROPN compound
Court PROPN pobj
, PUNCT punct
the DET det
second ADJ amod
woman NOUN appos
and CCONJ cc
the DET det
first ADJ amod
Jewish ADJ amod
woman NOUN conj
appointed VERB acl
to ADP prep
the DET det
court NOUN pobj
in ADP prep
U.S. PROPN compound
history NOUN pobj
. PUNCT punct



 SPACE nummod
Supreme PROPN compound
Court PROPN compound
tenure NOUN nsubj


 SPACE advcl
Nominated VERB dobj
to ADP prep
the DET det
Supreme PROPN compound
Court PROPN pobj
by ADP prep
President PROPN compound
Bill PROPN compound
Clinton PROPN pobj
in ADP prep
1993 NUM pobj
, PUNCT punct
Ginsburg PROPN nsubjpass
was AUX auxpass
noted VERB ROOT
for ADP prep
her PRON poss
liberal ADJ amod
dissents NOUN pobj
and CCONJ cc
her PRON poss
longevity NOUN conj
on ADP prep
the DET det
court NOUN pobj
. PUNCT punct
Among ADP prep
Ginsberg 

crystal NOUN nsubj
clear ADJ ccomp
before ADP mark
she PRON nsubj
ever ADV advmod
uttered VERB advcl
a DET det
word NOUN dobj
. PUNCT punct


 SPACE ROOT
Ginsburg PROPN nsubj
on ADP prep
gender NOUN compound
equality NOUN pobj


 SPACE ROOT
“ PUNCT punct
[ PUNCT punct
W]hen ADV advmod
I PRON nsubjpass
’m AUX auxpass
sometimes ADV advmod
asked VERB ccomp
when ADV advmod
there PRON expl
will AUX aux
be VERB ccomp
enough ADJ acomp
[ PUNCT punct
women NOUN npadvmod
on ADP prep
the DET det
Supreme PROPN compound
Court PROPN pobj
] PUNCT punct
and CCONJ cc
I PRON nsubj
say VERB conj
‘ PUNCT punct
When ADV advmod
there PRON expl
are AUX ccomp
nine NUM attr
, PUNCT punct
’ PUNCT punct
people NOUN nsubj
are AUX ccomp
shocked ADJ acomp
. PUNCT punct
But CCONJ cc
there PRON expl
’d VERB neg
been VERB ROOT
nine NUM nummod
men NOUN attr
, PUNCT punct
and CCONJ cc
nobody PRON nsubj
’s AUX auxpass
ever ADV advmod
raised VERB conj
a DET det
question NOUN dobj
about ADP prep
that DET pobj
. PUNCT punct

In [9]:
adjs = []
for token in document:
    if token.pos_ == 'ADJ':
        adjs.append(token.text)
        
adjs

['associate',
 'second',
 'first',
 'Jewish',
 'liberal',
 'notable',
 'male',
 'unconstitutional',
 'presidential',
 'decisive',
 'same',
 'primary',
 'young',
 'first',
 'co',
 '-',
 'several',
 'advanced',
 'federal',
 'moderate',
 'true',
 'later',
 'staunch',
 'liberal',
 'several',
 'first',
 'single',
 'pancreatic',
 'oral',
 'first',
 'left',
 'determined',
 'mental',
 'political',
 'later',
 'fiery',
 'human',
 'wide',
 'decorative',
 'judicial',
 'favorite',
 'clear',
 'enough',
 'shocked',
 '10th',
 'Funeral',
 'public',
 'available',
 'first',
 'first',
 'Jewish',
 'private',
 'black',
 'historic',
 'cherished',
 'future',
 'resolute']

In [10]:
adjs_tally = Counter(adjs)

In [11]:
adjs_tally.most_common()

[('first', 6),
 ('Jewish', 2),
 ('liberal', 2),
 ('several', 2),
 ('later', 2),
 ('associate', 1),
 ('second', 1),
 ('notable', 1),
 ('male', 1),
 ('unconstitutional', 1),
 ('presidential', 1),
 ('decisive', 1),
 ('same', 1),
 ('primary', 1),
 ('young', 1),
 ('co', 1),
 ('-', 1),
 ('advanced', 1),
 ('federal', 1),
 ('moderate', 1),
 ('true', 1),
 ('staunch', 1),
 ('single', 1),
 ('pancreatic', 1),
 ('oral', 1),
 ('left', 1),
 ('determined', 1),
 ('mental', 1),
 ('political', 1),
 ('fiery', 1),
 ('human', 1),
 ('wide', 1),
 ('decorative', 1),
 ('judicial', 1),
 ('favorite', 1),
 ('clear', 1),
 ('enough', 1),
 ('shocked', 1),
 ('10th', 1),
 ('Funeral', 1),
 ('public', 1),
 ('available', 1),
 ('private', 1),
 ('black', 1),
 ('historic', 1),
 ('cherished', 1),
 ('future', 1),
 ('resolute', 1)]

In [12]:
df = pd.DataFrame(adjs_tally.most_common(), columns=['adj', 'count'])
df[:100]

Unnamed: 0,adj,count
0,first,6
1,Jewish,2
2,liberal,2
3,several,2
4,later,2
5,associate,1
6,second,1
7,notable,1
8,male,1
9,unconstitutional,1


In [13]:
nouns = []
for token in document:
    if token.pos_ == 'NOUN':
        nouns.append(token.text)

nouns_tally = Counter(nouns)

df = pd.DataFrame(nouns_tally.most_common(), columns=['noun', 'count'])
df[:100]

Unnamed: 0,noun,count
0,women,7
1,rights,6
2,court,5
3,cancer,5
4,woman,4
5,appointment,3
6,judge,3
7,years,3
8,bench,3
9,justice,2


In [14]:
verbs = [token.text for token in document if token.pos_ == 'VERB']

verbs_tally = Counter(verbs)

df = pd.DataFrame(verbs_tally.most_common(), columns=['verb', 'count'])
df[:100]

Unnamed: 0,verb,count
0,had,3
1,be,3
2,made,2
3,followed,2
4,lie,2
5,lost,2
6,appointed,1
7,Nominated,1
8,noted,1
9,wrote,1


In [15]:
import re
from IPython.display import Markdown, display

In [16]:
def find_sentences_with_keyword(keyword, document):
    
    #Iterate through all the sentences in the document and pull out the text of each sentence
    for sentence in document.sents:
        sentence = sentence.text
        
        #Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
        if keyword.lower() in sentence.lower():
            
            #Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
            sentence = re.sub('\n', ' ', sentence)
            sentence = re.sub(f"{keyword}", f"**{keyword}**", sentence, flags=re.IGNORECASE)
            
            display(Markdown(sentence))

In [17]:
find_sentences_with_keyword(keyword="liberal", document=document)

   Supreme Court tenure  Nominated to the Supreme Court by President Bill Clinton in 1993, Ginsburg was noted for her **liberal** dissents and her longevity on the court.

It was only in later years that she became known as a staunch member of the court’s **liberal** wing.

In [18]:
#Make a list of tokens and POS labels from document if the token is a word 
tokens_and_labels = [(token.text, token.pos_) for token in document if token.is_alpha]

In [19]:
#Make a function to get all two-word combinations
def get_bigrams(word_list, number_consecutive_words=2):
    
    ngrams = []
    adj_length_of_word_list = len(word_list) - (number_consecutive_words - 1)
    
    #Loop through numbers from 0 to the (slightly adjusted) length of your word list
    for word_index in range(adj_length_of_word_list):
        
        #Index the list at each number, grabbing the word at that number index as well as N number of words after it
        ngram = word_list[word_index : word_index + number_consecutive_words]
        
        #Append this word combo to the master list "ngrams"
        ngrams.append(ngram)
        
    return ngrams

In [20]:
bigrams = get_bigrams(tokens_and_labels)

In [21]:
bigrams[5:20]

[[('associate', 'ADJ'), ('justice', 'NOUN')],
 [('justice', 'NOUN'), ('of', 'ADP')],
 [('of', 'ADP'), ('the', 'DET')],
 [('the', 'DET'), ('Supreme', 'PROPN')],
 [('Supreme', 'PROPN'), ('Court', 'PROPN')],
 [('Court', 'PROPN'), ('the', 'DET')],
 [('the', 'DET'), ('second', 'ADJ')],
 [('second', 'ADJ'), ('woman', 'NOUN')],
 [('woman', 'NOUN'), ('and', 'CCONJ')],
 [('and', 'CCONJ'), ('the', 'DET')],
 [('the', 'DET'), ('first', 'ADJ')],
 [('first', 'ADJ'), ('Jewish', 'ADJ')],
 [('Jewish', 'ADJ'), ('woman', 'NOUN')],
 [('woman', 'NOUN'), ('appointed', 'VERB')],
 [('appointed', 'VERB'), ('to', 'ADP')]]

In [22]:
def get_neighbor_words(keyword, bigrams, pos_label = None):
    
    neighbor_words = []
    keyword = keyword.lower()
    
    for bigram in bigrams:
        
        #Extract just the lowercased words (not the labels) for each bigram
        words = [word.lower() for word, label in bigram]        
        
        #Check to see if keyword is in the bigram
        if keyword in words:
            
            for word, label in bigram:
                
                #Now focus on the neighbor word, not the keyword
                if word.lower() != keyword:
                    #If the neighbor word matches the right pos_label, append it to the master list
                    if label == pos_label or pos_label == None:
                        neighbor_words.append(word.lower())
    
    return Counter(neighbor_words).most_common()

In [23]:
get_neighbor_words("liberal", bigrams)

[('her', 1), ('dissents', 1), ('court', 1), ('wing', 1)]

In [25]:
get_neighbor_words("liberal", bigrams, pos_label='NOUN')

[('dissents', 1), ('court', 1), ('wing', 1)]