# 3-1 Contexts & Collocations

Back in notebook 1-2, [Texts and NLTK](https://github.com/johnlaudun/text-as-data/blob/main/notebooks/1-2-texts-and-NLTK-lab.ipynb), we explored some of the ways words accrue meaning through the context of their usage. We return to that now to understand how extracting grammar-based features, keyphrases, entities, n-grams (and collocations) can help us analyze texts more thoroughly.

## Imports and Data

NLTK is a big library. I'm okay loading it in its entirety when I first start work, but as I work I like to narrow what functionalities I need and slowly change my imports to just what I need.

In [1]:
# IMPORTS
from pathlib import Path
import re
import nltk # See note above.
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

from nltk.corpus import stopwords
stoplist = stopwords.words('english')

# MPL block
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 300
plt.rcParams["figure.figsize"] = (10,5)

This custom function actually comes into play later in the notebook, but as a habit I tend to migrate custom functions up in a notebook. Usually I put them just below the imported libraries as part of the overall "load-in."

In [2]:
# The doc string still needs work
def processed(a_string):
    """
    processed takes a string and returns a string of lemmas
    Requires the following imports:
    -------------------------------
    from nltk.tokenize import word_tokenize
    from nltk.corpus import stopwords
    stop_words = set(stopwords.words('english'))
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer() 
    """
    # Get rid of HTML (or HTML-like) tags
    clean = re.sub('<.*?>', '', a_string)
    # first we lower-case everything
    lowered = clean.lower()
    # then tokenize
    words = word_tokenize(lowered)
    # remove stopwords
    # words = [token for token in tokens if token not in stoplist]
    # lemmatize
    lemmas = [lemmatizer.lemmatize(word) for word in words]
    # rejoin the list of lemmas into a string and return
    return " ".join(lemmas)

def lemmify(a_string):
    """
    processed takes a string and returns a list of lemmas
    Requires the following imports:
    -------------------------------
    from nltk.tokenize import word_tokenize
    from nltk.corpus import stopwords
    stop_words = set(stopwords.words('english'))
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer() 
    """
    # Get rid of HTML (or HTML-like) tags
    clean = re.sub('<.*?>', '', a_string)
    # first we lower-case everything
    lowered = clean.lower()
    # then tokenize
    tokens = word_tokenize(lowered)
    # remove stopwords
    words = [token for token in tokens if token not in stoplist]
    # lemmatize
    lemmas = [lemmatizer.lemmatize(word) for word in words]
    # Return a list of lemmas
    return lemmas

In [3]:
# DATA
# Previously we did this three times.
# Now we're just going to use a list to feed an f-string
# https://realpython.com/python-f-strings/
folders = ["histories", "tragedies", "comedies"]
plays = []
for folder in folders:
    for p in Path(f"../data/shakespeare/{folder}/").glob('*.txt'):
        with open(p, mode="r", encoding="utf-16") as f:
            contents = f.read()
            plays.append(contents)

# If we have 37, we got them all
len(plays)

37

## Transform Data into NLTK Text

In [4]:
# To explore the plays:
# we are going to combine them into one string
text = " ".join(plays)
# Break them into a list of tokens
tokens = nltk.tokenize.word_tokenize(text)
# Create an NLTK from this list
shakespeare = nltk.Text(tokens)

In [5]:
shakespeare.concordance("king")

Displaying 25 of 2074 matches:
< Shakespeare -- THE FIRST PART OF KING HENRY VI > < from Online Library of 
 < Dead March . Enter the Funeral of King Henry the Fifth ... > < ... attended
have consented unto Henry 's death ! King Henry the Fifth , too famous to live
 to live long ! England ne'er lost a king of so much worth . < /BEDFORD > < GL
CESTER > < 1 % > England ne'er had a king until his time . Virtue he had , des
ER > < WINCHESTER > < 1 % > He was a king bless 'd of the King of kings . Unto
 1 % > He was a king bless 'd of the King of kings . Unto the French the dread
ort : The Dauphin Charles is crowned king in Rheims ; The Bastard of Orleans w
EXETER > < 4 % > The Dauphin crowned king ! all fly to him ! O ! whither shall
ur laments , Wherewith you now bedew King Henry 's hearse , I must inform you 
And then I will proclaim young Henry king . < /GLOUCESTER > < STAGE DIR > < Ex
> To Eltham will I , where the young king is , Being ordain 'd his special gov
will not be Jack-out-of

If you are dealing with longer texts in your corpus, you might want to think about visualizing where certain words fall within the text. It would be interesting, for example, to see how "king" is distributed across the various plays: we could do that with NLTK's `fdist` and a for-loop. 

<div class="alert alert-block alert-success">
<b>Action:</b> Look up how to create distribution or dispersion plots in Python. This is a common enough need that there are built-in functions in Plotly and Seaborn.
</div>

In [6]:
shakespeare.similar("king")

dir and lord duke man world prince queen day time crown love word
heart one way matter that lady but


In [7]:
shakespeare.similar("woman")

man time fool king day lady gentleman and love maid heart good word
life night is lord place dir thing


In [8]:
shakespeare.generate()

Building ngram index...


banishment ! 't is a lion That I have laboured for the house of York
Shall be brought unto the gentleman , and let them hear what pitiful
cries they made with this decree ; She 's the matter was , And revel
it with groans ; but I was sometime Milan.—Quickly , spirit ! > <
SANDS > < /STAGE DIR > < BEATRICE > < 77 % > How 's this ? still-vex
'd Bermoothes ; there art thou there , Enforce him with the bitterness
of soul To the Fool. < /FRENCH KING > < 86 %


"banishment ! 't is a lion That I have laboured for the house of York\nShall be brought unto the gentleman , and let them hear what pitiful\ncries they made with this decree ; She 's the matter was , And revel\nit with groans ; but I was sometime Milan.—Quickly , spirit ! > <\nSANDS > < /STAGE DIR > < BEATRICE > < 77 % > How 's this ? still-vex\n'd Bermoothes ; there art thou there , Enforce him with the bitterness\nof soul To the Fool. < /FRENCH KING > < 86 %"

## Building ngram index...


In [9]:
shakespeare.collocations()

/STAGE DIR; STAGE DIR; /MARK ANTONY; MARK ANTONY; thou art; thou hast;
/DON PEDRO; Sir John; thou shalt; /ANTIPHOLUS SYR.; ANTIPHOLUS SYR.;
/DROMIO SYR.; DROMIO SYR.; Thou art; King Henry; dost thou; thou wilt;
Good morrow; art thou; /ANTIPHOLUS EPH.


A quick look at the bigram functionality in the NLTK reveals two things: we need to be better about cleaning our texts, and it produces a long list that we need to be tallied. (That is, we don't want to see *all* the bigrams, only the most interesting ones, which means both cleaning, again, and counting and sorting.)

In [10]:
bigrams = nltk.bigrams(tokens[100:120])

for bigram in bigrams:
    print(bigram)

('SCENE', '1')
('1', '>')
('>', '<')
('<', 'Westminster')
('Westminster', 'Abbey.')
('Abbey.', '>')
('>', '<')
('<', 'STAGE')
('STAGE', 'DIR')
('DIR', '>')
('>', '<')
('<', 'Dead')
('Dead', 'March')
('March', '.')
('.', 'Enter')
('Enter', 'the')
('the', 'Funeral')
('Funeral', 'of')
('of', 'King')


Cleaning can mean simply filtering stop tokens, because we want to include those angle brackets, as well as normalizing and possibly (possibly) lemmatizing.

In [11]:
# Say hello to our old friend:
vec = CountVectorizer(preprocessor = processed, ngram_range=(2,3))

# matrix of ngrams
ngrams = vec.fit_transform(plays)

# count frequency of ngrams
count_values = ngrams.toarray().sum(axis=0)

# list of ngrams
vocab = vec.vocabulary_
df_ngram = pd.DataFrame(sorted([(count_values[i],k) for k,i in vocab.items()],
                            reverse=True)).rename(columns={0: 'frequency', 1:'bigram/trigram'})
df_ngram.shape

(956676, 2)

In [12]:
df_ngram[20:40]

Unnamed: 0,frequency,bigram/trigram
20,552,all the
21,548,with the
22,524,thou art
23,521,no more
24,520,will not
25,519,let me
26,503,of this
27,502,this is
28,497,of your
29,493,in my


## Part-of-Speech Tagging

*Also known as POS tagging.*

In [1]:
redshirts = """
From the top of the large boulder he sat on, Ensign Tom Davis looked across the expanse of the cave toward Captain Lucius Abernathy, Science Officer Qeeng and Chief Engineer Paul West perched on a second, larger boulder, and thought, Well, this sucks. “Borgovian Land Worms!” Captain Abernathy said, and smacked his boulder with an open palm. “I should have known.” You should have known? How the hell could you not have known? thought Ensign Davis, and looked at the vast dirt floor of the cave, its powdery surface moving here and there with the shadowy humps that marked the movement of the massive, carnivorous worms.
"""

redshirts = word_tokenize(redshirts)

NameError: name 'word_tokenize' is not defined

In [20]:
redshirts_tags = nltk.pos_tag(redshirts)
redshirts_tags

[('From', 'IN'),
 ('the', 'DT'),
 ('top', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('large', 'JJ'),
 ('boulder', 'NN'),
 ('he', 'PRP'),
 ('sat', 'VBD'),
 ('on', 'IN'),
 (',', ','),
 ('Ensign', 'NNP'),
 ('Tom', 'NNP'),
 ('Davis', 'NNP'),
 ('looked', 'VBD'),
 ('across', 'IN'),
 ('the', 'DT'),
 ('expanse', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('cave', 'NN'),
 ('toward', 'IN'),
 ('Captain', 'NNP'),
 ('Lucius', 'NNP'),
 ('Abernathy', 'NNP'),
 (',', ','),
 ('Science', 'NNP'),
 ('Officer', 'NNP'),
 ('Qeeng', 'NNP'),
 ('and', 'CC'),
 ('Chief', 'NNP'),
 ('Engineer', 'NNP'),
 ('Paul', 'NNP'),
 ('West', 'NNP'),
 ('perched', 'VBD'),
 ('on', 'IN'),
 ('a', 'DT'),
 ('second', 'JJ'),
 (',', ','),
 ('larger', 'JJR'),
 ('boulder', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('thought', 'VBD'),
 (',', ','),
 ('Well', 'NNP'),
 (',', ','),
 ('this', 'DT'),
 ('sucks', 'NNS'),
 ('.', '.')]

A short list of the codes above:

|Tag | Part of Speech |
|----|----------------|
|JJ	 | Adjectives     | 
|NN	 | Nouns          |
|RB	 | Adverbs        |
|PRP | Pronouns       |
|VB	 | Verbs          |

But this [alphabetical list](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) on a UPenn web page is really the most helpful.

In [15]:
# UNCOMMENT to run: Be aware it's long.
# For a fuller list:
# nltk.help.upenn_tagset()

# You have to make sure you download it first:
# nltk.download("tagsets")

## Chunking

In [18]:
# Create a chunk "grammar"
grammar = "NP: {<DT>?<JJ>*<NN>}"

# Instantiate the chunk parser
chunk_parser = nltk.RegexpParser(grammar)

In [22]:
sentence = "From the top of the large boulder he sat on, Ensign Tom Davis looked across the expanse of the cave toward Captain Lucius Abernathy and thought, Well, this sucks."



tree = chunk_parser.parse(redshirts_tags)
tree.draw()

: 