<center>
<h1>Cultural Analytics</h1><br>
<h2>ENGL64.05 / QSS 30.16 21F</h2>
</center>

----

# Lab 2
## Punctuation, Part of Speech Tagging, Named-Entity Recognition, Segmentation, and Vectorization

 <center><pre>Created: 10/09/2019; Revised 09/27/2021</pre></center>

<h3><font color="Green">Part One: Part of Speech Tagging</font></h3>

In [None]:
# Let's begin by loading up some important libraries/packages
import nltk

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag, ne_chunk

import numpy as np
import glob as glob

# allow for displaying of graphics
%matplotlib inline

In [None]:
# Let's learn about NLTK's Part of Speech (POS) Tagger. 
# Write a sample sentence here...

test_sentence = """
"""

In [None]:
# We need to tokenize a sentence in order to tag the words.
test_sentence_tokens = word_tokenize(test_sentence)

# Now we run the tagger:
nltk.pos_tag(test_sentence_tokens)

In [None]:
# The complete list of tag types appears at the bottom of this notebook

# Now let's return to the second cell and write some other kinds of sentences.
# Experiment with words that could be nouns or verbs depending on context.
# How well does this work?

<h3><font color="Green">Part Two: Named Entity Recognition</font></h3>

The following are the Named Entities that NLTK can recognize:

|NER|Example|
|------------|-----------|
|ORGANIZATION|Georgia-Pacific Corp., WHO|
|PERSON|Eddy Bonte, President Obama|
|LOCATION|Murray River, Mount Everest|
|DATE|June, 2008-06-29|
|TIME|two fifty a m, 1:30 p.m.|
|MONEY|175 million Canadian Dollars, GBP 10.40|
|PERCENT|twenty pct, 18.75 %|
|FACILITY|Washington Monument, Stonehenge|
|GPE|South East Asia, Midlothian|

In [None]:
# There are 150 English-language novels in Andrew Piper's Novel450 dataset:
for document in sorted(glob.glob("shared/ENGL64.05-21F/data/Novel450/EN*")):
    print(document)

In [None]:
# select one of these and read it into the variable raw_text.

In [None]:
# Okay, let's determine how long it is (word count) using our old friend, the word_tokenizer
tokens = nltk.word_tokenize(raw_text)
print("found",len(tokens),"tokens")

In [None]:
# Let's look at the first 300 words (roughly a page)
sample_tokens = tokens[:300]

In [None]:
# We'll use the 'Named Entity Chunker' ne_chunk to 'chunk' our tagged 
# tokens and then apply named entity recongition.
ner_data = ne_chunk(pos_tag(sample_tokens))

In [None]:
ner_type = "PERSON" # define NER category of interest

# we'll make a dictonary to store found Named Entities
found_objects = dict()

# Run GPE 
for i in ner_data.subtrees():
    if i.label() == ner_type: 
            ner_object = i[0][0]
            if ner_object in found_objects:
                found_objects[ner_object] += 1
            else:
                found_objects[ner_object] = 1

top_objects = sorted(found_objects, key=found_objects.get, reverse=True)
for i in top_objects:
    print(i,found_objects[i])

In [None]:
# Now go back and select a different range (different number of pages) of your text. 
# Then try another text.
# How well does this work?

<h3><font color="Green">Part Three: Document Segmentation</font></h3>

As we just saw, it will be sometimes better to operate a small section of text. We can call these units "segments" and produce them automatically. With a standarized set of segments we can better understand changes throughout narrative time (the "syuzhet" or emplotted narrative).

In [None]:
# select one of the above texts and (re)read it into the variable raw_text:

In [None]:
# Tokenize
tokens = nltk.word_tokenize(raw_text)
print("found",len(tokens),"tokens")

In [None]:
# Typically we predetermine the number of segments we want created.
ns = 100 # how many segments do we want to create?
segment_length = int(len(tokens) / ns) # how many words go in each segment?
segments = list()
for j in range(ns):
    seg = tokens[segment_length*j:segment_length*(j+1)]
    segments.append(seg)

In [None]:
# Let's begin with tagging the first bucket
pos_data = nltk.pos_tag(segments[0])

# find all the proper nouns (NNP)
found_words = [word for word in pos_data if word[1] == 'NNP']
print(len(set(found_words)))

In [None]:
# display them
found_words

In [None]:
# What is our percent of proper nouns per bucket?
data_to_plot=list()

for s in segments:
    total_tokens = len(s)
    
    # extract Part of Speech data 
    pos_data = nltk.pos_tag(s)
    
    # select only objects of interest
    found_words = [word for word in pos_data if word[1] == 'NNP']

    # add to list
    data_to_plot.append((round(len(found_words)/total_tokens * 100,2)))

In [None]:
# display these percentages over narrative time
import matplotlib.pyplot as plt
x = np.arange(len(data_to_plot))
plt.plot(x, data_to_plot)
plt.title("Distribution of Proper Nouns")
plt.show()

In [None]:
# What is this? What can this distribution of the percentage of
# proper nouns tell us?

# Now go back and change to find foreign words

<h3><font color="Green">Part Four: Punctuation</font></h3>

Let's now compare the use of punctuation in two different authors.

1. Select one text from two different authors
2. Read and tokenize file. 
3. Use punct_count function to obtain dictionary of counts for 1,000 word segments
4. Compare use of punctuation marks as mean value of instances in 1,000 word segments.

In [None]:
punctuation_list = [".",",",";",":","?","!","—","-","[","(","&","/"]

In [None]:
def punct_count(tokens):
    # create segments of 1,000 tokens
    segment_length = 1000
    ns = int(len(tokens) / segment_length) # how many segments are needed?
    segments = list()
    for j in range(ns):
        seg = tokens[segment_length*j:segment_length*(j+1)]
        segments.append(seg)
    punct_dict = dict()

    # process each segment and count appearance of punctuation marks
    for seg in segments:
        for p in punctuation_list:
            if p not in punct_dict:
                punct_dict[p] = [seg.count(p)]
            else:
                punct_dict[p].append(seg.count(p))
    return punct_dict

<h3><font color="Green">Part Five: Vectorization</font></h3>

Now we're going to convert our texts into a document-term matrix. We'll use Scikit-Learn to vectorize the files.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(input='filename',
                             lowercase=True,
                             strip_accents='unicode')

In [None]:
# Vectorize all 150 English-language Novels
input_files = glob.glob("shared/ENGL64.05-21F/data/Novel450/EN*")

# This does the actual vectorization
counts = vectorizer.fit_transform(input_files)

# Return total number of documents and the number of items in the vocabulary
dc, vc = counts.shape
print("document count:",dc,"vocabulary count:",vc)

In [None]:
# what are our top terms?
vocab_sums = counts.sum(axis=0)
sorted_vocab = [(v, vocab_sums[0, i]) for v, i in vectorizer.vocabulary_.items()]
sorted_vocab = sorted(sorted_vocab, key = lambda x: x[1], reverse=True)

# display top twenty words
for i in range(1,20):
    print(sorted_vocab[i][0],"->",sorted_vocab[i][1])

In [None]:
# We're now to going to limit the vocabulary.
# Review the documentation for the vectorizer by executing this cell and modify the above line in 
# which we initialize the vectorizer from CountVectorizer. 
#
# FIRST:
# Remove the English language "stopwords" and check the top terms. What was removed? What remains?
#
# THEN
# Limit the vocabulary to only those terms appearing in 75% of the documents

help(vectorizer)

POS tag list:
----

|Tag|Meaning|
|---|-------|
|CC|coordinating conjunction|
|CD|cardinal digit|
|DT|determiner|
|EX|existential there|
|FW|foreign word|
|IN|preposition/subordinating conjunction|
|JJ|adjective|
|JJR|adjective, comparative|
|JJS|adjective, superlative|
|LS|list marker|
|MD|modal|
|NN|noun, singular|
|NNS|noun plural|
|NNP|proper noun, singular|
|NNPS|proper noun, plural|
|PDT|predeterminer|
|POS|possessive ending|
|PRP|personal pronoun|
|PRP$|possessive pronoun|
|RB|adverb|
|RBR|adverb, comparative|
|RBS|adverb, superlative|
|RP|particle|
|TO|to go|
|UH|interjection|
|VB|verb, base form|
|VBD|verb, past tense|
|VBG|verb, gerund/present participle|
|VBN|verb, past participle|
|VBP|verb, sing. present|
|VBZ|verb, 3rd person sing. present|
|WDT|wh-determiner which|
|WP|wh-pronoun who, what|
|WP\$|possessive pronoun|
|WRB|wh-abverb where, when|



