<center>
<h1>Cultural Analytics</h1><br>
<h2>ENGL64.05 Fall 2019</h2>
</center>

----

# Lab 3
## Segments, Text Processing, and Part of Speech Tagging 

 <center><pre>Rev: 10/09/2019</pre></center>

<h3><font color="Red">Part One: Part of Speech Tagging</font></h3>

In [None]:
# Let's begin by loading up some important libraries
import nltk

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag, ne_chunk

import numpy as np

# allow for displaying of graphics
%matplotlib inline

In [None]:
# Let's learn about Part of Speech Tagging. 
# Write a sample sentence here...

test_sentence = "TEXT GOES HERE"

In [None]:
# We need to tokenize a sentence in order to tag the words.
test_sentence_tokens = word_tokenize(test_sentence)

# Now we run the tagger:
nltk.pos_tag(test_sentence_tokens)

In [None]:
# the list of tag types appears at the bottom of this notebook

# Now let's return to the second cell and write some other kinds of sentences.
# Experiment with words that could be nouns or verbs depending on context.
# How well does this work?

In [None]:
# We can use this "read" and quantify text not into word vectors but as 
# an arrangement or pattern of specific parts of speech.

# This is a paragraph from F. Scott Fitzgerald's This Side of Paradise (his Princeton novel)
this_side_of_paradise="""
Amory was far from contented. He missed the place he had won at St. Regis', the being known and admired, 
yet Princeton stimulated him, and there were many things ahead calculated to arouse the Machiavelli latent 
in him, could he but insert a wedge. The upper-class clubs, concerning which he had pumped a reluctant graduate 
during the previous summer, excited his curiosity: Ivy, detached and breathlessly aristocratic; Cottage, 
an impressive mélange of brilliant adventurers and well-dressed philanderers; Tiger Inn, broad-shouldered and 
athletic, vitalized by an honest elaboration of prep-school standards; Cap and Gown, anti-alcoholic, faintly 
religious and politically powerful; flamboyant Colonial; literary Quadrangle; and the dozen others, varying in 
age and position."""

In [None]:
# Now we'll tokenize, tag it, and display what we've found:

# We need to tokenize a sentence in order to tag the words.
p_tokens = word_tokenize(this_side_of_paradise)

# Now we run the tagger:
tagged = nltk.pos_tag(p_tokens)

# find the set of found objects:
for tag_type in set([x[1] for x in tagged if x[1].isalpha()]):
    print("{0}: ".format(tag_type),end='')
    for x in tagged:
        if x[1] == tag_type:
            print(x[0],end=' ')
    print()

In [None]:
# Let's convert our tokens for that text into a NLTK Text object.
p_text = nltk.Text(p_tokens)

In [None]:
# What can we do with this type of object?
help(p_text)

In [None]:
p_text.concordance("he")

In [None]:
# Try another method from the above list.
p_text.XXX

In [None]:
# Now let's open the entire text of This Side of Paradise

raw_text = open('data/Novel450/EN_1920_Fitzgerald,FScott_ThisSideofParadise_Novel.txt').read()
tokens =  word_tokenize(raw_text)
nltk_text_object = nltk.Text(tokens)

In [None]:
# Here's a sample (collocation as a list).
# Try some other things
nltk_text_object.collocation_list()

In [None]:
# What is this showing us?
nltk_text_object.dispersion_plot(["Amory","Rosalind"])

<h3><font color="Red">Part Two: Named Entity Recognition</font></h3>

In [None]:
# We'll use the 'Named Entity Chunker' ne_chunk to 'chunk' our tagged 
# tokens and then apply named entity recongition.
#

ner_data = ne_chunk(pos_tag(tokens))

|NER|Example|
|------------|-----------|
|ORGANIZATION|Georgia-Pacific Corp., WHO|
|PERSON|Eddy Bonte, President Obama|
|LOCATION|Murray River, Mount Everest|
|DATE|June, 2008-06-29|
|TIME|two fifty a m, 1:30 p.m.|
|MONEY|175 million Canadian Dollars, GBP 10.40|
|PERCENT|twenty pct, 18.75 %|
|FACILITY|Washington Monument, Stonehenge|
|GPE|South East Asia, Midlothian|

In [None]:
# we'll make a dictonary to store found Named Entities
found_objects = dict()

# GPE 
for i in ner_data.subtrees():
    if i.label() == 'GPE': 
            ner_object = i[0][0]
            if ner_object in found_objects:
                found_objects[ner_object] += 1
            else:
                found_objects[ner_object] = 1


top_objects = sorted(found_objects, key=found_objects.get, reverse=True)
for i in top_objects:
    print(i,found_objects[i])

In [None]:
# Edit the above and try some other searches. 

In [None]:
# Now try another text. Go back to the raw_text line and swap for another.
# Here are some possibilities:

# data/Novel450/EN_1818_Shelley,Mary_Frankenstein_Novel.txt
# data/Novel450/EN_1847_Bronte,Emily_WutheringHeights_Novel.txt
# data/Novel450/EN_1884_Twain,Mark_TheAdventuresofHuckleberryFinn_Novel.txt
# data/Novel450/EN_1869_Alcott,Louisa_LittleWomen_Novel.txt
# data/Novel450/EN_1890_Wilde,Oscar_ThePictureofDorianGray_Novel.txt

<h3><font color="Red">Part Three: Segmentation</font></h3>

In [None]:
# Many, many tasks require us to compare units of texts. Let's look at
# creating segments of text that can be used for comparing the internal
# distribution of words.

In [None]:
# This is the "raw text" of Virginia Woolf's Mrs. Dalloway
raw_text = open("data/Novel450/EN_1925_Woolf,Virginia_Mrs.Dalloway_Novel.txt").read()

In [None]:
# Okay, let's determine how long it is (word count)

# the sent_tokenizer is new for us--this attempts to give us individual sentences.
sentences = sent_tokenize(raw_text)
print("found",len(sentences),"sentences")

# our old friend, the word_tokenizer
words = word_tokenize(raw_text)
print("found",len(words),"words")

In [None]:
# What if we created segments of 1,000 words?
length = 1000
i = 0 
segment_list = list()
segment=list()
n_words = len(words)

# here is a process that will enable us to deal
# the remainder of words for the final, shorter
# segment. Maybe we don't want this?
for x, word in enumerate(words):
    if i < length and x != n_words -1:
        segment.append(word)
        i += 1
    elif i < length and x == n_words -1:
        segment_list.append(segment)
    else:
        segment_list.append(segment)
        segment=list()
        i = 0

In [None]:
# How many buckets did we create?
print(len(segment_list))

In [None]:
# Want to try again with a different length?

In [None]:
# Part of Speech Tagging

# Let's begin with the first bucket
pos_data = nltk.pos_tag(segment_list[0])

# find all the proper nouns (NNP)
found_words = [word for word in pos_data if word[1] == 'NNP']
print(len(set(found_words)))

In [None]:
# display them
found_words

In [None]:
# What is our percent of proper nouns per bucket?

data_to_plot=list()
for seg in segment_list:
    total_tokens = len(seg)
    pos_data = nltk.pos_tag(seg)
    found_words = [word for word in pos_data if word[1] == 'NNP']
    data_to_plot.append((round(len(found_words)/total_tokens * 100,2)))

In [None]:
# display as chart
import matplotlib.pyplot as plt
x = np.arange(len(data_to_plot))
plt.bar(x, data_to_plot)
plt.show()

In [None]:
# What is this? What can this distribution of the percentage of
# proper nouns tell us?

# What else could we determine by processing and comparing segments of text?
# - Places mentioned 
# - Type/Token Ratio
# - Others?

POS tag list:
----

|Tag|Meaning|
|---|-------|
|CC|coordinating conjunction|
|CD|cardinal digit|
|DT|determiner|
|EX|existential there|
|FW|foreign word|
|IN|preposition/subordinating conjunction|
|JJ|adjective|
|JJR|adjective, comparative|
|JJS|adjective, superlative|
|LS|list marker|
|MD|modal|
|NN|noun, singular|
|NNS|noun plural|
|NNP|proper noun, singular|
|NNPS|proper noun, plural|
|PDT|predeterminer|
|POS|possessive ending|
|PRP|personal pronoun|
|PRP$|possessive pronoun|
|RB|adverb|
|RBR|adverb, comparative|
|RBS|adverb, superlative|
|RP|particle|
|TO|to go|
|UH|interjection|
|VB|verb, base form|
|VBD|verb, past tense|
|VBG|verb, gerund/present participle|
|VBN|verb, past participle|
|VBP|verb, sing. present|
|VBZ|verb, 3rd person sing. present|
|WDT|wh-determiner which|
|WP|wh-pronoun who, what|
|WP\$|possessive pronoun|
|WRB|wh-abverb where, when|



