We'll work with two different NLP packages: NLTK and spaCy. NLP is good for learning language parsing because it is highly customizeable and transparent. On the other hand, it also contains many older models and methods that are useful for teaching NLP but are not optimal for production code. spaCy is almost the direct opposite. Rather than offering language parsing options, spaCy just processes text data using whatever algorithms and methods are considered "state of the art". It is considerably leaner, and because it is written in Cython (meaning Python code is translated into C and then run), it is considerably faster. On the other hand, we lose the virtue of choice, and if spaCy's algorithms change, our results could change as well.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import re

In [2]:
import nltk
# Launch the installer to download "gutenberg" and "stop words" corpora.
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [3]:
# Import the data we just downloaded and installed.
from nltk.corpus import gutenberg, stopwords

# Grab and process the raw data.
print(gutenberg.fileids())

[u'austen-emma.txt', u'austen-persuasion.txt', u'austen-sense.txt', u'bible-kjv.txt', u'blake-poems.txt', u'bryant-stories.txt', u'burgess-busterbrown.txt', u'carroll-alice.txt', u'chesterton-ball.txt', u'chesterton-brown.txt', u'chesterton-thursday.txt', u'edgeworth-parents.txt', u'melville-moby_dick.txt', u'milton-paradise.txt', u'shakespeare-caesar.txt', u'shakespeare-hamlet.txt', u'shakespeare-macbeth.txt', u'whitman-leaves.txt']


In [4]:
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# Print the first 100 characters of Alice in Wonderland.
print('\nRaw:\n', alice[0:100])

('\nRaw:\n', u"[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I. Down the Rabbit-Hole\n\nAlice was")


In [5]:
#result = re.sub('abc',  '',    input)           # Delete pattern abc
#result = re.sub('abc',  'def', input)           # Replace pattern abc -> def

# This pattern matches all text between square bracket and removes it!.
pattern = "[\[].*?[\]]"
#pattern2 = "\n"

persuasion = re.sub(pattern, "", persuasion)
alice = re.sub(pattern, "", alice)
#alice = re.sub(pattern2, "", alice)


# Print the first 100 characters of Alice again.
print('Title removed:\n', alice[0:100])

('Title removed:\n', u'\n\nCHAPTER I. Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on')


In [6]:
# Now we'll match and remove chapter headings.
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)

# Ok, what's it look like now?
print'Chapter headings removed:\n', alice[0:100]

Chapter headings removed:




Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothin


In [7]:
# Remove newlines and other extra whitespace by splitting and rejoining.
persuasion = ' '.join(persuasion.split())
alice = ' '.join(alice.split())

# All done with cleanup? Let's see how it looks.
print'Extra whitespace removed:\n', alice[0:100]

Extra whitespace removed:
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to


What information can we extract from text?

Tokens

Each individual meaningful piece from a text is called a token, and the process of breaking up the text into these pieces is called tokenization. Tokens are generally words and punctuation. We may discard some tokens, such as punctuation, that we don't think add informational value. One class of potentially-uninformative tokens is stop words, words used very frequently that don't have much informational value, such as "the" and "of". Some NLP approaches discard stop words, while other approaches retain them because stop words can make up part of meaningful phrases ("master of the universe" being more specific and informative than "master" and "universe" alone)

In [8]:
# Here is a list of the stopwords identified by NLTK.
print(stopwords.words('english'))

[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u"you're", u"you've", u"you'll", u"you'd", u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u"she's", u'her', u'hers', u'herself', u'it', u"it's", u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u"that'll", u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'eac

Let's go ahead and use spaCy to parse our novels into tokens. When we call spaCy on the novel it will immediately and automatically parse it, tokenizing the string by breaking it into words and punctuation (and many other things we will explore).

In [9]:
import spacy
nlp = spacy.load('en')

# All the processing work is done here, so it may take a while.
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

In [10]:
# Let's explore the objects we've built.
print("The alice_doc object is a {} object.".format(type(alice_doc)))
print("It is {} tokens long".format(len(alice_doc)))
print("The first three tokens are '{}'".format(alice_doc[:3]))
print("The type of each token is {}".format(type(alice_doc[0])))

The alice_doc object is a <type 'spacy.tokens.doc.Doc'> object.
It is 34408 tokens long
The first three tokens are 'Alice was beginning'
The type of each token is <type 'spacy.tokens.token.Token'>


In [11]:
from collections import Counter

# Utility function to calculate how frequently words appear in the text.
def word_frequencies(text, include_stop=True):
    
    # Build a list of words.
    # Strip out punctuation and, optionally, stop words.
    words = []
    for token in text:
        if not token.is_punct and (not token.is_stop or include_stop):
            words.append(token.text)
            
    # Build and return a Counter object containing word counts.
    return Counter(words)
    
# The most frequent words:
alice_freq = word_frequencies(alice_doc).most_common(10)
persuasion_freq = word_frequencies(persuasion_doc).most_common(10)
print('Alice:', alice_freq)
print('Persuasion:', persuasion_freq)

('Alice:', [(u'the', 1524), (u'and', 796), (u'to', 724), (u'a', 611), (u'I', 533), (u'it', 524), (u'she', 508), (u'of', 499), (u'said', 453), (u'Alice', 394)])
('Persuasion:', [(u'the', 3120), (u'to', 2775), (u'and', 2738), (u'of', 2563), (u'a', 1529), (u'in', 1346), (u'was', 1329), (u'had', 1177), (u'her', 1159), (u'I', 1118)])


In [12]:
# Use our optional keyword argument to remove stop words.
alice_freq = word_frequencies(alice_doc, include_stop=False).most_common(10)
persuasion_freq = word_frequencies(persuasion_doc, include_stop=False).most_common(10)
print('Alice:', alice_freq)
print('Persuasion:', persuasion_freq)

('Alice:', [(u'said', 453), (u'Alice', 394), (u'little', 124), (u'like', 84), (u'went', 83), (u'know', 83), (u'thought', 74), (u'Queen', 73), (u'time', 68), (u'King', 61)])
('Persuasion:', [(u'Anne', 496), (u'Captain', 297), (u'Mrs', 291), (u'Elliot', 288), (u'Mr', 254), (u'Wentworth', 217), (u'Lady', 191), (u'good', 181), (u'little', 175), (u'Charles', 166)])


Popular words that are not common in both books to see the characteristics of each book

In [13]:
# Pull out just the text from our frequency lists.
alice_common = [pair[0] for pair in alice_freq]
persuasion_common = [pair[0] for pair in persuasion_freq]

# Use sets to find the unique values in each top ten.
print('Unique to Alice:', set(alice_common) - set(persuasion_common))
print('Unique to Persuasion:', set(persuasion_common) - set(alice_common))

('Unique to Alice:', set([u'King', u'said', u'like', u'Queen', u'Alice', u'thought', u'know', u'time', u'went']))
('Unique to Persuasion:', set([u'good', u'Elliot', u'Charles', u'Mrs', u'Mr', u'Anne', u'Captain', u'Lady', u'Wentworth']))


Lemmas

Words "think", "thought", and "thinking" have a common root in "think"

In [18]:
# Utility function to calculate how frequently lemas appear in the text.
def lemma_frequencies(text, include_stop=True, token='Lemma'):
    
    # Build a list of lemas.
    # Strip out punctuation and, optionally, stop words.
    lemmas = []
    prefixes = [] #(token.prefix_) 
    suffixes =[] #(token.suffix_)
    if token == 'Lemma': 
        for token in text:
            if not token.is_punct and (not token.is_stop or include_stop):
                lemmas.append(token.lemma_)             
    elif token =='prefix':
        for token in text:
            if not token.is_punct and (not token.is_stop or include_stop):
                lemmas.append(token.prefix_)  
    elif token =='suffix':
        for token in text:
            if not token.is_punct and (not token.is_stop or include_stop):
                lemmas.append(token.suffix_)              
    # Build and return a Counter object containing word counts.
    return Counter(lemmas)

# Instantiate our list of most common lemmas.
alice_lemma_freq = lemma_frequencies(alice_doc, include_stop=False).most_common(10)
persuasion_lemma_freq = lemma_frequencies(persuasion_doc, include_stop=False).most_common(10)
print 'Lemmas: \n'


print'\nAlice:', alice_lemma_freq
print'Persuasion:', persuasion_lemma_freq

# Again, identify the lemmas common to one text but not the other.
alice_lemma_common = [pair[0] for pair in alice_lemma_freq]
persuasion_lemma_common = [pair[0] for pair in persuasion_lemma_freq]
print 'Unique to Alice:', set(alice_lemma_common) - set(persuasion_lemma_common)
print 'Unique to Persuasion:', set(persuasion_lemma_common) - set(alice_lemma_common)
print '\n'
print 'Prefix:'

alice_lemma_freq = lemma_frequencies(alice_doc, include_stop=False, token = 'prefix').most_common(10)
persuasion_lemma_freq = lemma_frequencies(persuasion_doc, include_stop=False, token = 'prefix').most_common(10)
print 'Prefix: \n'


print'\nAlice:', alice_lemma_freq
print'Persuasion:', persuasion_lemma_freq

# Again, identify the lemmas common to one text but not the other.
alice_lemma_common = [pair[0] for pair in alice_lemma_freq]
persuasion_lemma_common = [pair[0] for pair in persuasion_lemma_freq]
print 'Unique to Alice:', set(alice_lemma_common) - set(persuasion_lemma_common)
print 'Unique to Persuasion:', set(persuasion_lemma_common) - set(alice_lemma_common)
print '\n'


Lemmas: 


Alice: [(u'say', 477), (u'Alice', 394), (u'think', 130), (u'go', 130), (u'little', 125), (u'look', 106), (u'know', 103), (u'come', 96), (u'like', 92), (u'begin', 91)]
Persuasion: [(u'Anne', 493), (u'Captain', 294), (u'Mrs', 291), (u'Elliot', 288), (u'think', 256), (u'know', 255), (u'Mr', 254), (u'good', 224), (u'Wentworth', 215), (u'say', 191)]
Unique to Alice: set([u'begin', u'look', u'little', u'Alice', u'go', u'come', u'like'])
Unique to Persuasion: set([u'good', u'Elliot', u'Mrs', u'Mr', u'Anne', u'Captain', u'Wentworth'])


Prefix: 


Alice: [(u's', 1378), (u't', 842), (u'l', 655), (u'c', 630), (u'w', 498), (u'h', 453), (u'f', 437), (u'A', 417), (u'g', 407), (u'b', 405)]
Persuasion: [(u's', 3139), (u'c', 2273), (u'p', 1857), (u'a', 1784), (u'f', 1555), (u'h', 1420), (u'l', 1403), (u't', 1391), (u'd', 1384), (u'r', 1348)]
Unique to Alice: set([u'A', u'b', u'w', u'g'])
Unique to Persuasion: set([u'a', u'p', u'r', u'd'])




## Sentences: 

In [20]:
# Initial exploration of sentences.
sentences = list(alice_doc.sents)
print("Alice in Wonderland has {} sentences.".format(len(sentences)))

example_sentence = sentences[2]
print("Here is an example: \n{}\n".format(example_sentence))

Alice in Wonderland has 1727 sentences.
Here is an example: 
There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself, '



In [21]:
# Look at some metrics around this sentence.
example_words = [token for token in example_sentence if not token.is_punct]
unique_words = set([token.text for token in example_words])

print(("There are {} words in this sentence, and {} of them are"
       " unique.").format(len(example_words), len(unique_words)))

There are 27 words in this sentence, and 23 of them are unique.


## Parts of speech, dependencies, entities
Tokens within each sentence are also coded with the parts of speech they play. This is useful for distinguishing between _homographs_, words with the same spelling but different meaning (the umbrella term for this kind of linguistic feature is _polysemy_).  For example, the word "break" is a noun in "I need a break" but a verb in "I need to break the glass".

In [27]:
print nlp(u"I need a break")[3].pos_
print nlp(u"I need to break the glass")[3].pos_

NOUN
VERB


In [28]:
# View the part of speech for some tokens in our sentence.
print('\nParts of speech:')
for token in example_sentence[:9]:
    print(token.orth_, token.pos_)


Parts of speech:
(u'There', u'ADV')
(u'was', u'VERB')
(u'nothing', u'NOUN')
(u'so', u'ADV')
(u'VERY', u'ADV')
(u'remarkable', u'ADJ')
(u'in', u'ADP')
(u'that', u'DET')
(u';', u'PUNCT')


## Dependencies
https://nlp.stanford.edu/software/stanford-dependencies.shtml

In [29]:
# View the dependencies for some tokens.
print('\nDependencies:')
for token in example_sentence[:9]:
    print(token.orth_, token.dep_, token.head.orth_)


Dependencies:
(u'There', u'expl', u'was')
(u'was', u'ROOT', u'was')
(u'nothing', u'attr', u'was')
(u'so', u'advmod', u'VERY')
(u'VERY', u'advmod', u'remarkable')
(u'remarkable', u'amod', u'nothing')
(u'in', u'prep', u'remarkable')
(u'that', u'pobj', u'in')
(u';', u'punct', u'was')


In [30]:
#Extract the first ten entities with .etns method
entities = list(alice_doc.ents)[0:10]
for entity in entities:
    print(entity.label_, ' '.join(t.orth_ for t in entity))

(u'PERSON', u'Alice')
(u'DATE', u'the hot day')
(u'PERSON', u'Alice')
(u'PERSON', u'Rabbit')
(u'PERSON', u'Rabbit')
(u'PERSON', u'Alice')
(u'PERSON', u'Alice')
(u'PERSON', u'Alice')
(u'ORDINAL', u'First')
(u'CARDINAL', u'one')


In [31]:
# All of the uniqe entities spaCy thinks are people.
people = [entity.text for entity in list(alice_doc.ents) if entity.label_ == "PERSON"]
print(set(people))

set([u'Treacle', u'the Knave of Hearts', u'William the Conqueror.', u'Ou', u'William the Conqueror', u"W. RABBIT'", u'Brandy', u'Sing', u'a Lobster Quadrille', u'Tortoise', u'Shy', u'Stretching', u'Alice', u'Longitude', u'Pat', u'Canary', u'Duchess', u'Lobster', u"I'M", u'Mouse', u'Seaography', u'Latin Grammar', u'Shakespeare', u'Latitude', u'Run', u'Cheshire Puss', u'William', u'Game', u'Rabbit', u'Tortoise--', u'Rule Forty-two', u'King', u'Dinah', u'Ma', u'Bill', u'Cheshire', u'Elsie', u'Knave', u'FOOT', u'Turtle', u'Lacie', u'Hare', u'Queen', u'Hush', u'The Knave of Hearts', u'Said', u"the King: '", u'Panther', u'Magpie', u'Ada', u"Alice)--'and", u'Soles', u'HER', u'a Cheshire Cat,', u'Behead', u'Curiouser', u'Dinn', u'Boots', u'Down', u'Duck', u'Off--', u'Lory', u'Twinkle', u'Edwin', u'Normans--', u'Tis', u'Footman', u'Mercia', u'Swim', u'Tillie', u'Begin', u'Beau', u'FATHER WILLIAM', u'Mary Ann', u'Jack', u'Off', u'VERY', u'ALICE', u'Gryphon', u'Soup', u'Lizard', u'Owl', u'Majesty