# Lesson 1 - Basics of language parsing

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import re

## Processing and analysis

NLP is a two-part problem: 
- Process the data from its original form (blocks of text or speech) into a form the computer can understand
- Conduct analysis on the processed data. 

Step one, which we will address in this assignment, may involve elements of data cleaning and feature extraction. When dealing with verbal information, this step is called _language parsing_.  Domain knowledge (information about word frequency, meaning, and grammar) is applied to raw text to extract features of interest.

We'll work with two different NLP packages: [NLTK](http://www.nltk.org/) and [spaCy](https://spacy.io).  

NLTK, or Natural Language ToolKit, is a seasoned package with great richness and depth. It is good for learning language parsing because it is _highly customizeable and transparent._ On the other hand, it also contains many older models and methods that are useful for teaching NLP but are _not optimal for production code_. 

spaCy is almost the direct opposite. Rather than offering language parsing options, spaCy just processes text data using whatever algorithms and methods are considered "state of the art". It is considerably _leaner,_ and because it is written in Cython (meaning Python code is translated into C and then run), it is considerably _faster_. On the other hand, we _lose the virtue of choice_, and if spaCy's algorithms change, our results could change as well.

In this lesson we'll use some of the text corpora (bodies of text) from NLTK, but will process them with spaCy. We'll also use regular expressions (the standard library package `re`) when we want to pull a very specific element of text out of a string, usually before passing the text to spaCy.

### Setup

Let's begin. First, if you haven't used NLTK yet, you'll want to install the package with `pip install nltk`. Then, run the cell below to launch an [interactive installer](http://www.nltk.org/data.html#interactive-installer). Using the installer, choose the "corpora" tab and download the "gutenberg" and "stopwords" corpora.

In [2]:
import nltk
# Launch the installer to download "gutenberg" and "stop words" corpora.
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

### Cleaning
We're going to work specifically with two novels from the project Gutenberg corpora: _Alice in Wonderland_ by Lewis Carroll, and _Persuasion_ by Jane Austin. 

In [3]:
# Import the data we just downloaded and installed.
from nltk.corpus import gutenberg, stopwords

# Grab and process the raw data.
print(gutenberg.fileids())

persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# Print the first 100 characters of Alice in Wonderland.
print('\nRaw:\n', alice[0:100])

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

Raw:
 [Alice's Adventures in Wonderland by Lewis Carroll 1865]

CHAPTER I. Down the Rabbit-Hole

Alice was


We're going to use _regular expressions_ (specifically [re.sub()](https://docs.python.org/3/library/re.html#re.sub), short for "substitute") to identify and remove substrings we don't want. Specifically we'll match those substrings with a regular expression and substitute in an empty string for them.

We won't go into detail here about how regular expressions work, but you should be able to get a good sense for what's happening by reading the code. If you want more information the [Python Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html) is an accessible starting point and reference, and [RegExr](http://regexr.com/) is a useful tool for visualizing and tinkering with regular expressions.

We'll start our cleaning by removing the title. We'll match all text between square brackets and replace it with an empty string.

Regular expressions cheat sheet: https://www.debuggex.com/cheatsheet/regex/python

" " = string

\\ = escape character

\[ \\[ ] = Square bracket are parenthesis for escapte chracter and first bracket.

\\[ = Look for square brackets

. = look for any character other than newline

* = match any number of characters (wild card)

? = eh?

\[ \\] ] = end at the end bracket.


In [4]:
# This pattern matches all text between square brackets.
pattern = "[\[].*?[\]]"  ## WHY DOES THE PATTERN LOOK LIKE THIS?
persuasion = re.sub(pattern, "", persuasion)
alice = re.sub(pattern, "", alice)


# Now we'll match and remove chapter headings.
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)


# Remove newlines and other extra whitespace by splitting and rejoining.
persuasion = ' '.join(persuasion.split())
alice = ' '.join(alice.split())

# All done with cleanup? Let's see how it looks.
print('Extra whitespace removed:\n', alice[0:100])

Extra whitespace removed:
 Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to


## What information can we extract from text?

### Tokens

Each individual meaningful piece from a text is called a _token_, and the process of breaking up the text into these pieces is called _tokenization_. Tokens are generally words and punctuation. We may discard some tokens, such as punctuation, that we don't think add informational value. One class of potentially-uninformative tokens is _stop words_, words used very frequently that don't have much informational value, such as "the" and "of". Some NLP approaches discard stop words, while other approaches retain them because stop words can make up part of meaningful phrases ("master of the universe" being more specific and informative than "master" and "universe" alone).

In [5]:
# Here is a list of the stopwords identified by NLTK.
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Let's go ahead and use spaCy to parse our novels into tokens. When we call spaCy on the novel it will immediately and automatically parse it, tokenizing the string by breaking it into words and punctuation (and many other things we will explore).

Now is a good time to run `pip install spacy` in your terminal if you don't have it yet, then follow that up with `python -m spacy download 'en'` to download the individual spaCy English module.

In [6]:
import spacy
nlp = spacy.load('en')

# All the processing work is done here, so it may take a while.
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

In [7]:
# Let's explore the objects we've built.
print("The alice_doc object is a {} object.".format(type(alice_doc)))
print("It is {} tokens long".format(len(alice_doc)))
print("The first three tokens are '{}'".format(alice_doc[:3]))
print("The type of each token is {}".format(type(alice_doc[0])))

The alice_doc object is a <class 'spacy.tokens.doc.Doc'> object.
It is 34430 tokens long
The first three tokens are 'Alice was beginning'
The type of each token is <class 'spacy.tokens.token.Token'>


We see from introspecting the spaCy objects above that we're playing around with [doc](https://spacy.io/docs/api/doc) and [token](https://spacy.io/docs/api/token) objects. That's nice, but what can we _do_ with them?

A simple way to extract information from tokenized text data is to just count how often various tokens occur in each piece of text.

In [8]:
from collections import Counter

# Utility function to calculate how frequently words appear in the text.
def word_frequencies(text, include_stop=True):
    
    # Build a list of words.
    # Strip out punctuation and, optionally, stop words.
    words = []
    for token in text:
        if not token.is_punct and (not token.is_stop or include_stop):
            words.append(token.text)
            
    # Build and return a Counter object containing word counts.
    return Counter(words)
    
# The most frequent words:
alice_freq = word_frequencies(alice_doc).most_common(10)
persuasion_freq = word_frequencies(persuasion_doc).most_common(10)
print('Alice:', alice_freq)
print('Persuasion:', persuasion_freq)

Alice: [('the', 1524), ('and', 796), ('to', 724), ('a', 611), ('I', 534), ('it', 524), ('she', 508), ('of', 499), ('said', 453), ('Alice', 394)]
Persuasion: [('the', 3120), ('to', 2775), ('and', 2738), ('of', 2563), ('a', 1529), ('in', 1346), ('was', 1329), ('had', 1177), ('her', 1159), ('I', 1121)]


Those word counts aren't very informative. Most of them are stop words. Let's try again, leaving those out.

In [9]:
# Use our optional keyword argument to remove stop words.
alice_freq = word_frequencies(alice_doc, include_stop=False).most_common(10)
persuasion_freq = word_frequencies(persuasion_doc, include_stop=False).most_common(10)
print('Alice:', alice_freq)
print('Persuasion:', persuasion_freq)

Alice: [('I', 534), ('said', 453), ('Alice', 394), ("n't", 215), ("'s", 190), ('little', 124), ('The', 102), ('like', 84), ('went', 83), ('know', 83)]
Persuasion: [('I', 1121), ('Anne', 497), ("'s", 485), ('She', 326), ('Captain', 297), ('Mrs', 291), ('Elliot', 288), ('Mr', 255), ('He', 225), ('Wentworth', 217)]


That's better. Now let's identify which words are more characteristic of one text than another. Specifically, let's remove the words that are in the top ten for both books.

In [10]:
# Pull out just the text from our frequency lists.
alice_common = [pair[0] for pair in alice_freq]
persuasion_common = [pair[0] for pair in persuasion_freq]

# Use sets to find the unique values in each top ten.
print('Unique to Alice:', set(alice_common) - set(persuasion_common))
print('Unique to Persuasion:', set(persuasion_common) - set(alice_common))

Unique to Alice: {"n't", 'said', 'little', 'The', 'Alice', 'like', 'went', 'know'}
Unique to Persuasion: {'Wentworth', 'Captain', 'Mrs', 'He', 'She', 'Elliot', 'Anne', 'Mr'}


In [11]:
alice_common

['I', 'said', 'Alice', "n't", "'s", 'little', 'The', 'like', 'went', 'know']

## Lemmas

So far we've just looked at whether certain words are present and how frequently they appear. We can process these words further to remove a little more noise from our data. Consider the words "think", "thought", and "thinking". They're related. They all share the same root word: the verb "think". Sometimes we want to focus on the fact that the act of thinking comes up a lot in data, and not have that information split across all the different forms of "think".

To focus in like this, we can reduce each word to its root, or [_lemma_](https://simple.wikipedia.org/wiki/Lemma_%28linguistics%29), and do our counts again. This time we're building a count of _concepts_ rather than just _words_:

In [18]:
# Utility function to calculate how frequently lemas appear in the text.
def lemma_frequencies(text, include_stop=True):
    
    # Build a list of lemmas.
    # Strip out punctuation and, optionally, stop words.
    lemmas = []
    for token in text:
        if not token.is_punct and (not token.is_stop or include_stop):
            lemmas.append(token.lemma_)
            
    # Build and return a Counter object containing word counts.
    return Counter(lemmas)

# Instantiate our list of most common lemmas.
alice_lemma_freq = lemma_frequencies(alice_doc, include_stop=False).most_common(10)
persuasion_lemma_freq = lemma_frequencies(persuasion_doc, include_stop=False).most_common(10)
print('\nAlice:', alice_lemma_freq)
print('Persuasion:', persuasion_lemma_freq)

# Again, identify the lemmas common to one text but not the other.
alice_lemma_common = [pair[0] for pair in alice_lemma_freq]
persuasion_lemma_common = [pair[0] for pair in persuasion_lemma_freq]
print('Unique to Alice:', set(alice_lemma_common) - set(persuasion_lemma_common))
print('Unique to Persuasion:', set(persuasion_lemma_common) - set(alice_lemma_common))


Alice: [('-PRON-', 758), ('say', 476), ('alice', 396), ('be', 254), ('not', 231), ('go', 133), ('think', 131), ('little', 126), ('the', 109), ('look', 105)]
Persuasion: [('-PRON-', 2241), ('anne', 497), ("'s", 466), ('captain', 303), ('elliot', 295), ('mrs', 291), ('good', 289), ('know', 258), ('think', 256), ('mr', 255)]
Unique to Alice: {'the', 'alice', 'be', 'not', 'little', 'say', 'look', 'go'}
Unique to Persuasion: {'mrs', 'elliot', 'captain', 'mr', 'good', "'s", 'anne', 'know'}


In [19]:
token = "thought"
print(token.lemma_)

AttributeError: 'str' object has no attribute 'lemma_'

Oh. This isn't working because we have to use the token object.

In [22]:
print(alice_doc[2])
print(alice_doc[2].lemma_)

beginning
begin


cool.

In addition to looking at lemmas, we could perform a similar analysis and pull out prefixes (`token.prefix_`) or suffixes (`token.suffix_`).

In [23]:
print(alice_doc[2].suffix_)

ing


## Sentences

Beyond individual words, text can also be considered at the level of sentences. Using punctuation cues, we can split up text into sentences. Each sentence can then be summarized by, for example, using sentiment analysis to categorize sentences as having positive or negative sentiment. We may also be interested in how long sentences tend to be, and how many unique words make up a sentence.  The sentence also provides _context_ for the individual words, allowing us to draw even more information from each word.

We get a lot of automatic sentence-level information from spaCy. The `doc.sents` property will give us each sentence as a [span](https://spacy.io/docs/api/span) object. Let's look at some of that.

In [24]:
# Initial exploration of sentences.
sentences = list(alice_doc.sents)
print("Alice in Wonderland has {} sentences.".format(len(sentences)))

example_sentence = sentences[2]
print("Here is an example: \n{}\n".format(example_sentence))

Alice in Wonderland has 1678 sentences.
Here is an example: 
There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself, 'Oh dear!



In [27]:
for i in range(3):
    print(sentences[i])

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?'
So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.
There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself, 'Oh dear!


Wow Lewis Carrol writes long sentences.

In [28]:
# Look at some metrics around this sentence.
example_words = [token for token in example_sentence if not token.is_punct]
unique_words = set([token.text for token in example_words])

print(("There are {} words in this sentence, and {} of them are"
       " unique.").format(len(example_words), len(unique_words)))

There are 29 words in this sentence, and 25 of them are unique.


## Parts of speech, dependencies, entities
Tokens within each sentence are also coded with the parts of speech they play. This is useful for distinguishing between _homographs_, words with the same spelling but different meaning (the umbrella term for this kind of linguistic feature is _polysemy_).  For example, the word "break" is a noun in "I need a break" but a verb in "I need to break the glass".

In [29]:
print(nlp("I need a break")[3].pos_)
print(nlp("I need to break the glass")[3].pos_)

NOUN
VERB


In [30]:
# View the part of speech for some tokens in our sentence.
print('\nParts of speech:')
for token in example_sentence[:9]:
    print(token.orth_, token.pos_)


Parts of speech:
There ADV
was VERB
nothing NOUN
so ADV
VERY ADV
remarkable ADJ
in ADP
that DET
; PUNCT


You also get _dependencies_, or how words relate to each other syntatically, with spaCy. Dependencies are a bit complicated – for a visual example of dependencies expressed as a tree, check the [About section of the Standford NLP Group Dependencies page](https://nlp.stanford.edu/software/stanford-dependencies.shtml). Stanford's NLP Group has had a lot of influence in this field, so you're likely to run across them frequently if you go deep into NLP. We aren't going to cover this in depth here.

Let's look at some dependencies of this sentence.

https://stackoverflow.com/questions/40288323/what-do-spacys-part-of-speech-and-dependency-tags-mean
^ Explanation of tags!

In [32]:
# View the dependencies for some tokens.
print('\nDependencies:\n')
for token in example_sentence[:9]:
    print(token.orth_, token.dep_, token.head.orth_)


Dependencies:

There expl was
was ROOT was
nothing attr was
so advmod remarkable
VERY advmod remarkable
remarkable amod nothing
in prep nothing
that pobj in
; punct was


Finally, spaCy gives us access to the named entities with `.ents`. In the example below you'll see some errors creep in – we can see that the entity identification rules in spaCy assume that, if it doesn't fall under any other obvious rule, any word or phrase IN ALL CAPS is an organization (if a noun) or an event (if a verb).

"was" is not dependent on anything.
so is an adverbial modifier of remarkable.
in is a preposition for nothing.

NOTE: Hey, they seemed to have fixed those bugs! The example below has less errors than the lesson.

In [33]:
# Extract the first ten entities.
entities = list(alice_doc.ents)[0:10]
for entity in entities:
    print(entity.label_, ' '.join(t.orth_ for t in entity))

PERSON Alice
DATE the hot day
PERSON Alice
PRODUCT Rabbit
PRODUCT Rabbit
PRODUCT WAISTCOAT - POCKET
PERSON Alice
PERSON Alice
PERSON Alice
ORDINAL First


In [34]:
# All of the uniqe entities spaCy thinks are people.
people = [entity.text for entity in list(alice_doc.ents) if entity.label_ == "PERSON"]
print(set(people))

{'Brandy', 'The Queen', "the King: '", 'Prizes', 'Stupid', 'Edgar Atheling', 'Stand', 'William', "Dinah'll", 'Elsie', 'Run', 'indeed:--', 'Jack', 'Bill', 'Fish-Footman', 'Duchess', 'Sixteenth', 'Hush', 'Fury', 'this:--', 'Longitude', 'INSIDE', 'Sha', 'the Lobster Quadrille', 'Hjckrrh', 'Fetch', 'Curiouser', 'Edwin', "the Mock Turtle: '", 'Pinch', 'YOURS', 'Turn', 'The Fish-Footman', 'Ma', 'Drink', 'the King', 'Queen', 'Said', 'M--', 'Beau', 'WILLIAM', 'the March Hare', 'Crab', 'Mock Turtle', 'Footman', 'Serpent', 'FUL SOUP', 'Tut', 'Repeat', 'm--', 'Stolen', 'Turtle Soup', 'Soup of the evening', 'Begin', "Don't", 'the White Rabbit', 'Adventures', 'Shy', 'Shall', 'Rabbit', 'HAD', 'Cheshire Puss', 'Ou', 'began:--', '--or', 'William the Conqueror', 'Soo', 'Morcar', 'Pat', 'Latitude', 'Soles', 'a Lobster Quadrille', "the Duchess: '", 'Idiot', 'Seaography', 'Sentence', 'Kings', 'The White Rabbit', 'Duck', 'Mabel', 'Canary', 'Lacie', 'the Duchess', 'The Mock Turtle', 'the Queen of Hearts', '

## The Pen is Mightier

Hopefully at this point it's clear that we have the ability to programmatically extract _a lot_ of information from text. Like any other feature extraction problem, let ingenuity be your guide. In the next assignment, we'll use this information to build a few supervised learning models.

# Lesson 4.2 - As supervised problem

Supervised NLP requires a pre-labelled dataset for training and testing, and is generally interested in categorizing text in various ways. In this case, we are going to try to predict whether a sentence comes from Alice in Wonderland by Lewis Carroll or Persuasion by Jane Austen. We can use any of the supervised models we've covered previously, as long as they allow categorical outcomes. In this case, we'll try Random Forests, SVM, and KNN.

Our feature-generation approach will be something called BoW, or Bag of Words. BoW is quite simple: For each sentence, we count how many times each word appears. We will then use those counts as features.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
from nltk.corpus import gutenberg, stopwords
from collections import Counter


In [2]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text
    
# Load and clean the data.
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# The Chapter indicator is idiosyncratic
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)
    
alice = text_cleaner(alice)
persuasion = text_cleaner(persuasion)


In [3]:
# Parse the cleaned novels. This can take a bit.
nlp = spacy.load('en')
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

In [4]:
# Group into sentences.
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]

# Combine the sentences from the two novels into one data frame.
sentences = pd.DataFrame(alice_sents + persuasion_sents)
sentences.head()

Unnamed: 0,0,1
0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,"(So, she, was, considering, in, her, own, mind...",Carroll
2,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,"(Oh, dear, !)",Carroll
4,"(I, shall, be, late, !, ')",Carroll


Time to bag some words!  Since spaCy has already tokenized and labelled our data, we can move directly to recording how often various words occur.  We will exclude stopwords and punctuation.  In addition, in an attempt to keep our feature space from exploding, we will work with lemmas (root words) rather than the raw text terms, and we'll only use the 2000 most common words for each text.

## Go over this code

In [7]:
# Utility function to create a list of the 2000 most common words.
def bag_of_words(text):
    
    # Filter out punctuation and stop words.  This creates a big bag of words
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common 2000 words.
    return [item[0] for item in Counter(allwords).most_common(2000)]
    

# Creates a data frame with features for each word in our common word set.
# Each value is the count of the times the word appears in each sentence.
# Each sentence is a row and each column is a count of words that appear in that row.
def bow_features(sentences, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    
    # Add column of the sentence
    df['text_sentence'] = sentences[0]
    
    # Add column for source, Carroll or Austen
    df['text_source'] = sentences[1]
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # This creates a temporary list called words that stores tokens in the 2000 common words list 
        # after filtering out punc, stop workds, and lemmatizing them.
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Using the temp work list, count up the common words
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 10 == 0:
            print("Processing row {}".format(i))
            
    return df

# Set up the bags.
alicewords = bag_of_words(alice_doc)
persuasionwords = bag_of_words(persuasion_doc)

# Combine bags to create a set of unique words.
common_words = set(alicewords + persuasionwords)

In [4]:
# Create our data frame with features. This can take a while to run.
#word_counts = bow_features(sentences, common_words)
# ^This took an entire day to run...

# Here is the code to upload the csv instead
word_count = pd.read_csv('caroll_austen_wordcounts.csv')
word_count.head()

Unnamed: 0,shoe,contrast,purple,darkness,baby,king,fire,trample,remind,urge,...,flatter,open,directly,wretched,shrink,dig,end,advantage,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Alice was beginning to get very tired of sitti...,Carroll
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,So she was considering in her own mind (as wel...,Carroll
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,There was nothing so VERY remarkable in that; ...,Carroll
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Oh dear!,Carroll
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,I shall be late!',Carroll


In [9]:
word_counts.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5318 entries, 0 to 5317
Columns: 3064 entries, shoe to text_source
dtypes: object(3064)
memory usage: 124.3+ MB


In [15]:
# Exporting to csv so I never have to run that again.
# word_counts.to_csv('caroll_austen_wordcounts.csv', index=False)


## Trying out BoW

Now let's give the bag of words features a whirl by trying a random forest.

In [19]:
from sklearn import ensemble
from sklearn.model_selection import train_test_split

rfc = ensemble.RandomForestClassifier()
Y = word_counts['text_source']
X = np.array(word_counts.drop(['text_sentence','text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)
train = rfc.fit(X_train, y_train)

print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

Training set score: 0.9893416927899686

Test set score: 0.8919172932330827


Overfitting is a known problem when using bag of words, since it basically involves throwing a massive number of features at a model – some of those features (in this case, word frequencies) will capture noise in the training set. Since overfitting is also a known problem with Random Forests, the divergence between training score and test score is expected.

## BoW with Logistic Regression

Let's try a technique with some protection against overfitting due to extraneous features – logistic regression with ridge regularization (from ridge regression, also called L2 regularization).

In [20]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
train = lr.fit(X_train, y_train)
print(X_train.shape, y_train.shape)
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

(3190, 3062) (3190,)
Training set score: 0.9579937304075236

Test set score: 0.9158834586466166


Logistic regression performs a bit better than the random forest.  

# BoW with Gradient Boosting

And finally, let's see what gradient boosting can do:

In [21]:
clf = ensemble.GradientBoostingClassifier()
train = clf.fit(X_train, y_train)

print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))

Training set score: 0.8846394984326019

Test set score: 0.8735902255639098


Looks like logistic regression is the winner, but there's room for improvement.

# Same model, new inputs

What if we feed the model a different novel by Jane Austen, like _Emma_?  Will it be able to distinguish Austen from Carroll with the same level of accuracy if we insert a different sample of Austen's writing?

First, we need to process _Emma_ the same way we processed the other data, and combine it with the Alice data:

In [22]:
# Clean the Emma data.
emma = gutenberg.raw('austen-emma.txt')
emma = re.sub(r'VOLUME \w+', '', emma)
emma = re.sub(r'CHAPTER \w+', '', emma)
emma = text_cleaner(emma)
print(emma[:100])

Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to


In [23]:
# Parse our cleaned data.
emma_doc = nlp(emma)

In [36]:
# Group into sentences.
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]
emma_sents = [[sent, "Austen"] for sent in emma_doc.sents]

# Emma is quite long, let's cut it down to 500 sentences.
emma_sents = emma_sents[0:len(alice_sents)]

In [37]:
len(emma_sents)

1669

In [40]:
# Build a new Bag of Words data frame for Emma word counts.
# We'll use the same common words from Alice and Persuasion.
#emma_sentences = pd.DataFrame(emma_sents)
#emma_bow = bow_features(emma_sentences, common_words)

# print('done')

# emma_bow.to_csv('emma_bow.csv', index=False)

emma_bow = pd.read_csv('emma_bow.csv')

In [41]:
emma_bow.head()

Unnamed: 0,shoe,contrast,purple,darkness,baby,king,fire,trample,remind,urge,...,flatter,open,directly,wretched,shrink,dig,end,advantage,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"Emma Woodhouse, handsome, clever, and rich, wi...",Austen
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,She was the youngest of the two daughters of a...,Austen
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Her mother had died too long ago for her to ha...,Austen
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Sixteen years had Miss Taylor been in Mr. Wood...,Austen
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Between _them,Austen


In [42]:
# Now we can model it!
# Let's use logistic regression again.

# Combine the Emma sentence data with the Alice data from the test set.

# Combine X_Train data that share the same index as y_train labelled Carroll
# With Emma data, dropping the y train data.
X_Emma_test = np.concatenate((
    X_train[y_train[y_train=='Carroll'].index], 
    emma_bow.drop(['text_sentence','text_source'], 1)
), axis=0)

# Combine Carroll y_train data with a column that says "Austen" for the number of emma sentences
y_Emma_test = pd.concat([y_train[y_train=='Carroll'],
                         pd.Series(['Austen'] * emma_bow.shape[0])])

# Model.
print('\nTest set score:', lr.score(X_Emma_test, y_Emma_test))
lr_Emma_predicted = lr.predict(X_Emma_test)
pd.crosstab(y_Emma_test, lr_Emma_predicted)


Test set score: 0.6976137211036539


col_0,Austen,Carroll
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
Austen,1564,105
Carroll,706,307


## ^ Is this good?

Well look at that!  NLP approaches are generally effective on the same type of material as they were trained on. It looks like this model is actually able to differentiate multiple works by Austen from Alice in Wonderland.  Now the question is whether the model is very good at identifying Austen, or very good at identifying Alice in Wonderland, or both...

# Challenge 0:

Recall that the logistic regression model's best performance on the test set was 93%.  See what you can do to improve performance.  Suggested avenues of investigation include: Other modeling techniques (SVM?), making more features that take advantage of the spaCy information (include grammar, phrases, POS, etc), making sentence-level features (number of words, amount of punctuation), or including contextual information (length of previous and next sentences, words repeated from one sentence to the next, etc), and anything else your heart desires.  Make sure to design your models on the test set, or use cross_validation with multiple folds, and see if you can get accuracy above 90%. 

In [28]:
import spacy
nlp = spacy.load('en')

# Since I uploaded the document onto a csv, I need to nlp it.
for i in range(0,len(word_count['text_sentence'])):
    word_count['text_sentence'][i] = nlp(word_count['text_sentence'][i])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


In [29]:
word_count['text_sentence'][0]

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?'

In [30]:
type(word_count['text_sentence'][0])

spacy.tokens.doc.Doc

In [39]:
# Carroll seems to like longer sentences.  Let's make a sentence word count feature.

sent_leng = []

for r in range(0,len(word_count['text_sentence'])):
    sent_leng.append(len(word_count['text_sentence'][r]))

word_count['sent_leng'] = sent_leng    
    

In [40]:
word_count.head()

Unnamed: 0,shoe,contrast,purple,darkness,baby,king,fire,trample,remind,urge,...,open,directly,wretched,shrink,dig,end,advantage,text_sentence,text_source,sent_leng
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll,67
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,"(So, she, was, considering, in, her, own, mind...",Carroll,63
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,"(There, was, nothing, so, VERY, remarkable, in...",Carroll,33
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,"(Oh, dear, !)",Carroll,3
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,"(I, shall, be, late, !, ')",Carroll,6


In [36]:
len(word_count['text_sentence'][0])

67

In [48]:
punct_count = []

for r in range(0,len(word_count['text_sentence'])):
    punct = 0
    for word in word_count['text_sentence'][r]:
        if word.is_punct:
            punct =+ 1
        punct_count.append(punct)
        return punct_count
    
word_count['punct_count'] = punct_count 
    

SyntaxError: 'return' outside function (<ipython-input-48-daf15c4257d3>, line 9)

In [45]:
word_count.head(30)

Unnamed: 0,shoe,contrast,purple,darkness,baby,king,fire,trample,remind,urge,...,directly,wretched,shrink,dig,end,advantage,text_sentence,text_source,sent_leng,punct_count
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll,67,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,"(So, she, was, considering, in, her, own, mind...",Carroll,63,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,"(There, was, nothing, so, VERY, remarkable, in...",Carroll,33,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,"(Oh, dear, !)",Carroll,3,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,"(I, shall, be, late, !, ')",Carroll,6,1
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,"((, when, she, thought, it, over, afterwards, ...",Carroll,126,1
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,"(In, another, moment, down, went, Alice, after...",Carroll,23,1
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,"(The, rabbit, -, hole, went, straight, on, lik...",Carroll,44,1
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,"(Either, the, well, was, very, deep, ,, or, sh...",Carroll,37,1
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,"(First, ,, she, tried, to, look, down, and, ma...",Carroll,49,1
