# Supervised NLP

In this exercise, we will attempt to predict whether a sentence comes from _Alice in Wonderland_ by Lewis Carroll or _Persuasion_ by Jane Austen.  We'll perform NLP to create the input features for the model before using some run of the mill classifiers for the prediction.

The main feature generation method that we'll use is **Bag of Words**.  Its simply counting how many times each word appears in every sentence.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
from nltk.corpus import gutenberg, stopwords
from collections import Counter
import nltk
from sklearn.feature_extraction.text import CountVectorizer

%matplotlib inline

In [2]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text
    
# Load and clean the data.
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# The Chapter indicator is idiosyncratic
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)
    
alice = text_cleaner(alice[:int(len(alice)/10)])
persuasion = text_cleaner(persuasion[:int(len(persuasion)/10)])

In [3]:
# Parse the cleaned novels. This can take a bit.
nlp = spacy.load('en')
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

In [4]:
# Group into sentences.
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]

# Combine the sentences from the two novels into one data frame.
sentences = pd.DataFrame(alice_sents + persuasion_sents)
sentences.head()

Unnamed: 0,0,1
0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,"(So, she, was, considering, in, her, own, mind...",Carroll
2,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,"(Oh, dear, !)",Carroll
4,"(Oh, dear, !)",Carroll


We have the texts cleaned, processed by spaCy, on combined into a nice DataFrame.  Now we can bag the words.  We'll use some functions to perform the bagging, but `SKlearn` has the `CountVectorizer` method that can do this too. Our focus will be on the 2000 most common lemmas.

In [5]:
# Utility function to create a list of the 2000 most common words.
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(2000)]
    

# Creates a data frame with features for each word in our common word set.
# Each value is the count of the times the word appears in each sentence.
def bow_features(sentences, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 50 == 0:
            print("Processing row {}".format(i))
            
    return df

# Set up the bags.
alicewords = bag_of_words(alice_doc)
persuasionwords = bag_of_words(persuasion_doc)

# Combine bags to create a set of unique words.
common_words = set(alicewords + persuasionwords)

In [6]:
# Create our data frame with features. This can take a while to run.
word_counts = bow_features(sentences, common_words)
word_counts.head()

Processing row 0
Processing row 50
Processing row 100
Processing row 150
Processing row 200
Processing row 250
Processing row 300
Processing row 350
Processing row 400


Unnamed: 0,pair,living,dispose,doorway,March,office,express,recommendation,sight,delight,...,New,neighbour,esteem,bit,month,design,successive,Gloucester,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(So, she, was, considering, in, her, own, mind...",Carroll
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Oh, dear, !)",Carroll
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Oh, dear, !)",Carroll


In [7]:
word_counts[word_counts['neglect'] > 0]

Unnamed: 0,pair,living,dispose,doorway,March,office,express,recommendation,sight,delight,...,New,neighbour,esteem,bit,month,design,successive,Gloucester,text_sentence,text_source
309,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(You, need, not, be, afraid, ,, Miss, Elliot, ...",Austen


The BoW feature creation is complete.  Let's feed these into some models.  Once again, we're trying to predict which book a particular sentence came from.

## Random Forest

In [8]:
from sklearn import ensemble
from sklearn.model_selection import train_test_split

rfc = ensemble.RandomForestClassifier(n_estimators=100)
Y = word_counts['text_source']
X = np.array(word_counts.drop(['text_sentence','text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)
train = rfc.fit(X_train, y_train)

print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

Training set score: 0.9849624060150376

Test set score: 0.8539325842696629


This is obviously pretty severely overfit.  How do other models compare?

# Logistic Regression

In [9]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(solver='lbfgs', penalty='l2')
train = lr.fit(X_train, y_train)
print(X_train.shape, y_train.shape)
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

(266, 1612) (266,)
Training set score: 0.9624060150375939

Test set score: 0.8707865168539326


That's a bit better than the RF model, but still overfit.  How about Gradient boosting?

# Gradient Boosted Trees

In [10]:
clf = ensemble.GradientBoostingClassifier()
train = clf.fit(X_train, y_train)

print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))

Training set score: 0.9661654135338346

Test set score: 0.8202247191011236


Boosted trees seems to be the most overfit model.  Besides evaluating the models on training/test scores, another way to test their efficacy and generalization is to use a completely new work from a given author.

# Same model, new book

Let's see if the model can differentiate between _Alice in Wonderland_ and a book it has not been trained on by Jane Austen, like _Emma_.

In [11]:
# Loading and cleaning the text
emma = gutenberg.raw('austen-emma.txt')
emma = re.sub(r'VOLUME \w+', '', emma)
emma = re.sub(r'CHAPTER \w+', '', emma)
emma = text_cleaner(emma[:(len(emma)//60)]) # Comparable length to Alice
print(emma[:100])


Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to


In [12]:
# Parse the text with spaCy
emma_doc = nlp(emma)

# Group into sentences
emma_sents = [[sent, 'Austen'] for sent in emma_doc.sents]

# Bag of Words
emma_sentences = pd.DataFrame(emma_sents)
emma_bow = bow_features(emma_sentences, common_words)

Processing row 0
Processing row 50
Processing row 100
Processing row 150


In [13]:
# Replacing the Persuasion data in the train test split with Emma data
X_emma_test = np.concatenate((
    X_train[y_train[y_train == 'Carroll'].index], 
    emma_bow.drop(['text_sentence', 'text_source'], 1)), axis=0)
y_emma_test = pd.concat([y_train[y_train=='Carroll'],
                        pd.Series(['Austen'] * emma_bow.shape[0])])

## Logistic Model

In [14]:
lr_emma_pred = lr.predict(X_emma_test)

print('Test score: ', lr.score(X_emma_test, y_emma_test))
pd.crosstab(y_emma_test, lr_emma_pred)

Test score:  0.7195121951219512


col_0,Austen,Carroll
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
Austen,158,12
Carroll,57,19


Cool, so the model generalizes a bit.  It can differentiate between _Alice_ and an Austen work it has not seen before.  However, we don't know if it's good at identifying Austen's works, or _Alice_, or both.

# Challenge 1:

The original logistic regression scored 87% on the test set.  Let's add additional features and try to make it better.

In [15]:
def new_features(sentences, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df['sent_length'] = np.zeros(sentences[0].shape)
    df['punct_length'] = np.zeros(sentences[0].shape)
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        for word in words:
            df.loc[i, word] += 1
            
        # Count the number of Parts of Speech in each sentence
        pos = [token.pos_ 
               for token in sentence 
               if (
                   not token.is_punct 
                   and not token.is_stop 
                   and token.lemma_ in common_words)]
        parts, counts = np.unique(np.array(pos), return_counts=True)
        for j, p in enumerate(parts):
            if p not in df.columns:
                df[p] = np.zeros(sentences[0].shape)
            df.loc[i, p] += counts[j]
        
        # Count the number of words in each sentence
        sent_len = [token.lemma_ 
                         for token in sentence 
                         if not token.is_punct]
        df.loc[i, "sent_length"] = len(sent_len)
        
        # Count the number of punctuation marks in each sentence
        punct_len = [token for token in sentence if token.is_punct]
        df.loc[i, 'punct_length'] = len(punct_len)
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 50 == 0:
            print("Processing row {}".format(i))
            
    return df


In [16]:
new_word_counts = new_features(sentences, common_words)
new_word_counts.head()

Processing row 0
Processing row 50
Processing row 100
Processing row 150
Processing row 200
Processing row 250
Processing row 300
Processing row 350
Processing row 400


Unnamed: 0,pair,living,dispose,doorway,March,office,express,recommendation,sight,delight,...,NOUN,PROPN,VERB,INTJ,ADP,NUM,AUX,PART,DET,PUNCT
0,0,0,0,0,0,0,0,0,0,0,...,10.0,2.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,0,0,0,0,0,0,0,0,0,...,8.0,2.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,0,0,0,0,0,0,0,0,0,...,1.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


Now, in addition to the common words, we have sentence length and sentence-wise part of speech count.

In [17]:
# New feature splits
y_new = new_word_counts['text_source']
X_new = np.array(new_word_counts.drop(['text_sentence','text_source'], 1))

X_new_train, X_new_test, y_new_train, y_new_test = train_test_split(X_new, 
                                                    y_new,
                                                    test_size=0.4,
                                                    random_state=0)

In [18]:
# New logistic regression model

lr_new = LogisticRegression(C=20, solver='lbfgs', penalty='l2', max_iter=1000000)
train = lr_new.fit(X_new_train, y_new_train)

tr_score = lr_new.score(X_new_train, y_new_train)
ts_score = lr_new.score(X_new_test, y_new_test)

print('Training score: ', tr_score)
print('Test score: ', ts_score)

Training score:  0.9924812030075187
Test score:  0.898876404494382


Adding addtional features tended to drop the test score, however, after grid searching for `C` I found a solution which has a test score close to 90%.

# Challenge 2:

Test the model to find out if it is well trained on _Alice_, _Persuasion_, or _Austen_.  To do this, we'll grab a text from a different author and test it against all three of the previous texts; _Alice_, _Persuasion_, and _Emma_.

In [19]:
print(gutenberg.fileids())

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


Let's use _The Ball and The Cross_ by Chesterton.

In [45]:
ball = gutenberg.raw('chesterton-ball.txt')

In [46]:
ball[:2000]

'[The Ball and The Cross by G.K. Chesterton 1909]\n\n\nI. A DISCUSSION SOMEWHAT IN THE AIR\n\nThe flying ship of Professor Lucifer sang through the skies like\na silver arrow; the bleak white steel of it, gleaming in the\nbleak blue emptiness of the evening.  That it was far above the\nearth was no expression for it; to the two men in it, it seemed\nto be far above the stars.  The professor had himself invented\nthe flying machine, and had also invented nearly everything in\nit.  Every sort of tool or apparatus had, in consequence, to the\nfull, that fantastic and distorted look which belongs to the\nmiracles of science.  For the world of science and evolution is\nfar more nameless and elusive and like a dream than the world of\npoetry and religion; since in the latter images and ideas remain\nthemselves eternally, while it is the whole idea of evolution\nthat identities melt into each other as they do in a nightmare.\n\nAll the tools of Professor Lucifer were the ancient human tools\n

In [50]:
# Cleaning line breaks, title, Chapter letters 'I.', and chapter titles
ball_clean = re.sub(r'\n', ' ', ball)
ball_clean = re.sub("[\[].*?[\]]", "", ball_clean)
ball_clean = re.sub(r'I\.+?', '', ball_clean)
ball_clean = re.sub(r'([A-Z ]+?  )', '', ball_clean)
ball_clean[:2000]

'The flying ship of Professor Lucifer sang through the skies like a silver arrow; the bleak white steel of it, gleaming in the bleak blue emptiness of the evening.  That it was far above the earth was no expression for it; to the two men in it, it seemed to be far above the stars.  The professor had himself invented the flying machine, and had also invented nearly everything in it.  Every sort of tool or apparatus had, in consequence, to the full, that fantastic and distorted look which belongs to the miracles of science.  For the world of science and evolution is far more nameless and elusive and like a dream than the world of poetry and religion; since in the latter images and ideas remain themselves eternally, while it is the whole idea of evolution that identities melt into each other as they do in a nightmare.  All the tools of Professor Lucifer were the ancient human tools gone mad, grown into unrecognizable shapes, forgetful of their origin, forgetful of their names.  That thing

In [53]:
# Shortening the text length to better match the other works
ball_clean = ball_clean[:15000]

In [56]:
# Parse the text with spaCy
ball_doc = nlp(ball_clean)

# Group into sentences
ball_sents = [[sent, 'Chesterton'] for sent in ball_doc.sents]

# Bag of Words
ball_sentences = pd.DataFrame(ball_sents)
ball_bow = bow_features(ball_sentences, common_words)

Processing row 0
Processing row 50
Processing row 100
Processing row 150


Now that we have cleaned and processed _Ball_, let's combine it with each of the other three works to create three separate train/test splits.

In [59]:
# Replacing the Persuasion data in the train test split with Ball data
X_AliceBall_test = np.concatenate((
    X_train[y_train[y_train == 'Carroll'].index], 
    ball_bow.drop(['text_sentence', 'text_source'], 1)), axis=0)
y_AliceBall_test = pd.concat([y_train[y_train=='Carroll'],
                        pd.Series(['Chesterton'] * ball_bow.shape[0])])

In [60]:
np.unique(y_AliceBall_test, return_counts=True)

(array(['Carroll', 'Chesterton'], dtype=object), array([ 76, 172]))

In [90]:
# Replacing Alice data with the Ball data
X_PersBall = pd.concat((ball_bow.iloc[:129, :], word_counts.iloc[129:, :]))
X_PersBall = X_PersBall.drop(['text_sentence','text_source'], 1)
Y_PersBall = pd.concat((pd.Series(['Chesterton'] * 129), pd.Series(['Austen'] * 315)))

In [93]:
X_PersBall_train, X_PersBall_test, Y_PersBall_train, Y_PersBall_test = train_test_split(X_PersBall,
                                                                                       Y_PersBall,
                                                                                       test_size=.4,
                                                                                       random_state=0)

In [99]:
# Combining Emma and Ball data
X_BallEmma = pd.concat((ball_bow.iloc[:129, :], emma_bow))
X_BallEmma = X_BallEmma.drop(['text_sentence','text_source'], 1)
Y_BallEmma = pd.concat((pd.Series(['Chesterton'] * 129), pd.Series(['Austen'] * 170)))

In [102]:
X_BallEmma_train, X_BallEmma_test, Y_BallEmma_train, Y_BallEmma_test = train_test_split(X_BallEmma,
                                                                                       Y_BallEmma,
                                                                                       test_size=.4,
                                                                                       random_state=0)

## Test Results

In [104]:
# AliceBall Results
lr_AliceBall_pred = lr.predict(X_AliceBall_test)

print('AliceBall Test score: ', lr.score(X_AliceBall_test, y_AliceBall_test))
pd.crosstab(y_AliceBall_test, lr_AliceBall_pred)

AliceBall Test score:  0.07661290322580645


col_0,Austen,Carroll
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
Carroll,57,19
Chesterton,163,9


In [105]:
# PersBall Results
lr_PersBall_pred = lr.predict(X_PersBall_test)

print('PersBall Test score: ', lr.score(X_PersBall_test, Y_PersBall_test))
pd.crosstab(Y_PersBall_test, lr_PersBall_pred)

PersBall Test score:  0.702247191011236


col_0,Austen,Carroll
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
Austen,125,0
Chesterton,48,5


In [107]:
# BallEmma Results
lr_BallEmma_pred = lr.predict(X_BallEmma_test)

print('BallEmma Test score: ', lr.score(X_BallEmma_test, Y_BallEmma_test))
pd.crosstab(Y_BallEmma_test, lr_BallEmma_pred)

BallEmma Test score:  0.5916666666666667


col_0,Austen,Carroll
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
Austen,71,3
Chesterton,44,2


Seems like our model is trained well on Jane Austen's Persuasion and generalizes fairly well to other works by Austen.