# Natural Language Processing

In this exercise we will learn the basics of language parsing and manipulation to try to predict whether a sentence is written by one author or another. Methods we will implement include: bag-of-words, spaCy feature generation, list comprehension, and trying several different classifier models.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
from nltk.corpus import gutenberg, stopwords
from collections import Counter

from sklearn import ensemble
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

from sklearn.svm import SVR
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import VarianceThreshold

import nltk

Create function for cleaning text, manually remove '--' (spaCy does not recognize). <br>
Load and clean data (Alice in Wonderland by Lewis Carroll and Persuasion by Jane Austen).<br>
Delete chapter titles.<br>

In [2]:
def text_cleaner(text):

    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text
    
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)
    
alice = text_cleaner(alice)
persuasion = text_cleaner(persuasion)

Parse the cleaned novels into spaCy docs.

In [3]:
nlp = spacy.load('en')
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

Extract sentences from spaCy docs. <br>
Combine sentences from the two novels into one dataframe for analysis. <br>

In [4]:
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]

sentences = pd.DataFrame(alice_sents + persuasion_sents)
sentences.head()

Unnamed: 0,0,1
0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,"(So, she, was, considering, in, her, own, mind...",Carroll
2,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,"(Oh, dear, !)",Carroll
4,"(I, shall, be, late, !, ')",Carroll


"Time to bag some words! Since spaCy has already tokenized and labelled our data, we can move directly to recording how often various words occur. We will exclude stopwords and punctuation. In addition, in an attempt to keep our feature space from exploding, we will work with lemmas (root words) rather than the raw text terms, and we'll only use the 2000 most common words for each text." <br>

Create bag-of-words utility function to create a list of the 2000 most common words, filtering out punctuation and stop words. <br>
Create function that constructs a dataframe with features for every word in the combined word set. (Each row represents a sentence in either novel. Each column is the count of the times that word appears in the sentence). <br>
Scaffold the dataframe and initialize counts to 0. <br>
Process each row, counting the occurrence of words in each sentence. <br>
Populate the row with word counts for all non-punctuation or stop-word lemmas.<br>
Set up the bag-of-words for each novel and combine into a list, then make a dataframe from that combined list.

In [5]:
def bag_of_words(text):

    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]

    return [item[0] for item in Counter(allwords).most_common(2000)]
    
def bow_features(sentences, common_words):
    
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:, common_words] = 0
    
    for i, sentence in enumerate(df['text_sentence']):

        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 500 == 0:
            print("Processing row {}".format(i))
            
    return df

alicewords = bag_of_words(alice_doc)
persuasionwords = bag_of_words(persuasion_doc)

common_words = set(alicewords + persuasionwords)

In [6]:
word_counts = bow_features(sentences, common_words)
word_counts.head()

Processing row 0
Processing row 500
Processing row 1000
Processing row 1500
Processing row 2000
Processing row 2500
Processing row 3000
Processing row 3500
Processing row 4000
Processing row 4500
Processing row 5000


Unnamed: 0,introduce,sufficient,don't,force,somewhere,dunce,wear,grove,hayters,delight,...,puss,nineteen,renew,leave,overcome,coils,negative,sensation,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(So, she, was, considering, in, her, own, mind...",Carroll
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Oh, dear, !)",Carroll
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(I, shall, be, late, !, ')",Carroll


"Give the bag of words features a whirl by trying a random forest."

In [7]:
rfc = ensemble.RandomForestClassifier()
Y = word_counts['text_source']
X = np.array(word_counts.drop(['text_sentence','text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)
train = rfc.fit(X_train, y_train)

print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

Training set score: 0.990282131661

Test set score: 0.886748120301


"Holy overfitting, Batman! Overfitting is a known problem when using bag of words, since it basically involves throwing a massive number of features at a model – some of those features (in this case, word frequencies) will capture noise in the training set. Since overfitting is also a known problem with Random Forests, the divergence between training score and test score is expected." <br>


Let's try a technique with some protection against overfitting due to extraneous features – logistic regression with lasso. (Lasso is not specified here...looks like we are using ridge instead!)

In [8]:
lr = LogisticRegression()
train = lr.fit(X_train, y_train)
print(X_train.shape, y_train.shape)
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

y_pred = train.predict(X_test)
cm = pd.crosstab(y_test, y_pred)
print(cm)

(3190, 3002) (3190,)
Training set score: 0.956112852665

Test set score: 0.914003759398
col_0        Austen  Carroll
text_source                 
Austen         1434       38
Carroll         145      511


Logistic regression performs a bit better than the random forest. The testing score increases from 0.8867 to **0.914**, and the overfitting is much less prominent. <br>

Let's see what gradient boosting can do.

In [9]:
clf = ensemble.GradientBoostingClassifier()
train = clf.fit(X_train, y_train)

print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))

Training set score: 0.876489028213

Test set score: 0.86795112782


With a gradient boosting score of **0.8679** using default parameters, it looks like logistic regression is the winner. However, there is always room for improvement! <br>

# Same model, new inputs

"What if we feed the model a different novel by Jane Austen, like Emma? Will it be able to distinguish Austen from Carroll with the same level of accuracy if we insert a different sample of Austen's writing?
First, we need to process Emma the same way we processed the other data, and combine it with the Alice data:"

In [10]:
emma = gutenberg.raw('austen-emma.txt')
emma = re.sub(r'VOLUME \w+', '', emma)
emma = re.sub(r'CHAPTER \w+', '', emma)
emma = text_cleaner(emma)
print(emma[:100])

Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to


In [11]:
emma_doc = nlp(emma)

In [12]:
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]
emma_sents = [[sent, "Austen"] for sent in emma_doc.sents]

emma_sents = emma_sents[0:len(alice_sents)]

Emma is quite long, so we cut it down to the same length as Alice. <br>
Build a new Bag of Words dataframe for Emma word counts, and use same common words from Alice and Persuasion.

In [13]:
emma_sentences = pd.DataFrame(emma_sents)
emma_bow = bow_features(emma_sentences, common_words)

Processing row 0
Processing row 500
Processing row 1000
Processing row 1500


Now we can model it! Let's use logistic regression again, since it performed the best last time. <br>
Combine the Emma sentence data with the Alice data from the test set and model.

In [14]:
X_Emma_test = np.concatenate((
    X_train[y_train[y_train=='Carroll'].index],
    emma_bow.drop(['text_sentence','text_source'], 1)
), axis=0)
y_Emma_test = pd.concat([y_train[y_train=='Carroll'],
                         pd.Series(['Austen'] * emma_bow.shape[0])])

print('\nTest set score:', lr.score(X_Emma_test, y_Emma_test))
lr_Emma_predicted = lr.predict(X_Emma_test)
pd.crosstab(y_Emma_test, lr_Emma_predicted)


Test set score: 0.687173750932


col_0,Austen,Carroll
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
Austen,1534,135
Carroll,704,309


"Well look at that! NLP approaches are generally effective on the same type of material as they were trained on. It looks like this model is actually able to differentiate multiple works by Austen from Alice in Wonderland. Now the question is whether the model is very good at identifying Austen, or very good at identifying Alice in Wonderland, or both..."

# Challenge 0:

Recall that the logistic regression model's best performance on the test set was 93%. See what you can do to improve performance. Suggested avenues of investigation include: Other modeling techniques (SVM?), making more features that take advantage of the spaCy information (include grammar, phrases, POS, etc), making sentence-level features (number of words, amount of punctuation), or including contextual information (length of previous and next sentences, words repeated from one sentence to the next, etc), and anything else your heart desires. Make sure to design your models on the test set, or use cross_validation with multiple folds, and see if you can get accuracy above 97%.

First, I will take advantage of spaCy token information to add spaCy features into the sentence dataframe. <br>
spaCy features include test, lemma, pos, tag, dep, shape, is_alpha, is_stop, and is_punct. These features are all per token, so I need to gather summary statistics on them per sentence. Since logistic regression with default ridge settings performed best, I will use this again here to evaluate.<br>

Using list comprehensions, I will add features for the number of words in the sentence, the average word length, the number of stop words, and the number of quotation marks. These are all features that could potentially differentiate one author's writing style from another.

In [36]:
def create_spacy_features(df):

    df['sentence_length'] = df['text_sentence'].apply(lambda sentence: len(sentence))
    df['avg_word_length'] = df['text_sentence'].apply(
        lambda sentence: sum(len(word) for word in sentence)/len(sentence))
    
    def sum_stops (sentence):
        sum_stops = []
        sum_stops = [sum_stops.append(token) for token in sentence if token.is_stop]
        return len(sum_stops)
    
    def sum_quote (sentence):
        sum_quote = []
        sum_quote = [sum_quote.append(token) for token in sentence if token.is_quote]
        return len(sum_quote)
        

    df['num_stop'] = df['text_sentence'].apply(lambda sentence: sum_stops(sentence))
    df['num_quotes'] = df['text_sentence'].apply(lambda sentence: sum_quote(sentence))
        
    print(df.head())
    
    return(df)

alice_persuasion_spacy = create_spacy_features(word_counts)

  introduce sufficient don't force somewhere dunce wear grove hayters delight  \
0         0          0     0     0         0     0    0     0       0       0   
1         0          0     0     0         0     0    0     0       0       0   
2         0          0     0     0         0     0    0     0       0       0   
3         0          0     0     0         0     0    0     0       0       0   
4         0          0     0     0         0     0    0     0       0       0   

     ...     coils negative sensation  \
0    ...         0        0         0   
1    ...         0        0         0   
2    ...         0        0         0   
3    ...         0        0         0   
4    ...         0        0         0   

                                       text_sentence text_source  \
0  (Alice, was, beginning, to, get, very, tired, ...     Carroll   
1  (So, she, was, considering, in, her, own, mind...     Carroll   
2  (There, was, nothing, so, VERY, remarkable, in...     Carro

Let's try iterating through multiple C values for the logistic regressor.

In [56]:
c_values = [1e-1, 1, 3, 5, 10, 100]
for c in c_values:
    run_logistic(alice_persuasion_spacy, c)

C = 0.1
Training set score: 0.921630094044
Test set score: 0.90507518797


col_0        Austen  Carroll
text_source                 
Austen         1411       61
Carroll         141      515


C = 1
Training set score: 0.972727272727
Test set score: 0.918703007519


col_0        Austen  Carroll
text_source                 
Austen         1416       56
Carroll         117      539


C = 3
Training set score: 0.985579937304
Test set score: 0.917293233083


col_0        Austen  Carroll
text_source                 
Austen         1411       61
Carroll         115      541


C = 5
Training set score: 0.986833855799
Test set score: 0.918233082707


col_0        Austen  Carroll
text_source                 
Austen         1409       63
Carroll         111      545


C = 10
Training set score: 0.989655172414
Test set score: 0.916353383459


col_0        Austen  Carroll
text_source                 
Austen         1402       70
Carroll         108      548


C = 100
Training set score: 0.99122257

The default parameter C=1 gives the highest score at **0.9187.** <br>

Let's try using lasso regression, which protects against overfitting due to extraneous features.

In [57]:
def run_lasso(df, c):
    
    Y = df['text_source']
    X = df.drop(['text_sentence','text_source'], 1)
    
    X_dummies = pd.get_dummies(X)
    X_train, X_test, y_train, y_test = train_test_split(X_dummies, 
                                                        Y,
                                                        test_size=0.4,
                                                        random_state=0)
    print(f'C = {c}')
    lr = LogisticRegression(C=c, penalty='l1', random_state = 1)
    lr.fit(X_train, y_train)
    print('Training set score:', lr.score(X_train, y_train))
    print('Test set score:', lr.score(X_test, y_test))

    y_pred = lr.predict(X_test)
    cm = pd.crosstab(y_test, y_pred)
    print('\n')
    print(cm)
    print('\n')
    
for c in c_values:
    run_lasso(alice_persuasion_spacy, c)

C = 0.1
Training set score: 0.847962382445
Test set score: 0.850093984962


col_0        Austen  Carroll
text_source                 
Austen         1380       92
Carroll         227      429


C = 1
Training set score: 0.936677115987
Test set score: 0.906015037594


col_0        Austen  Carroll
text_source                 
Austen         1412       60
Carroll         140      516


C = 3
Training set score: 0.982131661442
Test set score: 0.911654135338


col_0        Austen  Carroll
text_source                 
Austen         1408       64
Carroll         124      532


C = 5
Training set score: 0.986206896552
Test set score: 0.910714285714


col_0        Austen  Carroll
text_source                 
Austen         1408       64
Carroll         126      530


C = 10
Training set score: 0.98934169279
Test set score: 0.90977443609


col_0        Austen  Carroll
text_source                 
Austen         1398       74
Carroll         118      538


C = 100
Training set score: 0.988087774

Turns out lasso regression does not perform as well as ridge, the highest testing score was **0.9116** with default C=1. This is slightly lower than the highest ridge score 0.9187. <br>

Next, I will try using a support vector classifier to distinguish between Carroll and Austen's writing styles. I will iterate through several values of the penalty parameter C and pick the best score. I will continue using the spaCy feature set since it produced a better score in logistic regression.

In [46]:
def run_svc(df, c):
    print(f'C = {c}')
    
    Y = df['text_source']
    X = df.drop(['text_sentence','text_source'], 1)
    X_dummies = pd.get_dummies(X)

    X_train, X_test, y_train, y_test = train_test_split(X_dummies, 
                                                        Y,
                                                        test_size=0.4,
                                                        random_state=0)
    svc = SVC(C = c)
    svc.fit(X_train, y_train)
    print('Training set score:', svc.score(X_train, y_train))
    print('Test set score:', svc.score(X_test, y_test))

    y_pred = svc.predict(X_test)
    cm = pd.crosstab(y_test, y_pred)
    print(cm)
    print('\n')

c_values = [1e-3, 1e-2, 1e-1, 1, 100, 1000]
for c in c_values:
    run_svc(alice_persuasion_spacy, c)

C = 0.001
Training set score: 0.682445141066
Test set score: 0.691729323308
col_0        Austen
text_source        
Austen         1472
Carroll         656


C = 0.01
Training set score: 0.682445141066
Test set score: 0.691729323308
col_0        Austen
text_source        
Austen         1472
Carroll         656


C = 0.1
Training set score: 0.682445141066
Test set score: 0.691729323308
col_0        Austen
text_source        
Austen         1472
Carroll         656


C = 1
Training set score: 0.682445141066
Test set score: 0.691729323308
col_0        Austen
text_source        
Austen         1472
Carroll         656


C = 100
Training set score: 0.910971786834
Test set score: 0.897086466165
col_0        Austen  Carroll
text_source                 
Austen         1418       54
Carroll         165      491


C = 1000
Training set score: 0.977115987461
Test set score: 0.915883458647
col_0        Austen  Carroll
text_source                 
Austen         1404       68
Carroll         111  

Up to a certain value of C, all the training/testing scores are the same, and they are not very good. For these values, the testing score is 0.69. Once C goes up to C=100, the score goes up to 0.88, and at C=1000, the score is **0.9159.** This isn't bad, but doesn't beat the logistic regression score of 0.9187.<br>

Let's try some forms of feature selection to see if reducing the features down from ~3000 will help the score.

In [64]:
k_values = [500,1000,1500,2000,2500]

Y = alice_persuasion_spacy['text_source']
X = alice_persuasion_spacy.drop(['text_sentence','text_source'], 1)

for k in k_values:
    print('k:', k)
    kb = SelectKBest(k=k)
    k_reduced = kb.fit_transform(X,y=Y)

    X_train_k, X_test_k, y_train_k, y_test_k = train_test_split(k_reduced, 
                                                        Y,
                                                        test_size=0.4,
                                                        random_state=0)

    lr = LogisticRegression()
    lr.fit(X_train_k, y_train_k)
    print('Training set score:', lr.score(X_train_k, y_train_k))
    print('Test set score:', lr.score(X_test_k, y_test_k))

    y_pred = lr.predict(X_test_k)
    cm = pd.crosstab(y_test_k, y_pred)
    print(cm)
    print('\n')

k: 500
Training set score: 0.927586206897
Test set score: 0.921052631579
col_0        Austen  Carroll
text_source                 
Austen         1425       47
Carroll         121      535


k: 1000
Training set score: 0.938871473354
Test set score: 0.925281954887
col_0        Austen  Carroll
text_source                 
Austen         1425       47
Carroll         112      544


k: 1500
Training set score: 0.943260188088
Test set score: 0.926691729323
col_0        Austen  Carroll
text_source                 
Austen         1424       48
Carroll         108      548


k: 2000
Training set score: 0.94670846395
Test set score: 0.926221804511
col_0        Austen  Carroll
text_source                 
Austen         1425       47
Carroll         110      546


k: 2500
Training set score: 0.952664576803
Test set score: 0.925281954887
col_0        Austen  Carroll
text_source                 
Austen         1424       48
Carroll         111      545




When using SelectKBest to reduce the feature set to the 1500 best features, we improve the score from 0.9187 to **0.9267** ! This was achieved using the default value of C=1.<br>

Let's try one more method of feature selection before moving on to the next challenge.

In [65]:
var = VarianceThreshold(threshold=(.999 * (1 - .999)))
var_reduced = var.fit_transform(X)
print(var_reduced.shape)

X_train_var, X_test_var, y_train_var, y_test_var = train_test_split(var_reduced, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)

lr = LogisticRegression()
lr.fit(X_train_var, y_train_var)
print('Training set score:', lr.score(X_train_var, y_train_var))
print('Test set score:', lr.score(X_test_var, y_test_var))

y_pred = lr.predict(X_test_var)
cm = pd.crosstab(y_test_var, y_pred)
print(cm)
print('\n')

(5318, 1411)
Training set score: 0.954545454545
Test set score: 0.924812030075
col_0        Austen  Carroll
text_source                 
Austen         1424       48
Carroll         112      544




With a score of **0.9248**, variance threshold produces results that are good, but not better than SelectKBest. <br>

Let's see if using k=1500 with SelectKBest can improve the SVC classifier.

In [77]:
print('k: 1500')
kb = SelectKBest(k=1500)
k_reduced = kb.fit_transform(X,y=Y)

for c in c_values:
    print(f'C = {c}')
    
    X_dummies = pd.get_dummies(X)

    X_train, X_test, y_train, y_test = train_test_split(k_reduced, 
                                                        Y,
                                                        test_size=0.4,
                                                        random_state=0)
    svc = SVC(C = c)
    svc.fit(X_train, y_train)
    print('Training set score:', svc.score(X_train, y_train))
    print('Test set score:', svc.score(X_test, y_test))

    y_pred = svc.predict(X_test)
    cm = pd.crosstab(y_test, y_pred)
    print(cm)
    print('\n')

k: 1500
C = 0.1
Training set score: 0.682445141066
Test set score: 0.691729323308
col_0        Austen
text_source        
Austen         1472
Carroll         656


C = 1
Training set score: 0.761128526646
Test set score: 0.759868421053
col_0        Austen  Carroll
text_source                 
Austen         1388       84
Carroll         427      229


C = 3
Training set score: 0.797178683386
Test set score: 0.803571428571
col_0        Austen  Carroll
text_source                 
Austen         1373       99
Carroll         319      337


C = 5
Training set score: 0.821943573668
Test set score: 0.827537593985
col_0        Austen  Carroll
text_source                 
Austen         1382       90
Carroll         277      379


C = 10
Training set score: 0.857053291536
Test set score: 0.858082706767
col_0        Austen  Carroll
text_source                 
Austen         1389       83
Carroll         219      437


C = 100
Training set score: 0.929467084639
Test set score: 0.91212406015
co

These scores look good! For some unexplicable reason, C=1000 didn't run so I will make another cell for just that parameter.

In [78]:
print('k: 1500')
kb = SelectKBest(k=1500)
k_reduced = kb.fit_transform(X,y=Y)

print(f'C = 1000')

X_dummies = pd.get_dummies(X)

X_train, X_test, y_train, y_test = train_test_split(k_reduced, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)
svc = SVC(C = 1000)
svc.fit(X_train, y_train)
print('Training set score:', svc.score(X_train, y_train))
print('Test set score:', svc.score(X_test, y_test))

y_pred = svc.predict(X_test)
cm = pd.crosstab(y_test, y_pred)
print(cm)
print('\n')

k: 1500
C = 1000
Training set score: 0.968652037618
Test set score: 0.915413533835
col_0        Austen  Carroll
text_source                 
Austen         1391       81
Carroll          99      557




Even better! However, this does not beat our score of 0.9248 using k=1500 with logistic regression. Now, let's move on to the next challenge.

# Challenge 1:

Find out whether your new model is good at identifying Alice in Wonderland vs any other work, Persuasion vs any other work, or Austen vs any other work. This will involve pulling a new book from the Project Gutenberg corpus (print(gutenberg.fileids()) for a list) and processing it.
Record your work for each challenge in a notebook and submit it below.

Let's see whether we can distinguish sentences from another novel. We will use Chesterton's "The Ballad of the White Horse."

Import the text and get rid of the chapter titles (denoted with roman numerals).

In [66]:
chesball = gutenberg.raw('chesterton-ball.txt')
chesball = re.sub(r'^[IVXLMC]+[.][A-Z ]+$','', chesball, flags=re.MULTILINE)

chesball = text_cleaner(chesball)
print(chesball[:200])

The flying ship of Professor Lucifer sang through the skies like a silver arrow; the bleak white steel of it, gleaming in the bleak blue emptiness of the evening. That it was far above the earth was n


Process the text the same way we processed the text above.

In [67]:
chesball_doc = nlp(chesball)

chesball_sents = [[sent, "Chesterton"] for sent in chesball_doc.sents]
ches_al_sentences = pd.DataFrame(chesball_sents + alice_sents)

cheswords = bag_of_words(chesball_doc)

ches_al_words = set(cheswords + alicewords)

ches_al_word_counts = bow_features(ches_al_sentences, ches_al_words)
print(ches_al_word_counts.head())

Processing row 0
Processing row 500
Processing row 1000
Processing row 1500
Processing row 2000
Processing row 2500
Processing row 3000
Processing row 3500
Processing row 4000
Processing row 4500
Processing row 5000
Processing row 5500
Processing row 6000
  introduce don't force somewhere dunce wear science delight count  \
0         0     0     0         0     0    0       0       0     0   
1         0     0     0         0     0    0       0       0     0   
2         0     0     0         0     0    0       0       0     0   
3         0     0     0         0     0    0       1       0     0   
4         0     0     0         0     0    0       1       0     0   

  indescribable     ...     leave bewilder overcome legend negative coils  \
0             0     ...         0        0        0      0        0     0   
1             0     ...         0        0        0      0        0     0   
2             0     ...         0        0        0      0        0     0   
3             0

In [68]:
print(len(chesball_doc))
print(len(alice_doc))

97608
34363


Looks like we may have a bit of class imbalance here. Let's proceed anyways and create the spacy features dataframe.

In [70]:
ches_al_spacy = create_spacy_features(ches_al_word_counts)

  introduce don't force somewhere dunce wear science delight count  \
0         0     0     0         0     0    0       0       0     0   
1         0     0     0         0     0    0       0       0     0   
2         0     0     0         0     0    0       0       0     0   
3         0     0     0         0     0    0       1       0     0   
4         0     0     0         0     0    0       1       0     0   

  indescribable    ...     negative coils conceivable sensation  \
0             0    ...            0     0           0         0   
1             0    ...            0     0           0         0   
2             0    ...            0     0           0         0   
3             0    ...            0     0           0         0   
4             0    ...            0     0           0         0   

                                       text_sentence text_source  \
0  (The, flying, ship, of, Professor, Lucifer, sa...  Chesterton   
1  (That, it, was, far, above, the, eart

In [72]:
for c in c_values:
    run_logistic(ches_al_spacy, c)

C = 0.1
Training set score: 0.900476442562
Test set score: 0.877380952381


col_0        Carroll  Chesterton
text_source                     
Carroll          389         264
Chesterton        45        1822


C = 1
Training set score: 0.965854949709
Test set score: 0.906746031746


col_0        Carroll  Chesterton
text_source                     
Carroll          470         183
Chesterton        52        1815


C = 3
Training set score: 0.979354155638
Test set score: 0.905952380952


col_0        Carroll  Chesterton
text_source                     
Carroll          480         173
Chesterton        64        1803


C = 5
Training set score: 0.981206987824
Test set score: 0.903571428571


col_0        Carroll  Chesterton
text_source                     
Carroll          479         174
Chesterton        69        1798


C = 10
Training set score: 0.983589200635
Test set score: 0.900793650794


col_0        Carroll  Chesterton
text_source                     
Carroll          481     

Not bad! Identifying Lewis Carroll vs. G.K. Chesterton using logistic regression gives us a testing accuracy score of **0.9067**. Let's see what happens when we reduce features using SelectKBest.

In [73]:
Y = alice_persuasion_spacy['text_source']
X = alice_persuasion_spacy.drop(['text_sentence','text_source'], 1)

print('k: 1500')
kb = SelectKBest(k=1500)
k_reduced = kb.fit_transform(X,y=Y)

X_train_k, X_test_k, y_train_k, y_test_k = train_test_split(k_reduced, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)

lr = LogisticRegression()
lr.fit(X_train_k, y_train_k)
print('Training set score:', lr.score(X_train_k, y_train_k))
print('Test set score:', lr.score(X_test_k, y_test_k))

y_pred = lr.predict(X_test_k)
cm = pd.crosstab(y_test_k, y_pred)
print(cm)
print('\n')

k: 1500
Training set score: 0.943260188088
Test set score: 0.926691729323
col_0        Austen  Carroll
text_source                 
Austen         1424       48
Carroll         108      548




Using the 1500 best features allows us to increase the score from 0.9067 to **0.9267.** This is our best score yet! Now let's run SVC, which performed the best for the last dataset.

In [75]:
for c in c_values:
    run_svc(ches_al_spacy, c)

C = 0.1
Training set score: 0.731074642668
Test set score: 0.740873015873
col_0        Chesterton
text_source            
Carroll             653
Chesterton         1867


C = 1
Training set score: 0.731604023293
Test set score: 0.741666666667
col_0        Carroll  Chesterton
text_source                     
Carroll            4         649
Chesterton         2        1865


C = 3
Training set score: 0.732662784542
Test set score: 0.74246031746
col_0        Carroll  Chesterton
text_source                     
Carroll            6         647
Chesterton         2        1865


C = 5
Training set score: 0.733456855479
Test set score: 0.742857142857
col_0        Carroll  Chesterton
text_source                     
Carroll            8         645
Chesterton         3        1864


C = 10
Training set score: 0.744044467972
Test set score: 0.75
col_0        Carroll  Chesterton
text_source                     
Carroll           26         627
Chesterton         3        1864


C = 100
Traini

Interestingly, these results are much worse than all of the other methods. This looks like a case of class imbalance, as the classifier is aggressively predicting Chesterton, which is the longer document. In the interest of time, I will not re-run these. But, in the future, I would randomly sub-sample the larger class (Chesterton) to create two equally sized classes then re-run this model.

# Conclusion

When classes are comparably sized, logistic regression with feature reduction by SelectKBest likely gives the best results. Although SVC is a very powerful classifier, like anything else it requires some fine-tuning and parameter optimizing, and this takes too long to run for this simple drill.<br>

The class imbalance does not seem to affect the accuracy for logistic regression, which is a testament to this method's robustness and versatility compared to SVC. However, I acknowledge that it is good practice to represent the classes equally going forth.