# Natural Language Processing Supervised Learning -- Kristofer Schobert

In this assignment, we will seek to predict which novel a given sentence is from. The two novels are Jane Austen's *Persuasion* and Lewis Carroll's *Alice in Wonderland*. 

We start by following Thinkful's cirriculum then 

We will improve upon this initial model, which only uses word frequency per sentence as features, by also including parts of speech per sentence. We also use scikit-learn's GridSearchCV to scan the parameter space to find the best model. 

Afterwords, we see how this model performs when comparing *Alice in Wonderland* to *Sense and Sensibility* by Jane Austen. 

In [3]:
# Importing packages

%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
from nltk.corpus import gutenberg, stopwords
from collections import Counter
import nltk

nltk.download('gutenberg')
!python -m spacy download en

[nltk_data] Downloading package gutenberg to /Users/Kris/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/anaconda3/envs/first_sandbox/lib/python3.7/site-packages/en_core_web_sm -->
/anaconda3/envs/first_sandbox/lib/python3.7/site-packages/spacy/data/en
You can now load the model via spacy.load('en')


Now, we will clean each text, removing chapter titles and new line characters.  

In [4]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text
    
# Load and clean the data.
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# The Chapter indicator is idiosyncratic 
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)


# including the whole novels will require a lot of runtime, 
# so let's only include the first tenth of each novel. 
alice = text_cleaner(alice[:int(len(alice)/10)])
persuasion = text_cleaner(persuasion[:int(len(persuasion)/10)])

We now have spaCy parse the cleaned texts, and group by sentences

In [6]:
# Parse the cleaned novels. This can take a bit.
nlp = spacy.load('en')
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

In [7]:
# Group into sentences.
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]

# Combine the sentences from the two novels into one data frame.
sentences = pd.DataFrame(alice_sents + persuasion_sents)
sentences.head()

Unnamed: 0,0,1
0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,"(So, she, was, considering, in, her, own, mind...",Carroll
2,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,"(Oh, dear, !)",Carroll
4,"(Oh, dear, !)",Carroll


From here, we are going to create some functions that will construct our desired dataframe. Each row of the dataframe corresponds with a sentence from one of the two books. Each column of the dataframe will be a word, which is one of the most common 2000 words from either book. The cells will be populated with the frequency of that word in the given sentence. We will also include a column that gives the row's entire sentence and a target column which states the novel the sentence is from. 


In [11]:
# Utility function to create a list of the 2000 most common words.
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(2000)]
    

# Creates a data frame with features for each word in our common word set.
# Each value is the count of the times the word appears in each sentence.
def bow_features(sentences, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 50 == 0:
            print("Processing row {}".format(i))
            
    return df

# Set up the bags.
alicewords = bag_of_words(alice_doc)
persuasionwords = bag_of_words(persuasion_doc)

# Combine bags to create a set of unique words.
common_words = set(alicewords + persuasionwords)

In [15]:
# Create our data frame with features. This can take a while to run.
word_counts = bow_features(sentences, common_words)
word_counts.head()

Processing row 0
Processing row 50
Processing row 100
Processing row 150
Processing row 200
Processing row 250
Processing row 300
Processing row 350
Processing row 400


Unnamed: 0,door,plan,undue,pass,age,carry,climate,bottle,place,eagerly,...,1806,dozen,kid,circumstance,miserably,mindedness,flushed,release,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(So, she, was, considering, in, her, own, mind...",Carroll
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Oh, dear, !)",Carroll
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Oh, dear, !)",Carroll


Let's now try Random Forest, Logistic Regression, and Gradient Boosting as our supervised models for predicting which book a sentence is from. We will be splitting the data we have into training data (60% of sentences) and testing data (40% of sentences). The model with the best testing accuracy we will dub the winner. 

In [34]:
from sklearn import ensemble
from sklearn.model_selection import train_test_split

rfc = ensemble.RandomForestClassifier()
Y = word_counts['text_source']
X = np.array(word_counts.drop(['text_sentence','text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)
train = rfc.fit(X_train, y_train)

print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

Training set score: 0.981203007518797

Test set score: 0.8426966292134831




In [33]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l2') # No need to specify l2 as it's the default. But we put it for demonstration.
train = lr.fit(X_train, y_train)
print(X_train.shape, y_train.shape)
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

(266, 1612) (266,)
Training set score: 0.9699248120300752

Test set score: 0.8764044943820225




In [36]:
clf = ensemble.GradientBoostingClassifier()
train = clf.fit(X_train, y_train)

print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))

Training set score: 0.9661654135338346

Test set score: 0.8202247191011236


Our best model is Logistic Regression! Let's see how it performs when comparing *Alice in Wonderland* to Emma by Jane Austen. We will use the same model (logistic regression with the features being the most common words of *Alice in Wonderland* and *Persuasion*), and see how it does. Note that we are not using the most common words from *Emma* as features in this model. 

In [38]:
# Clean the Emma data.
emma = gutenberg.raw('austen-emma.txt')
emma = re.sub(r'VOLUME \w+', '', emma)
emma = re.sub(r'CHAPTER \w+', '', emma)
emma = text_cleaner(emma[:int(len(emma)/60)])
print(emma[:500])

Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to unite some of the best blessings of existence; and had lived nearly twenty-one years in the world with very little to distress or vex her. She was the youngest of the two daughters of a most affectionate, indulgent father; and had, in consequence of her sister's marriage, been mistress of his house from a very early period. Her mother had died too long ago for her to have more than an indistinct 


In [39]:
# Parse our cleaned data.
emma_doc = nlp(emma)

In [40]:
# Group into sentences.
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]
emma_sents = [[sent, "Austen"] for sent in emma_doc.sents]

In [41]:
# Build a new Bag of Words data frame for Emma word counts.
# We'll use the same common words from Alice and Persuasion.
emma_sentences = pd.DataFrame(emma_sents)
emma_bow = bow_features(emma_sentences, common_words)

print('done')

Processing row 0
Processing row 50
Processing row 100
Processing row 150
done


In [75]:
# Now we can model it!
# Let's use logistic regression again.

# Combine the Emma sentence data with the Alice data from the test set.
X_Emma_test = np.concatenate((
    X_test[y_test[y_test=='Carroll'].index],
    emma_bow.drop(['text_sentence','text_source'], 1)
), axis=0)
y_Emma_test = pd.concat([y_test[y_test=='Carroll'],
                         pd.Series(['Austen'] * emma_bow.shape[0])])

# Model.
print('\nTest set score:', lr.score(X_Emma_test, y_Emma_test))
lr_Emma_predicted = lr.predict(X_Emma_test)
pd.crosstab(y_Emma_test, lr_Emma_predicted)


Test set score: 0.7623318385650224


col_0,Austen,Carroll
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
Austen,158,12
Carroll,41,12


Our model still does fairly well. We are only considering the 2000 most common words from *Alice in Wonderland* and *Persuasion*, the other Jane Austen novel. However, we are able to discern which novel a sentence from *Alice in Wonderland* versus *Emma* is from. 

## Challenge 0:

We will now try to optomize the original model. We will concern ourselves with discerning sentences from *Alice in Wonderland* versus *Persuasion*. 

We will employ GridSearchCV to find the optimal Logistic Regression model. 

In [138]:
Y = word_counts['text_source']
X = np.array(word_counts.drop(['text_sentence','text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)

from sklearn.linear_model import LogisticRegression


from sklearn.model_selection import GridSearchCV
parameters = {'penalty':['l1','l2'], 'C':[.001, .001, .01, .1, 1, 10, 100]}
lr = LogisticRegression(penalty='l2', solver = 'liblinear')
clf = GridSearchCV(lr, parameters, cv=7, scoring='accuracy')
clf.fit(X_train, y_train)
#print(clf.cv_results_)
print(clf.best_estimator_)



# lr = LogisticRegression(penalty='l2') # No need to specify l2 as it's the default. But we put it for demonstration.
# train = lr.fit(X_train, y_train)
print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))

LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)
Training set score: 0.9849624060150376

Test set score: 0.8876404494382022




# Let's try to add more useful features

## We will add parts of speech frequency for each sentence. 

It seems reasonable to think the two authors will have different sentence structures. Maybe one uses more verbs per sentence than the other. 

Hopfully we can increase our testing set to 90% accuracy. Again, we are inputting sentences and trying to predict which novel the sentence came from.

We will run the code from the top. A lot of this code repeates what has been done prior.

In [139]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text
    
# Load and clean the data.
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# The Chapter indicator is idiosyncratic 
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)
    
alice = text_cleaner(alice[:int(len(alice)/10)])
persuasion = text_cleaner(persuasion[:int(len(persuasion)/10)])

In [140]:
# Parse the cleaned novels. This can take a bit.
nlp = spacy.load('en')
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

In [142]:
# Group into sentences.
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]

# Combine the sentences from the two novels into one data frame.
sentences = pd.DataFrame(alice_sents + persuasion_sents)
sentences.tail()

Unnamed: 0,0,1
439,"(But, it, was, not, a, merely, selfish, cautio...",Austen
440,"(Had, she, not, imagined, herself, consulting,...",Austen
441,"(The, belief, of, being, prudent, ,, and, self...",Austen
442,"(He, had, left, the, country, in, consequence, .)",Austen
443,"(A, few, months, had, seen, the, beginning, an...",Austen


In [162]:
# Utility function to create a list of the 2000 most common words.
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(2000)]

# Utility function to create a list of the most common parts of speach.
def bag_of_pos(text):

# Filter out punctuation.
    allpos = [token.pos_
                for token in text
                if not token.is_punct]

# Return the most common words.
    return list(set(allpos))


# Creates a data frame with features for each word in our common word set and pos in our pos set.
# Each value is the count of the times the word appears in each sentence.
def bow_features(sentences, common_words, common_pos):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words + common_pos)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:, common_words + common_pos] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
            
        # Convert words to parts of speech
        pos_ = [token.pos_
               for token in sentence
               if (
                   not token.is_punct)]
        
        # Populate the row with word counts.
        for part in pos_:
            df.loc[i, part] += 1

        # This counter is just to make sure the kernel didn't hang.
        if i % 50 == 0:
            print("Processing row {}".format(i))
            
    return df

# Set up the bags.
alicewords = bag_of_words(alice_doc)
persuasionwords = bag_of_words(persuasion_doc)

alicepos = bag_of_pos(alice_doc)
persuasionpos = bag_of_pos(persuasion_doc)

# Combine bags to create a set of unique words.
common_words = list(set(alicewords + persuasionwords))
common_pos = list(set(alicepos + persuasionpos))

In [163]:
common_pos

['NOUN',
 'PRON',
 'ADJ',
 'CCONJ',
 'DET',
 'PART',
 'AUX',
 'ADV',
 'VERB',
 'INTJ',
 'ADP',
 'PUNCT',
 'PROPN',
 'NUM']

In [164]:
# Create our data frame with features. This can take a while to run.
word_counts_pos = bow_features(sentences, common_words, common_pos)
word_counts_pos.head()

Processing row 0
Processing row 50
Processing row 100
Processing row 150
Processing row 200
Processing row 250
Processing row 300
Processing row 350
Processing row 400


Unnamed: 0,door,plan,undue,pass,age,carry,climate,bottle,place,eagerly,...,AUX,ADV,VERB,INTJ,ADP,PUNCT,PROPN,NUM,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,3,13,0,8,0,2,0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,0,0,0,0,0,0,0,0,0,0,...,0,7,11,0,8,0,2,0,"(So, she, was, considering, in, her, own, mind...",Carroll
2,0,0,0,0,0,0,0,0,0,0,...,0,6,5,0,4,0,2,0,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,"(Oh, dear, !)",Carroll
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,"(Oh, dear, !)",Carroll


Now that we have the desired data frame we can begin optomizing our logistic regression. 

In [167]:
from sklearn import ensemble
from sklearn.model_selection import train_test_split

Y = word_counts_pos['text_source']
X = np.array(word_counts_pos.drop(['text_sentence','text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import GridSearchCV
parameters = {'penalty':['l1','l2'], 'C':[.001, .01, .1, 1, 10, 100]}
lr = LogisticRegression(penalty='l2', solver = 'liblinear')
clf = GridSearchCV(lr, parameters, cv=7, scoring='accuracy')
clf.fit(X_train, y_train)
#print(clf.cv_results_)
print(clf.best_estimator_)



# lr = LogisticRegression(penalty='l2') # No need to specify l2 as it's the default. But we put it for demonstration.
# train = lr.fit(X_train, y_train)
print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:',clf.score(X_test, y_test))

LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)
Training set score: 0.9924812030075187

Test set score: 0.898876404494382




## Results
We were able to increase or testing set score to 90%! This was our goal.

# Challenge 1:

## Compare *Alice in Wonderland* to Jane Austen's *Sense.* using the same pipeline

We will use the the same pipeline to create a similar model for comparing these two books. 

We will put both books through the same pipeline and compare. To summarize the process, we start by finding the 2000 most commonly used words (that are not stop_words) in each book. We create a data frame where each row corresponds with one sentence from one of the books. The colomns of this dataframe are most commonly used words and parts of speech of the two books. Each cell contains the frequency of the common word or the frequency of a part of speech in the given sentence. We then fit a logistic regression model to the dataset while holding out a portion for testing. The target of the model is which book the sentence is from. We finally predict which book a given sentence in the testing dataset is from. 

In [168]:
print(gutenberg.fileids())

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


In [184]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text
    
# Load and clean the data.
sense = gutenberg.raw('austen-sense.txt')


# The Chapter indicator is idiosyncratic 
sense = re.sub(r'CHAPTER .*', '', sense)

# The novel is longer so only taking the first 50th
sense = text_cleaner(sense[:int(len(sense)/50)])


In [211]:
# Parse the cleaned novels. This can take a bit.
nlp = spacy.load('en')
sense_doc = nlp(sense)

In [212]:
# Group into sentences.
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
sense_sents = [[sent, "Austen"] for sent in sense_doc.sents]

# Combine the sentences from the two novels into one data frame.
sentences = pd.DataFrame(alice_sents + sense_sents)
sentences.tail()

Unnamed: 0,0,1
237,"(As, it, is, ,, without, any, addition, of, mi...",Austen
238,"("")",Austen
239,"(To, be, sure, it, is, ;, and, ,, indeed, ,, i...",Austen
240,"(They, will, have, ten, thousand, pounds, divi...",Austen
241,"(If, they, marry, ,, they, will, be, sure, of,...",Austen


In [213]:
# Utility function to create a list of the 2000 most common words.
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(2000)]

# Utility function to create a list of the most common parts of speach.
def bag_of_pos(text):

# Filter out punctuation.
    allpos = [token.pos_
                for token in text
                if not token.is_punct]

# Return the most common words.
    return list(set(allpos))


# Creates a data frame with features for each word in our common word set and pos in our pos set.
# Each value is the count of the times the word appears in each sentence.
def bow_features(sentences, common_words, common_pos):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words + common_pos)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:, common_words + common_pos] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
            
        # Convert words to parts of speech
        pos_ = [token.pos_
               for token in sentence
               if (
                   not token.is_punct)]
        
        # Populate the row with word counts.
        for part in pos_:
            df.loc[i, part] += 1

        # This counter is just to make sure the kernel didn't hang.
        if i % 50 == 0:
            print("Processing row {}".format(i))
            
    return df

# Set up the bags.
alicewords = bag_of_words(alice_doc)
sensewords = bag_of_words(sense_doc)

alicepos = bag_of_pos(alice_doc)
sensepos = bag_of_pos(sense_doc)

# Combine bags to create a set of unique words.
common_words = list(set(alicewords + sensewords))
common_pos = list(set(alicepos + sensepos))

In [214]:
# Create our data frame with features. This can take a while to run.
word_counts_pos = bow_features(sentences, common_words, common_pos)
word_counts_pos.head()

Processing row 0
Processing row 50
Processing row 100
Processing row 150
Processing row 200


Unnamed: 0,door,plan,Park,acquaintance,think,pass,age,earth,carry,neighbourhood,...,PART,AUX,ADV,VERB,INTJ,ADP,PROPN,NUM,text_sentence,text_source
0,0,0,0,0,1,0,0,0,0,0,...,2,0,3,13,0,8,2,0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,0,0,0,0,0,0,0,0,0,0,...,1,0,7,11,0,8,2,0,"(So, she, was, considering, in, her, own, mind...",Carroll
2,0,0,0,0,1,0,0,0,0,0,...,1,0,6,5,0,4,2,0,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,"(Oh, dear, !)",Carroll
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,"(Oh, dear, !)",Carroll


In [218]:
# Checking to see that we have a similar number of sentences from each book.
# And we find that we do not have a class imbalance.
Counter(word_counts_pos['text_source'])

Counter({'Carroll': 129, 'Austen': 113})

In [215]:
from sklearn import ensemble
from sklearn.model_selection import train_test_split

Y = word_counts_pos['text_source']
X = np.array(word_counts_pos.drop(['text_sentence','text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import GridSearchCV
parameters = {'penalty':['l1','l2'], 'C':[.001, .001, .01, .1, 1, 10, 100]}
lr = LogisticRegression(penalty='l2', solver = 'liblinear')
clf = GridSearchCV(lr, parameters, cv=7, scoring='accuracy')
clf.fit(X_train, y_train)
#print(clf.cv_results_)
print(clf.best_estimator_)



# lr = LogisticRegression(penalty='l2') # No need to specify l2 as it's the default. But we put it for demonstration.
# train = lr.fit(X_train, y_train)
print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:',clf.score(X_test, y_test))

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='liblinear',
          tol=0.0001, verbose=0, warm_start=False)
Training set score: 0.9862068965517241

Test set score: 0.9072164948453608




In [219]:
y_pred = clf.predict(X_test)
pd.crosstab(y_test, y_pred)

col_0,Austen,Carroll
text_source,Unnamed: 1_level_1,Unnamed: 2_level_1
Austen,41,3
Carroll,6,47


# Conclusion and Future Thoughts

Our model works slightly better here! We are only misclassifying 9 of the 97 testing sentences. It seem the two authors use noticably different language and sentence structures. Our inclusion of the frequency of parts of speech seems to be helpful. It is reasonable to think that one author may use more complex sentences with more varying parts of speech than the other. Calculating the frequceny of parts of speech also gives us a measure of sentence length. The longer the sentence the more nouns, verbs, adverbs, articles, etc. 

It would be interesting to see how this pipeline could discern Shakespeare from Lewis Carroll. I assume it would be even more accurate. The two writers use very different language, and thus our model shouldn't have much trouble discerning. To drive home this point, if we were comparing Lewis Carroll to a French author, who wrote in french, I would hope our model would have perfect accuracy. When then language becomes that different, discerning between the two novels becomes incredibly easy. However, the features that count the frequencies of parts of speech per sentence would be much less useful than the most common words in this case. French and English both use similar parts of speech. 

The most challenging would be to discern two works from the same author. Would their language very strongly enough from one novel to the next to be able to tell the difference? How consistant is their style in one novel anyways? Maybe we could discern chapters in one novel with this pipline... If one chapter is happy in tone while another is sad, this could be possible. Chapters vary in emotion, so maybe this would not be as difficult as it seems. This would be an interesting future project. 