For this challenge, you will need to choose a corpus of data from nltk or another source that includes categories you can predict and create an analysis pipeline that includes the following steps:

1. Data cleaning / processing / language parsing
2. Create features using two different NLP methods: For example, BoW vs tf-idf.
3. Use the features to fit supervised learning models for each feature set to predict the category outcomes.
4. Assess your models using cross-validation and determine whether one model performed better.
5. Pick one of the models and try to increase accuracy by at least 5 percentage points.

Write up your report in a Jupyter notebook. Be sure to explicitly justify the choices you make throughout, and submit it below.

In [22]:
import pandas as pd
import numpy as np
import spacy
import nltk
from nltk.corpus import gutenberg
nltk.download('punkt')
nltk.download('gutenberg')
import re
from collections import Counter
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedShuffleSplit


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/nickdelucchi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/nickdelucchi/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


In [2]:
print(gutenberg.fileids())

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


# TF-IDF

In [3]:
#reading in the data, this time in the form of paragraphs
paradise = gutenberg.paras('milton-paradise.txt')
#processing

paradise_paras=[]
for paragraph in paradise:
    para=paragraph[0]
    #removing the double-dash from all words
    para=[re.sub(r'--','',word) for word in para]
    #Forming each paragraph into a string and adding it to the list of strings.
    paradise_paras.append(' '.join(para))

print(paradise_paras[0:10])



In [4]:
#reading in the data, this time in the form of paragraphs
blake = gutenberg.paras('blake-poems.txt')
#processing

blake_paras=[]
for paragraph in blake:
    para=paragraph[0]
    #removing the double-dash from all words
    para=[re.sub(r'--','',word) for word in para]
    #Forming each paragraph into a string and adding it to the list of strings.
    blake_paras.append(' '.join(para))

print(blake_paras[0:10])

['[ Poems by William Blake 1789 ]', 'SONGS OF INNOCENCE AND OF EXPERIENCE and THE BOOK of THEL', 'SONGS OF INNOCENCE', 'INTRODUCTION', 'Piping down the valleys wild , Piping songs of pleasant glee , On a cloud I saw a child , And he laughing said to me :', '" Pipe a song about a Lamb !"', '" Drop thy pipe , thy happy pipe ; Sing thy songs of happy cheer :!"', '" Piper , sit thee down and write In a book , that all may read ."', "And I made a rural pen , And I stain ' d the water clear , And I wrote my happy songs Every child may joy to hear .", 'THE SHEPHERD']


In [7]:
# Group into sentences.
paradiseparas = [[para, "Milton"] for para in paradise_paras]
blakeparas = [[para, "Blake"] for para in blake_paras]

# Combine the sentences from the two novels into one data frame.
paras = pd.DataFrame(paradiseparas + blakeparas)
paras.head()

Unnamed: 0,0,1
0,[ Paradise Lost by John Milton 1667 ],Milton
1,Book I,Milton
2,"Of Man ' s first disobedience , and the fruit ...",Milton
3,Book II,Milton
4,"High on a throne of royal state , which far Ou...",Milton


In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

X_train, X_test, y_train, y_test = train_test_split(paras.drop(1,axis=1),paras[1], test_size=0.4, random_state=0)

vectorizer = TfidfVectorizer(max_df=0.5, # drop words that occur in more than half the paragraphs
                             min_df=2, # only use words that appear at least twice
                             stop_words='english', 
                             lowercase=True, #convert everything to lower case (since Alice in Wonderland has the HABIT of CAPITALIZING WORDS for EMPHASIS)
                             use_idf=True,#we definitely want to use inverse document frequencies in our weighting
                             norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=True #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )


#Applying the vectorizer
paras_tfidf=vectorizer.fit_transform(paras.drop(1,axis=1)[0].tolist())
print("Number of features: %d" % paras_tfidf.get_shape()[1])

#splitting into training and test sets
X_train_tfidf, X_test_tfidf= train_test_split(paras_tfidf, test_size=0.4, random_state=0)


#Reshapes the vectorizer output into something people can read
X_train_tfidf_csr = X_train_tfidf.tocsr()
X_test_tfidf_csr = X_test_tfidf.tocsr()

#number of paragraphs
n = X_train_tfidf_csr.shape[0]
#A list of dictionaries, one per paragraph
tfidf_bypara = [{} for _ in range(0,n)]
#List of features
terms = vectorizer.get_feature_names()
#for each paragraph, lists the feature words and their tf-idf scores
for i, j in zip(*X_train_tfidf_csr.nonzero()):
    tfidf_bypara[i][terms[j]] = X_train_tfidf_csr[i, j]


Number of features: 466


In [9]:
from sklearn import ensemble
from sklearn.model_selection import train_test_split

rfc = ensemble.RandomForestClassifier()

rfc.fit(X_train_tfidf_csr, y_train)

print('Training set score:', rfc.score(X_train_tfidf_csr, y_train))
print('\nTest set score:', rfc.score(X_test_tfidf_csr, y_test))

split = StratifiedShuffleSplit(n_splits=10, random_state=1337)

score = cross_val_score(rfc, X_test_tfidf_csr, y_test, cv=split, scoring='accuracy')
print("\nCross Validation:\n    %0.2f (+/- %0.2f)" % (score.mean(), score.std()))
print(score)

Training set score: 1.0

Test set score: 0.9603174603174603

Cross Validation:
    0.95 (+/- 0.04)
[0.92307692 1.         0.92307692 0.92307692 1.         1.
 1.         0.92307692 0.92307692 0.92307692]




In [10]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l2') # No need to specify l2 as it's the default. But we put it for demonstration.
lr.fit(X_train_tfidf_csr, y_train)

print('Training set score:', lr.score(X_train_tfidf_csr, y_train))
print('\nTest set score:', lr.score(X_test_tfidf_csr, y_test))

split = StratifiedShuffleSplit(n_splits=10, random_state=1337)

score = cross_val_score(lr, X_test_tfidf_csr, y_test, cv=split, scoring='accuracy')
print("\nCross Validation:\n    %0.2f (+/- %0.2f)" % (score.mean(), score.std()))
print(score)

Training set score: 0.9090909090909091

Test set score: 0.9047619047619048

Cross Validation:
    0.96 (+/- 0.04)
[1.         1.         0.92307692 0.92307692 1.         1.
 1.         0.92307692 0.92307692 0.92307692]




In [11]:
clf = ensemble.GradientBoostingClassifier()
train = clf.fit(X_train_tfidf_csr, y_train)

print('Training set score:', clf.score(X_train_tfidf_csr, y_train))
print('\nTest set score:', clf.score(X_test_tfidf_csr, y_test))

split = StratifiedShuffleSplit(n_splits=10, random_state=1337)

score = cross_val_score(lr, X_test_tfidf_csr, y_test, cv=split, scoring='accuracy')
print("\nCross Validation:\n    %0.2f (+/- %0.2f)" % (score.mean(), score.std()))
print(score)

Training set score: 1.0

Test set score: 0.9603174603174603

Cross Validation:
    0.96 (+/- 0.04)
[1.         1.         0.92307692 0.92307692 1.         1.
 1.         0.92307692 0.92307692 0.92307692]




# Bag of Words

In [12]:
# Utility function for standard text cleaning.
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text
    
# Load and clean the data.
#macbeth = gutenberg.raw('shakespeare-macbeth.txt')
paradise = gutenberg.raw('milton-paradise.txt')
blake = gutenberg.raw('blake-poems.txt')

# unique cleaning
paradise = re.sub(r'Book \D{1,3}', '', paradise)
blake = re.sub(r"[A-Z]+\b","",blake)
blake = re.sub(r"and   of","",blake)

paradise = text_cleaner(paradise[:int(len(paradise)/10)])
blake = text_cleaner(blake[:int(len(blake)/10)])

print(paradise[:100])
print(blake[:100])

Of Man's first disobedience, and the fruit Of that forbidden tree whose mortal taste Brought death i
Piping down the valleys wild, Piping songs of pleasant glee, On a cloud saw a child, And he laughing


In [15]:
# Parse the cleaned novels. This can take a bit.
nlp = spacy.load('en')
paradise_doc = nlp(paradise)
blake_doc = nlp(blake)

In [16]:
# Group into sentences.
paradise_sents = [[sent, "Milton"] for sent in paradise_doc.sents]
blake_sents = [[sent, "Blake"] for sent in blake_doc.sents]

# Combine the sentences from the two novels into one data frame.
sentences = pd.DataFrame(paradise_sents + blake_sents)
sentences.head()

Unnamed: 0,0,1
0,"(Of, Man, 's, first, disobedience, ,, and, the...",Milton
1,"(And, chiefly, thou, ,, O, Spirit, ,, that, do...",Milton
2,"(Say, first, for, Heaven, hides, nothing, from...",Milton
3,"(Who, first, seduced, them, to, that, foul, re...",Milton
4,"(Th, ', infernal, Serpent, ;, he, it, was, who...",Milton


In [19]:
# Utility function to create a list of the 2000 most common words.
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(2000)]

# Creates a data frame with features for each word in our common word set.
# Each value is the count of the times the word appears in each sentence.
def bow_features(sentences, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 50 == 0:
            print("Processing row {}".format(i))
            
    return df

# Set up the bags.
blakewords = bag_of_words(blake_doc)
paradisewords = bag_of_words(paradise_doc)

# Combine bags to create a set of unique words.
common_words = set(blakewords + paradisewords)

In [20]:
# Create our data frame with features. This can take a while to run.
word_counts = bow_features(sentences, common_words)
word_counts.head()

Processing row 0
Processing row 50
Processing row 100
Processing row 150
Processing row 200
Processing row 250
Processing row 300


Unnamed: 0,audacious,music,glory,roll,seed,realm,perfidious,pain,afric,stream,...,aim,altar,damp,reed,orders,furnace,servile,reiterated,text_sentence,text_source
0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Of, Man, 's, first, disobedience, ,, and, the...",Milton
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(And, chiefly, thou, ,, O, Spirit, ,, that, do...",Milton
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Say, first, for, Heaven, hides, nothing, from...",Milton
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Who, first, seduced, them, to, that, foul, re...",Milton
4,0,0,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,"(Th, ', infernal, Serpent, ;, he, it, was, who...",Milton


In [23]:
from sklearn import ensemble
from sklearn.model_selection import train_test_split

rfc = ensemble.RandomForestClassifier()
Y = word_counts['text_source']
X = np.array(word_counts.drop(['text_sentence','text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)


In [24]:
rfc.fit(X_train, y_train)

print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

split = StratifiedShuffleSplit(n_splits=10, random_state=1337)

score = cross_val_score(rfc, X_test, y_test, cv=split, scoring='accuracy')
print("\nCross Validation:\n    %0.2f (+/- %0.2f)" % (score.mean(), score.std()))
print(score)



Training set score: 1.0

Test set score: 0.9692307692307692

Cross Validation:
    0.92 (+/- 0.00)
[0.92307692 0.92307692 0.92307692 0.92307692 0.92307692 0.92307692
 0.92307692 0.92307692 0.92307692 0.92307692]


In [25]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l2') # No need to specify l2 as it's the default. But we put it for demonstration.
lr.fit(X_train, y_train)

print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

split = StratifiedShuffleSplit(n_splits=10, random_state=1337)

score = cross_val_score(lr, X_test, y_test, cv=split, scoring='accuracy')
print("\nCross Validation:\n    %0.2f (+/- %0.2f)" % (score.mean(), score.std()))
print(score)

Training set score: 1.0

Test set score: 0.9461538461538461

Cross Validation:
    0.92 (+/- 0.00)
[0.92307692 0.92307692 0.92307692 0.92307692 0.92307692 0.92307692
 0.92307692 0.92307692 0.92307692 0.92307692]




In [26]:
clf = ensemble.GradientBoostingClassifier()
clf.fit(X_train, y_train)

print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))


split = StratifiedShuffleSplit(n_splits=10, random_state=1337)

score = cross_val_score(clf, X_test, y_test, cv=split, scoring='accuracy')
print("\nCross Validation:\n    %0.2f (+/- %0.2f)" % (score.mean(), score.std()))
print(score)

Training set score: 1.0

Test set score: 0.9384615384615385

Cross Validation:
    0.91 (+/- 0.03)
[0.92307692 0.92307692 0.92307692 0.92307692 0.92307692 0.92307692
 0.92307692 0.92307692 0.84615385 0.84615385]


Due to the nature of TF-IDF and Bag of Words having some overlapping representation of word frequency, we expect that the performance would be similar between feature sets. On average, we observe better performance for TF-IDF across mutliple models.