For this project you'll dig into a large amount of text and apply most of what you've covered in this unit and in the course so far.

First, pick a set of texts. This can be either a series of novels, chapters, or articles. Anything you'd like. It just has to have multiple entries of varying characteristics. At least 100 should be good. There should also be at least 10 different authors, but try to keep the texts related (either all on the same topic or from the same branch of literature - something to make classification a bit more difficult than obviously different subjects).

This capstone can be an extension of your NLP challenge if you wish to use the same corpus. If you found problems with that data set that limited your analysis, however, it may be worth using what you learned to choose a new corpus. Reserve 25% of your corpus as a test set.

1. The first technique is to create a series of clusters. Try several techniques and pick the one you think best represents your data. Make sure there is a narrative and reasoning around why you have chosen the given clusters. Are authors consistently grouped into the same cluster?

2. Next, perform some unsupervised feature generation and selection using the techniques covered in this unit and elsewhere in the course. 

3. Using those features then build models to attempt to classify your texts by author. Try different permutations of unsupervised and supervised techniques to see which combinations have the best performance.

4. Lastly return to your holdout group. Does your clustering on those members perform as you'd expect? Have your clusters remained stable or changed dramatically? What about your model? Is it's performance consistent? If there is a divergence in the relative stability of your model and your clusters, delve into why.

Your end result should be a write up of how clustering and modeling compare for classifying your texts. What are the advantages of each? Why would you want to use one over the other? Approximately 3-5 pages is a good length for your write up, and remember to include visuals to help tell your story!

In [22]:
import time
import pandas as pd
import numpy as np
import spacy
import nltk
from nltk.corpus import gutenberg
import re
import matplotlib.pyplot as plt
from collections import Counter
from sklearn import ensemble
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize
from sklearn.mixture import GaussianMixture
from sklearn import neighbors
from sklearn.svm import SVC
from sklearn import tree
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.naive_bayes import BernoulliNB, GaussianNB
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedShuffleSplit

import warnings 
warnings.simplefilter('ignore')

In [11]:
# Load and clean the data.
leaves = gutenberg.paras('whitman-leaves.txt')
leaves_paras=[]
for paragraph in leaves:
    para=paragraph[0]
    #removing the double-dash from all words
    para=[re.sub(r'--','',word) for word in para]
    #Forming each paragraph into a string and adding it to the list of strings.
    leaves_paras.append(' '.join(para))

paradise = gutenberg.paras('milton-paradise.txt')
paradise_paras=[]
for paragraph in paradise:
    para=paragraph[0]
    #removing the double-dash from all words
    para=[re.sub(r'--','',word) for word in para]
    #Forming each paragraph into a string and adding it to the list of strings.
    paradise_paras.append(' '.join(para))
    
    
blake = gutenberg.paras('blake-poems.txt')
blake_paras=[]
for paragraph in blake:
    para=paragraph[0]
    #removing the double-dash from all words
    para=[re.sub(r'--','',word) for word in para]
    #Forming each paragraph into a string and adding it to the list of strings.
    blake_paras.append(' '.join(para))

    
new_corpus = PlaintextCorpusReader("",'.txt')    
    
hiawatha = new_corpus.paras("The Song Of Hiawatha, by Henry W. Longfellow.txt")
hiawatha_paras=[]
for paragraph in hiawatha:
    para=paragraph[0]
    #removing the double-dash from all words
    para=[re.sub(r'--','',word) for word in para]
    #Forming each paragraph into a string and adding it to the list of strings.
    hiawatha_paras.append(' '.join(para))


endymion = new_corpus.paras("Endymion, by John Keats.txt")
endymion_paras=[]
for paragraph in endymion:
    para=paragraph[0]
    #removing the double-dash from all words
    para=[re.sub(r'--','',word) for word in para]
    #Forming each paragraph into a string and adding it to the list of strings.
    endymion_paras.append(' '.join(para))


odyssey = new_corpus.paras("The Odyssey by Homer.txt")
odyssey_paras=[]
for paragraph in odyssey:
    para=paragraph[0]
    #removing the double-dash from all words
    para=[re.sub(r'--','',word) for word in para]
    #Forming each paragraph into a string and adding it to the list of strings.
    odyssey_paras.append(' '.join(para))


burns = new_corpus.paras("The Complete Works of Robert Burns.txt")
burns_paras=[]
for paragraph in burns:
    para=paragraph[0]
    #removing the double-dash from all words
    para=[re.sub(r'--','',word) for word in para]
    #Forming each paragraph into a string and adding it to the list of strings.
    burns_paras.append(' '.join(para))


sea = new_corpus.paras("Sea Garden by Hilda Doolittle.txt")
sea_paras=[]
for paragraph in sea:
    para=paragraph[0]
    #removing the double-dash from all words
    para=[re.sub(r'--','',word) for word in para]
    #Forming each paragraph into a string and adding it to the list of strings.
    sea_paras.append(' '.join(para))
    

beowulf = new_corpus.paras("Beowulf by Leslie Hall.txt")
beowulf_paras=[]
for paragraph in beowulf:
    para=paragraph[0]
    #removing the double-dash from all words
    para=[re.sub(r'--','',word) for word in para]
    #Forming each paragraph into a string and adding it to the list of strings.
    beowulf_paras.append(' '.join(para))


sappho = new_corpus.paras("Sappho- One Hundred Lyrics by Bliss Carman.txt")
sappho_paras=[]
for paragraph in sappho:
    para=paragraph[0]
    #removing the double-dash from all words
    para=[re.sub(r'--','',word) for word in para]
    #Forming each paragraph into a string and adding it to the list of strings.
    sappho_paras.append(' '.join(para))


In [12]:
# Group into paragraphs.
paradise_paras = [[para, "Milton"] for para in paradise_paras]
blake_paras = [[para, "Blake"] for para in blake_paras]
leaves_paras = [[para, "Whitman"] for para in leaves_paras]
hiawatha_paras = [[para, "Longfellow"] for para in hiawatha_paras]
endymion_paras = [[para, "Keats"] for para in endymion_paras]
odyssey_paras = [[para, "Homer"] for para in odyssey_paras]
burns_paras = [[para, "Burns"] for para in burns_paras]
sea_paras = [[para, "Doolittle"] for para in sea_paras]
beowulf_paras = [[para, "Beowulf"] for para in beowulf_paras]
sappho_paras = [[para, "Sappho"] for para in sappho_paras]

# Combine the sentences from the two novels into one data frame.
paragraphs = pd.DataFrame(paradise_paras + blake_paras + 
                         leaves_paras + hiawatha_paras + 
                         endymion_paras + odyssey_paras +
                         burns_paras + sea_paras +
                         beowulf_paras + sappho_paras)
paragraphs.head()

Unnamed: 0,0,1
0,[ Paradise Lost by John Milton 1667 ],Milton
1,Book I,Milton
2,"Of Man ' s first disobedience , and the fruit ...",Milton
3,Book II,Milton
4,"High on a throne of royal state , which far Ou...",Milton


In [13]:
print(len(paradise_paras))
print(len(blake_paras))
print(len(leaves_paras))
print(len(hiawatha_paras))
print(len(endymion_paras))
print(len(odyssey_paras))
print(len(burns_paras))
print(len(sea_paras))
print(len(beowulf_paras))
print(len(sappho_paras))
print(len(paragraphs))

29
284
2478
681
140
216
11621
252
1164
461
17326


# TF-IDF

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

X_train, X_test, y_train, y_test = train_test_split(paragraphs.drop(1,axis=1),paragraphs[1], 
                                                    test_size=0.25, random_state=0)

vectorizer = TfidfVectorizer(max_df=0.5, # drop words that occur in more than half the paragraphs
                             min_df=3, # only use words that appear at least twice
                             stop_words='english', 
                             lowercase=True, #convert everything to lower case (since Alice in Wonderland has the HABIT of CAPITALIZING WORDS for EMPHASIS)
                             use_idf=True,#we definitely want to use inverse document frequencies in our weighting
                             norm=u'l2', #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=True #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )

In [15]:
#Applying the vectorizer
paras_tfidf=vectorizer.fit_transform(paragraphs.drop(1,axis=1)[0].tolist())
print("Number of features: %d" % paras_tfidf.get_shape()[1])

#splitting into training and test sets
X_train_tfidf, X_test_tfidf= train_test_split(paras_tfidf, test_size=0.25, random_state=0)

#Reshapes the vectorizer output into something people can read
X_train_tfidf_csr = X_train_tfidf.tocsr()
X_test_tfidf_csr = X_test_tfidf.tocsr()

#number of sentences
n = X_train_tfidf_csr.shape[0]
#A list of dictionaries, one per sentence
tfidf_bysent = [{} for _ in range(0,n)]
#List of features
terms = vectorizer.get_feature_names()
#for each sentence, lists the feature words and their tf-idf scores
for i, j in zip(*X_train_tfidf_csr.nonzero()):
    tfidf_bysent[i][terms[j]] = X_train_tfidf_csr[i, j]


Number of features: 9301


In [18]:
paragraphs.drop(1,axis=1)[0]

0                    [ Paradise Lost by John Milton 1667 ]
1                                                   Book I
2        Of Man ' s first disobedience , and the fruit ...
3                                                  Book II
4        High on a throne of royal state , which far Ou...
5                                                 Book III
6        Hail , holy Light , offspring of Heaven firstb...
7        00021053 Thou , therefore , whom thou only can...
8                                                  Book IV
10       00081429 Which to our general sire gave prospe...
11                                                  Book V
12       Now Morn , her rosy steps in the eastern clime...
13                                                 Book VI
14       All night the dreadless Angel , unpursued , Th...
15                                                Book VII
16       Descend from Heaven , Urania , by that name If...
17                                               Book VI

In [23]:
rfc = ensemble.RandomForestClassifier()

rfc.fit(X_train_tfidf_csr, y_train)

print('Training set score:', rfc.score(X_train_tfidf_csr, y_train))
print('\nTest set score:', rfc.score(X_test_tfidf_csr, y_test))

split = StratifiedShuffleSplit(n_splits=10, random_state=1337)

score = cross_val_score(rfc, X_test_tfidf_csr, y_test, cv=split, scoring='accuracy')
print("\nCross Validation:\n    %0.2f (+/- %0.2f)" % (score.mean(), score.std()))
print(score)

Training set score: 0.9696783130675697

Test set score: 0.8047091412742382

Cross Validation:
    0.76 (+/- 0.01)
[0.75345622 0.73502304 0.76267281 0.78341014 0.7718894  0.76958525
 0.75576037 0.74884793 0.76728111 0.74884793]


In [20]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l2') # No need to specify l2 as it's the default. But we put it for demonstration.
lr.fit(X_train_tfidf_csr, y_train)

print('Training set score:', lr.score(X_train_tfidf_csr, y_train))
print('\nTest set score:', lr.score(X_test_tfidf_csr, y_test))

split = StratifiedShuffleSplit(n_splits=10, random_state=1337)

score = cross_val_score(lr, X_test_tfidf_csr, y_test, cv=split, scoring='accuracy')
print("\nCross Validation:\n    %0.2f (+/- %0.2f)" % (score.mean(), score.std()))
print(score)

Training set score: 0.8355394797598892

Test set score: 0.801477377654663

Cross Validation:
    0.71 (+/- 0.01)
[0.71889401 0.70967742 0.70967742 0.70506912 0.71198157 0.7235023
 0.70967742 0.70506912 0.71658986 0.71889401]


In [24]:
clf = ensemble.GradientBoostingClassifier()
train = clf.fit(X_train_tfidf_csr, y_train)

print('Training set score:', clf.score(X_train_tfidf_csr, y_train))
print('\nTest set score:', clf.score(X_test_tfidf_csr, y_test))

split = StratifiedShuffleSplit(n_splits=10, random_state=1337)

score = cross_val_score(lr, X_test_tfidf_csr, y_test, cv=split, scoring='accuracy')
print("\nCross Validation:\n    %0.2f (+/- %0.2f)" % (score.mean(), score.std()))
print(score)

Training set score: 0.8356164383561644

Test set score: 0.7705447830101569

Cross Validation:
    0.71 (+/- 0.01)
[0.71889401 0.70967742 0.70967742 0.70506912 0.71198157 0.7235023
 0.70967742 0.70506912 0.71658986 0.71889401]


# Bag Of Words

In [None]:
# Utility function to create a list of the 3000 most common words.
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(2000)]

# Creates a data frame with features for each word in our common word set.
# Each value is the count of the times the word appears in each sentence.
def bow_features(sentences, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 50 == 0:
            print("Processing row {}".format(i))
            
    return df

# Set up the bags.
blakewords = bag_of_words(blake_doc)
paradisewords = bag_of_words(paradise_doc)
leaveswords = bag_of_words(leaves_doc)
hiawathawords = bag_of_words(hiawatha_doc)
endymionwords = bag_of_words(endymion_doc)
odysseywords = bag_of_words(odyssey_doc)
burnswords = bag_of_words(burns_doc)
seawords = bag_of_words(sea_doc)
beowulfwords = bag_of_words(beowulf_doc)
sapphowords = bag_of_words(sappho_doc)

# Combine bags to create a set of unique words.
common_words = set(blakewords + paradisewords + 
                   leaveswords + hiawathawords + 
                   endymionwords + odysseywords +
                   burnswords + seawords +
                   beowulfwords +sapphowords)

In [None]:
print(common_words)

In [None]:
# Create our data frame with features. This can take a while to run.
start_time = time.time()
print('Processing...')

word_counts = bow_features(sentences, common_words)

t= round((time.time() - start_time),4)
print("\n -- %s seconds --\n" % t)

word_counts.head()

In [None]:
#increase common words
#explore td=idf

In [None]:
# Define the features and the outcome.
y = word_counts['text_source']
X = np.array(word_counts.drop(['text_sentence','text_source'], 1))



In [None]:
start_time = time.time()

# Normalize the data.
X_norm = normalize(X)

# Reduce it to two components.
X_pca = PCA(int(X_norm.shape[0]/2)).fit_transform(X_norm)

# Calculate predicted values.
y_pred = KMeans(n_clusters=10, random_state=42).fit_predict(X_pca)

# Plot the solution.
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_pred, alpha = 0.5)
plt.show()

# Check the solution against the data.
print('Comparing k-means clusters against the data:')
print(pd.crosstab(y_pred, y))

print("--- %s seconds for model fit ---" % (time.time() - start_time))

In [None]:
X_norm.shape

In [None]:
#plot true values
start_time = time.time()

# Normalize the data.
X_norm = normalize(X)

#pca = PCA(2).fit(X_norm)
# Reduce it to two components.
X_pca = PCA(int(X_norm.shape[0]/2)).fit_transform(X_norm)

# Calculate predicted values.
y_pred = KMeans(n_clusters=10, random_state=42).fit_predict(X_pca)

labels = y.map(lambda x: 0 if x == "Milton" else (1 if x == "Blake" else 2))
# Plot the solution.
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels, alpha = 0.5)
plt.show()

# Check the solution against the data.
print('Comparing k-means clusters against the data:')
print(pd.crosstab(y_pred, y))

print("--- %s seconds to model ---" % (time.time() - start_time))

In [None]:
start_time = time.time()

# Each batch will be made up of 200 data points.
minibatchkmeans = MiniBatchKMeans(
    init='random',
    n_clusters=10,
    batch_size=200)
minibatchkmeans.fit(X_pca)

# Add the new predicted cluster memberships to the data frame.
predict_mini = minibatchkmeans.predict(X_pca)

# Check the MiniBatch model against our earlier one.
print('Comparing k-means and mini batch k-means solutions:')
print(pd.crosstab(predict_mini, y_pred))

print("--- %s seconds to model ---" % (time.time() - start_time))

In [None]:
#pca(reduce by half then only plot first 2) or lsa

In [None]:
from sklearn import ensemble
from sklearn.model_selection import train_test_split

y = word_counts['text_source']
X = np.array(word_counts.drop(['text_sentence','text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y,
                                                    test_size=0.25,
                                                    stratify = y,
                                                    random_state=0)

split = StratifiedShuffleSplit(n_splits=5, random_state=1337)

In [None]:
y_test.value_counts()

In [None]:
# Spot Check Algorithms
models = []
models.append(('NBB', BernoulliNB()))
models.append(('RFC', ensemble.RandomForestClassifier()))
models.append(('KNN', neighbors.KNeighborsClassifier()))
models.append(('DTC', tree.DecisionTreeClassifier()))
models.append(('GNB', GaussianNB()))
models.append(('SVC', SVC()))
models.append(('GBC', ensemble.GradientBoostingClassifier()))
models.append(('ABC', ensemble.AdaBoostClassifier()))
models.append(('ETC', ensemble.ExtraTreesClassifier()))
models.append(('QDA', QuadraticDiscriminantAnalysis()))
#models.append(('GMM', GaussianMixture()))

# evaluate each model in turn
results = []
names = []
for name, model in models:
    split = StratifiedShuffleSplit(n_splits=10, random_state=1337)
    model = model.fit(X_train,y_train)
    cv_results = cross_val_score(model, X_test, y_test, cv=split, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

In [None]:
rfc = ensemble.RandomForestClassifier()
rfc.fit(X_train, y_train)

print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

split = StratifiedShuffleSplit(n_splits=5, random_state=1337)

score = cross_val_score(rfc, X_test, y_test, cv=split, scoring='accuracy')
print("\nCross Validation:\n    %0.2f (+/- %0.2f)" % (score.mean(), score.std()))
print(score)

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l2') # No need to specify l2 as it's the default. But we put it for demonstration.
lr.fit(X_train, y_train)

print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

split = StratifiedShuffleSplit(n_splits=5, random_state=1337)

score = cross_val_score(lr, X_test, y_test, cv=split, scoring='accuracy')
print("\nCross Validation:\n    %0.2f (+/- %0.2f)" % (score.mean(), score.std()))
print(score)

In [None]:
clf = ensemble.GradientBoostingClassifier()
clf.fit(X_train, y_train)

print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))


split = StratifiedShuffleSplit(n_splits=5, random_state=1337)

score = cross_val_score(clf, X_test, y_test, cv=split, scoring='accuracy')
print("\nCross Validation:\n    %0.2f (+/- %0.2f)" % (score.mean(), score.std()))
print(score)