# Text Mining DocSouth Slave Narrative Archive
---

*Note:* This is one in [a series of documents and notebooks](https://jeddobson.github.io/textmining-docsouth/) that will document and evaluate various machine learning and text mining tools for use in literary studies. These notebooks form the practical and critical archive of my book-in-progress, _Digital Humanities and the Search for a Method_. I have published a critique of some existing methods (Dobson 2016) that takes up some of these concerns and provides some theoretical background for my account of computational methods as used within the humanities.

### Revision Date and Notes:

10/12/2017: Initial version (james.e.dobson@dartmouth.edu)

### Producing Topic Models from DocSouth North American Slave Narrative Texts



In [1]:
# local Natural Language Toolkit
import nltk
print("nltk version: ",nltk.__version__)

# load scikit-learn 
import sklearn
from sklearn.feature_extraction import text
from sklearn import decomposition
from sklearn import datasets
print("sklearn version: ",sklearn.__version__)

nltk version:  3.2.2
sklearn version:  0.18.1


In [2]:
# load all library and all the texts
import sys

sys.path.append("lib")
import docsouth_utils

neh_slave_archive = docsouth_utils.load_narratives()

In [3]:
#
# create input list of documents for the topic model
# and perform additional preprocessing (stopword removal, 
#  lowercase, dropping non-alpha characters, etc.)
#

topic_model_source=list()
for i in neh_slave_archive:
    preprocessed=docsouth_utils.preprocess(i['text'])
    topic_model_source.append(' '.join(preprocessed))

In [11]:
# topics to model
num_topics = 20

# features to extract
num_features = 50

print('reading files and loading into vectorizer')
print('generating',num_topics,'topics with',num_features,'features')

# make explicit default flags and parameters
vectorizer = text.CountVectorizer(input=topic_model_source,
#                                  ngram_range=(2,4),
                                  lowercase='true',
                                  max_features=num_features, 
                                  strip_accents='unicode',
                                  stop_words='english')

# replace text that the vectorizer cannot read
vectorizer.decode_error='replace'
counts = vectorizer.fit_transform(topic_model_source)
tfidf = text.TfidfTransformer().fit_transform(counts)


# for LDA (Just TF)
lda_vectorizer = text.CountVectorizer(max_df=0.95, min_df=2,
                                max_features=num_features,
                                lowercase='true',
                                ngram_range=(2,4),
                                strip_accents='unicode',
                                stop_words='english')

lda_tf = lda_vectorizer.fit_transform(topic_model_source)


# fit to models
nmf_model = decomposition.NMF(n_components=num_topics).fit(tfidf)


lda_model = decomposition.LatentDirichletAllocation(n_topics=num_topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
lda_model.fit(lda_tf)

# extract topics 
print("NMF Model:")
feature_names = vectorizer.get_feature_names()
for topic_idx, topic in enumerate(nmf_model.components_):
    print("Topic #%d:" % topic_idx)
    print(", ".join([feature_names[i] for i in topic.argsort()[:-num_features - 1:-1]]))
    print()

print("--------------------")
print("LDA Model:")
feature_names = lda_vectorizer.get_feature_names()
for topic_idx, topic in enumerate(lda_model.components_):
    print("Topic #%d:" % topic_idx)
    print(", ".join([feature_names[i] for i in topic.argsort()[:-num_features - 1:-1]]))
    print()


reading files and loading into vectorizer
generating 20 topics with 50 features
NMF Model:
Topic #0:
people, work, time, way, make, great, come, place, came, say, slavery, home, day, know, went, life, new, long, little, good, church, thought, men, god, free, country, colored, house, left, children, called, years, lord, man, white, took, told, state, soon, slaves, slave, shall, saw, said, old, night, negro, mother, master, away

Topic #1:
church, years, men, new, called, good, lord, great, state, people, little, god, work, shall, white, life, took, children, say, went, said, left, time, told, place, day, free, country, know, come, colored, home, came, house, master, long, make, man, way, mother, negro, night, old, saw, slave, slavery, slaves, soon, thought, away

Topic #2:
great, country, people, man, good, place, soon, time, called, told, took, long, house, little, left, new, saw, make, thought, mother, said, slaves, way, life, say, shall, came, colored, children, god, free, day, come,