# Text Mining DocSouth Slave Narrative Archive
---

*Note:* This is one in [a series of documents and notebooks](https://jeddobson.github.io/textmining-docsouth/) that will document and evaluate various machine learning and text mining tools for use in literary studies. These notebooks form the practical and critical archive of my book-in-progress, _Digital Humanities and the Search for a Method_. I have published a critique of some existing methods (Dobson 2016) that takes up some of these concerns and provides some theoretical background for my account of computational methods as used within the humanities.

### Revision Date and Notes:

10/12/2017: Initial version (james.e.dobson@dartmouth.edu)

### Producing Topic Models from DocSouth North American Slave Narrative Texts



In [2]:
# local Natural Language Toolkit
import nltk
pr9int("nltk version: ",nltk.__version__)

# load scikit-learn 
import sklearn
from sklearn.feature_extraction import text
from sklearn import decomposition
from sklearn import datasets
print("sklearn version: ",sklearn.__version__)

nltk version:  3.2.2
sklearn version:  0.18.1


In [3]:
# load all library and all the texts
import sys

sys.path.append("lib")
import docsouth_utils

neh_slave_archive = docsouth_utils.load_narratives()

In [6]:
#
# create input list of documents for the topic model
# and perform additional preprocessing (stopword removal, 
#  lowercase, dropping non-alpha characters, etc.)
#

topic_model_source=list()
for i in neh_slave_archive:
    preprocessed=docsouth_utils.preprocess(i['text'])
    topic_model_source.append(' '.join(preprocessed))

# topics to model
num_topics = 20

# features to extract
num_features = 100

print('reading files and loading into vectorizer')
print('generating',num_topics,'topics with',num_features,'features')

# make explicit default flags and parameters
vectorizer = text.CountVectorizer(input=topic_model_source,lowercase='true',
                                  max_features=num_features, 
                                  strip_accents='unicode',
                                  stop_words='english')

# replace text that the vectorizer cannot read
vectorizer.decode_error='replace'
counts = vectorizer.fit_transform(topic_model_source)
tfidf = text.TfidfTransformer().fit_transform(counts)

# fit to model
nmf = decomposition.NMF(n_components=num_topics).fit(tfidf)

# extract topics 
feature_names = vectorizer.get_feature_names()
for topic_idx, topic in enumerate(nmf.components_):
    print("Topic #%d:" % topic_idx)
    print(", ".join([feature_names[i] for i in topic.argsort()[:-num_features - 1:-1]]))
    print()

reading files and loading into vectorizer
generating 20 topics with 100 features
Topic #0:
time, day, city, new, world, state, place, hand, left, night, shall, soon, days, years, seen, states, year, make, hands, home, given, took, men, life, taken, morning, long, country, say, received, death, heard, woman, called, come, large, let, great, young, far, brought, friends, kind, land, mother, money, gave, away, right, know, house, black, think, family, way, sent, freedom, things, colored, saw, free, brother, came, children, church, felt, god, better, best, father, good, man, heart, john, work, wife, white, went, told, thought, tell, south, slaves, slavery, slave, school, said, race, power, poor, people, old, negro, mind, meeting, master, lord, little, knew, asked

Topic #1:
church, god, brother, years, city, meeting, year, shall, called, men, great, new, good, people, received, young, say, state, school, life, took, best, time, gave, friends, work, white, little, left, world, power, death,