# Text Mining DocSouth Slave Narrative Archive
---

*Note:* This is one in [a series of documents and notebooks](https://jeddobson.github.io/textmining-docsouth/) that will document and evaluate various machine learning and text mining tools for use in literary studies. These notebooks form the practical and critical archive of my book-in-progress, _Digital Humanities and the Search for a Method_. I have published a critique of some existing methods (Dobson 2016) that takes up some of these concerns and provides some theoretical background for my account of computational methods as used within the humanities.

### Revision Date and Notes:

10/12/2017: Initial version (james.e.dobson@dartmouth.edu)

### Producing Topic Models from DocSouth North American Slave Narrative Texts



In [None]:
# local Natural Language Toolkit
import nltk
print("nltk version: ",nltk.__version__)

# load scikit-learn 
import sklearn
from sklearn.feature_extraction import text
from sklearn import decomposition
from sklearn import datasets
print("sklearn version: ",sklearn.__version__)

In [None]:
# load all library and all the texts
import sys

sys.path.append("lib")
import docsouth_utils

neh_slave_archive = docsouth_utils.load_narratives()

In [None]:
#
# create input list of documents for the topic model
# and perform additional preprocessing (stopword removal, 
#  lowercase, dropping non-alpha characters, etc.)
#

topic_model_source=list()
for i in neh_slave_archive:
    preprocessed=docsouth_utils.preprocess(i['text'])
    topic_model_source.append(' '.join(preprocessed))

In [None]:
# topics to model
num_topics = 20

# features to extract
num_features = 50

print('reading files and loading into vectorizer')
print('generating',num_topics,'topics with',num_features,'features')

# for LDA (Just TF)
lda_vectorizer = text.CountVectorizer(max_df=0.95, min_df=2,
                                max_features=num_features,
                                lowercase='true',
                                ngram_range=(2,4),
                                strip_accents='unicode',
                                stop_words='english')

lda_vectorizer.decode_error='replace'
lda_tf = lda_vectorizer.fit_transform(topic_model_source)

# fit to model
lda_model = decomposition.LatentDirichletAllocation(n_topics=num_topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                batch_size=128,
                                max_doc_update_iter=100,
                                random_state=None)
lda_model.fit(lda_tf)

print("LDA Model:")
feature_names = lda_vectorizer.get_feature_names()
for topic_idx, topic in enumerate(lda_model.components_):
    print("Topic #%d:" % topic_idx)
    print(", ".join([feature_names[i] for i in topic.argsort()[:-num_features - 1:-1]]))
    print()