The goal of the code below is to use topic modeling to investigate the relationship between some philosophical texts. In a previous step, I scraped them up from the Gutenberg project and stored them locally as text files.

To start building the topic model - load in some libraries, including numpy and sklearn.

In [1]:
import glob, sys, string
import numpy as np
from sklearn import feature_extraction, decomposition

Define some functions, where we'll fit the model and investigate the output.

In [2]:
def print_top_words(model, feature_names, n_top_words):
        for topic_idx, topic in enumerate(model.components_):
                print("Topic #%d:" % topic_idx)
                print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))

def topic_modeling_prep(filenames, n_features):
        vectorizer = feature_extraction.text.CountVectorizer(input='filename', max_df=0.95, min_df=2, max_features=n_features, stop_words='english')
        tf_matrix = vectorizer.fit_transform(filenames)
        tf_array = tf_matrix.toarray()
        return vectorizer, tf_matrix, tf_array

def calc_lda(tf_array, n_topics=10,max_iter=20):
        lda = decomposition.LatentDirichletAllocation(n_topics=n_topics, max_iter=max_iter,
                                learning_method='online',
                                learning_offset=50.)
        doctopic = lda.fit_transform(tf_array)
        return doctopic, lda

def assign_topic(text_names,doctopic,num_topics):
        doctopic_grouped = np.zeros((text_names.shape[0], num_topics))
        for i, name in enumerate(text_names):
                doctopic_grouped[i, :] = np.mean(doctopic[text_names == name, :], axis=0)
        doc_topic_assign = np.argmax(doctopic_grouped, axis=1)
        return doc_topic_assign

def summarize_results(doc_topic_assign, lda, vectorizer, filenames):
        summary = read_summary()
        bookids = [f[5:-4] for f in filenames]

        # for each topic, print out words and documents.
        feature_names = vectorizer.get_feature_names()
        n_top_words = 5
        for topic_idx, topic in enumerate(lda.components_):
                print "Topic #%d:" % topic_idx,
                print(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))

                # which docs are placed in this topic?
                ids = np.where(doc_topic_assign==topic_idx)[0].tolist()
                print("\n".join([summary[(bookids[i],'title')] + " by " + summary[(bookids[i],'author')]  for i in ids]))
                print("\n")

def read_summary():
        # read in summary file with title/author/year info
        summary = {}
        with open('data/summary.txt','r') as f1:
                for line in f1:
                        cols                            = line.split('\t')
                        summary[(str(cols[0]),'title')] = str(cols[1])
                        summary[(str(cols[0]),'author')]= str(cols[2])
                        summary[(str(cols[0]),'year')]  = str(cols[3]).rstrip()
        return summary


Now let's run it! This step takes a few seconds. Then below is the output by topic.

In [3]:
# constants
num_topics = 10
num_features = 1000

# read input files
filenames = glob.glob("data/[0-9]*.txt")
text_names = np.asarray(filenames)

# create model
vectorizer, tf_matrix, tf_array = topic_modeling_prep(filenames,num_features)
doctopic, lda = calc_lda(tf_array,num_topics)
doc_topic_assign = assign_topic(text_names,doctopic,num_topics)

# describe results
summarize_results(doc_topic_assign, lda, vectorizer,filenames)


Topic #0: moral principle beauty human object
The Aesthetical Essays by Frederich Schiller
Literary and Philosophical Essays by  Various
Beyond Good and Evil by Friedrich Nietzsche
Utilitarianism by John Stuart Mill
Lectures on the true, the beautiful and the good by Victor Cousin
The Philosophical Letters by Frederich Schiller
The Philosophy of the Moral Feelings by John Abercrombie


Topic #1: greek letter small experience existence
Poetics by Aristotle
The Critique of Pure Reason by Immanuel Kant
Ontology or the Theory of Being by Peter Coffey


Topic #2: government people socrates law opinion
Critical Miscellanies, Vol. 3 (of 3) by John Morley
Apology by Plato
Euthyphro by Plato
The Poetics by Aristotle
The Queen's Matrimonial Ladder by William Hone
Beyond Good and Evil by Friedrich Nietzsche
The English Utilitarians, Volume I. by Leslie Stephen
Second Treatise of Government by John Locke
Considerations on Representative Government by John Stuart Mill
The Republic by Plato
Ion by P

That's pretty cool! Some of the groups almost make sense! Is there a way to visualize this?

In [4]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

In [5]:
pyLDAvis.sklearn.prepare(lda,tf_matrix, vectorizer)