# Topic Modeling on Bukowski's Poems

On the first post about Bukowski's poems we explored the top words and their polarity. From inspection these groups seemed to be associated to 4 main topics, which also happen to be mentioned on the writter's legacy website. It would be interesting to see if these same topics show up when applicating a generative statistical modeling, such as the Latent Dirichlet Allocation (LDA). To do this and visualize the results I'll use the pyLDAvis and scikit-learn packages.

In [2]:
## Setup and importing libraries.
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning, module='.*/IPython/.*')

from tqdm import tqdm
import pandas as pd
import gensim
import pyLDAvis
import pyLDAvis.sklearn

from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer 

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

pyLDAvis.enable_notebook()

In [3]:
## Load the pickle file.
content = pd.read_pickle('data/content.pkl')

## Add each poem as an element of a list.
raw_poems = []
for p in content.poem.map(int).unique():
    flat_poem = ' '.join([text for text in content.loc[content.poem == p, 'text']])
    raw_poems.append(flat_poem)

In order to perform the LDA, we need to generate a document-term matrix representation of all the columns. Rows in our matrix represent words, while columns correspond to different words. The `CountVectorizer` function helps us do this, and it also performs some pre-processing (removing stop-words, generating bigrams, setting minimum word frequencies, setting everything to lowercase).

In [31]:
## Generate sparse document-term matrix.
dtm_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                stop_words = 'english',
                                lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b',
                                ngram_range = (1,1), 
                                min_df = 10,
                                max_df = 0.2)
dtm_poems = dtm_vectorizer.fit_transform(raw_poems)
print dtm_poems.shape

(1363, 1883)


## Fitting and Visualizing the LDA Model
To fit the LDA algorithm, we are required to set the number of topics it will associate the words with. This is an arbitrary decision, and in this case I'll start with 7.

In [32]:
lda_model = LatentDirichletAllocation(n_topics=7)
lda_model.fit(dtm_poems)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_jobs=1, n_topics=7, perp_tol=0.1, random_state=None,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

In [33]:
pyLDAvis.sklearn.prepare(lda_model, dtm_poems, dtm_vectorizer, mds='tsne')

Examining each of the word clusters, or topics, we see some very general ones and others having too much noise. However there are a couple of interesting ones:

- **Cluster 3:** General topic seems to be around love and dead. Other related words are lives, laughter, reason and madness.
- **Cluster 5:** Horse tracks. The author's visits to this place are also a common subject in his writtings, and this gets reflected here.
- **Cluster 6:** Alcohol and its multiple manifestations.
- **Cluster 7:** Music and art, which are also commonly present on Bukowski's life.

## Expanding Number of Topics
If we define a larger number of clusters for the LDA model, these will tend to become more granular, particularly for the smaller ones. Below I redo the exercise but for 25 groups, I won't go into details but the user is encouraged to explore. 

In [341]:
lda_model = LatentDirichletAllocation(n_topics=25)
lda_model.fit(dtm_poems)
pyLDAvis.sklearn.prepare(lda_model, dtm_poems, dtm_vectorizer, R=20)