# Brief interlude for topic modeling
50,000 reviews is way too many documents for individuals to read in order to get an impression of what people are saying. We can use `Latent Dirichlet Allocation` to reduce the text into topic vectors so we understand what is being stated (generally) in the documents with relatively little overhead.

In a nutshell, it finds words that frequently appear together across documents. It uses two matrices (1) `document-to-topic` and (2) `word-to-topic` that when multiplied together reproduce the bag-of-words matrix.  Like PCA, you `need to define the number of topics`.

Starting with the word frequencies of the `imdb` data:

In [None]:
import datetime

from sklearn.feature_extraction.text import CountVectorizer

st = datetime.datetime.now()

# exclude words that occur in more than 10% of the documents
# include most common 5000 words
count = CountVectorizer(stop_words='english', max_df=.1, max_features=5000)

X = count.fit_transform(df['review'].values)

from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=10, 
                               random_state=10, 
                               learning_method='batch'
                              )

X_topics = lda.fit_transform(X)

en = datetime.datetime.now()

el = en-st

print(f'Time to complete: {el}')

Word importances for each of the 5,000 words for each topic:

In [None]:
lda.components_.shape

In [None]:
top_words = 5

feature_names = count.get_feature_names_out()

for topic_idx, topic in enumerate(lda.components_):
    print(f'Topic {topic_idx+1}')
    print(','.join([feature_names[i] for i in topic.argsort()[:-top_words - 1:-1]]))

> You would need to convert these into logical categories to present to your boss. You would want to check the topics match the text, since LDA can extract topics that aren't much stronger than random noise.

In [None]:
comedy = X_topics[:, 0].argsort()[::-1]
for iter_idx, movie_idx in enumerate(comedy[:3]):
    print(f'\nComedy {iter_idx+1}:')
    print(df['review'][movie_idx][:300], '...')

In [None]:
feature_names = count.get_feature_names_out()

for topic_idx, topic in enumerate(lda.components_):
    print(f'Topic {topic_idx+1}')
    print(','.join([feature_names[i] for i in topic.argsort()[:-top_words - 1:-1]]))
    
    topic = X_topics[:, topic_idx].argsort()[::-1]
    
    for iter_idx, movie_idx in enumerate(topic[:3]):
        print(f'- {iter_idx+1}:')
        print(df['review'][movie_idx][:300], '...')
        
    print ('------------')

> Seem reasonable?