![@mikegchambers](../../images/header.png)

# Latent Dirichlet Allocation

In this notebook, we explore Latent Dirichlet Allocation using scikit-learn to carry out topic modeling.

![Letters](letters.png)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

import numpy as np
from scipy import sparse
import pandas as pd

import random

# Data

Supplied with this notebook is a text file.  Each line of the text file is a document taken from the AWS documentation from one of three topics:

- Amzon EC2
- Amazon S3
- Amazon SageMaker

Let's load the documents and print how many we have.

In [None]:
text_file = open("corpus.txt", "r")
data_samples = text_file.readlines()
random.shuffle(data_samples)
print(len(data_samples))

Soon we will be processing these documents, so to be able to reference them during testing we save store them, whole, in a train and test set. We set the percentage split here, and will use this again when we split the processed documents.

In [None]:
split_percentage = 90
X_train_document, X_test_document = np.split(data_samples, [int(len(data_samples)*(split_percentage/100))])

# tf–idf (Term Frequency–Inverse Document Frequency)

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.transform

max_df = 0.5 (Removes terms with DF higher than the 50% of the documents)

min_df = 100 (Terms must have DF >= 100 to be considered)

This example is by no means perfect.  The values for min and max DF reflect this.  We could get better results by doing more pre-processing of the data, such as to remove dates.

In [None]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, 
                                   min_df=100,
                                   stop_words='english')

Now we factorize the documents. We do this to the whole set, and then split the processed documents to ensure that both the train and test datasets have the same number of features.

In [None]:
tfidf = tfidf_vectorizer.fit_transform(data_samples)

The vectorizer produces a 'sparse matrix'.  We quickly convert to a normal array, split it, and then put back to a sparse array as that's what LDA wants.   

A 'sparse matrix' is an matrix in which most of the elements are zero.

In [None]:
tfidf = tfidf.toarray()
l, _ = tfidf.shape

X_train, X_test = np.split(tfidf, [int(l*(split_percentage/100))])

X_train = sparse.csr_matrix(X_train)
X_test = sparse.csr_matrix(X_test)

print(X_train.shape)
print(X_test.shape)

## Term Frequency Matrix

Let's use Pandas to render a view of the Term Frequency Matrix from the tfidf vectorizer.  We will add the feature names to the columns, and the rows are the document numbers.

In [None]:
# UPDATE: This cell has had minor changes to work with the newer version of scikit-learn.
df = pd.DataFrame.sparse.from_spmatrix(X_train)
df.columns = tfidf_vectorizer.get_feature_names_out()
df

# Model

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

Now we create our model.  All we need to do is tell it how many topics we want to find.

In [None]:
topics = 3
model = LatentDirichletAllocation(n_components=topics)

This is an unsupervised model, so we have no 'y' data.

In [None]:
model.fit(X_train)

# Topic Results

This useful function* formats and prints a summary of the topics.

(* which I found here: https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html)

In [None]:
# UPDATE: This cell has had minor changes to work with the newer version of scikit-learn.

tf_feature_names = tfidf_vectorizer.get_feature_names_out()

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()
    
print_top_words(model, tf_feature_names, 10)

# Test
Let's provide a document to the model from our X_test dataset and see what topic it determines.

In [None]:
test_sample = 1

In [None]:
p = model.transform(X_test[test_sample])
print(p)

Models often produce an array of probabilities for the different possibilities that it was predicted. We can use `argmax()` to quickly find the largest value, and therefore the prediction.

In [None]:
t = p.argmax()
print("Topic #{}".format(t))

And what did the document say?

In [None]:
print(X_test_document[test_sample])