# Topic Modeling on the 20 Newsgroup Dataset

This brief tutorial walks through an end-to-end example of training a topic model on the 20 Newsgroup dataset. To keep things simple, this tutorial explores only a few different parameters and choices involved when training a topic model. In addition to training a model, the tutorial briefly explores examining the results of the topic modeling process, including methods for visualizing the resulting topic assignments.

The first step in this process is to download the 20 news dataset:

In [2]:
from sklearn.datasets import fetch_20newsgroups
from tmnt.preprocess.vectorizer import TMNTVectorizer

data, y = fetch_20newsgroups(shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'),
                             return_X_y=True)

Next, we leverage the functionality within :py:class:`tmnt.preprocess.vectorizer.TMNTVectorizer` to map the documents into a document-term matrix (i.e. a matrix where each row corresponds to a document in the dataset and each column corresponds to a term in the vocabulary).

In [3]:
tf_vectorizer = TMNTVectorizer(vocab_size=2000)
X, _ = tf_vectorizer.fit_transform(data)

Given the resulting document-term matrix ``X``, we can fit a model:

In [4]:
from tmnt.estimator import BowEstimator
vocabulary = tf_vectorizer.get_vocab()
estimator = BowEstimator(vocabulary)
_ = estimator.fit(X)