# nbgallery topic model demo

This example is based on the [scikit-learn documentation](https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html).

In [None]:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation as LDA

import nbgallery.database.orm as nbgorm
import nbgallery.notebooks as nbgnb

First, build a corpus from all the notebooks stored in nbgallery.

In [None]:
session = nbgorm.Session()

In [None]:
corpus = []
for nb in session.query(nbgorm.Notebook).all():
    doc = nbgnb.from_model(nb)
    corpus.append(' '.join(doc.sources()))
len(corpus)

Next, use [TF-IDF](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) to build a document-term matrix.

In [None]:
ntopics = 5
nwords = 10

In [None]:
# We're including code, so to avoid getting numeric contants, let's change
# the default token pattern to require words start with a letter.
vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b[a-z]\w+\b')
tfidf = vectorizer.fit_transform(corpus)
tfidf

Last, use [LDA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) to extract topics.

In [None]:
lda = LDA(n_components=ntopics)
lda.fit(tfidf)

In [None]:
# taken from sklearn example
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

In [None]:
print_top_words(lda, vectorizer.get_feature_names(), nwords)