<div style="font-size:200%; font-weight:bold; font-variant:small-caps;">Topic Modeling with SciKit Learn</div>

In this notebook we create a topic model from our corpus using SciKit Learn's text feature extraction library. We'll save our results and then use another notebook to explore the results.

# Set Up

## Imports

In [None]:
import pandas as pd
import numpy as np
from lib import tapi

## Configuration

In [None]:
tapi.list_dbs()

In [None]:
tapi.list_corpora()

In [None]:
data_prefix = 'winereviews'

In [None]:
db = tapi.Edition(data_prefix)

## Parameters

In [None]:
n_terms = 4000      # Vocabulary size
ngram_range = (1,4) # ngram min and max lengths
n_topics = 20       # Number of topics
max_iter = 5        # Number of iterations for topic model

## Create Tables Object

These tables constitute a "digital critical edition."

# Import Corpus Data

We import a corpus in our standard format

In [None]:
corpus = db.get_corpus()

## Inspect contents

In [None]:
corpus

In [None]:
corpus.doc_content.sample(10).to_list()

In [None]:
# corpus.head(10)

# Convert to Bag of Words 

ie. a __Count Vector Space__

We use Scikit Learn's CountVectorizer to convert our corpus of documents into a document-term vector space of word counts.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [None]:
count_engine = CountVectorizer(max_features=n_terms, stop_words='english', ngram_range=ngram_range)
count_model = count_engine.fit_transform(corpus.doc_content)

## Get Generated VOCAB

In [None]:
db.VOCAB = pd.DataFrame(count_engine.get_feature_names(), columns=['term_str'])
db.VOCAB = db.VOCAB.set_index('term_str')
db.VOCAB['ngram_len'] = None # To be added later

In [None]:
db.VOCAB.sample(10)

## Get Generated BOW

We do this just to show what the counter vectorizer produced. `DTM` stands for documet-term matrix. We convert this sparse matrix into a "thin" dataframe that keeps only terms with counts for each document. 

In [None]:
db.DTM = pd.DataFrame(count_model.toarray(), index=corpus.index, columns=db.VOCAB.index)
db.BOW = db.DTM.stack().to_frame('n')
db.BOW = db.BOW[db.BOW.n > 0]

In [None]:
db.DTM.info(verbose=False)

In [None]:
db.BOW.info(verbose=False)

## Compute TF-IDF

In [None]:
tfidf_engine = TfidfTransformer()
tfidf_model = tfidf_engine.fit_transform(count_model)

In [None]:
db.TFIDF = pd.DataFrame(tfidf_model.toarray(), index=corpus.index, columns=db.VOCAB.index)

In [None]:
db.BOW['tfidf'] = db.TFIDF.stack()

In [None]:
db.BOW

## Add Features to VOCAB

In [None]:
db.VOCAB['ngram_len'] = db.VOCAB.apply(lambda x: len(x.name.split()), 1)
db.VOCAB['n'] = db.DTM.sum()
db.VOCAB['tfidf_mean'] = db.TFIDF.mean()

In [None]:
db.VOCAB

In [None]:
db.VOCAB.ngram_len.value_counts()

# Generate Topic Models

We run Scikit Learn's [LatentDirichletAllocation algorithm](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html#sklearn.decomposition.LatentDirichletAllocation) and extract the THETA and PHI tables.

In [None]:
from sklearn.decomposition import LatentDirichletAllocation as LDA, NMF

## Using LDA

In [None]:
lda_engine = LDA(n_components=n_topics, max_iter=max_iter, learning_offset=50., random_state=0)

### THETA

The Document-Term Matrix

In [None]:
db.THETA = pd.DataFrame(lda_engine.fit_transform(count_model), index=corpus.index)
db.THETA.index.name = 'doc_id'
db.THETA.columns.name = 'topic_id'

In [None]:
db.THETA.sample(20).style.background_gradient(axis=1)

### PHI

In [None]:
db.PHI = pd.DataFrame(lda_engine.components_, columns=db.VOCAB.index)
db.PHI.index.name = 'topic_id'
db.PHI.columns.name  = 'term_str'

In [None]:
db.PHI.head().style.background_gradient()

### Create Topic Glosses

In [None]:
n_top_words = 7

In [None]:
db.TOPICS = db.PHI.stack()\
    .to_frame('weight')\
    .groupby('topic_id')\
    .apply(lambda x: x.weight.sort_values(ascending=False)\
               .head(n_top_words)\
               .reset_index()\
               .drop('topic_id',1)\
               .term_str)

In [None]:
db.TOPICS

In [None]:
db.TOPICS['topwords'] = db.TOPICS.apply(lambda x: str(x.name) + ' ' + ', '.join(x), 1)

In [None]:
db.TOPICS

### Add Doc Weights

In [None]:
db.TOPICS['doc_weight_sum'] = db.THETA.sum()

In [None]:
db.TOPICS.iloc[:, 7:].sort_values('doc_weight_sum', ascending=False).style.bar()

## Using NMF

In [None]:
nmf_engine = NMF(n_components=n_topics, init='nndsvd', random_state=1, alpha=.1, l1_ratio=.5)

### THETA

In [None]:
db.THETA_NMF = pd.DataFrame(nmf_engine.fit_transform(tfidf_model), index=corpus.index)
db.THETA_NMF.columns.name = 'topic_id'

In [None]:
db.THETA_NMF.sample(20).style.background_gradient()

### PHI

In [None]:
db.PHI_NMF = pd.DataFrame(nmf_engine.components_, columns=db.VOCAB.index)

In [None]:
db.PHI_NMF.index.name = 'topic_id'
db.PHI_NMF.columns.name = 'term_str'

In [None]:
db.PHI_NMF.T.head().style.background_gradient()

### Topics

In [None]:
db.TOPICS_NMF = db.PHI_NMF.stack()\
    .to_frame('weight')\
    .groupby('topic_id')\
    .apply(lambda x: 
           x.weight.sort_values(ascending=False)\
               .head(n_top_words)\
               .reset_index()\
               .drop('topic_id',1)\
               .term_str)

In [None]:
db.TOPICS_NMF

In [None]:
db.TOPICS_NMF['topwords'] = db.TOPICS_NMF.apply(lambda x: str(x.name) + ' ' + ', '.join(x), 1)

### Add Doc Weights

In [None]:
db.TOPICS_NMF['doc_weight_sum'] = db.THETA_NMF.sum()

In [None]:
db.TOPICS_NMF.iloc[:, 7:].sort_values('doc_weight_sum', ascending=False).style.bar()

# Save the Model

## Keep Corpus Label Info

This is effectively the LIB table.

In [None]:
db.LABELS = corpus[set(corpus.columns.tolist()) - set(['doc_key', 'doc_content', 'doc_original'])]

## Save Tables

In [None]:
db.save_tables()

In [None]:
# See if it worked ...

!ls -l ./db/{data_prefix}*.csv