# Topic Modeling with SciKit Learn &mdash; Experimental

In this notebook we create a topic model from our corpus  using SciKit Learn's library. We'll save our results and then use another notebook to explore the results.

Experimental = developing a low code interface to working with SciKit Learn, etc.

# Set Up

## Imports

In [1]:
import pandas as pd
import numpy as np
from lib import tapi, etal

## Configuration

### Show and pick a corpus to work with

In [2]:
tapi.list_corpora()

['airbnb',
 'anphoblacht',
 'arxiv',
 'covid19',
 'jstor_hyperparameter',
 'novels',
 'okcupid',
 'tamilnet',
 'winereviews',
 'yelp',
 'zuboff']

In [17]:
# data_prefix = 'winereviews'
data_prefix = 'tamilnet'

## Create Tables Object

These tables constitute a "digital critical edition."

In [4]:
db = tapi.Edition(data_prefix)

## Parameters

In [5]:
db.n_terms = 4000        # Vocabulary size
db.ngram_range = (2, 4)  # ngram min and max lengths
db.n_topics = 20         # Number of topics
db.max_iter = 5          # Number of iterations for topic model

# Import Corpus Data

We import a corpus in our standard format

In [6]:
db.import_corpus().create_bow()

Initializing Count Engine.
Generating Count Model.
Initializing TFIDF Engine.
Generating TFIDF Model.
Extracting VOCABulary.
Creating Bag of Words table.
Applying stats to VOCAB.


<lib.tapi.Edition at 0x7ffc850ce040>

# Generate Topic Models

We run Scikit Learn's [LatentDirichletAllocation algorithm](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html#sklearn.decomposition.LatentDirichletAllocation) and extract the THETA and PHI tables.

In [8]:
db.create_nmf()

Initializing NMF Engine.
Generating NMF Model.
Extracting NMF Doc-Topic Matrix.
Extracting NMF Term-Topic Matrix.
Extracting NMF Topics.


<lib.tapi.Edition at 0x7ffc850ce040>

In [9]:
db.create_lda()

Initializing LDA Engine.
Generating LDA Model.
Extracting LDA Doc-Topic Matrix.
Extracting LDA Term-Topic Matrix.
Extracting LDA Topics.


<lib.tapi.Edition at 0x7ffc850ce040>

In [12]:
db.TOPICS.sort_values('preponderance', ascending=False).style.bar()

Unnamed: 0_level_0,preponderance,label
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1
10,950.292998,"10: black cherry, palate offers, sauvignon blanc, palate delivers, opens aromas, tannins drink, lead nose"
2,695.833071,"2: ready drink, wine shows, nose palate, cherry fruit, soft tannins, drink 2015, wine ready"
0,643.925211,"0: fruit flavors, black fruit, black fruits, black pepper, dark fruit, berry aromas, cherry plum"
9,587.844451,"9: black currant, dark chocolate, ripe fruit, fruit flavors, drink 2018, new oak, blackberry black"
17,562.232363,"17: red fruit, fruit flavors, red cherry, cherry raspberry, flavored wine, red wine, lively acidity"
3,557.031719,"3: firm tannins, drink 2020, cabernet sauvignon, petit verdot, drink 2017, citrus flavors, acidity drink"
13,548.195831,"13: cabernet franc, crisp acidity, cabernet sauvignon, fruity wine, finish drink, spice flavors, peach flavors"
11,543.592389,"11: fruit aromas, red plum, plum berry, wood aging, bright acidity, palate shows, berry flavors"
15,502.006179,"15: apple flavors, light bodied, creamy texture, ready drink, wine light, wine dry, pink grapefruit"
14,488.516511,"14: pinot noir, medium bodied, bodied wine, cherry flavors, black cherry, white wine, yellow fruits"


In [13]:
db.TOPICS_NMF.sort_values('preponderance', ascending=False).style.bar()

Unnamed: 0_level_0,preponderance,label
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1
6,34.594302,"6: palate offers, grained tannins, fine grained, fine grained tannins, opens aromas, lead nose, tannins drink"
0,29.073242,"0: black cherry, black cherry fruit, black cherry flavors, ripe black, blackberry black cherry, aromas black, ripe black cherry"
1,28.680874,"1: fruit flavors, red fruit, red fruit flavors, dark fruit, black fruit flavors, dark fruit flavors, stone fruit flavors"
3,28.455032,"3: cabernet sauvignon, cabernet franc, petit verdot, blend cabernet, blend cabernet sauvignon, sauvignon merlot, cabernet sauvignon merlot"
2,24.823145,"2: ready drink, wine ready, wine ready drink, fruity wine, red fruits, wine soft, acidity wine"
7,21.646815,"7: medium bodied, bodied wine, medium bodied wine, aromas flavors, medium bodied palate, bodied palate, flavors like"
4,20.688912,"4: pinot noir, raspberry cherry, silky texture, cherry flavors, chardonnay pinot noir, pinot noir dry, noir dry"
18,19.99352,"18: berry fruits, red berry, red berry fruits, red berry flavors, ripe red berry, ripe red, drink 2017"
8,19.799163,"8: black fruit, black fruit flavors, fruit aromas, black fruit aromas, red black fruit, red black, black fruit palate"
5,17.59062,"5: green apple, white peach, apple flavors, green apple flavors, flavors green apple, flavors green, lime green"


# Save the Model

## Keep Corpus Label Info

This is effectively the LIB table.

In [14]:
db.LABELS = db.corpus[set(db.corpus.columns.tolist()) - set(['doc_key', 'doc_content', 'doc_original'])]
db.LABELS = db.LABELS.dropna(1)

## Save data

In [15]:
db.save_tables()

In [16]:
!ls -l ./db/{data_prefix}*.csv

-rw-r--r--@ 1 rca2t1  staff  2524273 Jun 16 01:18 ./db/winereviews-BOW.csv
-rw-r--r--@ 1 rca2t1  staff        3 Jun 16 01:18 ./db/winereviews-DTM.csv
-rw-r--r--@ 1 rca2t1  staff  1504341 Jun 16 01:18 ./db/winereviews-LABELS.csv
-rw-r--r--@ 1 rca2t1  staff  2083714 Jun 16 01:18 ./db/winereviews-PHI.csv
-rw-r--r--@ 1 rca2t1  staff   554277 Jun 16 01:18 ./db/winereviews-PHI_NMF.csv
-rw-r--r--@ 1 rca2t1  staff  4195814 Jun 16 01:18 ./db/winereviews-THETA.csv
-rw-r--r--@ 1 rca2t1  staff  1024505 Jun 16 01:18 ./db/winereviews-THETA_NMF.csv
-rw-r--r--@ 1 rca2t1  staff     2485 Jun 16 01:18 ./db/winereviews-TOPICS.csv
-rw-r--r--@ 1 rca2t1  staff     2842 Jun 16 01:18 ./db/winereviews-TOPICS_NMF.csv
-rw-r--r--@ 1 rca2t1  staff   418844 Jun 16 01:18 ./db/winereviews-VOCAB.csv
