<b class="page-title" style="font-size:200%;">Topic Modeling with SciKit Learn &mdash; Experimental</b>

In this notebook we create a topic model from our corpus  using SciKit Learn's library. We'll save our results and then use another notebook to explore the results.

Experimental = developing a low code interface to working with SciKit Learn, etc.

In [1]:
import pandas as pd
import numpy as np
from lib import tapi, etal

# Select Corpus

In [2]:
tapi.list_corpora()

/Users/rca2t1/Dropbox/Courses/NEH/TAPI_Topic_Models


['jstor_hyperparameter',
 'okcupid',
 'pitchfork',
 'poliblogs2008',
 'tamilnet',
 'winereviews']

In [3]:
# data_prefix = 'winereviews'
data_prefix = 'pitchfork'

# Create Object

This object stores the tables that constitute a "digital critical edition," and the algorithms to generate them.

In [7]:
db = tapi.Edition(data_prefix)

# Set Parameters

In [8]:
db.n_terms = 4000        # Vocabulary size
db.ngram_range = (2, 4)  # ngram min and max lengths
db.n_topics = 20         # Number of topics
db.max_iter = 5          # Number of iterations for topic model

# Create Models

In [10]:
db.import_corpus()\
    .create_bow()\
    .create_lda()\
    .create_nmf()

Initializing Count Engine.
Generating Count Model.
Initializing TFIDF Engine.
Generating TFIDF Model.
Extracting VOCABulary.
Creating Bag of Words table.
Applying stats to VOCAB.
Initializing LDA Engine.
Generating LDA Model.
Extracting LDA Doc-Topic Matrix.
Extracting LDA Term-Topic Matrix.
Extracting LDA Topics.
Initializing NMF Engine.
Generating NMF Model.
Extracting NMF Doc-Topic Matrix.
Extracting NMF Term-Topic Matrix.
Extracting NMF Topics.


<lib.tapi.Edition at 0x7f96648e79a0>

# View Results

## LDA Topics

In [11]:
db.TOPICS.sort_values('preponderance', ascending=False).style.bar()

Unnamed: 0_level_0,preponderance,label
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1
18,2007.726833,"18: sounds like, sound like, feel like, post rock, indie rock, songs like, rock band"
2,1544.76603,"2: feels like, post punk, sounds like, sound like, songs sound, joy division, don want"
8,1323.654567,"8: hip hop, feels like, sounds like, drum bass, old school, tracks like, feel like"
10,1306.286648,"10: new wave, ve got, songs like, punk rock, synth pop, title track, rock roll"
17,1193.975732,"17: rock roll, title track, indie rock, lot like, liner notes, greatest hits, box set"
4,710.373603,"4: dance music, title track, sounds like, album cover, house music, second half, electro pop"
13,677.657002,"13: feels like, sounds like, beach boys, free jazz, acoustic guitar, folk music, couple years"
14,504.054714,"14: electronic music, making music, field recordings, bit like, boards canada, seven minutes, title track"
7,481.494866,"7: indie pop, power pop, big star, pop songs, singer songwriter, acoustic guitar, dirty projectors"
1,473.054137,"1: opening track, early 90s, album best, death metal, solo album, new york, multi tracked"


## NMF Topics

In [13]:
db.TOPICS_NMF.sort_values('preponderance', ascending=False).style.bar()

Unnamed: 0_level_0,preponderance,label
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,195.563208,"0: acoustic guitar, songs like, feel like, ve got, years ago, singer songwriter, liner notes"
1,50.58164,"1: hip hop, underground hip hop, underground hip, old school, stones throw, def jux, boom bap"
5,45.519602,"5: sounds like, album sounds, album sounds like, track sounds like, track sounds, sounds like work, music sounds like"
10,38.894601,"10: sound like, doesn sound, doesn sound like, songs sound, don sound, don sound like, songs sound like"
13,33.809199,"13: title track, opening title, opening title track, album title track, closing title, closing title track, album title"
11,31.874498,"11: feels like, feel like, album feels like, album feels, record feels, felt like, music feels"
9,30.553077,"9: indie rock, broken social, broken social scene, social scene, indie rock band, rock band, modest mouse"
4,30.450185,"4: new york, new york city, york city, new york times, york times, arthur russell, avant garde"
3,30.291418,"3: dance music, house music, deep house, house techno, daft punk, acid house, electronic dance"
6,27.823234,"6: post punk, new wave, joy division, bloc party, early 80s, franz ferdinand, punk band"


# Save the Model

In [14]:
db.export_tables()

/Users/rca2t1/Dropbox/Courses/NEH/TAPI_Topic_Models


In [16]:
!ls -l ./db/{data_prefix}*.csv

-rw-r--r--@ 1 rca2t1  staff  10932606 Jul  1 22:14 ./db/pitchfork-BOW.csv
-rw-r--r--@ 1 rca2t1  staff         3 Jul  1 22:14 ./db/pitchfork-DTM.csv
-rw-r--r--@ 1 rca2t1  staff   2167106 Jul  1 22:13 ./db/pitchfork-LABELS.csv
-rw-r--r--@ 1 rca2t1  staff   2059123 Jul  1 22:14 ./db/pitchfork-PHI.csv
-rw-r--r--@ 1 rca2t1  staff    595850 Jul  1 22:14 ./db/pitchfork-PHI_NMF.csv
-rw-r--r--@ 1 rca2t1  staff   5917473 Jul  1 22:14 ./db/pitchfork-THETA.csv
-rw-r--r--@ 1 rca2t1  staff   1818030 Jul  1 22:14 ./db/pitchfork-THETA_NMF.csv
-rw-r--r--@ 1 rca2t1  staff      2309 Jul  1 22:14 ./db/pitchfork-TOPICS.csv
-rw-r--r--@ 1 rca2t1  staff      2481 Jul  1 22:14 ./db/pitchfork-TOPICS_NMF.csv
-rw-r--r--@ 1 rca2t1  staff    498047 Jul  1 22:13 ./db/pitchfork-VOCAB.csv
