<div style="font-size:200%; font-weight:bold; font-variant:small-caps;">Topic Modeling with SciKit Learn</div>

In this notebook we create a topic model from our corpus using SciKit Learn's text feature extraction library. We'll save our results and then use another notebook to explore the results.

# Set Up

## Imports

In [1]:
import pandas as pd
import numpy as np
from lib import tapi

## Configuration

In [2]:
tapi.list_dbs()

/Users/rca2t1/Dropbox/Courses/NEH/TAPI_Topic_Models


['jstor_hyperparameter',
 'jstor_hyperparameter_demo',
 'pitchfork',
 'poliblogs2008',
 'tamilnet',
 'ussc',
 'winereviews']

In [3]:
tapi.list_corpora()

/Users/rca2t1/Dropbox/Courses/NEH/TAPI_Topic_Models


['jstor_hyperparameter',
 'okcupid',
 'pitchfork',
 'poliblogs2008',
 'tamilnet',
 'ussc',
 'winereviews']

In [4]:
data_prefix = 'ussc'

In [5]:
db = tapi.Edition(data_prefix)

## Parameters

In [6]:
n_terms = 4000      # Vocabulary size
ngram_range = (1,4) # ngram min and max lengths
n_topics = 40       # Number of topics
max_iter = 5        # Number of iterations for topic model

## Create Tables Object

These tables constitute a "digital critical edition."

# Import Corpus Data

We import a corpus in our standard format

In [7]:
corpus = db.get_corpus()

## Inspect contents

In [8]:
corpus

Unnamed: 0_level_0,doc_year,doc_key,doc_position,doc_content,doc_len
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,1796,3_US_171,concur,CONCUR BY: PATERSON; IREDELL; WILSON\nPATERSON...,22824
1,1796,3_US_171,dissent,"DISSENT BY: CUSHING\nCUSHING, Justice. As I ha...",252
2,1796,3_US_171,opinion,THE COURT delivered their opinions seriatim in...,7826
3,1798,3_US_386,concur,"CONCUR BY: PATERSON\nPATERSON, Justice. The Co...",7835
4,1798,3_US_386,dissent,"DISSENT BY: IREDELL\nIREDELL, Justice. Though ...",8908
...,...,...,...,...,...
5374,2008,554_US_527,dissent,DISSENT BY: Stevens \nDISSENT \nJustice Steven...,25244
5375,2008,554_US_527,opinion,Justice Scalia delivered the opinion of the Co...,45723
5376,2008,554_US_84,concur,CONCUR BY: SCALIA; THOMAS (In Part) \nCONCUR \...,2948
5377,2008,554_US_84,dissent,DISSENT BY: THOMAS (In Part) \nDISSENT \nJUSTI...,1237


In [9]:
# corpus.doc_content.sample(10).to_list()

In [10]:
# corpus.head(10)

# Convert to Bag of Words 

ie. a __Count Vector Space__

We use Scikit Learn's CountVectorizer to convert our corpus of documents into a document-term vector space of word counts.

In [11]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [12]:
count_engine = CountVectorizer(max_features=n_terms, stop_words='english', ngram_range=ngram_range)
count_model = count_engine.fit_transform(corpus.doc_content)

## Get Generated VOCAB

In [13]:
db.VOCAB = pd.DataFrame(count_engine.get_feature_names(), columns=['term_str'])
db.VOCAB = db.VOCAB.set_index('term_str')
db.VOCAB['ngram_len'] = None # To be added later

In [14]:
db.VOCAB.sample(10)

Unnamed: 0_level_0,ngram_len
term_str,Unnamed: 1_level_1
fund,
compensatory,
indictment,
pre,
cong 1st,
invasion,
128,
republic,
contempt,
february,


## Get Generated BOW

We do this just to show what the counter vectorizer produced. `DTM` stands for documet-term matrix. We convert this sparse matrix into a "thin" dataframe that keeps only terms with counts for each document. 

In [15]:
db.DTM = pd.DataFrame(count_model.toarray(), index=corpus.index, columns=db.VOCAB.index)
db.BOW = db.DTM.stack().to_frame('n')
db.BOW = db.BOW[db.BOW.n > 0]

In [16]:
db.DTM.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5379 entries, 0 to 5378
Columns: 4000 entries, 000 to zone
dtypes: int64(4000)
memory usage: 164.3 MB


In [17]:
db.BOW.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 2544973 entries, (0, '000') to (5378, 'years')
Columns: 1 entries, n to n
dtypes: int64(1)
memory usage: 29.4+ MB


## Compute TF-IDF

In [18]:
tfidf_engine = TfidfTransformer()
tfidf_model = tfidf_engine.fit_transform(count_model)

In [19]:
db.TFIDF = pd.DataFrame(tfidf_model.toarray(), index=corpus.index, columns=db.VOCAB.index)

In [20]:
db.BOW['tfidf'] = db.TFIDF.stack()

In [21]:
db.BOW

Unnamed: 0_level_0,Unnamed: 1_level_0,n,tfidf
doc_id,term_str,Unnamed: 2_level_1,Unnamed: 3_level_1
0,000,2,0.018704
0,10,1,0.005761
0,100,2,0.017461
0,105,3,0.029116
0,14,1,0.006636
...,...,...,...
5378,writing,1,0.009378
5378,wrong,1,0.007966
5378,wrote,1,0.009504
5378,year,2,0.013173


## Add Features to VOCAB

In [22]:
db.VOCAB['ngram_len'] = db.VOCAB.apply(lambda x: len(x.name.split()), 1)
db.VOCAB['n'] = db.DTM.sum()
db.VOCAB['tfidf_mean'] = db.TFIDF.mean()

In [23]:
db.VOCAB

Unnamed: 0_level_0,ngram_len,n,tfidf_mean
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
000,1,2762,0.004429
10,1,6323,0.006809
100,1,2025,0.003155
101,1,1337,0.002201
102,1,1335,0.002264
...,...,...,...
york,1,6368,0.010286
york times,2,546,0.001932
young,1,1120,0.002581
youth,1,423,0.001047


In [24]:
db.VOCAB.ngram_len.value_counts()

1    3608
2     349
3      40
4       3
Name: ngram_len, dtype: int64

# Generate Topic Models

We run Scikit Learn's [LatentDirichletAllocation algorithm](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html#sklearn.decomposition.LatentDirichletAllocation) and extract the THETA and PHI tables.

In [25]:
from sklearn.decomposition import LatentDirichletAllocation as LDA, NMF

## Using LDA

In [26]:
lda_engine = LDA(n_components=n_topics, max_iter=max_iter, learning_offset=50., random_state=0)

### THETA

The Document-Term Matrix

In [27]:
db.THETA = pd.DataFrame(lda_engine.fit_transform(count_model), index=corpus.index)
db.THETA.index.name = 'doc_id'
db.THETA.columns.name = 'topic_id'

In [28]:
db.THETA.sample(20).style.background_gradient(axis=1)

topic_id,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1
4056,0.010587,0.000142,0.000142,0.000142,0.000142,0.000142,0.000142,0.000142,0.000142,0.000142,0.000142,0.000142,0.014726,0.000142,0.000142,0.000142,0.000142,0.000142,0.000142,0.000142,0.000142,0.52377,0.000142,0.000142,0.000142,0.000142,0.000142,0.000142,0.000142,0.024784,0.000142,0.040495,0.000142,0.000142,0.171723,0.000142,0.000142,0.000142,0.000142,0.209227
3088,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,0.019257,4e-05,4e-05,4e-05,0.372766,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,0.263651,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,0.099278,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,4e-05,0.243632
3685,9e-06,9e-06,0.027255,9e-06,0.481099,0.007526,9e-06,9e-06,9e-06,9e-06,0.478943,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,9e-06,0.004844
3676,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,0.035475,2.8e-05,2.8e-05,0.058262,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,0.050839,2.8e-05,2.8e-05,0.328002,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,0.493146,0.033309
946,4.7e-05,4.7e-05,0.034,4.7e-05,4.7e-05,4.7e-05,4.7e-05,4.7e-05,4.7e-05,4.7e-05,4.7e-05,0.354963,4.7e-05,0.021319,4.7e-05,4.7e-05,4.7e-05,4.7e-05,4.7e-05,4.7e-05,4.7e-05,0.095353,4.7e-05,4.7e-05,4.7e-05,4.7e-05,4.7e-05,0.300417,4.7e-05,4.7e-05,4.7e-05,0.172731,4.7e-05,4.7e-05,4.7e-05,4.7e-05,4.7e-05,4.7e-05,4.7e-05,0.019663
5023,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,0.176291,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,0.59141,3.2e-05,3.2e-05,3.2e-05,3.2e-05,3.2e-05,0.20753,0.023628
1178,0.060732,3.4e-05,0.517806,3.4e-05,3.4e-05,3.4e-05,3.4e-05,3.4e-05,3.4e-05,0.23597,0.025545,3.4e-05,3.4e-05,3.4e-05,3.4e-05,0.033457,3.4e-05,3.4e-05,0.027933,3.4e-05,3.4e-05,3.4e-05,3.4e-05,3.4e-05,3.4e-05,3.4e-05,3.4e-05,3.4e-05,3.4e-05,3.4e-05,3.4e-05,3.4e-05,3.4e-05,3.4e-05,3.4e-05,0.097424,3.4e-05,3.4e-05,3.4e-05,3.4e-05
1358,2.6e-05,2.6e-05,2.6e-05,2.6e-05,2.6e-05,2.6e-05,2.6e-05,2.6e-05,2.6e-05,2.6e-05,0.128661,2.6e-05,2.6e-05,2.6e-05,2.6e-05,2.6e-05,2.6e-05,2.6e-05,0.3775,2.6e-05,2.6e-05,0.081217,2.6e-05,2.6e-05,2.6e-05,2.6e-05,2.6e-05,0.217317,2.6e-05,2.6e-05,2.6e-05,2.6e-05,2.6e-05,2.6e-05,0.076065,2.6e-05,2.6e-05,0.024378,0.094006,2.6e-05
1102,4.2e-05,4.2e-05,4.2e-05,4.2e-05,4.2e-05,4.2e-05,4.2e-05,4.2e-05,4.2e-05,4.2e-05,4.2e-05,0.776538,4.2e-05,4.2e-05,4.2e-05,0.001424,4.2e-05,4.2e-05,4.2e-05,4.2e-05,4.2e-05,4.2e-05,4.2e-05,4.2e-05,4.2e-05,0.069967,4.2e-05,4.2e-05,4.2e-05,0.018418,4.2e-05,4.2e-05,4.2e-05,4.2e-05,4.2e-05,4.2e-05,4.2e-05,0.132184,4.2e-05,4.2e-05
1366,2.8e-05,2.8e-05,0.020426,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,0.183333,2.8e-05,0.041461,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,0.449569,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,2.8e-05,0.30423


### PHI

In [29]:
db.PHI = pd.DataFrame(lda_engine.components_, columns=db.VOCAB.index)
db.PHI.index.name = 'topic_id'
db.PHI.columns.name  = 'term_str'

In [30]:
db.PHI.T.head().style.background_gradient()

topic_id,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1
0,17.261673,13.55353,376.798377,0.598921,58.537397,12.471511,25.474955,5.380009,70.930995,116.041064,103.769356,228.141371,4.325445,84.61197,18.516242,77.641724,134.598987,6.178166,86.822207,4.744243,175.688999,48.075773,76.303767,0.451017,21.358211,8.557587,3.423902,84.042166,1.401106,77.473366,34.23585,449.546835,61.313797,26.63483,24.467677,64.124922,25.358526,113.734646,14.865258,5.543621
10,259.700123,200.474589,442.803719,66.829358,128.332483,350.267224,51.674417,141.101147,284.077234,126.498763,315.497367,120.624424,80.303678,152.009966,124.48173,79.595843,168.663682,63.95538,139.729251,87.36769,107.073462,62.098726,163.997509,139.303404,186.955146,243.123069,120.336975,50.733889,50.357288,268.697537,221.938366,371.39407,199.584329,51.806587,39.604764,120.29253,83.356021,236.460104,160.452623,62.445534
100,135.962482,47.802454,91.221868,98.099019,38.746094,65.120656,29.175681,21.385476,122.503297,195.28652,68.625676,67.683756,11.781897,51.257514,19.696513,29.592616,82.18061,13.425478,26.978289,62.622746,111.735275,63.566241,37.759883,13.612905,22.863086,7.90209,36.208635,7.065874,48.103822,27.518771,14.002308,48.728513,151.786318,12.410114,16.043565,8.906029,19.811081,53.848842,11.828682,33.149321
101,80.123603,5.105609,18.351133,34.474159,19.117823,33.292099,80.013173,27.90196,7.084128,159.566828,43.778981,21.261694,8.15754,20.406937,2.389746,12.527283,4.971912,14.447126,9.315454,4.681392,63.644172,47.906621,12.994376,40.572167,88.849652,15.583374,40.280344,1.433825,29.514853,103.811267,6.026258,9.695715,143.728652,15.230961,4.292297,11.934149,20.319871,48.128921,9.497612,17.586329
102,87.225975,26.584457,13.375542,58.691068,49.041322,44.389908,129.70098,8.21967,5.303331,106.247273,89.993797,18.814752,8.719642,17.713015,0.798751,7.569236,13.541215,7.709677,1.666662,5.305516,38.180403,22.562411,5.458281,20.289356,14.601615,6.228625,95.281491,9.974469,117.172752,25.930062,1.594913,22.728104,143.578275,0.025012,10.286448,9.47562,13.317286,26.631289,13.347018,38.724782


### Create Topic Glosses

In [31]:
n_top_words = 7

In [32]:
db.TOPICS = db.PHI.stack()\
    .to_frame('weight')\
    .groupby('topic_id')\
    .apply(lambda x: x.weight.sort_values(ascending=False)\
               .head(n_top_words)\
               .reset_index()\
               .drop('topic_id',1)\
               .term_str)

In [33]:
db.TOPICS

term_str,0,1,2,3,4,5,6
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,death,sentence,court,penalty,sentencing,punishment,jury
1,search,court,fourth,amendment,fourth amendment,warrant,states
2,act,congress,court,law,price,statute,liability
3,ed,2d,ed 2d,ct,state,court,pre
4,court,respondents,fees,attorney,case,rights,district court
5,congress,cable,discrimination,vii,title vii,title,broadcast
6,ed,ct,ed 2d,2d,government,court,id
7,trial,court,counsel,defendant,right,criminal,case
8,court,state,district,voting,equal,racial,discrimination
9,ed,ct,2d,ed 2d,state,states,law


In [34]:
db.TOPICS['topwords'] = db.TOPICS.apply(lambda x: str(x.name) + ' ' + ', '.join(x), 1)

In [35]:
db.TOPICS

term_str,0,1,2,3,4,5,6,topwords
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,death,sentence,court,penalty,sentencing,punishment,jury,"0 death, sentence, court, penalty, sentencing,..."
1,search,court,fourth,amendment,fourth amendment,warrant,states,"1 search, court, fourth, amendment, fourth ame..."
2,act,congress,court,law,price,statute,liability,"2 act, congress, court, law, price, statute, l..."
3,ed,2d,ed 2d,ct,state,court,pre,"3 ed, 2d, ed 2d, ct, state, court, pre"
4,court,respondents,fees,attorney,case,rights,district court,"4 court, respondents, fees, attorney, case, ri..."
5,congress,cable,discrimination,vii,title vii,title,broadcast,"5 congress, cable, discrimination, vii, title ..."
6,ed,ct,ed 2d,2d,government,court,id,"6 ed, ct, ed 2d, 2d, government, court, id"
7,trial,court,counsel,defendant,right,criminal,case,"7 trial, court, counsel, defendant, right, cri..."
8,court,state,district,voting,equal,racial,discrimination,"8 court, state, district, voting, equal, racia..."
9,ed,ct,2d,ed 2d,state,states,law,"9 ed, ct, 2d, ed 2d, state, states, law"


### Add Doc Weights

In [36]:
db.TOPICS['doc_weight_sum'] = db.THETA.sum()

In [37]:
db.TOPICS.iloc[:, 7:].sort_values('doc_weight_sum', ascending=False).style.bar()

term_str,topwords,doc_weight_sum
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1
39,"39 justice, court, opinion, mr, mr justice, concurring, case",749.66936
11,"11 tax, state, commerce, court, interstate, business, property",215.81521
14,"14 court, district, case, judge, district court, appeals, trial",205.8146
1,"1 search, court, fourth, amendment, fourth amendment, warrant, states",202.773081
29,"29 union, labor, court, employees, employer, board, act",199.31692
30,"30 petitioner, court, trial, evidence, state, defendant, case",193.598535
24,"24 state, federal, court, states, law, congress, courts",187.276298
13,"13 public, speech, court, amendment, city, ordinance, right",184.16155
7,"7 trial, court, counsel, defendant, right, criminal, case",171.430667
2,"2 act, congress, court, law, price, statute, liability",166.903637


## Using NMF

In [38]:
nmf_engine = NMF(n_components=n_topics, init='nndsvd', random_state=1, alpha=.1, l1_ratio=.5)

### THETA

In [39]:
db.THETA_NMF = pd.DataFrame(nmf_engine.fit_transform(tfidf_model), index=corpus.index)
db.THETA_NMF.columns.name = 'topic_id'



In [40]:
db.THETA_NMF.sample(20).style.background_gradient()

topic_id,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1
2918,0.076837,0.0,0.0,0.0,0.0,0.0,0.0,0.002754,0.0,0.0,0.0,0.0,0.013531,0.0,0.0,0.0,0.0,0.0,0.0,0.061992,0.008616,0.0,0.0,0.0,0.114783,0.0,0.0,0.016324,0.0,0.026903,0.0,0.0,0.0,0.0,0.0,0.0,0.01375,0.0,0.0,0.0
363,0.0,0.120376,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1428,0.004738,0.052093,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.134991,0.0,0.0,0.0,0.0,0.0,0.044666,0.0,0.0,0.0,0.0,0.045113,0.0,0.0,0.0,0.0,0.0,0.0
1013,0.025041,0.000344,0.0,0.0,0.0,0.0,0.0,0.000394,0.0,0.0,0.0,0.001522,0.0,0.065454,0.0,0.0,0.0,0.0,0.000589,0.012118,0.0,0.0,0.0,0.0048,0.0,0.0,0.0,0.041169,0.0,0.0,0.0,0.0,0.0,0.0,0.086236,0.002444,0.0,0.0,0.0,0.0
2465,0.033324,0.0,0.0,0.0,0.0,0.0,0.004635,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.031194,0.0,0.008126,0.0,0.116314,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1303,0.05941,0.016128,0.0,0.0,0.0,0.0,0.0,0.00572,0.0,0.003299,0.0,0.008255,0.0,0.00056,0.0,0.0,0.0,0.001815,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.024422,0.020256,0.011149,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4802,0.028772,0.0,0.0,0.024391,0.0,0.044506,0.0,0.0,0.0,0.0,0.0,0.0,0.00221,0.00993,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.007662,0.0,0.126017,0.0,0.055782,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003174,0.0
1062,0.03503,0.012069,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.007186,0.0,0.066885,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013127,0.0,0.0,1.6e-05,0.0,0.0,0.00683,0.0,0.279995,0.0,0.0,0.0,0.0
4030,0.027765,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.004644,0.0,0.115087,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.006257,0.0,0.0,0.0,0.0,0.0,0.005573,0.0,0.0,0.0,0.0,0.0,0.0
4101,0.019237,0.0,0.003296,0.001008,0.001174,8.6e-05,0.0,0.0,0.0,0.00851,0.001578,0.0,0.0,0.004694,0.002748,0.0,0.0,0.0,0.00637,0.0,0.0,0.005997,0.0,0.069037,0.0,0.001413,0.0,0.007953,0.0,0.0,0.0,0.0,0.0,0.011838,0.0,0.0,0.0,0.0,0.0,0.0


### PHI

In [41]:
db.PHI_NMF = pd.DataFrame(nmf_engine.components_, columns=db.VOCAB.index)

In [42]:
db.PHI_NMF.index.name = 'topic_id'
db.PHI_NMF.columns.name = 'term_str'

In [43]:
db.PHI_NMF.T.head().style.background_gradient()

topic_id,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1
0,0.035353,0.0,0.0,0.0,0.0,0.027284,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.116405,0.0,0.0,0.002047,0.001414,0.0,0.0,0.0,0.063669,0.0,0.04347,0.0,0.0,0.0,0.056243,0.0,0.0,0.043224,0.0,0.097834,0.0,0.034699,0.0
10,0.109435,0.0,0.004901,0.032102,0.01447,0.005301,0.0,0.016045,0.0,0.0,0.012053,0.0,0.006583,0.0,0.0,0.0,0.009014,0.005356,0.008362,0.018112,0.0,0.0,0.0,0.016267,0.0,0.014023,0.0,0.112158,0.0,0.0,0.0,0.003657,0.0,0.0,0.0,0.019252,0.000192,0.0,0.0,0.0
100,0.030257,0.0,0.0,0.087716,0.003106,0.002059,0.0,0.0,0.0,0.0,0.000732,0.0,0.0,0.0,0.004645,0.006621,0.0,0.009532,0.0,0.037548,0.0,0.0,0.0,0.005558,0.0,0.001317,0.0,0.018275,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
101,0.016576,0.0,0.0,0.071231,0.0,0.0,0.0,0.020241,0.0,0.004006,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.018192,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
102,0.019776,0.0,0.0,0.099619,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.024264,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Topics

In [44]:
db.TOPICS_NMF = db.PHI_NMF.stack()\
    .to_frame('weight')\
    .groupby('topic_id')\
    .apply(lambda x: 
           x.weight.sort_values(ascending=False)\
               .head(n_top_words)\
               .reset_index()\
               .drop('topic_id',1)\
               .term_str).rename_axis(columns={'term_str':'topic_features'})

In [45]:
db.TOPICS_NMF

topic_features,0,1,2,3,4,5,6
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,court,district,case,district court,appeals,court appeals,rule
1,mr justice,mr,justice,result,jackson,mr justice frankfurter,justice frankfurter
2,trial,defendant,counsel,witness,right,accused,criminal
3,ed 2d,ed,ct,2d,ante,id,scalia
4,search,fourth amendment,fourth,warrant,probable cause,arrest,police
5,tax,income,taxes,taxation,taxpayer,taxing,revenue
6,religious,religion,establishment clause,establishment,secular,church,free exercise
7,union,labor,board,employer,employees,bargaining,collective
8,arbitration,contract,agreement,collective,collective bargaining,bargaining,grievance
9,state,federal,law,court,courts,state court,jurisdiction


In [46]:
db.TOPICS_NMF['topwords'] = db.TOPICS_NMF.apply(lambda x: str(x.name) + ' ' + ', '.join(x), 1)

### Add Doc Weights

In [47]:
db.TOPICS_NMF['doc_weight_sum'] = db.THETA_NMF.sum()

In [48]:
db.TOPICS_NMF.iloc[:, 7:].sort_values('doc_weight_sum', ascending=False).style.bar()

topic_features,topwords,doc_weight_sum
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"0 court, district, case, district court, appeals, court appeals, rule",148.294915
27,"27 congress, act, statute, legislative, power, section, secretary",77.40415
3,"3 ed 2d, ed, ct, 2d, ante, id, scalia",62.724377
9,"9 state, federal, law, court, courts, state court, jurisdiction",53.155546
13,"13 states, united states, united, government, constitution, power, foreign",52.746992
2,"2 trial, defendant, counsel, witness, right, accused, criminal",50.378963
23,"23 speech, ordinance, public, amendment, city, expression, commercial",47.044944
33,"33 concur, concurring, opinion, join, court, blackmun, court opinion",45.813438
17,"17 property, company, railroad, contract, owner, corporation, value",45.731211
4,"4 search, fourth amendment, fourth, warrant, probable cause, arrest, police",42.900933


# Save the Model

## Keep Corpus Label Info

This is effectively the LIB table.

In [49]:
db.LABELS = corpus[set(corpus.columns.tolist()) - set(['doc_key', 'doc_content', 'doc_original'])]

## Save Tables

In [50]:
db.save_tables()

/Users/rca2t1/Dropbox/Courses/NEH/TAPI_Topic_Models


In [51]:
# See if it worked ...

!ls -l ./db/{data_prefix}*.csv

-rw-r--r--@ 1 rca2t1  staff  90951293 Sep 11 10:13 ./db/ussc-BOW.csv
-rw-r--r--@ 1 rca2t1  staff  43188474 Sep 11 10:13 ./db/ussc-DTM.csv
-rw-r--r--@ 1 rca2t1  staff    123269 Sep 11 10:13 ./db/ussc-LABELS.csv
-rw-r--r--@ 1 rca2t1  staff   2917490 Sep 11 10:13 ./db/ussc-PHI.csv
-rw-r--r--@ 1 rca2t1  staff   1113140 Sep 11 10:13 ./db/ussc-PHI_NMF.csv
-rw-r--r--@ 1 rca2t1  staff   4757981 Sep 11 10:13 ./db/ussc-THETA.csv
-rw-r--r--@ 1 rca2t1  staff   1717321 Sep 11 10:13 ./db/ussc-THETA_NMF.csv
-rw-r--r--@ 1 rca2t1  staff      5398 Sep 11 10:13 ./db/ussc-TOPICS.csv
-rw-r--r--@ 1 rca2t1  staff      6378 Sep 11 10:13 ./db/ussc-TOPICS_NMF.csv
-rw-r--r--@ 1 rca2t1  staff    145820 Sep 11 10:13 ./db/ussc-VOCAB.csv


In [52]:
vars

<function vars>