# Topic Modeling with SciKit Learn

In this notebook we create a topic model from our corpus  using SciKit Learn's library. We'll save our results and then use another notebook to explore the results.

# Set Up

## Imports

In [1]:
import pandas as pd
import numpy as np

## Configuration

In [2]:
corpus_file = './corpora/jstor_hyperparameter-tapi.csv'
db_dir = './db'
data_prefix = corpus_file.split('/')[-1].split('-')[0]
csv_sep = '|'

In [3]:
data_prefix

'jstor_hyperparameter'

## Parameters

In [4]:
n_terms = 4000 # Vocabulary size
ngram_range = (1,4)
use_tfidf = True
n_topics = 10 # Number of topics
max_iter = 5 # Number of iterations for topic model

In [5]:
topic_cols = [t for t in range(n_topics)]

# Import Corpus Data

We import a corpus in our standard format

In [6]:
corpus = pd.read_csv(corpus_file, sep=csv_sep)
corpus.index.name = 'doc_id'

## Inspect contents

In [7]:
corpus.head()

Unnamed: 0_level_0,doc_content,doc_title,doc_url,doc_key,doc_date,doc_year,doc_lang,doc_tdmcat,doc_srccat
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,Two Bayesian optimal design criteria for hiera...,BAYESIAN DESIGNS FOR HIERARCHICAL LINEAR MODELS,http://www.jstor.org/stable/24310154,24310154,2012-01-01,2012,['eng'],['Mathematics - Mathematical logic'],"['Mathematics', 'Science and Mathematics', 'St..."
1,A regular supply of applicants to Queen's Univ...,Bayesian Break-Point Forecasting in Parallel T...,http://www.jstor.org/stable/3314857,3314857,1987-03-01,1987,['eng'],['Mathematics - Applied mathematics'],"['Science & Mathematics', 'Statistics']"
2,Multivariate hierarchical Bayesian models prov...,Inferring Upon Heterogeneous Associations in D...,http://www.jstor.org/stable/23208850,23208850,2012-04-01,2012,['eng'],['Applied sciences - Engineering'],"['Science & Mathematics', 'Agriculture', 'Stat..."
3,The sampling/importance resampling algorithm i...,POOL SIZE SELECTION FOR THE SAMPLING/IMPORTANC...,http://www.jstor.org/stable/24307704,24307704,2007-07-01,2007,['eng'],['Mathematics - Applied mathematics'],"['Mathematics', 'Science and Mathematics', 'St..."
4,Spline and generalized spline smoothing is sho...,"Improper Priors, Spline Smoothing and the Prob...",http://www.jstor.org/stable/2984701,2984701,1978-01-01,1978,['eng'],['Mathematics - Mathematical logic'],"['Science & Mathematics', 'Statistics']"


In [8]:
corpus.sample(5)

Unnamed: 0_level_0,doc_content,doc_title,doc_url,doc_key,doc_date,doc_year,doc_lang,doc_tdmcat,doc_srccat
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
30,We propose a probability model for random part...,A Product Partition Model With Regression on C...,http://www.jstor.org/stable/23113387,23113387,2011-03-01,2011,['eng'],['Mathematics - Mathematical logic'],"['Science & Mathematics', 'Computer Science', ..."
1042,A set of unknown normal means (treatment effec...,A Bayesian Approach to Ranking and Selection o...,http://www.jstor.org/stable/2288851,2288851,1988-06-01,1988,['eng'],['Mathematics - Mathematical logic'],"['Science & Mathematics', 'Statistics']"
447,Many popular Bayesian nonparametric priors can...,Generalized Species Sampling Priors With Laten...,http://www.jstor.org/stable/24247385,24247385,2014-12-01,2014,['eng'],['Applied sciences - Engineering'],"['Science & Mathematics', 'Statistics']"
1241,We are interested in predicting one or more co...,A Hierarchical Model for Quantifying Forest Va...,http://www.jstor.org/stable/41415531,41415531,2011-03-01,2011,['eng'],['Applied sciences - Engineering'],"['Science and Mathematics', 'Statistics']"
131,The first five sections of the paper describe ...,The 1988 Wald Memorial Lectures: The Present P...,http://www.jstor.org/stable/2245880,2245880,1990-02-01,1990,['eng'],"['Philosophy - Logic', 'Mathematics - Mathemat...","['Science and Mathematics', 'Statistics']"


In [9]:
corpus.shape

(1406, 9)

In [10]:
corpus.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1406 entries, 0 to 1405
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   doc_content  1406 non-null   object
 1   doc_title    1406 non-null   object
 2   doc_url      1406 non-null   object
 3   doc_key      1406 non-null   object
 4   doc_date     1406 non-null   object
 5   doc_year     1406 non-null   int64 
 6   doc_lang     1406 non-null   object
 7   doc_tdmcat   1404 non-null   object
 8   doc_srccat   1401 non-null   object
dtypes: int64(1), object(8)
memory usage: 99.0+ KB


# Create Bag-of-Words 

ie. a __Count Vector Space__

We use Scikit Learn's CountVectorizer to convert our F1 corpus of paragraphs into a document-term vector space of word counts.

In [11]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [12]:
count_engine = CountVectorizer(max_features=n_terms, stop_words='english', ngram_range=ngram_range)
count_model = count_engine.fit_transform(corpus.doc_content)

## Get Generated VOCAB

In [13]:
VOCAB = pd.DataFrame(count_engine.get_feature_names(), columns=['term_str'])
VOCAB = VOCAB.set_index('term_str')
# VOCAB.index.name = 'term_id' # For convenience, we'll use strings for IDs

## Get Generated Bag-of-Words

We do this just to show what the counter vectorizer produced. `DTM` stands for documet-term matrix. We convert this sparse matrix into a "thin" dataframe that keeps only terms with counts for each document. 

In [15]:
DTM = pd.DataFrame(count_model.toarray(), index=corpus.index, columns=VOCAB.index)
BOW = DTM.stack().to_frame('n')
BOW = BOW[~(BOW.n == 0)]

In [16]:
DTM.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1406 entries, 0 to 1405
Columns: 4000 entries, 000 to θi
dtypes: int64(4000)
memory usage: 42.9 MB


In [17]:
BOW.info(verbose=False)

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 103759 entries, (0, 'approximate') to (1405, 'various')
Columns: 1 entries, n to n
dtypes: int64(1)
memory usage: 1.3+ MB


## Compute TF-IDF

In [18]:
tfidf_engine = TfidfTransformer()
tfidf_model = tfidf_engine.fit_transform(count_model)

In [19]:
TFIDF = pd.DataFrame(tfidf_model.toarray(), index=corpus.index, columns=VOCAB.index)

In [20]:
BOW['tfidf'] = TFIDF.stack()

In [21]:
BOW

Unnamed: 0_level_0,Unnamed: 1_level_0,n,tfidf
doc_id,term_str,Unnamed: 2_level_1,Unnamed: 3_level_1
0,approximate,1,0.066517
0,bayesian,1,0.023505
0,case,2,0.099956
0,cases,1,0.059501
0,compare,1,0.056196
...,...,...,...
1405,stick breaking,1,0.077530
1405,terms,1,0.046874
1405,text,1,0.080439
1405,unknown,1,0.045688


## Add Features to VOCAB

In [22]:
VOCAB['ngram_len'] = None # Since VOCAB has no columns yet
VOCAB['ngram_len'] = VOCAB.apply(lambda x: len(x.name.split()), 1)
VOCAB['n'] = DTM.sum()
VOCAB['tfidf_mean'] = TFIDF.mean()

In [23]:
VOCAB.ngram_len.value_counts()

1    2918
2     960
3     101
4      21
Name: ngram_len, dtype: int64

In [24]:
# VOCAB[VOCAB.ngram_len == VOCAB.ngram_len.max()].sort_values('n', ascending=False)

In [25]:
# VOCAB.sort_values('n', ascending=False)

In [26]:
# VOCAB[VOCAB.ngram_len > 1].sort_values('n', ascending=False)

In [27]:
# VOCAB[VOCAB.ngram_len > 1].sort_values('tfidf_mean', ascending=False)

# Generate Topic Models

We run Scikit Learn's [LatentDirichletAllocation algorithm](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html#sklearn.decomposition.LatentDirichletAllocation) and extract the THETA and PHI tables.

In [28]:
from sklearn.decomposition import NMF, LatentDirichletAllocation as LDA

## Using LDA

In [29]:
lda = LDA(n_components=n_topics, max_iter=max_iter, learning_offset=50., random_state=0)

### THETA

In [30]:
if use_tfidf:
    THETA = pd.DataFrame(lda.fit_transform(tfidf_model), index=corpus.index)
else:
    THETA = pd.DataFrame(lda.fit_transform(count_model), index=corpus.index)
THETA.index.name = 'doc_id'
THETA.columns.name = 'topic_id'

In [31]:
THETA.sample(20).style.background_gradient()

topic_id,0,1,2,3,4,5,6,7,8,9
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
825,0.028749,0.028749,0.02875,0.028751,0.028749,0.028749,0.741251,0.028751,0.028749,0.02875
808,0.010271,0.010271,0.010289,0.1043,0.010284,0.010274,0.737267,0.086495,0.010273,0.010276
1014,0.014683,0.014683,0.014694,0.057978,0.014683,0.014684,0.575155,0.014692,0.014683,0.264064
681,0.01068,0.01068,0.010681,0.010681,0.01068,0.01068,0.903875,0.010682,0.01068,0.010681
1350,0.013855,0.013851,0.013955,0.013855,0.013851,0.01387,0.66658,0.02177,0.214489,0.013925
379,0.012875,0.012875,0.012877,0.012877,0.012875,0.012879,0.56809,0.328885,0.01289,0.012878
1244,0.014029,0.014029,0.014033,0.014032,0.014029,0.014029,0.425163,0.462596,0.014029,0.01403
1332,0.010478,0.010478,0.010478,0.010479,0.010478,0.010478,0.160555,0.010479,0.010478,0.755619
773,0.01261,0.012615,0.038351,0.012621,0.01261,0.012611,0.673786,0.199564,0.012615,0.012617
1193,0.011612,0.011612,0.011615,0.011616,0.011613,0.011612,0.895481,0.011613,0.011612,0.011613


In [32]:
# THETA.sum(1).sum()

### PHI

In [33]:
PHI = pd.DataFrame(lda.components_, columns=VOCAB.index)
PHI.index.name = 'topic_id'
PHI.columns.name  = 'term_str'

In [34]:
PHI.T.head().style.background_gradient()

topic_id,0,1,2,3,4,5,6,7,8,9
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
000,0.1,0.467921,0.293428,0.765594,0.1,0.100005,0.101092,0.287468,0.192108,0.163059
10,0.1,0.159717,0.235152,0.161052,0.1,0.1,1.90611,0.659263,0.10002,0.1069
11,0.160186,0.1,0.25057,0.489035,0.100001,0.100015,0.103241,0.196769,0.1,0.484361
15,0.100174,0.107632,0.1,0.384663,0.1,0.100002,1.654074,0.222629,0.100003,0.1
1970s,0.1,0.1,0.1,0.720603,0.231773,0.100007,0.100074,0.294247,0.1,0.251533


### Create Topic Glosses

In [35]:
n_top_words = 7

In [36]:
TOPICS = PHI.stack().to_frame().rename(columns={0:'weight'})\
    .groupby('topic_id')\
    .apply(lambda x: 
           x.weight.sort_values(ascending=False)\
               .head(n_top_words)\
               .reset_index()\
               .drop('topic_id',1)\
               .term_str)

In [37]:
TOPICS

term_str,0,1,2,3,4,5,6
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,imputation,resonance,imaging,multiple imputation,magnetic,magnetic resonance,resonance imaging
1,species,ozone,species richness,richness,incidence,morphological,climate
2,designs,bayesian,simulator,statistical,distributions,methods,data
3,gene,expression,genes,model,data,models,species
4,dose,toxicity,mdl,θi,loglinear,stationarity,cure
5,ps,structural equation,sectional,cross sectional,assumptions,molecules,gpu
6,model,data,models,bayesian,prior,distribution,approach
7,model,tests,data,models,bayesian,assumptions,mcmc
8,earthquake,pollution,air pollution,dsge,bonds,stress,topic
9,des,les,la,et,une,pour,en


In [38]:
TOPICS['topwords'] = TOPICS.apply(lambda x: str(x.name) + ' ' + ', '.join(x), 1)

### Add Doc Weights

In [39]:
TOPICS['doc_weight_sum'] = THETA.sum()

## Using NMF

In [40]:
nmf_engine = NMF(n_components=n_topics, init='nndsvd', random_state=1, alpha=.1, l1_ratio=.5)

In [41]:
THETA_NMF = pd.DataFrame(nmf_engine.fit_transform(tfidf_model), index=corpus.index)
THETA_NMF.columns.name = 'topic_id'

In [57]:
THETA_NMF.sample(20).style.background_gradient()

topic_id,0,1,2,3,4,5,6,7,8,9
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
948,0.037622,0.0,0.053502,0.012304,0.0,0.0,0.015117,0.0,0.0,0.317648
557,0.065663,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1265,0.036988,0.0,0.0,0.009396,0.0,0.0,0.0,0.0,0.0,0.0
633,0.060368,0.0,0.080028,0.0,0.0,0.0,0.009255,0.0,0.0,0.0
277,0.070159,0.0,0.0,0.0,0.0,0.014152,0.0,0.0,0.0,0.0
1051,0.033152,0.0,0.0,0.014666,0.0,0.0,0.0,0.018486,0.0,0.0
1190,0.058591,0.0,0.0,0.001162,0.0,0.0,0.0,0.0,0.0,0.0
1255,0.035362,0.242218,0.009529,0.0,0.0,0.0,0.0,0.0,0.0,0.0
321,0.065633,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
815,0.074633,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.128439,0.0


In [58]:
PHI_NMF = pd.DataFrame(nmf_engine.components_, columns=VOCAB.index)

In [59]:
PHI_NMF.index.name = 'topic_id'
PHI_NMF.columns.name  = 'term_str'

In [60]:
PHI_NMF.T.head().style.background_gradient()

topic_id,0,1,2,3,4,5,6,7,8,9
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
000,0.002938,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005307,0.001521
10,0.017489,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000443,0.0
11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15,0.004849,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002139,0.0
1970s,0.0,0.0,0.0,0.0,0.092153,0.0,0.0,0.0,0.0,0.0


In [61]:
TOPICS_NMF = PHI_NMF.stack().to_frame().rename(columns={0:'weight'})\
    .groupby('topic_id')\
    .apply(lambda x: 
           x.weight.sort_values(ascending=False)\
               .head(n_top_words)\
               .reset_index()\
               .drop('topic_id',1)\
               .term_str)

In [62]:
TOPICS_NMF

term_str,0,1,2,3,4,5,6
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,model,data,models,bayesian,approach,time,using
1,des,les,la,et,une,pour,le
2,markov,chain,markov chain,monte,monte carlo,carlo,markov chain monte
3,prior,distribution,distributions,posterior,priors,prior distributions,prior distribution
4,inflation,policy,monetary,monetary policy,shocks,forecasts,rate
5,effects,random effects,random,mixed,linear,linear mixed,mixed model
6,selection,variable selection,variable,model selection,regression,bayesian variable selection,bayesian variable
7,bayes,empirical bayes,empirical,intervals,estimators,eb,estimation
8,species,tree,population,abundance,growth,climate,forest
9,expression,gene,genes,gene expression,microarray,differentially expressed,differentially


In [63]:
TOPICS_NMF['topwords'] = TOPICS_NMF.apply(lambda x: str(x.name) + ' ' + ', '.join(x), 1)

### Add Doc Weights

In [64]:
TOPICS_NMF['doc_weight_sum'] = THETA_NMF.sum()

In [66]:
TOPICS_NMF[['topwords']]

term_str,topwords
topic_id,Unnamed: 1_level_1
0,"0 model, data, models, bayesian, approach, tim..."
1,"1 des, les, la, et, une, pour, le"
2,"2 markov, chain, markov chain, monte, monte ca..."
3,"3 prior, distribution, distributions, posterio..."
4,"4 inflation, policy, monetary, monetary policy..."
5,"5 effects, random effects, random, mixed, line..."
6,"6 selection, variable selection, variable, mod..."
7,"7 bayes, empirical bayes, empirical, intervals..."
8,"8 species, tree, population, abundance, growth..."
9,"9 expression, gene, genes, gene expression, mi..."


# Save the Model

# Keep Corpus Label Info

In [67]:
LABELS = corpus[set(corpus.columns.tolist()) - set(['doc_key', 'doc_content', 'doc_original'])]

## Save each dataframe

This could of course be generalized as a function or class method.

In [68]:
LABELS.to_csv(f"{db_dir}/{data_prefix}-LABELS.csv", index=True)
VOCAB.to_csv(f"{db_dir}/{data_prefix}-VOCAB.csv", index=True)
BOW.to_csv(f"{db_dir}/{data_prefix}-BOW.csv", index=True)
TOPICS.to_csv(f"{db_dir}/{data_prefix}-TOPICS.csv", index=True)
THETA.to_csv(f"{db_dir}/{data_prefix}-THETA.csv", index=True)
PHI.to_csv(f"{db_dir}/{data_prefix}-PHI.csv", index=True)
TOPICS_NMF.to_csv(f"{db_dir}/{data_prefix}-TOPICS_NMF.csv", index=True)
THETA_NMF.to_csv(f"{db_dir}/{data_prefix}-THETA_NMF.csv", index=True)
PHI_NMF.to_csv(f"{db_dir}/{data_prefix}-PHI_NMF.csv", index=True)

In [69]:
LABELS.iloc[182]

doc_title     Regression Selection Strategies and Revealed P...
doc_url                     http://www.jstor.org/stable/2286604
doc_lang                                                ['eng']
doc_srccat              ['Science & Mathematics', 'Statistics']
doc_year                                                   1978
doc_date                                             1978-09-01
doc_tdmcat               ['Mathematics - Mathematical objects']
Name: 182, dtype: object