# Metadata

```yaml
Course:    DS 5001 
Module:    08 Lab
Topic:     Gibbs Sampler
Author:    R.C. Alvarado
Date:      03 March 2023 (revised)
```
**Purpose:** We develop a simple topic modeler using collapsed Gibbs sample as described by [Griffiths and Steyvers (2004)](https://collab.its.virginia.edu/access/content/group/b9e58ce7-0f44-48fe-9861-b7a7657f551a/Articles/sciencetopics.pdf).

# Setup

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import re
from nltk.corpus import stopwords 

# Convert F1 Corpus 

We want to convert any given F1 corpus (DOC) into unannotated TOKEN and VOCAB tables.

This is so we can work with ad hoc training data.

In [2]:
class Corpus:

    def __init__(self, doc_list:list, doc_col='doc_str'):
        "Create DOC table from F1 list"
        self.DOC = pd.DataFrame(doc_list, columns=[doc_col])
        self.DOC.index.name = 'doc_id'
        self.stop_words = set(stopwords.words('english')) 
        
    def convert_corpus(self):        
        "Convert raw docs into TOKEN and BOW tables"
        tokens = []
        for i, row in self.DOC.iterrows():
            for j, token in enumerate(row.doc_str.split()):
                term_str = re.sub(r'[\W_]+', '', token).lower()
                if term_str not in self.stop_words:
                    tokens.append((i, j, term_str))
        self.TOKEN = pd.DataFrame(tokens, columns=['doc_id','token_num','term_str'])\
            .set_index(['doc_id','token_num'])
        self.BOW = self.TOKEN.groupby(['doc_id','term_str']).term_str.count().to_frame('n')
        return self
        
    def extract_vocab(self):
        "Extract vocabulary"
        self.VOCAB = self.TOKEN.term_str.value_counts().to_frame('n')
        self.VOCAB.index.name = 'term_str'   
        return self 

In [3]:
raw_docs = """
I ate a banana and a spinach smoothie for breakfast.
I like to eat broccoli and bananas.
Chinchillas and kittens are cute.
My sister adopted a kitten yesterday.
Look at this cute hamster munching on a piece of broccoli.
""".split("\n")[1:-1]

In [4]:
corpus1 = Corpus(raw_docs).convert_corpus().extract_vocab()

In [5]:
corpus1.BOW

Unnamed: 0_level_0,Unnamed: 1_level_0,n
doc_id,term_str,Unnamed: 2_level_1
0,ate,1
0,banana,1
0,breakfast,1
0,smoothie,1
0,spinach,1
1,bananas,1
1,broccoli,1
1,eat,1
1,like,1
2,chinchillas,1


# Gibbs Sampler

We sample each document and word combination in the BOW table. In each case,
we are looking for two values:

* the topic with which a word has been most frequently labeled
* the topic with which the document has the most labeled words

We combine these values in order to align the label of the current word with the rest of the data.\
If a the topic is highly associated with both the word and the document, then that topic will get a high value.

Note that all that is going on here is a sorting operation -- the random assignment does not predict anything.\
Instead, we are just gathering words under topics and topics under documents.

**From Darling 2011:**
<hr />
<div style="float:left;">
<img src="images/gibbs-algo-text.png" width="650px" />
<img src="images/gibbs-algo.png" width="650px" />
</div>

In [6]:
class GibbsSampler:

    def __init__(self, n_topics=10, iters=100, a = 1, b = .1):

        # Map arguments
        self.n_topics = n_topics
        self.iters = iters
        self.a = a
        self.b = b 
        
        # Define topic table
        topic_names = [f"T{str(t).zfill(len(str(self.n_topics)))}" for t in range(self.n_topics)]
        self.TOPIC = pd.DataFrame({'top_terms':'TBD'}, index=topic_names)
        self.TOPIC.index.name = 'topic_id'

    def add_corpus(self, corpus:Corpus):
        
        # Copy BOW and assign random topics        
        self.BOW = corpus.BOW.copy()
        self.BOW['topic_id'] = self.TOPIC.sample(len(self.BOW), replace=True).index
        
        # Get vocab length
        self.VOCAB = corpus.VOCAB
        self.W = self.VOCAB.shape[0]       
        
        return self
            
    def compute_topics(self):

        # Create count tables
        self.THETA = self.BOW.value_counts(['doc_id', 'topic_id']).unstack().fillna(0)
        self.PHI = self.BOW.value_counts(['topic_id', 'term_str']).unstack().fillna(0)
        self.TOPIC['n'] = self.BOW.value_counts('topic_id').fillna(0)
        
        # Iterate 
        for i in tqdm(range(self.iters)):  
            
            # Estimage topic per word
            for doc_id, term_str in self.BOW.index:
            
                # Get the currenttly assigned topic
                z = self.BOW.loc[(doc_id, term_str)].topic_id

                # ... and remove from counts
                self.THETA.loc[doc_id, z] -= 1
                self.PHI.loc[z, term_str] -= 1
                self.TOPIC.loc[z, 'n']    -= 1
                                
                # Estimate probability of new topic for this word
                # A, B, and C are each topic vectors with counts in a given context
                A = self.THETA.loc[doc_id] + self.a # Context = document
                B = self.PHI[term_str] + self.b # Context = vocab
                AP = A.T / A.T.sum()
                BP = B.T / B.T.sum()
                PZ = AP * BP

                # Darling 2011
                # C = self.TOPIC['n'] + (self.b * self.W) # Context = corpus
                # PZ = A * (B/C)
                
                # Sample from new distribution and reassign
                z2 = PZ.sample(weights=PZ).index[0]
                self.THETA.loc[doc_id, z2] += 1
                self.PHI.loc[z2, term_str] += 1
                self.TOPIC.loc[z2, 'n']    += 1
                self.BOW.loc[(doc_id, term_str), 'topic_id'] = z2
                
            # Compute perplexity of each iteration

        return self
    
    def get_top_terms(self):
        # Get top words for each topic
        for topic_id in self.TOPIC.index:
            self.TOPIC.loc[topic_id, 'top_terms'] = ' '.join(self.PHI.loc[topic_id, self.PHI.loc[topic_id] > 0].index.values)
            
        return self

In [7]:
def do_all(f1_list:[], k=4, iters=100):
    corpus = Corpus(f1_list).convert_corpus().extract_vocab()
    model = GibbsSampler(n_topics=k, iters=iters, a=1, b=1).add_corpus(corpus).compute_topics().get_top_terms()
    return corpus, model

# Demo 1

We use a toy example to see if the method works.\
Because our codd is not vert efficient, we just 

## Data

A small F1 corpus.

In [8]:
raw_docs = """
I ate a banana and a spinach smoothie for breakfast.
I like to eat broccoli and bananas.
Chinchillas and kittens are cute.
My sister adopted a kitten yesterday.
Look at this cute hamster munching on a piece of broccoli.
""".split("\n")[1:-1]

## Process

In [9]:
cp1, tm1 = do_all(raw_docs, k=5, iters=500)

100%|████████████████████████████████████████████████████████████████████████████████| 500/500 [00:19<00:00, 25.32it/s]


In [10]:
tm1.TOPIC

Unnamed: 0_level_0,top_terms,n
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1
T0,breakfast broccoli cute yesterday,4
T1,ate cute kittens piece smoothie,5
T2,adopted broccoli eat kitten sister spinach,6
T3,bananas chinchillas like munching,4
T4,banana hamster look,3


In [11]:
cp1.DOC.join(tm1.THETA.astype('int')).style.background_gradient(axis=None)

Unnamed: 0_level_0,doc_str,T0,T1,T2,T3,T4
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,I ate a banana and a spinach smoothie for breakfast.,1,2,1,0,1
1,I like to eat broccoli and bananas.,1,0,1,2,0
2,Chinchillas and kittens are cute.,1,1,0,1,0
3,My sister adopted a kitten yesterday.,1,0,3,0,0
4,Look at this cute hamster munching on a piece of broccoli.,0,2,1,1,2


# Demo 2

## Data

In [12]:
some_documents = [
    ["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"],
    ["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
    ["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"],
    ["R", "Python", "statistics", "regression", "probability"],
    ["machine learning", "regression", "decision trees", "libsvm"],
    ["Python", "R", "Java", "C++", "Haskell", "programming languages"],
    ["statistics", "probability", "mathematics", "theory"],
    ["machine learning", "scikit-learn", "Mahout", "neural networks"],
    ["neural networks", "deep learning", "Big Data", "artificial intelligence"],
    ["Hadoop", "Java", "MapReduce", "Big Data"],
    ["statistics", "R", "statsmodels"],
    ["C++", "deep learning", "artificial intelligence", "probability"],
    ["pandas", "R", "Python"],
    ["databases", "HBase", "Postgres", "MySQL", "MongoDB"],
    ["libsvm", "regression", "support vector machines"]
]
raw_docs2  = [' '.join(item) for item in some_documents]

## Process

In [13]:
cp2, tm2 = do_all(raw_docs2, k=10, iters=500)

100%|████████████████████████████████████████████████████████████████████████████████| 500/500 [01:13<00:00,  6.76it/s]


In [14]:
tm2.TOPIC

Unnamed: 0_level_0,top_terms,n
topic_id,Unnamed: 1_level_1,Unnamed: 2_level_1
T00,big data hbase java languages learning postgre...,11
T01,data databases libsvm neural python support,6
T02,c decision deep mysql statsmodels storm vector,9
T03,artificial haskell neural probability python s...,6
T04,deep learning machine networks pandas postgres r,7
T05,hadoop machines mapreduce nosql probability sc...,6
T06,data hbase java learning mathematics networks ...,12
T07,big cassandra hbase learning machine mongodb s...,8
T08,artificial cassandra hadoop libsvm mongodb spa...,7
T09,intelligence java mahout numpy python r regres...,10


In [15]:
cp2.DOC.join(tm2.THETA.astype('int')).style.background_gradient(axis=None)

Unnamed: 0_level_0,doc_str,T00,T01,T02,T03,T04,T05,T06,T07,T08,T09
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,Hadoop Big Data HBase Java Spark Storm Cassandra,1,1,1,0,0,0,1,2,2,0
1,NoSQL MongoDB Cassandra HBase Postgres,1,0,0,0,0,1,1,1,1,0
2,Python scikit-learn scipy numpy statsmodels pandas,0,0,1,1,1,0,0,0,0,3
3,R Python statistics regression probability,0,1,0,0,1,0,1,1,0,1
4,machine learning regression decision trees libsvm,0,0,1,0,0,0,1,2,2,0
5,Python R Java C++ Haskell programming languages,1,0,1,1,0,0,1,0,0,3
6,statistics probability mathematics theory,2,0,0,1,0,0,1,0,0,0
7,machine learning scikit-learn Mahout neural networks,0,0,0,1,3,1,0,0,0,1
8,neural networks deep learning Big Data artificial intelligence,1,1,1,1,0,0,3,0,0,1
9,Hadoop Java MapReduce Big Data,2,0,0,0,0,2,0,1,0,0
