# Metadata

```
Course:   DS5001
Module:   098 Lab
Topic:    Gibbs Sampler
Author:   R.C. Alvarado

Purpose:  We develop an LDA topic modeler using collapsed Gibbs sample as described by [Griffiths and Steyvers (2004)].
```

## Setup

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import re
from nltk.corpus import stopwords 

## Functions

### Convert Corpus 

We convert the list of token lists (DOC) into TOKEN and VOCAB tables.

In [2]:
class Corpus(): 
    
    def __init__(self, doclist):

        self.docs = doclist
        
        # Create DOC table from F1 doclist
        self.DOC = pd.DataFrame(doclist, columns=['doc_str'])
        self.DOC.index.name = 'doc_id'
        self.DOC

        # Convert docs into tokens
        stop_words = set(stopwords.words('english')) 
        tokens = []
        for i, doc in enumerate(doclist):
            for j, token in enumerate(doc.split()):
                term_str = re.sub(r'[\W_]+', '', token).lower()
                if term_str not in stop_words:
                    tokens.append((i, j, term_str))
        self.TOKEN = pd.DataFrame(tokens, columns=['doc_id','token_num','term_str'])\
            .set_index(['doc_id','token_num'])

        # Extract vocabulary
        self.VOCAB = self.TOKEN.term_str.value_counts().to_frame('n')
        self.VOCAB.index.name = 'term_str'    
        

### Gibbs Sampler

We sample each document and word combination in the BOW table. In each case,
we are looking for two values:

* the topic with which a word has been most frequently labeled
* the topic with which the document has the most labeled words

We combine these values in order to align the label of the current word with the rest of the data.\
If a the topic is highly associated with both the word and the document, then that topic will get a high value.

Note that all that is going on here is a sorting operation -- the random assignment does not predict anything.\
Instead, we are just gathering words under topics and topics under documents.

**From Darling 2011:**
<hr />
<div style="float:left;">
<img src="images/gibbs-algo-text.png" width="650px" />
<img src="images/gibbs-algo.png" width="650px" />
</div>

In [209]:
class GibbsSampler():
    
    n_topics:int = 10
    n_iters:int = 100
    a:float = 1.
    b:float = .1

    # See Griffiths and Steyvers 2004
    #     a = 1 # 50 / n_topics
    #     b = .1 # 200 / W

    def __init__(self, corpus:Corpus):
        self.corpus = corpus
        self.N = len(corpus.TOKEN)
        self.W = len(corpus.VOCAB)
        
    def _estimate_z(self, row):

        # Get row elements
        d = row.name[0]  # Current document
        z = row.topic_id # Current assigned topic
        w = row.term_str # Current term

        # Zero out the current topic assignment
        # We want current state of everything else
        row[z] = 0

        # Number of words assigned to each topic k in the document -- C(w|d,k)
        n_dk = self.Z.loc[d, self.zcols].sum()

        # Number of times word w is assigned to each topic -- C(w|k)
        n_kw = self.Z.loc[self.Z.term_str == w, self.zcols].sum()

        # Number of times any word is assigned to each topic -- C(W|k)
        n_k = self.Z[self.zcols].sum()

        # Generate probalities
        # Note formula involves a LOCAL and a GLOBAL measure, kinda like TF-IDF
        pz = (n_dk + self.a) * ((n_kw + self.b) / (n_k + self.b * self.W))  

        # Sample to get new z
        z2 = pz.sample().index[0]

        # Update the token assignment (redundantly)
        row[z2] = 1
        row.topic_id = z2
        
    def generate_model(self):
        
        # Create topics table
        self.zcols = range(self.n_topics)
        self.topics = pd.DataFrame(index=self.zcols)

        # Randomly assign topics to toknes
        self.corpus.TOKEN['topic_id'] = self.topics.sample(self.N, replace=True).index

        # Create one-hot-encoding topic columns for easier computation
        self.Z = pd.concat([self.corpus.TOKEN, pd.get_dummies(self.corpus.TOKEN.topic_id)], axis=1)
        
        # Iterate
        for x in tqdm(range(self.n_iters)):
            self.Z.apply(self._estimate_z, 1)
        
        # Create topic model tables
        self.topics['n_tokens'] = self.Z.value_counts('topic_id')
        self.theta = self.Z.value_counts(['doc_id','topic_id']).unstack(fill_value=0)
        self.phi = self.Z.value_counts(['term_str','topic_id']).unstack(fill_value=0)
        self.theta = (self.theta.T / self.theta.T.sum()).T
        
        # Get top words for each topic
        self.topics['top_terms'] = self.topics\
            .apply(lambda x: self.phi.loc[self.phi[x.name] > 0, x.name]\
                   .sort_values(ascending=False)\
                   .head().index.to_list(), 1)   
        

## Demo 1

We use a toy example to see if the method works.\
Because our codd is not vert efficient, we just 

### Data

A small F1 corpus.

In [205]:
raw_docs = """
I ate a banana and a spinach smoothie for breakfast.
I like to eat broccoli and bananas.
Chinchillas and kittens are cute.
My sister adopted a kitten yesterday.
Look at this cute hamster munching on a piece of broccoli.
""".split("\n")[1:-1]

### Process

In [206]:
pd.options.mode.chained_assignment = None

In [207]:
corpus1 = Corpus(raw_docs)

In [210]:
model1 = GibbsSampler(corpus1)
model1.n_topics = 2
model1.n_iters = 1000
model1.generate_model()

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [01:11<00:00, 13.94it/s]


In [211]:
model1.topics

Unnamed: 0,n_tokens,top_terms
0,10,"[adopted, ate, banana, bananas, chinchillas]"
1,12,"[broccoli, cute, breakfast, eat, kittens]"


## Demo 2

### Data

In [213]:
some_documents = [
    ["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"],
    ["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
    ["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"],
    ["R", "Python", "statistics", "regression", "probability"],
    ["machine learning", "regression", "decision trees", "libsvm"],
    ["Python", "R", "Java", "C++", "Haskell", "programming languages"],
    ["statistics", "probability", "mathematics", "theory"],
    ["machine learning", "scikit-learn", "Mahout", "neural networks"],
    ["neural networks", "deep learning", "Big Data", "artificial intelligence"],
    ["Hadoop", "Java", "MapReduce", "Big Data"],
    ["statistics", "R", "statsmodels"],
    ["C++", "deep learning", "artificial intelligence", "probability"],
    ["pandas", "R", "Python"],
    ["databases", "HBase", "Postgres", "MySQL", "MongoDB"],
    ["libsvm", "regression", "support vector machines"]
]
raw_docs2  = [' '.join(item) for item in some_documents]

### Process

In [214]:
corpus2 = Corpus(raw_docs2)

In [215]:
model2 = GibbsSampler(corpus2)
model2.n_topics = 10
model2.n_iters = 200
model2.generate_model()

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:52<00:00,  3.82it/s]


In [216]:
model2.topics

Unnamed: 0,n_tokens,top_terms
0,9,"[databases, hbase, java, learning, neural]"
1,8,"[c, cassandra, haskell, libsvm, programming]"
2,10,"[pandas, big, c, cassandra, deep]"
3,8,"[intelligence, java, languages, learning, libsvm]"
4,5,"[big, machines, probability, scipy, statistics]"
5,13,"[networks, decision, deep, hbase, machine]"
6,11,"[data, hadoop, hbase, java, learning]"
7,7,"[big, machine, mahout, regression, scikitlearn]"
8,5,"[artificial, data, intelligence, mongodb, prob..."
9,6,"[artificial, hadoop, postgres, r, regression]"


In [185]:
corpus2.DOC.join(model2.theta).style.background_gradient(cmap='GnBu', high=.5, axis=1)

Unnamed: 0_level_0,doc_str,0,1,2,3,4,5,6,7,8,9
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,Hadoop Big Data HBase Java Spark Storm Cassandra,0.0,0.0,0.125,0.0,0.125,0.125,0.0,0.0,0.375,0.25
1,NoSQL MongoDB Cassandra HBase Postgres,0.0,0.0,0.4,0.0,0.2,0.0,0.0,0.0,0.2,0.2
2,Python scikit-learn scipy numpy statsmodels pandas,0.333333,0.0,0.0,0.166667,0.0,0.0,0.333333,0.0,0.166667,0.0
3,R Python statistics regression probability,0.2,0.0,0.0,0.0,0.0,0.2,0.2,0.0,0.2,0.2
4,machine learning regression decision trees libsvm,0.166667,0.0,0.0,0.0,0.166667,0.166667,0.0,0.166667,0.0,0.333333
5,Python R Java C++ Haskell programming languages,0.142857,0.0,0.142857,0.142857,0.142857,0.285714,0.0,0.142857,0.0,0.0
6,statistics probability mathematics theory,0.25,0.25,0.0,0.0,0.0,0.0,0.0,0.25,0.25,0.0
7,machine learning scikit-learn Mahout neural networks,0.166667,0.166667,0.0,0.0,0.166667,0.166667,0.0,0.0,0.0,0.333333
8,neural networks deep learning Big Data artificial intelligence,0.125,0.25,0.125,0.125,0.0,0.0,0.125,0.0,0.0,0.25
9,Hadoop Java MapReduce Big Data,0.2,0.0,0.0,0.0,0.0,0.2,0.2,0.2,0.2,0.0


In [186]:
model2.topics.sort_values('n_tokens', ascending=False).style.bar()

Unnamed: 0,n_tokens,top_terms
9,12,"['learning', 'vector', 'hadoop', 'neural', 'mahout']"
0,11,"['statistics', 'java', 'big', 'mapreduce', 'decision']"
5,11,"['regression', 'java', 'scikitlearn', 'pandas', 'mongodb']"
8,10,"['probability', 'mysql', 'statistics', 'big', 'numpy']"
2,7,"['artificial', 'spark', 'machines', 'programming', 'hbase']"
6,7,"['python', 'big', 'c', 'pandas', 'scipy']"
7,7,"['c', 'data', 'deep', 'mathematics', 'hbase']"
1,6,"['neural', 'r', 'theory', 'postgres', 'deep']"
4,6,"['networks', 'mongodb', 'support', 'storm', 'haskell']"
3,5,"['statsmodels', 'artificial', 'python', 'intelligence', 'support']"
