[View in Colaboratory](https://colab.research.google.com/github/librairy/notebooks/blob/master/Intro_TopicModels.ipynb)

This Google Colab Notebook serves as an introduction to Probabilistic Topic Models. Textual data can be loaded from a Google Sheet and topics derived from  LDA can be generated. 

First, it is necessary to indicate the training google sheet and the number of words to show per topic.


In [88]:
#@title Google Colab Authentication
!pip install --upgrade -q gspread
#!pip install -q gensim

from google.colab import auth
auth.authenticate_user()

import gspread
from oauth2client.client import GoogleCredentials

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import numpy as np



def display_topics(H, W, feature_names, documents, no_top_words, no_top_documents):
    for topic_idx, topic in enumerate(H):
        print("-"*30)
        print(" Topic ",(topic_idx)," :")
        print("["," | ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]),"]")
        top_doc_indices = np.argsort( W[:,topic_idx] )[::-1][0:no_top_documents]
        for doc_index in top_doc_indices:
            print("(",doc_index,")",documents[doc_index], W[doc_index])






In [95]:
#@title Load and preview data from a Google Sheet

googlesheet_filename = 'sample' #@param {type:"string"}
data_rows_to_preview = 10 #@param {type:"integer"}


gc = gspread.authorize(GoogleCredentials.get_application_default())

worksheet = gc.open(googlesheet_filename).sheet1

# get_all_values gives a list of rows.
rows = worksheet.get_all_values()

# convert the 2nd column values to a list
documents = []
for row in rows[1:]:
  documents.append(row[1])
  
#print(documents)

# Convert to a DataFrame and render.
import pandas as pd
dataset_df = pd.DataFrame.from_records(rows)
dataset_df.head(n=data_rows_to_preview)

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(
    stop_words='english',
    min_df=2,
    max_df=0.95,
    lowercase=True,
    ngram_range=(1,1)
)
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()
vocab = tf_vectorizer.vocabulary_

print("Vocabulary Size: ", len(tf_feature_names))

2018-10-15 11:08:41,064 : INFO : Refreshing access_token


Vocabulary Size:  10


Now it's time to build a topic model by setting values for:
- number of topics
- alpha
- beta

In [96]:
#@title Run LDA

topics = 4 #@param {type:"integer"}

alpha = 0.1 #@param {type:"number"}

beta = 0.1 #@param {type:"number"}

no_top_words = 5

no_top_documents = 3


# Run LDA
lda_model = LatentDirichletAllocation(
    n_components=topics, 
    doc_topic_prior=alpha, 
    topic_word_prior=beta, 
    max_iter=20, 
    learning_method='online', 
    learning_offset=50.,
    random_state=0).fit(tf)
lda_W = lda_model.transform(tf)
lda_H = lda_model.components_

print("LDA Topics")
display_topics(lda_H, lda_W, tf_feature_names, documents, no_top_words, no_top_documents)


LDA Topics
------------------------------
 Topic  0  :
[ graph | trees | minors | computer | response ]
( 7 ) Graph minors: Widths of trees and quasi-ordering [0.91176294 0.02941196 0.02941266 0.02941244]
( 8 ) Graph minors: a suervey in data from graphical point of view [0.87499749 0.04166696 0.04166814 0.04166742]
( 6 ) the intersection graph of paths in trees [0.87499729 0.04166692 0.04166786 0.04166793]
------------------------------
 Topic  1  :
[ response | computer | time | human | user ]
( 1 ) A survey of user opinion of computer system response time [0.02272786 0.9318136  0.02272864 0.02272991]
( 4 ) Relation of user-perceived response time to error measurement [0.02941249 0.91175796 0.02941341 0.02941613]
( 0 ) human machine interface for Lab ABC computer applications [0.02941265 0.9117546  0.02941397 0.02941878]
------------------------------
 Topic  2  :
[ computer | eps | user | time | minors ]
( 5 ) The generation of random, binary, unordered trees [0.78570752 0.07142924 

Get Topic Distributions:


In [97]:
#@title Topic Distributions


print("Topic Distributions: ")

bounds = (0,5)
tds = lda_model.transform(tf[bounds[0]:bounds[1]])
for x in range(bounds[0],bounds[1]):
  print("Doc",x,tds[x])


Topic Distributions: 
Doc 0 [0.02941265 0.9117546  0.02941397 0.02941878]
Doc 1 [0.02272786 0.9318136  0.02272864 0.02272991]
Doc 2 [0.02941243 0.02941805 0.02941369 0.91175584]
Doc 3 [0.0416682  0.87497362 0.04167149 0.04168668]
Doc 4 [0.02941249 0.91175796 0.02941341 0.02941613]


Get topics for a given  sample:

In [98]:
#@title Inference

text = "this is an example" #@param {type:"string"}

print("Topic Distribution: ", lda_model.transform(tf_vectorizer.transform([text])))


Topic Distribution:  [[0.25 0.25 0.25 0.25]]
