[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/librairy/notebooks/blob/master/Intro_TopicModels.ipynb)

This Google Colab Notebook serves as an introduction to Probabilistic Topic Models. 

Textual data can be loaded from a Google Sheet and topics derived from  LDA can be generated. 

First, it is necessary to indicate the training google sheet and the number of words to show per topic.


In [0]:
#@title Google Colab Authentication
!pip install --upgrade -q gspread
#!pip install -q gensim

from google.colab import auth
auth.authenticate_user()

import gspread
from oauth2client.client import GoogleCredentials

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import numpy as np



def display_topics(H, W, feature_names, documents, no_top_words, no_top_documents):
    for topic_idx, topic in enumerate(H):
        print("-"*30)
        print(" Topic ",(topic_idx)," :")
        print("["," | ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]),"]")
        top_doc_indices = np.argsort( W[:,topic_idx] )[::-1][0:no_top_documents]
        for doc_index in top_doc_indices:
            print("(",doc_index+2,")",documents[doc_index][:100]," ...")
            print("\t",W[doc_index])




In [43]:
#@title Load and preview data from a Google Sheet

corpus = 'texts' #@param {type:"string"}
preview = 10 #@param {type:"integer"}


gc = gspread.authorize(GoogleCredentials.get_application_default())

worksheet = gc.open(corpus).sheet1

# get_all_values gives a list of rows.
rows = worksheet.get_all_values()

# convert the 2nd column values to a list
documents = []
for row in rows[1:]:
  documents.append(row[1])
  
#print(documents)

# Convert to a DataFrame and render.
import pandas as pd
dataset_df = pd.DataFrame.from_records(rows)
dataset_df.head(n=preview)



Unnamed: 0,0,1
0,1,Dean J. Falcione (posting from jrmst+8@pitt.ed...
1,2,In article <1993Mar29.190650.28940@ramsey.cs.l...
2,3,"I can only comment on the Kings, but the most ..."
3,4,In article <C4zCII.Ftn@watserv1.uwaterloo.ca> ...
4,5,Maine beat LSSU 5-4.
5,6,"Well, I gotta tell ya, last night's Leafs game..."
6,7,>The shirts are believe or not from a Bob Prob...
7,8,Lake State/Maine in finals...WHO WON? Please...
8,9,Also sprach slegge@kean.ucs.mun.ca ... >TSN Sp...
9,10,In article <C4zHJ1.7xB@idacom.hp.com> andrew@i...


Tokenize input data:


In [44]:
#@title Tokenization

tf_vectorizer = CountVectorizer(
    stop_words=None,
    min_df=1,
    max_df=0.95,
    lowercase=False,
    ngram_range=(1,1),
    analyzer = 'word'
)
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()
vocab = tf_vectorizer.vocabulary_

print("Vocabulary Size: ", len(tf_feature_names))

Vocabulary Size:  5189


Now it's time to build a topic model by setting values for:
- number of topics
- alpha
- beta

In [48]:
#@title Run LDA

topics = 10 #@param {type:"integer"}

alpha = 0.1 #@param {type:"number"}

beta = 0.01 #@param {type:"number"}

no_top_words = 5

no_top_documents = 3


# Run LDA
lda_model = LatentDirichletAllocation(
    n_components=topics, 
    doc_topic_prior=alpha, 
    topic_word_prior=beta, 
    max_iter=20, 
    learning_method='online', 
    learning_offset=50.,
    random_state=0).fit(tf)
lda_W = lda_model.transform(tf)
lda_H = lda_model.components_

print("LDA Topics")
display_topics(lda_H, lda_W, tf_feature_names, documents, no_top_words, no_top_documents)


LDA Topics
------------------------------
 Topic  0  :
[ the | to | Biggest | and | in ]
( 5 ) Maine beat LSSU 5-4.  ...
	 [0.02500072 0.02500046 0.02500464 0.02500077 0.02501747 0.02500023
 0.02500064 0.02500061 0.77497393 0.02500053]
( 8 ) Lake State/Maine in finals...WHO WON?   Please post.  ...
	 [0.01000009 0.01000019 0.08505583 0.01000025 0.83494263 0.01000036
 0.01000009 0.01000021 0.01000019 0.01000015]
( 27 ) ktgeiss@miavx1.acs.muohio.edu writes: > Lake State/Maine in finals...WHO WON?   Please post. Maine 5  ...
	 [0.00588241 0.00588245 0.15394474 0.00588248 0.79899565 0.00588245
 0.00588242 0.00588245 0.00588252 0.00588243]
------------------------------
 Topic  1  :
[ Rangers | Hartford | the | Amonte | Turcotte ]
( 78 ) Hartford                         1 1 3--5 NY Rangers                       1 2 1--4 First period      ...
	 [0.0009804  0.74982755 0.12890367 0.00098041 0.00098041 0.11440593
 0.00098041 0.00098041 0.0009804  0.0009804 ]
( 80 ) x - Clinched Division Title y

Get Topic Distributions:


In [49]:
#@title Topic Distributions


print("Topic Distributions: ")

bounds = (0,5)
tds = lda_model.transform(tf[bounds[0]:bounds[1]])
for x in range(bounds[0],bounds[1]):
  print("Doc",x,tds[x])


Topic Distributions: 
Doc 0 [3.42472944e-04 3.42474341e-04 9.96917737e-01 3.42474380e-04
 3.42473456e-04 3.42473470e-04 3.42473104e-04 3.42474792e-04
 3.42473294e-04 3.42473269e-04]
Doc 1 [5.29122340e-04 5.29129635e-04 9.95237882e-01 5.29127051e-04
 5.29122158e-04 5.29126909e-04 5.29121516e-04 5.29124067e-04
 5.29121323e-04 5.29122886e-04]
Doc 2 [0.00119051 0.00119051 0.98928541 0.00119051 0.00119051 0.0011905
 0.00119051 0.00119051 0.00119051 0.00119051]
Doc 3 [0.02500072 0.02500046 0.02500464 0.02500077 0.02501747 0.02500023
 0.02500064 0.02500061 0.77497393 0.02500053]
Doc 4 [3.50893495e-04 3.50894558e-04 9.96841946e-01 3.50894216e-04
 3.50893683e-04 3.50899034e-04 3.50893747e-04 3.50893998e-04
 3.50895423e-04 3.50896102e-04]


Get topics for a given  sample:

In [0]:
#@title Inference

text = "this is an example" #@param {type:"string"}

print("Topic Distribution: ", lda_model.transform(tf_vectorizer.transform([text])))
