# Topic Modeling for Everybody with Google Colab

**Super simple topic modeling using both the Non Negative Matrix Factorization (NMF) and Latent Dirichlet Allocation (LDA) algorithms.**

This Google Colab Notebook makes topic modeling accessible to everybody. Textual data can be loaded from a Google Sheet and topics derived from NMF and LDA can be generated. Only simple form entry is required to set:

* the name of the google sheet
* the number of topics to be generated
* the number of top words and documents that must be printed out for each topic





In [None]:
#@title Install gspread, authenticate and load data from a Google Sheet
!pip install --upgrade -q gspread

from google.colab import auth
auth.authenticate_user()

import gspread
from oauth2client.client import GoogleCredentials

# Default data from
# http://web.eecs.utk.edu/~berry/order/node4.html#SECTION00022000000000000000

googlesheet_filename = 'mydata' #@param {type:"string"}
data_rows_to_preview = 10 #@param {type:"integer"}


In [None]:
#@title Load and preview data from a Google Sheet

gc = gspread.authorize(GoogleCredentials.get_application_default())

worksheet = gc.open(googlesheet_filename).sheet1

# get_all_values gives a list of rows.
rows = worksheet.get_all_values()

# convert the 2nd column values to a list
documents = []
for row in rows[1:]:
  documents.append(row[1])
  
#print(documents)

# Convert to a DataFrame and render.
import pandas as pd
dataset_df = pd.DataFrame.from_records(rows)
dataset_df.head(n=data_rows_to_preview)


Unnamed: 0,0,1
0,id,text
1,1,Human machine interface for Lab ABC computer a...
2,2,A survey of user opinion of computer system re...
3,3,The EPS user interface management system
4,4,System and human system engineering testing of...
5,5,Relation of user-perceived response time to er...
6,6,"The generation of random, binary, unordered trees"
7,7,The intersection graph of paths in trees
8,8,Graph minors IV: Widths of trees and quasi-ord...
9,9,Graph minors: A survey




---



---



In [None]:
#@title Set topic modeling algorithm arguments

no_topics = 3 #@param {type:"integer"}

no_top_words = 4 #@param {type:"integer"}

no_top_documents = 3 #@param {type:"integer"}

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import numpy as np

In [None]:
#@title Run NMF

def display_topics(H, W, feature_names, documents, no_top_words, no_top_documents):
    for topic_idx, topic in enumerate(H):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
        top_doc_indices = np.argsort( W[:,topic_idx] )[::-1][0:no_top_documents]
        for doc_index in top_doc_indices:
            print(documents[doc_index])

# NMF is able to use tf-idf
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

# Run NMF
nmf_model = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)
nmf_W = nmf_model.transform(tfidf)
nmf_H = nmf_model.components_

print("NMF Topics")
display_topics(nmf_H, nmf_W, tfidf_feature_names, documents, no_top_words, no_top_documents)
print("--------------")



NMF Topics
Topic 0:
trees graph minors survey
Graph minors IV: Widths of trees and quasi-ordering
The intersection graph of paths in trees
The generation of random, binary, unordered trees
Topic 1:
time response user survey
Relation of user-perceived response time to error measurement
A survey of user opinion of computer system response time
The EPS user interface management system
Topic 2:
human eps interface computer
System and human system engineering testing of EPS
Human machine interface for Lab ABC computer applications
The EPS user interface management system
--------------


In [None]:
#@title Run LDA

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()

# Run LDA
lda_model = LatentDirichletAllocation(n_components=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)
lda_W = lda_model.transform(tf)
lda_H = lda_model.components_

print("LDA Topics")
display_topics(lda_H, lda_W, tf_feature_names, documents, no_top_words, no_top_documents)

LDA Topics
Topic 0:
user response time survey
A survey of user opinion of computer system response time
Relation of user-perceived response time to error measurement
The EPS user interface management system
Topic 1:
trees human graph minors
Graph minors IV: Widths of trees and quasi-ordering
Human machine interface for Lab ABC computer applications
The intersection graph of paths in trees
Topic 2:
trees survey time minors
The generation of random, binary, unordered trees
System and human system engineering testing of EPS
The intersection graph of paths in trees
