# Topic Modeling for Everybody with Google Colab

**Super simple topic modeling using both the Non Negative Matrix Factorization (NMF) and Latent Dirichlet Allocation (LDA) algorithms.**

This Google Colab Notebook makes topic modeling accessible to everybody. Textual data can be loaded from a Google Sheet and topics derived from NMF and LDA can be generated. Only simple form entry is required to set:

* the name of the google sheet
* the number of topics to be generated
* the number of top words and documents that must be printed out for each topic





In [1]:
#@title Install pyLDAVis (specific version for Google Collab)
!pip install pyLDAvis==2.1.2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyLDAvis==2.1.2
  Downloading pyLDAvis-2.1.2.tar.gz (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 2.1 MB/s 
Collecting funcy
  Downloading funcy-1.17-py2.py3-none-any.whl (33 kB)
Building wheels for collected packages: pyLDAvis
  Building wheel for pyLDAvis (setup.py) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-2.1.2-py2.py3-none-any.whl size=97738 sha256=dd7110df176e84cb83c981cb1639b22054271ecbc60575e0f4530984832ff257
  Stored in directory: /root/.cache/pip/wheels/3b/fb/41/e32e5312da9f440d34c4eff0d2207b46dc9332a7b931ef1e89
Successfully built pyLDAvis
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-1.17 pyLDAvis-2.1.2


In [5]:
#@title Install gspread, authenticate and load data from a Google Sheet
!pip install --upgrade -q gspread

from google.colab import auth
auth.authenticate_user()

import gspread
from google.auth import default
creds, _ = default()

gc = gspread.authorize(creds)

googlesheet_filename = 'manet_tags' #@param {type:"string"}
data_rows_to_preview = 10 #@param {type:"integer"}


In [7]:
#@title Load and preview data from a Google Sheet

from google.colab import auth
auth.authenticate_user()

import gspread
from google.auth import default
creds, _ = default()

gc = gspread.authorize(creds)

worksheet = gc.open(googlesheet_filename).sheet1

# get_all_values gives a list of rows.
rows = worksheet.get_all_values()

# convert the 2nd column values to a list
documents = []
for row in rows[1:]:
  documents.append(row[1])
  
#print(documents)

# Convert to a DataFrame and render.
import pandas as pd
dataset_df = pd.DataFrame.from_records(rows)
dataset_df.head(n=data_rows_to_preview)


Unnamed: 0,0,1,2,3,4
0,['amtshaus',modernism',reformation',entropy',corbusier'].rtf
1,['amtshaus',modernism',reformation',entropy',corbusier'].txt
2,['amtshaus,modernism',continuity',proportion',values'].rtf
3,['amtshaus,modernism',continuity',proportion',values'].txt
4,['art',continuity',rupture',modernlife',history'].rtf
5,['art',continuity',rupture',modernlife',history'].txt
6,['artist',dandy',baudelaire',flaneur',ephemeral'].rtf
7,['artist',dandy',baudelaire',flaneur',ephemeral'].txt
8,['artist',socialclass',crowd',flaneur',time'].rtf
9,['artist',socialclass',crowd',flaneur',time'].txt




---



---



In [14]:
#@title Set topic modeling algorithm arguments

no_topics = 6 #@param {type:"integer"}

no_top_words = 9 #@param {type:"integer"}

no_top_documents = 6 #@param {type:"integer"}

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import numpy as np

In [15]:
#@title Run NMF

def display_topics(H, W, feature_names, documents, no_top_words, no_top_documents):
    for topic_idx, topic in enumerate(H):
        print("Topic %d:" % (topic_idx))
        print(" ".join([ (feature_names[i] + " (" + str(topic[i].round(2)) + ")")
          for i in topic.argsort()[:-no_top_words - 1:-1]]))
        top_doc_indices = np.argsort( W[:,topic_idx] )[::-1][0:no_top_documents]
        for doc_index in top_doc_indices:
            print(str(doc_index) + ". " + documents[doc_index])

# NMF is able to use tf-idf
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

# Run NMF
nmf_model = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)
nmf_W = nmf_model.transform(tfidf)
nmf_H = nmf_model.components_

print("NMF Topics")
display_topics(nmf_H, nmf_W, tfidf_feature_names, documents, no_top_words, no_top_documents)
print("--------------")



NMF Topics
Topic 0:
public (1.78) woman (0.0) modernity (0.0) modern (0.0) love (0.0) institution (0.0) impressionism (0.0) identity (0.0) haussman (0.0)
53. public'
58. public'
57. public'
56. public'
55. public'
54. public'
Topic 1:
spectacle (1.78) woman (0.0) flaubert (0.0) modern (0.0) love (0.0) institution (0.0) impressionism (0.0) identity (0.0) haussman (0.0)
76. spectacle'
75. spectacle'
71. spectacle'
72. spectacle'
73. spectacle'
74. spectacle'
Topic 2:
socialclass (1.78) woman (0.0) flaubert (0.0) modern (0.0) love (0.0) institution (0.0) impressionism (0.0) identity (0.0) haussman (0.0)
66. socialclass'
65. socialclass'
67. socialclass'
68. socialclass'
7. socialclass'
8. socialclass'
Topic 3:
family (1.55) flaubert (0.0) modern (0.0) love (0.0) institution (0.0) impressionism (0.0) identity (0.0) haussman (0.0) woman (0.0)
40. family'
39. family'
38. family'
37. family'
89. 
31. artificial'
Topic 4:
salon (1.55) woman (0.0) flaubert (0.0) modern (0.0) love (0.0) institut



In [13]:
#@title Visualise NMF with pyLDAVis

import pyLDAvis.sklearn

pyLDAvis.enable_notebook()

pyLDAvis_data = pyLDAvis.sklearn.prepare(nmf_model, tfidf, tfidf_vectorizer)
# Visualization can be displayed in the notebook
pyLDAvis.display(pyLDAvis_data)

  return dists / dists.sum(axis=1)[:, None]


ValidationError: ignored

In [None]:
#@title Run LDA

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()

# Run LDA
lda_model = LatentDirichletAllocation(n_components=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)
lda_W = lda_model.transform(tf)
lda_H = lda_model.components_

print("LDA Topics")
display_topics(lda_H, lda_W, tf_feature_names, documents, no_top_words, no_top_documents)

LDA Topics
Topic 0:
user (1.73) response (1.4) time (1.38) survey (1.35)
1. A survey of user opinion of computer system response time
4. Relation of user-perceived response time to error measurement
2. The EPS user interface management system
Topic 1:
trees (1.67) human (1.52) graph (1.39) minors (1.21)
7. Graph minors IV: Widths of trees and quasi-ordering
0. Human machine interface for Lab ABC computer applications
6. The intersection graph of paths in trees
Topic 2:
trees (0.88) survey (0.86) time (0.82) minors (0.81)
5. The generation of random, binary, unordered trees
3. System and human system engineering testing of EPS
6. The intersection graph of paths in trees


In [None]:
#@title Visualise LDA with pyLDAVis

import pyLDAvis.sklearn

pyLDAvis.enable_notebook()

pyLDAvis_data = pyLDAvis.sklearn.prepare(lda_model, tf, tf_vectorizer)
# Visualization can be displayed in the notebook
pyLDAvis.display(pyLDAvis_data)