[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/librairy/notebooks/blob/master/Intro_TopicModels.ipynb)

This Google Colab Notebook serves as an introduction to Probabilistic Topic Models. 

Textual data can be loaded from a Google Sheet and topics derived from  LDA can be generated. 

First, it is necessary to indicate the training google sheet and the number of words to show per topic.


In [0]:
#@title Google Colab Authentication
!pip install --upgrade -q gspread
#!pip install -q gensim

from google.colab import auth
auth.authenticate_user()

import gspread
from oauth2client.client import GoogleCredentials

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import numpy as np




In [111]:
#@title Load and preview data from a Google Sheet

corpus = 'texts' #@param {type:"string"}
preview = 10 #@param {type:"integer"}


gc = gspread.authorize(GoogleCredentials.get_application_default())

worksheet = gc.open(corpus).sheet1

# get_all_values gives a list of rows.
rows = worksheet.get_all_values()

# convert the 3rd column values to a list
documents = []
for row in rows[1:]:
  documents.append(row[2])
  
#print(documents)

# Convert to a DataFrame and render.
import pandas as pd
dataset_df = pd.DataFrame.from_records(rows)
dataset_df.head(n=preview)



Unnamed: 0,0,1,2
0,EU100000,Visual object population codes relating human...,Two major challenges facing systems neuroscien...
1,EU100001,New Opportunities for Research Funding Agency ...,NORFACE is a co-ordinated common action of fif...
2,EU100002,USA and Europe Cooperation in Mini UAVs,Unmanned Aerial Systems have been an active ar...
3,EU100003,Sustainable Infrastructure for Resilient Urban...,This fellowship aim is to identify how the use...
4,EU100004,Modelling star formation in the local universe,The goal of this proposal is to revolutionize ...
5,EU100005,Coherent spin manipulation in hybrid nanostruc...,The rapid development of novel nanoelectronic ...
6,EU100006,Stochastic Modeling of Spatially Extended Ecos...,The coupling between ecosystems and the climat...
7,EU100007,Secure Multiparty Computation in the von Neuma...,The ultimate aim of this project is to design ...
8,EU100008,Infrastructures for Community-Based Data Manag...,"Structured data is increasingly created, trans..."
9,EU100009,Reducing Environmental Footprint based on Mult...,Reduction of CO2 emissions is the great challe...


Tokenize input data:


In [112]:
#@title Tokenization

tf_vectorizer = CountVectorizer(
    stop_words=None,
    min_df=2,
    max_df=0.95,
    lowercase=False,
    max_features=None,
    ngram_range=(1,1),
    analyzer = 'word'
)
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()
vocab = tf_vectorizer.vocabulary_

print("Vocabulary Size: ", len(tf_feature_names))

Vocabulary Size:  2077


Now it's time to build a topic model by setting values for:
- number of topics
- alpha
- beta

In [115]:
#@title Run LDA

topics = 2 #@param {type:"integer"}

alpha = 0.1 #@param {type:"number"}

beta = 0.01 #@param {type:"number"}

no_top_words = 10

no_top_documents = 3


# Run LDA
lda_model = LatentDirichletAllocation(
    n_components=topics, 
    doc_topic_prior=alpha, 
    topic_word_prior=beta, 
    max_iter=100, 
    learning_method='online', 
    learning_offset=50.,
    random_state=0).fit(tf)
lda_W = lda_model.transform(tf)
lda_H = lda_model.components_

print("LDA Topics")
for topic_idx, topic in enumerate(lda_H):
    print("-"*30)
    print(" Topic ",(topic_idx)," :")
    print("["," | ".join([tf_feature_names[i]
                    for i in topic.argsort()[:-no_top_words - 1:-1]]),"]")
    top_doc_indices = np.argsort( lda_W[:,topic_idx] )[::-1][0:no_top_documents]
    for doc_index in top_doc_indices:
        print("[",doc_index,"] (",rows[doc_index-1][0],") \'",rows[doc_index-1][1],"\'")
        print("\t",lda_W[doc_index])
            
#display_topics(lda_H, lda_W, tf_feature_names, documents, no_top_words, no_top_documents)


LDA Topics
------------------------------
 Topic  0  :
[ will | for | is | The | be | on | as | that | with | are ]
[ 36 ] ( EU100035 ) ' GLObal Robotic telescopes Intelligent Array for e-Science '
	 [9.99673416e-01 3.26583934e-04]
[ 35 ] ( EU100034 ) ' Trans-Atlantic Research and Education Agenda in System of Systems '
	 [9.99656593e-01 3.43406595e-04]
[ 27 ] ( EU100026 ) ' Quantitative Graph Games: Theory and Applications '
	 [9.99629904e-01 3.70096227e-04]
------------------------------
 Topic  1  :
[ will | for | The | as | is | be | on | by | are | that ]
[ 21 ] ( EU100020 ) ' The importance of submarine groundwater discharge for the southwestern Baltic Sea '
	 [0.99804687 0.00195313]
[ 67 ] ( EU100068 ) ' A Treatment-Oriented Research Project of NCL Disorders as a Major Cause of Dementia in Childhood '
	 [0.99841772 0.00158228]
[ 80 ] ( EU100081 ) ' Comprehensive European Approach to the Protection of Civil Aviation '
	 [0.99859551 0.00140449]


In [108]:
#@title Term Frequencies
s = tf.toarray().sum(axis=0)
st = sorted(range(len(s)), key=lambda k: s[k], reverse=True)
for i,x in enumerate(st[:20]):
  print(tf_vectorizer.get_feature_names()[x],s[x])

technology 70
model 66
approach 62
application 53
provide 53
need 51
base 50
challenge 47
propose 47
development 46
support 46
europe 45
cell 44
focus 43
activity 41
tool 40
area 39
energy 39
service 39
analysis 38


Get Topic Distributions:


In [88]:
#@title Topic Distributions
for i,v in enumerate(lda_W):
  print(rows[i][0],":",v)


EU100000 : [5.54323736e-04 5.54323735e-04 5.54323728e-04 9.98337029e-01]
EU100001 : [6.15763563e-04 9.98152709e-01 6.15763558e-04 6.15763558e-04]
EU100002 : [5.54323737e-04 9.98337029e-01 5.54323735e-04 5.54323738e-04]
EU100003 : [9.97863248e-01 7.12250723e-04 7.12250728e-04 7.12250719e-04]
EU100004 : [9.98218527e-01 5.93824241e-04 5.93824245e-04 5.93824238e-04]
EU100005 : [6.15763554e-04 6.15763554e-04 9.98152709e-01 6.15763553e-04]
EU100006 : [6.00961556e-04 9.98197115e-01 6.00961571e-04 6.00961546e-04]
EU100007 : [9.98218527e-01 5.93824243e-04 5.93824235e-04 5.93824241e-04]
EU100008 : [4.89236802e-04 9.98532290e-01 4.89236800e-04 4.89236800e-04]
EU100009 : [5.04032264e-04 9.98487903e-01 5.04032262e-04 5.04032267e-04]
EU100010 : [9.98502994e-01 4.99002009e-04 4.99002000e-04 4.99002005e-04]
EU100011 : [7.44047634e-04 9.97767857e-01 7.44047633e-04 7.44047630e-04]
EU100012 : [6.15763555e-04 9.98152709e-01 6.15763561e-04 6.15763556e-04]
EU100013 : [5.60538126e-04 5.60538127e-04 5.6053812

Get topics for a given  sample:

In [116]:
#@title Topic Inference

text = "It\u2019s amazing how much can be achieved with just 36 lines of Python code and some Scikit Learn magic. The full code listing is provided below" #@param {type:"string"}

print("Topic Distribution: ", lda_model.transform(tf_vectorizer.transform([text])))


Topic Distribution:  [[0.99242424 0.00757576]]
