# Visualizing a Topic Model using pyLDAVis

For this project, we try to see if we can conduct some topic modelling using sklearn's in-built LDA and pyLDAVis to visualise the topic models for the TIA posts that we extracted from TIA's publicly available REST APIs. We will attempt to see if we can better understand what content the writers are writing in TIA. With a better understanding of the available topics in TIA's posts, we can then look into how this information can benefit TIA if TIA were to move towards monetization through premium content for its posts.

We will be using Latent Dirichlet Allocation (LDA), a method to allow the discovery of topics within different documents given the text in the documents. We then use pyLDAVis to visualise the results and then from there attempt to find out any interesting topics from the visualisation.

The steps are as follows:

1. import the necessary dependencies for this notebook
2. connect to the database and retrieve the necessary data
3. text preprocessing of the posts
4. creating the term document matrix
5. topic modelling with LDA
6. visualisation using pyLDAVis
7. analyse the results

## 1. Import Dependencies

First, we import the libraries needed, including:

- **sklearn** library for running LDA on the posts
- [**pyLDAVis**](https://github.com/bmabey/pyLDAvis), a python port of [LDAVis](https://github.com/cpsievert/LDAvis) which is a R package for interactive topic model visualization
- **MySQLDb** library for connecting to the database and extract the posts
- Stanford's **nltk** (natural language toolkit) for text processing
- **pandas** for its dataframe and to retrieve the data from MySQL using the MySQLDB connection.

Note that this notebook runs on Python 2.7 due to pyLDAVis not being able to run on Python 3.

In [1]:
from __future__ import print_function

# import pyLDAVis
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

# import MySQLdb to use the connection later
import MySQLdb

# import re for regex
import re

# import sklearn vectorizer and LDA library
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# import pandas for dataframe
import pandas as pd

from nltk.corpus import stopwords
import nltk
nltk.download('stopwords');

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


We run the following code segment below to suppress some warnings from pyLDAVis due to version issues.

In [2]:
# suppress warnings on sklearn LDA
import warnings
warnings.filterwarnings("ignore")

## 2. Retrieve Data from Database

Specify the connection information to the MySQL database and then connect to the database using pandas, retrieving the data as a dataframe. We only need the post_content column from the posts table for the topic modelling.

In [3]:
conn = MySQLdb.connect(host="localhost",    # your host ip, or if its running locally, localhost
                     user="root",             # your username to connect to the db
                     passwd="user1991",       # your password for the user you are connecting using
                     db="tia_test")           # name of the database/schema you are connecting to

# connect to the database using the connection you previously specified and the query
posts_df = pd.read_sql('select * from posts;', con=conn)

# use only the post_content column
posts_doc = posts_df['post_content']

## 3. Text Preprocessing

Next we look at how we can clean up the text in the posts. Some approaches are as follows:

- setting all words to lowercase
- removing all punctuations
- removing stopwords based on stopword list
- removing all tokens with only numbers
- strip leading and trailing spaces
- customise some terms

The stop_word_list consists of the top 100 words in Wikipedia + some custom stop words based on my observations of the posts data. We then combine this with the nltk stop word list to form a combined stop word list for stop word removal.

For term customisation, what we are trying to do is to modify certain terms such that they will retain their intended meaning even after the text preprocessing. For instance, if we did not modify the word go-jek, after going through the text preprocessing it will become two words: go, jek (furthermore, go is in the stop word list). This is because the text preprocessing strips the punctuation and replaces it with a space. So to prevent such a thing from happening, before we strip the hyphen, we modify the entire term:

- original: go-jek
- after: gojek

so that the content will still be there after the text preprocessing. There are also a few other terms modified as well besides this, but this is one example.

In [4]:
# custom stopword list based on understanding of the dataset as well as common words
stop_word_list = ["the", "be", "to", "of", "and", "a", "in", "that", "have", "i", "it", "for", "not", "on", "with", "he", "as", 
                  "you", "do", "at", "this", "but", "his", "by", "from", "they", "we", "say", "her", "she", "or", "an", "will", 
                  "my", "one", "all", "would", "there", "their", "what", "so", "up", "out", "if", "about", "who", "get", 
                  "which", "go", "me", "when", "make", "can", "like", "time", "no", "just", "him", "know", "take", "people", 
                  "into", "year", "your", "good", "some", "could", "them", "see", "other", "than", "then", "now", "look", 
                  "only", "come", "its", "over", "think", "also", "back", "after", "use", "two", "how", "our", "work", "first", 
                  "well", "way", "even", "new", "want", "because", "any", "these", "give", "day", "most", "us", 'https', 'www', 
                  'etc', 'is', 'youre', 'were', 'im', 'hes', 'shes', 'theyre', 'are', 'yes', 'hi', 'isnt', 'wasnt', 'hey', 
                  'co', 'dont']

# nltk corpus stoplist
stops = list(stopwords.words('english'))

def treat_text(text):
    # decode text to work with special characters
    text = text.decode('utf8')
    
    # lower case the words
    text = text.lower()
    
    # substitution for consistency for words
    text = re.sub("e-commerce", "ecommerce", text)
    text = re.sub("go-jek", "gojek", text)
    
    # custom regexes based on text   
    # remove all symbols except letters numbers and blank spaces
    text = re.sub(r'[^A-Za-z0-9\s]+', ' ', text)
    
    # remove all words that are purely numbers only
    text = re.sub(r'\b[0-9]+\b\s*', '', text)
    
    # remove stopwords based on nltk stop list and some custom words from looking at the dataset
    text = ' '.join([word for word in text.split() if word not in stop_word_list and word not in stops])
    
    # substitution for consistency for words
    text = re.sub("tia", "tech-in-asia", text)
    text = re.sub("tech in asia", "tech-in-asia", text)
    text = re.sub("data sets", "datasets", text)
    text = re.sub(" ai ", " artificial intelligence ", text)
    text = re.sub(" ml ", " machine learning ", text)
    text = re.sub("data science", "data-science", text)
    text = re.sub("data sciences", "data-science", text)
    text = re.sub("se asia", "se-asia", text)
    text = re.sub("startups", "startup", text)
    text = re.sub("southeast asia", "se-asia", text)
    text = re.sub("south-east asia", "se-asia", text)
    text = re.sub("facebooks", "facebook", text)
    text = re.sub("icos", "ico", text)
    
    # compress all more than 1 blank spaces into a single blank space and finally strip the trailing and leading blank spaces
    text = re.sub("\s\s+" , " ", text).strip()
    
    return text

The entire text preprocessing for a single document is contained inside the function treat_text. Because of this, we can then do text preprocessing on every single document in the dataframe using the apply method on the treat_text function. 

In [5]:
# apply text cleaning on entire dataframe
posts_parsed = (posts_doc).apply(treat_text)

## 4. Creating term-document matrix

After preprocessing the text, we can then convert the collection of posts into a matrix of token counts using sklearn's CountVectorizer. Since LDA is a probabilistic model that only requires raw counts of the number of words, we use the CountVectorizer to retrieve the raw counts without any idf weighing.

We then apply the fit_transform on the parsed documents to be processed using LDA.

In [6]:
tf_vectorizer = CountVectorizer(analyzer='word',
                                max_df = 0.5, 
                                min_df = 10)

dtm_tf = tf_vectorizer.fit_transform(posts_parsed)
print(dtm_tf.shape)

(900, 4188)


## 5. Topic Modelling with LDA

With the term document matrix, we can then run LDA on it to try and identify latent variables that will allow us to discover topics in the documents that may not be visible just on the posts content alone.

For the LDA, there are many parameters that we can play around with, but for now we will be modifying only 3 arguments:

- n_components: number of topics to generate from the LDA. For this example, I use 20 topics but it can be increased or decreased depending on whether one thinks that there might be more or less common topics among the posts. Choosing a good number of topics may be tricky though. Choosing too few topics may result in trivial topics that are overly broad, while choosing too many topics may result in "over-clustering" of a corpus into many small, highly-similar topics. Number of topics is also based on your corpus size, for documents with over 100,000 documents there can be more than a thousand topics, but for this case with 900 posts, having a relative smaller number of 20 topics seems to be a reasonable number.

- random_state: a seed for the random number generator so that the results will be consistent and replicable. This is necessary as LDA is a probabilistic model that carrys out sampling of the dataset to do its computations

- max_iter: number of iterations to run for the LDA model. It might be possible that for certain sets of data the model takes longer to converge but increasing the number of iterations comes at a cost of more computation time.

In [7]:
lda_tf = LatentDirichletAllocation(n_topics=20, 
                                   random_state=0,
                                   max_iter=100)
lda_tf.fit(dtm_tf);

## 6. Visualising using pyLDAVis

The last step of this is to visualise the results of the LDA using pyLDAVis. Some things to take note:

- the Intertopic Distance Map on the left allows us to visualise in 2D each topic as a bubble. The size of the bubble indicates how many percent of the entire corpus (posts) corresponds to this topic, while the distance between two topic bubbles indicate how closely relevant to each other the two topics are. For this case the technique for scaling the distance is using Principal Components, but other methods can be used as well such as t-SNE (technique for dimensionality reduction well suited for visualization of high-dimensional datasets).
- the horizontal bar charts on the right show the top 30 terms for each topic that is selected. These are the individual terms that are the most useful for interpreting the currently selected topic on the left arranged from top to bottom in order of importance.  A pair of overlaid bars represent both the corpus-wide frequency of a given term as well as the topic-specific frequency of the term. Adjusting the lamdba parameter (discussed later) affects the ranking as well as the terms displayed. 
- the relevance metric λ (lambda) is used to determine the ranking of terms within topics. By changing the value of λ, we can adjust the term rankings. Having a smaller values of λ (near 0) highlight potentially rare, but exclusive terms for the selected topic, while having a larger values of λ (near 1) highlight frequent, but not necessarily exclusive, terms for the selected topic. The original paper for LDAVis did an empirical test to show that setting a λ close to 0.6 seems to help users in interpreting topics well with the top relevant terms. However, it is still a metric that is dependent on the dataset itself as well as the topics, hence the slider. (Refer to the original LDAVis paper for a formal definition of the relevance metric λ)

Below are two visualisations, the first one is using the default PCoA for the intertopic distance while the second one is using t-SNE.

In [8]:
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)

In [9]:
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer, mds="tsne") # tsne for the mds

## 7. Inference of Results

For this, I'm using λ = 0.6 referencing the original LDAVis paper, and using t-SNE for the dimensional scaling/reduction. Choosing t-SNE was because for my own inference purposes, I was more interested in the relative distance between different topics to look at how closely related topics are to each other, something t-SNE is good at.

From the results, we can make some observations for the topics. Lets take a look at some of the main ones: 

1. Some of the key words in topic 1 such as product, team, need seems to be about posts that describe about existing products teams, how they are being built and run, as well as advices on starting one. Simple google search using these key words that yielded results like "How Carousell runs its product team" and "How to form a tech team that actually solves problems..". These posts might interest people who are managing or are part of teams that are trying to develop their own product (eg. product managers), who can draw inspiration from these posts to either form their own impression of an ideal product team or how to better manage their existing teams.
2. Topic 2 is probably for posts related to e-commerce in general, with key words such as ecommerce, online, amazon, lazada, shopee; to name a few keywords that one can associate with e-commerce. These posts might relate to people who are interested in the e-commerce scene, especially in southeast asia and countries like Indonesia or Malaysia.
3. From the key terms in topic 3, it is likely that the posts that majority belong to this topic are related to startup funding in general, with terms like funding, capital, investors, investment, round, series, raised synonymous with startup funding rounds.
4. Topic 4 is likely to be related on latest trends in data, artificial intelligence (AI), analytics and machine learning that can interest people who are active in the data science community, interested in data science or may want to break into the field.
5. Topic 5 potentially contains topics on the different ride sharing/bike sharing/ride hailing transportation posts, given the key words (grab, uber, ofo, gojek, mobike). Posts from this topic are suitable for people who are actively following the transportationd and ride sharing scene. Interesting to note that topics 1, 2 and 5 are quite close to each other. One possible reason might be because ecommerce/ride-sharing and having a good product team goes hand in hand. Ride-hailing companies like Uber and Grab would need to have strong product teams to develop their mobile platforms, while this is also true for ecommerce like Lazada and Carousell.
6. Topic 6 is probably about China's social media social media scene, with key terms such as wechat, tencent, app, social media being strong hints of this. Posts from this topic are for people who are interested in social media applications or the china tech scence in general.
7. Topic 7 is most likely about one of the latest craze, the development of cryptocurrency, especially in Singapore. With key terms such as blockchain, ico and bitcoin, it is highly likely related to cryptocurrency. One interesting thing to note are some terms associated with this topic such as government, strengthening the idea of Singapore aiming to be a crypto haven; and also security, another concern that comes with crypto for many people. This is definitely one to recommend to crytomaniacs, and interestingly also seems to be closely related with the data science topic as well.

Above are some of my observations and analysis of the main topics in the TIA posts. Similar inferences can be made for the rest of the posts, but I'll leaeve it at this 7 of them for now in case I bore you guys too much. While for this notebook, I have only run the model and shown the visualisation for the results, it is possible to find out the main topics for each post with additional lines of python code as well.

This topic modelling is important as it is essential to better understand what kind of possible topics that Tech in Asia writers are generating for the posts, beyond the usual tags that is included, which some of it may be abit generic for good post recommendation (for instance the News and Startups categories that are in more than 1/3 of the posts based on the PowerBI dashboard), especially if TIA is to go for monetization using recommendation of premium content. 

A better understanding of the topics generated by the writers can help in recommending content to them if their interests can be captured. In fact, TIA members, when signing up or modifying their profile, can even choose to fill up articles they are interested in based on a range of topics available. These topics available for members to choose from can even draw reference from topic modelling results such as this. Also, a good understanding of the relationship between topics can help in prospective recommendations as well if two topics are found to be closely related to each other, with the idea that related topics can quip a reader's interest as well.

## Conclusion

This notebook serves to show the topic modelling I carried out on the TIA posts, with the aim of discovering latent posts' topics through the observable posts content. It walks through both the process of arriving at the topic model as well as the analysis of the topics derived. If given enough time, one thing I would want to do better is to maybe play around more with the parameters and see if its possible to derive even better topics and subsequently insights from this topic modelling. 

Credits to [Carson Sievert](https://github.com/cpsievert/LDAvis) for the LDAVis R package and the python port [pyLDAVis](https://github.com/bmabey/pyLDAvis) by Ben Mabey.