# Visualizing a Topic Model using pyLDAVis

For this project, we try to see if we can conduct some topic modelling using sklearn's in-built LDA and pyLDAVis to visualise the topic models.

First, we import the libraries needed, including:

- **sklearn** library for running LDA on the comments
- **pyLDAVis** for visualisation
- **MYSQLDb** library for connecting to the database and extract the comments
- Stanford's **nltk** (natural language toolkit) for text processing
- **pandas** for its dataframe

Note that this notebook runs on Python 2.7 due to pyLDAVis not being able to run on Python 3.

In [1]:
# supress warnings on sklearn LDA
import warnings
warnings.filterwarnings("ignore")

from __future__ import print_function

# import pyLDAVis
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

# import MySQLdb to use the connection later
import MySQLdb

# import re for regex
import re

# import sklearn vectorizer and LDA library
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# import pandas for dataframe
import pandas as pd

from nltk.corpus import stopwords
import nltk
nltk.download('stopwords');

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Specify the connection information to the MySQL database and then connect to the database using pandas, retrieving the data as a dataframe. We only need the comments column for the topic modelling.

In [2]:
connec = MySQLdb.connect(host="localhost",    # your host, usually localhost
                     user="root",         # your username
                     passwd="user1991",  # your password
                     db="tia_test")        # name of the data base

comments_df = pd.read_sql('select * from comments;', con=connec)
comments_doc = comments_df['comment']

Next we look at how we can clean up the text in the comments. Some approaches are as follows:

- setting all words to lowercase
- removing all punctuations
- removing stopwords based on stopword list
- removing all tokens with only numbers
- strip leading and trailing spaces
- customise some terms (will see later)

In [3]:
# custom stopword list based on understanding of the dataset
#stop_word_list = ["salary", "cum", "up", "to", "mnc", "mncs", "job", "id", "etc", "bonus", "per", "month", "aws", "days",
#                 "immediate", "position", "positions", "temp", "temporary", "perm", "permanant", "urgent", "sea", 
#                  "singapore", "apac", "asia", "pacific", "ot", "weekend", "weekday"]
stop_word_list = ["the", "be", "to", "of", "and", "a", "in", "that", "have", "i", "it", "for", "not", "on", "with", "he", "as", "you", "do", "at",
                  "this", "but", "his", "by", "from", "they", "we", "say", "her", "she", "or", "an", "will", "my", "one", "all", "would", "there",
                  "their", "what", "so", "up", "out", "if", "about", "who", "get", "which", "go", "me", "when", "make", "can", "like", "time", "no",
                  "just", "him", "know", "take", "people", "into", "year", "your", "good", "some", "could", "them", "see", "other", "than", "then",
                  "now", "look", "only", "come", "its", "over", "think", "also", "back", "after", "use", "two", "how", "our", "work", "first", "well",
                  "way", "even", "new", "want", "because", "any", "these", "give", "day", "most", "us", 'https', 'www', 'crystal',
                  'etc', 'is', 'youre', 'were', 'im', 'hes', 'shes', 'theyre', 'are', 'yes', 'hi', 'isnt', 'wasnt', 'hey', 'co']

stops = list(stopwords.words('english'))

def treat_text(text):
    # decode text to work with special characters
    text = text.decode('utf8')
    
    # lower case the words
    text = text.lower()
    
    # substitution for consistency for words
    text = re.sub("e-commerce", "ecommerce", text)
    text = re.sub("go-jek", "gojek", text)
    
    # custom regexes based on text   
    # remove all symbols except letters numbers and blank spaces
    text = re.sub(r'[^A-Za-z0-9\s]+', ' ', text)
    
    text = re.sub(r'\b[0-9]+\b\s*', '', text)
    # remove stopwords based on nltk stop list and some custom words from looking at the dataset
    text = ' '.join([word for word in text.split() if word not in stop_word_list and word not in stops])
    
    # substitution for consistency for words
    text = re.sub("tia", "tech-in-asia", text)
    text = re.sub("tech in asia", "tech-in-asia", text)
    text = re.sub("data sets", "datasets", text)
    text = re.sub(" ai ", " artificial-intelligence ", text)
    text = re.sub(" ml ", " machine-learning ", text)
    text = re.sub("data science", "data-science", text)
    text = re.sub("data sciences", "data-science", text)
    text = re.sub("se asia", "se-asia", text)
    text = re.sub("startups", "startup", text)
    text = re.sub("southeast asia", "se-asia", text)
    text = re.sub("south-east asia", "se-asia", text)
    text = re.sub("facebooks", "facebook", text)
    text = re.sub("icos", "ico", text)
    
    # compress all more than 1 blank spaces into a single blank space and finally strip the trailing and leading blank spaces
    text = re.sub("\s\s+" , " ", text).strip()
    
    return text

In [4]:
def check_cleaning(text):
    print('Before:', text.lower())                   # title before cleaning
    print('After:', treat_text(text))        # title after cleaning

In [5]:
check_cleaning(comments_doc[172])
print()
check_cleaning(comments_doc[555])
print()
check_cleaning(comments_doc[722])
print()
check_cleaning(comments_doc[922])
print()

Before: i would actually say that data science is not mission-critical for most e-commerce/marketplace/etc startup companies. data science has the most predictive power and application when you have large data sets, which frankly, most startups will not have.
in the early years of go-jek, the role and vision of data started very early  we started small, and ensured that we were exceptionally involved in the product development from the get-go. that is, just ensure that there is a working feedback loop between data teams and other parts of the company, such as finance, product owners, and marketing, and it should happen as early as possible. because the bigger a company is, the harder it is to integrate effectively, or at all. if you have proper data sets, the data science can come later. solve the easy problems first 
make sure that data is structured properly first. having vast amounts of unstructured data is a sign that the bi team and a product development team were not coordinating

In [6]:
# remove duplicates and apply text cleaning on entire dataframe
comments_parsed = (comments_doc).apply(treat_text)

In [7]:
tf_vectorizer = CountVectorizer(analyzer='word',
                                max_df = 0.5, 
                                min_df = 10)

dtm_tf = tf_vectorizer.fit_transform(comments_parsed)
print(dtm_tf.shape)

(1437, 513)


In [8]:
lda_tf = LatentDirichletAllocation(n_topics=20, 
                                   random_state=0,
                                   max_iter=100)
lda_tf.fit(dtm_tf)



LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=100,
             mean_change_tol=0.001, n_components=10, n_jobs=1, n_topics=20,
             perp_tol=0.1, random_state=0, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [9]:
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  topic_term_dists = topic_term_dists.ix[topic_order]
