## Topic Modeling On Translink Tweets

Topic Modeling aims to find the topics (or clusters) inside a corpus of texts (like mails or news articles), without knowing those topics at first. Here lies the real power of Topic Modeling, you don’t need any labeled or annotated data, only raw texts, and from this chaos Topic Modeling algorithms will find the topics your texts are about!

The most popular approach for Topic Modeling is Latent Dirichlet Allocation (LDA).

### What does LDA do?

LDA’s approach to topic modeling considers each document as a collection of topics in a certain proportion. And each topic as a collection of keywords, again, in a certain proportion.

A topic is nothing but a collection of dominant keywords that are typical representatives. Just by looking at the keywords, you can identify what the topic is all about.

Once you provide the algorithm with the number of topics, all it does it to rearrange the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of topic-keywords distribution.




### Preprocessing
We will need to remove frequency used words that are not helpful like 'the', 'a', ... These are called  `stopwords'.

We also need to lemmalize the words. Lemmatization is nothing but converting a word to its root word. For example: the lemma of the word ‘machines’ is ‘machine’. Likewise, ‘walking’ –> ‘walk’, ‘mice’ –> ‘mouse’ and so on.

### Remove Stopwords, Make Bigrams and Lemmatize
Bigrams are two words frequently occurring together in the document. Trigrams are 3 words frequently occurring.


Gensim’s Phrases model can build and implement the bigrams, trigrams, quadgrams and more. The two important arguments to Phrases are min_count and threshold. The higher the values of these param, the harder it is for words to be combined to bigrams.

### Loading required packages

In [1]:
import re
import numpy as np
import pandas as pd
import nltk
from pprint import pprint
# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gsdmm.mgp import MovieGroupProcess
from gensim.models import CoherenceModel
nltk.download('stopwords')

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\acer\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['https', '@', 's', 'http', 'co', 'woman', 'man','translink'])

In [3]:
# Import Dataset
df = pd.read_csv('translink.csv').sample(frac=1, random_state=0)
df[['TweetText','TweetDateTime','Followers','Friends','Statuses','Favourites','Sentiment']].head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,TweetText,TweetDateTime,Followers,Friends,Statuses,Favourites,Sentiment
25559,@TransLink kudos to the driver of 100 bus this...,2015-11-15 21:22,62,142,3130,9385,Negative
251957,"@scronide Thank you, will let control know.^jd",2013-11-14 2:43,0,0,0,0,
84769,@TransLink first trains leaving new west stn,2016-04-10 14:57,89,610,160,214,Positive
40636,@TransLink whats going on with 16 southbound o...,2016-12-19 3:49,225,320,209,4,Negative
299766,@TransLink Lougheed station is so crowded righ...,2018-11-13 15:51,11,7,5,1,Neutral


In [4]:
# Convert to list
data = df.TweetText.values.tolist()

print(data[0])

@TransLink kudos to the driver of 100 bus this AM who refused to leave woman in distress along Marine Dr #thingsmoreimportantthanbeinglate


### Tokenize each sentence

Here we tokenize each sentence into a list of words, removing punctuations and unnecessary characters altogether.

Gensim’s simple_preprocess() is great for this. 

In [5]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True, min_len=3, max_len=15))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

print(data_words[0])

['translink', 'kudos', 'the', 'driver', 'bus', 'this', 'who', 'refused', 'leave', 'woman', 'distress', 'along', 'marine']


In [6]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=10, threshold=50) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=200)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])

['translink', 'kudos', 'the', 'driver', 'bus', 'this', 'who', 'refused', 'leave', 'woman', 'distress', 'along', 'marine']


In [7]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [8]:
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
#nlp = spacy.load('en', disable=['parser', 'ner'])
import en_core_web_sm
nlp = en_core_web_sm.load()
# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[0])

['driver', 'bus', 'refuse', 'leave', 'distress']


## Create the Dictionary and Corpus needed for LDA
The two main inputs to the LDA topic model are the dictionary(id2word) and the corpus. Let’s create them:

In [9]:
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[0])

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)]


In [10]:
# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

[[('bus', 1), ('distress', 1), ('driver', 1), ('leave', 1), ('refuse', 1)]]

## Building the Topic Model
We have everything required to train the LDA model. In addition to the corpus and dictionary, you need to provide the number of topics as well.

Apart from that, alpha and eta are hyperparameters that affect sparsity of the topics. According to the Gensim docs, both defaults to 1.0/num_topics prior.

chunksize is the number of documents to be used in each training chunk. update_every determines how often the model parameters should be updated and passes is the total number of training passes.

In [11]:
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=12, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=200,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

In [12]:
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.107*"due" + 0.095*"today" + 0.092*"would" + 0.073*"need" + 0.048*"trip" + '
  '0.039*"detail" + 0.038*"give" + 0.038*"well" + 0.034*"waterfront" + '
  '0.032*"detour"'),
 (1,
  '0.195*"train" + 0.094*"line" + 0.057*"back" + 0.040*"much" + '
  '0.033*"feedback" + 0.031*"expo" + 0.029*"always" + 0.020*"nice" + '
  '0.019*"area" + 0.017*"ride"'),
 (2,
  '0.201*"thank" + 0.119*"time" + 0.071*"station" + 0.051*"take" + '
  '0.046*"schedule" + 0.036*"make" + 0.034*"number" + 0.026*"may" + '
  '0.025*"info" + 0.022*"sure"'),
 (3,
  '0.104*"happen" + 0.102*"report" + 0.099*"issue" + 0.081*"morning" + '
  '0.034*"open" + 0.033*"customer" + 0.031*"main" + 0.030*"yet" + '
  '0.025*"operate" + 0.023*"put"'),
 (4,
  '0.077*"route" + 0.073*"people" + 0.055*"right" + 0.046*"arrive" + '
  '0.045*"amp" + 0.044*"able" + 0.044*"try" + 0.041*"have" + 0.038*"head" + '
  '0.037*"downtown"'),
 (5,
  '0.099*"get" + 0.074*"skytrain" + 0.067*"know" + 0.065*"s" + 0.055*"minute" '
  '+ 0.047*"pass" + 0.

In [13]:
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.20589083447406642


## Short Text Topic Modeling

Despite the great performance of LDA on medium or large sized texts (>50 words, typically mails and news articles are about this size range) it poorly performs on short texts like Tweets, Reddit posts or StackOverflow titles’ questions.

The assumption of LDA is that a text is a mixture of topics. This is not true in the case of short texts. We will now assume that a short text is made from only one topic.

### Gibbs Sampling Dirichlet Mixture Model (GSDMM)

The Gibbs Sampling Dirichlet Mixture Model (GSDMM) is an “altered” LDA algorithm, showing great results on STTM tasks, that makes the initial assumption: 1 topic ↔️1 document. The words within a document are generated using the same unique topic, and not from a mixture of topics as it was in the original LDA.

In [15]:
K=10 # Number of topics
docs=data_lemmatized
# Train a new model 

# Init of the Gibbs Sampling Dirichlet Mixture Model algorithm
mgp = MovieGroupProcess(K=K, alpha=0.1, beta=0.1, n_iters=30)

vocab = set(x for doc in docs for x in doc)
n_terms = len(vocab)
n_docs = len(docs)

# Fit the model on the data given the chosen seeds
y = mgp.fit(docs, n_terms)


In stage 0: transferred 338964 clusters with 10 clusters populated
In stage 1: transferred 315467 clusters with 10 clusters populated
In stage 2: transferred 264938 clusters with 10 clusters populated
In stage 3: transferred 187426 clusters with 10 clusters populated
In stage 4: transferred 151606 clusters with 10 clusters populated
In stage 5: transferred 139882 clusters with 10 clusters populated
In stage 6: transferred 135618 clusters with 10 clusters populated
In stage 7: transferred 133751 clusters with 10 clusters populated
In stage 8: transferred 133439 clusters with 10 clusters populated
In stage 9: transferred 132368 clusters with 10 clusters populated
In stage 10: transferred 132399 clusters with 10 clusters populated
In stage 11: transferred 131354 clusters with 10 clusters populated
In stage 12: transferred 131124 clusters with 10 clusters populated
In stage 13: transferred 130877 clusters with 10 clusters populated
In stage 14: transferred 131035 clusters with 10 clusters 

In [23]:
for i in range(10):
    print('Cluster Number'+' '+str(i)+' '+'Most Popular Words are:'+'\t')
    arg=np.argsort(-np.array([v for v in mgp.cluster_word_distribution[i].values()]))[0:100]
    print(np.array([k for k in mgp.cluster_word_distribution[i].keys()])[arg])
    print()

Cluster Number 0 Most Popular Words are:	
['bus' 'wait' 'run' 'delay' 'due' 'sorry' 'late' 'cancel' 'stop' 'trip'
 'report' 'time' 'min' 'leave' 'see' 'show' 'today' 'next' 'check'
 'problem' 'long' 'issue' 'apology' 'minute' 'jkd' 'schedule'
 'unfortunately' 'currently' 'able' 'get' 'way' 'traffic' 'arrive'
 'really' 'running' 'morning' 'mechanical' 'make' 'still' 'go' 'hope'
 'look' 'early' 'miss' 'amp' 'route' 'driver' 'coach' 'hear' 'couple'
 'gps' 'approx' 'right' 'come' 'thank' 'shortly' 'hopefully' 'moment'
 'departure' 'momentarily' 'behind_schedule' 'station' 'pass' 'close'
 'cause' 'service' 'be' 'appear' 'depart' 'behind' 'good' 'experience'
 'may' 'away' 'board' 'soon' 'cancellation' 'sure' 'already' 'far' 'yet'
 'control' 'last' 'happen' 'train' 'number' 'specific' 'tonight' 'back'
 'almost' 'response' 'result' 'earlier' 'full' 's' 'longer' 'bridgeport'
 'downtown' 'catch' 'one']

Cluster Number 1 Most Popular Words are:	
['bus' 'driver' 'train' 'skytrain' 'stop' 'get' 'ca

 'jrlezfsrpu' 'change' 'dept' 'could']



In [17]:
doc_count = np.array(mgp.cluster_doc_count)
print('Number of documents per topics :', doc_count)
print('*'*20)

# Topics sorted by document inside
top_index = doc_count.argsort()[-10:][::-1]
print('Most important clusters (by number of docs inside):', top_index)
print('*'*20)

Number of documents per topics : [32377 32307 39365 48616 31391 53049 29411 53943 35514 28915]
********************
Most important clusters (by number of docs inside): [7 5 3 2 8 0 1 4 6 9]
********************


In [26]:
# Must be hand made so the topic names match the above clusters regarding their content
topic_dict = {}
topic_names = ['bus delay','bus issues like bad smell or door not working', 'delay in skytrain stations', 'bus rout change', 'police control at skytrain ','appreciation, good service', 'transit fare', 'bus is full','bus delay due to an issue like construction or weather' ,'translink appreciates feedback']

for i, topic_num in enumerate(top_index):
    topic_dict[topic_num]=topic_names[i]

In [30]:
for i in range(10):
    print(topic_names[i]+': '+'Most Popular Words are:'+'\t')
    arg=np.argsort(-np.array([v for v in mgp.cluster_word_distribution[i].values()]))[0:10]
    print(np.array([k for k in mgp.cluster_word_distribution[i].keys()])[arg])
    print()

bus delay: Most Popular Words are:	
['bus' 'wait' 'run' 'delay' 'due' 'sorry' 'late' 'cancel' 'stop' 'trip']

bus issues like bad smell or door not working: Most Popular Words are:	
['bus' 'driver' 'train' 'skytrain' 'stop' 'get' 'car' 'people' 'station'
 'door']

delay in skytrain stations: Most Popular Words are:	
['train' 'station' 'line' 'skytrain' 'go' 'waterfront' 'expo' 'run'
 'leave' 's']

bus rout change: Most Popular Words are:	
['bus' 'stop' 'time' 'wait' 'next' 'come' 'leave' 'go' 'show' 'schedule']

police control at skytrain : Most Popular Words are:	
['thank' 'know' 'control' 'let' 'skytrain' 'info' 'call' 'report' 'look'
 'jkd']

appreciation, good service: Most Popular Words are:	
['thank' 'bus' 'good' 'get' 'time' 'day' 'work' 'great' 'help' 'make']

transit fare: Most Popular Words are:	
['bus' 'zone' 'use' 'compass' 'pass' 'ticket' 'card' 'tap' 'get' 'pay']

bus is full: Most Popular Words are:	
['bus' 'stop' 'wait' 'driver' 'late' 'go' 'get' 'time' 'minute' 'people

# refrences
https://towardsdatascience.com/short-text-topic-modeling-70e50a57c883
https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
https://github.com/rwalk/gsdmm
https://github.com/mamrou/short_text_topic_modeling/blob/master/notebook_sttm_example.ipynb
https://www.coursera.org/learn/python-text-mining
https://www.coursera.org/learn/language-processing