### Topic Modeling

Topic modeling is a process of automatically identifying the topics present in a text corpus, it derives the hidden patterns among the words in the corpus in an unsupervised manner. Topics are defined as “a repeating pattern of co-occurring terms in a corpus”. Topic modelling can be described as a method for finding a group of words (i.e topic) from a collection of documents that best represents the information in the collection.

As the name suggests, it is a process to automatically identify topics present in a text object and to derive hidden patterns exhibited by a text corpus. Thus, assisting better decision making. 

Topic Modelling is different from rule-based text mining approaches that use regular expressions or dictionary based keyword searching techniques. It is an unsupervised approach used for finding and observing the bunch of words (called “topics”) in large clusters of texts.

A good topic model results in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”.

Topic Models are very useful for the purpose for document clustering, organizing large blocks of textual data, information retrieval from unstructured text and feature selection. For Example – New York Times are using topic models to boost their user – article recommendation engines. Various professionals are using topic models for recruitment industries where they aim to extract latent features of job descriptions and map them to right candidates. They are being used to organize large datasets of emails, customer reviews, and user social media profiles.

There are many approaches for obtaining topics from a text such as – Term Frequency and Inverse Document Frequency (TfIdf). NonNegative Matrix Factorization techniques. Latent Dirichlet Allocation(LDA) is the most popular topic modeling technique and in this article, we will discuss the same.

LDA assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution. Given a dataset of documents, LDA backtracks and tries to figure out what topics would create those documents in the first place.



In [1]:
import numpy as np
import sys, re
import nltk

utils_dir = '/Users/gshyam/utils/'
sys.path.append(utils_dir)

from nlp_utils import prepare_text

In [2]:
# lets see how it works with the following sentences.

doc1 = "I have big exam tomorrow and I need to study hard to get a good grade. this Exam is hard."
doc2 = "My wife likes to go out with me but I prefer staying at home and studying."
doc3 = "Kids are playing football in the field and they seem to have fun"
doc4 = "Sometimes I feel depressed while driving and it's hard to focus on the road."
doc5 = "I usually prefer reading at home but my wife prefers watching a TV."

# array of documents aka corpus
corpus = [doc1, doc2, doc3, doc4, doc5]

## Processing and Tokenizing the text 

In [3]:
tokenized_data = [prepare_text(doc, TOKENIZE=True) for doc in corpus]
tokenized_data

[['big',
  'exam',
  'tomorrow',
  'need',
  'study',
  'hard',
  'get',
  'good',
  'grade',
  'exam',
  'hard'],
 ['wife', 'likes', 'go', 'prefer', 'staying', 'home', 'studying'],
 ['kids', 'playing', 'football', 'field', 'seem', 'fun'],
 ['sometimes', 'feel', 'depressed', 'driving', 'hard', 'focus', 'road'],
 ['usually', 'prefer', 'reading', 'home', 'wife', 'prefers', 'watching', 'tv']]

In [4]:
from gensim import corpora
dictionary = corpora.Dictionary(tokenized_data)

print ("First 10 items in the dictionary: key is index and value are the words")
for item in list(dictionary.items())[:10]:
    print (item)

First 10 items in the dictionary: key is index and value are the words
(0, 'big')
(1, 'exam')
(2, 'get')
(3, 'good')
(4, 'grade')
(5, 'hard')
(6, 'need')
(7, 'study')
(8, 'tomorrow')
(9, 'go')


## Bag of Words (BoW) method 
this is a common and very popular method to convert a document in text form into numerical values which can be fed into a model. In this method each unque word in the doc is assignmed a label and the number of times a word appears in the doc is also assigned.


In [5]:
# Transform the collection of texts to a numerical form
numerical_corpus = [dictionary.doc2bow(text) for text in tokenized_data]

In [6]:
numerical_corpus

[[(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 2), (6, 1), (7, 1), (8, 1)],
 [(9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1)],
 [(16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1)],
 [(5, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1)],
 [(10, 1), (12, 1), (15, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1)]]

Notice in the first line that second entry `(1,2)` represents word `exam` with index 1 and it appears 2 times in the doc. Similarly the word `hard` appears twice. hence we have `(5,2)`.

## LDA Model

The LDA model discovers the different topics that the documents represent and how much of each topic is present in a document. 

Python provides many great libraries for text mining practices, “gensim” is one such clean and beautiful library to handle text data. It is scalable, robust and efficient.

In [7]:
from gensim.models import LdaModel

model = LdaModel(corpus=numerical_corpus, num_topics=10, id2word=dictionary)

all_topics = model.print_topics()

for i in range(10):
    # Print the first 10 most representative topics
    print(f"Topic #{i} : {model.print_topic(i, 5 )}")


Topic #0 : 0.030*"hard" + 0.030*"wife" + 0.030*"field" + 0.030*"prefer" + 0.030*"fun"
Topic #1 : 0.064*"wife" + 0.064*"focus" + 0.064*"driving" + 0.064*"depressed" + 0.064*"hard"
Topic #2 : 0.030*"hard" + 0.030*"prefer" + 0.030*"wife" + 0.030*"kids" + 0.030*"fun"
Topic #3 : 0.030*"home" + 0.030*"wife" + 0.030*"prefer" + 0.030*"hard" + 0.030*"sometimes"
Topic #4 : 0.030*"hard" + 0.030*"prefer" + 0.030*"wife" + 0.030*"fun" + 0.030*"home"
Topic #5 : 0.030*"hard" + 0.030*"wife" + 0.030*"field" + 0.030*"prefer" + 0.030*"sometimes"
Topic #6 : 0.147*"exam" + 0.147*"hard" + 0.077*"grade" + 0.077*"big" + 0.077*"tomorrow"
Topic #7 : 0.030*"home" + 0.030*"hard" + 0.030*"prefer" + 0.030*"wife" + 0.030*"field"
Topic #8 : 0.064*"home" + 0.064*"prefer" + 0.064*"football" + 0.064*"usually" + 0.064*"watching"
Topic #9 : 0.030*"hard" + 0.030*"wife" + 0.030*"prefer" + 0.030*"home" + 0.030*"kids"


Since we trained and built our LDA model over the five simple sentences, whenever we want to detect the topic of a new sentence or text, we'll at first prepare the text and then push that into our model to get a topic. Let's try to predict a topic for a new sentence.

## Testing the model

Let's find out a topic for a new doc using the previously trained model. 
`My wife plans to go out tonight.`

In [8]:
doc_new = "My wife plans to go out tonight."
doc_new_prepared = prepare_text(doc_new, TOKENIZE=True)
print ( doc_new_prepared )
doc_bow = dictionary.doc2bow(doc_new_prepared)
print (doc_bow)


['wife', 'plans', 'go', 'tonight']
[(9, 1), (15, 1)]


Notice here since our dictionary is not large enough the bag of words for the new doc has missed a couple of words `plans` and `tonight`. As only the words `wife: index=15` and `go : index=9` exists in the dictionary.

In [9]:
def sort_list(A, key=0):
    # sort a list taking the take element of each item
    A_sorted = sorted(A, key=lambda x: x[key] )
    # reverse the sorted list to make the first element the largest
    A_sorted.reverse()
    return A_sorted

A = [(2, 1), (3, 4), (4, 1), (1, 3)]
A0=sort_list(A, 0)
A1=sort_list(A, 1)
print (f"orinal list :\t\t{A} \nsort with first:\t{A0} \nSorted with second:\t{A1}" )


orinal list :		[(2, 1), (3, 4), (4, 1), (1, 3)] 
sort with first:	[(4, 1), (3, 4), (2, 1), (1, 3)] 
Sorted with second:	[(3, 4), (1, 3), (4, 1), (2, 1)]


In [10]:
#def print_topics(topics_sorted, all_topics, k=2):
def print_top_k(topics, all_topics, k=2):
    topics_sorted = sort_list(topics, key=1)
    for i, topics in enumerate(topics_sorted[:k]):
        idx = topics[0]
        print (i, all_topics[idx])
        
        

In [11]:
topics= model.get_document_topics( doc_bow )
top_k_topics = print_top_k(topics, all_topics, k=2)

0 (1, '0.064*"wife" + 0.064*"focus" + 0.064*"driving" + 0.064*"depressed" + 0.064*"hard" + 0.064*"go" + 0.064*"road" + 0.064*"staying" + 0.064*"likes" + 0.064*"home"')
1 (8, '0.064*"home" + 0.064*"prefer" + 0.064*"football" + 0.064*"usually" + 0.064*"watching" + 0.064*"reading" + 0.064*"tv" + 0.064*"prefers" + 0.064*"wife" + 0.064*"playing"')


top predictions for the new sentence `My wife plans to go out tonight.` are printed above.

## Similarity between documents


In [12]:
from gensim import similarities

lda_index = similarities.MatrixSimilarity(model[numerical_corpus])

doc_new = "We are going play soccer with the kids"
doc_new_prepared = prepare_text(doc_new, TOKENIZE=True)
print ( doc_new_prepared )
doc_bow = dictionary.doc2bow(doc_new_prepared)
print (doc_bow)

similarities = lda_index[model[doc_bow]]

print(similarities)

['going', 'play', 'soccer', 'kids']
[(19, 1)]
[0.08770634 0.1110798  0.9765234  0.11107814 0.97384095]


Which means this new sentence is closest in the meaning to `doc3`  with probability `0.976`. And it makes sense that the new doc `We are going play soccer with the kids` is closest in meaning to `Kids are playing football in the field and they seem to have fun`.

## Things that can be added here

* N grams vocabulary
* Word Embeddings where `play` and `playing` mean the same thing.