 # <div align="center">Latent Dirichlet Allocation(LDA): Topic Modeling</div>
---------------------------------------------------------------------

Levon Khachatryan:
  
  
 <img src="pics/LDA.jpg" />

 <a id="top"></a> <br>
## Notebook  Content
1. [LDA Algorithm](#1)
  
  
2. [Problem Definition](#2)
  
  
3. [Import Packages](#3)
  
  
4. [Load Data](#4)
  
  
5. [Used functions for Data Preprocessing](#5)
  
  
6. [Data Preprocessing](#6)
  
  
7. [Model Deployment](#7)
  
  
8. [Save Model to Disk](#8)
  
  
9. [Load Model From Disk](#9)
  
  
10. [Detailed Information of Topics](#10)
  
  
11. [Word Cloud](#11)
  
  
12. [Message Analysis by Country](#12)

<a id="1"></a> <br>

# <div align="center">1. LDA Algorithm</div>
---------------------------------------------------------------------


### Background
Topic modeling is the process of identifying topics in a set of documents. This can be useful for search engines, customer service automation, and any other instance where knowing the topics of documents is important. There are multiple methods of going about doing this, but here I will explain one: Latent Dirichlet Allocation (LDA).  
  
  
  
### The Algorithm
LDA is a form of unsupervised learning that views documents as bags of words (**ie order does not matter**). LDA works by first making a key assumption: the way a document was generated was by picking a set of topics and then for each topic picking a set of words. Now you may be asking “ok so how does it find topics?” Well the answer is simple: it reverse engineers this process. To do this it does the following for each document m:  
  
1. Assume there are k topics across all of the documents
2. Distribute these k topics across document m (this distribution is known as **α** and can be symmetric or asymmetric, more on this later) by assigning each word a topic.
3. For each word w in document m, assume its topic is wrong but every other word is assigned the correct topic.
4. Probabilistically assign word w a topic based on two things:
    1. what topics are in document m
    2. how many times word w has been assigned a particular topic across all of the documents (this distribution is called β, more on this later)
5. Repeat this process a number of times for each document and you’re done!
  
  
  
### The Model
<img src="pics/model.png" />  
  
Smoothed LDA from https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation  
  
Above is what is known as a plate diagram of an LDA model where:  
α is the per-document topic distributions,  
β is the per-topic word distribution,  
θ is the topic distribution for document m,  
φ is the word distribution for topic k,  
z is the topic for the n-th word in document m, and  
w is the specific word  
  
  
  
### Tweaking the Model
In the plate model diagram above, you can see that w is grayed out. This is because it is the only observable variable in the system while the others are latent. Because of this, to tweak the model there are a few things you can mess with and below I focus on two.  
  
  
α is a matrix where each row is a document and each column represents a topic. A value in row i and column j represents how likely document i contains topic j. A symmetric distribution would mean that each topic is evenly distributed throughout the document while an asymmetric distribution favors certain topics over others. This affects the starting point of the model and can be used when you have a rough idea of how the topics are distributed to improve results.  
  
  
β is a matrix where each row represents a topic and each column represents a word. A value in row i and column j represents how likely that topic i contains word j. Usually each word is distributed evenly throughout the topic such that no topic is biased towards certain words. This can be exploited though in order to bias certain topics to favor certain words. For example if you know you have a topic about Apple products it can be helpful to bias words like “iphone” and “ipad” for one of the topics in order to push the model towards finding that particular topic.  
  
  
  
### Conclusion
This part is not meant to be a full-blown LDA tutorial, but rather to give an overview of how LDA models work and how to use them. There are many implementations out there such as Gensim that are easy to use and very effective.

<a id="2"></a> <br>

# <div align="center">2. Problem Definition</div>
---------------------------------------------------------------------
[go to top](#top)

So here we have a text data (from messenger) and our aim is to find some topics from data (do topic modeling). Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation (LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

<a id="3"></a> <br>

# <div align="center">3. Import Packages</div>
---------------------------------------------------------------------
[go to top](#top)

In [1]:
'''
Loading numpy and pandas libraries
'''
import numpy as np
import pandas as pd


'''
Loading Gensim and nltk libraries
'''
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
# from nltk.stem.porter import * 
# from nltk.stem.porter import PorterStemmer


'''
Load english wards from nltk , and english stemmers
'''
import nltk
nltk.download('wordnet')
nltk.download('words')
words = set(nltk.corpus.words.words())
stemmer = SnowballStemmer("english")


'''
Load Regular expressions
'''
import re


'''
Load operator package, this will be used in dictionary sort
'''
import operator


'''
fix random state
'''
np.random.seed(42)


'''
Suppress warnings
'''
import warnings
warnings.filterwarnings("ignore")


'''
Load punctuation for data preprocesing
'''
from string import punctuation


'''
Word cloud implementation
'''
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from matplotlib import pyplot as plt

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


<a id="4"></a> <br>

# <div align="center">4. Load Data</div>
---------------------------------------------------------------------
[go to top](#top)

In [2]:
'''
Load data from csv file, which is in the same folder
'''
data = pd.read_csv('***.csv')


'''
Delete messages created by ***
'''
data = data[data.u_id != 1]


'''
Correct the Date column format
'''
data.date = data.date.str.slice(0, 10)
data['date'] = pd.to_datetime(data['date'], format='%Y-%m-%d')


'''
Choose only same part from data, in this example I Chose the messages created on last month
'''
data = data.loc[data.date >= '20190407']


'''
Delete messages containing no more than 3 characters
'''
data = data[data.text.str.len() > 3]


'''
Remaining conversations
'''
print('After delete unused messages the Remaining count of conversation is: {}'.format(data.c_id.nunique()))


After delete unused messages the Remaining count of conversation is: 24860


In [3]:
'''
Group messages to appropriate conversations which we will consider as documents
'''
conversation = data.groupby('c_id')['text'].apply(lambda x: "%s" % ', '.join(x))
documents = conversation.to_frame(name=None)

<a id="5"></a> <br>

# <div align="center">5. Used Functions for Data Preprocessing</div>
---------------------------------------------------------------------
[go to top](#top)

In [4]:
def preprocess_word(word):
    """ 
    Word preprocessing 
  
    This function will preprocess particular word 
  
    Parameters: 
    word: string
  
    Returns: 
    string: will return initial string input but preprocessed ,
            so from input string will delete all punctuation and repeated symbols.
    """
    
    
    # Remove punctuation
    word = ''.join(c for c in word if c not in punctuation)
    
    # Convert more than 2 letter repetitions to 2 letter
    # funnnnny --> funny
    word = re.sub(r'(.)\1+', r'\1\1', word)
    
    return word

# preprocess_word('aaa|sd''f,gh!jg&')

In [5]:
def is_valid_word(word):
    """ 
    Word checking
  
    This function will check if word starts with alphabet 
  
    Parameters: 
    word: string
  
    Returns: 
    Boolean: Is valid or not , True means that word is valid
    """
    
    
    # Check if word begins with an alphabet
    return (re.search(r'^[a-zA-Z][a-z0-9A-Z\._]*$', word) is not None)

# is_valid_word('1dgh')

In [6]:
def handle_emojis(document):
    """ 
    Emoji classifier
  
    This function will replace emojis with EMO_POS or EMO_NEG , depending on its meaning 
  
    Parameters: 
    document: string
  
    Returns: 
    string: initial string input replaced emojis by their meaning, 
            for example :) will replaced with EMO_POS but ): will replaced with EMO_NEG
    """
    
    
    # Smile -- :), : ), :-), (:, ( :, (-:, :')
    document = re.sub(r'(:\s?\)|:-\)|\(\s?:|\(-:|:\'\))', ' EMO_POS ', document)
    
    # Laugh -- :D, : D, :-D, xD, x-D, XD, X-D
    document = re.sub(r'(:\s?D|:-D|x-?D|X-?D)', ' EMO_POS ', document)
    
    # Love -- <3, :*
    document = re.sub(r'(<3|:\*)', ' EMO_POS ', document)
    
    # Wink -- ;-), ;), ;-D, ;D, (;,  (-;
    document = re.sub(r'(;-?\)|;-?D|\(-?;)', ' EMO_POS ', document)
    
    # Sad -- :-(, : (, :(, ):, )-:
    document = re.sub(r'(:\s?\(|:-\(|\)\s?:|\)-:)', ' EMO_NEG ', document)
    
    # Cry -- :,(, :'(, :"(
    document = re.sub(r'(:,\(|:\'\(|:"\()', ' EMO_NEG ', document)
    
    return document

# handle_emojis('dsf):ghj')

In [7]:
def preprocess_document(document, use_stemmer = False):
    """ 
    Text preprocessing
  
    This function will preprocess the input text 
  
    Parameters: 
    document: string (we can put the entire string row , for instance in our case I will pass conversation)
    use_stemmer: Boolean (If True I will use stemmer as well as all other processes)
  
    Returns: 
    string: processed input string
    """
    
    
    def lemmatize_stemming(text):
        return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

    processed_document = []
    
    # Convert to lower case
    document = document.lower()
    
    # Replaces URLs with the word URL
    document = re.sub(r'((www\.[\S]+)|(https?://[\S]+))', ' URL ', document)
    
    # Replace @handle with the word USER_MENTION
    document = re.sub(r'@[\S]+', 'USER_MENTION', document)
    
    # Replaces #hashtag with hashtag
    document = re.sub(r'#(\S+)', r' \1 ', document)
    
    # Replace 2+ dots with space
    document = re.sub(r'\.{2,}', ' ', document)
    
    # Strip space, " and ' from document
    document = document.strip(' "\'')
    
    # Replace emojis with either EMO_POS or EMO_NEG
    document = handle_emojis(document)
    
    # Replace multiple spaces with a single space
    document = re.sub(r'\s+', ' ', document)
    words = document.split()

    for word in words:
        word = preprocess_word(word)
        if is_valid_word(word):
            if use_stemmer:
                word = lemmatize_stemming(word)
            if word not in gensim.parsing.preprocessing.STOPWORDS and len(word) > 3:
                processed_document.append(word)
            
    processed_internal_state = ' '.join(processed_document)
    
    processed_internal_state = re.sub(r'\b\w{1,3}\b', '', processed_internal_state)
    
    processed_internal_state = ' '.join(processed_internal_state.split())

    return processed_internal_state

In [8]:
def preprocess(preprocessed_document):
    """ 
    tokenize and combine already preprocessed document
  
    This function will tokenize document and will combine document such a way ,
    that we can containing the number of times a word appears in the training set 
    using gensim.corpora.Dictionary
  
    Parameters: 
    preprocessed_document: string (particular document obtained from preprocess_document function)
  
    Returns: 
    list: tokenized documents in approprite form
    """
    
    
    result=[]
    
    for token in gensim.utils.simple_preprocess(preprocessed_document) :
        result.append(token)
            
    return result

<a id="6"></a> <br>

# <div align="center">6. Data Preprocessing</div>
---------------------------------------------------------------------
[go to top](#top)

In [9]:
'''
Create a list from 'documents' DataFrame and call it 'processed_docs'
'''

processed_docs = []

for doc in documents.values:
    processed_docs.append(preprocess(preprocess_document(doc[0])))

In [10]:
'''
Create a dictionary from 'processed_docs' containing the number of times a word appears 
in the data set using gensim.corpora.Dictionary and call it 'dictionary'
'''

dictionary = gensim.corpora.Dictionary(processed_docs)

In [11]:
# '''
# Checking dictionary created
# '''

# count = 0
# for k, v in dictionary.iteritems():
#     print(k, v)
#     count += 1
#     if count > 10:
#         break

In [12]:
'''
OPTIONAL STEP
Remove very rare and very common words:

- words appearing less than 15 times
- words appearing in more than 10% of all documents
'''

dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n= 100000)

In [13]:
'''
Create the Bag-of-words model for each document i.e for each document we create a dictionary reporting how many
words and how many times those words appear. Save this to 'bow_corpus'
'''

bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [14]:
# '''
# Preview BOW for our sample preprocessed document
# '''

# document_num = 50
# bow_doc_x = bow_corpus[document_num]

# for i in range(len(bow_doc_x)):
#     print("Word {} (\"{}\") appears {} time.".format(bow_doc_x[i][0], 
#                                                      dictionary[bow_doc_x[i][0]], 
#                                                      bow_doc_x[i][1]))

<a id="7"></a> <br>

# <div align="center">7. Model Deployment</div>
---------------------------------------------------------------------
[go to top](#top)

Online Latent Dirichlet Allocation (LDA) in Python, using all CPU cores to parallelize and speed up model training.  
  
The parallelization uses multiprocessing; in case this doesn’t work for you for some reason, try the gensim.models.ldamodel.LdaModel class which is an equivalent, but more straightforward and single-core implementation.  
  
  
The training algorithm:
1. is streamed: training documents may come in sequentially, no random access required,
2. runs in constant memory w.r.t. the number of documents: size of the training corpus does not affect memory footprint, can process corpora larger than RAM  
  
  
This module allows both LDA model estimation from a training corpus and inference of topic distribution on new, unseen documents. The model can also be updated with new documents for online training.  
  
  
class **gensim.models.ldamulticore.LdaMulticore**(corpus=None, num_topics=100, id2word=None, workers=None, chunksize=2000, passes=1, batch=False, alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10, iterations=50, gamma_threshold=0.001, random_state=None, minimum_probability=0.01, minimum_phi_value=0.01, per_word_topics=False, dtype=<type 'numpy.float32'>)  
  
  
Bases: gensim.models.ldamodel.LdaModel  
  
An optimized implementation of the LDA algorithm, able to harness the power of multicore CPUs. Follows the similar API as the parent class LdaModel.    
  
  
**Parameters:**
  
1. corpus ({iterable of list of (int, float), scipy.sparse.csc}, optional) – Stream of document vectors or sparse matrix of shape (num_terms, num_documents). If not given, the model is left untrained (presumably because you want to call update() manually).
2. num_topics (int, optional) – The number of requested latent topics to be extracted from the training corpus.
3. id2word ({dict of (int, str), gensim.corpora.dictionary.Dictionary}) – Mapping from word IDs to words. It is used to determine the vocabulary size, as well as for debugging and topic printing.
4. workers (int, optional) – Number of workers processes to be used for parallelization. If None all available cores (as estimated by workers=cpu_count()-1 will be used. Note however that for hyper-threaded CPUs, this estimation returns a too high number – set workers directly to the number of your real cores (not hyperthreads) minus one, for optimal performance.
5. chunksize (int, optional) – Number of documents to be used in each training chunk.
6. passes (int, optional) – Number of passes through the corpus during training.
7. alpha ({np.ndarray, str}, optional) – Can be set to an 1D array of length equal to the number of expected topics that expresses our a-priori belief for the each topics’ probability. Alternatively default prior selecting strategies can be employed by supplying a string: ’asymmetric’: Uses a fixed normalized asymmetric prior of 1.0 / topicno.
8. gamma_threshold (float, optional) – Minimum change in the value of the gamma parameters to continue iterating.
9. minimum_probability (float, optional) – Topics with a probability lower than this threshold will be filtered out.
10. per_word_topics (bool) – If True, the model also computes a list of topics, sorted in descending order of most likely topics for each word, along with their phi values multiplied by the feature length (i.e. word count).
11. minimum_phi_value (float, optional) – if per_word_topics is True, this represents a lower bound on the term probabilities.
  
  
**Methods and functions**
  
  
There are varous methods for lda model which we can find in the lda documentation:   https://radimrehurek.com/gensim/models/ldamulticore.html

In [15]:
# LDA mono-core -- fallback code in case LdaMulticore throws an error on your machine
# lda_model = gensim.models.LdaModel(bow_corpus, 
#                                    num_topics = 10, 
#                                    id2word = dictionary,                                    
#                                    passes = 50)

# LDA multicore 
'''
Train your lda model using gensim.models.LdaMulticore and save it to 'lda_model'
'''
number_of_topics = 7

lda_model =  gensim.models.LdaMulticore(bow_corpus, 
                                   num_topics = number_of_topics, 
                                   id2word = dictionary,                                    
                                   passes = 10,
                                   workers = 2)

In [1]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''

for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

<a id="8"></a> <br>

# <div align="center">8. Save Model to Disk</div>
---------------------------------------------------------------------
[go to top](#top)

In [17]:
'''
Save model to disk.
'''

directory_to_save = 'C:\\_Files\\MyProjects\\***_TopicExtraction\\model\\model'
lda_model.save(directory_to_save)

<a id="9"></a> <br>

# <div align="center">9. Load Model From Disk</div>
---------------------------------------------------------------------
[go to top](#top)

In [18]:
'''
Load a potentially pretrained model from disk.
'''
lda_model = gensim.models.LdaMulticore.load(directory_to_save)

<a id="10"></a> <br>

# <div align="center">10. Detailed Information of Topics</div>
---------------------------------------------------------------------
[go to top](#top)

In [19]:
'''
Create (num_of_conv x num_of_topic) matrix with all 0 values and call it conversation_topic
'''

conversation_topic = np.zeros(shape=(len(bow_corpus), number_of_topics), dtype=float)
print(conversation_topic.shape)
print(conversation_topic)

(24860, 7)
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [20]:
'''
Fill appropriate probability of conversation i to belong topic j to conversation_topic matrix
'''

for i in range(len(bow_corpus)):
    prob = lda_model.get_document_topics(bow_corpus[i], per_word_topics = False)
    for k in range(len(prob)):
        conversation_topic[i, prob[k][0]] = prob[k][1]

In [21]:
'''
Calculate summed probabilities of each topic and call it prob_dict
'''

prob_dict = dict()
for i in range(number_of_topics):
    prob_dict[i] = round(conversation_topic.sum(axis = 0)[i] / len(bow_corpus), 2)

In [22]:
'''
Sort prob_dict dictionary t find the most probable topic over all conversation dataset
'''

sorted_prob = sorted(prob_dict.items(), key=operator.itemgetter(1))
print(sorted_prob)

[(6, 0.08), (1, 0.09), (2, 0.09), (0, 0.14), (3, 0.15), (4, 0.21), (5, 0.23)]


In [2]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''

for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

<a id="11"></a> <br>

# <div align="center">11. Word Cloud</div>
---------------------------------------------------------------------
[go to top](#top)

In [24]:
'''
Combine all preprocessed conversations to one string and call it word_cloud_messenger
'''

word_cloud_messenger = []

for doc in processed_docs:
    s = " "
    word_cloud_messenger.append(s.join( doc ))

s = " "

word_cloud_messenger = s.join( word_cloud_messenger )

In [33]:
'''
Save generated word cloud to disk
'''

np.save('word_cloud_messenger.npy', word_cloud_messenger)

In [43]:
'''
Read generated word cloud from disk
'''

word_cloud_messenger = np.load('word_cloud_messenger.npy')
word_cloud_messenger = str(word_cloud_messenger)

In [4]:
'''
Generate Picture of words, so called word cloud
'''

# Create stopword list:
stopwords = set()
stopwords.update(["doritos", "doritosdoritos", "chirp", "chirpchirp", "mexico"])

# Generate a word cloud image
wordcloud = WordCloud(stopwords=stopwords, 
                      background_color="white",
                      width = 800, 
                      height = 800, 
                      min_font_size = 10).generate(word_cloud_messenger)

# Save the image in the img folder:
wordcloud.to_file("first_review.png")

# Display the generated image:
# the matplotlib way:
plt.figure(figsize=(18, 10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show();

In [3]:
'''
Generate Picture of words following a color pattern (with mask).
'''

# Create stopword list:
stopwords = set()
stopwords.update(["doritos", "doritosdoritos", "chirp", "chirpchirp", "mexico"])

# Generate a word cloud image
mask = np.array(Image.open("Icon.jpg"))
wordcloud_ddxk_learn = WordCloud(stopwords=stopwords, 
                                 background_color="white", 
                                 mode="RGBA", 
                                 max_words=1000,
#                                  width = 800, 
#                                  height = 800, 
#                                  min_font_size = 10,
                                 mask=mask).generate(word_cloud_messenger)

# create coloring from image
image_colors = ImageColorGenerator(mask)
plt.figure(figsize=(18, 10))
plt.imshow(wordcloud_ddxk_learn.recolor(color_func=image_colors), interpolation="bilinear")
plt.axis("off")

# store to file
plt.savefig("second_review.png", format="png")

plt.show();

<a id="12"></a> <br>

# <div align="center">12. Message Analysis by Country</div>
---------------------------------------------------------------------
[go to top](#top)