# Topic Modeling (LDA) Using Pandas libraries

Topic modeling is a technique for determining the topics in a document. I can also be used t to discover patterns of words in a collection of documents. By analyzing the frequency of words and phrases in the documents, determines the probability of a word or phrase belonging to a certain topic and cluster documents based on their similarity or closeness.

Latent Dirichlet Allocation (LDA) is an unsupervised clustering technique that is commonly used for text analysis. Here, words are represented as topics, and documents are represented as a collection of these word topics.

Since the dataset contains ~3M rows, and because pandas is not distributed computing, it takes ~**45 hours** to run this topic modeling. The time is longer because of filtering the corpus by sentiment and category - since there are 3 sentiments and 4 categories, for each combination a new coprus is generated and LDA model is run. This reduces the size of the total corpus, but to get back the dataframe in the format described below, there are nested for loops - which results in increasing the time. Therefore, less number of runs is better. Further, the more the corpus is filtered, the corpus essentially reduces and it will not be able to produce distinct topics. 
 
The output is saved in the table **pandasLDATopics** in following format:

Index | ProductName | Category | ReviewContent | Sentiment | Topic | Topic Terms | Topic Weight |
--- | --- | --- | --- | --- | --- | --- | --- |
001 | name1 | Category1 | Text Description | Positive | 3 | \[term1, term2, term3\] | \[weight1, weight2, weight3\]
011 | name2 | Category2 | Text Description | Negative | 1| \[term1, term2, term3\] | \[weight1, weight2, weight3\]

The following table is saved as **pandasLDATopicTerms**. 

Index | Topic Terms | Topic Weight |
--- | --- | --- |
001 | term1 | weight1
001 | term2 | weight2

## Import packages

In [0]:
%pip install --upgrade numpy

Python interpreter will be restarted.
Python interpreter will be restarted.


In [0]:
# spark packages
from pyspark.sql.functions import *
from pyspark.sql.types import *

from pyspark.ml import Pipeline
from pyspark.ml.feature import StopWordsRemover
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *

import re
import pandas as pd
import numpy as np
import random

# Warnings
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)

#
from gensim.corpora import Dictionary
from gensim.models import LdaModel



## Load Data
Currently, this is being loaded from DBFS. This can be changed to read directly from ADLS as well, with the correct permissions, or from hive_metastore.

In [0]:
SQL = f'''select * from test_db.resultSentiment'''
df = spark.sql(SQL)

In [0]:
df.show(5)

+-----+-------------------+-------------+---------+-----+--------------------+---------+------+
|Index|               Date|  ProductName| Category|Price|             Content|Sentiment|Rating|
+-----+-------------------+-------------+---------+-----+--------------------+---------+------+
|    2|2022-06-26 18:29:13|ProductName-5|Category1| 78.8|I'm reading a lot...| Positive|   4.0|
|    7|2020-08-13 13:20:02|ProductName-4|Category3|81.03|I was a dissapoin...| Positive|   4.7|
|   11|2021-04-04 20:09:20|ProductName-4|Category1|45.74|"When you hear fo...| Positive|   4.4|
|   36|2020-05-23 15:51:47|ProductName-5|Category1|85.97|It seems somebody...| Positive|   4.8|
|   54|2022-10-30 14:06:38|ProductName-5|Category1| 30.5|Recieved this ite...| Positive|   4.8|
+-----+-------------------+-------------+---------+-----+--------------------+---------+------+
only showing top 5 rows



## Processing text columns

In [0]:
@udf("string")
def clean_sentence(sentence):
  
    '''function to clean up the sentence
    - remove punctuations, special characters,
    numbers, additional spaces in netween words,
    and remove any words of length <= 3'''

    sentence = re.sub(r"[^a-z A-Z]", " ", sentence)
    sentence = re.sub(r"/s+", "", sentence)
    sentence = " ".join([ele for ele in sentence.split() if len(ele) >= 3])
    return sentence

In [0]:
# clean up strings which have more than 2 same letters consecutively in a word

df = df.withColumn("Text", clean_sentence(regexp_replace(col("Content"), r"(\w)\1{2}", " ")))

Define the following spark pipelines before topic modeling can be done to tokenize, lemmatize and remove stopwords from the sentences.

In [0]:
documentAssembler = DocumentAssembler() \
    .setInputCol("Text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols("document") \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

POSTag = PerceptronModel.pretrained() \
    .setInputCols("document", "token") \
    .setOutputCol("pos")

chunker = Chunker() \
    .setInputCols("sentence", "pos") \
    .setOutputCol("chunk") \
    .setRegexParsers(["<NN>", "<NNS>", "<NNP>", "<JJ>", "<ADJ>"])

lemmatizer = LemmatizerModel.pretrained() \
     .setInputCols(["token"]) \
     .setOutputCol("lemmatized")

stopwordsCleaner = StopWordsCleaner() \
    .setStopWords(StopWordsRemover \
    .loadDefaultStopWords("english")) \
    .setInputCols(["lemmatized"]) \
    .setOutputCol("unigram")

pos_anc download started this may take some time.
Approximate size to download 3.9 MB
[ | ][OK!]
lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[ | ][OK!]


In [0]:
pipeline1 = Pipeline() \
            .setStages([
                documentAssembler,
                sentence,
                tokenizer,
                POSTag,
                chunker
            ])

In [0]:
pipelinedf = pipeline1.fit(df).transform(df)

In [0]:
pipelinedf = pipelinedf.select("Index", "Date", "ProductName", "Category", "Price", "Rating", "Content", "Sentiment", "Text", \
                               col("chunk.result").alias("chunked")) \
                .withColumn("chunked", clean_sentence(
                            regexp_replace(clean_sentence(concat_ws(", ", array_distinct(split( \
                                           regexp_replace(concat_ws(", ", col("chunked")), "[^A-za-z0-9]", " "), " ")))), r"\s*[A-Z]\w*\s*", " ")))

In [0]:
pipelinedf = pipelinedf.filter(col("chunked").isNotNull()) \
                        .select("Index", "Date", "ProductName", "Category", "Price", "Rating", "Content", "Sentiment", "Text", "chunked")

In [0]:
pipeline2 = Pipeline() \
            .setStages([
                documentAssembler,
                sentence,
                tokenizer,
                lemmatizer,
                stopwordsCleaner
            ])

In [0]:
pipelinedf = pipeline2.fit(pipelinedf).transform(pipelinedf)

In [0]:
pipelinedf = pipelinedf.select("Index", "Date", "ProductName", "Category", "Price", "Rating", "Content", "Sentiment", \
                               "chunked", col("unigram.result").alias("unigrams")) \
                .withColumn("unigrams", clean_sentence(regexp_replace( \
                                          regexp_replace(clean_sentence(concat_ws(", ", \
                                                  array_distinct(split(regexp_replace(concat_ws(", ", \
                                                        col("unigrams")), "[^A-za-z0-9]", " "), " ")))), \
                                             r"\s*[A-Z]\w*\s*", " "), r"(\w)\1{2}", " "))) \
                .withColumn("words", split(col("unigrams"), " ")) 

In [0]:
pipelinedf = pipelinedf.filter((col("words").isNotNull()) ) 

In [0]:
data = pipelinedf.select("Index", "Date", "ProductName", "Category", "Price", "Rating", "Content", "Sentiment", "words")

## Converting to pandas dataframe

In [0]:
# convert pyspark dataframe to pandas dataframe

data_pd = data.select("Index", "Category", "Sentiment", "words")

In [0]:
data_pd = data_pd.toPandas()

In [0]:
data_pd.shape

Out[18]: (2999981, 4)

In [0]:
def getPandasLDA(df_pd, category, sentiment = sentiment):

    '''function to get corresponding model, 
    corpus and the list of unique indices 
    (this is important to map back to the
    original data) for each separate category
    and sentiment.
    '''

    dftmp = df_pd[df_pd["Sentiment"] == sentiment]
    df = dftmp[dftmp["Category"] == category]

    index = [idx for idx in df["Index"]]

    # Create Dictionary
    id2word = Dictionary(df["words"].tolist())

    # Create Corpus
    texts = df["words"].tolist()

    # Term Document Frequency
    corpus = [id2word.doc2bow(text) for text in texts]

    # View
    #print(corpus[:1])

    # Human readable format of corpus (term-frequency)
    [[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

    # Build LDA model
    num_topics = 10
    lda_model = LdaModel(corpus=corpus,
                id2word=id2word,
                num_topics=num_topics,
                random_state=12,
                update_every=1,
                chunksize=100,
                passes=15,
                alpha = "auto",
                per_word_topics=True)
    
    return lda_model, corpus, index

In [0]:
def getPandasTopics(df, category, sentiment):
    '''
    function to generate the pandas dataframe
    for each separate category and sentiment.
    the columns in the dataframe are - list
    of terms in a particular topic, the
    corresponding term weights and the dominant
    topic. there is also the index column,
    required to map back to the original dataframe
    '''
    
    model, corpus, index = getPandasLDA(df, category, sentiment)

    topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(model[corpus]):
        row = sorted(row[0], key=lambda x: (x[1]), reverse=True)

        # Get the Dominant topic, Keywords  for each document
        for j, (topic_num, perc_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = model.show_topic(topic_num)
                terms = [word for word, prop in wp]
                weights = [prop for word, prop in wp]
                topics_df = topics_df.append(pd.Series([int(topic_num + 1), terms, weights]), ignore_index=True)
            else:
                break

    topics_df.columns = ["Topic", "TopicTerms", "TermWeights"]
    topics_df["Index"] = index

    return topics_df

In [0]:
# this section assigns the output to a different pandas dataframe.
# there are 3 sentiments and 4 categories - which results in a total of 12 dataframes.
# the try and except block is added to catch any ValueError/IndexError etc., in case there is no related data.
# in case there is no data for the corresaponding businnes unit-life critical task combination, 
# initialize empty dataframe so as to not disrupt the pipeline

try:
    dfpos1 = getPandasTopics(data_pd, category = "Category1", sentiment = "Positive")
except:
    pass
    print("no data for positive sentiment for Category1")
    dfpos1 = pd.DataFrame()

try:
    dfpos2 = getPandasTopics(data_pd, category = "Category2", sentiment = "Positive")
except:
    pass
    print("no data for positive sentiment for Category2")
    dfpos2 = pd.DataFrame()
    
try:
    dfpos3 = getPandasTopics(data_pd, category = "Category3", sentiment = "Positive")
except:
    pass
    print("no data for positive sentiment for Category3")
    dfpos3 = pd.DataFrame()

try:
    dfpos4 = getPandasTopics(data_pd, category = "Category4", sentiment = "Positive")
except:
    pass
    print("no data for positive sentiment for Category4")
    dfpos4 = pd.DataFrame()
    
try:
    dfneg1 = getPandasTopics(data_pd, category = "Category1", sentiment = "Negative")
except:
    pass
    print("no data for negative sentiment for Category1")
    dfneg1 = pd.DataFrame()
    
try:
    dfneg2 = getPandasTopics(data_pd, category = "Category2", sentiment = "Negative")
except:
    pass
    print("no data for negative sentiment for Category2")
    dfneg2 = pd.DataFrame()
    
try:
    dfneg3 = getPandasTopics(data_pd, category = "Category3", sentiment = "Negative")
except:
    pass
    print("no data for negative sentiment for Category3")
    dfneg3 = pd.DataFrame()

try:
    dfneg4 = getPandasTopics(data_pd, category = "Category4", sentiment = "Negative")
except:
    pass
    print("no data for negative sentiment for Category4")
    dfneg4 = pd.DataFrame()
    
    
try:
    dfneu1 = getPandasTopics(data_pd, category = "Category1", sentiment = "Neutral")
except:
    pass
    print("no data for neutral sentiment for Category1")
    dfneu1 = pd.DataFrame()
    
try:
    dfneu2 = getPandasTopics(data_pd, category = "Category2", sentiment = "Neutral")
except:
    pass
    print("no data for neutral sentiment for Category2")
    dfneu2 = pd.DataFrame()
    
try:
    dfneu3 = getPandasTopics(data_pd, category = "Category3", sentiment = "Neutral")
except:
    pass
    print("no data for neutral sentiment for Category3")
    dfneu3 = pd.DataFrame()
    
try:
    dfneu4 = getPandasTopics(data_pd, category = "Category4", sentiment = "Neutral")
except:
    pass
    print("no data for neutral sentiment for Category4")
    dfneu4 = pd.DataFrame()

In [0]:
topic_df = spark.createDataFrame(pd.concat([dfpos1, dfpos2, dfpos3, dfpos4, \
                                            dfneg1, dfneg2, dfneg3, dfneg4, \
                                            dfneu1, dfneu2, dfneu3, dfneu4
                                           ]))

In [0]:
# join the resultant output dataframe from the topic modeling with the original dataset to get the final output table

result_df = df.join(topic_df, on = ["Index"], how = "inner").drop("Text")
# result_df.display()

In [0]:
# expand the columns with arrays to get another view of the resultant output from the topic modeling

result_topic = result_df.withColumn("new", arrays_zip("TopicTerms", "TermWeights")) \
                        .withColumn("new", explode("new")) \
                                .select("Index", "Date", "ProductName", "Category", "Price", "Rating", "Content", "Sentiment", "Topic", \
                                        col("TopicTerms").alias("TopicTermsList"), col("TermWeights").alias("TopicWeightsList"), \
                                        col("new.TopicTerms").alias("TopicTerms"), col("new.TermWeights").alias("TermWeights"))
# result_topic.display()

## Write data to DBFS

In [0]:
result_topic.write.mode("overwrite").option("overwriteSchema", "true").saveAsTable("test_db.pandasLDATopics")

# Validation of data/tables

In [0]:
%sql
select * from test_db.pandasLDATopics

Index,Date,ProductName,Category,Price,Rating,Content,Sentiment,Topic,TopicTermsList,TopicWeightsList,TopicTerms,TermWeights
0,2022-12-22T08:56:31.000+0000,ProductName-3,Category3,58.88,4.4,"""Gave this to my dad for a gag gift after directing """"Nunsense",Positive,9.0,"List(good, one, like, get, well, great, make, time, buy, think)","List(0.023918254, 0.021876363, 0.021571169, 0.02007495, 0.019242281, 0.017435873, 0.014986804, 0.0147079835, 0.014079191, 0.011814969)",good,0.023918254
0,2022-12-22T08:56:31.000+0000,ProductName-3,Category3,58.88,4.4,"""Gave this to my dad for a gag gift after directing """"Nunsense",Positive,9.0,"List(good, one, like, get, well, great, make, time, buy, think)","List(0.023918254, 0.021876363, 0.021571169, 0.02007495, 0.019242281, 0.017435873, 0.014986804, 0.0147079835, 0.014079191, 0.011814969)",one,0.021876363
0,2022-12-22T08:56:31.000+0000,ProductName-3,Category3,58.88,4.4,"""Gave this to my dad for a gag gift after directing """"Nunsense",Positive,9.0,"List(good, one, like, get, well, great, make, time, buy, think)","List(0.023918254, 0.021876363, 0.021571169, 0.02007495, 0.019242281, 0.017435873, 0.014986804, 0.0147079835, 0.014079191, 0.011814969)",like,0.021571169
0,2022-12-22T08:56:31.000+0000,ProductName-3,Category3,58.88,4.4,"""Gave this to my dad for a gag gift after directing """"Nunsense",Positive,9.0,"List(good, one, like, get, well, great, make, time, buy, think)","List(0.023918254, 0.021876363, 0.021571169, 0.02007495, 0.019242281, 0.017435873, 0.014986804, 0.0147079835, 0.014079191, 0.011814969)",get,0.02007495
0,2022-12-22T08:56:31.000+0000,ProductName-3,Category3,58.88,4.4,"""Gave this to my dad for a gag gift after directing """"Nunsense",Positive,9.0,"List(good, one, like, get, well, great, make, time, buy, think)","List(0.023918254, 0.021876363, 0.021571169, 0.02007495, 0.019242281, 0.017435873, 0.014986804, 0.0147079835, 0.014079191, 0.011814969)",well,0.019242281
0,2022-12-22T08:56:31.000+0000,ProductName-3,Category3,58.88,4.4,"""Gave this to my dad for a gag gift after directing """"Nunsense",Positive,9.0,"List(good, one, like, get, well, great, make, time, buy, think)","List(0.023918254, 0.021876363, 0.021571169, 0.02007495, 0.019242281, 0.017435873, 0.014986804, 0.0147079835, 0.014079191, 0.011814969)",great,0.017435873
0,2022-12-22T08:56:31.000+0000,ProductName-3,Category3,58.88,4.4,"""Gave this to my dad for a gag gift after directing """"Nunsense",Positive,9.0,"List(good, one, like, get, well, great, make, time, buy, think)","List(0.023918254, 0.021876363, 0.021571169, 0.02007495, 0.019242281, 0.017435873, 0.014986804, 0.0147079835, 0.014079191, 0.011814969)",make,0.014986804
0,2022-12-22T08:56:31.000+0000,ProductName-3,Category3,58.88,4.4,"""Gave this to my dad for a gag gift after directing """"Nunsense",Positive,9.0,"List(good, one, like, get, well, great, make, time, buy, think)","List(0.023918254, 0.021876363, 0.021571169, 0.02007495, 0.019242281, 0.017435873, 0.014986804, 0.0147079835, 0.014079191, 0.011814969)",time,0.0147079835
0,2022-12-22T08:56:31.000+0000,ProductName-3,Category3,58.88,4.4,"""Gave this to my dad for a gag gift after directing """"Nunsense",Positive,9.0,"List(good, one, like, get, well, great, make, time, buy, think)","List(0.023918254, 0.021876363, 0.021571169, 0.02007495, 0.019242281, 0.017435873, 0.014986804, 0.0147079835, 0.014079191, 0.011814969)",buy,0.014079191
0,2022-12-22T08:56:31.000+0000,ProductName-3,Category3,58.88,4.4,"""Gave this to my dad for a gag gift after directing """"Nunsense",Positive,9.0,"List(good, one, like, get, well, great, make, time, buy, think)","List(0.023918254, 0.021876363, 0.021571169, 0.02007495, 0.019242281, 0.017435873, 0.014986804, 0.0147079835, 0.014079191, 0.011814969)",think,0.011814969
