#  2) WCD BIG DATA PROJECT - TWITTER - ALL TOPICS - CLUSTERING / TOPIC MODELLING

**WeCloudData Bootcamp 2022 (Part-time Cohort)**<br> </font>
By: Kevin Jeswani & Junaid Zafar <br>
The set of notebooks are segmented for the purpose of clarity & convenience <br>
The following is the suggested order for running the scripts:
- '1_WCD_Twitter_Inflation_Classification' - Mounted S3 bucket for inflation tweets, copied over twitter data, tweet cleaning. VADER & Spark-NLP pre-trained model is used to apply labels to the inflation tweets. The data is then transformed with spark-ml. Logistic regression & random forest are built and trained with gridsearchCV on the label and transformed token features.
- '2_WCD_Twitter_AllTopics_Clustering'  **This Notebook** - All topics in the WCD twitter bucket are filtered, custom transformers are built and inserted into an extensive pipeline to load raw data from Kinesis firehose. Clustering uses Latent Dirichlet Allocation is conducted using a custom gridsearch to perform topic modelling.<br>

**Appendices** - Please note these notebooks are included simply as supporting information and to show that other experiments and exercises were conduct. Less time and effort was spent formatting on these notebooks, whereas Notebook 1) and 2) are the main submission documents.
- 'AppA_WCD_Twitter_Inflation_Classification_MLPOnly' - Experimentation for classification with multi-layer perceptron models - originally at the end of Notebook 1)
- 'AppB_WCD_Twitter_Inflation_Clustering' -Inflation tweet data with Spark-NLP labels imported, custom transformer for data cleaning built and combined with standard nlp transformers in a pipeline. LDA clustering implemented to model topics in the inflation dataset. An attemp was made with a GMM clustering model.
- 'AppC_WCD_Twitter_AllTopics_52mil_Clustering' - ALL streamed tweets (55mil+) are loaded from the WCD bucket, a transformation pipeline is built and all the data is transformed. A LDM clustering is built to cluster all the topics. 
- 'AppD_WCD_Twitter_AllTopics_Clustering_Evaluation' - An attempt was made to visualize the clustering using principal component analysis and t-SNE, but the data transformation required was too heavy to process and other issues occured. <br> 

## 1.0 Setup

### 1.1 Installation & Imports

#### 1.1.1 Install NLTK Package - Update Configuration on First Run - Restart

In [0]:
# Use shell script to add another shell script in the databricks scripts folder that installs nltk on ALL CLUSTERS NOT JUST MASTER and then install wordnet
# Add cluster config in advanced - init scripts
dbutils.fs.put("dbfs:/databricks/scripts/nltk_wordnet.sh", """#!/bin/bash
pip install nltk
python -m nltk.downloader wordnet""",True)

Wrote 62 bytes.
Out[1]: True

In [0]:
!pip install nltk

You should consider upgrading via the '/local_disk0/.ephemeral_nfs/envs/pythonEnv-0774d75f-3e7b-4f8a-98c2-bd3147a3dd76/bin/python -m pip install --upgrade pip' command.[0m


#### 1.1.2 PySpark & PySpark SQL

In [0]:
# Spark Imports
from pyspark.sql import SparkSession #create a spark session
from pyspark import SparkContext, SparkConf #for Spark NLP
from pyspark.sql.types import StructType, StructField, StringType, ArrayType, FloatType, IntegerType #for schema & for data processing
import pyspark.sql.functions as F #spark sql functions with alias

In [0]:
# Initialize Spark Session
spark = SparkSession \
        .builder \
        .appName('Twitter Clustering') \
        .getOrCreate()
print('Session created')
# Follow instructions for spark config: https://nlp.johnsnowlabs.com/docs/en/install
#sparknlp_config = SparkConf().set('spark.kryoserializer.buffer.max', 2000).set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
# Standard Spark Context
sc = SparkContext.getOrCreate() #conf=sparknlp_config #reset the spark context with the updated settings

Session created


#### 1.1.3 Feature Transformation

In [0]:
# Feature Transformer Imports
# Pyspark ML Feature Transformer:
from pyspark.ml.feature import Tokenizer #tokenization - break strings into words
from pyspark.ml.feature import StopWordsRemover, CountVectorizer, IDF #remove stopwords (words integral to sentence structure, but no value w.r.t sentiment); Get term-frequency (count vectorization)
                                                                      #Inverse-Document Frequency
from pyspark.ml.feature import   NGram, HashingTF #Vector Assembler to process IDF struct; NGram, Token-Frequency
from pyspark.ml.feature import  VectorAssembler #feature assembly inal stage to process 1gram_idf and 2gram_tf

#Pyspark ML Pipeline
from pyspark.ml import Pipeline, PipelineModel #standard pipeline

#Custom Transformer Building
from pyspark import keyword_only
from pyspark.ml import Transformer
from pyspark.ml.param.shared import HasInputCol, HasOutputCol, Param, Params, TypeConverters
from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
from pyspark.sql import DataFrame
import re #regex
# NLTK
import nltk
nltk.download('wordnet',download_dir='dbfs:/mnt/nltk_data')
from nltk.stem import WordNetLemmatizer #lemmatizer - associate different variations of the same word

[nltk_data] Downloading package wordnet to dbfs:/mnt/nltk_data...


#### 1.1.4 Latent Dirichlet Allocation & Evaluation

In [0]:
# PySpark ML Imports
from pyspark.ml.clustering import LDA, LocalLDAModel #Latent Dirichlet allocation (LDA)
from pyspark.ml.evaluation import ClusteringEvaluator #Clustering silhouette measure evaluation
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.feature import PCA #principal component analysis for cluster visualization

#Pandas for smaller df & plotting in 3.4
import pandas as pd
import pyspark.pandas as pspd
import numpy as np

# Viz
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# skLearn - Validation of topic model with dimension reduction (t-SNE model)
from sklearn.manifold import TSNE



## 1.2 Load Data

Bucket already mounted just access databricks /mnt/ path in Notebook 1).

In [0]:
# Generic file path for ALL tweets
filePath = '/mnt/data/*/*/*/*/*/*'

In [0]:
%fs ls /mnt/data

path,name,size,modificationTime
dbfs:/mnt/data/AI/,AI/,0,1673607277034
dbfs:/mnt/data/BankofCanada/,BankofCanada/,0,1673607277034
dbfs:/mnt/data/BlackFriday/,BlackFriday/,0,1673607277034
dbfs:/mnt/data/CERB/,CERB/,0,1673607277034
dbfs:/mnt/data/CSIS/,CSIS/,0,1673607277034
dbfs:/mnt/data/CanadaHousing/,CanadaHousing/,0,1673607277034
dbfs:/mnt/data/ElonMusk/,ElonMusk/,0,1673607277034
dbfs:/mnt/data/Flames/,Flames/,0,1673607277035
dbfs:/mnt/data/Inflation/,Inflation/,0,1673607277035
dbfs:/mnt/data/Interest_rate/,Interest_rate/,0,1673607277035


In [0]:
# Define the Schema resulting from the Kinesis Firehose output csvs
schema = StructType([
    StructField('id', StringType(), True),
    StructField('name', StringType(), True),
    StructField('username', StringType(), True),
    StructField('tweet', StringType(), True),
    StructField('followers_count', StringType(), True),
    StructField('location', StringType(), True),
    StructField('geo', StringType(), True),
    StructField('created_at', StringType(), True)
])

An attempt was made to put all processing in the transformation pipeline in section 2, however it is not easier to create a custom transformer than can drop rows in the df without causing issues with the pipeline. In one iteration of processing the entire scraped dataset on the WCD/twitter bucket, a topic model could be built but at a very large computational and monetary cost. However the exercise taught a valuable lesson about optimization of sccripts before running them on full datasets. After the model was built it was hard to evaluate it with the dimensionality reduction t-SNE or PCA as the dense matrix required for the exercise was too large to proces. Thus manual cleaning will be done below before pushing the data through the transformation pipeline and fitting the clustering model.

In [0]:
# Create function to open scraped folder, read to df, process, compile
def read_tweetfolder(dir,downsample,downsample_ratio):
    '''
    Look subfolders & file in given path, load the twitter csv files from kinesis firehose; downsample specified folders
    path = path in DataBricks dir
    downsample = str list containing twitter topic folders to downsample
    downsample_ratio = decimal ratio to downscale the collected tweets in specified downsample folders
    '''
    subfolders = dbutils.fs.ls(dir) #can also check .isDir() if want to specify that the elem. is only a directly
    
    for i,p in enumerate(subfolders):
        path = p.path #get the path only
        # Read from path with input prefix
        df_in = (spark.read.schema(schema).option('delimiter','\t').csv(path+'/*/*/*/*')) #assume all twitter output files are organized in the same way
        df_in.cache()
        # Filter out retweets - not much value added and need to downsample the entire dataset
        df_in = df_in.filter(~df_in['tweet'].rlike(r"^(RT )@([a-zA-Z0-9_]{1,50}):"))
        # Keep track of initial #hastag bin - semi-unsupervised, but will permit better understanding of the algorithm
        df_in = df_in.withColumn('bin',F.lit(path))

        #Downsample specified datasets 
        if bool([ds for ds in downsample if(ds in path)]): #Bool(if any element in downsample in path)
            df_in = df_in.sample(False,downsample_ratio,seed=123)

        # Compile main df
        if i == 0: #initialize main df for first folder
            df_all = df_in
        else:
            df_all = df_all.union(df_in) #concatenate vertically
    
    #Drop all lines with 'null' in the tweet
    df_all.dropna().show(truncate=False)
    
    #Return the compiled downsampled df
    return df_all
 
# Topics to downsample
downsample = ['WoldCup','twitter'] #tags to be loaded and downsampled
# Call the function on entire bin, specify downsampling of 2 ~5gb bins
tweets = read_tweetfolder(dir='/mnt/data/',downsample = ['WorldCup','twitter'],downsample_ratio=0.05)

+-------------------+-------------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------+---------------+------------------------------+----+------------------------------+------------------+
|id                 |name         |username   |tweet                                                                                                                                       |followers_count|location                      |geo |created_at                    |bin               |
+-------------------+-------------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------+---------------+------------------------------+----+------------------------------+------------------+
|1601172099045158912|YUNUS HANBAL |HanbalYunus|@CryptoEmdarks There are many innovations and surprises in the future project #D

In [0]:
# Still 3.6million but with a better distribution for topic modelling
tweets.count()

Out[7]: 3594430

In [0]:
display(tweets)

id,name,username,tweet,followers_count,location,geo,created_at,bin
1601172099045158912,YUNUS HANBAL,HanbalYunus,@CryptoEmdarks There are many innovations and surprises in the future project #DXGM metaverse universe. You can par… https://t.co/R4PzLNEGFk,21.0,,,Fri Dec 09 11:09:11 +0000 2022,dbfs:/mnt/data/AI/
1601172122730041344,ASLI HANBAL,HanbalAsli,"@Crypto__Diva #GPLEX with blockchain technology in the gaming world, with the unique #metaverse world waiting to be… https://t.co/scFwwFVAvz",152.0,,,Fri Dec 09 11:09:17 +0000 2022,dbfs:/mnt/data/AI/
1601172161372491778,YUNUS HANBAL,HanbalYunus,@CryptoThro There are many innovations and surprises in the future project #DXGM metaverse universe. You can partic… https://t.co/HvRw4EbJZU,21.0,,,Fri Dec 09 11:09:26 +0000 2022,dbfs:/mnt/data/AI/
1601172171602419712,ASLI HANBAL,HanbalAsli,"@belufrancese #GPLEX with blockchain technology in the gaming world, with the unique #metaverse world waiting to be… https://t.co/FltRhZlt3J",152.0,,,Fri Dec 09 11:09:28 +0000 2022,dbfs:/mnt/data/AI/
1601172214056767489,YUNUS HANBAL,HanbalYunus,@cryptojack There are many innovations and surprises in the future project #DXGM metaverse universe. You can partic… https://t.co/n0FmdR4I81,21.0,,,Fri Dec 09 11:09:38 +0000 2022,dbfs:/mnt/data/AI/
1601172226631311362,ASLI HANBAL,HanbalAsli,"@CryptoThro #GPLEX with blockchain technology in the gaming world, with the unique #metaverse world waiting to be d… https://t.co/D0TacL3rxu",152.0,,,Fri Dec 09 11:09:41 +0000 2022,dbfs:/mnt/data/AI/
1601172266460454914,ASLI HANBAL,HanbalAsli,"@cryptoworld202 #GPLEX with blockchain technology in the gaming world, with the unique #metaverse world waiting to… https://t.co/eflMFNfp2K",152.0,,,Fri Dec 09 11:09:51 +0000 2022,dbfs:/mnt/data/AI/
1601172313017581568,ASLI HANBAL,HanbalAsli,"@pascualprincipe #GPLEX with blockchain technology in the gaming world, with the unique #metaverse world waiting to… https://t.co/gWtYa8bIf6",152.0,,,Fri Dec 09 11:10:02 +0000 2022,dbfs:/mnt/data/AI/
1601172334810783744,YUNUS HANBAL,HanbalYunus,@riccardogems There are many innovations and surprises in the future project #DXGM metaverse universe. You can part… https://t.co/MGNxYHe1cK,21.0,,,Fri Dec 09 11:10:07 +0000 2022,dbfs:/mnt/data/AI/
1601172340188254208,Space ☆ Bruce,spacebruce,"The VF-0 Phoenix variable fighter was an prototype for the VF-1 Valkyrie, it served in 2008 as a front-line fighter… https://t.co/25Omth4G4a",952.0,68000 HEART ON FIRE /🔞 please,,Fri Dec 09 11:10:08 +0000 2022,dbfs:/mnt/data/AI/


Here the dataset of all the topics is reduced to 3.6mil tweets, removing retweets and by downsampling 'WorldCup' and 'twitter' tweets. Referr to Notebook 'Appendix_A' in this series, where a topic model of the original 55million tweets was built with results that were too difficult to discern.

## 2.0 Transformation Pipeline

### 2.1 Custom Transformers

1) Use regular expressions `reg_exp_replace`  to remove urls (can use this for another analysis later on)  <br>
2) Remove "RT @___:" for retweets  <br>
3) remove mentions "@___ " & replies "@___: "  <br>
4) replace "& amp;" (without the space) and replace with "&" only  <br>
5) Remove all special characters except '&' as its used for 's&p500' for example  <br>
6) Substitute multiple spaces with single space <Br>
7) Lowercase all text (can examine this later on fo RTs and also for emotions with all capital words)  <br>
8) Trim leading/trailing whitespaces

In [0]:
# Define regex replacement list for text processing
regex_list = [(r"http\S+", " "), #remove urls
              (r"^(RT @)([a-zA-Z0-9_]{1,50}):", " "), #remove retweets
              (r"@([a-zA-Z0-9_]{1,50}) ", " "), #remove mentions
              (r"@([a-zA-Z0-9_]{1,50}): ", " "), #remove replies
              (r"amp;", ""), #change &amp; to & only
              (r"(?<=[a-zA-Z0-9])&(?=[a-zA-Z0-9])|\s+-|-$", "-"), # change all '&' directly adjacent to words into '-'
              (r'[^a-zA-Z0-9\s-]+', " "), # remove all special characters except dash
              (r" rt ", " "), #remove other traces of retweets not at the beginning of the tweet
              (r"\s+", " ") #replace multi space with single 
             ]

In [0]:
# Develop a custom transformer class to peform series of regex replace functions on an input streaming df
# Source with modifications: https://csyhuang.github.io/2020/08/01/custom-transformer/ & https://www.crowdstrike.com/blog/deep-dive-into-custom-spark-transformers-for-machine-learning-pipelines/
class RegexTransformer(Transformer, #Main class
                        HasInputCol, #Setup for outputCol parameter
                        HasOutputCol, #Setup for outputCol parameter
                        DefaultParamsReadable, 
                        DefaultParamsWritable):
    '''
    Custom Transformer for Spark >3.0; wrapper for a regex replace function
    Input: inputCol = column name to be transformed, outputCol = column_name resulting from transformations, regex_rules = tuple list for regex substitute procedure rules
    Output Transform a df when RegexTransformer().transform(inputCol,outputCol,regex_rules) is called or put into pyspark ml pipeline
    '''
    inputCol = Param(Params._dummy(), "inputCol", "input column name.", typeConverter=TypeConverters.toString) #inputCol parameter
    outputCol = Param(Params._dummy(), "outputCol", "output column name.", typeConverter=TypeConverters.toString) #outputCol parameter
    regex_rules = Param(Params._dummy(), "regex_rules","list of tuples for regex replace procedures; first el is regex str, 2nd el is str to be replaced with") 
    f_trim = Param(Params._dummy(), "f_trim","flag to apply .trim() to input text")
    f_lower = Param(Params._dummy(), "f_lower","flag to apply .lower() to input text")
    
    @keyword_only
    # Initialize
    def __init__(self, inputCol: str = "input", outputCol: str = "output", regex_rules=None,f_trim=True,f_lower=True):
        super(RegexTransformer, self).__init__()
        self._setDefault(inputCol='text', outputCol='text_clean', regex_rules=None,f_trim=True,f_lower=True) #default settings to None - always requires input
        kwargs = self._input_kwargs
        self.set_params(**kwargs)

    @keyword_only
    def set_params(self, inputCol: str = "input", outputCol: str = "output", regex_rules=None,f_trim=True,f_lower=True): #initialize the parameters
        kwargs = self._input_kwargs
        self._set(**kwargs)

    def get_inputCol(self):
        return self.getOrDefault(self.inputCol)

    def get_outputCol(self):
        return self.getOrDefault(self.outputCol)       
      
    def get_regex_rules(self):
        return self.getOrDefault(self.regex_rules) 
    
    def get_f_trim(self):
        return self.getOrDefault(self.f_trim)
    
    def get_f_lower(self):
        return self.getOrDefault(self.f_lower)
                      
    # Main Transformer Function
    def _transform(self, df: DataFrame): 
        # Get the input parameters
        inputCol = self.get_inputCol()
        outputCol = self.get_outputCol()
        regex_rules = self.get_regex_rules() #custom parameter input, list of tuples
        f_trim = self.get_f_trim()
        f_lower = self.get_f_lower()
        #for every tuple in the regex_rules, use the regex pattern in the 1st el. and replace with 2nd el.
        #udf applying regex substitute to each row
        for i,(regex,repl) in enumerate(regex_rules):
            transform_udf = F.udf(lambda x: re.sub(str(regex),str(repl),str(x)), StringType()) 
            if i==0:
                target_col = inputCol
            else:
                target_col = outputCol
            df = df.withColumn(outputCol, transform_udf(target_col))     #apply udf to the inputCol
        # Trim & Lower-case if flags are true
        if f_lower==True:
            df = df.withColumn(outputCol, F.lower(outputCol))  
        if f_trim==True:
            df = df.withColumn(outputCol, F.trim(outputCol))  
        return df

In [0]:
# Develop a custom transformer class to peform series of regex replace functions on an input streaming df
class Lemmatizer(Transformer, #Main class
                HasInputCol, #Setup for inputCol parameter
                HasOutputCol, #Setup for outputCol parameter
                DefaultParamsReadable,
                DefaultParamsWritable):

    # Initialize
    @keyword_only
    def __init__(self, inputCol=None, outputCol=None):
        super(Lemmatizer, self).__init__()
        nltk.download('wordnet')
        self.lemmatizer = WordNetLemmatizer() #initializer the function from NLTK
        self._setDefault(inputCol="text", outputCol="lemmas")
        kwargs = self._input_kwargs
        self.set_params(**kwargs)
    
    @keyword_only
    def set_params(self, inputCol: str = "input", outputCol: str = "output"): #initialize the parameters
        kwargs = self._input_kwargs
        self._set(**kwargs)

    def get_inputCol(self):
        return self.getOrDefault(self.inputCol)

    def get_outputCol(self):
        return self.getOrDefault(self.outputCol)       
    
    # Main transformer function
    def _transform(self, df: DataFrame):
        # Get the input parameters
        inputCol = self.get_inputCol()
        outputCol = self.get_outputCol()
        # Lemmatizer udf calling lemmatizer function of NLTK
        lemmatize = F.udf(lambda tokens: [self.lemmatizer.lemmatize(token) for token in tokens], ArrayType(StringType()))
        # Additional Transformer to filter out lemmatized tokens with 2 or less characters
        filter_tokens = udf(lambda tokens: [token for token in tokens if len(token) > 2], ArrayType(StringType()))
        # Apply to df
        df = df.withColumn(outputCol, filter_tokens(lemmatize(inputCol)))
        # Return the transformed df
        return df

In [0]:
# Feature Transformers
# See appendix for custom transformer class - must be run before this!
regex_replacer = RegexTransformer(inputCol='tweet',outputCol='tweet_clean',regex_rules=regex_list,f_trim=True,f_lower=True) #pass regex_list, apply lowercase and trim
tokenizer = Tokenizer(inputCol="tweet_clean", outputCol="tokens") #Create tokenizer - breakdown text into "building blocks" to be usedin NLP algos - separate by words
lemmatizer = Lemmatizer(inputCol="tokens", outputCol="tokens_lemmatized") #Create lemmatizer - lump words with syntactical meaning
stopword_remover = StopWordsRemover(inputCol="tokens_lemmatized", outputCol="tokens_noStopWords") # Remove stopwords from the token list,= no significance in meaning added to sentence

countvec = CountVectorizer(vocabSize=2**16, inputCol="tokens_noStopWords", outputCol='countvec') #term-frequency: 2^16 is arbritary large number = how many unique words should be considered; Create count vectorizer function
idf = IDF(inputCol='countvec', outputCol="1gram_idf", minDocFreq=5) #Inverse-Document Frequency - minDocFreq: remove sparse terms appearing less than 5 times troughout all the tweets
ngram = NGram(n=2, inputCol="tokens_noStopWords", outputCol="2gram") #n-gram two word combinations with more meaning than single words
ngram_countvec = CountVectorizer(vocabSize=2**12, inputCol="2gram", outputCol='2gram_countvec') #token frequency for countvectorizer
#ngram_hashingtf = HashingTF(inputCol="2gram", outputCol="2gram_tf", numFeatures=20000) #Hasing Token-Frequency - NEED COUNT VECTORIZER FOR LDA as hashingtf skips vocabulary
ngram_idf = IDF(inputCol='2gram_countvec', outputCol="2gram_idf", minDocFreq=5) #Inverse-Document Frequency for 2-gram

#Feature Assembly
assembler = VectorAssembler(inputCols=["1gram_idf", "2gram_idf"], outputCol="features") #Vector Assembler (1gram + 2gram inverse doc freq.)

# Feature selection from 1-gram and 2-gram token frequencies - requires a 'label' column
# Normally use ch2selector but can cause issues in the latter half of the pipeline - use all the features available from the 1 & 2 gram

# Transformation Pipeline: staged process for feature processing
t_pipeline = Pipeline(stages=[regex_replacer,tokenizer, lemmatizer, stopword_remover,countvec, idf, ngram, ngram_countvec, ngram_idf, assembler]) # #

# Warning: chi2selector is depracated - should update to univariate selecton:
    # https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.UnivariateFeatureSelector.html
    # Additional information: https://medium.com/insiderengineering/a-deep-dive-into-sparks-univariate-feature-selector-3f8b726d7d32

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### 2.2 Fit the Model & Transform Data

This section does not need to be re-run

In [0]:
# Fit the transformation pipeline on full input dataset
t_pipeline_model = t_pipeline.fit(tweets) 

In [0]:
# Write the pipeline model
t_pipeline_model.write().overwrite().save("~/mlmodels/models/Twitter_tPipeline_AllTopics_3")

In [0]:
# Transform full input dataset
tweets_transformed = t_pipeline_model.transform(tweets)

In [0]:
# Save to avoid re-running
tweets_transformed.write.mode("overwrite").parquet('/mnt/my_bucket/twitterAllTopics_transformed3.parquet')

## 3.0 Clustering on Inflation: Latent Dirichlet Allocation (LDA)

**Background Info**:
NLP text clustering using LDA = topic modelling. <br>

**Terms**
- Word = basic unit of text data index from a vocabulary set with V elements: {1,...,V}
- Document = sequence of N words, w = (w_1,w_2,…w_N) ,w_n is the nth word in document
- Corpus = sequence of M documents, D = (w_1, w_2,…w_M)
- Topic = distribution of words, with LDA treating each document in the corpus as mixture of topics

**How LDA works: 3-level heirarchical bayesian model**
- Estimates two distribution sets in the corpus: 
    - word distribution of each topic(θ); which follows Dirichlet distribution
    - topic distribution over corpus (Z)
- Latent variables (θ & Z) are defined in the text generation process
- Probability distribution of P(θ_d), P(Z_dn|θ_d), and P(w_dn|z_dn) are defined and the joint distribution, P(w,z,θ), is calculated <br>

**Advantages:**
- Can find latent topics in the document; which can be interpretable
- Unsupervised = no label required
- Easy to train, can retain word representation <br>

**Disadvantages:**
- Document considered as bag of words; word order and semantic information (multiple meanings) ignored
- Topic # = fixed; can be uncorrelated and the topic distribution cannot capture correlations
- Static (no topic evolution over time)<br>

**Source & More Info:**
<br>
https://towardsdatascience.com/nlp-with-lda-latent-dirichlet-allocation-and-text-clustering-to-improve-classification-97688c23d98 <Br>
https://spark.apache.org/docs/latest/ml-clustering.html#latent-dirichlet-allocation-lda <Br>
https://towardsdatascience.com/the-ultimate-guide-to-clustering-algorithms-and-topic-modeling-3a65129df324 <Br>
https://algotech.netlify.app/blog/topic-modeling-lda/

From the preliminary results (clustering on inflation only, and clustering on entire 55mil tweet dataset) there are still too many words repeating in different topics - research hints at what can be done:
- Increase number of topics of overall corupuse which can result in more distinct topics (grid search for this done in this section)
- Increase interations (default is 20, while 50 is used in this section)
- Adjust alpha & beta:should affect sparsity of topic-word distribution (done in this section)
  - alpha: mixture of topics; lower alpha = less likely mixture of topics in documents; increase = document will have more mixture of topics
  - beta: beta; controls distribution of words per topics; decrease beta = topics will have less words; increase and topics will have more words
  - Ideally want fewer topics and document parts to belong to only some topics
- More pre-processing: add lemmatization (added in Section 2)
- Try other algorithms (not supported by pyspark directly): LSA (Latent Semantic Analysis) or NMF (Non-negative Matrix Factorization) <br>
Handling imbalanced datasets:
- Sampling = initialization method;  Gibbs sampling method = less prone to overfitting = ideally more balanced topics.
- Use coherence instead of perplexity as eval metric - harder to calculate without advanced LDA & sklearn packages; pyspark does not have a distributed version of these tools

Load Transformed Data & Pipeline Model to skip re-running 2.2)

In [0]:
# Read the parquet file instead of re-running; no need to specify schema
tweets_transformed_in = spark.read.parquet('/mnt/my_bucket/twitterAllTopics_transformed3.parquet')
tweets_transformed_in.cache()

Out[11]: DataFrame[id: string, name: string, username: string, tweet: string, followers_count: string, location: string, geo: string, created_at: string, bin: string, tweet_clean: string, tokens: array<string>, tokens_lemmatized: array<string>, tokens_noStopWords: array<string>, countvec: vector, 1gram_idf: vector, 2gram: array<string>, 2gram_countvec: vector, 2gram_idf: vector, features: vector]

In [0]:
# Read the pipeline model
t_pipeline_model = PipelineModel.load("dbfs:/~/mlmodels/models/Twitter_tPipeline_AllTopics_3")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### 3.1 Assemble Model

There is no "cross validation" with unsupervised learning (albeit the origin bucket labels are attached to the the df), but one intends to simulate a real clustering exercise without labels. <br>
To determine the best model parameters and understanding the model, a custom grid search will be run.

In [0]:
num_topics =5  #clusters/topics to find - need to input this into the LDA model first then let param grid iterate on it
max_iter = 50 #allow for more iterations to have more distinct clusters
lda = LDA(k=num_topics, 
          maxIter=max_iter, 
          featuresCol='features',
          optimizeDocConcentration=True, #default;will optimize for DocConcentration = alpha parameter; prior placed on documents\distributions over topics 
          optimizer='online',# number of topics/cluster; low beta will place more weight on having each topic composed of only few dominant words
          seed=123 #affects number of topics;  
          ) 
k_grid = [5,7,10,13] #number of topics to iterate through
beta_grid = [0.01,0.1,0.3] #affects number of topics/cluster; low beta will place more weight on having each topic composed of only few dominant words
# Ideally examine these as well:
#  learningDecay,[0.51, 0.7, 0.9] #= learning_rate
modelpath="dbfs:/~/mlmodels/models/Twitter_LDA_AllTopics" #models to be saved here
predpath = "/mnt/my_bucket/twitterAllTopics_lda_pred" #predictions to be saved here

In [0]:
# Run different number of topics & topicConcentration as a custom parameter grid and fix the other parameters to either default or a tweaked value
def lda_gridearch(df_in,lda_estimator,k_grid,beta_grid,modelpath,predpath):
    '''
    input:
        df_in = dataframe with tokenized and vectorized features with column labelled 'features'
        lda_estimator = pyspark ml LDA() model
    output" will save output df and model for each parameter combination
    '''
    for k_ in k_grid: #iterate through input #topics (k)
        for beta in beta_grid:
            # Set the k in the estimator
            lda_estimator.setK(k_).setTopicConcentration(beta) #update the number of topics
            # Fit the model & Save
            lda_model = lda_estimator.fit(df_in)
            lda_model.write().overwrite().save(modelpath+'_k'+str(k_)+'_beta'+str(beta))
            # Predict & Save
            df_pred = lda_model.transform(df_in) 
            df_pred.write.mode('overwrite').parquet(predpath+'_k'+str(k_)+'_beta'+str(beta)+".parquet")    

Do not need to run this again if just examining the clusters

In [0]:
# CAll the lda topic # search function 
lda_gridearch(df_in=tweets_transformed_in,lda_estimator=lda,k_grid=k_grid,beta_grid=beta_grid,modelpath=modelpath,predpath=predpath)

### 3.2 Compile Results

Create a series of functions to load and interpret the LDA gridsearch results

#### 3.2.1 Topic Assignment Based on Distribution

In [0]:
# Function to assign topic to each tweet from it's 'topicDistribution' column prediction output:
def assign_topic(prediction):
    """
    Takes clustering topic distribution structure an assigns prevalent topic to the text row
    Input: df of prediction from clustering model
    Output: same df with additional column for prevalent topic
    """
    # Exract values of the topic distribution array, sort ascending, get topic with highest probability assignment
    udf_assign_topic = udf(lambda x: int(x.values.argsort()[::-1][0]), IntegerType()) 
    return prediction.select('*', udf_assign_topic('topicDistribution').alias("topic")) #UDF to apply to df

#### 3.2.2 Link Topic Distributions to Vocabulary

Collect the vocabulary

In [0]:
# Connect the topics from the LDA model with the vocabulary from the count vectorization model 
vocab_1gram = t_pipeline_model.stages[4].vocabulary #access a speific stage of a pipeline - access fitted model's contents
vocab_2gram = t_pipeline_model.stages[7].vocabulary #repeat for 2gram vocab
vocab = vocab_1gram + vocab_2gram #full vocabulary for Token frequency

Function & SQL UDL to link text tokens to topics in the clustering model

In [0]:
# Convert word indices to word in the vocabulary
def tokens_vocab(token_list,vocab):
    '''
    Given a token_list (str list of tokenized words from text input) and vocabulary (str list) from count vectorized model
    Will link vocabulary to tokens to produce words associated with topics of a clustering model
    '''
    return [vocab[token_id] for token_id in token_list]
# Create SQL sql udf to attach function
udf_tokens_vocab = F.udf(lambda row: tokens_vocab(row,vocab), ArrayType(StringType()))

#### 3.2.3 Aggregate Function

Create function to load all the saved models and predictions & combine results

In [0]:
def process_lda_gridsearch(vocab,k_grid,beta_grid,num_top_words,modelpath,predpath,savepath):
    '''
    input:
    df_in = dataframe with tokenized and vectorized features with column labelled 'features'
    lda_estimator = pyspark ml LDA() model
    num_top_words = number of words per topic
    output combined predictions df lda_pred_all and combined topic summary topics_all
    '''
    # Initialize evaluation metrics (not necessarily the deciding factors in how good a model is but can offer some value)
    #The evaluation metrics do not work for all - they are slowing down the process significantly
    #llikelihoods ={} #log likelihood = lower is better
    #lperplexities = {} #log perplexity = lower is better
    silhouettes = {} #silhouette clustering evaluation [-1,1]
    for i,k_ in enumerate(k_grid): #iterate through input #topics (k) - reduced list
        for j,beta in enumerate(beta_grid):
            # Create a tag for the grid parameters
            tag = '_k'+str(k_)+'_beta'+str(beta)
            coltag = tag.replace('.','_') #keep column names without decimals, replace
            lda_model_in = LocalLDAModel.load(modelpath+tag)

            # Load the predictions, assign topics to tweets, and combine
            if (i==0) and (j==0): #if first model being processed
                lda_pred_in = spark.read.parquet(predpath+tag+".parquet").select('id','tweet_clean','bin','features','topicDistribution') 
                #select columns for final df from the first loaded model
                # Assign topics using topicDistribution to each tweet
                lda_pred_in = assign_topic(lda_pred_in)

                # Evaluation Metrics
                #llikelihoods[coltag] = lda_model_in.logLikelihood(lda_pred_in)
                #lperplexities[coltag] = lda_model_in.logLikelihood(lda_pred_in)
                #silhouettes[coltag] = ClusteringEvaluator(predictionCol='topic',featuresCol='features',distanceMeasure='cosine').evaluate(lda_pred_in) 
                                   #cosine rule - commonly used for NLP over squared euclidean

                #Adjust df formatting /column
                lda_pred_in = lda_pred_in.select('id','tweet_clean','bin','features',F.col('topic').alias('topic'+coltag))  

                # Assign corresponding word to each topic
                topics = lda_model_in.describeTopics(num_top_words).withColumn('topicWords',udf_tokens_vocab(F.col('termIndices'))).select('topic',F.col('topicWords').alias('topicWords'+coltag)).toPandas()
                # Get the count per topic
                postpertopic = lda_pred_in.groupby('topic'+coltag).count().toPandas().sort_values(by='topic'+coltag).rename(columns={'count':'count'+coltag})
                # Merge topic count and topic words
                topics = pd.merge(topics,postpertopic,left_on='topic',right_on='topic'+coltag,how='left').drop(columns=['topic'+coltag])

                # Store df as the main df that all model results will be joined back too
                lda_pred_all = lda_pred_in
                # Store the topic summary
                topics_all = topics

            else: #compile with store common dfs for all the other models after first
                lda_pred_in = spark.read.parquet(predpath+tag+".parquet").select('id','features','topicDistribution') #only take id, topic distribution, features
                # Assign topics using topicDistribution to each tweet
                lda_pred_in = assign_topic(lda_pred_in)

                # Evaluation Metrics - 
                #llikelihoods[coltag] = lda_model_in.logLikelihood(lda_pred_in)
                #lperplexities[coltag] = lda_model_in.logLikelihood(lda_pred_in)
               # silhouettes[coltag] = ClusteringEvaluator(predictionCol='topic',featuresCol='features',distanceMeasure='cosine').evaluate(lda_pred_in) 
                                   #cosine rule - commonly used for NLP over squared euclidean

                #Adjust df formatting /column
                lda_pred_in = lda_pred_in.select(F.col('id').alias('id2'),F.col('topic').alias('topic'+coltag)) #remove topic distribution, rename id to id (will assist with joining)

                # Assign corresponding word to each topic
                topics = lda_model_in.describeTopics(num_top_words).withColumn('topicWords',udf_tokens_vocab(F.col('termIndices'))).select('topic',F.col('topicWords').alias('topicWords'+coltag)).toPandas()
                # Get the count per topic
                postpertopic = lda_pred_in.groupby('topic'+coltag).count().toPandas().sort_values(by='topic'+coltag).rename(columns={'count':'count'+coltag})
                # Merge topic count and topic words
                topics = pd.merge(topics,postpertopic,left_on='topic',right_on='topic'+coltag,how='left').drop(columns=['topic'+coltag])

                # Join current predictions and topic df back to main
                lda_pred_all = lda_pred_all.join(lda_pred_in,lda_pred_all.id == lda_pred_in.id2).drop('id2')
                topics_all = pd.merge(topics_all,topics,on='topic',how='outer')

    #silhouette_df = pd.DataFrame.from_dict(silhouettes,orient='index')
    lda_pred_all.write.mode('overwrite').parquet(savepath+'/LDA_GridSearch')
    lda_pred_all.drop('features').write.mode("overwrite").option('header','false').option("delimiter","|").csv(savepath+'/LDA_GridSearch.csv') #write csv without headers & vectors to athena, pipe delimited
    topics_all.to_csv(savepath+'/LDA_GridSearch_Topics.csv',mode='w',sep= '|') #write csv without headers & vectors to athena, specify separator, pipe delimited
    #silhouette_df.to_csv(savepath+'/LDA_GridSearch_Silhouettes.csv',mode='w',sep= '|')
    return topics_all, lda_pred_all #, silhouette_df - ommited for now, cause issues

#### 3.2.4 Run Batch Collector

In [0]:
savepath="/dbfs/mnt/my_bucket/"

In [None]:
# Call the function to compile the model results - run only selected k & beta values - otherwise clusters are overloaded and will exit with code 143 for lack of memory and issues with map reducing
df_topics, df_pred = process_lda_gridsearch(vocab=vocab,k_grid=[7,10,13],beta_grid=[0.01,0.1],num_top_words=10,modelpath=modelpath,predpath=predpath,savepath=savepath) #silhouettes normally to be unpacked but having issues getting results for all the models
                                                                                                                                                                       #will ommit for now

### 3.3 Examine Key Results

In [0]:
#Load the topics summary
topics_all_in = pd.read_csv(savepath+'/LDA_GridSearch_Topics.csv',sep='|')

In [0]:
display(topics_all_in)

Unnamed: 0,topic,topicWords_k7_beta0_01,count_k7_beta0_01,topicWords_k7_beta0_1,count_k7_beta0_1,topicWords_k10_beta0_01,count_k10_beta0_01,topicWords_k10_beta0_1,count_k10_beta0_1,topicWords_k13_beta0_01,count_k13_beta0_01,topicWords_k13_beta0_1,count_k13_beta0_1
0,0,['might' 'moment' 'cancer' 'task' 'discovering' 'aptitude' 'aptitude task'  'might moment' 'moment discovering' 'discovering aptitude'],251314.0,['might' 'moment' 'cancer' 'task' 'discovering' 'aptitude' 'aptitude task'  'might moment' 'moment discovering' 'discovering aptitude'],251202.0,['might' 'moment' 'cancer' 'task' 'discovering' 'aptitude' 'aptitude task'  'might moment' 'moment discovering' 'discovering aptitude'],176628.0,['might' 'moment' 'cancer' 'task' 'discovering' 'aptitude' 'aptitude task'  'might moment' 'moment discovering' 'discovering aptitude'],176688.0,['might' 'moment' 'cancer' 'task' 'discovering' 'aptitude' 'aptitude task'  'might moment' 'moment discovering' 'discovering aptitude'],139013,['might' 'moment' 'cancer' 'task' 'discovering' 'aptitude' 'aptitude task'  'might moment' 'moment discovering' 'discovering aptitude'],139137
1,1,['thanksgiving' 'inflation' 'family' 'happy' 'happy thanksgiving' 'day'  'hope' 'food' 'great' 'year'],658415.0,['thanksgiving' 'inflation' 'family' 'happy' 'happy thanksgiving' 'day'  'hope' 'food' 'great' 'year'],658575.0,['thanksgiving' 'happy' 'family' 'happy thanksgiving' 'hope' 'day'  'inflation' 'good' 'great' 'dinner'],542711.0,['thanksgiving' 'happy' 'family' 'happy thanksgiving' 'hope' 'day'  'inflation' 'good' 'great' 'dinner'],542810.0,['thanksgiving' 'happy' 'hope' 'happy thanksgiving' 'family' 'day' 'good'  'great' 'everyone' 'dinner'],379381,['thanksgiving' 'happy' 'hope' 'happy thanksgiving' 'family' 'day' 'good'  'great' 'everyone' 'dinner'],379492
2,2,['black' 'friday' 'black friday' 'happy' 'happy thanksgiving' 'sale'  'thanksgiving' 'deal' 'friday sale' 'friday deal'],722036.0,['black' 'friday' 'black friday' 'happy' 'happy thanksgiving' 'sale'  'thanksgiving' 'deal' 'friday sale' 'friday deal'],722004.0,['black' 'friday' 'black friday' 'happy' 'happy thanksgiving' 'sale'  'deal' 'thanksgiving' 'friday sale' 'friday deal'],588330.0,['black' 'friday' 'black friday' 'happy' 'happy thanksgiving' 'sale'  'deal' 'thanksgiving' 'friday sale' 'friday deal'],588362.0,['black' 'friday' 'black friday' 'happy' 'happy thanksgiving' 'sale'  'deal' 'friday sale' 'thanksgiving' 'friday deal'],530439,['black' 'friday' 'black friday' 'happy' 'happy thanksgiving' 'sale'  'deal' 'friday sale' 'thanksgiving' 'friday deal'],530465
3,3,['thanksgiving' 'ukraine' 'first' 'well' 'get' 'got' 'like' 'people'  'much' 'inflation'],661351.0,['thanksgiving' 'ukraine' 'first' 'well' 'get' 'got' 'like' 'people'  'much' 'inflation'],661396.0,['ukraine' 'well' 'thanksgiving' 'inflation' 'get' 'yes' 'people' 'got'  'money' 'russia'],485419.0,['ukraine' 'well' 'thanksgiving' 'inflation' 'get' 'yes' 'people' 'got'  'money' 'russia'],485641.0,['well' 'ukraine' 'money' 'yes' 'giving' 'get' 'inflation' 'pretty'  'country' 'world'],312825,['well' 'ukraine' 'money' 'yes' 'giving' 'get' 'inflation' 'pretty'  'country' 'world'],312695
4,4,['world' 'cup' 'world cup' 'live' '2022' 'war' 'iran' 'fifa' 'link'  'fifa world'],381357.0,['world' 'cup' 'world cup' 'live' '2022' 'war' 'iran' 'fifa' 'link'  'fifa world'],381335.0,['world' 'cup' 'world cup' 'live' 'war' 'fifa' '2022' 'fifa world'  'stream' 'iran'],300857.0,['world' 'cup' 'world cup' 'live' 'war' 'fifa' '2022' 'fifa world'  'stream' 'iran'],300879.0,['war' 'ukraine' 'update' 'iran' 'latest' 'war ukraine' 'latest update'  'update war' 'talk' 'world'],170384,['war' 'ukraine' 'update' 'iran' 'latest' 'war ukraine' 'latest update'  'update war' 'talk' 'world'],170413
5,5,['thanksgiving' 'know' 'believe' 'probably' 'bring' 'cancer' 'thankful'  'happy' 'happy thanksgiving' 'pursuit'],408460.0,['thanksgiving' 'know' 'believe' 'probably' 'bring' 'cancer' 'thankful'  'happy' 'happy thanksgiving' 'pursuit'],408447.0,['know' 'believe' 'probably' 'bring' 'cancer' 'pursuit' 'believe know'  'bring cancer' 'probably believe' 'know pursuit'],275278.0,['know' 'believe' 'probably' 'bring' 'cancer' 'pursuit' 'believe know'  'bring cancer' 'probably believe' 'know pursuit'],275196.0,['know' 'believe' 'probably' 'bring' 'cancer' 'pursuit' 'believe know'  'bring cancer' 'probably believe' 'know pursuit'],180273,['know' 'believe' 'probably' 'bring' 'cancer' 'pursuit' 'believe know'  'bring cancer' 'probably believe' 'know pursuit'],180170
6,6,['elon' 'musk' 'elon musk' 'twitter' 'inflation' 'call' 'make' 'trump'  'thanksgiving' 'hear'],511497.0,['elon' 'musk' 'elon musk' 'twitter' 'inflation' 'call' 'make' 'trump'  'thanksgiving' 'hear'],511471.0,['inflation' 'leftover' 'fed' 'thanksgiving leftover' 'trump' 'make'  'need' 'little' 'market' 'hear'],355321.0,['inflation' 'leftover' 'fed' 'thanksgiving leftover' 'trump' 'make'  'need' 'little' 'market' 'hear'],355182.0,['inflation' 'fed' 'market' 'twitter' 'make' 'elon' 'musk' 'trump' 'rate'  'elon musk'],256810,['inflation' 'fed' 'market' 'twitter' 'make' 'elon' 'musk' 'trump' 'rate'  'elon musk'],256659
7,7,,,,,['elon' 'musk' 'elon musk' 'twitter' 'world' 'thanksgiving' 'feel' 'love'  'nft' 'messi'],229748.0,['elon' 'musk' 'elon musk' 'twitter' 'world' 'thanksgiving' 'feel' 'love'  'nft' 'messi'],229746.0,['world' 'cup' 'world cup' 'elon' 'musk' 'elon musk' 'messi' 'twitter'  'goal' 'feel'],230762,['world' 'cup' 'world cup' 'elon' 'musk' 'elon musk' 'messi' 'twitter'  'goal' 'feel'],231354
8,8,,,,,['thanksgiving' 'year' 'one' 'inflation' 'american' 'look' 'guy' 'time'  'like' 'thanksgiving weekend'],342690.0,['thanksgiving' 'year' 'one' 'inflation' 'look' 'american' 'guy' 'time'  'like' 'thanksgiving weekend'],342573.0,['one' 'year' 'thanksgiving' 'guy' 'look' 'ago' 'time' 'luck' 'american'  'loved'],246494,['one' 'year' 'thanksgiving' 'guy' 'look' 'ago' 'time' 'luck' 'american'  'loved'],245947
9,9,,,,,['thank' 'god' 'thanksgiving' 'black' 'record' 'friday' 'black friday'  'beautiful' 'via' 'holiday'],297448.0,['thank' 'god' 'thanksgiving' 'black' 'record' 'friday' 'black friday'  'beautiful' 'via' 'holiday'],297353.0,['thank' 'god' 'thanksgiving' 'beautiful' 'via' 'wait' 'youtube' 'turkey'  'black' 'friday'],187416,['thank' 'god' 'thanksgiving' 'beautiful' 'via' 'wait' 'youtube' 'turkey'  'black' 'friday'],187261


In [0]:
# Load the tweets with topic assignments for key models
predictions_in =spark.read.parquet(savepath+'/LDA_GridSearch')
predictions_in.cache()

Out[47]: DataFrame[id: string, tweet_clean: string, bin: string, features: vector, topic_k7_beta0_01: int, topic_k7_beta0_1: int, topic_k10_beta0_01: int, topic_k10_beta0_1: int, topic_k13_beta0_01: int, topic_k13_beta0_1: int]

In [0]:
display(predictions_in)

id,tweet_clean,bin,features,topic_k7_beta0_01,topic_k7_beta0_1,topic_k10_beta0_01,topic_k10_beta0_1,topic_k13_beta0_01,topic_k13_beta0_1
1594729900203970562,world cup 2022 qater watch senegal vs netherlands 4k live stream link,dbfs:/mnt/data/WorldCup/,"Map(vectorType -> sparse, length -> 69632, indices -> List(4, 5, 11, 19, 58, 67, 71, 512, 744, 4441, 65538, 65545, 65546, 66852, 66951, 67714, 67728, 68104), values -> List(2.6587759527068164, 2.7735414824702795, 3.8661656228920176, 3.6175772487270277, 4.618735681599739, 4.570617190324005, 4.6094445170926255, 6.340104497527907, 6.717885100411563, 8.775928147481503, 2.7854054852492904, 4.810338757418368, 4.847039166610232, 8.43048724087753, 8.495025762015102, 8.810762100157136, 8.814500422267743, 8.952858855640581))",4,4,4,4,10,10
1594730049399361538,fifaworldcupqatar2022 qatar world cup 2022 senegal vs netherlands live stream,dbfs:/mnt/data/WorldCup/,"Map(vectorType -> sparse, length -> 69632, indices -> List(4, 5, 11, 19, 67, 89, 512, 737, 744, 65538, 65545, 65546, 65702, 66852, 68104), values -> List(2.6587759527068164, 2.7735414824702795, 3.8661656228920176, 3.6175772487270277, 4.570617190324005, 4.7666966832211894, 6.340104497527907, 6.527770701063492, 6.717885100411563, 2.7854054852492904, 4.810338757418368, 4.847039166610232, 6.929817002422871, 8.43048724087753, 8.952858855640581))",4,4,4,4,10,10
1594730538371276800,watch live stream world cup 2022 senegal vs netherlands live link hd s,dbfs:/mnt/data/WorldCup/,"Map(vectorType -> sparse, length -> 69632, indices -> List(4, 5, 11, 19, 58, 67, 71, 512, 744, 65538, 65545, 65546, 65567, 65604, 65642, 66852, 68104), values -> List(2.6587759527068164, 2.7735414824702795, 7.732331245784035, 3.6175772487270277, 4.618735681599739, 4.570617190324005, 4.6094445170926255, 6.340104497527907, 6.717885100411563, 2.7854054852492904, 4.810338757418368, 4.847039166610232, 5.551948253446982, 6.090719420531281, 6.610019671288244, 8.43048724087753, 8.952858855640581))",4,4,4,4,10,10
1594730712519053312,watch senned 2022 fifaworldcup engirn senned senegal vs netherlands fifa world cu,dbfs:/mnt/data/WorldCup/,"Map(vectorType -> sparse, length -> 69632, indices -> List(4, 19, 47, 71, 201, 512, 744, 2935, 3375, 65542, 66852, 68684), values -> List(2.6587759527068164, 3.6175772487270277, 4.353036275979086, 4.6094445170926255, 5.362374981447631, 6.340104497527907, 6.717885100411563, 16.400452443588915, 8.38559192096964, 4.54674298995477, 8.43048724087753, 9.093481383266788))",4,4,4,4,10,10
1594730828474757121,new episode of the shogun soccer sit down is now up on all streaming platforms we celebrate the start of the 202,dbfs:/mnt/data/WorldCup/,"Map(vectorType -> sparse, length -> 69632, indices -> List(32, 132, 161, 276, 543, 636, 934, 1163, 5273, 35494, 67452), values -> List(4.187326903869934, 5.0787498905528725, 5.184085588312029, 5.712621425344288, 6.239232830527815, 6.417627121965066, 6.8141851855650915, 7.049949211610216, 9.00358637915024, 12.099163987673947, 8.71816931332931))",1,1,1,1,1,1
1594731643117477894,it is a personal decision proud of the iranian team they could all get jailed for t,dbfs:/mnt/data/WorldCup/,"Map(vectorType -> sparse, length -> 69632, indices -> List(8, 73, 247, 683, 998, 1316, 8546), values -> List(3.3941560215168245, 4.642738317952469, 5.559722551169182, 6.463838813662654, 6.887221836872656, 7.198343559584898, 9.74778873051047))",6,6,6,6,6,6
1594731665762537473,face it folks it s the world cup it s my one major holiday and i ll be insufferable for an entire month,dbfs:/mnt/data/WorldCup/,"Map(vectorType -> sparse, length -> 69632, indices -> List(4, 5, 15, 57, 126, 405, 480, 710, 847, 10272, 65538, 67793), values -> List(2.6587759527068164, 2.7735414824702795, 3.6091672909461816, 4.464174160387725, 5.079286107554862, 5.97207648678125, 6.095030037465302, 6.491341904163657, 6.718115223528446, 10.007299925995554, 2.7854054852492904, 8.841067449652465))",3,3,8,8,3,3
1594731705847656450,fifa world cup inu- bsc stealth launch today next moonshot launching today las,dbfs:/mnt/data/WorldCup/,"Map(vectorType -> sparse, length -> 69632, indices -> List(4, 5, 23, 29, 47, 1611, 2987, 6615, 7785, 19038, 30068, 65538, 65542), values -> List(2.6587759527068164, 2.7735414824702795, 7.603514976268642, 4.165618040395565, 4.353036275979086, 7.438559094797755, 8.218631649337173, 9.342323622402306, 9.57744336476323, 10.984022397054627, 11.762691751052735, 2.7854054852492904, 4.54674298995477))",1,1,5,5,10,10
1594731915604787200,female tv reporter is robbed live on air at qatar world cup,dbfs:/mnt/data/WorldCup/,"Map(vectorType -> sparse, length -> 69632, indices -> List(4, 5, 11, 89, 345, 2592, 3038, 4664, 65538, 65702), values -> List(2.6587759527068164, 2.7735414824702795, 3.8661656228920176, 4.7666966832211894, 5.847163770034693, 8.036138108709274, 8.267267026725087, 8.868359591940473, 2.7854054852492904, 6.929817002422871))",4,4,4,4,10,10
1594732434213388288,finally been waiting for them to upload this,dbfs:/mnt/data/WorldCup/,"Map(vectorType -> sparse, length -> 69632, indices -> List(322, 537, 7693), values -> List(5.752738464631569, 6.203935048446791, 9.565467173716515))",3,3,3,3,1,1


In evaluating the clustering exercise, a more subjective review of the topic words per topic was conducted. It appears altering the beta value between 0.01 to 0.3 did not affect the words showing up in the topic much. However, these clusters look somewhat more distinct than running on all 55million tweets where the World Cup bucket really dominated all topics. <br>
**Subjective Best Parameters** k=10 topics, beta(topic distribution parameter)=0.01

**Note: Evaluation Metric** The PysSpark LDA function should allow two evaluation metrics to be produced: the log perplexity and log likelihood. However, optimizing for perplexity/likelihood may not yield human interpretable topics. A similar problem was occuring with the silhouette clustering evaluation metric for at least one of the models. Trying to collect these values for the gridsearch models was causing memory issues and values could not be extract so the comparison of this value among all models was ommited for this exercise. Instead the best (subjective) model is evaluated in the following cells.

The silhouette measurement with the squared euclidean distance for evaluating clustering = measure of consistency within clusters = [-1,1] <br>
A value close to 1 means that the points in a cluster are close to other points in the same cluster = far from points of other clusters <br>
The value is close to 0, meaning there isn't great distinction between the clusters but there it could also be substanially worse.
More info: https://towardsdatascience.com/silhouette-coefficient-validating-clustering-techniques-e976bb81d10c

In [0]:
# Load the selected model - can be combined with the summary df containing all the assigned topics & base tweet ifno
lda_model_selected = LocalLDAModel.load("dbfs:/~/mlmodels/models/Twitter_LDA_AllTopics_k10_beta0.01/")

The silhouette value was determined with the models developed in Appendix B and C but failed with this gridsearch model set. The silhouette values here were roughly around 0, and it is unlikely that this analysis is much better (i.e. won't be close to 1).

In [0]:
# Silhouette Clustering Evaluation - not working on this data (see appendix B & C for typical results)
silhouette = ClusteringEvaluator(predictionCol='topic_k10_beta0_01',featuresCol='features',distanceMeasure='cosine').evaluate(predictions_in) #cosine rule - commonly used for NLP over squared euclidean