This notebook explores LDA as it relates to uncovering hidden structure in a collection of texts (wikipedia articles).  The information and the code are repurposed through some online articles https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0, https://yanlinc.medium.com/how-to-build-a-lda-topic-model-using-from-text-601cdcbfd3a6and SIADS 543 unsupervised learning.

In [1]:
# import some necessary libararies 
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Suppress all warnings
import warnings
warnings.filterwarnings('ignore')

np.set_printoptions(precision = 3)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

## The choice of text processing has a significant impact on final classification performance

- The way text is represented has a major impact on classification performance.
- Text representation is a function of the many different parameter settings for Vectorizer objects in scikit-learn.

In [2]:
df = pd.read_pickle("data/features.pkl")

In [3]:
df.head()

Unnamed: 0,original_text,label,preprocessed,word_count,avg_word_count,syllable_count,uncommon,difficult_words,stem,discourse,cohesive_features,flesch,dale,mcalpine,nouns_adjs,normalized
0,There is manuscript evidence that Austen conti...,1,"[there, is, manuscript, evidence, that, austen...",35,4.485714,1.371429,14,7,there is manuscript evid that austen continu t...,4,2,52.87,11.24,48.0,0.228571,there is manuscript evidence that austen conti...
1,"In a remarkable comparative analysis , Mandaea...",1,"[in, a, remarkable, comparative, analysis, man...",19,6.0,1.789474,14,8,there is manuscript evid that austen continu t...,2,1,35.27,14.55,23.0,0.315789,in a remarkable comparative analysis mandaean ...
2,"Before Persephone was released to Hermes , who...",1,"[before, persephone, was, released, to, hermes...",40,4.725,1.4,15,9,there is manuscript evid that austen continu t...,7,3,47.8,11.15,57.0,0.175,before persephone was released to hermes who h...
3,Cogeneration plants are commonly found in dist...,1,"[cogeneration, plants, are, commonly, found, i...",32,6.28125,1.78125,22,14,there is manuscript evid that austen continu t...,0,1,22.08,14.6,38.0,0.59375,cogeneration plants are commonly found in dist...
4,"Geneva -LRB- , ; , ; , ; ; -RRB- is the second...",1,"[geneva, is, the, city, in, switzerland, after...",20,4.65,1.35,7,4,there is manuscript evid that austen continu t...,0,2,68.1,8.58,29.0,0.4,geneva is the city in switzerland after zürich...


In [4]:
documents_train = [text for text in df['normalized']]

In [47]:
tfidf_vectorizer = TfidfVectorizer(max_features=10000,lowercase=False,ngram_range=(1,1),min_df=10,max_df=0.95,\
                                   stop_words='english')


tfidf_documents = tfidf_vectorizer.fit_transform(documents_train)
features_tfidf = tfidf_vectorizer.get_feature_names()

tf_vectorizer = CountVectorizer(stop_words='english')
tf_documents = tf_vectorizer.fit_transform(documents_train)
tf_feature_names = tf_vectorizer.get_feature_names()


In [48]:
n_topics = 10

from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components = n_topics, random_state=0)
lda.fit(tfidf_documents)
topic_models = lda.components_

In [49]:
num_top_words = 10

def display_topics(model, feature_names, no_top_words):
    """takes the model components generated by LDA
    dump the top words by weight for each topic.
    """
    topics =[]
    for topic_idx, topic in enumerate(model.components_):
        term_list = [feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]
        topics.append(term_list)
    return topics


#display_topics(lda, features_tfidf, num_top_words)

# Topic coherence

One measure of topic model quality that is used (to determine the optimal number of topics for a corpus is *topic coherence*) This is a measure of how semantically related the top terms in a topic model are. 

- Low coherence: tend to be filled with seemingly random words and hard to interpret
- High coherence: tend to indicate a clear semantic theme that's easily interpreted

Word embeddings are an ideal tool for computing topic coherence because they have the ability to represent word semantics. 

We're going to use `text8_model` W2VTransformer object, which implements the word2vec embedding

In [50]:
#pip install "gensim==3.8.3"- use to this version of gensim

In [51]:
import pickle

f = open("data/text8_W2V.pickle", "rb")
text8_model = pickle.load(f)
f.close()

# Average semantic distance as a text coherence measure

In [68]:
from sklearn.metrics.pairwise import cosine_similarity
def calc_avg_cosine(s):
    try:
        wordvecs=text8_model.transform(s)#For each input term, compute its word2vec embedding vector
    except KeyError:
        #print('term not found {}'.format(s))
        matrix= np.zeros(10)
    else:
        matrix=np.zeros([len(s),len(s)]) #create an empty array with dimension len(input)*len(input)
    
        for i in range(len(s)): #iterate over the rows
            for j in range(len(s)):#iterate over the columns
                if i==j: #if row index and column index match (i.e comparing the same word to itself)
                    matrix[i][j]=0 # set that value to 0
                else: #otherwise calculate the cosine similarity between all words and append to the matrix
                    v1 = np.array(wordvecs[i]).reshape(1,-1)
                    v2 = np.array(wordvecs[j]).reshape(1,-1)
                    sim = cosine_similarity(v1,v2)
                    matrix[i][j]= sim
                    
                
    return np.mean(matrix)

In [73]:
#display_topics(lda, features_tfidf, num_top_words)

In [71]:
scores =[]
for topic in display_topics(lda, features_tfidf, num_top_words):
    scores.append(calc_avg_cosine(topic))

In [72]:
print(scores)

[0.18844251058995723, 0.22881923768669366, 0.18896742030978203, 0.23224354784935713, 0.26277119332458826, 0.1510866433568299, 0.32901755705475805, 0.0, 0.1663243638537824, 0.1324111014790833]


In [79]:
def coherence_b():
    from sklearn.decomposition import LatentDirichletAllocation
    top=10
    HH=[]
    for terms in range(2,11):
        lda = LatentDirichletAllocation(n_components = terms, random_state=0)
        W= lda.fit(tfidf_documents)
        h = lda.components_
        HH.append(h)
        
    topics=[]
    for H in HH:
        tops=[]
        for each in H:
            top10_indexes=each.argsort()[::-1][:top]
            topi=[features_tfidf[each] for each in top10_indexes]
            tops.append(topi)
        topics.append(tops)
    
    medians=[]
    for eachlist in topics:
        scorelist=[]
        for each in eachlist:
            score=calc_avg_cosine(each)
            scorelist.append(score)
        medians.append(np.median(scorelist))
    
    return medians

In [80]:
coherence_b()

[0.1576924245734699,
 0.11390801172703505,
 0.12204309355001897,
 0.16601227690465747,
 0.21220865251030774,
 0.1433943632710725,
 0.1633485022885725,
 0.19491060564294457,
 0.18870496544986964]

# Diagnose model performance with perplexity and log-likelihood

In [83]:
import pprint
# Log Likelyhood: Higher the better
print("Log Likelihood: ", lda.score(tfidf_documents))
# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
print("Perplexity: ", lda.perplexity(tfidf_documents))
# See model parameters
print(lda.get_params())

Log Likelihood:  -9233917.46092212
Perplexity:  8274.853376250132
{'batch_size': 128, 'doc_topic_prior': None, 'evaluate_every': -1, 'learning_decay': 0.7, 'learning_method': 'batch', 'learning_offset': 10.0, 'max_doc_update_iter': 100, 'max_iter': 10, 'mean_change_tol': 0.001, 'n_components': 10, 'n_jobs': None, 'perp_tol': 0.1, 'random_state': 0, 'topic_word_prior': None, 'total_samples': 1000000.0, 'verbose': 0}


#  Use GridSearch to determine the best LDA model

In [86]:
from sklearn.model_selection import GridSearchCV
#Define Search Param
search_params = {'n_components': [2, 3, 4, 8, 10,13], 'learning_decay': [.5, .7, .9]}
# Init the Model
lda = LatentDirichletAllocation(max_iter=5, learning_method='online', learning_offset=50.,random_state=0)
# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params)
# Do the Grid Search
model.fit(tf_documents)
GridSearchCV(cv=None, error_score='raise',
       estimator=LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=10, n_jobs=1,
             n_topics=None, perp_tol=0.1, random_state=None,
             topic_word_prior=None, total_samples=1000000.0, verbose=0),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_topics': [2,3,4,8,10,13], 'learning_decay': [0.5, 0.7, 0.9]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

# Visualizing Topics with pyLDAvis

pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

In [111]:
#pip install pyLDAvis
import spacy 
from spacy import displacy

import gensim
from gensim.corpora import Dictionary
from gensim.models import LdaModel, CoherenceModel, LsiModel, HdpModel
import warnings
warnings.filterwarnings("ignore")
from nltk.corpus import stopwords

In [112]:
stop = stopwords.words('english')

In [113]:
df['no_stop']= df['preprocessed'].apply(lambda x: [item for item in x if item not in stop])

In [115]:
texts= df['no_stop']

dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

In [116]:
lda_model = LdaModel(corpus=corpus, num_topics=10, id2word=dictionary)
lda_model.show_topics()

[(0,
  '0.040*"born" + 0.032*"player" + 0.028*"football" + 0.020*"former" + 0.012*"plays" + 0.009*"may" + 0.007*"italian" + 0.007*"spanish" + 0.006*"battle" + 0.005*"types"'),
 (1,
  '0.022*"used" + 0.018*"also" + 0.014*"name" + 0.013*"called" + 0.012*"many" + 0.010*"like" + 0.009*"people" + 0.009*"often" + 0.009*"different" + 0.008*"one"'),
 (2,
  '0.024*"team" + 0.022*"national" + 0.019*"university" + 0.014*"football" + 0.012*"language" + 0.011*"japanese" + 0.010*"club" + 0.009*"league" + 0.008*"church" + 0.007*"world"'),
 (3,
  '0.072*"france" + 0.071*"found" + 0.062*"region" + 0.056*"department" + 0.042*"commune" + 0.026*"de" + 0.016*"la" + 0.013*"ã" + 0.011*"calvados" + 0.010*"aisne"'),
 (4,
  '0.020*"born" + 0.015*"american" + 0.011*"died" + 0.010*"september" + 0.009*"first" + 0.009*"king" + 0.009*"became" + 0.009*"november" + 0.009*"july" + 0.009*"president"'),
 (5,
  '0.035*"city" + 0.030*"united" + 0.029*"states" + 0.027*"north" + 0.017*"state" + 0.017*"south" + 0.015*"county"

In [117]:
import pyLDAvis
import pyLDAvis.gensim_models

pyLDAvis.enable_notebook()
pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)