<h1> Topical Analysis </h1>
<h1> Topic Detection and Distribution </h1>

~ 16 hrs.

Although some of the topics are listed in the corpus documentation, the full list would probably be too exhaustive to actually dig up. Additionally, the topics as listed in the corpus are the topics assigned to students and not grounded in the lexis actually used: using an unsupervised method to cluster documents into a k-topic distribution provides insight into how topics are distributed across target groups in terms of tokens.
    
There are various methods (LSA, pLSA, LDA, CTM) for extracting topics. Here I try Latent Dirichlet Allocation (LDA). LDA outputs a probability of a document belonging to one of k topic clusters, although the k must be chosen by the researcher.


Using Latent Dirichlet Allocation (LDA), I assess the following:
    
   1. The Composition of the corpora across k=24 topics
   2. Token frequency and distribution per k
   3. K-wise probability distribution per sample
   4. Coefficient of variation (CV) of topic per target
   5. Plotted K-wise distribution CV per target

LDA was chosen because it assigns a probability distribution for each document: documents are not given 'hard' topical categories, but percentage 'mixtures' of the topics inherent in them (Alghamdi & Alfalqi). LSA does not have this fine-grained capability. LDA can also use the same token to represent more than one topic with a given probability, which accounts for lexical overlap and polysemous word use between topics. It does not, as CTM does, create a 'topic network' in which the relationships between topics is also represented.

LDA automatically clusters the corpus into k topics. I chose k=24 to be significantly greater than the target values (6) and easy to visualize, although in theory it could be as large as the number of samples. Realistically, 24 topics in a corpus of 100,000 samples is too general, but the point of this is to capture potential imbalances across the target groups rather than assign hard labels to the samples.

If we think of k topics as histogram bins, a larger k (more, but smaller, bins) gives us more definition, although we lose some notion of what the overall emerging distribution looks like. So a smaller k is really more of a visualization (and computational) convenience here.

code modified only slightly from: https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0


In [17]:
import os
import re
import json
import pickle
import pandas as pd
import numpy as np
from pprint import pprint
import random

random.seed(10) # for reproducable results: LDA initializes differently each time

# plotting
import plotly.express as px
import plotly
import plotly.io as pio

# corpus tools
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
import nltk
from common_prep_eda import get_CV

# dataframe display settings
pio.renderers.default='notebook'
pd.set_option('display.max_rows', 150)
pd.set_option('display.max_columns', 10)

# main directories
project_dir = "/Users/paulp/Library/CloudStorage/OneDrive-UniversityofEasternFinland/UEF/Thesis"
data_dir = os.path.join(project_dir, 'Data')
model_dir = os.path.join(project_dir, 'Models')

os.chdir(data_dir)

#load data
with open('target_idx.json', 'r') as file:
    target_idx = json.load(file)

with open('stopwords_iso.txt', 'r') as file:
    stop_words = file.readlines()
stop_words = [a.strip() for a in stop_words]
# include the BIOES tag pieces that will end up in the frequency list
stop_words.extend(['loc', 'misc', 'per', 'org', 'people', 'would', 'could'])

masked_data = pd.read_csv('masked_data_set.csv', index_col=0).reset_index(drop=True)
masked_data.iloc[0:25]

Unnamed: 0,Corpus,Target,Text,Length,k
0,ICLE,GE,I've been making music now for 20 years. You c...,379,8
1,ICLE,GE,A quick inspection of the waste-paper basket n...,554,6
2,ICLE,CN,Recycling of waste has long been a controversi...,572,1
3,ICLE,CN,"Few years age, government in some cities such ...",773,4
4,ICLE,JP,"Gender discrimination. These Days, we often co...",824,7
5,ICLE,CN,"After 1997, more and more co-operation between...",469,8
6,ICLE,GE,""" The more I get to know of people, the more I...",478,4
7,ICLE,CN,Having drinks in cyber cafes (PC cafes) is ver...,315,5
8,ICLE,JP,"When the subject given us first, I thought tha...",623,0
9,ICLE,GE,Yesterday evening I saw a report on the world ...,756,4


In [18]:
# remove punctuation, lowercase, and assign to pd.Series object
LDA_texts = masked_data['Text'].map(lambda x: re.sub(r'[,\.!?]', '', x)).map(lambda x: x.lower())

In [19]:

def sent_to_words(sentences):
    for sentence in sentences:
        # deacc=True removes punctuations
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) 
             if word not in stop_words] for doc in texts]

data = LDA_texts.values.tolist()
data_words = list(sent_to_words(data))
# remove stop words
data_words = remove_stopwords(data_words)

# check performance
print('number of samples: ', len(data_words), '\n length of sample 1: ', len(data_words[0]))

number of samples:  16522 
 length of sample 1:  106


In [20]:
# Create Dictionary
id2word = corpora.Dictionary(data_words)
# Create Corpus
texts = data_words
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# uncomment to view corpus entry
# print(corpus[:1][0][:30])

In [21]:

num_topics = 18
# Build model
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics,
                                      random_state = 99)

# uncomment below to print keywords per topic
# pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

In [22]:
iterables = [["Target"], target_idx.keys()]
col = pd.MultiIndex.from_product(iterables, names=["1", "2"])

topic_dist = pd.DataFrame(data = 0,
                             columns = col,
                        index = range(num_topics))

for a in range(len(doc_lda)):
    tgt = masked_data.loc[a, 'Target']
    for b in doc_lda[a]:
        topic_dist.loc[b[0], ('Target', tgt)] += b[1]
        

In [23]:
# get CV and add as column
topic_dist['CV'] = get_CV(topic_dist, 'Target')

# normalize the topic distributions and express as a fraction of CV
norm = topic_dist['Target'].div(topic_dist['Target'].sum(axis=1), axis = 0).mul(topic_dist['CV'], axis = 0)
topic_dist['Target'] = norm
topic_dist['Topic'] = topic_dist.index
topic_dist

1,Target,Target,Target,Target,Target,Target,CV,Topic
2,GE,CN,JP,RU,SP,AR,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0.099164,0.123175,0.038554,0.050038,0.082702,0.041721,0.435355,0
1,0.087643,0.053125,0.056175,0.029732,0.05259,0.044193,0.323457,1
2,0.114622,0.070974,0.037903,0.059559,0.075014,0.03783,0.395903,2
3,0.137031,0.094183,0.063914,0.021825,0.083908,0.059322,0.460182,3
4,0.178837,0.060366,0.034188,0.1029,0.116618,0.049357,0.542265,4
5,0.114658,0.059879,0.043041,0.071343,0.070978,0.033967,0.393866,5
6,0.144671,0.060669,0.043441,0.05871,0.101327,0.052067,0.460886,6
7,0.068385,0.048491,0.050208,0.026883,0.054382,0.034959,0.283308,7
8,0.1996,0.062231,0.051913,0.105116,0.106493,0.04044,0.565793,8
9,0.144217,0.068886,0.059397,0.053466,0.080229,0.041411,0.447607,9


In [24]:
# save
topic_dist.to_csv(f'topic_dist_{num_topics}.csv')

In [25]:
#topic_dist = pd.read_csv('topic_dist_18.csv', index_col = 0, header = 1)

In [52]:
fig = px.bar(topic_dist['Target'],
       x = topic_dist.index,
       y = list(topic_dist['Target'].columns),
      title = 'LDA, k = 18 : Contributions to Topic per Target')
fig.show()

In [53]:
fig.write_image(os.path.join(data_dir, "TARGET_LDA_18.png"))

Notice the topics which are represented heavily by Chinese and German writers. To remedy this, some samples from this topic may have to be randomly dropped from the training data.

In [57]:
[a[0] for a in lda_model.show_topic(8, topn = 20)]

['time',
 'power',
 'life',
 'money',
 'women',
 'students',
 'demographic',
 'products',
 'lot',
 'product',
 'energy',
 'magazine',
 'cars',
 'day',
 'ideas',
 'purchasing',
 'travel',
 'hand',
 'concepts',
 'magazines']

In [46]:
iterables = [["Corpus"], masked_data['Corpus'].unique()]
col = pd.MultiIndex.from_product(iterables, names=["1", "2"])

topic_dist_corpus = pd.DataFrame(data = 0,
                             columns = col,
                        index = range(num_topics))

for a in range(len(doc_lda)):
    tgt = masked_data.loc[a, 'Corpus']
    for b in doc_lda[a]:
        topic_dist_corpus.loc[b[0], ('Corpus', tgt)] += b[1]
        

In [47]:
# get CV and add as column
topic_dist_corpus['CV'] = get_CV(topic_dist_corpus, 'Corpus')

# normalize the topic distributions and express as a fraction of CV
norm = topic_dist_corpus['Corpus'].div(topic_dist_corpus['Corpus'].sum(axis=1), axis = 0).mul(topic_dist_corpus['CV'], axis = 0)
topic_dist_corpus['Corpus'] = norm
topic_dist_corpus['Topic'] = topic_dist_corpus.index
topic_dist_corpus

1,Corpus,Corpus,Corpus,Corpus,CV,Topic
2,ICLE,EFCAM,PELIC,TOEFL11,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.143818,0.224492,0.017581,0.16252,0.548412,0
1,0.051745,0.393372,0.027505,0.338743,0.811366,1
2,0.108118,0.499602,0.035746,0.197061,0.840528,2
3,0.041404,0.2972,0.014256,0.68544,1.038299,3
4,0.080098,0.980038,0.027959,0.158753,1.246847,4
5,0.135491,0.602229,0.049799,0.143071,0.93059,5
6,0.068212,0.731747,0.066041,0.183104,1.049104,6
7,0.051479,0.370593,0.023525,0.372466,0.818063,7
8,0.11011,1.097831,0.036128,0.086483,1.330552,8
9,0.099623,0.598631,0.025954,0.215144,0.939352,9


In [49]:
px.bar(topic_dist_corpus['Corpus'],
       x = topic_dist_corpus.index,
       y = list(topic_dist_corpus['Corpus'].columns),
      title = 'LDA, k = 18 : Contributions to Topic per Corpus')

In [50]:
lda_model.show_topic(8, topn = 20)

[('time', 0.014143742),
 ('power', 0.00770389),
 ('life', 0.0074340594),
 ('money', 0.006425502),
 ('women', 0.0057231337),
 ('students', 0.0053042592),
 ('demographic', 0.0050194687),
 ('products', 0.0050153895),
 ('lot', 0.0045684185),
 ('product', 0.004027994),
 ('energy', 0.003912709),
 ('magazine', 0.0036095257),
 ('cars', 0.003562119),
 ('day', 0.003298004),
 ('ideas', 0.0031382232),
 ('purchasing', 0.003097289),
 ('travel', 0.0029325415),
 ('hand', 0.0028622712),
 ('concepts', 0.0028181558),
 ('magazines', 0.0028081378)]

In [51]:
lda_model.show_topic(4, topn = 20)

[('life', 0.013356047),
 ('time', 0.008302079),
 ('product', 0.0077226213),
 ('person', 0.0075067924),
 ('cars', 0.0067143766),
 ('products', 0.0058959825),
 ('live', 0.005222048),
 ('successful', 0.004689139),
 ('car', 0.0045310464),
 ('enjoy', 0.0042872275),
 ('money', 0.0034828153),
 ('companies', 0.0034162123),
 ('energy', 0.0032567976),
 ('hand', 0.0031265637),
 ('reason', 0.0029781936),
 ('job', 0.002976368),
 ('lot', 0.002906235),
 ('statement', 0.0028460824),
 ('society', 0.0028323047),
 ('agree', 0.0027491339)]

In [63]:
import pyLDAvis
import pyLDAvis.gensim_models


# Visualize the topics
pyLDAvis.enable_notebook()
LDAvis_data_filepath = os.path.join(model_dir, 'lda_model_'+str(num_topics))

if 1 == 1:
    LDAvis_prepared = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word)
    with open(LDAvis_data_filepath, 'wb') as f:
        pickle.dump(LDAvis_prepared, f)
        
# load the prepared pyLDAvis data from disk
with open(LDAvis_data_filepath, 'rb') as f:
    LDAvis_prepared = pickle.load(f)
pyLDAvis.save_html(LDAvis_prepared, 'lda_model_'+str(num_topics) +'.html')

LDAvis_prepared


In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.

  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersi

  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)
  if LooseVersion(np.__version__) < '1.13':
  other = LooseVersion(other)


<h2> Results and Next Steps </h2>

The results initially pointed to the topics about smoking in public spaces and work-study appearing disproportionately in the Chinese subcorpus (these are both suggested writing prompts from ICLE). However, the addition of the TOEFLL11 corpus seemed to diminish this. Trials with k=18 and k=24 both point to the similar imbalances, but overall CV values dropped to <0.60, whereas imbalanced topics were >0.90 in some cases before the addition of TOEFLL11. I am going to wait before trimming samples from the corpus and return to this issue later on if topical cues are found to bias the model. It seems that the additional data has helped reduce the lopsided distribution of topics in the corpora.

<h1> References </h1>

Alghamdi, R., & Alfalqi, K. (2015). A Survey of Topic Modeling in Text Mining Topic Over Time (TOT), Dynamic Topic Models (DTM), Multiscale Topic Tomography, Dynamic Topic Correlation Detection, Detecting Topic Evolution in scientific literatures, etc. Keywords-Topic Modeling; Methods of Topic Modeling; Latent Semantic Analysis (LSA); Probabilistic Latent Semantic Analysis (PLSA); Latent Dirichlet Allocation (LDA); Correlated Topic Model (CTM); Topic Evolution Model. IJACSA) International Journal of Advanced Computer Science and Applications, 6(1). https://thesai.org/Downloads/Volume6No1/Paper_21-A_Survey_of_Topic_Modeling_in_Text_Mining.pdf

Kapadia, S. (2019, April 15). Topic Modeling in Python: Latent Dirichlet Allocation (LDA). Medium; Towards Data Science. https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0

Stopwords ISO. (2022, October 6). GitHub. https://github.com/stopwords-iso

