# LDA analysis

This workbook will demonstrate how to conduct LDA analysis. The code was written by Andres Azqueta and edited by Martin Lynge Rasmussen.

We will:

1. Install the required packages
2. Load and format the text files
3. Carry out LDA analysis


Before proceding, you should: 
1. Create a folder called "data" with two subfolders called "input" and "output". 
2. Before you proceed, you need to save all of the newspaper articles you have downloaded to the "input" folder. 
3. You need to name the files as Chinese_Risk_%d.json, where %d is a number, i.e. name the first document as 1. You can use other names, but in such case remember to change the names in the remainder of this file.

## 1. Load the required packages

In [3]:
## Libraries to download
import os
from nltk.tokenize import RegexpTokenizer
#from nltk.corpus import stopwords
#from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.porter import PorterStemmer
%pip install -U gensim
from gensim import corpora, models
import gensim
from gensim.parsing.preprocessing import STOPWORDS
%pip install pyLDAvis

Requirement already up-to-date: gensim in c:\anaconda3\lib\site-packages (3.8.3)
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## 2. Load and format the textual data

In [6]:
os.chdir('C:/Users/rasmusa/Desktop/python/Chinese_Risk/data/output') # Change this
## Tokenizing
tokenizer = RegexpTokenizer(r'\w+')

# create English stop words list
#en_stop = stopwords.words('english')
#stop_words = set(stopwords.words('english'))

# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()


## Reading the data
import json
import nltk
import re
import pandas

appended_data = []

for i in range(1,11):
    df0 = pandas.DataFrame([json.loads(l) for l in open('Chinese_Risk_%d.json' % i)])
    appended_data.append(df0)
    
appended_data = pandas.concat(appended_data)
doc_set = appended_data.body
print(len(doc_set))


## English Stopwords
English_Stopwords = open("English_Stopwords.txt").read() # also contain uni-characters
English_Stopwords1=English_Stopwords.split('\n')


# list for tokenized documents in loop
texts = []

# loop through document list
for i in doc_set:
    
    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)
    
    
    # remove all tokens that are not alphabetic
    words = [word for word in tokens if word.isalpha()]

    # remove stop words from tokens
    #stopped_tokens = [i for i in tokens if not i in en_stop]
    stopped_tokens = [i for i in words if not i in English_Stopwords1]
    
    # stem tokens
    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
    
    # add tokens to list
    texts.append(stemmed_tokens)

# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)

# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]
#The function doc2bow() simply counts the number of occurrences of each distinct word, 
#converts the word to its integer word id and returns the result as a sparse vector


1000


## 3. LDA analysis
You can change the number of topics if you wish. The output shows all of the identified topics. We will later show how to explore the topics in a more visually-appealing manner.

The below shows an example of topic 0. Each topic is separated by parenthesis. And the weights show the weight of the given word in the topic. 

[(0, '0.029*"ma" + 0.020*"alibaba" + 0.012*"yuan" + 0.012*"dollar" + 0.010*"china" + 0.009*"compani" + 0.009*"chines" + 0.008*"alipay" + 0.008*"financi" + 0.007*"intern" + 0.007*"asset" + 0.006*"investor" + 0.006*"even" + 0.006*"global" + 0.005*"currenc" + 0.005*"would" + 0.005*"first" + 0.004*"execut" + 0.004*"trillion" + 0.004*"jack" + 0.004*"america" + 0.004*"prasad" + 0.004*"commerc" + 0.004*"bank" + 0.004*"system" + 0.004*"trade" + 0.004*"busi" + 0.003*"market" + 0.003*"among" + 0.003*"chairman"')

### 3.1. Initial analysis

In [9]:
# generate LDA model, need to set minimum probability to zero otherwise topics will be surpresed in the next steps
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=20, id2word = dictionary, passes=50, minimum_probability=0)
ldamodel.save("model.ldaFinanceRisk20") 
#class gensim.models.ldamodel.LdaModel(corpus=None, num_topics=100, id2word=None, distributed=False, 
#chunksize=2000, passes=1, update_every=1, alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10, 
#iterations=50, gamma_threshold=0.001, minimum_probability=0.01)
#passes: optional. The number of laps the model will take through corpus. 
#The greater the number of passes, the more accurate the model will be. A lot of passes can be slow on a very large corpus.

print(ldamodel.print_topics(num_topics=20, num_words=30))

## visualization of the topics
import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)

vis_data = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)
pyLDAvis.save_html(vis_data, 'Chinse_Financial_Risk_Topics_20.html')

[(0, '0.029*"ma" + 0.020*"alibaba" + 0.012*"yuan" + 0.012*"dollar" + 0.010*"china" + 0.009*"compani" + 0.009*"chines" + 0.008*"alipay" + 0.008*"financi" + 0.007*"intern" + 0.007*"asset" + 0.006*"investor" + 0.006*"even" + 0.006*"global" + 0.005*"currenc" + 0.005*"would" + 0.005*"first" + 0.004*"execut" + 0.004*"trillion" + 0.004*"jack" + 0.004*"america" + 0.004*"prasad" + 0.004*"commerc" + 0.004*"bank" + 0.004*"system" + 0.004*"trade" + 0.004*"busi" + 0.003*"market" + 0.003*"among" + 0.003*"chairman"'), (1, '0.044*"bank" + 0.017*"china" + 0.013*"loan" + 0.011*"financi" + 0.011*"year" + 0.010*"risk" + 0.009*"credit" + 0.009*"per" + 0.009*"cent" + 0.008*"financ" + 0.008*"lend" + 0.007*"yuan" + 0.007*"product" + 0.006*"compani" + 0.006*"fund" + 0.006*"billion" + 0.006*"sector" + 0.005*"debt" + 0.005*"market" + 0.005*"shadow" + 0.005*"mainland" + 0.005*"growth" + 0.005*"system" + 0.005*"regul" + 0.005*"new" + 0.005*"last" + 0.004*"manag" + 0.004*"govern" + 0.004*"trust" + 0.004*"would"'), 

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


### 3.2. Save the topics in a readable csv file
The below saves the top 20 topics; you can change this to another number of topics.

In [67]:

## -----------------------------------------------------------------------------------------------------------##
# ------------------------------------- OBTAINING ARTICLES-TOPICS DISTRIBUTIONS -------------------------------#
## -----------------------------------------------------------------------------------------------------------##

# https://web.stanford.edu/class/stats202/content/lab18.htm very complete link with LDA information

## Here we save the topics in a readable csv file

numTopics = 20
topics = {"topic":[],"word":[],"weight":[]}
for topic in range(numTopics):
    x = ldamodel.show_topic(topic,20)
    for weight, word in x:
        topics["topic"].append(topic)
        topics["word"].append(word)
        topics["weight"].append(weight)
topics = pandas.DataFrame(topics)

topics.to_csv("topics20.csv")

# Here, we store the distribution of topics in every article

topicDists = [ ldamodel[corpus[i]] for i in range(len(corpus)) ]

# convert it into a dataframe
topic_article_Dists = pandas.DataFrame(topicDists)

# select only the probabilities for each column
topic_article_Dists = pandas.concat([topic_article_Dists[x].str[1] for x in topic_article_Dists.columns], axis=1)

# save the output as a csv file
topic_article_Dists.to_csv("Article-Topic-Distri20.csv")        
           
## Save the date in a csv file         
Date = pandas.DataFrame(appended_data.date)   
Date.to_csv("Date_20.csv")     


### 4. Word2vec
The below sets up the word2vec framework. 

In [11]:
os.chdir('C:/Users/rasmusa/Desktop/python/Chinese_Risk/data/output')

## Libraries to download
from nltk.tokenize import RegexpTokenizer
#from nltk.corpus import stopwords
#from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from gensim import corpora, models
import gensim
import csv

#from gensim.parsing.preprocessing import STOPWORDS

## Tokenizing
tokenizer = RegexpTokenizer(r'\w+')

# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()


## Reading the data
import json
#import nltk
#import re
import pandas


## English Stopwords
English_Stopwords = open("English_Stopwords.txt").read()
English_Stopwords1=English_Stopwords.split('\n')


# list for tokenized documents in loop
texts = []

# loop through document list
for i in doc_set:

    # clean and tokenize document string
    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    # remove all tokens that are not alphabetic
    words = [word for word in tokens if word.isalpha()]


    texts.append(words)


model = gensim.models.Word2Vec(
    texts,
    size=150,
    window=10,
    min_count=2,
    workers=10)

model.train(texts, total_examples=len(texts), epochs=10)

(6994419, 9000230)

### 4.1 Words most similar to risk
We can now explore the top 100 words that are most similar to risk and finance.

In [8]:
result_risk = model.most_similar(['risk'], topn=100)
result_risk

  """Entry point for launching an IPython kernel.


[('risks', 0.6448798179626465),
 ('leverage', 0.487862765789032),
 ('problem', 0.43997374176979065),
 ('conditions', 0.4376317858695984),
 ('negative', 0.417447566986084),
 ('nature', 0.40873077511787415),
 ('unstable', 0.3954346776008606),
 ('payout', 0.3952161967754364),
 ('opaque', 0.3927227258682251),
 ('quality', 0.3884698152542114),
 ('standards', 0.3877885341644287),
 ('backdrop', 0.3865804672241211),
 ('stable', 0.378500372171402),
 ('distortion', 0.3751464784145355),
 ('hazard', 0.3748941123485565),
 ('defaults', 0.374076247215271),
 ('fairly', 0.36818063259124756),
 ('causes', 0.366767019033432),
 ('argument', 0.365559458732605),
 ('given', 0.3650939464569092),
 ('painful', 0.3640231192111969),
 ('probability', 0.3633217215538025),
 ('danger', 0.362758994102478),
 ('proponent', 0.3617668151855469),
 ('grind', 0.36068451404571533),
 ('transparency', 0.36030688881874084),
 ('problems', 0.3601911664009094),
 ('regulation', 0.35937705636024475),
 ('roules', 0.35761839151382446),


In [94]:
result_finance = model.most_similar(['finance'], topn=100)
result_finance

  """Entry point for launching an IPython kernel.


[('financing', 0.4269069731235504),
 ('proposed', 0.38867759704589844),
 ('govern', 0.37845876812934875),
 ('jiwei', 0.37800031900405884),
 ('prime', 0.37565547227859497),
 ('banking', 0.3700164556503296),
 ('lending', 0.3570162057876587),
 ('commerce', 0.35553139448165894),
 ('securitiesfinance', 0.3537396490573883),
 ('tsipras', 0.35154497623443604),
 ('mckinsey', 0.34610962867736816),
 ('jointly', 0.34447425603866577),
 ('tighten', 0.3406780958175659),
 ('drafting', 0.33930468559265137),
 ('regulate', 0.33681434392929077),
 ('trustee', 0.32900261878967285),
 ('chiefly', 0.323551207780838),
 ('factual', 0.32240742444992065),
 ('nagano', 0.31778717041015625),
 ('broaden', 0.3177682161331177),
 ('jams', 0.31592032313346863),
 ('padoan', 0.31486770510673523),
 ('amplified', 0.3141745924949646),
 ('guidelines', 0.3136821985244751),
 ('poorest', 0.31070470809936523),
 ('benefitting', 0.3103598952293396),
 ('peer', 0.3101802468299866),
 ('designed', 0.3046668767929077),
 ('shinzo', 0.30075

0               risks
1            leverage
2             problem
3          conditions
4           standards
5              nature
6              opaque
7               given
8             painful
9            backdrop
10           unstable
11          extremely
12          technical
13             stable
14          situation
15             danger
16          vasudevan
17           solvency
18          proponent
19         regulation
20           defaults
21         governance
22            quality
23          unwinding
24      deterioration
25             causes
26              avoid
27    vulnerabilities
28         volatility
29              risky
           ...       
70             roules
71          deflation
72             assess
73         corporates
74         structured
75          exhibited
76               pain
77             mature
78         assessment
79        possibility
80               seem
81             shocks
82              moral
83         allocation
84        