<a href="https://colab.research.google.com/github/kobemawu/www/blob/master/LDA_EN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLTK Corpus Analysis with Gensim's LDA Model 

## Preparation
First of all, you need to import necessary libraries (with pip command).
* nltk
* gensim
* pyLDAvis

In [None]:
!pip install nltk
!pip install gensim
!pip install pyLDAvis

After installing the dependencies, you need to download the following datasets.

In [None]:
import nltk
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("reuters")
nltk.download("punkt")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Datasets
Load the corpus from NLTK package.

In [None]:
from nltk.corpus import reuters as corpus

Let us check out the content of the corpus.

In [None]:
# In some cases, you may need to run the code as follows.
#!unzip /root/nltk_data/corpora/reuters.zip -d /root/nltk_data/corpora

In [None]:
for n,item in enumerate(corpus.words(corpus.fileids()[0])[:300]):
    print(item, end=" ")
    if (n%25) ==24:
      print(" ")

ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears  
among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said . They  
told Reuter correspondents in Asian capitals a U . S . Move against Japan might boost protectionist sentiment in the U . S . And  
lead to curbs on American imports of their products . But some exporters said that while the conflict would hurt them in the long -  
run , in the short - term Tokyo ' s loss might be their gain . The U . S . Has said it will  
impose 300 mln dlrs of tariffs on imports of Japanese electronics goods on April 17 , in retaliation for Japan ' s alleged failure to  
stick to a pact not to sell semiconductors on world markets at below cost . Unofficial Japanese estimates put the impact of the tariffs at  
10 billion dlrs and spokesmen for major electronics firms said they would virtually halt exports 

The total number of documents.

In [None]:
len(corpus.fileids())

10788

You can train the model with first K number of documents or all documents.

In [None]:
# First K documents
# K=1000
# docs=[corpus.words(fileid) for fileid in corpus.fileids()[:K]]

# All documents
docs=[corpus.words(fileid) for fileid in corpus.fileids()]

print(docs[:5])
print("num of docs:", len(docs))

[['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', ...], ['CHINA', 'DAILY', 'SAYS', 'VERMIN', 'EAT', '7', '-', ...], ['JAPAN', 'TO', 'REVISE', 'LONG', '-', 'TERM', ...], ['THAI', 'TRADE', 'DEFICIT', 'WIDENS', 'IN', 'FIRST', ...], ['INDONESIA', 'SEES', 'CPO', 'PRICE', 'RISING', ...]]
num of docs: 10788


## Data preprocessing
First, let us define some stopwords. Here we consider English stopwords from the NLTK package and some noises that may affect our LDA analysis result.  
(Optional) Try to ignore numbers and words through regular expression.

In [None]:
# English stopwords defined by the NLTK package.
en_stop = nltk.corpus.stopwords.words('english')

# Ignore noises that might affect our result.
en_stop = ["``","/",",.",".,",";","--",":",")","(",'"','&',"'",'),',',"','-','.,','.,"','.-',"?",">","<"]                  \
         +["0","1","2","3","4","5","6","7","8","9","10","11","12","86","1986","1987","000"]                                                      \
         +["said","say","u","v","mln","ct","net","dlrs","tonne","pct","shr","nil","company","lt","share","year","billion","price"]          \
         +en_stop

Next, let us define several preprocessing functions.

In [None]:
from nltk.corpus import wordnet as wn # import for lemmatize

def preprocess_word(word, stopwordset):
    
    #1.convert words to lowercase (e.g., Python =>python)
    word=word.lower()
    
    #2.remove "," and "."
    if word in [",","."]:
        return None
    
    #3.remove stopwords  (e.g., the => (None)) 
    if word in stopwordset:
        return None
    
    #4.lemmatize  (e.g., cooked=>cook)
    lemma = wn.morphy(word)
    if lemma is None:
        return word

    # lemmatized words could be in the stopwords set
    elif lemma in stopwordset: 
        return None
    else:
        return lemma
    

def preprocess_document(document):
    document=[preprocess_word(w, en_stop) for w in document]
    document=[w for w in document if w is not None]
    return document

def preprocess_documents(documents):
    return [preprocess_document(document) for document in documents]

Let us check out the preprocessing result.

In [None]:
# before
print(docs[0][:25]) 

# after
print(preprocess_documents(docs)[0][:25])

['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears']
['asian', 'exporter', 'fear', 'damage', 'japan', 'rift', 'mounting', 'trade', 'friction', 'japan', 'raise', 'fear', 'among', 'many', 'asia', 'exporting', 'nation', 'row', 'could', 'inflict', 'far', 'reaching', 'economic', 'damage', 'businessmen']


Next, we need to reshape our documents with the available format for the gensim LDA model.

In [None]:
import gensim
from gensim import corpora

In [None]:
# build the dictionary
dictionary = corpora.Dictionary(preprocess_documents(docs))
# construct the 
corpus_ = [dictionary.doc2bow(doc) for doc in preprocess_documents(docs)]

Let us check out the contents of the built dictionary and corpus.

In [None]:
# token2id is the attribute which indicates the mapping between words and dictionary ID

print(dictionary.token2id)



In [None]:
# corpus_ contains words of each document with a list (ID, appear frequency)

# note that there is not the appearing order in the documents, but the order of the dictionary
print(corpus_[0][:10]) 

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 3), (8, 1), (9, 1)]


Let us compare the original document with our preprocessing result that is available for the LDA model.

In [None]:
# before
print([w.lower() for w in corpus.sents(corpus.fileids()[0])[0]])

# after
print(dictionary.doc2bow([w.lower() for w in corpus.sents(corpus.fileids()[0])[0]]))

## Training

In [None]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus=corpus_,
                                           num_topics=20,
                                           id2word=dictionary,
                                           alpha=0.1,                 # optional LDA hyperparameter alpha
                                           eta=0.1,                   # optional LDA hyperparameter beta
                                           #minimum_probability=0.0    # optional the lower bound of the topic/word generative probability
                                          )

Check out the learned parameters.

In [None]:
# the top num_words of words for each topic (topic ID, the word generative probability for the topic).

topics = ldamodel.print_topics(num_words=15)
for topic in topics:
    print(topic)

(0, '0.016*"expect" + 0.016*"quarter" + 0.014*"report" + 0.014*"earnings" + 0.012*"increase" + 0.009*"sales" + 0.009*"1985" + 0.008*"export" + 0.008*"last" + 0.008*"first" + 0.007*"result" + 0.007*"oil" + 0.007*"product" + 0.006*"revenue" + 0.006*"see"')
(1, '0.022*"oil" + 0.013*"opec" + 0.011*"saudi" + 0.009*"bpd" + 0.009*"crude" + 0.008*"would" + 0.008*"government" + 0.006*"production" + 0.006*"port" + 0.006*"raise" + 0.005*"50" + 0.005*"output" + 0.005*"coin" + 0.005*"arabia" + 0.005*"barrel"')
(2, '0.036*"january" + 0.030*"february" + 0.021*"rose" + 0.016*"surplus" + 0.015*"december" + 0.014*"deficit" + 0.013*"rise" + 0.010*"offer" + 0.010*"figure" + 0.010*"account" + 0.009*"revise" + 0.009*"fell" + 0.008*"adjust" + 0.008*"compare" + 0.008*"current"')
(3, '0.021*"wheat" + 0.019*"corn" + 0.014*"department" + 0.014*"export" + 0.012*"week" + 0.010*"february" + 0.010*"january" + 0.009*"soybean" + 0.008*"loan" + 0.008*"usda" + 0.008*"agriculture" + 0.007*"taiwan" + 0.007*"grain" + 0.007

In [None]:
# for each document, show the probabilities of topics which beyond the minimum_probability [(topic ID, probability)]

for n,item in enumerate(corpus_[:10]):
    print("document ID "+str(n)+":" ,end="")
    print(ldamodel.get_document_topics(item))

document ID 0:[(5, 0.021844955), (6, 0.31530783), (12, 0.020751594), (13, 0.1513013), (15, 0.4111117), (16, 0.076291814)]
document ID 1:[(2, 0.061180912), (5, 0.041856203), (10, 0.78335434), (11, 0.08553141)]
document ID 2:[(0, 0.07458029), (2, 0.051225573), (6, 0.2539899), (11, 0.14738245), (15, 0.2743513), (16, 0.18596743)]
document ID 3:[(0, 0.582091), (2, 0.097981416), (3, 0.05456297), (18, 0.24833897)]
document ID 4:[(0, 0.3120863), (2, 0.028656058), (3, 0.058121484), (8, 0.044398688), (9, 0.060160752), (11, 0.06853315), (13, 0.14956833), (15, 0.22724403), (16, 0.04022898)]
document ID 5:[(1, 0.8856952), (8, 0.02569027), (15, 0.07356801)]
document ID 6:[(0, 0.10329278), (1, 0.01599051), (5, 0.016534308), (8, 0.020233648), (9, 0.12145227), (12, 0.12480388), (13, 0.24918804), (15, 0.34344)]
document ID 7:[(3, 0.55864465), (14, 0.24422465), (15, 0.143993)]
document ID 8:[(5, 0.65719223), (9, 0.3170888)]
document ID 9:[(0, 0.12270873), (9, 0.1125541), (10, 0.14849241), (12, 0.04964843

In [None]:
# the categories of documents
categories = [corpus.categories(fileid) for fileid in corpus.fileids()]

Let us check out the ```nth``` document in the result.

In [None]:
n=0

# nth document's topic distribution
print(ldamodel.get_document_topics(corpus_[n]))

# nth document's category
print(categories[n])

# show the original document
print(" ".join(docs[n]))

[(5, 0.021852423), (6, 0.315329), (12, 0.020674141), (13, 0.15133592), (15, 0.41112706), (16, 0.07629063)]
['trade']


## Visualization
We can further analyze our result through visualization.

In [None]:
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()

In [None]:
# it will spend about 20 minutes to visualize the result if you train the model with all documents
# please note that gensim start topics with index 0 to K-1, but pyLDAvis start the index with 1 to K


lda_display = pyLDAvis.gensim_models.prepare(ldamodel, corpus_, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)