Follow this blog post
https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

In [1]:
from gensim import corpora, models 
from scripts.normalization import normalize_corpus
import numpy as np
from docx import Document
import sys
import os
import gensim
from gensim.models import Phrases
from gensim.models.phrases import Phraser
import pickle
import nltk
from collections import Counter
#import pyLDAvis
#import pyLDAvis.gensim  # don't skip this
import pickle
python_root = './scripts'
sys.path.insert(0, python_root)

import normalization_spacy as util
from contractions import CONTRACTION_MAP



#### Load data

In [2]:
doc_dict = pickle.load(open('./data/xml_docs.p', "rb")) 
ids = list(doc_dict.keys())
print('sample document ids: \n',ids[:5],'\n')
test_docs = doc_dict[ids[0]]
print('sample paragraphs: \n',test_docs.paras[0])

sample document ids: 
 ['9781451823295', '9781462328451', '9781451806069', '9781451815733', '9781451814002'] 

sample paragraphs: 
 1. As a small, open, tourism-based economy, St. Lucia is highly vulnerable to exogenous shocks. Tourism accounts for over three-quarters of exports, and the import content of both consumption and foreign direct investment (FDI) is very high (Figure 1). The economy has been buffeted by the global economic downturn, which has hobbled the tourism and construction sectors, with potential spillovers to the financial sector.


In [3]:
## faltten all paragraphs 
paras = [doc_dict[i].paras for i in ids]
corpus = list()
for ps in paras:
    corpus.extend(ps)

print('Total number of paragraphs in the corpus: {}'.format(len(corpus)))

Total number of paragraphs in the corpus: 255915


### Tokenize and lemmatize corpus

In [4]:
import en_core_web_md
nlp = en_core_web_md.load()

In [5]:
## single / multi threaded 
n_core = 30 
load = True 
trigram_reviews_filepath = 'data/lemma_docs.txt'
if load:
    with open(trigram_reviews_filepath, 'r', encoding='utf_8') as f:
        docs_lemma = f.readlines()
        docs_lemma = [d.strip('\n').split() for d in docs_lemma]
else:
    if n_core == 1:
        docs = [nlp(d) for d in corpus]
        docs_lemma = [[token.lemma_ for token in doc if not util.punct_space(token) ] for doc in docs]
    else:
        with open(trigram_reviews_filepath, 'w', encoding='utf_8') as f:
            for doc in nlp.pipe(corpus,batch_size=10000,n_threads=n_core):
                docs_lemma = [token.lemma_ for token in doc if not util.punct_space(token)]
                trigram_para = ' '.join(docs_lemma)
                f.write(trigram_para + '\n')

        with open(trigram_reviews_filepath, 'r', encoding='utf_8') as f:
            docs_lemma = f.readlines()
            docs_lemma = [d.strip('\n').split() for d in docs_lemma]

In [6]:
print(corpus[3])
print(docs_lemma[3])

4. Real GDP growth slowed in 2007-08. Spurred by preparations for the Cricket World Cup, St. Lucia’s economy grew by about 5 percent in 2006. However, slowing construction and tourism activity, together with a hurricane-induced contraction in banana exports, reduced growth to an estimated 1.7 percent and 0.7 percent in 2007 and 2008, respectively. The unemployment rate increased by three percentage points to 16.8 percent during the same period. Despite being underpinned by the regional currency board arrangement, annual inflation reached 7.2 percent in 2008, reflecting high international prices of energy and food. With the decline in these prices, inflation has fallen to 3.2 percent by end-March 2009.
['4', 'real', 'gdp', 'growth', 'slow', 'in', '2007', '08', 'spur', 'by', 'preparation', 'for', 'the', 'cricket', 'world', 'cup', 'st.', 'lucia', '’s', 'economy', 'grow', 'by', 'about', '5', 'percent', 'in', '2006', 'however', 'slow', 'construction', 'and', 'tourism', 'activity', 'together

### Bigram and Trigram transform

In [7]:
train_phrase_model = False
bigram_transformer_path = os.path.join('data','bigram_transformer')
trigram_transformer_path = os.path.join('data','trigram_transformer')
common_terms = ['a','an','of',',','i','about','to',"with", "without"]

if train_phrase_model: 
    paras = util.phrase_detect_train(docs_lemma,min_count=10,threshold=15,common_terms=common_terms,phrase_model_save_path='./data/bigram')
    paras = util.phrase_detect_train(paras,min_count=10,threshold=15,common_terms=common_terms,phrase_model_save_path='./data/trigram')
else:
    bigram_transformer = Phraser.load(bigram_transformer_path)
    trigram_transformer = Phraser.load(trigram_transformer_path)
    paras = util.phrase_detect(bigram_transformer,trigram_transformer,docs_lemma) 
    

- exam phrases

In [8]:
trigram = Phraser.load('data/trigram')
trigram_transformer = Phraser.load('data/trigram_transformer')
for phrase, score in trigram.export_phrases(docs_lemma[:2]):
     print(phrase,score)

b'st. lucia' 8940797.333333334
b'highly vulnerable' 111.43352532509945
b'exogenous shock' 530.652559063452
b'three quarter' 42.71415494158645
b'import content' 117.81433383936209
b'foreign direct' 40.30253436117819
b'figure 1' 23.037583272373663
b'exogenous shocks' 140299.9065190652
b'rac esf' 81261.88405797101
b'rac esf' 81261.88405797101
b'adverse impact' 22.163130299899233


In [10]:
with open('./data/processed_corpus.p','wb') as f:
    pickle.dump(paras,f)

In [12]:
# with open('./data/processed_corpus.p','rb') as f:
#     cs = pickle.load(f)