## Practice with the Gensium Tutorial for LDA Topic Modeling

Topic modeling is a technique for taking some unstructured text and automatically extracting its common themes, it is a great way to get a bird's eye view on a large text collection. 

Gensim = “Generate Similar” is a popular open source natural language processing library used for unsupervised topic modeling.

Gensim uses top academic models and modern statistical machine learning to perform various complex tasks such as −

* Building document or word vectors
* Corpora
* Performing topic identification
* Performing document comparison (retrieving semantically similar documents)
* Analysing plain-text documents for semantic structure

However, unlike Scikit-Learn, Gensim doesn’t do any work on behalf of your documents for tokenization or stemming.

The Gensim library uses a popular algorithm for doing topic model, namely Latent Dirichlet Allocation. Latent Dirichlet Allocation (LDA). LDA requires documents to be represented as a bag of words (for the gensim library, some of the API calls will shorten it to "bow"). This representation ignores word ordering in the document but retains information on how many times each word appears.

The main distinguishing feature for LDA is it allows for mixed membership, which means that each document can partially belong to several different topics. Note that the vocabulary probability will sum up to 1 for every topic, but often times, words that have lower weights will be truncated from the output.

Text modified from: 
* <https://notebook.community/ethen8181/machine-learning/clustering/topic_model/LDA>
* <https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html#sphx-glr-auto-examples-core-run-core-concepts-py>
* <https://www.tutorialspoint.com/gensim/index.htm>


In [1]:
## General Dependencies
import re
import numpy as np
import pandas as pd
from pprint import pprint
import sys, os
import glob
from tika import parser # pip install tika

## Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim import models
#from gensim.models.coherencemodel import CoherenceModel
from gensim.models import CoherenceModel
from gensim.models import LdaModel

## Preprocessing
import spacy
import nltk as nltk
from nltk.stem import WordNetLemmatizer 
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

## Plotting
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt

## Other Libraries
from operator import itemgetter

## ScikitLearn
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

  from PIL import PILLOW_VERSION
  from PIL import PILLOW_VERSION


In [2]:
## Import data on fetch 20 news groups
# from sklearn.datasets import fetch_20newsgroups
# newsgroups_train = fetch_20newsgroups(subset='train')
# data = newsgroups_train.data
# print(data[1])

In [3]:
directory = "News_Industry"
files = list(glob.glob(os.path.join(directory,'*.*')))
print(files)
#https://stackoverflow.com/questions/34000914/how-to-create-a-list-from-filenames-in-a-user-specified-directory-in-python
#https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory
#https://stackoverflow.com/questions/33912773/python-read-txt-files-into-a-dataframe

['News_Industry\\Bibliography.10AGGRESSION AND PHYSICAL HEALTH IN MARRIED WOMEN.pdf', 'News_Industry\\Bibliography.12Impact of Socio-demographic Factors on Awareness of Smoking Effects on Oral Health among Smokers and.pdf', 'News_Industry\\Bibliography.17Health-Promoting Factors related to lifestyle among nursing students in University of Hail.pdf', 'News_Industry\\Bibliography.17Multinomial logit analysis of the effects of five different app-based incentives to encourage cyclin.pdf', 'News_Industry\\Bibliography.1PREVALENCE OF DYSLIPIDEMIA IN YOUNG ADULTS.pdf', 'News_Industry\\Bibliography.20Risk Factors for Atherosclerotic Cardiovascular Disease in the South Asian Population.pdf', 'News_Industry\\Bibliography.29Is the Gay Community the Neo-marginalised of Modern Society_.pdf', 'News_Industry\\Bibliography.33A Biological Effect of Sex Hormone Binding Globulin and Testosterone in Polycystic Ovary Syndrome (P.pdf', 'News_Industry\\Bibliography.34DETERMINANTS OF DEPRESSION ANXIETY STRESS

In [4]:
# Open files, convert from PDF to text file, append each file to a document list
#https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file

document_list = []
for f in files:
    raw = parser.from_file(f)
    document_list.append(raw)

# print(document_list)

In [5]:
## Create a dataframe form the document list
text_df = pd.DataFrame(document_list)
text_df.head()
# print(text_df["content"][1])

Unnamed: 0,metadata,content,status
0,"{'Content-Type': 'application/pdf', 'Creation-...",\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,200
1,"{'Content-Type': 'application/pdf', 'Creation-...",\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,200
2,"{'Content-Type': 'application/pdf', 'Creation-...",\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,200
3,"{'Content-Type': 'application/pdf', 'Creation-...",\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,200
4,"{'Content-Type': 'application/pdf', 'Creation-...",\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,200


In [6]:
## Pre-process the text to lower case, remove special characters, etc. 
## https://kavita-ganesan.com/extracting-keywords-from-text-tfidf/#.X7RHltBKiUn
## Test regex here: https://pythex.org/

def preprocess(text):
    
    ## Lowercase words
    text_lower = text.lower()
    
    ## Remove Emails from text
    ## if you need to match a \, you can precede them with a backslash to remove their special meaning: \\.
    ## \S matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
    ## \s Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v]
    ## Code below matches any character, then an @ sign, then more characters, end matching when a white space is found.
    text_email = re.sub('\\S*@\\S*\\s?', '', text_lower) 
    
    ## Remove URLS from text
    ## https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python/40823105#40823105
    ## text_urls = re.sub(r'http\S+', '', text_email)
    ## https://www.geeksforgeeks.org/python-check-url-string/
    text_urls = re.sub(r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))",'', text_email)
    
    
    ## Remove tabs and new lines from text
    ## https://stackoverflow.com/questions/16355732/how-to-remove-tabs-and-newlines-with-a-regex
    ## \s Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v]
    text_spaces = re.sub(r'\s+',' ',text_urls)
        
    ## Remove \n from text
    text_space_character = text_spaces.replace('\n','')
    
    ## Remove \t from text
    text_tab_character = text_space_character.replace('\t','')
    
    ## Remove special characters and numbers
    ## \W matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_]
    ## \d matches any decimal digit; this is equivalent to the class [0-9]
    text_numbers = re.sub("(\\d|\\W)+"," ",text_tab_character)
    
    ## Remove tags
    ##text_tags = re.sub("","",text_numbers)

    ## Remove special characters and space, but leave in periods and numbers
    ## ^ means any character except. So [^5] will match any character except '5'
    ## [^a-zA-Z0-9_] matches any non-alphanumeric character.
    ## text_special = re.sub('[^A-Za-z0-9.]+|\s',' ',text_tab_character)
    
    ## Remove a sepcial list of terms
    ## https://stackoverflow.com/questions/15435726/remove-all-occurrences-of-words-in-a-string-from-a-python-list
    
    REMOVE_LIST = ['right reserved section',
                   'reserved section',
                   "length word byline", 
                   "byline", 
                   "word byline",
                   "journal code", 
                   "dr", 
                   "publication type magazine",
                   "type magazine",
                   "magazine",
                   "type newspaper",
                   "publication type newspaper",
                   'newspaper',
                   "group right reserved",
                   'section:',
                   'copyright',
                   'body',
                   'length:',
                   'keywords:',
                   'introduction',
                   'page',
                   'methodology',
                   'table',
                   'discussion',
                   'conclusions',
                   'references',
                   'classification',
                   'language',
                   'industry',
                   'geographic',
                   'load-date',
                   'end of document',
                   'mg dl',
                   'mg'
                   
                  ]

    remove = '|'.join(REMOVE_LIST)
    regex = re.compile(r'\b('+remove+r')\b', flags=re.IGNORECASE)
    text_special_remove = regex.sub("", text_numbers)

    return text_special_remove

## New column "preprocess" is formed from applying pre_process function to each item in the "content" column in dataframe
text_df['preprocess'] = text_df['content'].apply(lambda x:preprocess(x))

print(text_df['preprocess'][1])

#https://www.machinelearningplus.com/nlp/lemmatization-examples-python/



In [7]:
## Tokenize the data using Gensim Utils Simple Preprocess

data_words = []
def tokenize(documents):
    for doc in documents:
        token_list = gensim.utils.simple_preprocess(str(doc), deacc=True)  # deacc=True removes punctuations
        data_words.append(token_list)
    return data_words


tokenize(text_df['preprocess'])
# print(type(data_words))
print(data_words[1])



In [8]:
## Remove Stopwords using a custom stopword list
documents_nostop_list = []

def remove_stopwords(documents):
    
    ##Open stop words text file and save to stop_set variable
    with open("stop_words.txt", 'r', encoding="utf-8") as f:
        stopwords = f.readlines()
        stop_set = set(m.strip() for m in stopwords)
        f.close()

    ##Stopword list comes from the Terrier pacakge with 733 words and another 86 custom terms: 
    ##https://github.com/kavgan/stop-words/blob/master/terrier-stop.txt
    ##https://github.com/kavgan/stop-words/blob/master/minimal-stop.txt
    
    ##Other stopword list options can be reviewed here:
    ##https://medium.com/towards-artificial-intelligence/stop-the-stopwords-using-different-python-libraries-ffa6df941653


    for doc in documents:

        # Remove stop words from token_list
        token_nostop_list = [i for i in doc if not i in stop_set]
        
        documents_nostop_list.append(token_nostop_list)
        
    return documents_nostop_list

remove_stopwords(data_words)
print(documents_nostop_list[1])



In [9]:
## Create Bigram and Trigram Tokens from non-stop word data, and then compare to stopword

bigram_token = []
trigram_token = []

def build_bigram_trigram_models(documents, documents_nostop):
    
    ##Building Bigram & Trigram Models
    ##higher threshold fewer phrases.
    bigram = gensim.models.Phrases(documents, min_count=5, threshold=100) 
    ## min_count: Ignore all words and bigrams with total collected count lower than this value.
    ## threshold: Represent a score threshold for forming the phrases (higher means fewer phrases).
    trigram = gensim.models.Phrases(bigram[documents], threshold=100)
        
    bigram_mod = gensim.models.phrases.Phraser(bigram)
    trigram_mod = gensim.models.phrases.Phraser(trigram)
        
   
    for doc in documents_nostop:
        bigram_token.append(bigram_mod[doc])
    
    for doc in bigram_token:
        trigram_token.append(trigram_mod[bigram_mod[doc]])
        
    return trigram_token


build_bigram_trigram_models(data_words, documents_nostop_list)

print(trigram_token[1])



In [10]:
## Lemmetize the Data

texts_out = []

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    
    nlp = spacy.load(r'C:\Users\keg827\AppData\Local\Continuum\anaconda3\Lib\site-packages\en_core_web_sm\en_core_web_sm-2.3.1')
    #nlp = spacy.load('C:\Users\keg827\AppData\Local\Continuum\anaconda3\Lib\site-packages\en_core_web_sm\en_core_web_sm-2.3.1', disable=['parser', 'ner'])
    
    for sent in texts:
        doc = nlp(" ".join(sent))
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    
    return texts_out


lemmatization(trigram_token, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(texts_out[1])

#pip3 install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz
#https://stackoverflow.com/questions/54334304/spacy-cant-find-model-en-core-web-sm-on-windows-10-and-python-3-5-3-anacon

['socio_demographic', 'factor', 'awareness', 'smoking', 'effect', 'oral', 'health', 'smoker', 'non_smokers_dental', 'patient', 'visiting_private_clinic', 'impact', 'socio_demographic', 'factor', 'awareness', 'smoking', 'effect', 'oral', 'health', 'smoker', 'non_smokers_dental', 'patient', 'biomedical', 'research', 'society', 'asad', 'awareness', 'smoker', 'non', 'smoker', 'oral', 'health', 'global', 'tobacco', 'epidemic', 'major', 'public', 'health', 'concern', 'cause', 'people', 'death', 'world', 'fatality', 'death', 'occur', 'due', 'direct', 'consumption', 'tobacco', 'mortality', 'non', 'smoker', 'result', 'passive', 'smoke', 'owe', 'tobacco', 'consumption', 'low', 'middle', 'income', 'country', 'vulnerable', 'country', 'contribute', 'high', 'morbidity', 'mortality', 'rate', 'currently', 'world', 'smoker', 'reside', 'addition', 'harmful', 'effect', 'tobacco', 'health', 'productivity', 'loss', 'health', 'expenditure', 'significantly', 'contribute', 'economic', 'burden', 'smoking', 'es

In [11]:
##Run the gensim topic modeling and return the topics
##Code from: https://notebook.community/ethen8181/machine-learning/clustering/topic_model/LDA

def get_gensim_corpus_dictionary(data):
    ##If content is not yet a list, make it a list and build the id2word dictionary and the corpus (map the word to id)
    ##texts = text_df['content'].apply(lambda x: x.split(' ')).tolist()
    ##print(texts)

    ##Build the id2word dictionary and the corpus
    ##The dictionary associates each word in the corpus with a unique integer ID
    dictionary = corpora.Dictionary(data)
    print('Number of unique tokens: ', len(dictionary))

    ## Filter out words that appear in less than 2 documents (appear only once),
    dictionary.filter_extremes(no_below = 2)

    ## Filter out words that appears in more than certain % of documents
    ## no_above = 0.5 would remove words that appear in more than 50% of the documents
    # dictionary.filter_extremes(no_above = 0.5)

    # Remove gaps in id sequence after words that were removed
    dictionary.compactify()
    print('Number of unique tokens used 2 or more times: ', len(dictionary))

    ##Use code below to print terms in dictionary with their IDs
    ##This will show you the number of the terms in the dictionary
    #print("Dictionary Tokens with ID: ")
    #pprint.pprint(dictionary.token2id)
    
    ##Map terms in corpus to words in dictionary with ID
    ##This will show you the ID of the term in the dictionary, and the number of times the terms occurs in the corpus
    bow_corpus = [dictionary.doc2bow(text) for text in data]
    #print("Tokens in Corpus with Occurrence: ")
    #pprint.pprint(corpus)
    
    ##Print word count by vector 
    id_words_count = [[(dictionary[id], count) for id, count in line] for line in bow_corpus]
    print("Word Count in each Vector: ")
    pprint(id_words_count[1])
    
     
    return bow_corpus, dictionary




bow_corpus, dictionary = get_gensim_corpus_dictionary(texts_out)

Number of unique tokens:  5966
Number of unique tokens used 2 or more times:  3048
Word Count in each Vector: 
[('affect', 2),
 ('argument', 1),
 ('association', 1),
 ('behavior', 2),
 ('care', 4),
 ('carry', 1),
 ('cause', 2),
 ('chronic', 1),
 ('community', 1),
 ('day', 1),
 ('demographic', 1),
 ('diabetes', 1),
 ('direct', 1),
 ('disorder', 1),
 ('domestic', 2),
 ('due', 2),
 ('education', 8),
 ('educational', 4),
 ('effect', 33),
 ('ensure', 2),
 ('estimate', 1),
 ('ethical', 1),
 ('evidence', 1),
 ('exclude', 1),
 ('family', 1),
 ('finally', 1),
 ('form', 3),
 ('frequency', 1),
 ('full', 1),
 ('gender', 7),
 ('harmful', 2),
 ('household', 4),
 ('intend', 1),
 ('involve', 1),
 ('knowledge', 1),
 ('linear', 1),
 ('married', 1),
 ('multiple', 2),
 ('negative', 1),
 ('passive', 2),
 ('people', 5),
 ('percentage', 1),
 ('person', 1),
 ('present', 1),
 ('procedure', 1),
 ('public', 3),
 ('questionnaire', 2),
 ('rate', 1),
 ('reason', 2),
 ('reliability', 1),
 ('response', 1),
 ('review'

In [12]:
## Run the Gensim Library LDA Model
## See link below if you want to save and load a model
## https://notebook.community/ethen8181/machine-learning/clustering/topic_model/LDA

def run_gensim_LDA_model(corpus, dictionary):
    ##Directory for storing all lda models
    model_dir = 'lda_checkpoint'

    ##If model_dir directionry is not in the folder, then make the directory
    if not os.path.isdir(model_dir):
        os.mkdir(model_dir)

    ##Load the model if we've already trained it before
   
    path = os.path.join(model_dir, 'gensim_tutorial_topic_model.lda')
    if not os.path.isfile(path):
        ##Training LDA can take some time, we could set eval_every = None to not evaluate the model perplexity
        ##Other parameters for LdaModel, include: random_state=100, update_every=1,chunksize=100,passes=10,alpha='auto',per_word_topics=True
        topic_model = LdaModel(corpus, id2word = dictionary, num_topics = 2, iterations = 200, per_word_topics=True)
        topic_model.save(path)
 
    topic_model = LdaModel.load(path)

    # Each element of the list is a tuple containing the topic and word / probability list
    topics = topic_model.show_topics(num_words = 15, formatted = False)

    print(type(topics))
    
  
    
    return topic_model, topics

topic_model, topics = run_gensim_LDA_model(bow_corpus, dictionary)

<class 'list'>


In [13]:
# Save topics to CSV

def create_topic_CSV(topics):
    
    ##Create dataframe for topics
    df_topics = pd.DataFrame(topics, columns = ['TopicNum', 'Terms'])
    #df_topics.head()

    ## Save dataframe to csv
    with open(r"gensim_tutorial_topic_modeling.csv", 'w', encoding='utf-8') as file:
        df_topics.to_csv(file)
        file.close()
    
    return df_topics
    
create_topic_CSV(topics)

Unnamed: 0,TopicNum,Terms
0,0,"[(patient, 0.027202243), (diabetes, 0.00904576..."
1,1,"[(obesity, 0.01770158), (weight, 0.011255539),..."
2,2,"[(child, 0.0105029885), (weight, 0.008337434),..."
3,3,"[(diabetes, 0.010245176), (obesity, 0.00927900..."
4,4,"[(hypertension, 0.016347283), (obesity, 0.0144..."


In [14]:
## Test Model Perplexity and Coherence

def model_perplexity_coherence(bow_corpus, dictionary, texts_out, topic_model):
    
    ##Model perplexity and topic coherence provide a convenient measure to judge how good a given topic model is. 
    ##In my experience, topic coherence score, in particular, has been more helpful.
    #https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#:~:text=Topic%20Modeling%20is%20a%20technique,in%20the%20Python's%20Gensim%20package.

    
    ##The LDA model (lda_model) we have created above can be used to compute the model’s perplexity, 
    ##i.e. how good the model is. The lower the score the better the model will be.
    # Compute Perplexity
    perplexity_lda = topic_model.log_perplexity(bow_corpus)
    print('\nPerplexity: ',  perplexity_lda)  # a measure of how good the model is. lower the better.
    
    ## Compute Coherence Score
#     coherence_model_lda = CoherenceModel(model=topic_model, texts=corpus, dictionary=dictionary, coherence='c_v')
#     coherence_lda = coherence_model_lda.get_coherence()
#     print('\nCoherence Score: ', coherence_lda)

    ##The LDA model (lda_model) we have created above can be used to compute the model’s coherence score 
    ##i.e. the average /median of the pairwise word-similarity scores of the words in the topic. 
    
    
    coherence_model_lda = CoherenceModel(model=topic_model, texts=texts_out, dictionary=dictionary, coherence='c_v')
    coherence_lda = coherence_model_lda.get_coherence()
    print('\nCoherence Score: ', coherence_lda)
    
    return perplexity_lda, coherence_lda

perplexity_lda, coherence_lda = model_perplexity_coherence(bow_corpus, dictionary, texts_out, topic_model)


Perplexity:  -8.838628522210273

Coherence Score:  0.6139217977356327


In [None]:
## Run the Gensim Library TFIDF Model 
##The words that will occur more frequently in the document will get the smaller weights.
##https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html#sphx-glr-auto-examples-core-run-core-concepts-py
##new_list = []

tfidf_frequency = []

def run_gensim_tfidf_model(corpus, dictionary): 
    
    ##Initialize the tf-idf model, training it on our corpus 
    tfidf = models.TfidfModel(corpus)
    
    ##if working with a new document, you can get tfidf from the model
    #new_doc = "abbott bra adolesc".lower().split()
    #print(new_doc)
    #new_list.append(tfidf[dictionary.doc2bow(new_doc)])
    
    corpus_tfidf = tfidf[corpus]
    for doc in corpus_tfidf:
        ##pprint.pprint(doc)
        tfidf_frequency.append(doc)
    
    #Print word frequencies by vector 
    id_words_frequency = [[(dictionary[id], frequency) for id, frequency in line] for line in tfidf_frequency]
    print("Word Frequency by Vector: ")
    pprint.pprint(id_words_frequency[2])
    
run_gensim_tfidf_model(bow_corpus, dictionary)

#pprint.pprint(tfidf_frequency)
    