## Practice with the Gensium Tutorial for LDA Topic Modeling

Topic modeling is a technique for taking some unstructured text and automatically extracting its common themes, it is a great way to get a bird's eye view on a large text collection. 

Gensim = “Generate Similar” is a popular open source natural language processing library used for unsupervised topic modeling.

Gensim uses top academic models and modern statistical machine learning to perform various complex tasks such as −

* Building document or word vectors
* Corpora
* Performing topic identification
* Performing document comparison (retrieving semantically similar documents)
* Analysing plain-text documents for semantic structure

However, unlike Scikit-Learn, Gensim doesn’t do any work on behalf of your documents for tokenization or stemming.

The Gensim library uses a popular algorithm for doing topic model, namely Latent Dirichlet Allocation. Latent Dirichlet Allocation (LDA). LDA requires documents to be represented as a bag of words (for the gensim library, some of the API calls will shorten it to "bow"). This representation ignores word ordering in the document but retains information on how many times each word appears.

The main distinguishing feature for LDA is it allows for mixed membership, which means that each document can partially belong to several different topics. Note that the vocabulary probability will sum up to 1 for every topic, but often times, words that have lower weights will be truncated from the output.

Text modified from: 
* <https://notebook.community/ethen8181/machine-learning/clustering/topic_model/LDA>
* <https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html#sphx-glr-auto-examples-core-run-core-concepts-py>
* <https://www.tutorialspoint.com/gensim/index.htm>


In [1]:
## General Dependencies
import re
import numpy as np
import pandas as pd
from pprint import pprint
import sys, os
import glob
from tika import parser # pip install tika

## Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim import models
#from gensim.models.coherencemodel import CoherenceModel
from gensim.models import CoherenceModel
from gensim.models import LdaModel

## Preprocessing
import spacy
import nltk as nltk
from nltk.stem import WordNetLemmatizer 
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

## Plotting
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt

## Other Libraries
from operator import itemgetter

## ScikitLearn
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

  from PIL import PILLOW_VERSION
  from PIL import PILLOW_VERSION


In [2]:
## Import data on fetch 20 news groups
# from sklearn.datasets import fetch_20newsgroups
# newsgroups_train = fetch_20newsgroups(subset='train')
# data = newsgroups_train.data
# print(data[1])

In [3]:
directory = "News_Industry"
files = list(glob.glob(os.path.join(directory,'*.*')))
print(files)
#https://stackoverflow.com/questions/34000914/how-to-create-a-list-from-filenames-in-a-user-specified-directory-in-python
#https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory
#https://stackoverflow.com/questions/33912773/python-read-txt-files-into-a-dataframe

['News_Industry\\Bibliography.10AGGRESSION AND PHYSICAL HEALTH IN MARRIED WOMEN.pdf', 'News_Industry\\Bibliography.12Impact of Socio-demographic Factors on Awareness of Smoking Effects on Oral Health among Smokers and.pdf', 'News_Industry\\Bibliography.17Health-Promoting Factors related to lifestyle among nursing students in University of Hail.pdf', 'News_Industry\\Bibliography.17Multinomial logit analysis of the effects of five different app-based incentives to encourage cyclin.pdf', 'News_Industry\\Bibliography.1PREVALENCE OF DYSLIPIDEMIA IN YOUNG ADULTS.pdf', 'News_Industry\\Bibliography.20Risk Factors for Atherosclerotic Cardiovascular Disease in the South Asian Population.pdf', 'News_Industry\\Bibliography.29Is the Gay Community the Neo-marginalised of Modern Society_.pdf', 'News_Industry\\Bibliography.33A Biological Effect of Sex Hormone Binding Globulin and Testosterone in Polycystic Ovary Syndrome (P.pdf', 'News_Industry\\Bibliography.34DETERMINANTS OF DEPRESSION ANXIETY STRESS

In [4]:
# Open files, convert from PDF to text file, append each file to a document list
#https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file

document_list = []
for f in files:
    raw = parser.from_file(f)
    document_list.append(raw)

# print(document_list)

2021-01-08 15:47:11,875 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...


In [5]:
## Create a dataframe form the document list
text_df = pd.DataFrame(document_list)
text_df.head()
# print(text_df["content"][1])

Unnamed: 0,metadata,content,status
0,"{'Content-Type': 'application/pdf', 'Creation-...",\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,200
1,"{'Content-Type': 'application/pdf', 'Creation-...",\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,200
2,"{'Content-Type': 'application/pdf', 'Creation-...",\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,200
3,"{'Content-Type': 'application/pdf', 'Creation-...",\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,200
4,"{'Content-Type': 'application/pdf', 'Creation-...",\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n...,200


In [6]:
text_df.to_csv('gensim_test.csv')

### Pre-process the text to lower case, remove special characters, etc. 

In [7]:
## Pre-process the text to lower case, remove special characters, etc. 
## https://kavita-ganesan.com/extracting-keywords-from-text-tfidf/#.X7RHltBKiUn
## Test regex here: https://pythex.org/

def preprocess(text):
    
    ## Lowercase words
    text_lower = text.lower()
    
    ## Remove Emails from text
    ## if you need to match a \, you can precede them with a backslash to remove their special meaning: \\.
    ## \S matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
    ## \s Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v]
    ## Code below matches any character, then an @ sign, then more characters, end matching when a white space is found.
    text_email = re.sub('\\S*@\\S*\\s?', '', text_lower) 
    
    ## Remove URLS from text
    ## https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python/40823105#40823105
    ## text_urls = re.sub(r'http\S+', '', text_email)
    ## https://www.geeksforgeeks.org/python-check-url-string/
    text_urls = re.sub(r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))",'', text_email)
    
    
    ## Remove tabs and new lines from text
    ## https://stackoverflow.com/questions/16355732/how-to-remove-tabs-and-newlines-with-a-regex
    ## \s Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v]
    text_spaces = re.sub(r'\s+',' ',text_urls)
        
    ## Remove \n from text
    text_space_character = text_spaces.replace('\n','')
    
    ## Remove \t from text
    text_tab_character = text_space_character.replace('\t','')
    
    ## Remove special characters and numbers
    ## \W matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_]
    ## \d matches any decimal digit; this is equivalent to the class [0-9]
    text_numbers = re.sub("(\\d|\\W)+"," ",text_tab_character)
    
    ## Remove tags
    ##text_tags = re.sub("","",text_numbers)

    ## Remove special characters and space, but leave in periods and numbers
    ## ^ means any character except. So [^5] will match any character except '5'
    ## [^a-zA-Z0-9_] matches any non-alphanumeric character.
    ## text_special = re.sub('[^A-Za-z0-9.]+|\s',' ',text_tab_character)
    
    ## Remove a sepcial list of terms
    ## https://stackoverflow.com/questions/15435726/remove-all-occurrences-of-words-in-a-string-from-a-python-list
    
    REMOVE_LIST = ['right reserved section',
                   'reserved section',
                   "length word byline", 
                   "byline", 
                   "word byline",
                   "journal code", 
                   "dr", 
                   "publication type magazine",
                   "type magazine",
                   "magazine",
                   "type newspaper",
                   "publication type newspaper",
                   'newspaper',
                   "group right reserved",
                   'section:',
                   'copyright',
                   'body',
                   'length:',
                   'keywords:',
                   'introduction',
                   'page',
                   'methodology',
                   'table',
                   'discussion',
                   'conclusions',
                   'references',
                   'classification',
                   'language',
                   'industry',
                   'geographic',
                   'load-date',
                   'end of document',
                   'mg dl',
                   'mg'
                   
                  ]

    remove = '|'.join(REMOVE_LIST)
    regex = re.compile(r'\b('+remove+r')\b', flags=re.IGNORECASE)
    text_special_remove = regex.sub("", text_numbers)

    return text_special_remove

## New column "preprocess" is formed from applying pre_process function to each item in the "content" column in dataframe
text_df['preprocess'] = text_df['content'].apply(lambda x:preprocess(x))

print(text_df['preprocess'][1])

#https://www.machinelearningplus.com/nlp/lemmatization-examples-python/



### Tokenize the data using Gensim Utils Simple Preprocess

In [8]:
## Tokenize the data using Gensim Utils Simple Preprocess

def tokenize(text):
    token_list = gensim.utils.simple_preprocess(str(text), deacc=True)  # deacc=True removes punctuations
    return token_list

## New column "tokens" is formed from applying pre_process function to each item in the "content" column in dataframe
text_df['tokens'] = text_df['preprocess'].apply(lambda x:tokenize(x))


print(text_df['tokens'][1])



In [9]:
## Tokenize the data using Gensim Utils Simple Preprocess

# data_words = []
# def tokenize(documents):
#     for doc in documents:
#         token_list = gensim.utils.simple_preprocess(str(doc), deacc=True)  # deacc=True removes punctuations
#         data_words.append(token_list)
#     return data_words


# tokenize(text_df['preprocess'])
# # print(type(data_words))
# print(data_words[1])

### Remove Stopwords using a custom stopword list

In [10]:
## Remove Stopwords using a custom stopword list


def remove_stopwords(text):
    
    ##Open stop words text file and save to stop_set variable
    with open("stop_words.txt", 'r', encoding="utf-8") as f:
        stopwords = f.readlines()
        stop_set = set(m.strip() for m in stopwords)
        f.close()

    ##Stopword list comes from the Terrier pacakge with 733 words and another 86 custom terms: 
    ##https://github.com/kavgan/stop-words/blob/master/terrier-stop.txt
    ##https://github.com/kavgan/stop-words/blob/master/minimal-stop.txt
    
    ##Other stopword list options can be reviewed here:
    ##https://medium.com/towards-artificial-intelligence/stop-the-stopwords-using-different-python-libraries-ffa6df941653

    ## Remove stop words from token_list
    ## https://stackoverflow.com/questions/29771168/how-to-remove-words-from-a-list-in-python
    token_nostop_list = [word for word in text if word not in stop_set]
        

    return token_nostop_list

## New column "no_stop" is formed from applying pre_process function to each item in the "content" column in dataframe
text_df['no_stop'] = text_df['tokens'].apply(lambda x:remove_stopwords(x))


print(text_df['no_stop'][1])



In [11]:
# ## Remove Stopwords using a custom stopword list
# documents_nostop_list = []

# def remove_stopwords(documents):
    
#     ##Open stop words text file and save to stop_set variable
#     with open("stop_words.txt", 'r', encoding="utf-8") as f:
#         stopwords = f.readlines()
#         stop_set = set(m.strip() for m in stopwords)
#         f.close()

#     ##Stopword list comes from the Terrier pacakge with 733 words and another 86 custom terms: 
#     ##https://github.com/kavgan/stop-words/blob/master/terrier-stop.txt
#     ##https://github.com/kavgan/stop-words/blob/master/minimal-stop.txt
    
#     ##Other stopword list options can be reviewed here:
#     ##https://medium.com/towards-artificial-intelligence/stop-the-stopwords-using-different-python-libraries-ffa6df941653


#     for doc in documents:

#         # Remove stop words from token_list
#         token_nostop_list = [i for i in doc if not i in stop_set]
        
#         documents_nostop_list.append(token_nostop_list)
        
#     return documents_nostop_list

# remove_stopwords(data_words)
# print(documents_nostop_list[1])

### Create Bigram and Trigram Tokens from non-stop word data

In [17]:
## Create Bigram and Trigram Tokens from non-stop word data, and then compare to stopword

 

def build_bigrams_trigrams(text):
#     print("This is the text:", text)
#     print("---------------------------------------")
    bigram_token = []
    trigram_token = []
    
    ##Building Bigram & Trigram Models
    ##higher threshold fewer phrases.
    bigram = gensim.models.Phrases(text, min_count=5, threshold=100) 
    ## min_count: Ignore all words and bigrams with total collected count lower than this value.
    ## threshold: Represent a score threshold for forming the phrases (higher means fewer phrases).
    trigram = gensim.models.Phrases(bigram[text], threshold=100)
#     print(bigram)
#     print(trigram)
    bigram_mod = gensim.models.phrases.Phraser(bigram)
    trigram_mod = gensim.models.phrases.Phraser(trigram)
#     print(bigram_mod)
#     print(trigram_mod)

    bigram_token.append(bigram_mod[text])
    

    trigram_token.append(trigram_mod[bigram_mod[text]])
    
    print('This is the bigram_token list:', bigram_token)
    print("---------------------------------------")
    print('This is the trigram_token list:',trigram_token)
    return trigram_token



# from gensim.models.phrases import Phrases, Phraser
# phrases = Phrases(sentences, min_count=1, threshold=1)
# bigram_model = Phraser(phrases)
# sent = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there']
# print(bigram_model[sent])
# [u'the', u'mayor', u'of', u'new_york', u'was', u'there']



## New column "ngrams" is formed from applying function to each item in the column in the dataframe
#text_df['no_stop'] = text_df['tokens'].apply(lambda x:remove_stopwords(x))

text_df['ngrams'] = text_df['no_stop'].apply(lambda x:build_bigrams_trigrams(x))

print(text_df['ngrams'][1])

This is the bigram_token list: [['aggression', 'physical', 'health', 'married', 'women', 'aggression', 'physical', 'health', 'married', 'women', 'journal', 'postgraduate', 'medical', 'institute', 'december', 'tuesday', 'postgraduate', 'medical', 'institute', 'rights', 'vol', 'length', 'words', 'faiza', 'shafique', 'riffat', 'sadiq', 'keywords', 'aggression', 'health', 'women', 'aggression', 'instinctive', 'drive', 'person', 'dark', 'side', 'human', 'nature', 'includes', 'variety', 'range', 'behaviors', 'aggression', 'involves', 'verbal', 'physical', 'assault', 'expression', 'results', 'intense', 'violence', 'aggression', 'unwanted', 'maladaptive', 'behavior', 'causing', 'damage', 'obliteration', 'exhibited', 'forms', 'encompassing', 'physical', 'aggression', 'verbal', 'aggression', 'anger', 'hostility', 'person', 'physical', 'aggression', 'causing', 'physical', 'emotional', 'harm', 'harming', 'hurting', 'verbally', 'verbal', 'aggression', 'anger', 'involves', 'physiological', 'activati

This is the bigram_token list: [['prevalence', 'dyslipidemia', 'young', 'adults', 'prevalence', 'dyslipidemia', 'young', 'adults', 'professional', 'medical', 'journal', 'may', 'sunday', 'professional', 'medical', 'journal', 'rights', 'vol', 'length', 'words', 'mir', 'tahir', 'hussain', 'talpur', 'muhammad', 'tauqeer', 'katbar', 'khalil', 'ullah', 'shabir', 'kashif', 'ullah', 'shabir', 'uzair', 'yaqoob', 'shah', 'jabeen', 'danish', 'zia', 'key', 'words', 'dyslipidemia', 'hyperlipidemia', 'ldl', 'hdl', 'triglyceride', 'young', 'adults', 'term', 'dyslipidemia', 'signifies', 'abnormal', 'lipid', 'fat', 'content', 'blood', 'including', 'triglycerides', 'phospholipids', 'cholesterol', 'lipoprotein', 'fats', 'lipoproteinemia', 'denotes', 'fatty', 'substances', 'circulate', 'blood', 'attached', 'proteins', 'exceed', 'normal', 'level', 'called', 'similarly', 'cholesterol', 'triglycerides', 'increased', 'term', 'given', 'condition', 'non', 'optimal', 'levels', 'low', 'density', 'lipoprotein', 'l

This is the bigram_token list: [['determinants', 'depression', 'anxiety', 'stress', 'left', 'families', 'azad', 'jammu', 'kashmir', 'determinants', 'depression', 'anxiety', 'stress', 'left', 'families', 'azad', 'jammu', 'kashmir', 'pakistan', 'armed', 'forces', 'medical', 'journal', 'june', 'friday', 'pafmj', 'rights', 'length', 'words', 'rabia', 'mushtaq', 'muhammad', 'tahir', 'khalily', 'seema', 'gul', 'abstract', 'objective', 'investigate', 'determinants', 'depression', 'anxiety', 'unmanageable', 'stress', 'left', 'wives', 'overseas', 'migrants', 'azad', 'jammu', 'kashmir', 'ajandk', 'study', 'design', 'cross', 'sectional', 'study', 'place', 'duration', 'study', 'international', 'islamic', 'university', 'islamabad', 'materials', 'method', 'sample', 'comprised', 'left', 'wives', 'selected', 'districts', 'ajandk', 'purposive', 'sampling', 'technique', 'depression', 'anxiety', 'stress', 'scale', 'demographic', 'sheet', 'data', 'collection', 'test', 'way', 'analysis', 'variance', 'anova

This is the bigram_token list: [['frequency', 'dyslipidemias', 'years', 'old', 'pakistani', 'children', 'cross', 'sectional', 'study', 'frequency', 'dyslipidemias', 'years', 'old', 'pakistani', 'children', 'cross', 'sectional', 'study', 'kmuj', 'khyber', 'medical', 'university', 'journal', 'march', 'thursday', 'kohat', 'university', 'science', 'technology', 'rights', 'vol', 'length', 'words', 'amina', 'zehra', 'iqbal', 'sarah', 'basharat', 'adeel', 'basharat', 'shifa', 'basharat', 'aneela', 'ambareen', 'saima', 'gillani', 'abstract', 'objectives', 'estimate', 'frequency', 'dyslipidemias', 'years', 'old', 'pakistani', 'children', 'attending', 'pediatric', 'clinic', 'methods', 'cross', 'sectional', 'study', 'conducted', 'private', 'clinic', 'peshawar', 'pakistan', 'parental', 'consent', 'male', 'female', 'children', 'age', 'ranging', 'years', 'measurement', 'height', 'weight', 'waist', 'circumference', 'mass', 'index', 'bmi', 'blood', 'pressure', 'fasting', 'lipid', 'profile', 'fasting',

This is the bigram_token list: [['psychological', 'stress', 'distress', 'anthropometric', 'lifestyle', 'correlates', 'hypertension', 'sample', 'pakistani', 'population', 'psychological', 'stress', 'distress', 'anthropometric', 'lifestyle', 'correlates', 'hypertension', 'sample', 'pakistani', 'population', 'journal', 'behavioral', 'sciences', 'december', 'thursday', 'university', 'punjab', 'rights', 'vol', 'length', 'words', 'rafia', 'rafique', 'afifa', 'anjum', 'study', 'examines', 'psychological', 'lifestyle', 'anthropometric', 'correlates', 'hypertension', 'objective', 'study', 'find', 'high', 'level', 'stress', 'distress', 'bmi', 'greater', 'kg', 'high', 'whr', 'waist', 'hip', 'ratio', 'lack', 'useful', 'level', 'physical', 'activity', 'smoking', 'cigarettes', 'presence', 'family', 'history', 'hypertension', 'likely', 'predict', 'hypertension', 'conduct', 'study', 'seventy', 'cases', 'diagnosis', 'hypertension', 'aged', 'years', 'community', 'matched', 'controls', 'recruited', 'purp

This is the bigram_token list: [['effect', 'physical', 'activity', 'borderline', 'hypertension', 'effect', 'physical', 'activity', 'borderline', 'hypertension', 'biomedica', 'june', 'tuesday', 'biomedical', 'research', 'society', 'rights', 'vol', 'length', 'words', 'ali', 'malik', 'iqbal', 'saleem', 'abbas', 'ahsen', 'abstract', 'background', 'objectives', 'hypertension', 'key', 'contributor', 'current', 'pandemic', 'cardiovascular', 'diseases', 'randomized', 'control', 'trial', 'compared', 'decrease', 'systolic', 'diastolic', 'blood', 'pressures', 'study', 'control', 'groups', 'response', 'lifestyle', 'intervention', 'borderline', 'hypertensives', 'intervention', 'carried', 'period', 'months', 'total', 'adult', 'participants', 'included', 'borderline', 'hypertension', 'mmhg', 'lahore', 'pakistan', 'participants', 'randomized', 'study', 'control', 'groups', 'study', 'group', 'minutes', 'daily', 'exercise', 'advice', 'written', 'verbal', 'regular', 'monthly', 'follow', 'may', 'december'

This is the bigram_token list: [['relationship', 'platelet', 'indices', 'lipidemias', 'cross', 'sectional', 'study', 'karachi', 'relationship', 'platelet', 'indices', 'lipidemias', 'cross', 'sectional', 'study', 'karachi', 'journal', 'postgraduate', 'medical', 'institute', 'june', 'monday', 'postgraduate', 'medical', 'institute', 'rights', 'vol', 'length', 'words', 'sikandar', 'hayat', 'khan', 'syed', 'azhar', 'ahmad', 'abstract', 'objective', 'compare', 'platelet', 'indices', 'patients', 'low', 'high', 'levels', 'cholesterolemia', 'triglyceridemia', 'cross', 'sectional', 'study', 'conducted', 'january', 'may', 'departments', 'pathology', 'pns', 'rahat', 'baqai', 'medical', 'dental', 'university', 'karachi', 'total', 'subjects', 'presenting', 'estimation', 'fasting', 'triglycerides', 'total', 'cholesterol', 'selected', 'excluding', 'patients', 'receiving', 'anti', 'platelet', 'lipid', 'medication', 'pregnancy', 'acute', 'infectious', 'disorders', 'interviewed', 'examined', 'sampled', '

This is the bigram_token list: [['physical', 'activity', 'effect', 'forced', 'expiratory', 'volume', 'physical', 'activity', 'effect', 'forced', 'expiratory', 'volume', 'journal', 'pakistan', 'medical', 'association', 'march', 'sunday', 'jpma', 'rights', 'vol', 'length', 'words', 'syeda', 'sadia', 'fatima', 'rehana', 'rehman', 'saifullah', 'yumna', 'khan', 'abstract', 'objective', 'assess', 'compare', 'changes', 'pulmonary', 'functions', 'exercise', 'young', 'healthy', 'adults', 'methods', 'case', 'control', 'study', 'carried', 'january', 'march', 'medical', 'students', 'aged', 'years', 'bahria', 'university', 'medical', 'dental', 'college', 'karachi', 'baseline', 'values', 'forced', 'expiratory', 'volume', 'fev', 'forced', 'vital', 'capacity', 'fvc', 'measured', 'digital', 'spirometer', 'fev', 'fvc', 'ratio', 'calculated', 'lung', 'functions', 'group', 'comprising', 'volunteers', 'aerobic', 'exercise', 'minutes', 'daily', 'days', 'weeks', 'compared', 'group', 'ii', 'controls', 'who', 

This is the bigram_token list: [['practical', 'guide', 'metabolic', 'syndrome', 'practical', 'guide', 'metabolic', 'syndrome', 'clinical', 'advisor', 'february', 'haymarket', 'media', 'rights', 'feature', 'clinical', 'feature', 'cme', 'ce', 'length', 'words', 'jacinta', 'thomas', 'aprn', 'deborah', 'walker', 'dnp', 'crnp', 'aocn', 'highlight', 'obesity', 'rise', 'patients', 'risk', 'metabolic', 'syndrome', 'diagnosis', 'given', 'set', 'simultaneous', 'disorders', 'take', 'post', 'test', 'click', 'here', 'reading', 'article', 'take', 'post', 'test', 'mycme', 'com', 'glance', 'united', 'states', 'approximately', 'adults', 'carry', 'diagnosis', 'metabolic', 'syndrome', 'obesity', 'lipid', 'levels', 'bp', 'insulin', 'resistance', 'considered', 'diagnosing', 'metabolic', 'syndrome', 'screening', 'include', 'physical', 'examination', 'dietary', 'history', 'laboratory', 'workup', 'goal', 'treatment', 'delay', 'prevent', 'cvd', 'diabetes', 'national', 'cholesterol', 'education', 'program', 'nc

This is the bigram_token list: [['marijuana', 'affect', 'kidney', 'function', 'marijuana', 'affect', 'kidney', 'function', 'indian', 'agriculture', 'news', 'august', 'ffc', 'solution', 'rights', 'reserved', 'length', 'words', 'dateline', 'new', 'delhi', 'aug', 'new', 'delhi', 'aug', 'new', 'study', 'found', 'little', 'evidence', 'marijuana', 'affects', 'kidney', 'function', 'healthy', 'young', 'adults', 'analysis', 'upcoming', 'issue', 'clinical', 'journal', 'american', 'society', 'nephrology', 'cjasn', 'marijuana', 'increasingly', 'accepted', 'united', 'states', 'critical', 'need', 'studies', 'examining', 'risks', 'benefits', 'kidney', 'health', 'animal', 'studies', 'suggest', 'marijuana', 'affect', 'kidney', 'function', 'data', 'humans', 'limited', 'first', 'study', 'kind', 'julie', 'ishida', 'md', 'mas', 'university', 'california', 'san', 'francisco', 'san', 'francisco', 'va', 'medical', 'center', 'colleagues', 'examined', 'potential', 'links', 'marijuana', 'kidney', 'function', 'he

This is the bigram_token list: [['study', 'negative', 'early', 'life', 'experiences', 'adverse', 'effect', 'health', 'study', 'negative', 'early', 'life', 'experiences', 'adverse', 'effect', 'health', 'health', 'daily', 'digest', 'september', 'monday', 'ht', 'media', 'ltd', 'rights', 'reserved', 'length', 'words', 'dateline', 'los', 'angeles', 'los', 'angeles', 'sept', 'researchers', 'university', 'california', 'los', 'angeles', 'conducted', 'study', 'revealed', 'facing', 'negative', 'experiences', 'early', 'life', 'can', 'severely', 'affect', 'physical', 'health', 'people', 'causing', 'ailments', 'diabetes', 'heart', 'disease', 'researchers', 'selected', 'adults', 'participants', 'study', 'called', 'coronary', 'artery', 'risk', 'development', 'young', 'adults', 'detailed', 'measurements', 'biological', 'markers', 'carried', 'amount', 'existing', 'health', 'risk', 'summary', 'index', 'called', 'allostatic', 'load', 'created', 'adding', 'risks', 'markers', 'markers', 'included', 'heart'

This is the bigram_token list: [['regular', 'exercise', 'reduces', 'midlife', 'weight', 'gain', 'regular', 'exercise', 'reduces', 'midlife', 'weight', 'gain', 'clinical', 'advisor', 'december', 'haymarket', 'media', 'rights', 'news', 'length', 'words', 'nicole', 'blazek', 'highlight', 'department', 'health', 'human', 'services', 'recommends', 'least', 'minutes', 'moderate', 'intensity', 'activity', 'days', 'week', 'long', 'term', 'data', 'corroborated', 'amount', 'activity', 'keep', 'weight', 'now', 'people', 'who', 'maintain', 'high', 'activity', 'levels', 'young', 'adulthood', 'gain', 'significantly', 'excess', 'weight', 'middle', 'age', 'who', 'research', 'published', 'dec', 'issue', 'journal', 'american', 'medical', 'association', 'men', 'who', 'reported', 'high', 'activity', 'levels', 'hour', 'day', 'gained', 'pounds', 'women', 'minutes', 'day', 'gained', 'pounds', 'low', 'activity', 'levels', 'according', 'findings', 'prospective', 'longitudinal', 'study', 'followed', 'men', 'wom

This is the bigram_token list: [['clinical', 'trial', 'cardia', 'salt', 'sensitivity', 'blood', 'pressure', 'ssbp', 'clinical', 'trial', 'cardia', 'salt', 'sensitivity', 'blood', 'pressure', 'ssbp', 'health', 'daily', 'digest', 'february', 'tuesday', 'ht', 'media', 'ltd', 'rights', 'reserved', 'length', 'words', 'dateline', 'feb', 'clinicaltrials', 'gov', 'registry', 'received', 'study', 'nct', 'titled', 'cardia', 'salt', 'sensitivity', 'blood', 'pressure', 'ssbp', 'feb', 'brief', 'summary', 'salt', 'sensitivity', 'blood', 'pressure', 'ssbp', 'defined', 'change', 'blood', 'pressure', 'bp', 'relation', 'change', 'salt', 'intake', 'increase', 'bp', 'low', 'high', 'salt', 'diet', 'common', 'associated', 'increased', 'risk', 'cardiovascular', 'morbidity', 'mortality', 'normotensive', 'individuals', 'pathophysiology', 'ssbp', 'well', 'understood', 'prevailing', 'paradigm', 'abnormalities', 'neurohormones', 'regulate', 'sodium', 'na', 'retention', 'excretion', 'na', 'transporting', 'pathways

This is the bigram_token list: [['prevalence', 'valvular', 'heart', 'disease', 'elderly', 'may', 'top', 'prevalence', 'valvular', 'heart', 'disease', 'elderly', 'may', 'top', 'cardiology', 'news', 'february', 'elsevier', 'international', 'medical', 'news', 'group', 'rights', 'pg', 'vol', 'issn', 'length', 'words', 'mitchel', 'zoler', 'philadelphia', 'bureau', 'new', 'orleans', 'americans', 'aged', 'years', 'older', 'valvular', 'heart', 'disease', 'according', 'findings', 'unselected', 'population', 'people', 'prevalence', 'valvular', 'heart', 'disease', 'high', 'unselected', 'group', 'americans', 'aged', 'years', 'vuyisile', 'nkomo', 'reported', 'poster', 'annual', 'scientific', 'sessions', 'american', 'heart', 'association', 'high', 'prevalence', 'valvular', 'heart', 'disease', 'elderly', 'whom', 'asymptomatic', 'suggests', 'physicians', 'need', 'assess', 'elderly', 'patients', 'carefully', 'valvular', 'disease', 'history', 'physical', 'examination', 'nkomo', 'cardiologist', 'mayo', '

In [None]:
## Create Bigram and Trigram Tokens from non-stop word data, and then compare to stopword

bigram_token = []
trigram_token = []

def build_bigram_trigram_models(documents, documents_nostop):
    
    ##Building Bigram & Trigram Models
    ##higher threshold fewer phrases.
    bigram = gensim.models.Phrases(documents, min_count=5, threshold=100) 
    ## min_count: Ignore all words and bigrams with total collected count lower than this value.
    ## threshold: Represent a score threshold for forming the phrases (higher means fewer phrases).
    trigram = gensim.models.Phrases(bigram[documents], threshold=100)
        
    bigram_mod = gensim.models.phrases.Phraser(bigram)
    trigram_mod = gensim.models.phrases.Phraser(trigram)
        
   
    for doc in documents_nostop:
        bigram_token.append(bigram_mod[doc])
    
    for doc in bigram_token:
        trigram_token.append(trigram_mod[bigram_mod[doc]])
        
    return trigram_token


build_bigram_trigram_models(data_words, documents_nostop_list)

print(trigram_token[1])

### Lemmetize the Data

In [None]:
## Lemmetize the Data

texts_out = []

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    
    nlp = spacy.load(r'C:\Users\keg827\AppData\Local\Continuum\anaconda3\Lib\site-packages\en_core_web_sm\en_core_web_sm-2.3.1')
    #nlp = spacy.load('C:\Users\keg827\AppData\Local\Continuum\anaconda3\Lib\site-packages\en_core_web_sm\en_core_web_sm-2.3.1', disable=['parser', 'ner'])
    
    for sent in texts:
        doc = nlp(" ".join(sent))
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    
    return texts_out


lemmatization(trigram_token, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(texts_out[1])

#pip3 install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz
#https://stackoverflow.com/questions/54334304/spacy-cant-find-model-en-core-web-sm-on-windows-10-and-python-3-5-3-anacon

### Run the gensim topic modeling and return the topics

In [None]:
##Run the gensim topic modeling and return the topics
##Code from: https://notebook.community/ethen8181/machine-learning/clustering/topic_model/LDA

def get_gensim_corpus_dictionary(data):
    ##If content is not yet a list, make it a list and build the id2word dictionary and the corpus (map the word to id)
    ##texts = text_df['content'].apply(lambda x: x.split(' ')).tolist()
    ##print(texts)

    ##Build the id2word dictionary and the corpus
    ##The dictionary associates each word in the corpus with a unique integer ID
    dictionary = corpora.Dictionary(data)
    print('Number of unique tokens: ', len(dictionary))

    ## Filter out words that appear in less than 2 documents (appear only once),
    dictionary.filter_extremes(no_below = 2)

    ## Filter out words that appears in more than certain % of documents
    ## no_above = 0.5 would remove words that appear in more than 50% of the documents
    # dictionary.filter_extremes(no_above = 0.5)

    # Remove gaps in id sequence after words that were removed
    dictionary.compactify()
    print('Number of unique tokens used 2 or more times: ', len(dictionary))

    ##Use code below to print terms in dictionary with their IDs
    ##This will show you the number of the terms in the dictionary
    #print("Dictionary Tokens with ID: ")
    #pprint.pprint(dictionary.token2id)
    
    ##Map terms in corpus to words in dictionary with ID
    ##This will show you the ID of the term in the dictionary, and the number of times the terms occurs in the corpus
    bow_corpus = [dictionary.doc2bow(text) for text in data]
    #print("Tokens in Corpus with Occurrence: ")
    #pprint.pprint(corpus)
    
    ##Print word count by vector 
    id_words_count = [[(dictionary[id], count) for id, count in line] for line in bow_corpus]
    print("Word Count in each Vector: ")
    pprint(id_words_count[1])
    
     
    return bow_corpus, dictionary




bow_corpus, dictionary = get_gensim_corpus_dictionary(texts_out)

In [None]:
## Run the Gensim Library LDA Model
## See link below if you want to save and load a model
## https://notebook.community/ethen8181/machine-learning/clustering/topic_model/LDA

def run_gensim_LDA_model(corpus, dictionary):
    ##Directory for storing all lda models
    model_dir = 'lda_checkpoint'

    ##If model_dir directionry is not in the folder, then make the directory
    if not os.path.isdir(model_dir):
        os.mkdir(model_dir)

    ##Load the model if we've already trained it before
   
    path = os.path.join(model_dir, 'gensim_tutorial_topic_model.lda')
    if not os.path.isfile(path):
        ##Training LDA can take some time, we could set eval_every = None to not evaluate the model perplexity
        ##Other parameters for LdaModel, include: random_state=100, update_every=1,chunksize=100,passes=10,alpha='auto',per_word_topics=True
        topic_model = LdaModel(corpus, id2word = dictionary, num_topics = 3, iterations = 200, per_word_topics=True)
        topic_model.save(path)
 
    topic_model = LdaModel.load(path)

    # Each element of the list is a tuple containing the topic and word / probability list
    topics = topic_model.show_topics(num_words = 15, formatted = False)

    print(type(topics))
    
  
    
    return topic_model, topics

topic_model, topics = run_gensim_LDA_model(bow_corpus, dictionary)

In [None]:
# Save topics to CSV

def create_topic_CSV(topics):
    
    ##Create dataframe for topics
    df_topics = pd.DataFrame(topics, columns = ['TopicNum', 'Terms'])
    #df_topics.head()

    ## Save dataframe to csv
    with open(r"gensim_tutorial_topic_modeling.csv", 'w', encoding='utf-8') as file:
        df_topics.to_csv(file)
        file.close()
    
    return df_topics
    
create_topic_CSV(topics)

Now that the LDA model is built, the next step is to examine the produced topics and the associated keywords. There is no better tool than pyLDAvis package’s interactive chart and is designed to work well with jupyter notebooks.
<https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#17howtofindtheoptimalnumberoftopicsforlda>

In [None]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, bow_corpus, dictionary)
vis

In [None]:
## Test Model Perplexity and Coherence

def model_perplexity_coherence(bow_corpus, dictionary, texts_out, topic_model):
    
    ##Model perplexity and topic coherence provide a convenient measure to judge how good a given topic model is. 
    ##In my experience, topic coherence score, in particular, has been more helpful.
    #https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#:~:text=Topic%20Modeling%20is%20a%20technique,in%20the%20Python's%20Gensim%20package.

    
    ##The LDA model (lda_model) we have created above can be used to compute the model’s perplexity, 
    ##i.e. how good the model is. The lower the score the better the model will be.
    # Compute Perplexity
    perplexity_lda = topic_model.log_perplexity(bow_corpus)
    print('\nPerplexity: ',  perplexity_lda)  # a measure of how good the model is. lower the better.
    
    ## Compute Coherence Score
#     coherence_model_lda = CoherenceModel(model=topic_model, texts=corpus, dictionary=dictionary, coherence='c_v')
#     coherence_lda = coherence_model_lda.get_coherence()
#     print('\nCoherence Score: ', coherence_lda)

    ##The LDA model (lda_model) we have created above can be used to compute the model’s coherence score 
    ##i.e. the average /median of the pairwise word-similarity scores of the words in the topic. 
    
    
    coherence_model_lda = CoherenceModel(model=topic_model, texts=texts_out, dictionary=dictionary, coherence='c_v')
    coherence_lda = coherence_model_lda.get_coherence()
    print('\nCoherence Score: ', coherence_lda)
    
    return perplexity_lda, coherence_lda

perplexity_lda, coherence_lda = model_perplexity_coherence(bow_corpus, dictionary, texts_out, topic_model)

### Find optimal number of topics using the Coherence Score

My approach to finding the optimal number of topics is to build many LDA models with different values of number of topics (k) and pick the one that gives the highest coherence value.

Choosing a ‘k’ that marks the end of a rapid growth of topic coherence usually offers meaningful and interpretable topics. Picking an even higher value can sometimes provide more granular sub-topics.

If you see the same keywords being repeated in multiple topics, it’s probably a sign that the ‘k’ is too large.

The compute_coherence_values() (see below) trains multiple LDA models and provides the models and their corresponding coherence scores.

<https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#17howtofindtheoptimalnumberoftopicsforlda>

In [None]:
import os
from gensim.models.wrappers import LdaMallet
os.environ.update({'MALLET_HOME':r'C:\Users\keg827\Tools\mallet-2.0.8'}) 
#You should update this path as per the path of Mallet directory on your system.
mallet_path = r'C:\Users\keg827\Tools\mallet-2.0.8\bin\mallet' 
#You should update this path as per the path of Mallet directory on your system.
print(mallet_path)

In [None]:
ldamallet = gensim.models.wrappers.LdaMallet(
   mallet_path, corpus=bow_corpus, num_topics=20, id2word=dictionary
)
pprint(ldamallet.show_topics(formatted=False))

In [None]:
ldamallet = gensim.models.wrappers.LdaMallet(
   mallet_path, corpus=bow_corpus, num_topics=20, id2word=dictionary
)
pprint(ldamallet.show_topics(formatted=False))

In [None]:
def coherence_values_computation(dictionary, corpus, texts, limit, start, step):
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=dictionary)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())
    
    return model_list, coherence_values

model_list, coherence_values = coherence_values_computation (dictionary=dictionary, 
                                                             corpus=bow_corpus, 
                                                             texts=texts_out, 
                                                             start=1, 
                                                             limit=50, 
                                                             step=2)
print(coherence_values)

In [None]:
limit=50 
start=1 
step=2
x = range(start, limit, step)
print(x)
print(coherence_values)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()


In [None]:
# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))

If the coherence score seems to keep increasing, it may make better sense to pick the model that gave the highest CV before flattening out. 
<https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#17howtofindtheoptimalnumberoftopicsforlda>

Now, the question arises which model should we pick now? One of the good practices is to pick the model, that is giving highest coherence value before flattering out. So that’s why, we will be choosing the model with 25 topics which is at number 4 in the above list.
<https://www.tutorialspoint.com/gensim/gensim_documents_and_lda_model.htm>

In [None]:
## Add the number in the list of the optimal topic number, then print out that number of topics. 
optimal_model = model_list[1]
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))

### Finding dominant topics in sentences

Finding dominant topics in sentences is one of the most useful practical applications of topic modeling. It determines what topic a given document is about. Here, we will find that topic number which has the highest percentage contribution in that particular document. In order to aggregate the information in a table, we will be creating a function named dominant_topics() 
<https://www.tutorialspoint.com/gensim/gensim_documents_and_lda_model.htm>

In [None]:
def format_topics_sentences(ldamodel=topic_model, corpus=bow_corpus, texts=texts_out):
    ## Init output
    sent_topics_df = pd.DataFrame()

    ## Next, we will get the main topics in every document −
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        
        ## Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)


df_topic_sents_keywords = format_topics_sentences(ldamodel=optimal_model, corpus=bow_corpus, texts=texts_out)

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']

# Show
df_dominant_topic.head(10)

### Finding Most Representative Document
In order to understand more about the topic, we can also find the documents, a given topic has contributed to the most. We can infer that topic by reading that particular document(s).

In [None]:
# Group top 5 sentences under each topic
sent_topics_sorteddf_mallet = pd.DataFrame()

sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic')

for i, grp in sent_topics_outdf_grpd:
    sent_topics_sorteddf_mallet = pd.concat([sent_topics_sorteddf_mallet, 
                                             grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)], 
                                            axis=0)

# Reset Index    
sent_topics_sorteddf_mallet.reset_index(drop=True, inplace=True)

# Format
sent_topics_sorteddf_mallet.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Text"]

# Show
sent_topics_sorteddf_mallet.head()

### Topic distribution across documents
Finally, we want to understand the volume and distribution of topics in order to judge how widely it was discussed. The below table exposes that information.

In [None]:
# Number of Documents for Each Topic
topic_counts = df_topic_sents_keywords['Dominant_Topic'].value_counts()

# Percentage of Documents for Each Topic
topic_contribution = round(topic_counts/topic_counts.sum(), 4)

# Topic Number and Keywords
topic_num_keywords = df_topic_sents_keywords[['Dominant_Topic', 'Topic_Keywords']]

# Concatenate Column wise
df_dominant_topics = pd.concat([topic_num_keywords, topic_counts, topic_contribution], axis=1)

# Change Column names
df_dominant_topics.columns = ['Dominant_Topic', 'Topic_Keywords', 'Num_Documents', 'Perc_Documents']

# Show
df_dominant_topics

### Volume & Distribution of Topics
Sometimes we also want to judge how widely the topic is discussed in documents. For this we need to understand the volume and distribution of topics across the documents.

First calculate the number of documents for every Topic as follows −

In [None]:
topic_counts = df_topic_sents_keywords['Dominant_Topic'].value_counts()

Next, calculate the percentage of Documents for every Topic as follows −;

In [None]:
topic_contribution = round(topic_counts/topic_counts.sum(), 4)

Now find the topic Number and Keywords as follows −

In [None]:
topic_num_keywords = df_topic_sents_keywords[['Dominant_Topic', 'Topic_Keywords']]

Now, concatenate then Column wise as follows −

In [None]:
df_dominant_topics = pd.concat(
   [topic_num_keywords, topic_counts, topic_contribution], axis=1
)

Next, we will change the Column names as follows −

In [None]:
df_dominant_topics.columns = [
   'Dominant-Topic', 'Topic-Keywords', 'Num_Documents', 'Perc_Documents'
]
df_dominant_topics

In [None]:
## Run the Gensim Library TFIDF Model 
##The words that will occur more frequently in the document will get the smaller weights.
##https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html#sphx-glr-auto-examples-core-run-core-concepts-py
##new_list = []

tfidf_frequency = []

def run_gensim_tfidf_model(corpus, dictionary): 
    
    ##Initialize the tf-idf model, training it on our corpus 
    tfidf = models.TfidfModel(corpus)
    
    ##if working with a new document, you can get tfidf from the model
    #new_doc = "abbott bra adolesc".lower().split()
    #print(new_doc)
    #new_list.append(tfidf[dictionary.doc2bow(new_doc)])
    
    corpus_tfidf = tfidf[corpus]
    for doc in corpus_tfidf:
        ##pprint.pprint(doc)
        tfidf_frequency.append(doc)
    
    #Print word frequencies by vector 
    id_words_frequency = [[(dictionary[id], frequency) for id, frequency in line] for line in tfidf_frequency]
    print("Word Frequency by Vector: ")
    pprint.pprint(id_words_frequency[2])
    
run_gensim_tfidf_model(bow_corpus, dictionary)

#pprint.pprint(tfidf_frequency)
    