## Keyword Extraction using Scikit-Learn 

The baseline method in unsupervised approaches is TF.IDF which compares the frequency of a term in a document to its frequency in a large collection. Although quite simple to implement, TF.IDF requires access to a large corpus, which may not always be available.



In [38]:
## Dependencies
import sys, os
import pandas as pd
from tika import parser # pip install tika
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import re
import glob
import numpy as np
import nltk as nltk
from nltk.stem import WordNetLemmatizer 
from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

## Load PDFs and convert to Pandas Dataframe

In [39]:
directory = "News_Industry"
files = list(glob.glob(os.path.join(directory,'*.*')))
print(files)
#https://stackoverflow.com/questions/34000914/how-to-create-a-list-from-filenames-in-a-user-specified-directory-in-python
#https://stackoverflow.com/questions/3207219/how-do-i-list-all-files-of-a-directory
#https://stackoverflow.com/questions/33912773/python-read-txt-files-into-a-dataframe

['News_Industry\\Bibliography.10AGGRESSION AND PHYSICAL HEALTH IN MARRIED WOMEN.pdf', 'News_Industry\\Bibliography.12Impact of Socio-demographic Factors on Awareness of Smoking Effects on Oral Health among Smokers and.pdf', 'News_Industry\\Bibliography.17Health-Promoting Factors related to lifestyle among nursing students in University of Hail.pdf', 'News_Industry\\Bibliography.17Multinomial logit analysis of the effects of five different app-based incentives to encourage cyclin.pdf', 'News_Industry\\Bibliography.1PREVALENCE OF DYSLIPIDEMIA IN YOUNG ADULTS.pdf', 'News_Industry\\Bibliography.20Risk Factors for Atherosclerotic Cardiovascular Disease in the South Asian Population.pdf', 'News_Industry\\Bibliography.29Is the Gay Community the Neo-marginalised of Modern Society_.pdf', 'News_Industry\\Bibliography.33A Biological Effect of Sex Hormone Binding Globulin and Testosterone in Polycystic Ovary Syndrome (P.pdf', 'News_Industry\\Bibliography.34DETERMINANTS OF DEPRESSION ANXIETY STRESS

In [40]:
#https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file
document_list = []
for f in files:
    raw = parser.from_file(f)
    document_list.append(raw)

In [41]:
text_df = pd.DataFrame(document_list)
#text_df.head()
#print(text_df["content"][1])

## Using Scikit-Learn Feature Extraction and TF-IDF

This version of code runs separately on each document in the dataset. This version of code also uses external pre-processing and lemmatization. 

TF–IDF, term frequency–inverse document frequency, encoding normalizes the frequency of tokens in a document with respect to the rest of the corpus. This encoding approach accentuates terms that are very relevant to a specific instance. TF–IDF is computed on a per-term basis, such that the relevance of a token to a document is measured by the scaled frequency of the appearance of the term in the document, normalized by the inverse of the scaled frequency of the term in the entire corpus.

<https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.X8Jt9NBKiUk>



TIF-IDF is both a corpus exploration method and a pre-processing step for many other text-mining measures and models.

<https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf>


In [42]:
# Convert the "content" column in dataframe to a list
# The Count Vectorize and the fit transform() for Scikit-Learn expects an iterable 
# or list of strings or file objects, and creates a dictionary of the vocabulary on the corpus.

def convert_content_to_list(text_df):
    
    global content_list
    
    content_list = text_df['content'].tolist()

    return content_list

convert_content_to_list(text_df)

print(type(content_list))
print(content_list[0])

<class 'list'>
















































AGGRESSION AND PHYSICAL HEALTH IN MARRIED WOMEN


 

AGGRESSION AND PHYSICAL HEALTH IN MARRIED WOMEN

Journal of Postgraduate Medical Institute

December 31, 2019 Tuesday

Copyright 2019 Postgraduate Medical Institute All Rights Reserved

Section: Vol. 33; No. 4

Length: 3751 words

Byline: Faiza Shafique and Riffat Sadiq

Body

KeyWords: Aggression, Health, Women

INTRODUCTION

Aggression is an instinctive drive of a person and a dark side of human nature1. It includes a variety of range of 
behaviors2. Aggression involves verbal and physical assault3, therefore, its expression results in intense violence 
towards others4. Aggression is an unwanted and maladaptive behavior causing damage and obliteration5. It is 
exhibited in different forms encompassing physical aggression, verbal aggression, anger and hostility6. A person 
with physical aggression causing physical and emotional harm others while harming or hurting someon

#### Pre-processing for dataset

Scikit-Learn does tokenization using Count Vectorize, but not stemming. Stemming, lemmatizing, compound splitting, filtering based on part-of-speech, etc. are not included in the scikit-learn codebase, but can be added by customizing either the tokenizer or the analyzer.

<https://scikit-learn.org/stable/modules/feature_extraction.html>

In [43]:
## Pre-process the text to lower case, remove special characters, etc. 
## https://kavita-ganesan.com/extracting-keywords-from-text-tfidf/#.X7RHltBKiUn
## Test regex here: https://pythex.org/

def preprocess(text):
    
    ## Lowercase words
    text_lower = text.lower()
    
    ## Remove Emails from text
    ## if you need to match a \, you can precede them with a backslash to remove their special meaning: \\.
    ## \S matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
    ## \s Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v]
    ## Code below matches any character, then an @ sign, then more characters, end matching when a white space is found.
    text_email = re.sub('\\S*@\\S*\\s?', '', text_lower) 
    
    ## Remove URLS from text
    ## https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python/40823105#40823105
    ## text_urls = re.sub(r'http\S+', '', text_email)
    ## https://www.geeksforgeeks.org/python-check-url-string/
    text_urls = re.sub(r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’]))",'', text_email)
    
    
    ## Remove tabs and new lines from text
    ## https://stackoverflow.com/questions/16355732/how-to-remove-tabs-and-newlines-with-a-regex
    ## \s Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v]
    text_spaces = re.sub(r'\s+',' ',text_urls)
        
    ## Remove \n from text
    text_space_character = text_spaces.replace('\n','')
    
    ## Remove \t from text
    text_tab_character = text_space_character.replace('\t','')
    
    ## Remove special characters and numbers
    ## \W matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_]
    ## \d matches any decimal digit; this is equivalent to the class [0-9]
    text_numbers = re.sub("(\\d|\\W)+"," ",text_tab_character)
    
    ## Remove tags
    ##text_tags = re.sub("","",text_numbers)

    ## Remove special characters and space, but leave in periods and numbers
    ## ^ means any character except. So [^5] will match any character except '5'
    ## [^a-zA-Z0-9_] matches any non-alphanumeric character.
    ## text_special = re.sub('[^A-Za-z0-9.]+|\s',' ',text_tab_character)
   
    return text_numbers

## New column "preprocess" is formed from applying pre_process function to each item in the "content" column in dataframe
text_df['preprocess'] = text_df['content'].apply(lambda x:preprocess(x))

print(text_df['preprocess'][1])

#https://www.machinelearningplus.com/nlp/lemmatization-examples-python/



In [44]:
## Get stopwords
def get_stop_words(stop_file_path):
#     """load stop words """
    
    with open(stop_file_path, 'r', encoding="utf-8") as f:
        stopwords = f.readlines()
        stop_set = set(m.strip() for m in stopwords)
        return frozenset(stop_set)

#load a set of stop words
stopwords=get_stop_words("stop_words.txt")

In [45]:
## Get Parts of Speech for Lemmatization

# def get_wordnet_pos(treebank_tag):
#     if treebank_tag.startswith('J'):
#         return wordnet.ADJ
#     elif treebank_tag.startswith('V'):
#         return wordnet.VERB
#     elif treebank_tag.startswith('N'):
#         return wordnet.NOUN
#     elif treebank_tag.startswith('R'):
#         return wordnet.ADV
#     else:
#         return None

In [46]:
## Lemmatize documents

    
def lemmatize(doc_list, stopwords):
    
    #initiate a lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    #initiate an empty string
    lemmatized_text=''

    #split each doc into words
    for word in doc_list.split():
            
        #check if each word is in stopword list and lemmatize, add to string
        if word not in stopwords:
            
                lemmatized_text = lemmatized_text+' '+ str(lemmatizer.lemmatize(word))
                
    return lemmatized_text

## New column "lemmatize" is formed from applying pre_process function to each item in the "preprocess" column in dataframe
text_df['lemmatize'] = text_df['preprocess'].apply(lambda x:lemmatize(x, stopwords))

print(text_df['lemmatize'][1])





In [47]:
## Convert the "lemmatized" column in dataframe to a list

def convert_lemmatized_to_list(text_df):
    
    global lemmatized_list
    lemmatized_list = text_df['lemmatize'].tolist()

    return lemmatized_list

convert_lemmatized_to_list(text_df)
print(type(lemmatized_list))
print(len(lemmatized_list))
print(lemmatized_list[1])

<class 'list'>
103


### CountVectorizer 

In order to start using TfidfTransformer you will first have to create a CountVectorizer to count the number of words (term frequency), limit your vocabulary size, apply stop words and etc. The CountVectorizer transformer from the sklearn.feature_extraction model has its own internal tokenization and normalization methods. 

**How it wants the data:** Note the corpus should be a list (which is made of a "list of strings" not a "list of lists"). CountVectorizer considers each element of the list as a different document to vectorize.

**Expected return:** CountVectorizer creates a python dictionary of the tokens and their unique IDs from the corpus.

**Helpful methods:** 
* NOTE: cv is the variable I used when creating my CountVectorizer (see code below)
* Print the ID of one word in the dictionary: print(cv.vocabulary_.get(u'aspirin'))
* Print the list of terms and their unique IDs: print(cv.vocabulary_)
* Print the first 50 token IDs: print(list(cv.vocabulary_.keys())[:50])
* Print the frist 50 token names: print(list(cv.vocabulary_.values())[:50])
* Get a list of the token names: cv.get_feature_names()
* Print the stop list that was used: print(cv.get_stop_words())


### Fit_Transform 
Then we will use fit_transform to create a term-document matrix, where each column in the matrix represents a word in the vocabulary while each row represents the document in our dataset where the values in this case are the word counts.

**How it wants the data:** Remember that fit_transform() function for Scikit-Learn expects an iterable or list of strings or file objects.
 
**Expected return:** When fit_transform() is called, each individual document is transformed into a sparse array/matrix whose index tuple is the row (the document ID) and the token ID from the dictionary, and whose value is the count.

**Helpful methods:**
* Check the shape, which should return Number of documents in corpus, Number of terms extracted from corpus: print(word_count_vector.shape) 
* Note, the the todense() function acts as a dataframe contructor for a numpy matrix: word_count_df_all = pd.DataFrame(word_count_vector.todense())

#### Text from: 
* <https://kavita-ganesan.com/extracting-keywords-from-text-tfidf/#.X70izdBKiUn>
* <https://kavita-ganesan.com/how-to-use-countvectorizer/#Working-With-NGrams>
* <https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/ch04.html>
* <https://medium.com/@rnbrown/more-nlp-with-sklearns-countvectorizer-add577a0b8c8>

In [48]:
def vectorize_content_list(lemmatized_list): 

    global word_count_vector
    global cv
       
    ## Set parameters for Count Vectorize
    ## Ignore words that appear in 85% of documents, 
    ## Eliminate stop words
    ## Include tokens with one word or two word phrases
    ## Run the preprocess function 
    ## Limit our vocabulary size to 10,000
    ## Ignore words that only appear in 1 document
    
    cv=CountVectorizer(max_df=0.85,
                       stop_words='english',
                       ngram_range=(1, 2),
                       max_features=10000, 
                       min_df=1)
                    
    ## preprocessor = preprocess,
    ## Use Count Vectorizer and call fit_transform() to create the vocabulary and return a term-document matrix for each document
    word_count_vector=cv.fit_transform(lemmatized_list)
    
    ##.toarray()
    
    return cv, word_count_vector

cv, word_count = vectorize_content_list(lemmatized_list)

### Do some checking of the output

First, check the shape. We should have the same number of rows as documents in our dataset (6 rows = 6 docs) and the number of columns based on the unique words in our dataset, which we limited above to 10,000. 

Second, check the index of one of the words in the dictionary. 
Third, check the keys and the values. 
Fourth, review the tuple that is the output.

In [49]:
## Review the outputs of the vectorize_string() function
# print the stop list used
#print(cv.get_stop_words())

# print(type(cv))
print(cv.vocabulary_.get(u'aspirin'))
print(type(cv.vocabulary_))
#print(cv.vocabulary_)

## Review the first 50 token IDs
# print(list(cv.vocabulary_.keys())[:50])

## Review the frist 50 token names
# print(list(cv.vocabulary_.values())[:50])

## Review the word_count_vector
## The vector includes (doc, token_id) and a count of term in document

print(type(word_count_vector))
print(word_count_vector.shape)
# print(word_count_vector)




619
<class 'dict'>
<class 'scipy.sparse.csr.csr_matrix'>
(103, 10000)


### Compute IDF 
Now we are going to compute the IDF values by calling Tfidf_transformer.fit(word_count_vector) on the word counts we computed earlier. To get a glimpse of how the IDF values look, we are going to print it by placing the IDF values in a python DataFrame. 

Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears. The inverse document frequency is a measure of how much information the word provides, i.e., if it's common or rare across all documents. 

**IDF(t) = log_e(Total number of documents / Number of documents with term t in it)**

Notice that words with the lowest IDF values most likely appear in each and every document in our collection. The **lower the IDF value of a word, the less unique it is** to any particular document.IDF will give weight 0 to a word if it occurs in every document and hence, IDF does not consider it much relevant to other terms in the corpus. Note, you may consider adding tokens with IDF values close to zero to your stopword list.

In practice, your IDF should be based on a large corpora of text. IDF scores reflect on on your ENTIRE corpus, not just one document in the corpus. 

**How it wants the data:** The TfidfTransformer requires a vector from the CountVectorizer and Fit_Transform() functions above. 

**Expected return:** The tfidf_transformer.idf_ is a numpy array which contains the token_ID and the IDF score

**Resources:**
* <https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.X8Jt9NBKiUk>
* <http://www.tfidf.com/>
* <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html>

In [50]:
## Compute the IDF values for the ENTIRE dataset

## use_idf = True: Enable inverse-document-frequency reweighting.
## smooth_idf = True: Smooth idf weights by adding one to document frequencies, 
## as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.

def compute_IDF(word_count_vector):
    
    tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
    tfidf_transformer.fit(word_count_vector)
    
    ## Create a dataframe from the tfidf_transformer.idf_ which contains the token_ID and the IDF score
    ## Use reset_index(inplace=True) to make all the data into columns (and not and index and a column)
    idf_df = pd.DataFrame(data=tfidf_transformer.idf_)
    idf_df.reset_index(inplace=True)
    
    ## Change the column names to something more useful
    ## The "inplace = True" means the original dataframe is changed
    idf_df.rename(columns={'index': 'Token_ID', 0: 'IDF_for_Token'}, inplace=True)
       
   
    ## [OLD CODE BUT DOES WORK] Print the IDF scores of all tokens in the dataset in a dataframe. 
    ## tfidf_transformer.idf_ is an numpy array containing the inverse document frequency (IDF) vector; 
    ## only defined if use_idf is True
    # idf_df = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"]) 
    ## Sort IDF values ascending 
    # idf_df.sort_values(by=['idf_weights'])
   
    
    return idf_df, tfidf_transformer, tfidf_transformer.idf_

idf_df, tfidf_transformer, tfidf_transformer.idf_ = compute_IDF(word_count_vector)

#idf_df.to_csv(r'idf_values.csv')
print(type(tfidf_transformer))
#print(type(idf_df))
print(idf_df.head())
print(type(tfidf_transformer.idf_))
      

<class 'sklearn.feature_extraction.text.TfidfTransformer'>
   Token_ID  IDF_for_Token
0         0       3.447166
1         1       4.545779
2         2       2.811178
3         3       4.258097
4         4       3.852631
<class 'numpy.ndarray'>


### Compute TF-IDF

Once you have the IDF values, you can now compute the tf-idf scores for any document or set of documents.The more common the word across documents, the lower its TF-IDF score and the more unique a word is to our first document. 

The TF-IDF score is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. 

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)  
DIVIDED BY THE  
IDF(t) = log_e(Total number of documents / Number of documents with term t in it).  

**How it wants the data:** The input for the tfidf_transformer created from the code above is expected to be the vector from the Count Vectorizer Fit_Transform() function. 

**Expected Return:** The tfidf transform() function returns a sparse matrix representation in the form of ((doc, term), tfidf) where each key is a document and term pair and the value is the TF–IDF score.

**Helpful Methods:**
* The todense() function acts as a dataframe contructor for a numpy matrix: tf_idf_df_all = pd.DataFrame(tf_idf_vector.todense())

**Notes**: You can compute the TF-IDF for document(s) that weren't originally in the dataset that you used to create the IDF scores. For example you can create the CountVectorizer and the tfidf_transformer on a training dataset (to build the model) and then use a different testing dataset (to validate the model that was built). 

If you are using a different dataset (i.e. often called a testing dataset), there are a few steps to process the data and change to the compute_TF_IDF() function (see below) because you haven't called the countvectorizer on the data. See process below.  

**Text from:**
* <https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.X8Jt9NBKiUk>
* <http://www.tfidf.com/>
* <https://learning.oreilly.com/library/view/applied-text-analysis/9781491963036/ch04.html>

In [51]:
# Compute TF-IDF scores for all the documents and for one document in the dataset

def compute_TF_IDF(word_count_vector, tfidf_transformer):
    
    ## Transform returns a matrix of term/token counts
    tf_idf_vector=tfidf_transformer.transform(word_count_vector)
    
    #feature_names = cv.get_feature_names() 
    
    ## Get tfidf for each document in the dataset
    # tf_idf_df_all = pd.DataFrame(tf_idf_vector) 
     
    ## Get tfidf vector for first document 
    # first_document_vector=tf_idf_vector[1] 
 
    ## [OLD CODE BUT DOES WORK] Print the scores of one document in a dataframe. 
    ## The todense() function acts as a dataframe contructor for a numpy matrix
    ## T is a short form for transposing the data
    # tf_idf_df_one = pd.DataFrame(first_document_vector.T.todense(), index=feature_names, columns=["tfidf"]) 
    # tf_idf_df_one.sort_values(by=["tfidf"],ascending=False)
    
    return tf_idf_vector

    ## Returns not used (add as needed)
    # tf_idf_df_all
    # tf_idf_df_one
    # feature_names

tf_idf_vector = compute_TF_IDF(word_count_vector, tfidf_transformer)

## Returns not used (add above with comma as needed)
# tf_idf_df_all
# tf_idf_df_one
# feature_names

## Note, the return tf_idf_vector is a <class 'scipy.sparse.csr.csr_matrix'>
## The return is: ((doc, term_id), tfidf)
# print(type(tf_idf_vector))
# print(tf_idf_vector)

## Examine the dataframes (note, these dataframes were removed from the return statement)
# tf_idf_df_one.to_csv(r'tf_idf_values_one.csv')
# tf_idf_df_all.to_csv(r'tf_idf_values_all.csv')

In [52]:
## Organize data into dataframes and join to show final results

## The todense() function acts as a dataframe contructor for a numpy matrix
tf_idf_df_all = pd.DataFrame(tf_idf_vector.todense())

## The .stack() function returns a Pandas.Series, which in this case has a multi-level index
## The return is: multi-level index of doc, token_id, and a value of tfidf

tf_idf_sr_all = tf_idf_df_all.stack()

## Review the data by retrieving the first five elements in the Series
#print(tf_idf_sr_all[:5])

## Review the data by retrieving the 3rd through 6th elements in the Series using loc function
#print(tf_idf_sr_all.loc[3:4])

## Review the data by printing the series index
#print(tf_idf_sr_all.index) 

## Now, turn the Series into dataframe
## Use reset_index(inplace=True) to set the multi-level index as columns

tf_idf_df_final=pd.DataFrame(tf_idf_sr_all)
tf_idf_df_final.reset_index(inplace=True)

## Change the column names to something more useful
## The "inplace = True" means the original dataframe is changed
tf_idf_df_final.rename(columns={'level_0': 'Document_ID', 'level_1': 'Token_ID', 0: 'TF_IDF_for_Doc'}, inplace=True)
tf_idf_df_final.head()

## Resources
## https://www.geeksforgeeks.org/python-pandas-series-index/#:~:text=index,-Last%20Updated%3A%2028&text=The%20object%20supports%20both%20integer,performing%20operations%20involving%20the%20index.
## https://discuss.analyticsvidhya.com/t/how-to-convert-the-multi-index-series-into-a-data-frame-in-python/5119/2

                                                      

Unnamed: 0,Document_ID,Token_ID,TF_IDF_for_Doc
0,0,0,0.0
1,0,1,0.0
2,0,2,0.0
3,0,3,0.0
4,0,4,0.0


In [53]:
## The todense() function acts as a dataframe contructor for a numpy matrix
word_count_df_all = pd.DataFrame(word_count_vector.todense())

## The .stack() function returns a Pandas.Series, which in this case has a multi-level index
## The return is: multi-level index of doc, token_id, and a count of term in document

word_count_sr_all = word_count_df_all.stack()

## Review the data by retrieving the first five elements in the Series
# print(word_count_sr_all[:5])

## Now, turn the Series into dataframe
## Use reset_index(inplace=True) to set the multi-level index as columns

word_count_df_final=pd.DataFrame(word_count_sr_all)
word_count_df_final.reset_index(inplace=True)

## Change the column names to something more useful
## The "inplace = True" means the original dataframe is changed
word_count_df_final.rename(columns={'level_0': 'Document_ID', 'level_1': 'Token_ID', 0: 'Count_for_Doc'}, inplace=True)
word_count_df_final.head()




Unnamed: 0,Document_ID,Token_ID,Count_for_Doc
0,0,0,0
1,0,1,0
2,0,2,0
3,0,3,0
4,0,4,0


In [54]:
## Merge the word_count_df_final and the tf_idf_df_final based on the Document_ID and the Token_ID
## The merge() function in Pandas by default performs an inner join. 
## Inner join is the most common type and returns a dataframe with only those rows that have common characteristics.
## It takes both the dataframes as arguments and the name of the column on which the join has to be performed. 
## To change the join type, add how = 'inner' to the parameters of merge. 

count_tf_idf_df = word_count_df_final.merge(tf_idf_df_final, on = ['Document_ID', 'Token_ID'])
count_tf_idf_df.head()

Unnamed: 0,Document_ID,Token_ID,Count_for_Doc,TF_IDF_for_Doc
0,0,0,0,0.0
1,0,1,0,0.0
2,0,2,0,0.0
3,0,3,0,0.0
4,0,4,0,0.0


In [55]:
## Merge the count_tf_idf_df and the idf_df based on the Token_ID
## Left join, also known as Left Outer Join, returns a dataframe containing all the rows of the left dataframe.
## It takes both the dataframes as arguments and the name of the column on which the join has to be performed. 
## To change the join type, add how = 'inner' to the parameters of merge. 

count_idf_tf_idf_df = pd.merge(count_tf_idf_df,idf_df,on='Token_ID',how='left')
count_idf_tf_idf_df.head()

Unnamed: 0,Document_ID,Token_ID,Count_for_Doc,TF_IDF_for_Doc,IDF_for_Token
0,0,0,0,0.0,3.447166
1,0,1,0,0.0,4.545779
2,0,2,0,0.0,2.811178
3,0,3,0,0.0,4.258097
4,0,4,0,0.0,3.852631


In [56]:
## Add the token names based on token ID using the count vectorize vocabulary dictionary called cv.vocabulary_

## Create a dataframe from the cv.vocabulary_ dictionary
## Specify orient='index' to create the DataFrame using dictionary keys as rows
## Use reset_index(inplace=True) to turn index into a column
# print(cv.vocabulary_)

vocabulary_df = pd.DataFrame.from_dict(data = cv.vocabulary_, orient='index')
vocabulary_df.reset_index(inplace=True)

## Change the column names to something more useful
## The "inplace = True" means the original dataframe is changed
vocabulary_df.rename(columns={'index': 'Token_Name', 0: 'Token_ID'}, inplace=True)

vocabulary_df.head()


Unnamed: 0,Token_Name,Token_ID
0,aggression,297
1,physical,6778
2,health,3875
3,married,5491
4,woman,9815


In [57]:
## Merge the count_idf_tf_idf_df and the vocabulary_df based on the Token_ID
## Left join, also known as Left Outer Join, returns a dataframe containing all the rows of the left dataframe.
## It takes both the dataframes as arguments and the name of the column on which the join has to be performed. 
## To change the join type, add how = 'inner' to the parameters of merge. 

name_count_idf_tf_idf_df = pd.merge(count_idf_tf_idf_df,vocabulary_df,on='Token_ID',how='left')
name_count_idf_tf_idf_df.head(20)

Unnamed: 0,Document_ID,Token_ID,Count_for_Doc,TF_IDF_for_Doc,IDF_for_Token,Token_Name
0,0,0,0,0.0,3.447166,aa
1,0,1,0,0.0,4.545779,abbas
2,0,2,0,0.0,2.811178,abdominal
3,0,3,0,0.0,4.258097,abdominal central
4,0,4,0,0.0,3.852631,abdominal fat
5,0,5,0,0.0,3.246496,abdominal obesity
6,0,6,0,0.0,4.034953,abdul
7,0,7,1,0.007734,3.447166,ability
8,0,8,0,0.0,3.447166,able
9,0,9,0,0.0,3.341806,abnormal


The code that created the TF-IDF score will work fine on one document in the dataset (because any words not in the document have a TFIDF score of zero) but this code below is a nice way of creating a list of top terms without printing all 10,000 words and their TFIDF scores. 

You can also use the code below on a new document (i.e. not in the current dataset) that you want to use the current dataset's IDF scores on. 

In [58]:
## Remove the lines in the dataframe where "Count for Doc" equals zero. 
## This will dramatically reduce the size of the dataframe

final_df = name_count_idf_tf_idf_df[name_count_idf_tf_idf_df['Count_for_Doc']!= 0]

final_df.head(20)



Unnamed: 0,Document_ID,Token_ID,Count_for_Doc,TF_IDF_for_Doc,IDF_for_Token,Token_Name
7,0,7,1,0.007734,3.447166,ability
49,0,49,1,0.008298,3.698481,accounted
62,0,62,1,0.008298,3.698481,achieved
84,0,84,3,0.023994,3.564949,activation
92,0,92,2,0.008246,1.837728,activity
148,0,148,1,0.005942,2.648659,address
165,0,165,1,0.006179,2.754019,adjusted
170,0,170,2,0.016595,3.698481,administered
176,0,176,2,0.012358,2.754019,adolescent
188,0,188,1,0.003756,1.674099,adult cardia


In [59]:
## The Term Frequency, which is the Count_for_Doc divided by the number of terms in the document

count_doc_terms = final_df.groupby(['Document_ID'])['Count_for_Doc'].transform('count')
#print(count_doc_terms)
final_df['Token_Freq_per_Docs'] = final_df['Count_for_Doc']/count_doc_terms
print(final_df.head())


    Document_ID  Token_ID  Count_for_Doc  TF_IDF_for_Doc  IDF_for_Token  \
7             0         7              1        0.007734       3.447166   
49            0        49              1        0.008298       3.698481   
62            0        62              1        0.008298       3.698481   
84            0        84              3        0.023994       3.564949   
92            0        92              2        0.008246       1.837728   

    Token_Name  Token_Freq_per_Docs  
7      ability             0.001684  
49   accounted             0.001684  
62    achieved             0.001684  
84  activation             0.005051  
92    activity             0.003367  


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [60]:
## Re-order columns for ease of review
# setting column's order
new_df =  final_df[['Document_ID','Token_ID','Token_Name','Count_for_Doc','TF_IDF_for_Doc','Token_Freq_per_Docs','IDF_for_Token']] 
print(new_df.head())
new_df.to_csv(r'news_industry_scikitlearn.csv')

    Document_ID  Token_ID  Token_Name  Count_for_Doc  TF_IDF_for_Doc  \
7             0         7     ability              1        0.007734   
49            0        49   accounted              1        0.008298   
62            0        62    achieved              1        0.008298   
84            0        84  activation              3        0.023994   
92            0        92    activity              2        0.008246   

    Token_Freq_per_Docs  IDF_for_Token  
7              0.001684       3.447166  
49             0.001684       3.698481  
62             0.001684       3.698481  
84             0.005051       3.564949  
92             0.003367       1.837728  


### Training datasets vs. Test datasets 

You can compute the TF-IDF for document(s) that weren't originally in the dataset that you used to create the IDF scores. For example you can create the CountVectorizer and the tfidf_transformer on a training dataset (to build the model) and then use a different testing dataset (to validate the model that was built). 

If you are using a different dataset (i.e. often called a testing dataset), you still need to pre-process, remove stopwords and lemmatize the data. You don't need to create a new CountVectorizer and fit_transform for the data, because you'll use the one you already created. You do not need to calculate IDF scores, since this was done on the training dataset (which is the purpose of the training dataset). Instead you will fit your new dataset to the model you created with the training dataset. And also there is a minor change to the compute_TF_IDF() function (see below). 

1. Get test docs from their dataframe into a list: 
docs_test=df_test['text'].tolist()

2. Preprocess, remove stopwords, and lemmatize the items in the list (use functions already created). 

3. Modify the compute_TF_IDF function: 

def compute_TF_IDF(docs_test, tfidf_transformer, cv):
    
    tf_idf_vector=tfidf_transformer.transform(**cv.transform(docs_test)**)
        
    return tf_idf_vector

tf_idf_vector = compute_TF_IDF(docs_test, tfidf_transformer, cv)


**Text from:**
* <https://kavita-ganesan.com/tfidftransformer-tfidfvectorizer-usage-differences/#.X8Jt9NBKiUk>

In [25]:
## Load test dataset
directory = "practice_pdfs_2"
files = list(glob.glob(os.path.join(directory,'*.*')))
print(files)

['practice_pdfs_2\\Advanced-fibrosis-associates-with-atherosclerosis-in-subjects-with-nonalcoholic-fatty-liver-diseaseAtherosclerosis.pdf', 'practice_pdfs_2\\Alcohol-outlet-density-and-related-use-in-an-urban-Black-population-in-Philadelphia-public-housing-communitiesHealth-and-Place.pdf', 'practice_pdfs_2\\Big-Data-What-Is-It-and-What-Does-It-Mean-for-Cardiovascular-Research-and-Prevention-PolicyCurrent-Cardiovascular-Risk-Reports.pdf']


In [26]:
#https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file
document_list_2 = []
for f in files:
    raw = parser.from_file(f)
    document_list_2.append(raw)

In [27]:
text_df_2 = pd.DataFrame(document_list_2)
#text_df.head()
#print(text_df["content"][1])

In [28]:
# Convert the "content" column in dataframe to a list

convert_content_to_list(text_df_2)

print(type(content_list))
print(content_list[0])

<class 'list'>













































































































Advanced fibrosis associates with atherosclerosis in subjects with nonalcoholic fatty liver disease


lable at ScienceDirect

Atherosclerosis 241 (2015) 145e150
Contents lists avai
Atherosclerosis

journal homepage: www.elsevier .com/locate/atherosclerosis
Advanced fibrosis associates with atherosclerosis in subjects with
nonalcoholic fatty liver disease

Ying Chen a, b, 1, Min Xu a, b, 1, Tiange Wang a, b, Jichao Sun a, b, Wanwan Sun a, b,
Baihui Xu a, b, Xiaolin Huang a, b, Yu Xu a, b, Jieli Lu a, b, Xiaoying Li a, b, Weiqing Wang a, b,
Yufang Bi a, b, *, Guang Ning a, b

a State Key Laboratory of Medical Genomics, Key Laboratory for Endocrine and Metabolic Diseases of Ministry of Health, National Clinical Research Center
for Metabolic Diseases, Collaborative Innovation Center of Systems Biomedicine, Rui-Jin Hospital, Shanghai Jiao Tong University School of Medi

In [29]:
## Pre-process the text to lower case, remove special characters, etc. 
## https://kavita-ganesan.com/extracting-keywords-from-text-tfidf/#.X7RHltBKiUn
## Test regex here: https://pythex.org/

## New column "preprocess" is formed from applying pre_process function to each item in the "content" column in dataframe
text_df_2['preprocess'] = text_df_2['content'].apply(lambda x:preprocess(x))

print(text_df_2['preprocess'][1])

#https://www.machinelearningplus.com/nlp/lemmatization-examples-python/

 alcohol outlet density and related use in an urban black population in philadelphia public housing communities alcohol outlet density and related use in an urban black population in philadelphia public housing communities julie a cederbaum a n robin petering a m katherine hutchinson b amy s he a john p wilson c john b jemmott iii d loretta sweet jemmott e a university of southern california school of social work w th street mrf los angeles ca usa b boston college william f connell school of nursing cushing hall commonwealth avenue chestnut hill ma usa c university of southern california spatial sciences institute dornsife college of letters arts and sciences trousdale parkway ahf b los angeles ca usa d university of pennsylvania school of medicine and annenberg school of communication market street suite philadelphia pa usa e university of pennsylvania school of nursing center for health disparities research ci blvd floor l philadelphia pa usa a r t i c l e i n f o article history rec

In [30]:
## Get stopwords

stopwords=get_stop_words("stop_words.txt")

In [31]:
## Lemmatize documents

## New column "lemmatize" is formed from applying pre_process function to each item in the "preprocess" column in dataframe
text_df_2['lemmatize'] = text_df_2['preprocess'].apply(lambda x:lemmatize(x, stopwords))

print(text_df_2['lemmatize'][1])


 alcohol outlet density urban black population philadelphia public housing community alcohol outlet density urban black population philadelphia public housing community julie cederbaum robin petering katherine hutchinson amy john wilson john jemmott iii loretta sweet jemmott university southern california school social work th street mrf los angeles ca usa boston college william connell school nursing cushing hall commonwealth avenue chestnut hill ma usa university southern california spatial science institute dornsife college letter art science trousdale parkway ahf los angeles ca usa university pennsylvania school medicine annenberg school communication market street suite philadelphia pa usa university pennsylvania school nursing center health disparity research ci blvd floor philadelphia pa usa article history received june received revised form september accepted october online november keywords alcohol outlet alcohol family environment adolescent alcohol behavior influenced famil

In [32]:
## Convert the "lemmatized" column in dataframe to a list

convert_lemmatized_to_list(text_df_2)
print(type(lemmatized_list))
print(len(lemmatized_list))
print(lemmatized_list[1])

<class 'list'>
3
 alcohol outlet density urban black population philadelphia public housing community alcohol outlet density urban black population philadelphia public housing community julie cederbaum robin petering katherine hutchinson amy john wilson john jemmott iii loretta sweet jemmott university southern california school social work th street mrf los angeles ca usa boston college william connell school nursing cushing hall commonwealth avenue chestnut hill ma usa university southern california spatial science institute dornsife college letter art science trousdale parkway ahf los angeles ca usa university pennsylvania school medicine annenberg school communication market street suite philadelphia pa usa university pennsylvania school nursing center health disparity research ci blvd floor philadelphia pa usa article history received june received revised form september accepted october online november keywords alcohol outlet alcohol family environment adolescent alcohol behavior

In [35]:
def compute_TF_IDF_2(docs_test, tfidf_transformer, cv):

    tf_idf_vector=tfidf_transformer.transform(cv.transform(docs_test))

    return tf_idf_vector

tf_idf_vector = compute_TF_IDF_2(lemmatized_list, tfidf_transformer, cv)

In [37]:
## Organize data into dataframes and join to show final results

## The todense() function acts as a dataframe contructor for a numpy matrix
tf_idf_df_all_2 = pd.DataFrame(tf_idf_vector.todense())

## The .stack() function returns a Pandas.Series, which in this case has a multi-level index
## The return is: multi-level index of doc, token_id, and a value of tfidf

tf_idf_sr_all_2 = tf_idf_df_all_2.stack()

## Review the data by retrieving the first five elements in the Series
#print(tf_idf_sr_all[:5])

## Review the data by retrieving the 3rd through 6th elements in the Series using loc function
#print(tf_idf_sr_all.loc[3:4])

## Review the data by printing the series index
#print(tf_idf_sr_all.index) 

## Now, turn the Series into dataframe
## Use reset_index(inplace=True) to set the multi-level index as columns

tf_idf_df_final_2=pd.DataFrame(tf_idf_sr_all_2)
tf_idf_df_final_2.reset_index(inplace=True)

## Change the column names to something more useful
## The "inplace = True" means the original dataframe is changed
tf_idf_df_final_2.rename(columns={'level_0': 'Document_ID', 'level_1': 'Token_ID', 0: 'TF_IDF_for_Doc'}, inplace=True)
tf_idf_df_final_2.head(50)

## Resources
## https://www.geeksforgeeks.org/python-pandas-series-index/#:~:text=index,-Last%20Updated%3A%2028&text=The%20object%20supports%20both%20integer,performing%20operations%20involving%20the%20index.
## https://discuss.analyticsvidhya.com/t/how-to-convert-the-multi-index-series-into-a-data-frame-in-python/5119/2

                                                      

Unnamed: 0,Document_ID,Token_ID,TF_IDF_for_Doc
0,0,0,0.0
1,0,1,0.0
2,0,2,0.0
3,0,3,0.0
4,0,4,0.0
5,0,5,0.0
6,0,6,0.0
7,0,7,0.0
8,0,8,0.0
9,0,9,0.0


## THIS IS CODE FROM A TUTORIAL ON SORTING THE FINAL DATA

In [None]:
## Computing TF-IDF and Extracting Keywords
## https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html#scipy.sparse.coo_matrix
## https://stackoverflow.com/questions/4319014/iterating-through-a-scipy-sparse-vector-or-matrix

# def sort_coo(coo_matrix):
#     tuples = zip(coo_matrix.col, coo_matrix.data)
    
#     return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

# def extract_topn_from_vector(feature_names, sorted_items, topn=10):
#     """get the feature names and tf-idf score of top n items"""
    
#     #use only topn items from vector
#     sorted_items = sorted_items[:topn]

#     score_vals = []
#     feature_vals = []
    
#     # word index and corresponding tf-idf score
#     for idx, score in sorted_items:
        
#         #keep track of feature name and its corresponding score
#         score_vals.append(round(score, 3))
#         feature_vals.append(feature_names[idx])

#     #create a tuples of feature,score
#     #results = zip(feature_vals,score_vals)
#     results= {}
#     for idx in range(len(feature_vals)):
#         results[feature_vals[idx]]=score_vals[idx]
    
#     return results

In [None]:
# # you only needs to do this once, this is a mapping of index to 
# feature_names=cv.get_feature_names()

# # get the document that we want to extract keywords from
# content=content_list[1]

# #generate tf-idf for the given document
# tf_idf_vector=tfidf_transformer.transform(cv.transform([content]))

# #sort the tf-idf vectors by descending order of scores
# ## tocoo() means to convert this matrix to COOrdinate format.
# sorted_items=sort_coo(tf_idf_vector.tocoo())

# #extract only the top n; n here is 10
# keywords=extract_topn_from_vector(feature_names,sorted_items,10)

# # now print the results
# print("\n=====Doc=====")
# print(content)
# print("\n===Keywords===")
# for k in keywords:
#     print(k,keywords[k])


## Topic Modeling using Gensim LDA Library on Stemmed Tokens

Topic modeling is a technique for taking some unstructured text and automatically extracting its common themes, it is a great way to get a bird's eye view on a large text collection. 

Gensim = “Generate Similar” is a popular open source natural language processing library used for unsupervised topic modeling. 

The Gensim library uses a popular algorithm for doing topic model, namely Latent Dirichlet Allocation. Latent Dirichlet Allocation (LDA). LDA requires documents to be represented as a bag of words (for the gensim library, some of the API calls will shorten it to "bow"). This representation ignores word ordering in the document but retains information on how many times each word appears.

The main distinguishing feature for LDA is it allows for mixed membership, which means that each document can partially belong to several different topics. Note that the vocabulary probability will sum up to 1 for every topic, but often times, words that have lower weights will be truncated from the output.

Text modified from: 
* <https://notebook.community/ethen8181/machine-learning/clustering/topic_model/LDA>
* <https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html#sphx-glr-auto-examples-core-run-core-concepts-py>
* <https://www.tutorialspoint.com/gensim/index.htm>


In [None]:
##Pre-process the text by making all terms lower case, remove special characters and numbers
##Code from: https://kavita-ganesan.com/extracting-keywords-from-text-tfidf/#.X7RHltBKiUn

def pre_process(text):
    
    ##lowercase
    text=text.lower()
    
    ##remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    
    ## remove special characters and space, but leave in periods and numbers
    #text=re.sub('[^A-Za-z0-9.]+|\s',' ',text)
    
    ##remove tags
    #text=re.sub("","",text)
    
    ##Remove Emails
    #text=re.sub('\S*@\S*\s?', '', text) 

    ##Remove new line characters
    #text=[re.sub('\s+', ' ', text)

    ##Remove distracting single quotes
    #text=[re.sub("\'", "", text) 

    return text


text_df['preprocess'] = text_df['content'].apply(lambda x:pre_process(x))

# list(text_df.columns)
#show the second 'text' just for fun
# text_df['preprocess'][1]


In [None]:
##Then break the document text into tokens, remove the stopwords, and stem the tokens
##Code from: https://kavita-ganesan.com/extracting-keywords-from-text-tfidf/#.X7RHltBKiUn
doc_token_list=[]
    
def tokenize_stem(documents):
    
    ##Create PorterStemmer
    ##The better stemmer is the SnowballStemmer for English
    ##https://www.nltk.org/howto/stem.html
    ##p_stemmer = PorterStemmer()
    
    
    ##Create SnowballStemmer
    sb_stemmer = SnowballStemmer("english")
    

    ##Open stop words text file and save to stop_set variable
    with open("stop_words.txt", 'r', encoding="utf-8") as f:
        stopwords = f.readlines()
        stop_set = set(m.strip() for m in stopwords)
        f.close()

    ##Stopword list comes from the Terrier pacakge with 733 words and another 86 custom terms: 
    ##https://github.com/kavgan/stop-words/blob/master/terrier-stop.txt
    ##https://github.com/kavgan/stop-words/blob/master/minimal-stop.txt
    
    ##Other stopword list options can be reviewed here:
    ##https://medium.com/towards-artificial-intelligence/stop-the-stopwords-using-different-python-libraries-ffa6df941653


    for doc in documents.dropna():

        # Tokenize documents by splitting into words using NLTK's word_tokenize 
        token_list = nltk.word_tokenize(doc)

        # Remove stop words from token_list
        token_nostop_list = [i for i in doc if not i in stop_set]
        
        # Use Porter Stemmer to stem tokens to create more like-words
        #token_stem_list = [p_stemmer.stem(i) for i in token_nostop_list if len(i) > 3]
            
        # Use Snowball Stemmer to stem tokens to create more like-words
        token_stem_list = [sb_stemmer.stem(i) for i in token_nostop_list if len(i) > 3]
            
        #Append token_stem_list to the doc_token_list
        doc_token_list.append(token_stem_list)


    return doc_token_list

tokenize_stem(text_df['preprocess'])
# print(type(doc_token_list))
# print(doc_token_list[1])


In [None]:
##Run the gensim topic modeling and return the topics
##Code from: https://notebook.community/ethen8181/machine-learning/clustering/topic_model/LDA

def get_gensim_corpus_dictionary(data):
    ##If content is not yet a list, make it a list and build the id2word dictionary and the corpus (map the word to id)
    ##texts = text_df['content'].apply(lambda x: x.split(' ')).tolist()
    ##print(texts)

    ##Build the id2word dictionary and the corpus
    ##The dictionary associates each word in the corpus with a unique integer ID
    dictionary = corpora.Dictionary(data)
    print('Number of unique tokens: ', len(dictionary))

    ## Filter out words that appear in less than 2 documents (appear only once),
    dictionary.filter_extremes(no_below = 2)

    ## Filter out words that appears in more than certain % of documents
    ## no_above = 0.5 would remove words that appear in more than 50% of the documents
    # dictionary.filter_extremes(no_above = 0.5)

    # Remove gaps in id sequence after words that were removed
    dictionary.compactify()
    print('Number of unique tokens used 2 or more times: ', len(dictionary))

    ##Use code below to print terms in dictionary with their IDs
    ##This will show you the number of the terms in the dictionary
    #print("Dictionary Tokens with ID: ")
    #pprint.pprint(dictionary.token2id)
    
    ##Map terms in corpus to words in dictionary with ID
    ##This will show you the ID of the term in the dictionary, and the number of times the terms occurs in the corpus
    bow_corpus = [dictionary.doc2bow(text) for text in data]
    #print("Tokens in Corpus with Occurrence: ")
    #pprint.pprint(corpus)
    
    ##Print word count by vector 
    id_words_count = [[(dictionary[id], count) for id, count in line] for line in bow_corpus]
    print("Word Count in each Vector: ")
    pprint.pprint(id_words_count[2])
    
     
    return bow_corpus, dictionary




bow_corpus, dictionary = get_gensim_corpus_dictionary(doc_token_list)



In [None]:
## Run the Gensim Library LDA Model
## See link below if you want to save and load a model
## https://notebook.community/ethen8181/machine-learning/clustering/topic_model/LDA

def run_gensim_LDA_model(corpus, dictionary):
    ##Directory for storing all lda models
    model_dir = 'lda_checkpoint'

    ##If model_dir directionry is not in the folder, then make the directory
    if not os.path.isdir(model_dir):
        os.mkdir(model_dir)

    ##Load the model if we've already trained it before
   
    path = os.path.join(model_dir, 'topic_model.lda')
    if not os.path.isfile(path):
        ##Training LDA can take some time, we could set eval_every = None to not evaluate the model perplexity
        ##Other parameters for LdaModel, include: random_state=100, update_every=1,chunksize=100,passes=10,alpha='auto',per_word_topics=True
        topic_model = LdaModel(corpus, id2word = dictionary, num_topics = 3, iterations = 200)
        topic_model.save(path)
 
    topic_model = LdaModel.load(path)

    # Each element of the list is a tuple containing the topic and word / probability list
    topics = topic_model.show_topics(num_words = 10, formatted = False)

    print(type(topics))
    
    ##Model perplexity and topic coherence provide a convenient measure to judge how good a given topic model is. 
    ##In my experience, topic coherence score, in particular, has been more helpful.
    #https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#:~:text=Topic%20Modeling%20is%20a%20technique,in%20the%20Python's%20Gensim%20package.

    ## Compute Perplexity
    print('\nPerplexity: ', topic_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

    ## Compute Coherence Score
    coherence_model_lda = CoherenceModel(model=topic_model, texts=corpus, dictionary=dictionary, coherence='c_v')
    coherence_lda = coherence_model_lda.get_coherence()
    print('\nCoherence Score: ', coherence_lda)
    
    
    return topics

run_gensim_LDA_model(bow_corpus, dictionary)

In [None]:
# Save topics to CSV

def create_topic_CSV(topics):
    
    ##Create dataframe for topics
    df_topics = pd.DataFrame(topics, columns = ['TopicNum', 'Terms'])
    #df_topics.head()

    ## Save dataframe to csv
    with open(r"topic_modeling.csv", 'w', encoding='utf-8') as file:
        df_topics.to_csv(file)
        file.close()
    
create_topic_CSV(topics)

In [None]:
## Run the Gensim Library TFIDF Model 
##The words that will occur more frequently in the document will get the smaller weights.
##https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html#sphx-glr-auto-examples-core-run-core-concepts-py
##new_list = []

tfidf_frequency = []

def run_gensim_tfidf_model(corpus, dictionary): 
    
    ##Initialize the tf-idf model, training it on our corpus 
    tfidf = models.TfidfModel(corpus)
    
    ##if working with a new document, you can get tfidf from the model
    #new_doc = "abbott bra adolesc".lower().split()
    #print(new_doc)
    #new_list.append(tfidf[dictionary.doc2bow(new_doc)])
    
    corpus_tfidf = tfidf[corpus]
    for doc in corpus_tfidf:
        ##pprint.pprint(doc)
        tfidf_frequency.append(doc)
    
    #Print word frequencies by vector 
    id_words_frequency = [[(dictionary[id], frequency) for id, frequency in line] for line in tfidf_frequency]
    print("Word Frequency by Vector: ")
    pprint.pprint(id_words_frequency[2])
    
run_gensim_tfidf_model(bow_corpus, dictionary)

#pprint.pprint(tfidf_frequency)
    

## Topic Modeling using Gensim LDA Library on Lemmatized Tokens

In [None]:
##Pre-process the text by making all terms lower case, remove special characters and numbers
##Code from: https://kavita-ganesan.com/extracting-keywords-from-text-tfidf/#.X7RHltBKiUn

data_pre_process = []

def preprocess_data(data): 
    
    for email in data:
        
        ##lowercase
        text_lower=email.lower()
        
        ##Remove Emails
        text_email=re.sub('\\S*@\\S*\\s?', '', text_lower) 
        
        ##remove special characters and digits
        text_special=re.sub("(\\d|\\W)+"," ",text_email)
        
  
        data_pre_process.append(text_special)
    
    return data_pre_process

preprocess_data(data)
print(data_pre_process[1])

In [None]:
data_words = []

def tokenize(documents):
    for doc in documents:
        token_list = gensim.utils.simple_preprocess(str(doc), deacc=True)  # deacc=True removes punctuations
        data_words.append(token_list)
    return data_words


tokenize(data_pre_process)
# print(type(data_words))
print(data_words[1])

In [None]:
def built_bigram_trigram_models(documents):
    
   
    ##Building Bigram & Trigram Models
    ##higher threshold fewer phrases.
    bigram = gensim.models.Phrases(documents, min_count=5, threshold=100)
    trigram = gensim.models.Phrases(bigram[documents], threshold=100)
        
    ##Faster way to get a sentence clubbed as a trigram/bigram
    bigram_mod = gensim.models.phrases.Phraser(bigram)
    trigram_mod = gensim.models.phrases.Phraser(trigram)
          
    ##See trigram example
    print(trigram_mod[bigram_mod[doc[0]]])
        
    return bigram_mod, trigram_mod
        
bigram_mod, trigram_mod = built_bigram_trigram_models(data_words)
 

In [None]:
doc_no_stop_list = []

def remove_stop_words(documents):

    ##Open stop words text file and save to stop_set variable
    with open("stop_words.txt", 'r', encoding="utf-8") as f:
        stopwords = f.readlines()
        stop_set = set(m.strip() for m in stopwords)
        f.close()

    ##Stopword list comes from the Terrier pacakge with 733 words and another 86 custom terms: 
    ##https://github.com/kavgan/stop-words/blob/master/terrier-stop.txt
    ##https://github.com/kavgan/stop-words/blob/master/minimal-stop.txt
    
    ##Other stopword list options can be reviewed here:
    ##https://medium.com/towards-artificial-intelligence/stop-the-stopwords-using-different-python-libraries-ffa6df941653

    for doc in documents:
        
        # Remove stop words from doc in documents
        token_no_stop_list = [i for i in doc if not i in stop_set]
        
        #Append token_stem_list to the doc_token_list
        doc_no_stop_list.append(token_no_stop_list)
    
    return  doc_no_stop_list 
            
remove_stop_words(doc_token_list)


In [None]:
bigram_list = []

def make_bigrams(documents, bigram_mod):
    
    bigram_list = [bigram_mod[doc] for doc in documents]
        
    return bigram_list

make_bigrams(doc_no_stop_list, bigram_mod)

In [None]:
def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

### Unused Code

In [None]:
## Get Stop Words

# def get_stop_words(stop_file_path):
# #     """load stop words """
    
#     with open(stop_file_path, 'r', encoding="utf-8") as f:
#         stopwords = f.readlines()
#         stop_set = set(m.strip() for m in stopwords)
#         return stop_set

# #load a set of stop words
# my_stopwords=get_stop_words("stop_words.txt")

In [None]:
## Convert the "content" column in dataframe to a list
## The Count Vectorize and the fit transform() for Scikit-Learn expects an iterable 
## or list of strings or file objects, and creates a dictionary of the vocabulary on the corpus.

# def convert_content_to_list(text_df):
    
#     global content_list
    
#     content_list = text_df['preprocess'].tolist()

#     return content_list

# convert_content_to_list(text_df)

# print(type(content_list))
# print(content_list[0])

In [None]:
## https://medium.com/@cristhianboujon/how-to-list-the-most-common-words-from-text-corpus-using-scikit-learn-dad4d0cab41d

# tfidf_transformer = TfidfTransformer(smooth_idf=True,use_idf=True)
# X_train_tfidf = tfidf_transformer.fit_transform(word_count_vector)
# X_train_tfidf.shape
# feature_names = cv.get_feature_names() 

# # count matrix 
# count_vector=cv.transform(docs) 
 
# # tf-idf scores 
# tf_idf_vector=tfidf_transformer.transform(count_vector)


# # create DataFrame using data 
# tf_idf_df_test = pd.DataFrame(X_train_tfidf) 
 
# sum_words = X_train_tfidf.sum(axis=0)
# print(sum_words)
# words_freq = [(word, sum_words[0, idx]) for word, idx in cv.vocabulary_.items()]
# print(type(words_freq))
# words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
# #print(words_freq)

# for word, freq in words_freq:
#     print(word, freq)
# #print the scores 
# #tf_idf_df_test = pd.DataFrame.from_items(X_train_tfidf)
# #tf_idf_df_test.sort_values(by=["tfidf"],ascending=False)

# tf_idf_df_test.head()

# tf_idf_df_test.to_csv(r'tfidf_values_test.csv')

In [None]:
## Unused code 

# loop each item in transformed_documents_as_array, using enumerate to keep track of the current position

## Turn the ft_df_vector (which is a matrix) into a coo matrix.
# test = tf_idf_vector.tocoo()
#print(test)

# tf_idf_df_for_all = {}

# for counter, doc in enumerate(tf_idf_vector):
#     # construct a dataframe
#     tf_idf_tuples = list(zip(cv.get_feature_names(), doc))
#     print("This is counter: ", counter)
# #     print("This is doc: ", doc)
# #     print("THis is doc type: ", type(doc))
#     print("This is tf_df tuple: ", tf_idf_tuples)
#     print("This is tuple type: ", type(tf_idf_tuples))
# #     recs = tf_idf_tuples
# #     r = recs.fromrecords(recs, names='term, score')
# #     one_dict = [dict(zip(r.dtype.names,x)) for x  in r]
# #     print(one_dict)
    
#     one_doc_as_df = pd.DataFrame.from_records(tf_idf_tuples, index= doc, columns=['index', term', 'score']).sort_values(by='score', ascending=False)
    ## from_records creates a DataFrame object from a structured ndarray, sequence of tuples or dicts, or DataFrame
               # adding column name to the respective columns 
#test_df_stack.columns = ['Index', 'Token_ID', 'TF-DF_inDocument'] 
# test_df_stack.head()
# iterating the columns 
# for col in test_df_stack.columns: 
#     print(col) 

# print(test_df.columns)
# print(test_df.index)           

In [None]:
# https://medium.com/@cristhianboujon/how-to-list-the-most-common-words-from-text-corpus-using-scikit-learn-dad4d0cab41d

# tfidf_transformer = TfidfTransformer(smooth_idf=True,use_idf=True)
# X_train_tfidf = tfidf_transformer.fit_transform(word_count_vector)
# X_train_tfidf.shape
# feature_names = cv.get_feature_names() 

# # count matrix 
# count_vector=cv.transform(docs) 
 
# # tf-idf scores 
# tf_idf_vector=tfidf_transformer.transform(count_vector)

# # create DataFrame using data 
# tf_idf_df_test = pd.DataFrame(X_train_tfidf) 
 
# sum_words = X_train_tfidf.sum(axis=0)
# print(sum_words)
# words_freq = [(word, sum_words[0, idx]) for word, idx in cv.vocabulary_.items()]
# print(type(words_freq))
# words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
# #print(words_freq)

# for word, freq in words_freq:
#     print(word, freq)
# #print the scores 
# #tf_idf_df_test = pd.DataFrame.from_items(X_train_tfidf)
# #tf_idf_df_test.sort_values(by=["tfidf"],ascending=False)

# tf_idf_df_test.head()

# tf_idf_df_test.to_csv(r'tfidf_values_test.csv')

In [None]:
# transformed_documents_as_array = transformed_documents.toarray()
# # use this line of code to verify that the numpy array represents the same number of documents that we have in the file list
# len(transformed_documents_as_array)

# ##Each item in transformed_documents_as_array is an array of its own representing one document from our corpus. 

In [None]:
# import pandas as pd

# # loop each item in transformed_documents_as_array, using enumerate to keep track of the current position
# for counter, doc in enumerate(transformed_documents_as_array):
#     # construct a dataframe
#     tf_idf_tuples = list(zip(vectorizer.get_feature_names(), doc))
#     one_doc_as_df = pd.DataFrame.from_records(tf_idf_tuples, columns=['term', 'score']).sort_values(by='score', ascending=False).reset_index(drop=True)

#     # output to a csv using the enumerated value for the filename
#     one_doc_as_df.to_csv(r'tf_idf_values_all.csv')