## Create a Bag of Words using Gensim

Prior to running this code, complete the these notebooks: 
* NLP_Data_Loading
* NLP_Data_Preprocessing

In [1]:
## General Dependencies
import re
import numpy as np
import pandas as pd
from pprint import pprint
import sys, os
import glob
from tika import parser # pip install tika
import inspect
import datetime
import pickle5 as pickle

## Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim import models
#from gensim.models.coherencemodel import CoherenceModel
from gensim.models import CoherenceModel
from gensim.models import LdaModel
from gensim.models.wrappers import LdaMallet
from gensim.models import ldaseqmodel


## Preprocessing
import spacy
import nltk as nltk
from nltk.stem import WordNetLemmatizer 
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

## Plotting
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import ast

## Other Libraries
from operator import itemgetter

## ScikitLearn
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

  from PIL import PILLOW_VERSION
  from PIL import PILLOW_VERSION


In [2]:
## Load necessary data
## Open text_out_2 pickle file

file_name = "output/processing/texts_out_2.pkl"

open_file = open(file_name, "rb")
texts_out_2 = pickle.load(open_file)
open_file.close()

### Create a bag of words from the corpus

Using the gensim Dictionary we learned to generate a bag-of-words representation for each review. We'll save the resulting bag-of-words reviews as a matrix. In the following code, "bag-of-words" is abbreviated as bow.

Text from: <https://github.com/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb>


#### What is required by get_gensim_corpus_dictionary()?

* **texts_out_2:** Tokenized, lemmatized words for each document in your corpus, in the form of a "list of lists". 
* Example: [['identify', 'nonprofit', 'member', 'health'], ['member', 'health', 'service'], ['identify', 'heart', 'health', 'diet']]

#### What are the outputs of get_gensim_corpus_dictionary()?

* **bow_corpus:** a list of lists (similar to the data you provided) where the word is represented by a number.
    * Example: [ **[** (0,3), (1,1), (2,1), (3,1) **]** , **[**(2,1), (3,2) **]** ,  **[** (0,1), (1,2), (2,1) **]** ] 
    * In this example, each document is represented by a single list (so there are three documents in the entire corpus) 

* **dictionary:** a gensim.corpora.dictionary.Dictionary which is a list of all the unique words (including bigrams and trigrams, if you created them) in the entire corpus.
    * Example: ['diet','health', 'heart', 'identify'.....]

* **id_words_count:** a list of lists (i.e. a list of terms for each document in the corpus) of tuples, which is a version of dictionary that represents: the ID from the dictionary,the word, the word's total count in the document.
    * Example: [ **[**(0, 'diet', 3), (1, 'health', 1), (2, 'heart', 1), (3, 'identify', 1)**]** , **[**(2, 'heart', 1), (3, 'identify', 2)**]** , **[**(0, 'diet', 1), (1, 'health', 2), (2, 'heart', 1)**]** ]
    * In this example, each document is represented by a single list (so there are three documents in the entire corpus) 


In [3]:
## Run the gensim topic modeling and return the topics
## Code from: https://notebook.community/ethen8181/machine-learning/clustering/topic_model/LDA
vector_words=[]

def get_gensim_corpus_dictionary(data):

    ## Build the id2word dictionary and the corpus
    ## The dictionary associates each word in the corpus with a unique integer ID
    dictionary = corpora.Dictionary(data)
    print('Number of unique tokens prior to filtering: ', len(dictionary))

    ## Filter out words that appear in less than 2 documents (appear only once),
    dictionary.filter_extremes(no_below = 2)

    ## Filter out words that appears in more than certain % of documents
    ## no_above = 0.5 would remove words that appear in more than 50% of the documents
    dictionary.filter_extremes(no_above = 0.5)

    ## Remove gaps in id sequence after words that were removed
    # dictionary.compactify()
    # print('Number of unique tokens after filtering (i.e. used in 2 or more documents): ', len(dictionary))

    ##Use code below to print terms in dictionary with their IDs
    ##This will show you the number of the terms in the dictionary
    #print("Dictionary Tokens with ID: ")
    #pprint.pprint(dictionary.token2id)
    
    ##Map terms in corpus to words in dictionary with ID
    ##This will show you the ID of the term in the dictionary, and the number of times the terms occurs in the corpus
    bow_corpus = [dictionary.doc2bow(text) for text in data]
    #print("Tokens in Corpus with Occurrence: ")
    #pprint.pprint(corpus)
    
    ##Print word count by vector 
    id_words_count = [[(id, dictionary[id], count) for id, count in line] for line in bow_corpus]
    
     
    ## Print the outputs and inspect as needed
    #print(type(id_words_count))
    #print(id_words_count)
    #print(type(id_words_count[1][1]))
    #pprint(id_words_count[1])
    
     
    return bow_corpus, dictionary, id_words_count




bow_corpus, dictionary, id_words_count = get_gensim_corpus_dictionary(texts_out_2)

Number of unique tokens prior to filtering:  4349


In [4]:
print(type(bow_corpus))
print(bow_corpus)
print(type(dictionary))
print(type(id_words_count))

<class 'list'>
[[(0, 1), (1, 1), (2, 2), (3, 1), (4, 2), (5, 2), (6, 5), (7, 1), (8, 2), (9, 1), (10, 1), (11, 1), (12, 2), (13, 3), (14, 1), (15, 2), (16, 1), (17, 2), (18, 1), (19, 1), (20, 1), (21, 5), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 2), (28, 2), (29, 1), (30, 1), (31, 2), (32, 2), (33, 2), (34, 1), (35, 2), (36, 1), (37, 1), (38, 1), (39, 3), (40, 3), (41, 1), (42, 1), (43, 1), (44, 3), (45, 2), (46, 4), (47, 4), (48, 2), (49, 1), (50, 1), (51, 1), (52, 3), (53, 4), (54, 1), (55, 2), (56, 5), (57, 1), (58, 1), (59, 1), (60, 1), (61, 2), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 2), (68, 1), (69, 1), (70, 2), (71, 2), (72, 1), (73, 2), (74, 1), (75, 1), (76, 2), (77, 1), (78, 2), (79, 1), (80, 2), (81, 1), (82, 1), (83, 2), (84, 2), (85, 2), (86, 2), (87, 1), (88, 1), (89, 2), (90, 4), (91, 1), (92, 1), (93, 1), (94, 4), (95, 2), (96, 3), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 1), (104, 2), (105, 2), (106, 1), (107, 2), (108, 5), (10

In [5]:
## Create a pandas dataframe for the Dictionary
dictionary_df = pd.DataFrame(id_words_count) 

## Stack the columns on the index which returns a Series and make into a dataframe
stack_df = dictionary_df.stack().to_frame()
## https://www.w3resource.com/pandas/dataframe/dataframe-stack.php#:~:text=The%20stack()%20function%20is,compared%20to%20the%20current%20DataFrame.

## Reset the index to remove the multi-level index
stack_df.reset_index(inplace=True)
## https://stackoverflow.com/questions/20490274/how-to-reset-index-in-a-pandas-dataframe

## Change the column names to something more useful
## The "inplace = True" means the original dataframe is changed
stack_df.rename(columns={'level_0': 'Document_ID', 'level_1': 'Token_ID_in_Doc', 0: 'Token'}, inplace=True)

## Separate the tuple into multiple columns
## https://stackoverflow.com/questions/25559202/from-tuples-to-multiple-columns-in-pandas
new_col_list = ['Dictionary_ID','Term','Count_in_Doc']
for n,col in enumerate(new_col_list):
    stack_df[col] = stack_df['Token'].apply(lambda Token: Token[n])
stack_df = stack_df.drop('Token',axis=1)

## Review the final dataframe
stack_df.head()

Unnamed: 0,Document_ID,Token_ID_in_Doc,Dictionary_ID,Term,Count_in_Doc
0,0,0,0,accelerator,1
1,0,1,1,account,1
2,0,2,2,actual,2
3,0,3,3,administration,1
4,0,4,4,alone,2


In [6]:
## Create a pandas dataframe for the Dictionary
dictionary_df = pd.DataFrame(id_words_count) 

## Stack the columns on the index which returns a Series and make into a dataframe
stack_df = dictionary_df.stack().to_frame()
## https://www.w3resource.com/pandas/dataframe/dataframe-stack.php#:~:text=The%20stack()%20function%20is,compared%20to%20the%20current%20DataFrame.

## Reset the index to remove the multi-level index
stack_df.reset_index(inplace=True)
## https://stackoverflow.com/questions/20490274/how-to-reset-index-in-a-pandas-dataframe

## Change the column names to something more useful
## The "inplace = True" means the original dataframe is changed
stack_df.rename(columns={'level_0': 'Document_ID', 'level_1': 'Token_ID_in_Doc', 0: 'Token'}, inplace=True)

## Separate the tuple into multiple columns
## https://stackoverflow.com/questions/25559202/from-tuples-to-multiple-columns-in-pandas
new_col_list = ['Dictionary_ID','Term','Count_in_Doc']
for n,col in enumerate(new_col_list):
    stack_df[col] = stack_df['Token'].apply(lambda Token: Token[n])
stack_df = stack_df.drop('Token',axis=1)

## Review the final dataframe
stack_df.head()

Unnamed: 0,Document_ID,Token_ID_in_Doc,Dictionary_ID,Term,Count_in_Doc
0,0,0,0,accelerator,1
1,0,1,1,account,1
2,0,2,2,actual,2
3,0,3,3,administration,1
4,0,4,4,alone,2


In [7]:
## Save dataframe to csv
with open(r"output/bow/term_counts_for_doc.csv", 'w', encoding='utf-8') as file:
    stack_df.to_csv(file, index=False, line_terminator='\n')
    file.close()

In [8]:
## Create an overall dictionary for entire corpus
overall_dictionary_df = stack_df.groupby(['Dictionary_ID', 'Term'], as_index=False)['Count_in_Doc'].sum()

## Inspect the output as needed
overall_dictionary_df.head()

Unnamed: 0,Dictionary_ID,Term,Count_in_Doc
0,0,accelerator,7
1,1,account,5
2,2,actual,16
3,3,administration,11
4,4,alone,11


In [9]:
## Save dataframe to csv
with open(r"output/bow/dictionary_document_counts.csv", 'w', encoding='utf-8') as file:
    overall_dictionary_df.to_csv(file, index=False, line_terminator='\n')
    file.close()

In [13]:
## Create an overall dictionary for entire corpus
count_dictionary_df = stack_df.groupby(['Document_ID', 'Term'], as_index=False).size().reset_index()
## removed .to_fram() because it wasn't needed...??

## Inspect the output as needed
count_dictionary_df.head()

Unnamed: 0,index,Document_ID,Term,size
0,0,0,accelerator,1
1,1,0,account,1
2,2,0,actual,1
3,3,0,administration,1
4,4,0,alone,1


In [27]:
## Count the number of rows in the dataframe
count_dictionary_final_df = count_dictionary_df.groupby(['Term'], as_index=False).count().reset_index()

## Drop unnecessary columns
## inplace=True means no copy will be made of the dataframe
count_dictionary_final_df.drop(columns=['Document_ID', 'index'], axis=1, inplace=True)

## Rename columns
## inplace=True means no copy will be made of the dataframe
count_dictionary_final_df.rename(columns={"level_0": 'Count_Occurrence_Corpus'}, inplace=True)

## Inspect the output as needed
print(count_dictionary_final_df.columns)
count_dictionary_final_df.head()

Index(['Count_Occurrence_Corpus', 'Term', 'size'], dtype='object')


Unnamed: 0,Count_Occurrence_Corpus,Term,size
0,0,accelerator,7
1,1,account,5
2,2,active,12
3,3,actual,12
4,4,addition,12


In [28]:
## Save dataframe to csv
with open(r"output/bow/dictionary_corpus_counts.csv", 'w', encoding='utf-8') as file:
    count_dictionary_final_df.to_csv(file, index=False, line_terminator='\n')
    file.close()

In [29]:
## Get counts of how many terms exist at each frequency
frequency_df = count_dictionary_final_df['Count_Occurrence_Corpus'].value_counts().to_frame()
print(frequency_df)

## Rename columns
## inplace=True means no copy will be made of the dataframe
count_dictionary_final_df.rename(columns={'Count_Occurrence_Corpus': 'frequency_of_occurrence_corpus'}, inplace=True)

## Inspect the output as needed
frequency_df.head()

     Count_Occurrence_Corpus
675                        1
221                        1
229                        1
228                        1
227                        1
..                       ...
449                        1
448                        1
447                        1
446                        1
0                          1

[676 rows x 1 columns]


Unnamed: 0,Count_Occurrence_Corpus
675,1
221,1
229,1
228,1
227,1


In [30]:
## Save dataframe to csv
with open(r"output/bow/dictionary_frequency_corpus.csv", 'w', encoding='utf-8') as file:
    frequency_df.to_csv(file, index=True, line_terminator='\n')
    file.close()

In [31]:
## Save bow_corpus
## Save the list as a .pkl file

file_name = "output/bow/bow_corpus.pkl"

open_file = open(file_name, "wb")
pickle.dump(bow_corpus, open_file, protocol=4)
open_file.close()

## Resources
## https://www.kite.com/python/answers/how-to-save-and-read-a-list-in-python
## https://stackoverflow.com/questions/25843698/valueerror-unsupported-pickle-protocol-3-python2-pickle-can-not-load-the-file

In [32]:
## Save dictionary
## Save using the gensim native save function, which is a protocol 2 .pkl file

file_name = "output/bow/dictionary.pkl"

# gensim.corpora.dictionary.Dictionary.save(file_name)
dictionary.save(file_name)


## Resources
## https://stackoverflow.com/questions/58961983/how-do-you-save-a-model-dictionary-and-corpus-to-disk-in-gensim-and-then-load
## https://www.tutorialspoint.com/gensim/gensim_creating_a_dictionary.htm
## https://tedboy.github.io/nlps/generated/generated/gensim.corpora.Dictionary.save.html
## https://stackoverflow.com/questions/25843698/valueerror-unsupported-pickle-protocol-3-python2-pickle-can-not-load-the-file

In [33]:
## Save the id_words_count
## Save the list as a .pkl file

file_name = "output/bow/id_words_count.pkl"

open_file = open(file_name, "wb")
pickle.dump(id_words_count, open_file, protocol=4)
open_file.close()

## Resources
## https://www.kite.com/python/answers/how-to-save-and-read-a-list-in-python
## https://stackoverflow.com/questions/25843698/valueerror-unsupported-pickle-protocol-3-python2-pickle-can-not-load-the-file