# Text Preparation for LDA Topic Modeling

#### Introduction:
The purpose for text pre-processing is to make the text ready for further analysis. Text pre-processing could include many steps depending upon the type of data and the business problem. In our case, we have broken down text pre-processing into three different steps: pre-processing, lemmatization and noun extraction. In the pre-processing, all the text in the corpus was first lowercased so as to avoid algorithm reading 'Environment' and 'environment' as two different words. Next, the regular expressions were used to eliminate duplicate whitespaces, remove special characters, numbers and words which were less than three character length followed by the removal of stopwords. In the lemmatizer step, words were replaced with their root form using spaCy lemmatizer. For example: the lemma of the word 'plants' is 'plant'. Likewise, 'classifying'-> 'classify'. In the noun extraction step, only the nouns were extracted and other parts of speech were ignored as nouns are more indicative of the topic of the document.

#### Objective: 
Making the text ready to be used for topic modeling with only the words of interest

#### Input: 
The input for text pre-processing is the corpus of the documents of interest i.e. .txt files converted from their pdf version (for eg. ESA documents in our analysis) saved on a local directory

#### Output:
The output after text pre-processing is a folder in Python's working directory which would have .txt files for each pdf document with the resulting noun words in their lemmatized root form

#### Packages Used:
1. NLTK (for removing stopwords)
2. spaCy (for Lemmatization and noun extraction)

#### Import required modules

In [None]:
import re
import os
import sys
import codecs
import operator
import csv
import tokenize
import nltk
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
nlp.max_length = 6000000 #spaCy cannot process more than 1 million characters at once. Therefore nlp.max.length has to be changed as per the length of the text fed into the spaCy library functions

#### Pre-processing Function

In [113]:
def clean(mystr):
    my_new_str = re.sub("(\\W| +)"," ", mystr) #remove anything that is not a letter or number
    my_new_str = re.sub(r'\s+', ' ', my_new_str) #eliminate duplicate whitespaces
    my_new_str = re.sub(r"\b\d+\b", "", my_new_str)
    my_new_str = re.sub("\d+", "", my_new_str) #remove numbers from a string
    my_new_str = my_new_str.replace('é', 'e')
    my_new_str = re.sub(r"[^a-zA-Z0-9]+",' ', my_new_str) #remove special characters
    my_new_str = re.sub(r'\b\w{1,2}\b', '', my_new_str) #remove words of length less than 3 from string
    my_new_str = re.sub(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*','', my_new_str) #remove stopwords
    my_new_str = my_new_str.strip()
    my_new_str = re.sub(' +', ' ', my_new_str)
    return my_new_str

#### spaCy Lemmatization Function

In [116]:
def lemmaspacy(my_new_str):
    nlp = en_core_web_sm.load()
    nlp.max_length = 6000000
    sentence = my_new_str
    doc = nlp(sentence)
    return " ".join([token.lemma_ for token in doc]) # joining all the word tokens after lemmatizer implementation

#### Reading Files and Initializing Functions clean(), lemmaspacy() and Extracting Nouns

In [119]:
list_documents = [] #creating empty list to append entire text of the corpus in this list
file_names = [] #creating empty list to append name of ESA .txt files 
#all_tokens = []
#all_lemmatized_tokens = []
#all_lemmatized_tokens_nouns = []
files = os.listdir("X:/xxxx/xxxx/xxxxx/") #directory path where ESAs as text files are saved
for file in files:
    with codecs.open("X:/xxxx/xxxx/xxxx/" + file,'r', encoding='utf-8') as corpus: #directory path where ESAs as text files are saved
        file_names.append(file) #appending file names
        input_str = corpus.read().lower() #lowercasing all text in the corpus
        #print(len(input_str.split()), 'read')
        #print(input_str.split())
        input_str = clean(input_str) #calling function clean()
        #all_tokens.extend(input_str.split())
        #print(len(input_str.split()), 'clean')
        input_str = lemmaspacy(input_str) #calling function lemmaspacy()
        #print(len(input_str.split()), 'lemmawordnet')
        #all_lemmatized_tokens.extend(input_str.split())
        list_element= ""
        input_str = nlp(input_str)
        #print(type(input_str), 'nlp')
        #noun_checker = []
        for chunk in input_str.noun_chunks: #for loop for noun extraction
            list_element = list_element +" "+chunk.text #appending nouns separated by a space
            #noun_checker.extend(str(chunk.text).split())
        #print(len(noun_checker), 'nouns')
        #all_lemmatized_tokens_nouns.extend(noun_checker)
        list_documents.append(list_element) #appending all the nouns in the list called list_documents

#### Writing all the pre-processed, lemmatized and noun extracted .txt files in Python's Working Directory

In [None]:
for i in range(len(list_documents)):
    with open(file_names[i], 'w') as f:
        f.write(list_documents[i])

# Appendix (other lemmatization approaches that were tried)

#### Code for Wordnet Lemmatization

In [None]:
'''from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
def lemmawordnet(my_new_str):
    sentence_words = nltk.word_tokenize(my_new_str)
    lemmatized_output = ' '.join([lemmatizer.lemmatize(word, pos='n') for word in sentence_words]) 
    return lemmatized_output
    '''

#### Code for Textblob Lemmatization

In [None]:
'''from textblob import TextBlob, Word
def lemmatextblob(my_new_str):
    sentence = my_new_str
    sent = TextBlob(sentence)
    tag_dict = {"J": 'a', 
                "N": 'n', 
                "V": 'v', 
                "R": 'r'}
    words_and_tags = [(w, tag_dict.get(pos[0], 'n')) for w, pos in sent.tags]    
    lemmatized_list = [wd.lemmatize(tag) for wd, tag in words_and_tags]
    return " ".join(lemmatized_list)
    '''