# Text Preparation for LDA Topic Modeling

**Introduction:**
<br>The purpose for text pre-processing is to make the text ready for further analysis. Text pre-processing could include many steps depending upon the type of data and the business problem. In our case, we have broken down text pre-processing into three different steps: 
1. Pre-processing: In the pre-processing, all the text in the corpus was first lowercased so as to avoid algorithm reading **'Environment'** and **'environment'** as two different words. Next, the regular expressions were used to eliminate duplicate whitespaces, remove special characters, numbers and words which were less than three character length followed by the removal of stopwords
2. Lemmatization: In the lemmatizer step, words were replaced with their root form using [spaCy](https://spacy.io/api/lemmatizer) lemmatizer. For example: the lemma of the word **'plants'** is **'plant'**. Likewise, **'classifying'** -> **'classify'**. The [NLTK wordnet](https://www.nltk.org/api/nltk.stem.wordnet.html) and [TextBlob](https://textblob.readthedocs.io/en/dev/quickstart.html) lemmatization function has also been provided in this notebook, in case you would like to use these. For this project, we used [spaCy](https://spacy.io/api/lemmatizer) lemmatizer
3. Noun extraction: In the noun extraction step, only the nouns were extracted and other parts of speech were ignored as nouns are more indicative of the topic of the document. [spaCy](https://spacy.io/usage/linguistic-features#pos-tagging) Part-of-speech tagging linguistic feature was utilized for noun extraction

**Objective:**
<br>Making the text ready to be used for topic modeling with only the words of interest

**Data Input:** 
<br>The input for text pre-processing is the corpus of the documents of interest i.e. `.txt` files converted from their pdf version (for eg. ESA documents in our analysis) saved on a local directory

**Output:**
<br>The output after text pre-processing is a folder in Python's working directory which would have `.txt` files for each pdf document with the resulting noun words in their lemmatized root form

**Python Libraries Used:**
1. [NLTK](https://www.nltk.org/) for removing stopwords
2. [spaCy](https://spacy.io/) for Lemmatization and noun extraction


### Import required modules

In [None]:
import re
import os
import sys
import codecs
import operator
import csv
import tokenize
import nltk
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()
nlp.max_length = 6000000 #spaCy cannot process more than 1 million characters at once. Therefore nlp.max.length has to be changed as per the length of the text fed into the spaCy library functions

In [None]:
from pathlib import Path
ROOT_PATH = Path('.').resolve().parents[0]

### Pre-processing Function

In [None]:
def clean(mystr):
    my_new_str = re.sub("(\\W| +)"," ", mystr) #remove anything that is not a letter or number
    my_new_str = re.sub(r'\s+', ' ', my_new_str) #eliminate duplicate whitespaces
    my_new_str = re.sub(r"\b\d+\b", "", my_new_str)
    my_new_str = re.sub("\d+", "", my_new_str) #remove numbers from a string
    my_new_str = my_new_str.replace('é', 'e')
    my_new_str = re.sub(r"[^a-zA-Z0-9]+",' ', my_new_str) #remove special characters
    my_new_str = re.sub(r'\b\w{1,2}\b', '', my_new_str) #remove words of length less than 3 from string
    my_new_str = re.sub(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*','', my_new_str) #remove stopwords
    my_new_str = my_new_str.strip()
    my_new_str = re.sub(' +', ' ', my_new_str)
    return my_new_str

### spaCy Lemmatization Function

In [None]:
def lemmaspacy(my_new_str):
    nlp = en_core_web_sm.load()
    nlp.max_length = 60000000
    sentence = my_new_str
    doc = nlp(sentence)
    return " ".join([token.lemma_ for token in doc]) # joining all the word tokens after lemmatizer implementation

### Reading Files and Initializing Functions clean(), lemmaspacy() and Extracting Nouns
**Note:** Below code could take very long to run depending on the number of files you have in the corpus. You could either fire this code onto GPU machine or implement `multiprocessing`

In [None]:
list_documents = []
file_names = []
text_files_dir = ROOT_PATH / "data" / "processed" / "text_files" #directory path where PDFs as text files are saved
for file in os.listdir(text_files_dir):
    with codecs.open(str(ROOT_PATH / "data" / "processed" / "text_files") + '/' + file,'r', encoding='utf-8') as corpus: #directory path where PDFs as text files are saved
        file_names.append(file) 
        input_str = corpus.read().lower()
        input_str = clean(input_str)
        input_str = lemmaspacy(input_str) 
        list_element= ""
        input_str = nlp(input_str)
        for chunk in input_str.noun_chunks: #for loop for noun extraction
            list_element = list_element +" "+chunk.text #appending nouns separated by a space
        list_documents.append(list_element) 

### Writing all the pre-processed, lemmatized and noun extracted .txt files in Python's Working Directory

In [None]:
for i in range(len(list_documents)):
    with open(str(ROOT_PATH / "data" / "interim" / "Noun_text_files") + '/' + file_names[i], 'w') as f:
        f.write(list_documents[i])

### Appendix (other lemmatization approaches that were tried)
Code for Wordnet Lemmatization

In [None]:
'''from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
def lemmawordnet(my_new_str):
    sentence_words = nltk.word_tokenize(my_new_str)
    lemmatized_output = ' '.join([lemmatizer.lemmatize(word, pos='n') for word in sentence_words]) 
    return lemmatized_output
    '''

Code for Textblob Lemmatization

In [None]:
'''from textblob import TextBlob, Word
def lemmatextblob(my_new_str):
    sentence = my_new_str
    sent = TextBlob(sentence)
    tag_dict = {"J": 'a', 
                "N": 'n', 
                "V": 'v', 
                "R": 'r'}
    words_and_tags = [(w, tag_dict.get(pos[0], 'n')) for w, pos in sent.tags]    
    lemmatized_list = [wd.lemmatize(tag) for wd, tag in words_and_tags]
    return " ".join(lemmatized_list)
    '''