<h1> This notebook demonstrates how to preprocess a text corpus using NLKT, spaCy, Stanford CoreNLP, and Spark NLP.</h1>

For a breakdown of features for the libraries showcased in this notebook, check out: https://blog.dominodatalab.com/comparing-the-functionality-of-open-source-natural-language-processing-libraries/ 

<h4>The first thing to do is load a corpus of text documents from a local directory. In this example, the corpus is a collection of 11 speeches by Dr. Martin Luther King.<h4>

In [16]:
import glob
docs = []
for filename in glob.glob('./King/*.txt'): 
    with open(filename, 'r', encoding='utf-8') as f:     
        docs.append(f.read().replace('\n', ' '))

<h4>Now preprocess with NLTK --a powerful NLP library commonly used in research and academia. </h4>


In [17]:
import nltk
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english')) 

cleanDocs = []

# the tag_switch function allows one to switch the 
# pos tags from the format used by nltk.pos_tag 
# to one that is recognizable by WordNetLemmatizer 
# so all word tokens can be lemmatized. 

def tag_switch(word):
    
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tags = {'J': wordnet.ADJ, 'N': wordnet.NOUN,
            'V': wordnet.VERB, 'R': wordnet.ADV}
    return tags.get(tag, wordnet.NOUN)

# the code for tag_switch was taken from the nltk examples at the url below
# https://www.machinelearningplus.com/nlp/lemmatization-examples-python/

def nltk_preprocess(corpus): 
    for doc in corpus:       
        (cleanDocs.append([lemmatizer.lemmatize(word.lower(), tag_switch(word)) for word 
                           in nltk.word_tokenize(doc) if word.isalpha() and word.lower() 
                           not in stop_words]))
    #return cleanDocs  

In [18]:
nltk_preprocess(docs)

<h4>To get a feel for how the corpus has been transformed, let's take look at the first 100 tokens for the first text in cleanDocs.</h4>

In [20]:
print(cleanDocs[0][:100])

['happy', 'join', 'today', 'go', 'history', 'great', 'demonstration', 'freedom', 'history', 'nation', 'five', 'score', 'year', 'ago', 'great', 'american', 'whose', 'symbolic', 'shadow', 'stand', 'today', 'sign', 'emancipation', 'proclamation', 'momentous', 'decree', 'come', 'great', 'beacon', 'light', 'hope', 'million', 'negro', 'slave', 'sear', 'flame', 'wither', 'injustice', 'come', 'joyous', 'daybreak', 'end', 'long', 'night', 'captivity', 'one', 'hundred', 'year', 'later', 'negro', 'still', 'free', 'one', 'hundred', 'year', 'later', 'life', 'negro', 'still', 'sadly', 'cripple', 'manacle', 'segregation', 'chain', 'discrimination', 'one', 'hundred', 'year', 'later', 'negro', 'life', 'lonely', 'island', 'poverty', 'midst', 'vast', 'ocean', 'material', 'prosperity', 'one', 'hundred', 'year', 'later', 'negro', 'still', 'languish', 'corner', 'american', 'society', 'find', 'exile', 'land', 'come', 'today', 'dramatize', 'shameful', 'condition', 'sense', 'come', 'nation']


<h4>Now reload the raw corpus.</h4>

In [21]:
import glob
docs = []
for filename in glob.glob('./King/*.txt'):  
    with open(filename, 'r', encoding='utf-8') as f:     
        docs.append(f.read().replace('\n', ' '))

<h4>Next preprocess the docs with spaCy --an industry standard library which creates an integrated doc object and provides easy access to linguistic annotations.</h4>

In [24]:
import spacy
nlp = spacy.load('en_core_web_sm')

cleanDocs = []

def spacy_preprocess(corpus): 
    for doc in range(len(corpus)):
        corpus[doc] = nlp(corpus[doc]) 
        (cleanDocs.append([word.lemma_.lower() for word in corpus[doc] 
                           if word.is_stop ==False and word.lemma_.isalpha()]))
    
    #return cleanDocs

In [25]:
spacy_preprocess(docs)

<h4>It is always good practice to take a look at the processed data to make sure the results are as expected before going further in a project or experiment. A glance at the first 100 tokens in the second doc shows that spaCy didn't lemmatize everything properly.</h4>


In [26]:
print(cleanDocs[1][:100])

['order', 'answer', 'question', 'theme', 'honestly', 'recognize', 'constitution', 'write', 'strange', 'formula', 'determine', 'taxis', 'representation', 'declare', 'negro', 'percent', 'person', 'today', 'curious', 'formula', 'declare', 'percent', 'person', 'good', 'thing', 'life', 'negro', 'approximately', 'half', 'white', 'bad', 'thing', 'life', 'twice', 'white', 'half', 'negroes', 'live', 'substandard', 'housing', 'negroes', 'half', 'income', 'white', 'view', 'negative', 'experience', 'life', 'negro', 'double', 'share', 'twice', 'unemployed', 'rate', 'infant', 'mortality', 'negroes', 'double', 'white', 'twice', 'negroes', 'die', 'vietnam', 'white', 'proportion', 'size', 'population', 'sphere', 'figure', 'equally', 'alarming', 'elementary', 'school', 'negroes', 'lag', 'year', 'white', 'segregated', 'school', 'receive', 'substantially', 'money', 'student', 'white', 'school', 'twentieth', 'negroes', 'white', 'attend', 'college', 'employed', 'negroes', 'percent', 'hold', 'menial', 'job',

<h4>Let's take a look at tokens which end in 's' in the second document to figure out why all plural nouns haven't been lemmatized with spaCy.</h4>

In [27]:
# return the POS tag, word in lowercase, and the lemmatized word
# for the words in the second doc. 
for word in docs[1]:
    if word.text.endswith('s'):
        print(word.pos_, word.lower_, word.lemma_) 

VERB is be
VERB was be
NOUN taxes taxis
VERB was be
ADJ curious curious
VERB seems seem
VERB is be
NOUN things thing
VERB has have
NOUN whites white
NOUN things thing
VERB has have
NOUN whites white
ADV thus thus
PROPN negroes Negroes
PROPN negroes Negroes
NOUN whites white
NOUN experiences experience
VERB has have
ADV as as
PROPN negroes Negroes
VERB is be
NOUN whites white
ADV as as
PROPN negroes Negroes
ADP as as
NOUN whites white
NOUN spheres sphere
NOUN figures figure
NOUN schools school
PROPN negroes Negroes
NOUN years year
NOUN whites white
NOUN schools school
ADJ less less
NOUN schools school
ADP as as
PROPN negroes Negroes
ADP as as
NOUN whites white
PROPN negroes Negroes
NOUN jobs job
DET this this
VERB is be
VERB oppresses oppress
PRON us -PRON-
NOUN values value
NOUN centuries century
VERB is be
PROPN blackness Blackness
PROPN contributions Contributions
NOUN semantics semantic
VERB is be
PART 's 's
PROPN thesaurus Thesaurus
NOUN synonyms synonym
NOUN blackness blackness
AD

<h4>There are many different NLP toolkits and no library is perfect. It looks like spaCy misattributed the POS tag for some words due to case usage and this caused the lemmatizer to miss them. </h4> 

For performance benchmarks on how spaCy compares to other libraries, check out: https://spacy.io/usage/facts-figures

<h4>Now we are going to reload corpus again.</h4>

In [29]:
import glob
docs = []
for filename in glob.glob('./King/*.txt'):  
    with open(filename, 'r', encoding='utf-8') as f:     
        docs.append(f.read().replace('\n', ' '))

<h4>Next we are going to preprocess with Stanford CoreNLP --a Java library with a Python wrapper which provides support for several world languages.</h4>

In [31]:
import stanfordnlp
nlp = stanfordnlp.Pipeline(processors ="tokenize,lemma,pos") 

# stanfordnlp doesn't come with its own stop words list
# so we are going to have to import one from NLTK again. 

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english')) 

cleanDocs = []

def stanford_preprocess(corpus): 
    for doc in range(len(corpus)):
        corpus[doc] = nlp(corpus[doc])
        (cleanDocs.append([words.lemma.lower() for sent in corpus[doc].sentences for words
                           in sent.words if words.lemma.lower() not in stop_words and
                           words.lemma.isalpha()]))
    #return cleanDocs

Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': '/Users/ppchsdbib/stanfordnlp_resources/en_ewt_models/en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: lemma
With settings: 
{'model_path': '/Users/ppchsdbib/stanfordnlp_resources/en_ewt_models/en_ewt_lemmatizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Using soft attention for LSTM.
Finetune all embeddings.
[Running seq2seq lemmatizer with edit classifier]
---
Loading: pos
With settings: 
{'model_path': '/Users/ppchsdbib/stanfordnlp_resources/en_ewt_models/en_ewt_tagger.pt', 'pretrain_path': '/Users/ppchsdbib/stanfordnlp_resources/en_ewt_models/en_ewt.pretrain.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Done loading processors!
---


In [32]:
stanford_preprocess(docs)

<h4>Let's take a look at the third document.</h4>

In [33]:
print(cleanDocs[2][:100])

['want', 'use', 'subject', 'preach', 'three', 'dimension', 'complete', 'life', 'right', 'know', 'use', 'tell', 'hollywood', 'order', 'movie', 'complete', 'three', 'dimensional', 'well', 'morning', 'want', 'seek', 'get', 'life', 'complete', 'yes', 'must', 'three', 'dimensional', 'many', 'many', 'century', 'ago', 'man', 'name', 'john', 'find', 'prison', 'lonely', 'obscure', 'island', 'call', 'patmos', 'right', 'right', 'prison', 'enough', 'know', 'lonely', 'experience', 'right', 'incarcerate', 'situation', 'deprive', 'almost', 'every', 'freedom', 'freedom', 'think', 'freedom', 'pray', 'freedom', 'reflect', 'meditate', 'john', 'lonely', 'island', 'prison', 'right', 'lift', 'vision', 'high', 'heaven', 'right', 'see', 'descend', 'heaven', 'new', 'heaven', 'right', 'new', 'earth', 'right', 'twenty', 'first', 'chapter', 'book', 'revelation', 'open', 'say', 'see', 'new', 'heaven', 'new', 'earth', 'right', 'john', 'see', 'holy', 'city']


<h4>Now reload the corpus one more time.</h4>


In [38]:
import glob
docs = []
for filename in glob.glob('./King/*.txt'):  
    with open(filename, 'r', encoding='utf-8') as f:     
        docs.append(f.read().replace('\n', ' '))

<h4>Now we are going to preprocess with Spark NLP --an enterprise-scale library built on Apache Spark designed for all your big data needs.</h4>

In [37]:
# getting started wth spark and sparknlp 
import sparknlp 
from sparknlp.pretrained import PretrainedPipeline
spark = sparknlp.start()
pipeline = PretrainedPipeline('explain_document_dl', lang='en') 

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english')) 

cleanDocs = []

def sparknlp_preprocess(corpus): 
     for doc in corpus:
        doc = pipeline.annotate(doc) 
        (cleanDocs.append([word.lower() for word in doc['lemma'] if word.lower() 
                           not in stop_words and word.isalpha()]))
        #return cleanDocs

explain_document_dl download started this may take some time.
Approx size to download 167.3 MB
[OK!]


In [39]:
sparknlp_preprocess(docs)

<h4>As the last step, let's get a snapshot of the contents of the forth document.</h4>

In [41]:
print(cleanDocs[3][:100])

['force', 'preach', 'something', 'handicap', 'morning', 'fact', 'doctor', 'come', 'church', 'say', 'would', 'good', 'stay', 'bed', 'morning', 'insist', 'would', 'come', 'preach', 'allow', 'come', 'one', 'stipulation', 'would', 'come', 'pulpit', 'time', 'preach', 'would', 'immediately', 'go', 'back', 'home', 'get', 'bed', 'go', 'try', 'follow', 'instruction', 'point', 'want', 'use', 'subject', 'preach', 'morning', 'familiar', 'subject', 'familiar', 'preach', 'subject', 'twice', 'know', 'pulpit', 'try', 'make', 'something', 'custom', 'tradition', 'preach', 'passage', 'scripture', 'least', 'year', 'add', 'new', 'insight', 'develop', 'along', 'way', 'new', 'experience', 'give', 'message', 'although', 'content', 'basic', 'content', 'new', 'insight', 'new', 'experience', 'naturally', 'make', 'new', 'illustration', 'want', 'turn', 'attention', 'subject', 'loving', 'enemies', 'itys', 'basic', 'part', 'basic', 'philosophical', 'theological', 'orientation', 'whole', 'idea']
