# Building an NLP Pipeline, Step-by-Step



### Note: It’s worth mentioning that these are the steps in a typical NLP pipeline, but you will skip steps or re-order steps depending on what you want to do and how your NLP library is implemented.

Let’s look at a piece of text from Wikipedia:

London is the capital and most populous city of England and the United Kingdom. Standing on the River Thames in the south east of the island of Great Britain, London has been a major settlement for two millennia. It was founded by the Romans, who named it Londinium.

(Source: Wikipedia article “London”)

This paragraph contains several useful facts. It would be great if a computer could read this text and understand that London is a city, London is located in England, London was settled by Romans and so on. But to get there, we have to first teach our computer the most basic concepts of written language and then move up from there.

In [1]:
import nltk
from nltk.stem import WordNetLemmatizer
import sys
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.parse.malt import MaltParser
import pandas as pd
import seaborn as sns
import numpy as np

In [2]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('tagsets')
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [3]:
mp = MaltParser("maltparser-1.8.1", "engmalt.linear-1.7.mco")

In [4]:
df = pd.read_csv("200Reviews.csv")
df

Unnamed: 0.1,Unnamed: 0,id,sentiment,review
0,0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."
...,...,...,...,...
195,195,"""8807_9""",1,"""This is a collection of documentaries that la..."
196,196,"""12148_10""",1,"""This movie has a lot of comedy, not dark and ..."
197,197,"""10771_2""",0,"""Have not watched kids films for some years, s..."
198,198,"""6766_3""",0,"""You probably heard this phrase when it come t..."


In [11]:
def process_text(paragraph):
    sentences = nltk.sent_tokenize(paragraph)

    grammar = """NP: {<DT>?<JJ>*<NN.*>+}
    RELATION: {<V.*>}
    {<DT>?<JJ>*<NN.*>+}
    ENTITY: {<NN.*>}"""
    cp = nltk.RegexpParser(grammar)

    wordnet_lemmatizer = WordNetLemmatizer()
    result_list=[]
    noun_phrases_list=[]
    for k in range(len(sentences)):
        words = nltk.word_tokenize(sentences[k])
        tagged_words = nltk.pos_tag(words)
        lemmatized_wordlist=[]
        print("Noun phrases in sentence "+repr(k+1)+" are: ")
        for w in tagged_words:
            wordnettag=get_wordnet_pos(w[1])
            if wordnettag == '':
                lemmatizedword = wordnet_lemmatizer.lemmatize(w[0].lower())
            else:
                lemmatizedword = wordnet_lemmatizer.lemmatize(w[0].lower(),pos=wordnettag)
            if w[0].istitle():
                lemmatizedword = lemmatizedword.capitalize()
            elif w[0].upper()==w[0]:
                lemmatizedword = lemmatizedword.upper()
            else:
                lemmatizedword = lemmatizedword
            lemmatized_wordlist.append((lemmatizedword,w[1]))
        result = cp.parse(lemmatized_wordlist)
        result_list.append(result)
        noun_phrases_list.append[' '.join(leaf[0] for leaf in tree.leaves()) for tree in result.subtrees() if tree.label()=='NP']         
    
    return result_list, noun_phrases_list

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

## Step 1: Sentence Segmentation
The first step in the pipeline is to break the text apart into separate sentences. 

## Step 4: Text Lemmatization
In English (and most languages), words appear in different forms. Look at these two sentences:

I had a pony.

I had two ponies.

Both sentences talk about the noun pony, but they are using different inflections. When working with text in a computer, it is helpful to know the base form of each word so that you know that both sentences are talking about the same concept. Otherwise the strings “pony” and “ponies” look like two totally different words to a computer.

In NLP, we call finding this process lemmatization — figuring out the most basic form or lemma of each word in the sentence.

## Step 5: Identifying Stop Words
Next, we want to consider the importance of a each word in the sentence. English has a lot of filler words that appear very frequently like “and”, “the”, and “a”. When doing statistics on text, these words introduce a lot of noise since they appear way more frequently than other words. Some NLP pipelines will flag them as stop words —that is, words that you might want to filter out before doing any statistical analysis.

## Step 6a: Dependency Parsing
The next step is to figure out how all the words in our sentence relate to each other. This is called dependency parsing.

The goal is to build a tree that assigns a single parent word to each word in the sentence. The root of the tree will be the main verb in the sentence. Here’s what the beginning of the parse tree will look like for our sentence:

## Step 6b: Finding Noun Phrases
Sometimes it makes more sense to group together the words that represent a single idea or thing. We can use the information from the dependency parse tree to automatically group together words that are all talking about the same thing.

In [12]:
df["processed"] = df['review'].map(process_text)
df['processed']

in sentence 7 are: 
Noun phrases in sentence 8 are: 
Noun phrases in sentence 9 are: 
Noun phrases in sentence 10 are: 
Noun phrases in sentence 11 are: 
Noun phrases in sentence 12 are: 
Noun phrases in sentence 13 are: 
Noun phrases in sentence 14 are: 
Noun phrases in sentence 1 are: 
Noun phrases in sentence 2 are: 
Noun phrases in sentence 3 are: 
Noun phrases in sentence 4 are: 
Noun phrases in sentence 5 are: 
Noun phrases in sentence 6 are: 
Noun phrases in sentence 7 are: 
Noun phrases in sentence 8 are: 
Noun phrases in sentence 9 are: 
Noun phrases in sentence 10 are: 
Noun phrases in sentence 11 are: 
Noun phrases in sentence 12 are: 
Noun phrases in sentence 1 are: 
Noun phrases in sentence 2 are: 
Noun phrases in sentence 3 are: 
Noun phrases in sentence 4 are: 
Noun phrases in sentence 5 are: 
Noun phrases in sentence 6 are: 
Noun phrases in sentence 7 are: 
Noun phrases in sentence 8 are: 
Noun phrases in sentence 9 are: 
Noun phrases in sentence 10 are: 
Noun phrases i

0      [[(``, ``), (With, IN), (all, DT), (this, DT),...
1      [[(``, ``), (\, NN), ('', ''), (The, DT), (Cla...
2      [[(``, ``), (The, DT), (film, NN), (start, VBZ...
3      [[(``, ``), (It, PRP), (must, MD), (be, VB), (...
4      [[(``, ``), (Superbly, RB), (trashy, JJ), (and...
                             ...                        
195    [[(``, ``), (This, DT), (be, VBZ), (a, DT), (c...
196    [[(``, ``), (This, DT), (movie, NN), (have, VB...
197    [[(``, ``), (Have, VBP), (not, RB), (watch, VB...
198    [[(``, ``), (You, PRP), (probably, RB), (hear,...
199    [[(``, ``), (I, PRP), (be, VBD), (about, RB), ...
Name: processed, Length: 200, dtype: object

In [9]:
#Step 3: Predicting parts off speech for each token
# You can use nltk.help.upenn_tagset() to get the description of each of pos tag.
#Uncomment following line to print the list of all tags
def tag_words(sentence_list):
    tag_list=[]
    for tokenized_sentence in sentence_list:        
        tagged_words = nltk.pos_tag(tokenized_sentence)
        tag_list.append(tagged_words)
    return tag_list

In [10]:
df['tagged']=df['tokenized'].map(tag_words)
df['tagged']

0      [[(``, ``), (With, IN), (all, DT), (this, DT),...
1      [[(``, ``), (\, NN), ('', ''), (The, DT), (Cla...
2      [[(``, ``), (The, DT), (film, NN), (starts, VB...
3      [[(``, ``), (It, PRP), (must, MD), (be, VB), (...
4      [[(``, ``), (Superbly, RB), (trashy, JJ), (and...
                             ...                        
195    [[(``, ``), (This, DT), (is, VBZ), (a, DT), (c...
196    [[(``, ``), (This, DT), (movie, NN), (has, VBZ...
197    [[(``, ``), (Have, VBP), (not, RB), (watched, ...
198    [[(``, ``), (You, PRP), (probably, RB), (heard...
199    [[(``, ``), (I, PRP), (was, VBD), (about, RB),...
Name: tagged, Length: 200, dtype: object

In [None]:
#Step 5: Identifying stop words
stopWords = set(stopwords.words('english'))
for k in range(len(sentences)):
    words = nltk.word_tokenize(sentences[k])
    wordlist_wo_stopwords=[]
    print("Words in sentence "+repr(k+1)+" without stop words are: ")
    for w in words:
        if w not in stopWords:
            wordlist_wo_stopwords.append(w)
    print(wordlist_wo_stopwords)


In [None]:
#!wget http://www.maltparser.org/mco/english_parser/engmalt.linear-1.7.mco
#Step 6a: Dependency parsing
from nltk.parse.malt import MaltParser
from nltk.tree import Tree
from nltk.draw.tree import TreeView
import sys
sys.path.insert(0,'./maltparser-1.8.1')
sys.path.insert(0,'.')
#Malt parser works on sentences. It internally performs tokenization, pos tagging etc. 
#mp = MaltParser('maltparser-1.8.1', 'engmalt.linear-1.7.mco')
trees=[]
for k in range(len(sentences)):
    trees.append(mp.parse_one(sentences[k].split()).tree())
for k in range(len(trees)):
    print("Dependency tree for sentence "+repr(k+1))
    print(trees[k])
    #Uncomment the following to visualize the tree
    #print(trees[k].draw())


## Step 7. Named Entity Recognition (NER)
The goal of Named Entity recognition is to detect and label nouns with real world concepts they represent. But Named Entity Recgnition systems aren’t just doing a simple dictionary lookup. Instead,they are using the context of how a word appears in the sentence and a statistical model to guess which type of noun a word represents.


In [None]:
#Step 7. Named Entity Recognition (NER)
nltk.download('words')
nltk.download('maxent_ne_chunker')
"""
geo = Geographical Entity
org = Organization
per = Person
gpe = Geopolitical Entity
tim = Time indicator
art = Artifact
eve = Event
nat = Natural Phenomenon

"""
for k in range(len(sentences)):
    words = nltk.word_tokenize(sentences[k])
    tagged_words = nltk.pos_tag(words)
    ne_tagged_words = nltk.ne_chunk(tagged_words)
    #print(ne_tagged_words)
    print("Named entities in  sentence "+repr(k+1))
    for chunk in ne_tagged_words:
        if hasattr(chunk, 'label'):
            print(chunk.label(), ' '.join(c[0] for c in chunk))



Notice that it makes some mistakes. This is probably because there was nothing in the training data set similar to that and it made a best guess.

## Step 8: Coreference Resolution
At this point, we already have a useful representation of our sentence. We know the parts of speech for each word, how the words relate to each other and which words are talking about named entities. However, we still have one big problem. English is full of pronouns — words like he, she, and it. These are shortcuts that we use instead of writing out names over and over in each sentence. Humans can keep track of what these words represent based on context. But our NLP model doesn’t know what pronouns mean because it only examines one sentence at a time.

Let’s look at the third sentence in our paragraph:“It was founded by the Romans, who named it Londinium.”

As a human reading this sentence, you can easily Dgure out that “it” means “London”. The goal of coreference resolution is to Dgure out this same mapping by tracking pronouns across sentences. We want to figure out all the words that are referring to the same entity.


In [None]:
#Step 8: Coreference resolution
#Check out this great coreference resolution demo from Hugging Face. https://huggingface.co/coref/ 