# Building an NLP Pipeline, Step-by-Step



### Note: It’s worth mentioning that these are the steps in a typical NLP pipeline, but you will skip steps or re-order steps depending on what you want to do and how your NLP library is implemented.

Let’s look at a piece of text from Wikipedia:

London is the capital and most populous city of England and the United Kingdom. Standing on the River Thames in the south east of the island of Great Britain, London has been a major settlement for two millennia. It was founded by the Romans, who named it Londinium.

(Source: Wikipedia article “London”)

This paragraph contains several useful facts. It would be great if a computer could read this text and understand that London is a city, London is located in England, London was settled by Romans and so on. But to get there, we have to first teach our computer the most basic concepts of written language and then move up from there.

In [5]:
import nltk
from nltk.stem import WordNetLemmatizer
import sys
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.parse.malt import MaltParser
import pandas as pd
import seaborn as sns
import numpy as np

In [None]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')

In [4]:
mp = MaltParser("maltparser-1.8.1", "engmalt.linear-1.7.mco")

In [6]:
df = pd.read_csv("200Reviews.csv")
df

Unnamed: 0.1,Unnamed: 0,id,sentiment,review
0,0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."
...,...,...,...,...
195,195,"""8807_9""",1,"""This is a collection of documentaries that la..."
196,196,"""12148_10""",1,"""This movie has a lot of comedy, not dark and ..."
197,197,"""10771_2""",0,"""Have not watched kids films for some years, s..."
198,198,"""6766_3""",0,"""You probably heard this phrase when it come t..."


In [None]:
# paragraph = "London is the capital and most populous city of England and the United Kingdom."\
#              " Standing on the River Thames in the south east of the island of Great Britain,"\
#              "London has been a major settlement for two millennia. It was founded by the Romans, who named it Londinium."
# print("Paragraph considered is:")
# print(paragraph)

## Step 1: Sentence Segmentation
The first step in the pipeline is to break the text apart into separate sentences. 

In [9]:
df["tokenized"] = df['review'].map(nltk.sent_tokenize)
df['tokenized']

0      ["With all this stuff going down at the moment...
1      ["\"The Classic War of the Worlds\" by Timothy...
2      ["The film starts with a manager (Nicholas Bel...
3      ["It must be assumed that those who praised th...
4      ["Superbly trashy and wondrously unpretentious...
                             ...                        
195    ["This is a collection of documentaries that l...
196    ["This movie has a lot of comedy, not dark and...
197    ["Have not watched kids films for some years, ...
198    ["You probably heard this phrase when it come ...
199    ["I was about thirteen when this movie came ou...
Name: tokenized, Length: 200, dtype: object

## Step 2: Word Tokenization
Now that we’ve split our document into sentences, we can process them one at a time. The next step in our pipeline is to 
break this sentence into separate words or tokens. This is called tokenization. 

In [12]:
#Step 2: Word tokenization

df["word_list"] = df['tokenized'].map(nltk.word_tokenize)
df["word_list"]

### TODO word tokenize pipeline

# for k in range(len(sentences)):
#     # word tokenizer will keep the punctuations. To get rid of punctuations, use nltk.RegexpTokenizer(r'\w+').tokenize(sentences[k]) 
#     words = nltk.word_tokenize(sentences[k])
#     print("Words in sentence "+repr(k+1)+" are: ")
#     wordlist=[]
#     for w in words:
#         wordlist.append(w)
#     print(wordlist)

TypeError: expected string or bytes-like object

## Step 3: Predicting Parts of Speech for Each Token
Next, we’ll look at each token and try to guess its part of speech — whether it is a noun, a verb, an adjective and so on. Knowing the role of each word in the sentence will help us start to figure out what the sentence is talking about.

We can do this by feeding each word (and some extra words around it for context) into a pre-trained part-of-speech classification model:


The part-of-speech model was originally trained by feeding it millions of English sentences with each word’s part of speech already tagged and having it learn to replicate that behavior.

In [None]:
#Step 3: Predicting parts off speech for each token
# You can use nltk.help.upenn_tagset() to get the description of each of pos tag.
#Uncomment following line to print the list of all tags
#nltk.download('tagsets')
#nltk.help.upenn_tagset()
for k in range(len(sentences)):
    words = nltk.word_tokenize(sentences[k])
    tagged_words = nltk.pos_tag(words)
    print("Tagged Words in sentence "+repr(k+1)+" are: ")
    print(tagged_words)
   

## Step 4: Text Lemmatization
In English (and most languages), words appear in different forms. Look at these two sentences:

I had a pony.

I had two ponies.

Both sentences talk about the noun pony, but they are using different inflections. When working with text in a computer, it is helpful to know the base form of each word so that you know that both sentences are talking about the same concept. Otherwise the strings “pony” and “ponies” look like two totally different words to a computer.

In NLP, we call finding this process lemmatization — figuring out the most basic form or lemma of each word in the sentence.

In [None]:
#Step 4: Text Lemmatization
#As we are using wordnet Lemmatizer and the the standard NLTK pos tags are treebank tags, we need to convert the treebank tag
#to wordnet tags. 
def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''
    
wordnet_lemmatizer = WordNetLemmatizer()
for k in range(len(sentences)):
    words = nltk.word_tokenize(sentences[k])
    tagged_words = nltk.pos_tag(words)
    lemmatized_wordlist=[]
    print("Word:Lemmatized Word in sentence "+repr(k+1)+" are: ")
    for w in tagged_words:
        wordnettag=get_wordnet_pos(w[1])
        if wordnettag == '':
            lemmatizedword = wordnet_lemmatizer.lemmatize(w[0].lower())
        else:
            lemmatizedword = wordnet_lemmatizer.lemmatize(w[0].lower(),pos=wordnettag)
        if w[0].istitle():
            lemmatizedword = lemmatizedword.capitalize()
        elif w[0].upper()==w[0]:
            lemmatizedword = lemmatizedword.upper()
        else:
            lemmatizedword = lemmatizedword
        lemmatized_wordlist.append((w[0],lemmatizedword))
            
    print(lemmatized_wordlist)


## Step 5: Identifying Stop Words
Next, we want to consider the importance of a each word in the sentence. English has a lot of filler words that appear very frequently like “and”, “the”, and “a”. When doing statistics on text, these words introduce a lot of noise since they appear way more frequently than other words. Some NLP pipelines will flag them as stop words —that is, words that you might want to filter out before doing any statistical analysis.

In [None]:
#Step 5: Identifying stop words
stopWords = set(stopwords.words('english'))
for k in range(len(sentences)):
    words = nltk.word_tokenize(sentences[k])
    wordlist_wo_stopwords=[]
    print("Words in sentence "+repr(k+1)+" without stop words are: ")
    for w in words:
        if w not in stopWords:
            wordlist_wo_stopwords.append(w)
    print(wordlist_wo_stopwords)


## Step 6a: Dependency Parsing
The next step is to figure out how all the words in our sentence relate to each other. This is called dependency parsing.

The goal is to build a tree that assigns a single parent word to each word in the sentence. The root of the tree will be the main verb in the sentence. Here’s what the beginning of the parse tree will look like for our sentence:

In [None]:
#!wget http://www.maltparser.org/mco/english_parser/engmalt.linear-1.7.mco
#Step 6a: Dependency parsing
from nltk.parse.malt import MaltParser
from nltk.tree import Tree
from nltk.draw.tree import TreeView
import sys
sys.path.insert(0,'./maltparser-1.8.1')
sys.path.insert(0,'.')
#Malt parser works on sentences. It internally performs tokenization, pos tagging etc. 
#mp = MaltParser('maltparser-1.8.1', 'engmalt.linear-1.7.mco')
trees=[]
for k in range(len(sentences)):
    trees.append(mp.parse_one(sentences[k].split()).tree())
for k in range(len(trees)):
    print("Dependency tree for sentence "+repr(k+1))
    print(trees[k])
    #Uncomment the following to visualize the tree
    #print(trees[k].draw())


## Step 6b: Finding Noun Phrases
Sometimes it makes more sense to group together the words that represent a single idea or thing. We can use the information from the dependency parse tree to automatically group together words that are all talking about the same thing.

In [None]:
#Step 6b: Finding noun phrases
#In the above step we used the malt
grammar = """NP: {<DT>?<JJ>*<NN.*>+}
       RELATION: {<V.*>}
                 {<DT>?<JJ>*<NN.*>+}
       ENTITY: {<NN.*>}"""

cp = nltk.RegexpParser(grammar)
for k in range(len(sentences)):
    words = nltk.word_tokenize(sentences[k])
    tagged_words = nltk.pos_tag(words)
    lemmatized_wordlist=[]
    print("Noun phrases in sentence "+repr(k+1)+" are: ")
    for w in tagged_words:
        wordnettag=get_wordnet_pos(w[1])
        if wordnettag == '':
            lemmatizedword = wordnet_lemmatizer.lemmatize(w[0].lower())
        else:
            lemmatizedword = wordnet_lemmatizer.lemmatize(w[0].lower(),pos=wordnettag)
        if w[0].istitle():
            lemmatizedword = lemmatizedword.capitalize()
        elif w[0].upper()==w[0]:
            lemmatizedword = lemmatizedword.upper()
        else:
            lemmatizedword = lemmatizedword
        lemmatized_wordlist.append((lemmatizedword,w[1]))
            
   # print(lemmatized_wordlist)

    noun_phrases_list = [' '.join(leaf[0] for leaf in tree.leaves()) 
                      for tree in cp.parse(lemmatized_wordlist).subtrees() 
                      if tree.label()=='NP'] 
    result = cp.parse(lemmatized_wordlist)
    #print(result)
    #print(type(result))
    
    print(noun_phrases_list)

## Step 7. Named Entity Recognition (NER)
The goal of Named Entity recognition is to detect and label nouns with real world concepts they represent. But Named Entity Recgnition systems aren’t just doing a simple dictionary lookup. Instead,they are using the context of how a word appears in the sentence and a statistical model to guess which type of noun a word represents.


In [None]:
#Step 7. Named Entity Recognition (NER)
nltk.download('words')
nltk.download('maxent_ne_chunker')
"""
geo = Geographical Entity
org = Organization
per = Person
gpe = Geopolitical Entity
tim = Time indicator
art = Artifact
eve = Event
nat = Natural Phenomenon

"""
for k in range(len(sentences)):
    words = nltk.word_tokenize(sentences[k])
    tagged_words = nltk.pos_tag(words)
    ne_tagged_words = nltk.ne_chunk(tagged_words)
    #print(ne_tagged_words)
    print("Named entities in  sentence "+repr(k+1))
    for chunk in ne_tagged_words:
        if hasattr(chunk, 'label'):
            print(chunk.label(), ' '.join(c[0] for c in chunk))



Notice that it makes some mistakes. This is probably because there was nothing in the training data set similar to that and it made a best guess.

## Step 8: Coreference Resolution
At this point, we already have a useful representation of our sentence. We know the parts of speech for each word, how the words relate to each other and which words are talking about named entities. However, we still have one big problem. English is full of pronouns — words like he, she, and it. These are shortcuts that we use instead of writing out names over and over in each sentence. Humans can keep track of what these words represent based on context. But our NLP model doesn’t know what pronouns mean because it only examines one sentence at a time.

Let’s look at the third sentence in our paragraph:“It was founded by the Romans, who named it Londinium.”

As a human reading this sentence, you can easily Dgure out that “it” means “London”. The goal of coreference resolution is to Dgure out this same mapping by tracking pronouns across sentences. We want to figure out all the words that are referring to the same entity.


In [None]:
#Step 8: Coreference resolution
#Check out this great coreference resolution demo from Hugging Face. https://huggingface.co/coref/ 