# Chapter 2 - NLP Pipeline

![NLP Pipeline](Images/NLPPipeline.jpg)

1. Data acquisition : Collect data relevant to the given task
2. Text cleaning
3. Pre-processing : Data is converted into a canonical form
4. Feature engineering : we carve out indicators that are most suitable for the task at hand. These indicators are converted into a format that is understandable by modeling algorithms
5. Modeling
6. Evaluation
7. Deployment
8. Monitoring and model updating

Note that, in the real world, the process may not always be linear as it’s shown in the pipeline in Figure 2-1; it often involves going back and forth between individual steps.

## 1. Data acquisition

Strategies for gathering relevant data for a NLP project (in less-than-ideal scenario):

1. If we have **little or no data, we can start by looking at patterns in the data that indicate if the incoming message is a sales or support query**. We can then use regular expressions and other heuristics to match these patterns to separate sales queries from support queries. We evaluate this solution by collecting a set of queries from both categories and calculating what percentage of the messages were correctly identified by our system.


2. Now we can start thinking about using NLP techniques. For this, we need labeled data

    * Use public dataset

    * Scrape data

    * Product intervention (better instrumentation collecting data)

    * Data augmentation : NLP has a bunch of techniques through which we can take a small dataset and use
    some tricks to create more data.

        - Synonym replacement (Synsets or Wordnet)

        - Back translation : we use a translation tool to translate in another language and translate back to the initial language to get another sentence

        - TF-IDF-bases worlds replacement : Back translation can lose certain words that are crucial to the sentence. A concept we’ll introduce in Chapter 3.

        - Bigram flipping : Divide the sentence into bigrams. Take one bigram at random and flip it. For example: “I am going to the supermarket.” Here, we take the bigram “going to” and replace it with the flipped one: “to going.”

        - Replacing entities : Replace entities like person name, location, organization, etc., with other entities in the same category.

        - Adding noise to data : In many NLP applications, the incoming data contains spelling mistakes.For example, randomly choose a word in a sentence and replace it with another word that’s closer in spelling to the first word. For the "fat finger" problem, simulate a QWERTY keyboard error by replacing a few characters with their neighboring characters on the QWERTY keyboard.

        - Snorkel : This is a system for building training data automatically, without manual labeling. Using Snorkel, a large training dataset can be “created”—without manual labeling—using heuristics and creating synthetic data by transforming existing data and creating new data samples.

        - Easy Data Augmentation (EDA) and NLPAug : These two libraries are used to create synthetic samples for NLP. They provide implementation of various data augmentation techniques, including some techniques that we discussed previously.

        - Active learning : In such cases, the question becomes: for which data points should we ask for labels to maximize learning while keeping the labeling cost low?


In order for most of the techniques we discussed in this section to work well, one key requirement is a clean dataset to start with.

## 2. Text Extraction and Cleanup

**Definition**: process of extracting raw text from the input data by removing all the other non-textual information, such as markup, metadata, etc., and converting the text to the required encoding format.
(From PDF, HTML, text, datastream)

### HTML Parsing and Cleanup

In [5]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import ssl


In [6]:
url = "https://stackoverflow.com/questions/415511/%5Chow-to-get-the-current-time-in-python"

# In order not to check the SSL certificate
context = ssl._create_unverified_context()
html = urlopen(url, context=context).read()

In [8]:
soup = BeautifulSoup(html ,"html.parser")

In [26]:
question = soup.find("div", {"class": "question"}) \
            .find("div", {"class": "s-prose js-post-body"}).get_text().strip()
print("Question : ", question)

answer = soup.find("div", {"class": "answer js-answer accepted-answer"})\
        .find("div", {"class":"s-prose js-post-body"}).get_text().strip()
print("Answer : ", answer)

Question :  What is the module/method used to get the current time?
Answer :  Use:
>>> import datetime
>>> datetime.datetime.now()
datetime.datetime(2009, 1, 6, 15, 8, 24, 78915)

>>> print(datetime.datetime.now())
2009-01-06 15:08:24.789150

And just the time:
>>> datetime.datetime.now().time()
datetime.time(15, 8, 24, 78915)

>>> print(datetime.datetime.now().time())
15:08:24.789150

See the documentation for more information.
To save typing, you can import the datetime object from the datetime module:
>>> from datetime import datetime

Then remove the leading datetime. from all of the above.


### Unicode Normalization

*As we develop code for cleaning up HTML tags, we may also encounter various Unicode characters, including symbols, emojis, and other graphic characters*

To parse such non-textual symbols and special characters, we use Unicode normalization.

*Example*:
   
    text = ’I love ! Shall we book a to gizza?’
    
    Text = text.encode("utf-8")
    
    print(Text)

    which outputs:
    
    b'I love Pizza \xf0\x9f\x8d\x95! Shall we book a cab \xf0\x9f\x9a\x95
    to get pizza?'

### Spelling correction

Shorthand typing, fat finger problem ...

We can use the Microsoft REST API.

In [53]:
import requests
import json
import os

# URL et clé API
endpoint = "https://api.bing.microsoft.com/v7.0/SpellCheck"
api_key = "e187bcc54d7c41c2bbf92ba12923ae86"

# Paramétrage API
example_text = "Bnjour à tos" # the text to be spell-checked
data = {'text': example_text}

params = {
            'mkt':'en-us',
            'mode':'proof',
        }

headers = {
    'Content-Type': 'application/x-www-form-urlencoded',
    'Ocp-Apim-Subscription-Key': api_key,
    }

# Demande
try:
    response = requests.post(endpoint, headers=headers, params=params, data=data)
    json_response = response.json()
    print(json.dumps(json_response, indent=4))
except Exception as ex:
    raise ex

{
    "_type": "SpellCheck",
    "flaggedTokens": [
        {
            "offset": 0,
            "token": "Bnjour",
            "type": "UnknownToken",
            "suggestions": [
                {
                    "suggestion": "Bonjour",
                    "score": 0.7710960127696465
                }
            ]
        }
    ]
}


Going beyond APIs, we can build our own spell checker using a huge dictionary of words from a specific language.

    *Library*: pyenchant

### System-Specific Error Correction

* **PDF Documents** : The pipeline in this case starts with extraction of plain text from PDF documents. However, different PDF documents are encoded differently, and sometimes, we may not be able to extract the full text, or the structure of the text may get messed up.

    *Several libraries*: PyPDF, PDFMiner


* **Scanned documents**:
    *Several libraries*: Tesseract

![Snippet OCR](Images/SnippetOCR.jpg)

In [5]:
from PIL import Image
from pytesseract import image_to_string, pytesseract

filename = r"Images/SnippetOCR.jpg"

pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
# You need to install tesseract

text = image_to_string(Image.open(filename))
print(text)

in the nineteenth century the only Kind of linguistics considered
seriously was this comparative and historical study of words in languages
known ot believed to be cognate—say the Semitic languages, or the Indo-
European languages. It is significant that the Germans who really made
the subject what it was, used the term Indo-germanisch. Those who know
the popular works of Otto Jespersen will remember how firmly he
declares that linguistic science is histotical. And those who have noticed



## Pre-Processing

All NLP software typically works at the sentence level and expects a separation of words at the minimum.
So, we need some way to :
* split a text into words and sentences before proceeding further in a processing pipeline.
* Sometimes, we need to remove special characters and digits
* lower-case / upper-case

**Here are some common preprocessing steps used in NLP software:**

* Preliminaries: Sentence segmentation and word tokenization.
* Frequent steps : Stop word removal, stemming and lemmatization, removing digits/punctuation, lowercasing, etc.
* Other steps: Normalization, language detection, code mixing, transliteration, etc.
* Advanced processing: POS tagging, parsing, coreference resolution, etc.

### Preliminaries

* **Sentence segmentation** : Splitting the text into sentences.
    Risks of abbreviations, forms of addresses (Dr., Mr., etc.), or ellipses (...) that may break the simple rule.

In [12]:
from nltk.tokenize import sent_tokenize, word_tokenize

text = """In the previous chapter, we saw examples of some common NLP
applications that we might encounter in everyday life. If we were asked to
build such an application, think about how we would approach doing so at our
organization. We would normally walk through the requirements and break the
problem down into several sub-problems, then try to develop a step-by-step
procedure to solve them. Since language processing is involved, we would also
list all the forms of text processing needed at each step. This step-by-step
processing of text is known as pipeline. It is the series of steps involved in
building any NLP model. These steps are common in every NLP project, so it
makes sense to study them in this chapter. Understanding some common procedures
in any NLP pipeline will enable us to get started on any NLP problem encountered
in the workplace. Laying out and developing a text-processing pipeline is seen
as a starting point for any NLP application development process. In this
chapter, we will learn about the various steps involved and how they play
important roles in solving the NLP problem and we’ll see a few guidelines
about when and how to use which step. In later chapters, we’ll discuss
specific pipelines for various NLP tasks (e.g., Chapters 4–7)."""

In [14]:
print(sent_tokenize(text))

['In the previous chapter, we saw examples of some common NLP\napplications that we might encounter in everyday life.', 'If we were asked to\nbuild such an application, think about how we would approach doing so at our\norganization.', 'We would normally walk through the requirements and break the\nproblem down into several sub-problems, then try to develop a step-by-step\nprocedure to solve them.', 'Since language processing is involved, we would also\nlist all the forms of text processing needed at each step.', 'This step-by-step\nprocessing of text is known as pipeline.', 'It is the series of steps involved in\nbuilding any NLP model.', 'These steps are common in every NLP project, so it\nmakes sense to study them in this chapter.', 'Understanding some common procedures\nin any NLP pipeline will enable us to get started on any NLP problem encountered\nin the workplace.', 'Laying out and developing a text-processing pipeline is seen\nas a starting point for any NLP application develo

* **Work tokenization**


Similar to sentence tokenization, to tokenize a sentence into words, we can start with a simple rule to split text into words based on the presence of punctuation marks.

In [16]:
print(word_tokenize(text))

['In', 'the', 'previous', 'chapter', ',', 'we', 'saw', 'examples', 'of', 'some', 'common', 'NLP', 'applications', 'that', 'we', 'might', 'encounter', 'in', 'everyday', 'life', '.', 'If', 'we', 'were', 'asked', 'to', 'build', 'such', 'an', 'application', ',', 'think', 'about', 'how', 'we', 'would', 'approach', 'doing', 'so', 'at', 'our', 'organization', '.', 'We', 'would', 'normally', 'walk', 'through', 'the', 'requirements', 'and', 'break', 'the', 'problem', 'down', 'into', 'several', 'sub-problems', ',', 'then', 'try', 'to', 'develop', 'a', 'step-by-step', 'procedure', 'to', 'solve', 'them', '.', 'Since', 'language', 'processing', 'is', 'involved', ',', 'we', 'would', 'also', 'list', 'all', 'the', 'forms', 'of', 'text', 'processing', 'needed', 'at', 'each', 'step', '.', 'This', 'step-by-step', 'processing', 'of', 'text', 'is', 'known', 'as', 'pipeline', '.', 'It', 'is', 'the', 'series', 'of', 'steps', 'involved', 'in', 'building', 'any', 'NLP', 'model', '.', 'These', 'steps', 'are', '

Problems:

* ""Mr. Jack O’Neil works at Melitas Marg, located at 245 Yonge Avenue, Austin, 70272." If we run this through the NLTK tokenizer, O, ‘, and Neil are identified as three separate tokens.

* "There are \$10,000 and €1000 which are there just for testing a tokenizer" through this tokenizer, while $ and 10,000 are identified as separate tokens, €1000 is identified as a single token.

* If we want to tokenize tweets, this tokenizer will separate a hashtag into two tokens: a “#” sign and the string that follows it. (A point to note in this context is that NLTK also has a tweet tokenizer)

* “N.Y.!” has a total of three punctuations. But in English, N.Y. stands for New York, hence “N.Y.” should be treated as a single word

> Such language-specific exceptions can be specified in the tokenizer provided by spaCy

![Language-Specific exceptions during tokenization](Images/tokenization.jpg)

### Frequent Steps

*Say we’re designing software that identifies the category of a news article as one of politics, sports, business, and other.*

1. Assume we have a good sentence segmenter and word tokenizer in place.
2. **Removing stop words**: At that point, we would have to start thinking about what kind of information is useful for developing a categorization tool. Some of the frequently used words in English, such as a, an, the, of, in, etc., are not particularly useful for this task, as they don’t carry any content on their own to separate between the four categories. --> Such words are called stop words and are typically (though not always) removed from further analysis in such problem scenarios

3. **Removing punctuation and/or numbers**
4. **All text lowercased**

In [27]:
from nltk import download
download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\cokev\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [32]:
from nltk.corpus import stopwords
from string import punctuation

def preprocess_corpus(text):
    mystopwords = set(stopwords.words("english"))
    def remove_stops_digits(tokens):
        return  [token.lower() for token in tokens if token not in mystopworkds \
                and not token.isdigit() and token not in punctuation]
    return [remove_stops_digits(word_tokenize(text) for text in texts)]

In [31]:
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

* Stemming : refers to the process of removing suffixes and reducing a word to some base form such that all different variants of that word can be represented by the same form (e.g., “car” and “cars” are both reduced to “car”).

*Popular stemming algorithm called Porter Stemmer using NLTK*

In [35]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()
word1, word2 = "cars", "revolution"
print(stemmer.stem(word1), stemmer.stem(word2))

car revolut


* Lemmatization : of mapping all the different forms of a word to its base word, or lemma.

    For example, the adjective “better,” when stemmed, remains the same. However, upon lemmatization, this should become "good".
    

In [39]:
from nltk import download
download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\cokev\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [40]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("better", pos="a"))

good


In [2]:
import spacy
sp = spacy.load("en_core_web_sm")
token = sp(u"better")
for word in token: print(word.text, word.lemma_)

better well


* We typically lowercase the text before stemming.

* We also don’t remove tokens or lowercase the text before doing lemmatization because we have to know the part of speech of the word to get its lemma, and that requires all tokens in the sentence to be intact.

### Other Pre-Processing Steps

While we haven’t explicitly stated the nature of the texts, we have assumed that we’re dealing with regular English text. What’s different if that’s not the case?

* Text normalization : scenario where we’re working with a collection of social media posts to detect news events. -> It’s useful to reach a canonical representation of text that captures all these variations into one representation

    * Lowercase/Uppercase
    * Converts digits to text (9 to nine)
    * Expand abbreviations
    * Dictionnary of different sellings of a preset collection of words mapped to a single spelling

* Language detection : Non-english reviews for example. Language detection is performed as the first step in an NLP pipeline

    *Libraries like Polyglot allow language detection.* 
    Once this step is done, the next steps could follow a language-specific pipeline.

* Code mixing and transliteration : The same phrase with many different languages.

    * Code mixing : phenomenon of switching between languages.

    * Transliteration : people use multiple languages in their write-ups, they often type words from these languages in Roman script, with English spelling.

## Advanced Processing

Imagine we’re asked to develop a system to identify person and organization names in our company’s collection of one million documents. --> We use **POS tagging**.

*Libraries like NLTK, Spacy and Parsey McParseFace Tagger* allow us to do that.

In [5]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Charles Spencer Chaplin was born on 16 April 1889 to Hannah Chaplin \
            (born Hannah Harriet Pedlingham Hill) and Charles Chaplin Sr')")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.shape_, token.is_alpha, token.is_stop)

Charles Charles PROPN Xxxxx True False
Spencer Spencer PROPN Xxxxx True False
Chaplin Chaplin PROPN Xxxxx True False
was be AUX xxx True True
born bear VERB xxxx True False
on on ADP xx True True
16 16 NUM dd False False
April April PROPN Xxxxx True False
1889 1889 NUM dddd False False
to to ADP xx True True
Hannah Hannah PROPN Xxxxx True False
Chaplin Chaplin PROPN Xxxxx True False
                          SPACE      False False
( ( PUNCT ( False False
born bear VERB xxxx True False
Hannah Hannah PROPN Xxxxx True False
Harriet Harriet PROPN Xxxxx True False
Pedlingham Pedlingham PROPN Xxxxx True False
Hill Hill PROPN Xxxx True False
) ) PUNCT ) False False
and and CCONJ xxx True True
Charles Charles PROPN Xxxxx True False
Chaplin Chaplin PROPN Xxxxx True False
Sr Sr PROPN Xx True False
' ' PUNCT ' False False
) ) PUNCT ) False False


Relation extraction: along with identifying person and organization names in our company’s collection of one million documents, we’re also asked to identify if a given person and organization are related to each other in some way. Think about what kind of pre-processing we need for this case:

1. POS tagging

2. A way of identifying person and organization names, which is a separate information extraction task known as named entity recognition (NER)

3. Identify patterns indicating “relation” between two entities in a sentence (using syntactic representation of the sentence, such as parsing)

4. To identify and link multiple mentions of an entity : coreference resolution

![Output of different stages of NLP pipeline processing](Images/OutputPipeline.jpg)


![Pre-processing steps](Images/preprocessingsteps.jpg)

## Feature Engineering

**Definition** (feature engineering, feature etraction, text representation) : way to feed this pre-processed text into an ML algorithm.

**2 approaches**:

1. Classical NLP and traditional ML pipeline :

    Convert the raw data into a format that can be consumed by a machine : count the number of positive/negative words.
    Features are heavily inspired by the task at hand as well as domain knowledge.

    > One of the advantages of handcrafted features is that the model remains interpretable.

    The main drawback of classical ML models is the feature engineering. Handcrafted feature engineering becomes a bottleneck for both model performance and the model development cycle.

2. DL Pipeline : 

    In the DL pipeline, the raw data (after preprocessing) is directly fed to a model. The model is capable of “learning” features from the data.

    > The model loses interpretability.

![Feature engineering](Images/FeatureEngineering.jpg)

## Modeling

1. At the start, when we have limited data, we can use simpler methods and rules. 
2. Over time, with more data and a better understanding of the problem, we can add more complexity and improve performance

### Start with Simple Heuristics

* For instance, in email spam-classification tasks, we may have a blacklist of domains that are used exclusively to send spam.

* In an e-commerce setting, we may use a heuristic based on the number of purchases for ordering search results and show products belonging to the same category as recommendations.

* Using regular expressions:

    * Normal regular expressions : for some information, such as email IDs, dates, and telephone numbers.
    
    * Stanford NLP’s TokensRegex and spaCy’s rule-based matching: The following image shows a pattern that looks for text containing the lemma “match,” appearing as a noun, optionally preceded by an adjective, and followed by any word form of lemma “be.”

![spaCy's rule-based matcher](Images/rulebasedmatcher.jpg)

Even when we’re building ML-based models, we can use such heuristics to handle special cases—for example, cases where the model has failed to learn well.

### Building Your Model

Further, as we collect more data, our ML model starts beating pure heuristics. At that point, a common practice is to combine heuristics directly or indirectly with the ML model. There are two broad ways of
doing that:

1. **Create a feature from the heuristic for your ML model** : When there are many heuristics where the behavior of a single heuristic is deterministic but their combined behavior is fuzzy in terms of how they predict, it’s best to use these heuristics as features to train your ML model.

    **For instance, in the email spam-classification example, we can add features, such as the number of words from the blacklist in a given email or the email bounce rate, to the ML model.**

2. **Pre-process your input to the ML model** : If the heuristic has a really high prediction for a particular kind of class, then it’s best to use it before feeding the data in your ML model.

    **For instance, if for certain words in an email, there’s a 99% chance that it’s spam, then it’s best to classify that email as spam instead of sending it to an ML model.**

NLP providers provide off-the-shelf APIs to solve various NLP tasks. Once you’re comfortable that the task is feasible and conclude that the off-the-shelf models give reasonable results, you can move toward building custom ML models and improving them.

### Building THE Model

We start with a baseline approach and work toward improving it.

1. Ensemble and stacking : a common practice is not to have a single model, but to use a collection of ML models.

    a) Model stacking : we can feed one model’s output as input for another model
    
    b) Ensembe : pool predictions from multiple models and make a final prediction.

![Model ensemble and stacking](Images/modelensemble_stacking.jpg)

**For example, in the email spamclassification case, we can assume that we run three different models: a heuristicbased score, Naive Bayes, and LSTM. The output of these three models is then fed into the meta-model based on logistic regression, which then gives the chances of the email being spam or not.**

2. Better feature engineering : Feature selecting, ...

3. Transfer learning : Transfer learning tries to transfer preexisting knowledge from a big, well-trained model to a newer model at its initial phase.

    As an example, for email spam classification, we can use BERT to fine-tune the email dataset.

4. Reapplying heuristics : It’s possible to revisit these cases again at the end of the modeling pipeline to find any common pattern in errors and use heuristics to correct them. We can also apply domain-specific knowledge.

## Evaluation

Success in this phase depends on two factors:

1. Using the right metric for evaluation

    * They can also vary depending on the phase: the model building, deployment, and production phases.

        Whereas in the first two phases, we typically use ML metrics, in the final phase, we also include business metrics to measure business impact.

    * Also, evaluations are of two types: intrinsic and extrinsic.
    
        * Intrinsic focuses on intermediary objectives,
        * Extrinsic focuses on evaluating performance on the final objective.
        
        For example, consider a spam-classification system.
        
        * The ML metric will be precision and recall,
        * The business metric will be “the amount of time users spent on a spam email.   

2. Following the right evaluation process


### Intrinsic Evaluation

The output of the NLP model on a data point is compared against the corresponding label for that data point, and metrics are calculated based on the match.

![Metrics](Images/metrics.jpg)
![Metrics](Images/metrics2.jpg)



**Confusion matrix**: It allows us to inspect the actual and predicted output for different
classes in the dataset

**Recall at various ranks**: For example, for information retrieval, a common metric is “Recall at rank K”; it looks for the presence of ground truth in top K retrieved results.

In tasks such as translation, automated evaluation may not work.

**Extrinsic Evaluation**: 

Important:

* Consider a scenario where the regression model does well on the ML metrics but doesn’t really save a lot of time for the email service users

* question-answering model does very well on intrinsic metrics but fails to address a large number of questions

> The way to carry out extrinsic evaluation is to set up the business metrics and the process to measure them correctly at the start of the project

Intrinsic evaluation can be done mostly by the AI team itself. This makes extrinsic evaluation a much more expensive process as compared to intrinsic evaluation.

Bad results in intrinsic evaluation often imply bad results in extrinsic evaluation. However, the converse may not be true.

## Post-Modeling Phases

### Deployment

Deployment entails plugging the NLP module into the broader system. It involves **making sure input and output data pipelines are in order**, as well as making sure our **NLP module is scalable under heavy load**.

### Monitoring
We need to ensure that the outputs produced by our models daily make sense.

### Model Updating

* More training data is generated post-deployment : Once deployed, extracted signals can be used to automatically improve the model

    Can also try online learning to train the model automatically on a daily basis.

* Training data is not generated post-deployment : Manual labeling could be done to improve evaluation and the models.

* Low model latency is required, or model has to be online with near-real-time response :

    * Models that can be inferred quickly
    * Create memorization strategies like caching
    * bigger computing power

* Low model latency is not required, or model can be run in an offline fashion : can use more advanced and slower models

