# 1.Data acquisition
- Use a public dataset
- Scrape data
- Product intervention
- Data augmentation
## Data augmentation
- Synonym replacement
- Back translation
- TF-IDF–based word replacement
- Bigram flipping
- Replacing entities
- Adding noise to data
- Advanced techniques
 * Snorkel
 * Easy Data Augmentation (EDA) and NLPAug
 * Active learning

# 2.Text Extraction and Cleanup
## HTML Parsing and Cleanup

In [6]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

myurl = "https://stackoverflow.com/questions/415511/how-to-get-the-current-time-in-python"
html = urlopen(myurl).read()

soupified = BeautifulSoup(html, "html.parser")
question = soupified.find("div", {"class": "question"})
questiontext = question.find("div", {"class": "s-prose js-post-body"})

print("Question: \n", questiontext.get_text())

Question: 
 What is the module/method used to get the current time?


In [7]:
answer = soupified.find("div", {"class": "answer"})
answertext = answer.find("div", {"class": "s-prose js-post-body"})

print("Best answer: \n", answertext.get_text().strip())

Best answer: 
 Use:
>>> import datetime
>>> datetime.datetime.now()
datetime.datetime(2009, 1, 6, 15, 8, 24, 78915)

>>> print(datetime.datetime.now())
2009-01-06 15:08:24.789150

And just the time:
>>> datetime.datetime.now().time()
datetime.time(15, 8, 24, 78915)

>>> print(datetime.datetime.now().time())
15:08:24.789150

See the documentation for more information.
To save typing, you can import the datetime object from the datetime module:
>>> from datetime import datetime

Then remove the leading datetime. from all of the above.


## Unicode Normalization
b'I love Pizza \xf0\x9f\x8d\x95! Shall we book a cab \xf0\x9f\x9a\x95
to get pizza?'

## Spelling Correction
Shorthand typing: *Hllo world! I am back!* \
Fat finger problem [20]: *I pronise that I will not bresk the silence
again!*

In [10]:
import requests
import json

api_key = "<ENTER-KEY-HERE>"
example_text = "Hollo, wrld" # the text to be spell-checked
endpoint = "https://api.cognitive.microsoft.com/bing/v7.0/SpellCheck"

data = {'text': example_text}
params = {
    'mkt':'en-us',
    'mode':'proof'
    }
headers = {
    'Content-Type': 'application/x-www-form-urlencoded',
    'Ocp-Apim-Subscription-Key': api_key,
    }

response = requests.post(endpoint, headers=headers, params=params, data=data)

json_response = response.json()
print(json.dumps(json_response, indent=4))

{
    "error": {
        "code": "401",
        "message": "Access denied due to invalid subscription key or wrong API endpoint. Make sure to provide a valid key for an active subscription and use a correct regional API endpoint for your resource."
    }
}


```
Output (partially shown here):
"suggestions": [
    {
    "suggestion": "Hello",
    "score": 0.9115257530801
    },
    {
    "suggestion": "Hollow",
    "score": 0.858039839213461
    },
    {
    "suggestion": "Hallo",
    "score": 0.597385084464481
    }
    ]
```

## System-Specific Error Correction
Consider another scenario where our dataset is in the
form of a collection of PDF documents. While there are several libraries, such as **PyPDF**,
**PDFMiner**, etc., to extract text from PDF documents. \
Another common source of textual data is scanned documents. Text
extraction from scanned documents is typically done through **optical
character recognition (OCR)**, using libraries such as **Tesseract**.

# 3. Pre-Processing
- **Preliminaries**: Sentence segmentation and word tokenization.
- **Frequent steps**: Stop word removal, stemming and lemmatization, removing digits/punctuation, lowercasing, etc.
- **Other steps**: Normalization, language detection, code mixing, transliteration, etc.
- **Advanced processing**: POS tagging, parsing, coreference resolution, etc.

## Preliminaries
### SENTENCE SEGMENTATION
As a simple rule, we can do sentence segmentation by breaking up text
into sentences at the appearance of full stops and question marks.
A commonly used library is Natural Language
Tool Kit (NLTK)

In [4]:
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')

mytext = "In the previous chapter, we saw examples of some common NLP applications that we might encounter in everyday life. If we were asked to build such an application, think about how we would approach doing so at our organization. We would normally walk through the requirements and break the problem down into several sub-problems, then try to develop a step-by-step procedure to solve them. Since language processing is involved, we would also list all the forms of text processing needed at each step. This step-by-step processing of text is known as pipeline. It is the series of steps involved in building any NLP model. These steps are common in every NLP project, so it makes sense to study them in this chapter. Understanding some common procedures in any NLP pipeline will enable us to get started on any NLP problem encountered in the workplace. Laying out and developing a text-processing pipeline is seen as a starting point for any NLP application development process. In this hapter, we will learn about the various steps involved and how they play important roles in solving the NLP problem and we’ll see a few guidelines about when and how to use which step. In later chapters, we’ll discuss specific pipelines for various NLP tasks (e.g., Chapters 4–7)."

my_sentences = sent_tokenize(mytext)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\minhh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


### WORD TOKENIZATION

In [5]:
for sentence in my_sentences:
    print(sentence)
    print(word_tokenize(sentence))

In the previous chapter, we saw examples of some common NLP applications that we might encounter in everyday life.
['In', 'the', 'previous', 'chapter', ',', 'we', 'saw', 'examples', 'of', 'some', 'common', 'NLP', 'applications', 'that', 'we', 'might', 'encounter', 'in', 'everyday', 'life', '.']
If we were asked to build such an application, think about how we would approach doing so at our organization.
['If', 'we', 'were', 'asked', 'to', 'build', 'such', 'an', 'application', ',', 'think', 'about', 'how', 'we', 'would', 'approach', 'doing', 'so', 'at', 'our', 'organization', '.']
We would normally walk through the requirements and break the problem down into several sub-problems, then try to develop a step-by-step procedure to solve them.
['We', 'would', 'normally', 'walk', 'through', 'the', 'requirements', 'and', 'break', 'the', 'problem', 'down', 'into', 'several', 'sub-problems', ',', 'then', 'try', 'to', 'develop', 'a', 'step-by-step', 'procedure', 'to', 'solve', 'them', '.']
Since

![img](images/tokenization.svg)
<center>Language-specific (English here) exceptions during tokenization</center>

### Frequent Steps
The code example below shows how to remove stop words, digits,
and punctuation and lowercase a given collection of texts:


In [7]:
from nltk.corpus import stopwords
from string import punctuation

def preprocess_corpus(texts):
    mystopwords = set(stopwords.words("english"))
    
    def remove_stops_digits(tokens):
        return [token.lower() for token in tokens if token not in mystopwords
            and not token.isdigit() and token not in punctuation]
    
    return [remove_stops_digits(word_tokenize(text)) for text in texts]

### STEMMING AND LEMMATIZATION
Stemming refers to the process of removing suffixes and reducing a
word to some base form such that all different variants of that word
can be represented by the same form (e.g., “car” and “cars” are both
reduced to “car”). \
The following code snippet shows how to use a popular stemming
algorithm called **Porter Stemmer** using NLTK:

In [8]:
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()
word1, word2 = "cars", "revolution"
print(stemmer.stem(word1), stemmer.stem(word2))

car revolut


Lemmatization is the process of mapping all the different forms of a
word to its base word, or lemma. While this seems close to the
definition of stemming, they are, in fact, different. For example, the
adjective “better,” when stemmed, remains the same. However, upon
lemmatization, this should become “good”.

![](images/stemming-lemmatization.png)
<center>Difference between stemming and lemmatization</center>

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("better", pos="a")) #a is for adjective

In [None]:
import spacy
sp = spacy.load('en_core_web_sm')
token = sp(u'better')
for word in token:
print(word.text, word.lemma_)

NLTK prints the output as “good,” whereas spaCy prints “well” both
are correct.

### Other Pre-Processing Steps
### TEXT NORMALIZATION
Consider a scenario where we’re working with a collection of social
media posts to detect news events. Social media text is very different
from the language we’d see in, say, newspapers. A word can be
spelled in different ways, including in shortened forms, a phone
number can be written in different formats (e.g., with and without
hyphens), names are sometimes in lowercase, and so on. When we’re
working on developing NLP tools to work with such data, it’s useful to
reach a canonical representation of text that captures all these
variations into one representation. This is known as text
normalization.

### LANGUAGE DETECTION
We can use libraries like **Polyglot** for
language detection.

### CODE MIXING AND TRANSLITERATION
There’s another scenario where a
single piece of content is in more than one language.

### Advanced Processing
Identifying names requires us to be
able to do POS tagging, as identifying proper nouns can be useful in
identifying person and organization names. \
Pre-trained and readily usable POS
taggers are implemented in NLP libraries such as NLTK, spaCy,
and Parsey McParseface Tagger.

In [3]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Charles Spencer Chaplin was born on 16 April 1889 toHannah Chaplin (born Hannah Harriet Pedlingham Hill) and Charles Chaplin Sr')
for token in doc:
    print(token.text, token.lemma_, token.pos_,
    token.shape_, token.is_alpha, token.is_stop)

Charles Charles PROPN Xxxxx True False
Spencer Spencer PROPN Xxxxx True False
Chaplin Chaplin PROPN Xxxxx True False
was be AUX xxx True True
born bear VERB xxxx True False
on on ADP xx True True
16 16 NUM dd False False
April April PROPN Xxxxx True False
1889 1889 NUM dddd False False
toHannah toHannah PROPN xxXxxxx True False
Chaplin Chaplin PROPN Xxxxx True False
( ( PUNCT ( False False
born bear VERB xxxx True False
Hannah Hannah PROPN Xxxxx True False
Harriet Harriet PROPN Xxxxx True False
Pedlingham Pedlingham PROPN Xxxxx True False
Hill Hill PROPN Xxxx True False
) ) PUNCT ) False False
and and CCONJ xxx True True
Charles Charles PROPN Xxxxx True False
Chaplin Chaplin PROPN Xxxxx True False
Sr Sr PROPN Xx True False


![](images/output.png)
<center>Output from different stages of NLP pipeline processing</center>

![](images/An-example-of-preprocessing-steps-of-text.pbm)
<center>Advanced pre-processing steps on a blob of text</center>

# 4. Feature Engineering
The
goal of feature engineering is to capture the characteristics of the text
into a numeric vector that can be understood by the ML algorithms.

![](images/classical-NLP.png)
![](images/deep-learning-NLP.png)

## Classical NLP/ML Pipeline
Feature
engineering steps convert the raw data into a format that can be
consumed by a machine. These transformation functions are usually
handcrafted in the classical ML pipeline, aligning to the task at hand.

## DL Pipeline
Handcrafted feature engineering becomes a bottleneck for both model
performance and the model development cycle. A noisy or unrelated
feature can potentially harm the model’s performance by adding more
randomness to the data. \
In the DL pipeline, the raw data (after preprocessing) is directly fed to a model. The model is capable of
“learning” features from the data. Hence, these features are more in
line with the task at hand, so they generally give improved
performance.

# 5. Modeling
## Start with Simple Heuristics
At the very start of building a model, ML may not play a major role by
itself. Part of that could be due to a lack of data, but human-built
heuristics can also provide a great start in some ways. Heuristics may
already be part of your system, either implicitly or explicitly.

## Building Your Model
- Create a feature from the heuristic for your ML model
- Pre-process your input to the ML model

Additionally, we have NLP service providers, such as Google Cloud
Natural Language, Amazon Comprehend, Microsoft Azure
Cognitive Services, and IBM Watson Natural Language
Understanding, which provide off-the-shelf APIs to solve various
NLP tasks. 

## Building THE Model
- Ensemble and stacking
- Better feature engineering
- Transfer learning
- Reapplying heuristics

| Data attribute | Decision path | Examples |
|---|---|---|
| Large data volume | Can use techniques that require more data, like DL. Can use a richer set of features as well. If the data is sufficiently large but unlabeled, we can also apply unsupervised techniques. | If we have a lot of reviews and metadata associated with them, we can build a sentiment-analysis tool from scratch. |
| Small data volume | Need to start with rule-based or traditional ML solutions that are less data hungry. Can also adapt cloud APIs and generate more data with weak supervision. We can also use transfer learning if there’s a similar task that has large data. | This often happens at the start of a completely new project. |
| Data quality is poor and the data is heterogeneous in nature | More data cleaning and preprocessing might be required. | This entails issues like code mixing (different languages being mixed in the same sentence), unconventional language, transliteration, or noise (like social media text). |
| Data quality is good | Can directly apply off-the-shelf algorithms or cloud APIs more easily. | Legal text or newspapers. |
| Data consists of full-length documents | Choose the right strategy for breaking the document into lower levels, like paragraphs, sentences, or phrases, depending on the problem. | Document classification, review analysis, etc. |


# 6.Evaluation
Evaluations are of two types: intrinsic and extrinsic. Intrinsic
focuses on intermediary objectives, while extrinsic focuses on
evaluating performance on the final objective.
## Intrinsic Evaluation
| Metric | Description  | Applications |
|---|---|---|
| Accuracy | Used when the output variable is categorical or discrete. It denotes the fraction of times the model makes correct predictions as compared to the total predictions it makes. | Mainly used in classification tasks, such as sentiment classification (multiclass), natural language inference (binary), paraphrase detection (binary), etc. |
| Precision | Shows how precise or exact the model’s predictions are, i.e., given all the positive (the class we care about) cases, how many can the model classify correctly? | Used in various classification tasks, especially in cases where mistakes in a positive class are more costly than mistakes in a negative class, e.g., disease predictions in healthcare. |
| Recall | Recall is complementary to precision. It captures how well the model can recall positive class, i.e., given all the positive predictions it makes, how many of them are indeed positive? | Used in classification tasks, especially where retrieving positive results is more important, e.g., e-commerce search and other information retrieval tasks. |
| F1 score | Combines precision and recall to give a single metric, which also captures the trade-off between precision and recall, i.e., completeness and exactness. F1 is defined as (2 × Precision × Recall) / (Precision + Recall). | Used simultaneously with accuracy in most of the classification tasks. It is also used in sequence-labeling tasks, such as entity extraction, retrieval-based questions answering, etc. |
| AUC | Captures the count of positive predictions that are correct versus the count of positive predictions that are incorrect as we vary the threshold for prediction. | Used to measure the quality of a model independent of the prediction threshold. It is used to find the optimal prediction threshold for a classification task. |
| MRR (mean reciprocal rank) | Used to evaluate the responses retrieved given their probability of correctness. It is the mean of the reciprocal of the ranks of the retrieved results. | Used heavily in all information-retrieval tasks, including article search, e-commerce search, etc. |
| MAP (mean average precision) | Used in ranked retrieval results, like MRR. It calculates the mean precision across each retrieved result. | Used in information-retrieval tasks. |
| RMSE (root mean squared error) | Captures a model’s performance in a real-value prediction task. Calculates the square root of the mean of the squared errors for each data point. | Used in conjunction with MAPE in the case of regression problems, from temperature prediction to stock market price prediction. |
| MAPE (mean absolute percentage error) | Used when the output variable is a continuous variable. It is the average of absolute percentage error for each data point. | Used to test the performance of a regression model. It is often used in conjunction with RMSE. |
| BLEU (bilingual evaluation understudy) | Captures the amount of n-gram overlap between the output sentence and the reference ground truth sentence. It has many variants. | Mainly used in machine translation tasks. Recently adapted to other text generation tasks, such as paraphrase generation and text summarization. |
| METEOR | A precision-based metric to measure the quality of text generated. It fixes some of the drawbacks of BLEU, such as exact word matching while calculating precision. METEOR allows synonyms and stemmed words to be matched with the reference word. | Mainly used in machine translation. |
| ROUGE | Another metric to compare quality of generated text with respect to a reference text. As opposed to BLEU, it measures recall. | Since it measures recall, it’s mainly used for summarization tasks where it’s important to evaluate how many words a model can recall. |
| Perplexity | A probabilistic measure that captures how confused an NLP model is. It’s derived from the cross-entropy in a next word prediction task. | Used to evaluate language models. It can also be used in language-generation tasks, such as dialog generation. |





## Extrinsic Evaluation
Extrinsic evaluation focuses on evaluating the
model performance on the final objective.

# 7.Post-Modeling Phases
we move on to the post-modeling phase: deploying, monitoring, and updating the model. We’ll
cover these briefly in this section.
- Deployment
- Monitoring
- Model Updating

# 8.Working with Other Languages
The
pipeline for some languages may be very similar to English, whereas
some languages and scenarios may require us to rethink how we
approach the problem.

# 9. Case Study
Let’s see a case study, using Uber’s tool to improve customer care:
Customer Obsession Ticketing Assistant (COTA).
![](images/Uber-pipeline.png)
<center>NLP pipeline for ranking tickets in a ticketing system by Uber</center>