# Natural Language Processing(NLP) - Pipeline

NLP Pipeline consists of following steps:
1. Data Acquisition
2. Text Preparation
    * Text Cleanup
    * Basic Preprocessing
    * Advance Preprocessing
4. Feature Engineering
5. Modelling
   * Model Bulding
   * Evaluation
7. Deployment
   * Deployment
   * Monitoring
   * Model Update

## 1. Data Acquisition

Three possible scenario for task at hand:
- Available
- Available to others 
- Does not exist

![Data Acquisition](./figures/data_acquisition.png)

## 2. Preprocessing

![Data Acquisition](./figures/data_preprocessing.png)

#### HTML tag removal

In [6]:
sample_text = """<a href="/wiki/Wikipedia:Purpose" title="Wikipedia:Purpose">Wikipedia's purpose</a> is to benefit readers by presenting information on all branches of <a href="/wiki/Knowledge" title="Knowledge">knowledge</a>. 
Hosted by the <a href="/wiki/Wikipedia:Wikimedia_Foundation" title="Wikipedia:Wikimedia Foundation">Wikimedia Foundation</a>, 
it consists of <a href="/wiki/Help:Editing" title="Help:Editing">freely editable</a> content, whose articles also have 
numerous links to guide readers towards more information."""


In [7]:
import re

def striphtml(data):
    return re.sub('<.*?>', '', data)

In [8]:
striphtml(sample_text)

"Wikipedia's purpose is to benefit readers by presenting information on all branches of knowledge. \nHosted by the Wikimedia Foundation, \nit consists of freely editable content, whose articles also have \nnumerous links to guide readers towards more information."

#### Emoji removal - Unicode Normalization

In [9]:
emoji_text = "Gratitude is key. 🙏 Appreciate what you have: home 🏠, family 👨‍👩‍👧, work 💻, food 🍲, nature 🌳, friends 🤝. Count your blessings daily. Stay thankful. 💖"

In [10]:
emoji_text.encode('utf-8')

b'Gratitude is key. \xf0\x9f\x99\x8f Appreciate what you have: home \xf0\x9f\x8f\xa0, family \xf0\x9f\x91\xa8\xe2\x80\x8d\xf0\x9f\x91\xa9\xe2\x80\x8d\xf0\x9f\x91\xa7, work \xf0\x9f\x92\xbb, food \xf0\x9f\x8d\xb2, nature \xf0\x9f\x8c\xb3, friends \xf0\x9f\xa4\x9d. Count your blessings daily. Stay thankful. \xf0\x9f\x92\x96'

#### Spell Checking

In [13]:
pip install textblob

Defaulting to user installation because normal site-packages is not writeable
Collecting textblob
  Downloading textblob-0.18.0.post0-py3-none-any.whl (626 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m626.3/626.3 KB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25hCollecting nltk>=3.8
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25hCollecting joblib
  Using cached joblib-1.3.2-py3-none-any.whl (302 kB)
Collecting tqdm
  Using cached tqdm-4.66.2-py3-none-any.whl (78 kB)
Collecting regex>=2021.8.3
  Using cached regex-2023.12.25-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (773 kB)
Installing collected packages: tqdm, regex, joblib, nltk, textblob
Successfully installed joblib-1.3.2 nltk-3.8.1 regex-2023.12.25 textblob-0.18.0.pos

In [7]:
incorrect_text = 'ceartain aonditiona during seveerel ggenrations are modifiedd in the saame manner' 

In [8]:
from textblob import TextBlob

textBlob = TextBlob(incorrect_text)

textBlob.correct()

TextBlob("certain condition during several generations are modified in the same manner")

### Basic Preprocessing

#### Tokenisation

In [12]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/op/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [9]:
sample_text = """Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed bibendum turpis a enim imperdiet, sit amet efficitur ipsum pharetra. In eu ipsum non nisi tincidunt consectetur. Donec vitae nisi vel metus eleifend tempor. Curabitur fringilla, nibh non ultrices volutpat, purus turpis lacinia nunc, vel aliquet dolor ipsum et nisi. Donec eget nibh vel nisl consectetur sodales."""

In [15]:
from nltk.tokenize import sent_tokenize, word_tokenize

sentences = sent_tokenize(sample_text)
sentences

['Lorem ipsum dolor sit amet, consectetur adipiscing elit.',
 'Sed bibendum turpis a enim imperdiet, sit amet efficitur ipsum pharetra.',
 'In eu ipsum non nisi tincidunt consectetur.',
 'Donec vitae nisi vel metus eleifend tempor.',
 'Curabitur fringilla, nibh non ultrices volutpat, purus turpis lacinia nunc, vel aliquet dolor ipsum et nisi.',
 'Donec eget nibh vel nisl consectetur sodales.']

In [16]:
for sent in sentences:
    print(word_tokenize(sent))

['Lorem', 'ipsum', 'dolor', 'sit', 'amet', ',', 'consectetur', 'adipiscing', 'elit', '.']
['Sed', 'bibendum', 'turpis', 'a', 'enim', 'imperdiet', ',', 'sit', 'amet', 'efficitur', 'ipsum', 'pharetra', '.']
['In', 'eu', 'ipsum', 'non', 'nisi', 'tincidunt', 'consectetur', '.']
['Donec', 'vitae', 'nisi', 'vel', 'metus', 'eleifend', 'tempor', '.']
['Curabitur', 'fringilla', ',', 'nibh', 'non', 'ultrices', 'volutpat', ',', 'purus', 'turpis', 'lacinia', 'nunc', ',', 'vel', 'aliquet', 'dolor', 'ipsum', 'et', 'nisi', '.']
['Donec', 'eget', 'nibh', 'vel', 'nisl', 'consectetur', 'sodales', '.']


_Other techniques we will code later_

## 3. Feature Engineering

Converting textual data into numbers so that it can be fed to model training and testing.

Two high level approaches based on model building algorithms
- ML Algorithms: features need to be generated manually based on domain knowledge
- DL Algorithms: features are generated by algorithms

Both have its own advantages and disadvantages:
- ML Algorithms
    - Interpretable results
    - Domain knowledge required
    - Features generated may not be robust
- DL Algorithms
    - Features are generated by algorithms
    - Less interpretable

## 4. Modelling 

![Data Acquisition](./figures/modelling.png)
<center>Modelling Phase of NLP Pipeline</center>

- Based on data avilability and nature of problem, we chose among different approaches to build a model.
- `Intrinsic Evaluation` helps to evaluate model performance like accuracy, perplexityetc.
- `Extrinsic Evaluation` helps in evaluating model performane in business settings like how many times user are using suggestions.

## 5. Deployment

Three phases of deployment are:
- **Deploying**: Depending on application usecase of model different approaches are used.
  Ex: API(Microservices), chatbot, App integration
- **Monitoring**: Tracking model performance through dashboards.
- **Model Update**: Based on new data availability or to increase robustness, updates are done. Online Learning, model trains on the server itself as the data comes.