# Natural Language Processing
NLP is a sub-field of AI and is created from fields like linguistics, computer science and AI. Here the task is for the machines to understand natural language and help them in solving a problem or better assist them through it in speech format. For example, helping people in ATM booths to better guide them through the interface. There are some tasks of NLP which are as follows:-

* Text Classification
* Sentiment Analysis
* Information Retrieval
* Parts of Speech Tagging
* Language Detection and Machine Translation
* Conversational Agents
* Knowledge Graph and QA System
* Text Summarization
* Topic Modelling
* Text Generation
* Text Parsing
* Speech to Text

There are some approaches to NLP which are as follows:-

* Heuristic Method
* Machine Learning Method
* Deep Learning Method

Challenges in NLP:-

* Ambiguity
* Contextual Words
* Colloquialism and Slangs
* Synonyms
* Irony, Sarcasm and Tonal Difference
* Spelling Errors
* Creativity
* Diversity

# NLP Pipeline
It is set of steps followed to build an end to end NLP software.These steps are:-

* Data Acquisition
* Text Preparation
  * Text Cleanup
  * Basic Preprocessing
  * Advance Preprocessing
* Feature Engineering
* Modelling
  * Building
  * Evaluation
* Deployment
  * Deployment
  * Monitoring
  * Model Update

Notes:-

* This is not universal
* DL pipelines are slightly different
* Pipeline is non-linear

### Text Preparation

In [1]:
# Importing Libraries
from textblob import TextBlob # Took too long
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from autocorrect import Speller

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

import pandas as pd
import numpy as np
import re
import string

In [2]:
nltk.download("stopwords")
nltk.download("punkt_tab")
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /home/anish/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/anish/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /home/anish/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

The above have to be downloaded.

In [3]:
df = pd.read_csv("imdb_dataset.csv")

In [4]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


First step would be to change the case to lower as python is a case-sensitive language and mismatch cases will result in different tokens even though they are same.

In [5]:
df["review"] = df["review"].str.lower()

In [6]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. <br /><br />the...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


Now we use regular expression to remove html tags as they are of no use to us.

In [7]:
def remove_html(data):
    pattern = re.compile("<.*?>")
    return pattern.sub(r"", data)

In [8]:
df.review = df.review.apply(remove_html)

In [9]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


Next step would be to remove urls from the sentence.

In [10]:
def remove_url(data):
    pattern = re.compile(r"https?://\S+|www\.\S+")
    return pattern.sub(r"", data)

In [11]:
df.review = df.review.apply(remove_url)

In [12]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


Next step is to remove punctuation.

In [13]:
punctuations = string.punctuation

In [14]:
def remove_punc(data):
    for char in punctuations:
        data = data.replace(char, "")
    return data

In [15]:
df_temp = df
df_temp.review = df_temp.review.apply(remove_punc)

In [16]:
df_temp.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming tech...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive


The above method works but is slow so when we have a huge dataset it would be a problem.

In [17]:
def remove_punctuation(data):
    return data.translate(str.maketrans("", "", punctuations))

In [18]:
df.review = df.review.apply(remove_punctuation)

In [19]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming tech...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive


Next step would be chat word treatment. This deals with shortforms that are used instead of their actual self. We would need a dictionary where keys would be shortform and value will be their meaning. This is an important part for any application that interacts with real users as they generaly use these shortforms. But currently we will move to spelling correction.

In [20]:
def correct_spell(data):
    return spell(data)

spell = Speller(lang="en")

In [21]:
# df.review = df.review.apply(correct_spell)

The above step takes too long with both textblob and autocorrect so lets skip them for now.

In [22]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming tech...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive


Next step would be to get rid of stop words.

In [23]:
en_stopwords = stopwords.words("english")

In [24]:
def remove_stopwords(data):
    end_string = []
    for word in data.split():
        if word not in en_stopwords:
            end_string.append(word)
    return " ".join(end_string)

In [25]:
df.review = df.review.apply(remove_stopwords)

In [26]:
df.head()

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,positive
1,wonderful little production filming technique ...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically theres family little boy jake thinks...,negative
4,petter matteis love time money visually stunni...,positive


The next step is tokenization.

In [27]:
df_sample = df.iloc[0, 0]

In [28]:
word_tokenize(df_sample)

['one',
 'reviewers',
 'mentioned',
 'watching',
 '1',
 'oz',
 'episode',
 'youll',
 'hooked',
 'right',
 'exactly',
 'happened',
 'methe',
 'first',
 'thing',
 'struck',
 'oz',
 'brutality',
 'unflinching',
 'scenes',
 'violence',
 'set',
 'right',
 'word',
 'go',
 'trust',
 'show',
 'faint',
 'hearted',
 'timid',
 'show',
 'pulls',
 'punches',
 'regards',
 'drugs',
 'sex',
 'violence',
 'hardcore',
 'classic',
 'use',
 'wordit',
 'called',
 'oz',
 'nickname',
 'given',
 'oswald',
 'maximum',
 'security',
 'state',
 'penitentary',
 'focuses',
 'mainly',
 'emerald',
 'city',
 'experimental',
 'section',
 'prison',
 'cells',
 'glass',
 'fronts',
 'face',
 'inwards',
 'privacy',
 'high',
 'agenda',
 'em',
 'city',
 'home',
 'manyaryans',
 'muslims',
 'gangstas',
 'latinos',
 'christians',
 'italians',
 'irish',
 'moreso',
 'scuffles',
 'death',
 'stares',
 'dodgy',
 'dealings',
 'shady',
 'agreements',
 'never',
 'far',
 'awayi',
 'would',
 'say',
 'main',
 'appeal',
 'show',
 'due',
 'f

In [29]:
sent_tokenize(df_sample)

['one reviewers mentioned watching 1 oz episode youll hooked right exactly happened methe first thing struck oz brutality unflinching scenes violence set right word go trust show faint hearted timid show pulls punches regards drugs sex violence hardcore classic use wordit called oz nickname given oswald maximum security state penitentary focuses mainly emerald city experimental section prison cells glass fronts face inwards privacy high agenda em city home manyaryans muslims gangstas latinos christians italians irish moreso scuffles death stares dodgy dealings shady agreements never far awayi would say main appeal show due fact goes shows wouldnt dare forget pretty pictures painted mainstream audiences forget charm forget romanceoz doesnt mess around first episode ever saw struck nasty surreal couldnt say ready watched developed taste oz got accustomed high levels graphic violence violence injustice crooked guards wholl sold nickel inmates wholl kill order get away well mannered middle

The above are just examples from sample of data as all the data will take a long time. Spacy is another library which can help here. Next step is Stemming. It basically means to bring every word back to their root form.

In [30]:
ps = PorterStemmer()

In [31]:
def stem_words(data):
    return " ".join([ps.stem(word) for word in data.split()])

In [32]:
stem_words(df_sample)

'one review mention watch 1 oz episod youll hook right exactli happen meth first thing struck oz brutal unflinch scene violenc set right word go trust show faint heart timid show pull punch regard drug sex violenc hardcor classic use wordit call oz nicknam given oswald maximum secur state penitentari focus mainli emerald citi experiment section prison cell glass front face inward privaci high agenda em citi home manyaryan muslim gangsta latino christian italian irish moreso scuffl death stare dodgi deal shadi agreement never far awayi would say main appeal show due fact goe show wouldnt dare forget pretti pictur paint mainstream audienc forget charm forget romanceoz doesnt mess around first episod ever saw struck nasti surreal couldnt say readi watch develop tast oz got accustom high level graphic violenc violenc injustic crook guard wholl sold nickel inmat wholl kill order get away well manner middl class inmat turn prison bitch due lack street skill prison experi watch oz may becom c

As we can see every word has been reverted back to their root form but the root word may not be of the same language so we can also use Lemmatization.

In [33]:
wordnet = WordNetLemmatizer()

In [34]:
word_token = word_tokenize(df_sample)

In [35]:
for word in word_token:
    print(f"{word}, {wordnet.lemmatize(word, pos="v")}")

one, one
reviewers, reviewers
mentioned, mention
watching, watch
1, 1
oz, oz
episode, episode
youll, youll
hooked, hook
right, right
exactly, exactly
happened, happen
methe, methe
first, first
thing, thing
struck, strike
oz, oz
brutality, brutality
unflinching, unflinching
scenes, scenes
violence, violence
set, set
right, right
word, word
go, go
trust, trust
show, show
faint, faint
hearted, hearted
timid, timid
show, show
pulls, pull
punches, punch
regards, regard
drugs, drug
sex, sex
violence, violence
hardcore, hardcore
classic, classic
use, use
wordit, wordit
called, call
oz, oz
nickname, nickname
given, give
oswald, oswald
maximum, maximum
security, security
state, state
penitentary, penitentary
focuses, focus
mainly, mainly
emerald, emerald
city, city
experimental, experimental
section, section
prison, prison
cells, cells
glass, glass
fronts, front
face, face
inwards, inwards
privacy, privacy
high, high
agenda, agenda
em, em
city, city
home, home
manyaryans, manyaryans
muslims, mu

Now our task is done.

### Feature Engineering

In [36]:
cv = CountVectorizer()

In [37]:
bow = cv.fit_transform(df.iloc[:2, 0])

In [38]:
print(cv.vocabulary_)

{'one': 121, 'reviewers': 148, 'mentioned': 108, 'watching': 200, 'oz': 125, 'episode': 47, 'youll': 211, 'hooked': 86, 'right': 149, 'exactly': 50, 'happened': 81, 'methe': 110, 'first': 61, 'thing': 184, 'struck': 176, 'brutality': 11, 'unflinching': 194, 'scenes': 153, 'violence': 197, 'set': 161, 'word': 205, 'go': 71, 'trust': 190, 'show': 166, 'faint': 56, 'hearted': 83, 'timid': 186, 'pulls': 139, 'punches': 140, 'regards': 146, 'drugs': 40, 'sex': 163, 'hardcore': 82, 'classic': 19, 'use': 195, 'wordit': 206, 'called': 12, 'nickname': 119, 'given': 68, 'oswald': 124, 'maximum': 106, 'security': 157, 'state': 174, 'penitentary': 129, 'focuses': 63, 'mainly': 100, 'emerald': 44, 'city': 17, 'experimental': 52, 'section': 156, 'prison': 136, 'cells': 13, 'glass': 70, 'fronts': 65, 'face': 54, 'inwards': 89, 'privacy': 137, 'high': 84, 'agenda': 2, 'em': 43, 'home': 85, 'manyaryans': 103, 'muslims': 115, 'gangstas': 66, 'latinos': 95, 'christians': 16, 'italians': 91, 'irish': 90, 

In [39]:
print(bow[0].toarray())

[[1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 2 1 1 0 0 1 0 0 1 1 1 1 1 1 0 1 0 0 0
  1 1 0 0 1 2 0 1 1 0 0 2 1 0 1 1 1 0 1 1 1 0 1 0 0 2 0 1 3 1 1 2 1 0 1 1
  1 1 1 0 0 1 0 0 0 1 1 1 2 1 1 1 2 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 0 0 1 1
  1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 0 1 5 1 0 0 1 0 1 0 0 0 1 3 1 0 1 1 0 1 0
  0 0 1 0 1 2 1 1 2 1 1 0 1 1 0 0 0 1 0 1 1 0 3 1 1 1 1 0 0 1 1 1 2 0 1 1
  0 0 0 0 1 0 1 1 0 0 1 1 0 1 1 1 1 4 0 1 2 1 2 0 0 1 1 0 1 1 0 1]]


Above we can see Bag of Words in action. The above output shows the vocabulary of the dataset and the below one shows the array. We will now see ngrams.

In [40]:
cv_ngram  = CountVectorizer(ngram_range=(2, 2))

In [41]:
bow_ng = cv_ngram.fit_transform(df.iloc[:2, 0])

In [42]:
print(cv_ngram.vocabulary_)

{'one reviewers': 133, 'reviewers mentioned': 170, 'mentioned watching': 119, 'watching oz': 231, 'oz episode': 138, 'episode youll': 49, 'youll hooked': 247, 'hooked right': 95, 'right exactly': 171, 'exactly happened': 52, 'happened methe': 88, 'methe first': 121, 'first thing': 64, 'thing struck': 211, 'struck oz': 203, 'oz brutality': 137, 'brutality unflinching': 11, 'unflinching scenes': 221, 'scenes violence': 178, 'violence set': 227, 'set right': 186, 'right word': 172, 'word go': 241, 'go trust': 77, 'trust show': 217, 'show faint': 192, 'faint hearted': 58, 'hearted timid': 90, 'timid show': 213, 'show pulls': 193, 'pulls punches': 160, 'punches regards': 161, 'regards drugs': 168, 'drugs sex': 40, 'sex violence': 188, 'violence hardcore': 225, 'hardcore classic': 89, 'classic use': 20, 'use wordit': 223, 'wordit called': 242, 'called oz': 12, 'oz nickname': 141, 'nickname given': 130, 'given oswald': 74, 'oswald maximum': 136, 'maximum security': 117, 'security state': 182,

In [43]:
print(bow_ng[0].toarray())

[[1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 1 1 0 1 0 0
  0 1 1 0 1 1 1 0 1 1 0 0 1 1 1 0 1 1 1 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 1 1
  1 1 1 0 1 1 1 1 0 1 0 0 1 0 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 0
  0 0 1 1 1 1 1 0 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 0 1 1 0 1 1 1 1 1 1 1 0
  0 0 1 0 1 0 0 0 0 1 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 0 1 1 1 1 1 1 1 0 1 1
  0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 1 0 0 0 0 1 0 1 1 0
  0 1 1 0 1 1 0 1 1 1 1 1 1 0 1 2 0 0 0 1 0 1 1 0 0 1 1 0 1 1 0 1]]


The above was a representation of bi-gram but we can make n-gram by providing the hyperparameter. This still has many problems so we will now see Tf-Idf. Here we calculate Term frequency and Inverse document frequency to value words which are rare to a document.

In [44]:
tfidf = TfidfVectorizer()

In [45]:
tfidf.fit_transform(df.iloc[0: 2, 0]).toarray()

array([[0.06508585, 0.        , 0.06508585, 0.06508585, 0.06508585,
        0.06508585, 0.06508585, 0.06508585, 0.06508585, 0.06508585,
        0.06508585, 0.06508585, 0.06508585, 0.06508585, 0.06508585,
        0.        , 0.06508585, 0.13017169, 0.06508585, 0.06508585,
        0.        , 0.        , 0.06508585, 0.        , 0.        ,
        0.06508585, 0.06508585, 0.06508585, 0.06508585, 0.06508585,
        0.06508585, 0.        , 0.06508585, 0.        , 0.        ,
        0.        , 0.06508585, 0.06508585, 0.        , 0.        ,
        0.06508585, 0.13017169, 0.        , 0.06508585, 0.06508585,
        0.        , 0.        , 0.13017169, 0.06508585, 0.        ,
        0.06508585, 0.06508585, 0.06508585, 0.        , 0.06508585,
        0.06508585, 0.06508585, 0.        , 0.06508585, 0.        ,
        0.        , 0.13017169, 0.        , 0.06508585, 0.19525754,
        0.06508585, 0.06508585, 0.13017169, 0.06508585, 0.        ,
        0.06508585, 0.06508585, 0.06508585, 0.04

The above represents the tfidf array.

In [46]:
print(tfidf.idf_)
print(tfidf.get_feature_names_out())

[1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511
 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511
 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511
 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511
 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511
 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511
 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511
 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511
 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511
 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511
 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511
 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511
 1.40546511 1.         1.40546511 1.40546511 1.40546511 1.40546511
 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511
 1.40546511 1.         1.40546511 1.40546511 1.40546511 1.4054

The above is the idf and the words.