# Day 1 - Content

Natural Language Processing Fundamentals

1. NLP Introduction
2. NLP Stop Words

Text Processing Pipeline

3. POS Tagging
4. Named Entity Recognition
5. NLP Statistical Methods - Bag Of Words and TF-IDF

Feature Engineering in NLP

6. Text Normalization and Tokenization
7. Embedding and Word2Vec

# 1. NLP Introduction

<img src='n4.png' />

In [None]:
NLP
- NLU (Natural Language Understanding) - Semantic Analytics(context and intent)
- NLG (Natural Language Generation)

In [None]:
Statistical Modeling

- generative modeling -> predicting something based on the probability of previous thing

In [None]:
context

"your t-shirt is killer"

In [None]:
intent

"my mom gave me money to buy 1kg tomotos otherwise she will be angry"

In [None]:
give instructions to a machine
- python
- java
- c

chatGPT
- english


In [None]:
application of NLP

- sentiment analysis
- toxicity classification - threats, insults, hatred
- machine translation
- NER - Named Entity Recognition - name, organization, location or quantities
- email spam detection
- text generation - autocomplete, chatbots, 
- information retrieval
- summarization
- question and answering bots

# 2. NLP Stop Words

<img src='v41.png' />

# Text Preprocessing Pipeline

In [None]:
"A" = 65 # ASCII, utf-8

In [None]:
# Raw Text

"<SUBJECT LINE> Employees details.\
<END><BODY TEXT>Attached are 2 files 1st, one is pairoll 2nd is healtcare !"


In [None]:
# remove encoding

"Employees details. Attached are 2 files 1st, one is pairoll 2nd is healtcare !"

In [None]:
# lower casing

"employees details. attached are 2 files 1st, one is pairoll 2nd is healtcare !"

In [None]:
# digits to words

"employees details. attached are two files first, one is pairoll second is healtcare !"

In [None]:
# remove special characters - @!#$%^

"employees details attached are two files first one is pairoll second is healtcare"

In [None]:
# spelling corrections

"employees details attached are two files first one is payroll second is healthcare"

In [None]:
# remove stop words

"employees details attached two files first one payroll second healthcare"

In [None]:
# stemming

"employe detail attach two file first one payroll second healthcare"

In [None]:
# lemmatizing - ran - run, jumped - jump

"employe detail attach two file first one payroll second healthcare"

# 3. POS Tagging

<img src='n5.png' />

In [13]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

# 4. Named Entity Recognition

<img src='v45.jpg' />

# 5. NLP Statistical Methods - Bag Of Words and TF-IDF

In [None]:
text to numerical embeddings


1 - statistical methods
        - Bag of Words
        - Tf-IDF

2 - ML/DL based Methods
        - embeddings(lookup) method
        - Word2Vec -> Continuous Bag Of Words
        - trasformer based architectures

### Bag Of Words

<img src='n1.png' />

In [None]:
the cat sat on the mat
the dog sat of the cat

### Sequential Representation

<img src='n2.png' />

### TF-IDF

- weight each word by its importance

In [None]:
Term Frequency - How important is the word in the document?

Inverse Document Frequency - How important is the term in the whole corpus?

# Feature Engineering in NLP

# 6. Text Normalization and Tokenization

Tokenization Example - https://platform.openai.com/tokenizer

<img src='v42.png' />

<img src='v43.jpg' />

In [6]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

In [7]:
text = 'this is a single sentence.'

tokens = word_tokenize(text)

print(tokens)

['this', 'is', 'a', 'single', 'sentence', '.']


In [8]:
no_punctuation = [word.lower() for word in tokens if word.isalpha()]
no_punctuation

['this', 'is', 'a', 'single', 'sentence']

In [9]:
text = 'this is the first sentence. this is the second sentence. this is the document.'

print(sent_tokenize(text))

['this is the first sentence.', 'this is the second sentence.', 'this is the document.']


In [10]:
print([word_tokenize(sentence) for sentence in sent_tokenize(text)])

[['this', 'is', 'the', 'first', 'sentence', '.'], ['this', 'is', 'the', 'second', 'sentence', '.'], ['this', 'is', 'the', 'document', '.']]


In [11]:
stop_words = stopwords.words('english')

print(stop_words[:20])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']


In [12]:
text = 'this is the first sentence. this is the second sentence. this is the document.'

tokens = [token for token in word_tokenize(text) if token not in stop_words]

print(tokens)

['first', 'sentence', '.', 'second', 'sentence', '.', 'document', '.']


### Tokenization

1. Word Tokenization
2. Sentence Tokenization
3. Regular Expression Tokenization

In [None]:
import re

# 7. Embedding and Word2Vec


Word2Vec - https://nlp.stanford.edu/projects/glove/

In [21]:
import numpy as np

In [22]:
def loadGlove(path):
    file = open(path, 'r', encoding='utf8')
    model = {}
    
    for l in file:
        line = l.split()
        word = line[0]
        value = np.array([float(val) for val in line[1:]])
        model[word] = value
    
    return model

glove = loadGlove('glove.6B.50d.txt')

In [23]:
glove['python']   # vector embedding for the word Python

array([ 0.5897  , -0.55043 , -1.0106  ,  0.41226 ,  0.57348 ,  0.23464 ,
       -0.35773 , -1.78    ,  0.10745 ,  0.74913 ,  0.45013 ,  1.0351  ,
        0.48348 ,  0.47954 ,  0.51908 , -0.15053 ,  0.32474 ,  1.0789  ,
       -0.90894 ,  0.42943 , -0.56388 ,  0.69961 ,  0.13501 ,  0.16557 ,
       -0.063592,  0.35435 ,  0.42819 ,  0.1536  , -0.47018 , -1.0935  ,
        1.361   , -0.80821 , -0.674   ,  1.2606  ,  0.29554 ,  1.0835  ,
        0.2444  , -1.1877  , -0.60203 , -0.068315,  0.66256 ,  0.45336 ,
       -1.0178  ,  0.68267 , -0.20788 , -0.73393 ,  1.2597  ,  0.15425 ,
       -0.93256 , -0.15025 ])

In [24]:
glove['neural']

array([ 0.92803 ,  0.29096 ,  0.67837 ,  1.0444  , -0.72551 ,  2.1995  ,
        0.88767 , -0.94782 ,  0.67426 ,  0.24908 ,  0.95722 ,  0.18122 ,
        0.064263,  0.64323 , -1.6301  ,  0.94972 , -0.7367  ,  0.17345 ,
        0.67638 ,  0.10026 , -0.033782, -0.76971 ,  0.40519 , -0.099516,
        0.79654 ,  0.1103  , -0.076053, -0.090434,  0.015021, -1.137   ,
        1.6803  , -0.34424 ,  0.77538 , -1.8718  , -0.17148 ,  0.31956 ,
        0.093062,  0.004996,  0.25716 ,  0.52207 , -0.52548 , -0.93144 ,
       -1.0553  ,  1.4401  ,  0.30807 , -0.84872 ,  1.9986  ,  0.10788 ,
       -0.23633 , -0.17978 ])

### How the system know that these words are similar?

- Cosine Similarity

In [25]:
from sklearn.metrics.pairwise import cosine_similarity

In [26]:
cosine_similarity(glove['cat'].reshape(1,-1), glove['dog'].reshape(1,-1))

array([[0.92180053]])

In [27]:
cosine_similarity(glove['cat'].reshape(1,-1), glove['piano'].reshape(1,-1))

array([[0.19825255]])

In [28]:
cosine_similarity(glove['king'].reshape(1,-1), glove['queen'].reshape(1,-1))

array([[0.7839043]])

## Words in 2D Embedding Space

<img src='v44.png' />

# pipeline for text data


In [None]:
1. Text Notmalization
2. Tokenization
3. Tokens to IDs

In [None]:
"Celebration         of Worldcup winning !!"

Text Normalization
- lowercasing
- puctuation removal(?,!)
- trim whitespaces
- strip accents
- stemming
- lemmatization

In [None]:
"celebrate of worldcup win"

In [None]:
Tokenization
- word tokenization
- subword tokenization
- character tokenization
- sentence tokenization
- regular expression tokenization


["celebrate", 'of', 'worldcup', 'win']

In [None]:
Tokens to IDs
- words to numerical values

- lookup table -> animal - 18, car - 128
- hashing - we create a function that will give you an random ID for a word

[35, 21, 4, 18]

In [None]:
ML

- feature engineering

In [None]:
DL

- no need to do feature engineering manually

In [None]:
NLP

close the gap between human lanuage and machine language