<i>
Modified from NLP lecture series by [Dr. Milaan Parmar](https://www.linkedin.com/in/milaanparmar/)
</i>

# 01 Tokenization by Python

- **Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords.**
- **Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded.**

In [None]:
# Split by Whitespace
# \W: non-alphanumeric

import re
text = "I\'ll always be there with you forever in your heart.!"
words = re.split(r'\W+', text)
print(words[:100])

['I', 'll', 'always', 'be', 'there', 'with', 'you', 'forever', 'in', 'your', 'heart', '']


Here It didn't recognise **`.`** at the last of the sentence.

**Remove punctuations and separate the word**

In [None]:
import string
import re
# punctuation เครื่องหมาย

# split into words by white space
words = text.split()
print(words)

# prepare regex for char filtering
re_punc = re.compile('[%s]' % re.escape(string.punctuation))
print(re_punc)

# remove punctuation from each word
stripped = [re_punc.sub('', w) for w in words]
print(stripped[:100])

["I'll", 'always', 'be', 'there', 'with', 'you', 'forever', 'in', 'your', 'heart.!']
re.compile('[!"\\#\\$%\\&\'\\(\\)\\*\\+,\\-\\./:;<=>\\?@\\[\\\\\\]\\^_`\\{\\|\\}\\~]')
['Ill', 'always', 'be', 'there', 'with', 'you', 'forever', 'in', 'your', 'heart']


Python’s re.compile() method is used to compile a regular expression pattern provided as a string into a regex pattern object.
https://pynative.com/python-regex-compile/

In [None]:
# Normalizing Case

# split into words by white space
words = text.split()
# convert to lower case
words = [word.lower() for word in words]
print(words[:100])

["i'll", 'always', 'be', 'there', 'with', 'you', 'forever', 'in', 'your', 'heart.!']


## Tokenization with nltk

In [None]:
!pip install nltk



In [None]:
# Tokenization of paragraphs/sentences
import nltk
# nltk.download("popular") # use this to download all popular libraries in nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   U

True

In [None]:
paragraph = """I have three visions for India. In 3000 years of our history, people from all over
               the world have come and invaded us, captured our lands, conquered our minds.
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours.
               Yet we have not done this to any other nation. We have not conquered anyone.
               We have not grabbed their land, their culture,
               their history and tried to enforce our way of life on them.
               Why? Because we respect the freedom of others.That is why my
               first vision is that of freedom. I believe that India got its first vision of
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India
               stands up to the world, no one will respect us. Only strength respects strength. We must be
               strong not only as a military power but also as an economic power. Both must go hand-in-hand.
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life.
               I see four milestones in my career"""


In [None]:
# Tokenizing sentences
sentences = nltk.sent_tokenize(paragraph)

# Tokenizing words
words = nltk.word_tokenize(paragraph)

In [None]:
sentences

['I have three visions for India.',
 'In 3000 years of our history, people from all over\n               the world have come and invaded us, captured our lands, conquered our minds.',
 'From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,\n               the French, the Dutch, all of them came and looted us, took over what was ours.',
 'Yet we have not done this to any other nation.',
 'We have not conquered anyone.',
 'We have not grabbed their land, their culture,\n               their history and tried to enforce our way of life on them.',
 'Why?',
 'Because we respect the freedom of others.That is why my\n               first vision is that of freedom.',
 'I believe that India got its first vision of\n               this in 1857, when we started the War of Independence.',
 'It is this freedom that\n               we must protect and nurture and build on.',
 'If we are not free, no one will respect us.',
 'My second vision for India’s development.'

In [None]:
words

['I',
 'have',
 'three',
 'visions',
 'for',
 'India',
 '.',
 'In',
 '3000',
 'years',
 'of',
 'our',
 'history',
 ',',
 'people',
 'from',
 'all',
 'over',
 'the',
 'world',
 'have',
 'come',
 'and',
 'invaded',
 'us',
 ',',
 'captured',
 'our',
 'lands',
 ',',
 'conquered',
 'our',
 'minds',
 '.',
 'From',
 'Alexander',
 'onwards',
 ',',
 'the',
 'Greeks',
 ',',
 'the',
 'Turks',
 ',',
 'the',
 'Moguls',
 ',',
 'the',
 'Portuguese',
 ',',
 'the',
 'British',
 ',',
 'the',
 'French',
 ',',
 'the',
 'Dutch',
 ',',
 'all',
 'of',
 'them',
 'came',
 'and',
 'looted',
 'us',
 ',',
 'took',
 'over',
 'what',
 'was',
 'ours',
 '.',
 'Yet',
 'we',
 'have',
 'not',
 'done',
 'this',
 'to',
 'any',
 'other',
 'nation',
 '.',
 'We',
 'have',
 'not',
 'conquered',
 'anyone',
 '.',
 'We',
 'have',
 'not',
 'grabbed',
 'their',
 'land',
 ',',
 'their',
 'culture',
 ',',
 'their',
 'history',
 'and',
 'tried',
 'to',
 'enforce',
 'our',
 'way',
 'of',
 'life',
 'on',
 'them',
 '.',
 'Why',
 '?',
 '

# 02 Stemming and Lemmatization

**Stemming** is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a **lemma**. For example: words such as “Likes”, ”liked”, ”likely” and ”liking” will be reduced to “like” after stemming.

1. PorterStemmer
2. SnowballStemmer

**Lemmatization** usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the **lemma** .
- The difference between stemming and lemmatization is that stemming is faster as it cuts words without knowing the context, while lemmatization is slower as it knows the context of words before processing.
- For example, in stemming "historical" word'll convert to "histori", while in Lemmatization this word'll convert to "history"
- In stemming the converted base word may have or haven't a proper meaning while in lemmatization converted base word has proper meaning.

## **Stemming**
**Stemming** is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. For example: words such as “Likes”, ”liked”, ”likely” and ”liking” will be reduced to “like” after stemming.

1. PorterStemmer
2. SnowballStemmer

In [None]:
import nltk

In [None]:
# PorterStemmer
from nltk.stem.porter import *
p_stemmer = PorterStemmer()
words = ['run','runner','running','ran','runs','easily','fairly']

In [None]:
for word in words:
    print(word+' --> '+p_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fairli


In [None]:
# SnowballStemmer
from nltk.stem.snowball import SnowballStemmer
# The Snowball Stemmer requires that you pass a language parameter
s_stemmer = SnowballStemmer(language='english')

In [None]:
words = ['run','runner','running','ran','runs','easily','fairly']
# words = ['generous','generation','generously','generate']

In [None]:
for word in words:
    print(word+' --> '+s_stemmer.stem(word))

run --> run
runner --> runner
running --> run
ran --> ran
runs --> run
easily --> easili
fairly --> fair


##Lemmatization

In [None]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# stopword of in the a
# stopword in thai ครับ ค่ะ

In [None]:
words = ['run','runner','running','ran','runs','easily','fairly']

lemmatizer = WordNetLemmatizer()
for word in words:
    print(word+' --> '+lemmatizer.lemmatize(word))

run --> run
runner --> runner
running --> running
ran --> ran
runs --> run
easily --> easily
fairly --> fairly


In [None]:
print(lemmatizer.lemmatize("run","v"))
print(lemmatizer.lemmatize("runner","n"))
print(lemmatizer.lemmatize("running","v"))
print(lemmatizer.lemmatize("ran","v"))
print(lemmatizer.lemmatize("runs","v"))
print(lemmatizer.lemmatize("easily","r"))
print(lemmatizer.lemmatize("fairly","r"))

run
runner
run
run
run
easily
fairly


In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))

# a denotes adjective in "pos"
print("better :", lemmatizer.lemmatize("better", pos="a"))

rocks : rock
corpora : corpus
better : good


In [None]:
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, word_tokenize
wnl = WordNetLemmatizer()
sent = 'These sentences involves some horsing around'

for word, tag in pos_tag(word_tokenize(sent)):
  wntag = tag[0].lower()
  wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
  lemma = wnl.lemmatize(word, wntag) if wntag else word
  print(lemma)

These
sentence
involve
some
horsing
around


#**03 Spacy Library**

spaCy is a free open-source library for Natural Language Processing in Python. It features NER, POS tagging, dependency parsing, word vectors and more.

https://spacy.io/

In [None]:
!pip install -U spacy



In [None]:
import spacy

# Load the spaCy English model
nlp = spacy.load('en_core_web_sm')

# Define a sample text
text = "I am meeting him tomorrow at the meeting."

# Process the text using spaCy
doc = nlp(text)

# Extract lemmatized tokens
lemmatized_tokens = [token.lemma_ for token in doc]

# Join the lemmatized tokens into a sentence
lemmatized_text = ' '.join(lemmatized_tokens)

# Print the original and lemmatized text
print("Original Text:", text)
print("Lemmatized Text:", lemmatized_text)

Original Text: I am meeting him tomorrow at the meeting.
Lemmatized Text: I be meet he tomorrow at the meeting .


In [None]:
var1 = nlp(u"John Adam is one the researcher who invent the direction of way towards success!")

for token in var1:
    print(token.text, '\t', token.pos_, '\t', token.lemma, '\t', token.lemma_)

John 	 PROPN 	 11174346320140919546 	 John
Adam 	 PROPN 	 14264057329400597350 	 Adam
is 	 AUX 	 10382539506755952630 	 be
one 	 NUM 	 17454115351911680600 	 one
the 	 DET 	 7425985699627899538 	 the
researcher 	 NOUN 	 1317581537614213870 	 researcher
who 	 PRON 	 3876862883474502309 	 who
invent 	 VERB 	 5373681334090504585 	 invent
the 	 DET 	 7425985699627899538 	 the
direction 	 NOUN 	 895834437038626927 	 direction
of 	 ADP 	 886050111519832510 	 of
way 	 NOUN 	 6878210874361030284 	 way
towards 	 ADP 	 9315050841437086371 	 towards
success 	 NOUN 	 16089821935113899987 	 success
! 	 PUNCT 	 17494803046312582752 	 !


In [None]:
paragraph = """Thank you all so very much. Thank you to the Academy.
               Thank you to all of you in this room. I have to congratulate
               the other incredible nominees this year. The Revenant was
               the product of the tireless efforts of an unbelievable cast
               support leaders around the world who do not speak for the
               big polluters, but who speak for all of humanity, for the
               indigenous people of the world, for the billions and
               billions of underprivileged people out there who would be
               most affected by this. For our children’s children, and
               for those people out there whose voices have been drowned
               out by the politics of greed. I thank you all for this
               amazing award tonight. Let us not take this planet for
               granted. I do not take tonight for granted. Thank you so very much."""

In [None]:
#sentence segmentation
doc = nlp(paragraph)
for sent in doc.sents:
    print(sent)

Thank you all so very much.
Thank you to the Academy.
               
Thank you to all of you in this room.
I have to congratulate
               the other incredible nominees this year.
The Revenant was
               the product of the tireless efforts of an unbelievable cast
               support leaders around the world who do not speak for the
               big polluters, but who speak for all of humanity, for the
               indigenous people of the world, for the billions and
               billions of underprivileged people out there who would be
               most affected by this.
For our children’s children, and
               for those people out there whose voices have been drowned
               out by the politics of greed.
I thank you all for this
               amazing award tonight.
Let us not take this planet for
               granted.
I do not take tonight for granted.
Thank you so very much.


In [None]:
# Remove stop word and punctuation
filtered_tokens = [token.text for token in doc if not token.is_stop and not token.is_punct]

# Print the text excluding stop words
print(filtered_tokens)

['Thank', 'Thank', 'Academy', '\n               ', 'Thank', 'room', 'congratulate', '\n               ', 'incredible', 'nominees', 'year', 'Revenant', '\n               ', 'product', 'tireless', 'efforts', 'unbelievable', 'cast', '\n               ', 'support', 'leaders', 'world', 'speak', '\n               ', 'big', 'polluters', 'speak', 'humanity', '\n               ', 'indigenous', 'people', 'world', 'billions', '\n               ', 'billions', 'underprivileged', 'people', '\n               ', 'affected', 'children', 'children', '\n               ', 'people', 'voices', 'drowned', '\n               ', 'politics', 'greed', 'thank', '\n               ', 'amazing', 'award', 'tonight', 'Let', 'planet', '\n               ', 'granted', 'tonight', 'granted', 'Thank']


NLTK parahraph test with spacy

In [None]:
paragraph = """I have three visions for India. In 3000 years of our history, people from all over
               the world have come and invaded us, captured our lands, conquered our minds.
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours.
               Yet we have not done this to any other nation. We have not conquered anyone.
               We have not grabbed their land, their culture,
               their history and tried to enforce our way of life on them.
               Why? Because we respect the freedom of others.That is why my
               first vision is that of freedom. I believe that India got its first vision of
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India
               stands up to the world, no one will respect us. Only strength respects strength. We must be
               strong not only as a military power but also as an economic power. Both must go hand-in-hand.
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life.
               I see four milestones in my career"""


In [None]:
#sentence segmentation
doc = nlp(paragraph)
for sent in doc.sents:
    print(sent)

I have three visions for India.
In 3000 years of our history, people from all over
               the world have come and invaded us, captured our lands, conquered our minds.
               
From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours.
               
Yet we have not done this to any other nation.
We have not conquered anyone.
               
We have not grabbed their land, their culture,
               their history and tried to enforce our way of life on them.
               
Why?
Because we respect the freedom of others.
That is why my
               first vision is that of freedom.
I believe that India got its first vision of
               this in 1857, when we started the War of Independence.
It is this freedom that
               we must protect and nurture and build on.
If we are not free, no one will respect us.
               
My second vis

In [None]:
# Remove stop word and punctuation
filtered_tokens = [token.text for token in doc if not token.is_stop and not token.is_punct]

# Print the text excluding stop words
print(filtered_tokens)

['visions', 'India', '3000', 'years', 'history', 'people', '\n               ', 'world', 'come', 'invaded', 'captured', 'lands', 'conquered', 'minds', '\n               ', 'Alexander', 'onwards', 'Greeks', 'Turks', 'Moguls', 'Portuguese', 'British', '\n               ', 'French', 'Dutch', 'came', 'looted', 'took', '\n               ', 'nation', 'conquered', '\n               ', 'grabbed', 'land', 'culture', '\n               ', 'history', 'tried', 'enforce', 'way', 'life', '\n               ', 'respect', 'freedom', '\n               ', 'vision', 'freedom', 'believe', 'India', 'got', 'vision', '\n               ', '1857', 'started', 'War', 'Independence', 'freedom', '\n               ', 'protect', 'nurture', 'build', 'free', 'respect', '\n               ', 'second', 'vision', 'India', 'development', 'years', 'developing', 'nation', '\n               ', 'time', 'developed', 'nation', '5', 'nations', 'world', '\n               ', 'terms', 'GDP', '10', 'percent', 'growth', 'rate', 'areas',