> # `Tokenization`

> ## `Using NLTK`

In [5]:
import nltk 
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

# Download the 'punkt' resource (tokenizer)
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Digital\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [None]:
corpus = """Welcome to Mohammed Zahran's NLP lessons.
Please contact me via LinkedIn if you have any questions.
"""

In [7]:
sent_tokenize(corpus)

["Welcome to Mohammed Zahran's NLP lessons.",
 'Please contact me via LinkedIn if you have any questions.']

In [8]:
word_tokenize(corpus)

['Welcome',
 'to',
 'Mohammed',
 'Zahran',
 "'s",
 'NLP',
 'lessons',
 '.',
 'Please',
 'contact',
 'me',
 'via',
 'LinkedIn',
 'if',
 'you',
 'have',
 'any',
 'questions',
 '.']

> ## `Using spaCy`

In [9]:
import spacy

In [11]:
nlp = spacy.load("en_core_web_md")

In [12]:
corpus = """Welcome to Mohammed Zahran's NLP lessons.
Please contact me via LinkedIn if you have any questions.
"""

In [13]:
doc = nlp(corpus)

sentences = [sent for sent in doc.sents]
sentences

[Welcome to Mohammed Zahran's NLP lessons.,
 Please contact me via LinkedIn if you have any questions.]

In [14]:
doc = nlp(corpus)

words = [token for token in doc]
words

[Welcome,
 to,
 Mohammed,
 Zahran,
 's,
 NLP,
 lessons,
 .,
 ,
 Please,
 contact,
 me,
 via,
 LinkedIn,
 if,
 you,
 have,
 any,
 questions,
 .,
 ]

-------------

> ##### `Stemming`

In [15]:
import nltk
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer

In [16]:
porter_stemmer = PorterStemmer()
snowball_stemmer = SnowballStemmer("english")
lancaster_stemmer = LancasterStemmer()

words = ['cared','university','fairly','easily','singing','rocks',
       'sings','sung','singer','sportingly','congratulations','living']

porter_stems = [porter_stemmer.stem(word) for word in words]

snowball_stems = [snowball_stemmer.stem(word) for word in words]

lancaster_stems = [lancaster_stemmer.stem(word) for word in words]


print("Original words:", words)
print("Porter Stemmer:", porter_stems)
print("Snowball Stemmer:", snowball_stems)
print("Lancaster Stemmer:", lancaster_stems)


Original words: ['cared', 'university', 'fairly', 'easily', 'singing', 'rocks', 'sings', 'sung', 'singer', 'sportingly', 'congratulations', 'living']
Porter Stemmer: ['care', 'univers', 'fairli', 'easili', 'sing', 'rock', 'sing', 'sung', 'singer', 'sportingli', 'congratul', 'live']
Snowball Stemmer: ['care', 'univers', 'fair', 'easili', 'sing', 'rock', 'sing', 'sung', 'singer', 'sport', 'congratul', 'live']
Lancaster Stemmer: ['car', 'univers', 'fair', 'easy', 'sing', 'rock', 'sing', 'sung', 'sing', 'sport', 'congrat', 'liv']


------------

> ## `Lemmatizer`
> Lemmatization technique is like stemming. The output we will get after lemmatization is called ‘lemma’, which is a root word rather than root stem, the output of stemming. After lemmatization, we will be getting a valid word that means the same thing.

> NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus.

In [None]:
from nltk.stem import WordNetLemmatizer

In [18]:
lemmatizer = WordNetLemmatizer()
words = ['cats', 'running', 'better', 'flies','congratulations','loves']

for word in words:
    print(f"{word}: {lemmatizer.lemmatize(word)}")

cats: cat
running: running
better: better
flies: fly
congratulations: congratulation
loves: love


------------

> ## `stop words`

In [19]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from pprint import pprint

In [20]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Digital\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Digital\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Digital\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [23]:
paragraph = """I have three visions for India. In 3000 years of our history, people from all over
                the world have come and invaded us, captured our lands, conquered our minds.
                From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
                the French, the Dutch, all of them came and looted us, took over what was ours.
                Yet we have not done this to any other nation. We have not conquered anyone.
                We have not grabbed their land, their culture,
                their history and tried to enforce our way of life on them.
                Why? Because we respect the freedom of others.That is why my
                first vision is that of freedom. I believe that India got its first vision of
                this in 1857, when we started the War of Independence. It is this freedom that
                we must protect and nurture and build on. If we are not free, no one will respect us.
                My second vision for India’s development. For fifty years we have been a developing nation.
                It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
                in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
                Our achievements are being globally recognised today. Yet we lack the self-confidence to
                see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
                I have a third vision. India must stand up to the world. Because I believe that unless India
                stands up to the world, no one will respect us. Only strength respects strength. We must be
                strong not only as a military power but also as an economic power. Both must go hand-in-hand.
                My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of
                space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
                I was lucky to have worked with all three of them closely and consider this the great opportunity of my life.
                I see four milestones in my career"""

paragraph = paragraph.lower()
words = word_tokenize(paragraph)
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.isalnum() and word not in stop_words]

lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
processed_paragraph = ' '.join(lemmatized_words)

print(processed_paragraph)


three vision india 3000 year history people world come invaded u captured land conquered mind alexander onwards greek turk mogul portuguese british french dutch came looted u took yet done nation conquered anyone grabbed land culture history tried enforce way life respect freedom first vision freedom believe india got first vision 1857 started war independence freedom must protect nurture build free one respect u second vision india development fifty year developing nation time see developed nation among top 5 nation world term gdp 10 percent growth rate area poverty level falling achievement globally recognised today yet lack see developed nation incorrect third vision india must stand world believe unless india stand world one respect u strength respect strength must strong military power also economic power must go good fortune worked three great mind vikram sarabhai dept space professor satish dhawan succeeded brahm prakash father nuclear material lucky worked three closely conside

---------------

> ## `POS(Part of speech)`

>    1- token.pos_ >>>>>>>represent pos(verp, noun,....)

>    2- spacy.explain(token.pos_)   >>>>>akes the part-of-speech tag as an argument and returns a brief explanation of what the pos represents.

>    3- token.tag_  >>>> tell me the tense(الزمن  بتاعى )

>    4- spacy.explain(token.tag_ ) >>>> returns a brief explanation of what the tag represents

In [25]:
import spacy
nlp = spacy.load("en_core_web_md")

In [26]:
doc = nlp("Elon flew to mars yesterday. He carried biryani masala with him")

for token in doc:
    print(token, "||", token.pos_, "||", spacy.explain(token.pos_))

Elon || PROPN || proper noun
flew || VERB || verb
to || ADP || adposition
mars || NOUN || noun
yesterday || NOUN || noun
. || PUNCT || punctuation
He || PRON || pronoun
carried || VERB || verb
biryani || ADJ || adjective
masala || NOUN || noun
with || ADP || adposition
him || PRON || pronoun


In [27]:
doc = nlp("Wow! Dr. Strange made 265 million $ on the very first day")

for token in doc:
    print(token," | ", token.pos_, " | ", spacy.explain(token.pos_), " | ", token.tag_, " | ", spacy.explain(token.tag_))

Wow  |  INTJ  |  interjection  |  UH  |  interjection
!  |  PUNCT  |  punctuation  |  .  |  punctuation mark, sentence closer
Dr.  |  PROPN  |  proper noun  |  NNP  |  noun, proper singular
Strange  |  PROPN  |  proper noun  |  NNP  |  noun, proper singular
made  |  VERB  |  verb  |  VBD  |  verb, past tense
265  |  NUM  |  numeral  |  CD  |  cardinal number
million  |  NUM  |  numeral  |  CD  |  cardinal number
$  |  NOUN  |  noun  |  NN  |  noun, singular or mass
on  |  ADP  |  adposition  |  IN  |  conjunction, subordinating or preposition
the  |  DET  |  determiner  |  DT  |  determiner
very  |  ADV  |  adverb  |  RB  |  adverb
first  |  ADJ  |  adjective  |  JJ  |  adjective (English), other noun-modifier (Chinese)
day  |  NOUN  |  noun  |  NN  |  noun, singular or mass


> `Using pos as a filter for Removing all SPACE, PUNCT and X token from text`

In [28]:
nlp = spacy.load("en_core_web_md")

In [29]:
earnings_text = """Microsoft Corp. today announced the following results for the quarter ended December 31, 2021, as compared to the corresponding period of last fiscal year:

·         Revenue was $51.7 billion and increased 20%
·         Operating income was $22.2 billion and increased 24%
·         Net income was $18.8 billion and increased 21%
·         Diluted earnings per share was $2.48 and increased 22%
“Digital technology is the most malleable resource at the world’s disposal to overcome constraints and reimagine everyday work and life,” said Satya Nadella, chairman and chief executive officer of Microsoft. “As tech as a percentage of global GDP continues to increase, we are innovating and investing across diverse and growing markets, with a common underlying technology stack and an operating model that reinforces a common strategy, culture, and sense of purpose.”
“Solid commercial execution, represented by strong bookings growth driven by long-term Azure commitments, increased Microsoft Cloud revenue to $22.1 billion, up 32% year over year” said Amy Hood, executive vice president and chief financial officer of Microsoft."""

doc = nlp(earnings_text)
filtered_tokens = []
for token in doc:
    if token.pos_ not in ["SPACE", "PUNCT"]:
        filtered_tokens.append(token)

In [30]:
pprint(" ".join(str(token) for token in filtered_tokens))

('Microsoft Corp. today announced the following results for the quarter ended '
 'December 31 2021 as compared to the corresponding period of last fiscal year '
 'Revenue was $ 51.7 billion and increased 20 % Operating income was $ 22.2 '
 'billion and increased 24 % Net income was $ 18.8 billion and increased 21 % '
 'Diluted earnings per share was $ 2.48 and increased 22 % Digital technology '
 'is the most malleable resource at the world ’s disposal to overcome '
 'constraints and reimagine everyday work and life said Satya Nadella chairman '
 'and chief executive officer of Microsoft As tech as a percentage of global '
 'GDP continues to increase we are innovating and investing across diverse and '
 'growing markets with a common underlying technology stack and an operating '
 'model that reinforces a common strategy culture and sense of purpose Solid '
 'commercial execution represented by strong bookings growth driven by long '
 'term Azure commitments increased Microsoft Cloud r

> `Get the count of each pos`

In [31]:
spacy.attrs.POS

74

In [32]:
earnings_text = """Microsoft Corp. today announced the following results for the quarter ended December 31, 2021, as compared to the corresponding period of last fiscal year:

·         Revenue was $51.7 billion and increased 20%
·         Operating income was $22.2 billion and increased 24%
·         Net income was $18.8 billion and increased 21%
·         Diluted earnings per share was $2.48 and increased 22%
“Digital technology is the most malleable resource at the world’s disposal to overcome constraints and reimagine everyday work and life,” said Satya Nadella, chairman and chief executive officer of Microsoft. “As tech as a percentage of global GDP continues to increase, we are innovating and investing across diverse and growing markets, with a common underlying technology stack and an operating model that reinforces a common strategy, culture, and sense of purpose.”
“Solid commercial execution, represented by strong bookings growth driven by long-term Azure commitments, increased Microsoft Cloud revenue to $22.1 billion, up 32% year over year” said Amy Hood, executive vice president and chief financial officer of Microsoft."""

doc = nlp(earnings_text)
count = doc.count_by(spacy.attrs.POS)
count

{96: 14,
 92: 46,
 100: 22,
 90: 9,
 85: 17,
 93: 16,
 97: 27,
 98: 1,
 84: 21,
 103: 10,
 87: 6,
 99: 5,
 89: 12,
 86: 2,
 94: 3,
 95: 2}

In [33]:
doc.vocab[92].text

'NOUN'

In [34]:
pos_counts = {doc.vocab[pos].text: count for pos, count in count.items()}
pos_counts

{'PROPN': 14,
 'NOUN': 46,
 'VERB': 22,
 'DET': 9,
 'ADP': 17,
 'NUM': 16,
 'PUNCT': 27,
 'SCONJ': 1,
 'ADJ': 21,
 'SPACE': 10,
 'AUX': 6,
 'SYM': 5,
 'CCONJ': 12,
 'ADV': 2,
 'PART': 3,
 'PRON': 2}

--------

> ## `Apply pos using NLTK not spaCy`

In [35]:
# CC coordinating conjunction
# CD cardinal digit
# DT determiner
# EX existential there (like: “there is” … think of it like “there exists”)
# FW foreign word
# IN preposition/subordinating conjunction
# JJ adjective – ‘big’
# JJR adjective, comparative – ‘bigger’
# JJS adjective, superlative – ‘biggest’
# LS list marker 1)
# MD modal – could, will
# NN noun, singular ‘- desk’
# NNS noun plural – ‘desks’
# NNP proper noun, singular – ‘Harrison’
# NNPS proper noun, plural – ‘Americans’
# PDT predeterminer – ‘all the kids’
# POS possessive ending parent’s
# PRP personal pronoun –  I, he, she
# PRP$ possessive pronoun – my, his, hers
# RB adverb – very, silently,
# RBR adverb, comparative – better
# RBS adverb, superlative – best
# RP particle – give up
# TO – to go ‘to’ the store.
# UH interjection – errrrrrrrm
# VB verb, base form – take
# VBD verb, past tense – took
# VBG verb, gerund/present participle – taking
# VBN verb, past participle – taken
# VBP verb, sing. present, non-3d – take
# VBZ verb, 3rd person sing. present – takes
# WDT wh-determiner – which
# WP wh-pronoun – who, what
# WP$ possessive wh-pronoun, eg- whose
# WRB wh-adverb, eg- where, when

In [38]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Digital\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\Digital\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger_eng.zip.


True

In [39]:
text = "The quick brown fox jumps over the lazy dog."

# Tokenize the text into individual words
words = nltk.word_tokenize(text)

# Perform POS tagging on the tokenized words
pos_tags = nltk.pos_tag(words)

# Print the POS tags for each word
for word, tag in pos_tags:
    print(word, "->", tag)

The -> DT
quick -> JJ
brown -> NN
fox -> NN
jumps -> VBZ
over -> IN
the -> DT
lazy -> JJ
dog -> NN
. -> .


-------

> ## `Named Entity Recognition`

In [40]:
sentence = "The Eiffel Tower was built from 1887 to 1889 by Gustave Eiffel, whose company specialized in building metal frameworks and structures. Meta is a social media and social networking service owned by American technology conglomerate Meta"

In [42]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_md")

In [43]:
doc1 = nlp(sentence)
for ent in doc1.ents:
    print(ent , "->", ent.label_, "->", spacy.explain(ent.label_))

The Eiffel Tower -> FAC -> Buildings, airports, highways, bridges, etc.
1887 to 1889 -> DATE -> Absolute or relative dates or periods
Gustave Eiffel -> PERSON -> People, including fictional
Meta -> ORG -> Companies, agencies, institutions, etc.
American -> NORP -> Nationalities or religious or political groups
Meta -> ORG -> Companies, agencies, institutions, etc.


In [44]:
# display in a good way
displacy.render(doc1, style="ent")

-----------------

> ## `Great Job`

-----------