### Understanding NLP using NLTK
#### 2 parts of NLP
    - NLU (Natural language understanding)
    - NLG (Natural language generation)
#### Why NLU is hard?
    - Ambiguity? 3 types of ambiguity
        - lexical ambiguity: Same word can have different meanings based on the context.
            - eg: The lady is looking for a match. (match can be ambiguous)
        - syntactic ambiguity: The structure of a sentence can pose challenges in understanding what the sentence is conveying.
            - eg: "I saw the man with the telescope. (Did hee use a telescope to see the man? or Did he see a man using the telescope?)
        - referential ambiguity: Sometimes there can be two nouns in a sentence. A determinant (i.e words like a, the) can potentially point to either of the nouns, and the understanding depends on the speaker/listener. This is called referential ambiguity.
            - Father met his friend at a shop. He asked for a loan of 10,000 (who asked for a loan?)

#### Installation and setup of NLTK
- install NLTK `py -m pip install nltk` - This will download the basic nltk package but not does not have the necessary tools for us to perform NLP
- For that we need to install other packages of NLTK

In [1]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

- on executing the above function, a window will be opened with different nltk package options that can be individually downloaded.
- To keep things simple, let us select the all packages option and click download


#### Tokenization
- 3 basic steps involved in tokenization?
    - strip: stripping individual words from a sentence.
    - understand: understand importance of each word with respect to the sentence
    - describe: produce a structural description of the input sentence 
    


- Analyzing the corpus
    - A corpus is nothing but a collection of texts
    - NLTK leverage various kinds of Corpora (plural - corpus), and we as a developer can target and make use of the required corpus to perform NLP on our input. 


In [2]:
import os
import nltk.corpus

#all the corpora available to nltk
print(os.listdir(nltk.data.find("corpora")))

#the words contained in a corpus called brown
print(nltk.corpus.brown.words())

#some corpus can have subfiles in them as well
print(nltk.corpus.gutenberg.fileids())

['abc', 'abc.zip', 'alpino', 'alpino.zip', 'bcp47.zip', 'biocreative_ppi', 'biocreative_ppi.zip', 'brown', 'brown.zip', 'brown_tei', 'brown_tei.zip', 'cess_cat', 'cess_cat.zip', 'cess_esp', 'cess_esp.zip', 'chat80', 'chat80.zip', 'city_database', 'city_database.zip', 'cmudict', 'cmudict.zip', 'comparative_sentences', 'comparative_sentences.zip', 'comtrans.zip', 'conll2000', 'conll2000.zip', 'conll2002', 'conll2002.zip', 'conll2007.zip', 'crubadan', 'crubadan.zip', 'dependency_treebank', 'dependency_treebank.zip', 'dolch', 'dolch.zip', 'europarl_raw', 'europarl_raw.zip', 'extended_omw.zip', 'floresta', 'floresta.zip', 'framenet_v15', 'framenet_v15.zip', 'framenet_v17', 'framenet_v17.zip', 'gazetteers', 'gazetteers.zip', 'genesis', 'genesis.zip', 'gutenberg', 'gutenberg.zip', 'ieer', 'ieer.zip', 'inaugural', 'inaugural.zip', 'indian', 'indian.zip', 'jeita.zip', 'kimmo', 'kimmo.zip', 'knbc.zip', 'lin_thesaurus', 'lin_thesaurus.zip', 'machado.zip', 'mac_morpho', 'mac_morpho.zip', 'masc_tag

- Tokenizing our own string
 - using a word tokenizer

In [3]:
from nltk.tokenize import word_tokenize

our_str = "Tokenization is the process of breaking down text into smaller units, such as words or sentences. These smaller units are called tokens. Tokenization is a fundamental step in natural language processing (NLP) and text analysis. It helps in understanding the structure and meaning of the text."

our_words = word_tokenize(our_str)
print(our_words) #you can see that even characters like ',' are also splitted into a separate word

print(len(our_words))

['Tokenization', 'is', 'the', 'process', 'of', 'breaking', 'down', 'text', 'into', 'smaller', 'units', ',', 'such', 'as', 'words', 'or', 'sentences', '.', 'These', 'smaller', 'units', 'are', 'called', 'tokens', '.', 'Tokenization', 'is', 'a', 'fundamental', 'step', 'in', 'natural', 'language', 'processing', '(', 'NLP', ')', 'and', 'text', 'analysis', '.', 'It', 'helps', 'in', 'understanding', 'the', 'structure', 'and', 'meaning', 'of', 'the', 'text', '.']
53



- looking at frequency distribution of words using nltk's FreqDist module

In [4]:
from nltk.probability import FreqDist

fd = FreqDist()

for word in our_words:
    fd[word.lower()] +=1

#frequency distribution of each word
display(fd)

#we can get the top 10 frequently used words
display(fd.most_common(10))

#frequency of word and
display(fd["and"])

FreqDist({'.': 4, 'the': 3, 'text': 3, 'tokenization': 2, 'is': 2, 'of': 2, 'smaller': 2, 'units': 2, 'in': 2, 'and': 2, ...})

[('.', 4),
 ('the', 3),
 ('text', 3),
 ('tokenization', 2),
 ('is', 2),
 ('of', 2),
 ('smaller', 2),
 ('units', 2),
 ('in', 2),
 ('and', 2)]

2

- stripping paras using a blank line tokenizer
    - so nltk has different tokenizers, that evaluates tokens based on different logic.
    - so, similarly we have a blank line tokenizer, that considers any sequence of blank lines as a delimiter.

In [5]:

from nltk import blankline_tokenize

our_paras = """
Tokenization is the process of breaking down text into smaller units, such as words or sentences. These smaller units are called tokens. Tokenization is a fundamental step in natural language processing (NLP) and text analysis. It helps in understanding the structure and meaning of the text.

Natural language processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and humans through natural language. The ultimate goal of NLP is to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful.

Text analysis involves various techniques to extract information and insights from textual data. These techniques include tokenization, stemming, lemmatization, part-of-speech tagging, named entity recognition, and sentiment analysis. By applying these techniques, we can gain a deeper understanding of the content and context of the text.
"""
our_blanks = blankline_tokenize(our_paras)
print(len(our_blanks)) #No of blank lines or we can understand them as paras
print(our_blanks[1]) #First para



3
Natural language processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and humans through natural language. The ultimate goal of NLP is to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful.


- Tokenization via Bigram, Trigram, Ngram
 - bigrams: tokens created with 2 consecutive words in a sentence
 - trigrams: tokens with 3 consecutive words in a sentence
 - ngrams: tokens with n consecutive words in a sentence
 - They are useful in scenarios like transcribing, translation etc. where we it is easier to chunk n no of words and process them

In [11]:
from nltk.util import bigrams, trigrams, ngrams

our_str = "Tokenization is the process of breaking down text into smaller units, such as words or sentences. These smaller units are called tokens. Tokenization is a fundamental step in natural language processing (NLP) and text analysis. It helps in understanding the structure and meaning of the text."
tokens = word_tokenize(our_str)
print(tokens)
tokens_bigrams = bigrams(tokens)
tokens_trigrams = trigrams(tokens)
tokens_ngrams = ngrams(tokens, 5)
print(list(tokens_bigrams))
print(list(tokens_trigrams))
print(list(tokens_ngrams))

['Tokenization', 'is', 'the', 'process', 'of', 'breaking', 'down', 'text', 'into', 'smaller', 'units', ',', 'such', 'as', 'words', 'or', 'sentences', '.', 'These', 'smaller', 'units', 'are', 'called', 'tokens', '.', 'Tokenization', 'is', 'a', 'fundamental', 'step', 'in', 'natural', 'language', 'processing', '(', 'NLP', ')', 'and', 'text', 'analysis', '.', 'It', 'helps', 'in', 'understanding', 'the', 'structure', 'and', 'meaning', 'of', 'the', 'text', '.']
[('Tokenization', 'is'), ('is', 'the'), ('the', 'process'), ('process', 'of'), ('of', 'breaking'), ('breaking', 'down'), ('down', 'text'), ('text', 'into'), ('into', 'smaller'), ('smaller', 'units'), ('units', ','), (',', 'such'), ('such', 'as'), ('as', 'words'), ('words', 'or'), ('or', 'sentences'), ('sentences', '.'), ('.', 'These'), ('These', 'smaller'), ('smaller', 'units'), ('units', 'are'), ('are', 'called'), ('called', 'tokens'), ('tokens', '.'), ('.', 'Tokenization'), ('Tokenization', 'is'), ('is', 'a'), ('a', 'fundamental'), 

#### Stemming
- stemming is the process of concentrating a word to its core root word.
- This is done by for example removing suffixes like 's','ing' etc. 
- words like 'affected', 'affects' is stripped to the word affect.
- There are various stemmer algorithms like: porterStemmer, lancasterstemmer, snowballstemmer. Let us simulate stemming using these stemmers.

In [17]:
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
pst, lst, sst = PorterStemmer(), LancasterStemmer(), SnowballStemmer("english")
words_to_stem = ['affected', 'affecting', 'affects', 'affection', 'gave', 'given', 'giving', 'gives']
for word in words_to_stem:
    print(f"{word} | porter: {pst.stem(word)} | lancaster: {lst.stem(word)} | snowball: {sst.stem(word)}")

affected | porter: affect | lancaster: affect | snowball: affect
affecting | porter: affect | lancaster: affect | snowball: affect
affects | porter: affect | lancaster: affect | snowball: affect
affection | porter: affect | lancaster: affect | snowball: affect
gave | porter: gave | lancaster: gav | snowball: gave
given | porter: given | lancaster: giv | snowball: given
giving | porter: give | lancaster: giv | snowball: give
gives | porter: give | lancaster: giv | snowball: give


- If you see in the above stemming simulation, you can see at times, the stemmer produces an invalid word like 'gav'. So this is a drawback of stemmer. It is fast but can be imprecise.

#### Lemmatization
- What additional things lemmatization does in addition to stemmer?
    - it takes the context of the word in a sentence in to account.
    - it ensures that the resulting word is a valid word (i.e a word part of a dictionary of the language)
- wordnet lemmatization 
    - wordnet is a popular dictionary, and a lemmatization module is available in nltk written on top of it. Let us see a simulation of the same.

In [20]:
from nltk.stem import wordnet, WordNetLemmatizer
words_to_stem = ['affected', 'affecting', 'affects', 'affection', 'gave', 'given', 'giving', 'gives']
wordnet_lem = WordNetLemmatizer()
for word in words_to_stem:
    print(f"{word} | {wordnet_lem.lemmatize(word)}")

affected | affected
affecting | affecting
affects | affect
affection | affection
gave | gave
given | given
giving | giving
gives | give


- We can see that the way a lemmatizer infers a word is much different than that of a stemmer, and the execution time is far slower. But the result is much more precise.

#### Stopwords
- Stop words are words that are part of a language, that are not of much help in the context of natural language processing.
- NLTK has corpus which contains stopwords. We can make use of these stop words, to reduce our tokens count.
- Let us take a look at stop words in NLTK.

In [28]:
from nltk.corpus import stopwords
# Get the list of English stopwords
stop_words = set(stopwords.words('english'))

# Print the stopwords
print(stop_words)
print(len(stop_words))

{'such', 't', "should've", 'further', 'me', 'here', 'through', 'before', 'nor', 'are', "don't", 'does', 'when', "aren't", 'there', 'to', 'am', 'into', 'as', 'himself', 'with', 'a', 'ourselves', 'of', 'why', 'any', 'what', 'an', "you've", 'under', "doesn't", 'our', 'the', 're', 'isn', "it's", 'once', 'we', 'hers', 'where', "mustn't", 'but', 'd', "wasn't", 'for', "couldn't", 'only', 'i', 'these', "she's", 'so', 'too', 'during', 'have', 'do', 'has', 'some', 'than', "didn't", 'because', 'he', 'against', 'them', 'own', 'her', 'then', 'yours', 'being', 'they', 'all', "needn't", 'whom', 'weren', 'their', 'be', 'out', 'can', 'yourselves', "you'd", 'ain', 'those', 'same', 'down', 'his', "shouldn't", 'who', 'from', "hasn't", 'which', 'at', 'on', 'until', 'been', 'just', 'doesn', 'more', 'didn', 'itself', 'most', 'wouldn', 'each', 'my', "hadn't", 'it', 'in', 'both', 'after', 's', 'over', "mightn't", 'and', 'not', 'theirs', 'm', 'now', 'll', "wouldn't", 'y', 'yourself', 'ours', "haven't", 'how', '

#### Parts of speech (POS)
- In our english classes, do you remember us doing activities like identify the noun, verb, adverb etc from a sentence.
- Parts of speech is the same activity part of the nlp process.
- Here we process each word, or sometimes a bunch of words and tag them as a particular part of speech identified by a tag id.
- These are all of the tags that is used in the POS process:

    | Tag  | Description                                      | Example       |
    |------|--------------------------------------------------|---------------|
    | CC   | Coordinating conjunction                         | and           |
    | CD   | Cardinal number                                  | 1, two        |
    | DT   | Determiner                                       | the           |
    | EX   | Existential there                                | there         |
    | FW   | Foreign word                                     | d'autre       |
    | IN   | Preposition or subordinating conjunction         | in, of        |
    | JJ   | Adjective                                        | quick         |
    | JJR  | Adjective, comparative                           | quicker       |
    | JJS  | Adjective, superlative                           | quickest      |
    | LS   | List item marker                                 | 1., 2.        |
    | MD   | Modal                                            | could, will   |
    | NN   | Noun, singular or mass                           | dog           |
    | NNS  | Noun, plural                                     | dogs          |
    | NNP  | Proper noun, singular                            | John          |
    | NNPS | Proper noun, plural                              | Smiths        |
    | PDT  | Predeterminer                                    | all the       |
    | POS  | Possessive ending                                | 's            |
    | PRP  | Personal pronoun                                 | I, he         |
    | PRP$ | Possessive pronoun                               | my, his       |
    | RB   | Adverb                                           | quickly       |
    | RBR  | Adverb, comparative                              | faster        |
    | RBS  | Adverb, superlative                              | fastest       |
    | RP   | Particle                                         | up, off       |
    | SYM  | Symbol                                           | +, %, &       |
    | TO   | to                                               | to            |
    | UH   | Interjection                                     | ah, hey       |
    | VB   | Verb, base form                                  | run           |
    | VBD  | Verb, past tense                                 | ran           |
    | VBG  | Verb, gerund or present participle               | running       |
    | VBN  | Verb, past participle                            | run           |
    | VBP  | Verb, non-3rd person singular present            | run           |
    | VBZ  | Verb, 3rd person singular present                | runs          |
    | WDT  | Wh-determiner                                    | which         |
    | WP   | Wh-pronoun                                       | who, what     |
    | WP$  | Possessive wh-pronoun                            | whose         |
    | WRB  | Wh-adverb                                        | where, when   |





- Now let us simulate a simple pos process, by POS processing a word tokenized sentence.

In [31]:
from nltk import pos_tag
our_str = "Tokenization is the process of breaking down text into smaller units, such as words or sentences. These smaller units are called tokens. Tokenization is a fundamental step in natural language processing (NLP) and text analysis. It helps in understanding the structure and meaning of the text."
words = word_tokenize(our_str)
for i in range(0,10):
    word = words[i]
    print(f"{word} | pos tag -> {pos_tag([word])}")

Tokenization | pos tag -> [('Tokenization', 'NN')]
is | pos tag -> [('is', 'VBZ')]
the | pos tag -> [('the', 'DT')]
process | pos tag -> [('process', 'NN')]
of | pos tag -> [('of', 'IN')]
breaking | pos tag -> [('breaking', 'VBG')]
down | pos tag -> [('down', 'RB')]
text | pos tag -> [('text', 'NN')]
into | pos tag -> [('into', 'IN')]
smaller | pos tag -> [('smaller', 'JJR')]


#### Named entity recognition
- what is it?
    - Named entity Recognition is the process of identifying and classifying entities into predefined categories.
    - Entities are nothing but nouns or words similar to nouns (like Rupee, Dollar) in a sentence.
    - Some of the common predefined categories are: 

        | Category       | Description                                  | Example          |
        |----------------|----------------------------------------------|------------------|
        | PERSON         | Names of people                              | Barack Obama     |
        | ORGANIZATION   | Names of companies, government organizations | Microsoft        |
        | LOCATION       | Geographical names like cities, countries    | Tokyo            |
        | DATE           | Dates in various formats                     | January 23, 2025 |
        | TIME           | Times of the day                             | 12:28 PM         |
        | MONEY          | Monetary values                              | $100             |


- 3 steps of NER


- Let us perform NER on a string.
 - Before passing the string to NER, let us perform 
    - word tokenization and 
    - POS on the tokens 
    to ease up the NER process.

In [None]:
from nltk import ne_chunk #ne_chunk is the module that does NER in NLTK

# Sample sentence
sentence = "Barack Obama was born in Honolulu, Hawaii."

# Tokenize the sentence
tokens = word_tokenize(sentence)

# Perform POS tagging
tagged_tokens = pos_tag(tokens)

# Perform Named Entity Recognition
named_entities = ne_chunk(tagged_tokens)

# Print the named entities
print(named_entities) #entities are those words that have the category associated with it and are wrapped inside parenthesis
#(Barack and Obama) are identified as PERSON, and (Honolulu and Hawaii) are identified as GPE - geopolitical entities

(S
  (PERSON Barack/NNP)
  (PERSON Obama/NNP)
  was/VBD
  born/VBN
  in/IN
  (GPE Honolulu/NNP)
  ,/,
  (GPE Hawaii/NNP)
  ./.)


#### Syntax
- Each language has its own set of rules, like what part of the sentence will appear where?
- By establishing a syntax of a language, during NLP we can clearly understand the different parts of a sentence.
- By classifying different parts of a sentence, it would be easier for us to understand what the sentence is trying to communicate.
- It broadens the horizon of classification done at the POS stage, by identifying parts of sentence like verb phrase, noun phrase etc.
- Each part has a tag associated with it, just like in POS.
- Using these tags and the tags derived during the POS stage, we can create a tree like structure from a sentence. This is called a syntax tree.
- Let's say we have a sentence: “The quick brown fox jumps over the lazy dog”
    - post POS, the sentence will have been tagged like this: `The (DT) quick (JJ) brown (JJ) fox (NN) jumps (VBZ) over (IN) the (DT) lazy (JJ) dog (NN)`.
    - Now, after applying the syntax of English language, we can derive a syntax tree like this:
        ```              
                         S
                        / \
                    NP   VP
                    /  \  /  \
                    DT  JJ JJ  NN   VBZ  PP
                |   |   |   |    |   /  \
                The quick brown fox jumps over  NP
                                            /   \
                                            DT    JJ
                                            |      |
                                            the    lazy
                                                    |
                                                    NN
                                                    dog
        ```
    - Some of the top level tags we see are the phrase tags:

    | **Tag** | **Description**                |
    |---------|--------------------------------|
    | NP      | Noun Phrase                    |
    | VP      | Verb Phrase                    |
    | ADJP    | Adjective Phrase               |
    | ADVP    | Adverb Phrase                  |
    | PP      | Prepositional Phrase           |
    | SBAR    | Subordinating Conjunction + Sentence |
    | WHNP    | Wh-Noun Phrase (e.g., what, who)|
    | WHADVP  | Wh-Adverb Phrase (e.g., when, where)|
    | WHPP    | Wh-Prepositional Phrase        |
 
- Using the syntax trees, a complex AI can be developed which would be able to answer the question "Who jumped over the lazy dog?"


#### Chunking
- Chunking can be assumed as a process opposite to that of tokenization.
- Here we can assume that the tokenized words are chunked based on the syntax rules.
- Let us see the simulation of chunking using a simple rule defined using a regex.

In [34]:
from nltk import RegexpParser

# Sample sentence
sentence = "The quick brown fox jumps over the lazy dog"

# Tokenize the sentence
tokens = word_tokenize(sentence)

# Get part of speech tags
pos_tags = pos_tag(tokens)

# Define the chunking rule
# This rule chunks an adjective followed by a noun (e.g., "quick brown fox" and "lazy dog")
chunk_rule = "NP: {<JJ>*<NN>}"

# Create the chunk parser
chunk_parser = RegexpParser(chunk_rule)

# Parse the sentence
chunked_sentence = chunk_parser.parse(pos_tags)

# Print the chunked sentence
print(chunked_sentence)
# We can see that the phrases are chunked based on POS and the syntax (regex rule) we have applied.
# The chunks have phrases like "quick brown", "lazy dog" - which are chunked based on our syntax rule 
# that is looking for an adjective followed by a noun

(S
  The/DT
  (NP quick/JJ brown/NN)
  (NP fox/NN)
  jumps/VBZ
  over/IN
  the/DT
  (NP lazy/JJ dog/NN))
