### Understanding NLP using NLTK
#### 2 parts of NLP
    - NLU
    - NLG
#### Why NLU is hard?
    - Ambiguity? 3 types of ambiguity
        - lexical ambiguity
        - syntactic ambiguity
        - referential ambiguity

#### Installation and setup of NLTK
- install NLTK `py -m pip install nltk` - This will download the basic nltk package but not does not have the necessary tools for us to perform NLP
- For that we need to install other packages of NLTK

In [2]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

- on executing the above function, a window will be opened with different nltk package options that can be individually downloaded.
- To keep things simple, let us select the all packages option and click download


#### Tokenization
- 3 basic steps involved in tokenization?
    - strip: stripping individual words from a sentence.
    - understand: understand importance of each word with respect to the sentence
    - describe: produce a structural description of the input sentence 
    


- Analyzing the corpus
    - A corpus is nothing but a collection of texts
    - NLTK leverage various kinds of Corpora (plural - corpus), and we as a developer can target and make use of the required corpus to perform NLP on our input. 


In [3]:
import os
import nltk.corpus

#all the corpora available to nltk
print(os.listdir(nltk.data.find("corpora")))

#the words contained in a corpus called brown
print(nltk.corpus.brown.words())

#some corpus can have subfiles in them as well
print(nltk.corpus.gutenberg.fileids())

['abc', 'abc.zip', 'alpino', 'alpino.zip', 'bcp47.zip', 'biocreative_ppi', 'biocreative_ppi.zip', 'brown', 'brown.zip', 'brown_tei', 'brown_tei.zip', 'cess_cat', 'cess_cat.zip', 'cess_esp', 'cess_esp.zip', 'chat80', 'chat80.zip', 'city_database', 'city_database.zip', 'cmudict', 'cmudict.zip', 'comparative_sentences', 'comparative_sentences.zip', 'comtrans.zip', 'conll2000', 'conll2000.zip', 'conll2002', 'conll2002.zip', 'conll2007.zip', 'crubadan', 'crubadan.zip', 'dependency_treebank', 'dependency_treebank.zip', 'dolch', 'dolch.zip', 'europarl_raw', 'europarl_raw.zip', 'extended_omw.zip', 'floresta', 'floresta.zip', 'framenet_v15', 'framenet_v15.zip', 'framenet_v17', 'framenet_v17.zip', 'gazetteers', 'gazetteers.zip', 'genesis', 'genesis.zip', 'gutenberg', 'gutenberg.zip', 'ieer', 'ieer.zip', 'inaugural', 'inaugural.zip', 'indian', 'indian.zip', 'jeita.zip', 'kimmo', 'kimmo.zip', 'knbc.zip', 'lin_thesaurus', 'lin_thesaurus.zip', 'machado.zip', 'mac_morpho', 'mac_morpho.zip', 'masc_tag

- Tokenizing our own string
 - using a word tokenizer

In [4]:
from nltk.tokenize import word_tokenize

our_str = "Tokenization is the process of breaking down text into smaller units, such as words or sentences. These smaller units are called tokens. Tokenization is a fundamental step in natural language processing (NLP) and text analysis. It helps in understanding the structure and meaning of the text."

our_words = word_tokenize(our_str)
print(our_words) #you can see that even characters like ',' are also splitted into a separate word

print(len(our_words))

['Tokenization', 'is', 'the', 'process', 'of', 'breaking', 'down', 'text', 'into', 'smaller', 'units', ',', 'such', 'as', 'words', 'or', 'sentences', '.', 'These', 'smaller', 'units', 'are', 'called', 'tokens', '.', 'Tokenization', 'is', 'a', 'fundamental', 'step', 'in', 'natural', 'language', 'processing', '(', 'NLP', ')', 'and', 'text', 'analysis', '.', 'It', 'helps', 'in', 'understanding', 'the', 'structure', 'and', 'meaning', 'of', 'the', 'text', '.']
53



- looking at frequency distribution of words using nltk's FreqDist module

In [5]:
from nltk.probability import FreqDist

fd = FreqDist()

for word in our_words:
    fd[word.lower()] +=1

#frequency distribution of each word
display(fd)

#we can get the top 10 frequently used words
display(fd.most_common(10))

#frequency of word and
display(fd["and"])

FreqDist({'.': 4, 'the': 3, 'text': 3, 'tokenization': 2, 'is': 2, 'of': 2, 'smaller': 2, 'units': 2, 'in': 2, 'and': 2, ...})

[('.', 4),
 ('the', 3),
 ('text', 3),
 ('tokenization', 2),
 ('is', 2),
 ('of', 2),
 ('smaller', 2),
 ('units', 2),
 ('in', 2),
 ('and', 2)]

2

- stripping paras using a blank line tokenizer
    - so nltk has different tokenizers, that evaluates tokens based on different logic.
    - so, similarly we have a blank line tokenizer, that considers any sequence of blank lines as a delimiter.

In [9]:

from nltk import blankline_tokenize

our_paras = """
Tokenization is the process of breaking down text into smaller units, such as words or sentences. These smaller units are called tokens. Tokenization is a fundamental step in natural language processing (NLP) and text analysis. It helps in understanding the structure and meaning of the text.

Natural language processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and humans through natural language. The ultimate goal of NLP is to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful.

Text analysis involves various techniques to extract information and insights from textual data. These techniques include tokenization, stemming, lemmatization, part-of-speech tagging, named entity recognition, and sentiment analysis. By applying these techniques, we can gain a deeper understanding of the content and context of the text.
"""
our_blanks = blankline_tokenize(our_paras)
print(len(our_blanks)) #No of blank lines or we can understand them as paras
print(our_blanks[1]) #First para



3
Natural language processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and humans through natural language. The ultimate goal of NLP is to enable computers to understand, interpret, and generate human language in a way that is both meaningful and useful.


- Tokenization via Bigram, Trigram, Ngram

#### Stemming
- porterStemmer, lancasterstemmer, snowballstemmer

#### Lemmatization
- wordnet lemmatization 

#### Stopwords
- import and list stopwords

#### Parts of speech
- Chart of tags and their descriptions
- show the tags of a sample string using nltk.pos_tag

#### Named entity recognition
- what is it?
- 3 steps of NER
- Performing NER on a tokenized, stemmed, and pos-tagged string


#### Syntax
- syntax tree
- chunking
    - simulation of chunking using a simple regex