# NLP
A field of AI that focuses on the interaction between computers and humans through natural language. The ultimate goal of NLP is to enable computers to understand, interpret and generate human languages in a way that is both meaningful and useful.

## Applications of NLP
- Search Engines
- Chatbot
- Language Translation

# Regular Expressions
- `.` - matches any charecter except a newline
- `\w` - matches any word charecter(alphanumaric-equivalent to `[a-zA-Z0-9_]`)
- `\d` - matches any digit(`[0-9])
- `\s` - matches any whitespace character

# Installation

In [48]:
!pip install nltk




[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [49]:
import nltk

# Tokenization
It involves splitting text into smaller units, known as tokens. This token can be phrases, sentences or other meaningful units, depending on the granularity of the tokenization.
## Types
- Word Tokenization
- Sentence Tokenization
- Subword Tokenization
- Character Tokenization

In [50]:
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt_tab')

sentence="The quick brown fox jumps over the lazy dog. It was a sunny day"

words=word_tokenize(sentence)
sentences=sent_tokenize(sentence)
print("Word Tokens: ",end="")
print(words)
print("Sentence Tokens: ",end="")
print(sentences)

Word Tokens: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.', 'It', 'was', 'a', 'sunny', 'day']
Sentence Tokens: ['The quick brown fox jumps over the lazy dog.', 'It was a sunny day']


[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


# Stemming
A text normalization technique used to reduce words to their base/root form. It simplify text data by reducing derived words to a common base form so that they can be analyzed as a single item.

Stemming algorithms typically remove common word suffixes(int, ly, ed) to transform a word into its root form.

__Example:__ `running` -> `run`, `better` -> `bet`

## PorterStemmer

In [51]:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer

In [52]:
porter=PorterStemmer()
for word in words:
    print(f"{word} -> {porter.stem(word)}")

The -> the
quick -> quick
brown -> brown
fox -> fox
jumps -> jump
over -> over
the -> the
lazy -> lazi
dog -> dog
. -> .
It -> it
was -> wa
a -> a
sunny -> sunni
day -> day


## SnowballStemmer

In [53]:
snowball=SnowballStemmer(language='english')
for word in words:
    print(f"{word}->{snowball.stem(word)}")

The->the
quick->quick
brown->brown
fox->fox
jumps->jump
over->over
the->the
lazy->lazi
dog->dog
.->.
It->it
was->was
a->a
sunny->sunni
day->day


## LancasterStemmer

In [54]:
lancaster=LancasterStemmer()
for word in words:
    print(f"{word}->{lancaster.stem(word)}")

The->the
quick->quick
brown->brown
fox->fox
jumps->jump
over->ov
the->the
lazy->lazy
dog->dog
.->.
It->it
was->was
a->a
sunny->sunny
day->day


# Lemmatization
A text normalization technique used to reduce words to their base form but unlike stemming, it considers the context and morphological analysis of words, aiming to reduce words to their meaningful root forms.

__Example:__ `running` -> `run`, `better` -> `good`

In [55]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()

In [56]:
nltk.download('wordnet')
nltk.download('omw-1.4')

def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [57]:
for word in words:
    pos = get_wordnet_pos(word)
    lemmatized_word = lemmatizer.lemmatize(word, pos)
    print(f"{word}->{lancaster.stem(lemmatized_word)}")

The->the
quick->quick
brown->brown
fox->fox
jumps->jump
over->ov
the->the
lazy->lazy
dog->dog
.->.
It->it
was->be
a->a
sunny->sunny
day->day


# POS
It involves assigning parts of speech to each word in a sentence or text.
## Tags
- `NN` - Noun
- `VB` - Verb
- `JJ` - Adjective
- `RB` - Adverb
- `PRP` - Pronoun
- `IN` - Preposition
- `CC` - Conjunction
- `DT` - Determiner

In [58]:
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger_eng')
for word in words:
    print(f"{word} -> {pos_tag([word])}")

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


The -> [('The', 'DT')]
quick -> [('quick', 'NN')]
brown -> [('brown', 'NN')]
fox -> [('fox', 'NN')]
jumps -> [('jumps', 'NNS')]
over -> [('over', 'IN')]
the -> [('the', 'DT')]
lazy -> [('lazy', 'NN')]
dog -> [('dog', 'NN')]
. -> [('.', '.')]
It -> [('It', 'PRP')]
was -> [('was', 'VBD')]
a -> [('a', 'DT')]
sunny -> [('sunny', 'NN')]
day -> [('day', 'NN')]
