# NLP
A field of AI that focuses on the interaction between computers and humans through natural language. The ultimate goal of NLP is to enable computers to understand, interpret and generate human languages in a way that is both meaningful and useful.

## Applications of NLP
- Search Engines
- Chatbot
- Language Translation
- Text Classification

# Regular Expressions
- `.` - matches any charecter except a newline
- `\w` - matches any word charecter(alphanumaric-equivalent to `[a-zA-Z0-9_]`)
- `\d` - matches any digit(`[0-9])
- `\s` - matches any whitespace character

# Installation

In [None]:
!pip install nltk

In [None]:
import nltk

# Tokenization
It involves splitting text into smaller units, known as tokens. This token can be phrases, sentences or other meaningful units, depending on the granularity of the tokenization.
## Types
- Word Tokenization
- Sentence Tokenization
- Subword Tokenization
- Character Tokenization

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt_tab')

sentence="The quick brown fox jumps over the lazy dog. It was a sunny day"

words=word_tokenize(sentence)
sentences=sent_tokenize(sentence)
print("Word Tokens: ",end="")
print(words)
print("Sentence Tokens: ",end="")
print(sentences)

# Stemming
A text normalization technique used to reduce words to their base/root form. It simplify text data by reducing derived words to a common base form so that they can be analyzed as a single item.

Stemming algorithms typically remove common word suffixes(int, ly, ed) to transform a word into its root form.

__Example:__ `running` -> `run`, `better` -> `bet`

## PorterStemmer

In [None]:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer

In [None]:
porter=PorterStemmer()
for word in words:
    print(f"{word} -> {porter.stem(word)}")

## SnowballStemmer

In [None]:
snowball=SnowballStemmer(language='english')
for word in words:
    print(f"{word}->{snowball.stem(word)}")

## LancasterStemmer

In [None]:
lancaster=LancasterStemmer()
for word in words:
    print(f"{word}->{lancaster.stem(word)}")

# Lemmatization
A text normalization technique used to reduce words to their base form but unlike stemming, it considers the context and morphological analysis of words, aiming to reduce words to their meaningful root forms.

__Example:__ `running` -> `run`, `better` -> `good`

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()

In [None]:
nltk.download('wordnet')
nltk.download('omw-1.4')

def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [None]:
for word in words:
    pos = get_wordnet_pos(word)
    lemmatized_word = lemmatizer.lemmatize(word, pos)
    print(f"{word}->{lancaster.stem(lemmatized_word)}")

# Parts Of Speech Tagging
It involves assigning parts of speech to each word in a sentence or text.
## Tags
- `NN` - Noun
- `VB` - Verb
- `JJ` - Adjective
- `RB` - Adverb
- `PRP` - Pronoun
- `IN` - Preposition
- `CC` - Conjunction
- `DT` - Determiner

In [None]:
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger_eng')
for word in words:
    print(f"{word} -> {pos_tag([word])}")

# Named Entity Recognition
It involves identifying and `classifying` named entities in text into `predefined categories` such as persons, organizations, locations, dates and more.
## Categories of Named Entities
- `PER` - Person
- `ORG` - Oragnization
- `LOC` - Location
- `DATE/TIME` - Date/Time
- `MONEY` - Monetary Values
- `PERCENT` - Percentage

In [None]:
!pip install spacy

In [25]:
# python -m spacy download en_core_web_sm

In [21]:
import spacy
nlp=spacy.load("en_core_web_sm")

In [27]:
# doc=nlp(sentence)
doc=nlp("Apple is looking at buying U.K. startup for $1 billion. Barack Obama was born on August 4, 1961, in Honolulu, Hawaii.")
for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY
Barack Obama PERSON
August 4, 1961 DATE
Honolulu GPE
Hawaii GPE
