# Contents
- NLP
- Regular Expressions
- Tokenization
- Stemming
    - PorterStemmer
    - SnowballStemmer
    - LancasterStemmer
- Lemmatization
- Parts Of Speech Tagging
- Named Entity Recognition
- Text Representation
    - Bag of Words
    - Label Encoding
    - One-Hot Encoding
- Stop words
- Word vectors

# NLP
A field of AI that focuses on the interaction between computers and humans through natural language. The ultimate goal of NLP is to enable computers to understand, interpret and generate human languages in a way that is both meaningful and useful.

## Applications of NLP
- Search Engines
- Chatbot
- Language Translation
- Text Classification

# Regular Expressions
- `.` - matches any charecter except a newline
- `\w` - matches any word charecter(alphanumaric-equivalent to `[a-zA-Z0-9_]`)
- `\d` - matches any digit(`[0-9])
- `\s` - matches any whitespace character

# Tokenization
It involves splitting text into smaller units, known as tokens. This token can be phrases, sentences or other meaningful units, depending on the granularity of the tokenization.
## Types
- Word Tokenization
- Sentence Tokenization
- Subword Tokenization
- Character Tokenization

In [6]:
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt_tab')

sentence="The quick brown fox jumps over the lazy dog. It was a sunny day"

words=word_tokenize(sentence)
sentences=sent_tokenize(sentence)
print("Word Tokens: ",end="")
print(words)
print("Sentence Tokens: ",end="")
print(sentences)

Word Tokens: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.', 'It', 'was', 'a', 'sunny', 'day']
Sentence Tokens: ['The quick brown fox jumps over the lazy dog.', 'It was a sunny day']


[nltk_data] Error loading punkt_tab: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


# Stemming
A text normalization technique used to reduce words to their base/root form. It simplify text data by reducing derived words to a common base form so that they can be analyzed as a single item.

Stemming algorithms typically remove common word suffixes(int, ly, ed) to transform a word into its root form.

__Example:__ `running` -> `run`, `better` -> `bet`

## PorterStemmer

In [7]:
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer

In [8]:
porter=PorterStemmer()
for word in words:
    print(f"{word} -> {porter.stem(word)}")

The -> the
quick -> quick
brown -> brown
fox -> fox
jumps -> jump
over -> over
the -> the
lazy -> lazi
dog -> dog
. -> .
It -> it
was -> wa
a -> a
sunny -> sunni
day -> day


## SnowballStemmer

In [None]:
snowball=SnowballStemmer(language='english')
for word in words:
    print(f"{word}->{snowball.stem(word)}")

## LancasterStemmer

In [None]:
lancaster=LancasterStemmer()
for word in words:
    print(f"{word}->{lancaster.stem(word)}")

# Lemmatization
A text normalization technique used to reduce words to their base form but unlike stemming, it considers the context and morphological analysis of words, aiming to reduce words to their meaningful root forms.

__Example:__ `running` -> `run`, `better` -> `good`

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()

In [None]:
nltk.download('wordnet')
nltk.download('omw-1.4')

def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [None]:
for word in words:
    pos = get_wordnet_pos(word)
    lemmatized_word = lemmatizer.lemmatize(word, pos)
    print(f"{word}->{lancaster.stem(lemmatized_word)}")

# Parts Of Speech Tagging
It involves assigning parts of speech to each word in a sentence or text.
## Tags
- `NN` - Noun
- `VB` - Verb
- `JJ` - Adjective
- `RB` - Adverb
- `PRP` - Pronoun
- `IN` - Preposition
- `CC` - Conjunction
- `DT` - Determiner

In [None]:
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger_eng')
for word in words:
    print(f"{word} -> {pos_tag([word])}")

# Named Entity Recognition
It involves identifying and `classifying` named entities in text into `predefined categories` such as persons, organizations, locations, dates and more.
## Categories of Named Entities
- `PER` - Person
- `ORG` - Oragnization
- `LOC` - Location
- `DATE/TIME` - Date/Time
- `MONEY` - Monetary Values
- `PERCENT` - Percentage

In [None]:
!pip install spacy

In [25]:
# python -m spacy download en_core_web_sm

In [21]:
import spacy
nlp=spacy.load("en_core_web_sm")

In [27]:
# doc=nlp(sentence)
doc=nlp("Apple is looking at buying U.K. startup for $1 billion. Barack Obama was born on August 4, 1961, in Honolulu, Hawaii.")
for ent in doc.ents:
    print(ent.text, ent.label_)

Apple ORG
U.K. GPE
$1 billion MONEY
Barack Obama PERSON
August 4, 1961 DATE
Honolulu GPE
Hawaii GPE


# Text Representation
It convert textual data into a format that ml alogorithms can process.

In [4]:
documents=[
    "Natural Language Processing is fascinating.",
    "Text represntation is crucial in NLP.",
    "Word embeddings are a powerful tool in NLP"
]

## Bag of Words
It represent text as a collection of words without considering the order. It counts the frequency of each word in a document. The result is a vector where each dimension corresponds to a unique word in the corpus

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

In [45]:
vectorizer=CountVectorizer()
bow_matrix=vectorizer.fit_transform(documents)

### Vocabulary
It refers to the set of all unique words found across the entire corpus.

In [46]:
print(vectorizer.get_feature_names_out())

['are' 'crucial' 'embeddings' 'fascinating' 'in' 'is' 'language' 'natural'
 'nlp' 'powerful' 'processing' 'represntation' 'text' 'tool' 'word']


### Vectorization
The vector has the same length as the number of unique words in the vocabulary

__First Sentence:__ Natural Language Processing is fascinating.

- `are`         - not present - 0
- `crucial`     - not present - 0
- `embeddings`  - not present - 0
- `fascinating` - present     - 1
- `in`          - not prsent  - 0
- `is`          - not present - 0
- `language`    - present     - 1

In [47]:
print(bow_matrix.toarray())

[[0 0 0 1 0 1 1 1 0 0 1 0 0 0 0]
 [0 1 0 0 1 1 0 0 1 0 0 1 1 0 0]
 [1 0 1 0 1 0 0 0 1 1 0 0 0 1 1]]


Each word in the sentence appear __one__ time which is why the array appear only 0 and 1, if any word appear multiple time, the number will be shown in the matrix

In [48]:
print(bow_matrix)

  (0, 7)	1
  (0, 6)	1
  (0, 10)	1
  (0, 5)	1
  (0, 3)	1
  (1, 5)	1
  (1, 12)	1
  (1, 11)	1
  (1, 1)	1
  (1, 4)	1
  (1, 8)	1
  (2, 4)	1
  (2, 8)	1
  (2, 14)	1
  (2, 2)	1
  (2, 0)	1
  (2, 9)	1
  (2, 13)	1


#### `(0, 7)	1`
- `0` refers to the sentence
- `7` refers to the word at index 7 in vocabulary
- `1` refers it appears __1__ time in the sentence

## Label Encoding
It convert categorical data into numerical data assigning a unique integer value to each category.

In [16]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [28]:
categorical_data = {
    "Color": ["Red", "Blue", "Green", "Blue", "Red"],
    "Fruit": ["Apple", "Banana", "Cherry", "Date", "Elderberry"],
    "Brand": ["Apple", "Samsung", "Google", "OnePlus", "Huawei"],
    "Education": ["High School", "Associate's Degree", "Bachelor's Degree", "Master's Degree", "PhD"],
    "Country": ["USA", "Canada", "India", "Germany", "Australia"]
}
df=pd.DataFrame(categorical_data)

In [34]:
label_encoder=LabelEncoder()
label_encoded_data={}
for column in df.columns:
    label_encoded_data[column]=label_encoder.fit_transform(df[column])

In [37]:
encoded_df = pd.DataFrame(label_encoded_data)
print(encoded_df)

   Color  Fruit  Brand  Education  Country
0      2      0      0          2        4
1      0      1      4          0        1
2      1      2      1          1        3
3      0      3      3          3        2
4      2      4      2          4        0


Let's take a feature called `Color` with the following values: `['Red', 'Blue', 'Green', 'Blue', 'Red']`.

Each unique category in the feature is assigned an integer label. The assignment is typically done alphabetically or in the order in which the categories appear in the data. For the Color feature:
- Blue → 0
- Green → 1
- Red → 2

## One-Hot Encoding
It convert categorical data into numerical format.

Unlike label encoding, which assigns an integer to each category, one-hot encoding creates a binary column for each category.

For each unique category in the feature, one-hot encoding creates a new binary column. Each column corresponds to one category and has a binary value:
- If the instance belongs to that category, the column value is 1.
- If the instance does not belong to that category, the column value is 0

__For Example:__
- `Blue`  -> [0, 1, 0, 1, 0]
- `Green` -> [0, 0, 1, 0, 0]
- `Red`   -> [1, 0, 0, 0, 1]

In [40]:
one_hot_encoded_df = pd.get_dummies(df, columns=['Color'])
print(one_hot_encoded_df)

        Fruit    Brand           Education    Country  Color_Blue  \
0       Apple    Apple         High School        USA       False   
1      Banana  Samsung  Associate's Degree     Canada        True   
2      Cherry   Google   Bachelor's Degree      India       False   
3        Date  OnePlus     Master's Degree    Germany        True   
4  Elderberry   Huawei                 PhD  Australia       False   

   Color_Green  Color_Red  
0        False       True  
1        False      False  
2         True      False  
3        False      False  
4        False       True  


# Stop words
Stop words are common words in a language that are often filtered out before or after processing natural language data. These words are typically the most frequent and carry little semantic weight, meaning they don't contribute much to the meaning of a sentence. 

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

In [50]:
sentence = "This is a sample sentence, showing off the stop words filtration."
words = word_tokenize(sentence)
stop_words = set(stopwords.words('english'))
filtered_sentence = [word for word in words if word.lower() not in stop_words]

print("Original Sentence:", sentence)
print("Filtered Sentence:", " ".join(filtered_sentence))

Original Sentence: This is a sample sentence, showing off the stop words filtration.
Filtered Sentence: sample sentence , showing stop words filtration .


# Word vectors
Word vectors are mathematical representations of words as continuous vectors in a high-dimensional space. These vectors capture the semantic meaning of words by positioning them in such a way that words with similar meanings are close to each other in this space.

In [None]:
!pip install gensim

In [56]:
from gensim.models import Word2Vec

In [61]:
tokenized_data=[]
for document in documents:
    tokenized_data.append(word_tokenize(document))

In [62]:
# Train a Word2Vec model
model = Word2Vec(tokenized_data, vector_size=100, window=5, min_count=1, sg=0)

In [63]:
word="NLP"
word_vector = model.wv[word]

print(word_vector)

[-8.6205509e-03  3.6664880e-03  5.1899054e-03  5.7425159e-03
  7.4683712e-03 -6.1678356e-03  1.1054418e-03  6.0482449e-03
 -2.8410859e-03 -6.1737327e-03 -4.1060109e-04 -8.3700540e-03
 -5.5999835e-03  7.1053654e-03  3.3535317e-03  7.2259018e-03
  6.8013361e-03  7.5322888e-03 -3.7902421e-03 -5.6239933e-04
  2.3480400e-03 -4.5200605e-03  8.3894283e-03 -9.8599615e-03
  6.7658601e-03  2.9142674e-03 -4.9328064e-03  4.3991935e-03
 -1.7402744e-03  6.7124371e-03  9.9663734e-03 -4.3635760e-03
 -5.9996103e-04 -5.6965039e-03  3.8506347e-03  2.7867861e-03
  6.8917973e-03  6.1020418e-03  9.5395558e-03  9.2743067e-03
  7.8984145e-03 -6.9901976e-03 -9.1569778e-03 -3.5512337e-04
 -3.0998648e-03  7.8954669e-03  5.9394771e-03 -1.5460390e-03
  1.5107916e-03  1.7902239e-03  7.8186011e-03 -9.5114801e-03
 -2.0595833e-04  3.4690076e-03 -9.3987427e-04  8.3821919e-03
  9.0115890e-03  6.5369126e-03 -7.1222446e-04  7.7108950e-03
 -8.5348953e-03  3.2071467e-03 -4.6378332e-03 -5.0894269e-03
  3.5893617e-03  5.37174