<a href="https://colab.research.google.com/github/lathadevi158/Learning_AI_Sandbox/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

----------------------**NLP**-------------------

NLP stands for Natural Language Processing. In simple terms, it’s a field of Artificial Intelligence (AI) that focuses on making computers understand, interpret, and generate human language

Examples you use every day:

ChatGPT → answering questions

Google Translate → translating languages

Spam filters → detecting unwanted emails


NLP Pipeline (Simplified)

Text Input → raw text from user/documents

Preprocessing → clean and structure the text

Feature Extraction → convert text to numbers (vectors)

Modeling / Understanding → ML or LLM predicts/generates output

Output Generation → human-readable answer/text


3️⃣ Text Preprocessing Techniques

Before feeding text into a model, we clean it:

Technique	Purpose	Python Example
Tokenization	Split text into words or subwords	"I love AI" → ["I", "love", "AI"]
Lowercasing	Standardize text	"AI is Cool" → "ai is cool"
Stopword Removal	Remove common words that add little meaning	"I love AI" → ["love", "AI"]
Stemming	Reduce words to base form (crude)	"running" → "run"
Lemmatization	Reduce words to dictionary form (better)	"running" → "run"

4️⃣ Representing Text as Numbers

Models can’t understand raw text, so we convert it into vectors:

Bag of Words (BoW)

Count how many times each word appears

Example: "I love AI", "AI is fun" → vector representation

TF-IDF (Term Frequency – Inverse Document Frequency)

Gives importance to rare words

Example: common words like “is” get less weight

This is how LLMs and classical NLP understand meaning from text.


In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

text = "I love exploring Generative AI with ChatGPT!"

# 1. Tokenization
tokens = word_tokenize(text)
print("Tokens:", tokens)

# 2. Stopword removal
filtered_tokens = [t for t in tokens if t.lower() not in stopwords.words('english')]
print("Without stopwords:", filtered_tokens)

# 3. Stemming
stemmer = PorterStemmer()
stemmed = [stemmer.stem(t) for t in filtered_tokens]
print("Stemmed:", stemmed)

# 4. Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(t) for t in filtered_tokens]
print("Lemmatized:", lemmatized)



Tokens: ['I', 'love', 'exploring', 'Generative', 'AI', 'with', 'ChatGPT', '!']
Without stopwords: ['love', 'exploring', 'Generative', 'AI', 'ChatGPT', '!']
Stemmed: ['love', 'explor', 'gener', 'ai', 'chatgpt', '!']
Lemmatized: ['love', 'exploring', 'Generative', 'AI', 'ChatGPT', '!']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [None]:
# Import necessary libraries for vectorization
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Assuming 'documents' is a list of strings (your text data)
# Replace this with your actual data if needed
documents = ["I love AI", "AI is fun"]

# 1️⃣ Bag of Words (BoW)
# -----------------------------
bow_vectorizer = CountVectorizer()
bow = bow_vectorizer.fit_transform(documents)

print("Bag of Words Vocabulary:")
print(bow_vectorizer.get_feature_names_out())

print("\nBag of Words Vectors:")
print(bow.toarray())

# -----------------------------
# 2️⃣ TF-IDF
# -----------------------------
tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(documents)

print("\nTF-IDF Vocabulary:")
print(tfidf_vectorizer.get_feature_names_out())

print("\nTF-IDF Vectors:")
print(tfidf.toarray())

Bag of Words Vocabulary:
['ai' 'fun' 'is' 'love']

Bag of Words Vectors:
[[1 0 0 1]
 [1 1 1 0]]

TF-IDF Vocabulary:
['ai' 'fun' 'is' 'love']

TF-IDF Vectors:
[[0.57973867 0.         0.         0.81480247]
 [0.44943642 0.6316672  0.6316672  0.        ]]


🍕 Stemming vs Lemmatization – The Food Analogy

Imagine words are like pizzas 🍕

Technique	Analogy	Example
Stemming	You take a pizza and just bite off a big chunk. Doesn’t matter if it’s messy, you just want the main part.	"running" → "run" ; "flies" → "fli" (oops, some weird bite!)

Lemmatization	You carefully cut the pizza into neat slices so it looks perfect and edible.	"running" → "run" ; "flies" → "fly" (looks exactly like it should!)

Stemming = messy but fast snack bite 🍽️

Lemmatization = neat, proper slice 🍕✨

1️⃣ Parts of Speech (POS) Tagging

What it is:

Every word in a sentence has a role: noun, verb, adjective, etc.

POS tagging tells the computer what role each word plays.

Kid/Food Analogy:

Imagine a sentence is a kitchen:

Nouns = ingredients (apple, sugar)

Verbs = actions (chop, mix)

Adjectives = descriptions (sweet, fresh)

Example Sentence:
"I love eating chocolate"

I → Pronoun

love → Verb

eating → Verb

chocolate → Noun

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
text = "I love eating chocolate"

doc = nlp(text)
for token in doc:
    print(token.text, token.pos_)


I PRON
love VERB
eating VERB
chocolate NOUN


2️⃣ Named Entity Recognition (NER)

What it is:

Finds special entities in text: names, places, dates, organizations.

Kid/Food Analogy:

Imagine you’re picking special ingredients from a big bowl:

“Elon Musk founded SpaceX in 2002” →

Elon Musk = PERSON

SpaceX = ORG

2002 = DATE

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)


3️⃣ Sentiment Analysis

What it is:

Detects emotion or opinion in text: positive, negative, neutral.

Kid/Food Analogy:

Taste test your food:

"I love pizza!" → 😊 Positive

"This soup is terrible." → 😞 Negative

In [None]:
from textblob import TextBlob

text = "I love pizza but hate onions"
blob = TextBlob(text)
print(blob.sentiment)  # polarity, subjectivity

#Polarity > 0 → positive, <0 → negative, 0 → neutral

Sentiment(polarity=-0.15000000000000002, subjectivity=0.75)


4️⃣ Word Embeddings (Word2Vec / GloVe)

What it is:

Converts words into numbers (vectors) so models can see similarity and meaning.

Words with similar meaning have vectors close to each other.

Kid/Food Analogy:

Imagine a fridge map:

“apple” is near “banana” (both fruits)

“carrot” is a bit farther (vegetable)

So the computer knows which words are “similar” in meaning.

In [None]:
from gensim.models import Word2Vec

sentences = [["I", "love", "pizza"], ["Pizza", "is", "delicious"], ["I", "hate", "onions"]]
model = Word2Vec(sentences, vector_size=10, window=2, min_count=1, workers=1)

print(model.wv['pizza'])  # vector representation of "pizza"
print(model.wv.similarity('pizza', 'delicious'))  # similarity score


[-0.01577653  0.00321372 -0.0414063  -0.07682689 -0.01508008  0.02469795
 -0.00888027  0.05533662 -0.02742977  0.02260065]
-0.11387499
