
## 🧠 Natural Language Processing (NLP) - Comprehensive Notes 🚀

## **1️⃣ Introduction to NLP**
### 🔹 What is NLP?
- NLP (Natural Language Processing) is a field of AI that enables computers to understand, interpret, and respond to human language.
- It is a combination of **linguistics, computer science, and machine learning**.
- Applications include **chatbots, translation, sentiment analysis, spam detection, voice assistants, etc.**

### 🔹 Key Challenges in NLP:
- **Ambiguity**: "I saw a man with a telescope" (Who has the telescope?).
- **Synonyms & Homonyms**: Different words with the same meaning (happy, joyful).
- **Context Understanding**: "Apple is a company" vs. "I ate an apple."
- **Sarcasm & Idioms**: "Oh, great! Another bug in the code."

---

## **2️⃣ Text Preprocessing**
### 🔹 Why is Preprocessing Needed?
- Raw text contains **punctuations, special characters, stopwords**, etc., which need to be cleaned.
- Preprocessing helps in **better feature extraction & model performance**.

### **🔹 Steps in Text Preprocessing**
#### ✅ **1. Tokenization**
- Breaking text into smaller units (**words or sentences**).
```python
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize

text = "NLP is amazing! Let's learn it."
print(word_tokenize(text))  # ['NLP', 'is', 'amazing', '!', 'Let', "'s", 'learn', 'it', '.']
print(sent_tokenize(text))  # ["NLP is amazing!", "Let's learn it."]
```

#### ✅ **2. Stopword Removal**
- Removing common words that do not add value (**"is", "the", "and"**).
```python
from nltk.corpus import stopwords
stopwords.words('english')  # List of stopwords
```

#### ✅ **3. Stemming**
- Reducing words to their root form (**running → run**).
```python
from nltk.stem import PorterStemmer
ps = PorterStemmer()
print(ps.stem("running"))  # 'run'
```

#### ✅ **4. Lemmatization**
- Converts words to base form while keeping meaning intact (**better → good**).
```python
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("running", pos="v"))  # 'run'
```

#### ✅ **5. Removing Special Characters & Punctuation**
```python
import re
text = "Hello!! This is NLP 101."
clean_text = re.sub(r'[^a-zA-Z ]', '', text)  # Remove special characters
print(clean_text)  # 'Hello This is NLP '
```

---

## **3️⃣ N-Grams**
### 🔹 What are N-Grams?
- **N-Grams** are sequences of 'n' words in a sentence:
  - **Unigram** → ["I", "love", "NLP"]
  - **Bigram** → ["I love", "love NLP"]
  - **Trigram** → ["I love NLP"]
```python
from nltk.util import ngrams
tokens = word_tokenize("I love NLP")
bigrams = list(ngrams(tokens, 2))
print(bigrams)  # [('I', 'love'), ('love', 'NLP')]
```

---

## **4️⃣ Feature Extraction**
### **🔹 Bag of Words (BoW)**
- Represents text as a matrix of word occurrences.
```python
from sklearn.feature_extraction.text import CountVectorizer
corpus = ["NLP is great", "Machine learning is fun"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())  # ['fun', 'great', 'is', 'learning', 'machine', 'nlp']
```

### **🔹 TF-IDF (Term Frequency-Inverse Document Frequency)**
- Highlights important words while down-weighting common ones.
```python
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
```

---

## **5️⃣ Word Embeddings**
### 🔹 Word2Vec (Google) & GloVe (Stanford)
- Converts words into numerical representations (vectors).
```python
from gensim.models import Word2Vec
sentences = [["NLP", "is", "fun"], ["Deep", "learning", "is", "powerful"]]
model = Word2Vec(sentences, min_count=1)
print(model.wv.most_similar("NLP"))
```

---

## **6️⃣ Text Classification**
- Used in **spam detection, sentiment analysis, etc.**
```python
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

texts = ["Buy now", "Limited offer", "Hello friend", "See you soon"]
labels = [1, 1, 0, 0]  # 1=Spam, 0=Not Spam

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

clf = MultinomialNB()
clf.fit(X, labels)

new_text = vectorizer.transform(["Limited time deal"])
print(clf.predict(new_text))  # 1 (Spam)
```

---

## **7️⃣ Named Entity Recognition (NER)**
- Identifies names, locations, dates, etc.
```python
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying a UK startup for $1 billion")
for ent in doc.ents:
    print(ent.text, ent.label_)  # Apple ORG, UK GPE, $1 billion MONEY
```

---

## **8️⃣ Sentiment Analysis**
- Determines if text is **positive, negative, or neutral**.
```python
from textblob import TextBlob
text = "I love NLP!"
blob = TextBlob(text)
print(blob.sentiment.polarity)  # 0.5 (positive)
```

---

## **9️⃣ Transformer Models (BERT, GPT)**
### **🔹 BERT (Bidirectional Encoder Representations from Transformers)**
- Pre-trained model for **question answering, sentiment analysis**.
```python
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
print(classifier("NLP is awesome!"))  # [{'label': 'POSITIVE', 'score': 0.999}]
```

### **🔹 GPT (Generative Pre-trained Transformer)**
- Used for **text generation (ChatGPT, AI Chatbots)**.
```python
from transformers import pipeline
generator = pipeline("text-generation", model="gpt2")
print(generator("Once upon a time", max_length=30))
```

---

## **🔟 NLP Applications**
✅ **Chatbots (Siri, Alexa, ChatGPT)**  
✅ **Machine Translation (Google Translate)**  
✅ **Speech Recognition (Voice Assistants)**  
✅ **Fake News Detection**  
✅ **Text Summarization**  
✅ **Autocorrect & Spell Checking**  

---

# 🎯 **Final Tips**
- Practice with **real-world datasets** (Twitter sentiment analysis, spam detection, etc.).
- Use pre-trained models like **BERT & GPT**.
- Deploy NLP models using **Flask / FastAPI / Streamlit**.

---

🔥 **Master NLP & Build AI-powered Applications! 🚀**
```
