# Natural Language Processing (NLP)

---

## 1. Introduction to NLP

### What is NLP?

Natural Language Processing (NLP) is how computers understand and process human language. Think of it as teaching machines to understand us - how we talk, text, and write.

**Example:**

* You ask Siri, “What’s the weather today?” Siri understands your sentence and gives a response - that's NLP in action.

---

## 2. Word Vectorization

Machines don’t understand words; they understand **numbers**. So we need to **convert text to numbers**.

### Techniques:

#### a. Bag of Words (BoW)

We count how often each word appears.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = ["I love NLP", "NLP is fun"]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())  # ['fun', 'is', 'love', 'nlp']
print(X.toarray())                        # [[0 0 1 1], [1 1 0 1]] 

* `I love NLP` → has 'love' and 'nlp'
* `NLP is fun` → has 'fun', 'is', 'nlp'

Each row is a sentence. Each column is a word.

#### b. TF-IDF (Term Frequency – Inverse Document Frequency)

This improves BoW by reducing the weight of very common words like "the" or "is".

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus)
print(X_tfidf.toarray())

---

## 3. Introduction to NLP with NLTK

NLTK is the most popular library for basic NLP tasks.

In [None]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

text = "Natural language processing is amazing!"
tokens = word_tokenize(text)
print(tokens)

**Output**: `['Natural', 'language', 'processing', 'is', 'amazing', '!']`

This is called **tokenization** - breaking text into words.

---

## 4. Introduction to Regular Expressions (Regex)

Regular expressions are used to **search for patterns** in text.

In [None]:
import re

text = "My number is 0723-456-789"
pattern = r"\d{4}-\d{3}-\d{3}"
match = re.search(pattern, text)
print(match.group())  # 0723-456-789

**Explanation**:

* `\d` → digit
* `{n}` → match `n` times

---

## 5. Feature Engineering for Text Data

Before feeding text into models, we **clean and extract** useful features.

Steps:

1. Remove punctuation
2. Lowercase
3. Remove stopwords (like "is", "the", "and")
4. Lemmatize (reduce to base word)

In [None]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')
nltk.download('wordnet')

text = "Cats are running faster than dogs"
words = word_tokenize(text.lower())

words = [w for w in words if w not in stopwords.words('english')]
lemmatizer = WordNetLemmatizer()
words = [lemmatizer.lemmatize(w) for w in words]

print(words)  # ['cat', 'running', 'faster', 'dog']

---

## 6. Context-Free Grammars and POS Tagging

### POS = Part of Speech

We label each word with its role: noun, verb, adjective, etc.

In [None]:
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')

text = word_tokenize("The dog barks loudly.")
tags = pos_tag(text)
print(tags)

**Output**: `[('The', 'DT'), ('dog', 'NN'), ('barks', 'VBZ'), ('loudly', 'RB')]`

* 'NN' = Noun
* 'VBZ' = Verb
* 'RB' = Adverb

---

## 7. Text Classification

### Task: Is this email spam?

Steps:

* Convert text to vectors (e.g., TF-IDF)
* Train a classifier

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

emails = ["Free money", "Meeting today", "Win a car", "Project update"]
labels = [1, 0, 1, 0]  # 1 = spam, 0 = not spam

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(emails)

X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.5)

model = MultinomialNB()
model.fit(X_train, y_train)

print(model.predict(X_test.toarray()))

---

## 8. Data Ethics in NLP

### NLP Biases

* If your dataset is biased (e.g., favoring one gender), your model will be too.
* Offensive or harmful outputs can emerge from poor data curation.

### Best Practices:

* Use balanced and diverse datasets.
* Audit model outputs for harmful language.
* Always anonymize sensitive data.

---

## Exercises

1. Tokenize the sentence: `"Learning NLP is exciting!"`
2. Use TF-IDF on: `["AI is the future", "The future is now"]`
3. Create a regular expression to find email addresses.
4. Classify the sentence "Win a free trip!" using Naive Bayes.
5. Try POS tagging with NLTK on your own sentence.

---

## Summary

* NLP helps computers understand human language.
* We use tools like TF-IDF, tokenization, and classification to analyze text.
* Libraries like NLTK and scikit-learn make this easier.
* Ethical awareness is just as important as accuracy.
