# 🚀 Introduction to the Course

Hello and welcome to this **Introduction to Natural Language Processing (NLP)** course.  
We will be delving into the exciting field of NLP and exploring techniques that enable computers to **understand, generate, and classify human language**.

---

## 📚 What You Need
- ✅ Basic Python skills  
- ✅ Some familiarity with machine learning  
- ❌ No prior NLP knowledge required  

---

## 🗂️ Course Roadmap

1. **Introduction to NLP**  
   - What NLP means  
   - Everyday applications of NLP  

2. **🔧 Text Pre-processing**  
   - Fundamental step in NLP  
   - Practical exercises included  

3. **🧩 Key Component Extraction**  
   - Parts of Speech (POS) Tagging  
   - Named Entity Recognition (NER)  

4. **💬 Sentiment Analysis**  
   - Understanding emotions in text  

5. **🔢 Text Vectorization**  
   - Preparing text data for machine learning  

6. **📊 Advanced Topics**  
   - Topic modeling  
   - Building a custom text classifier  

7. **🏢 Case Study (Section A)**  
   - End-to-end business problem solution  
   - Portfolio-ready Jupyter Notebook  

8. **🔮 Future of NLP**  
   - Deep Learning in NLP  
   - Large Language Models (LLMs)  
   - Trends & future directions  

---

## 👩‍🏫 Instructor Introduction

**Lauren Newbold**  
- Data Scientist with experience in **large organizations and startups**  
- Built a **custom NLP text classifier** for interview data in developing communities  
- Speaker at **international conferences** on NLP in developing countries  
- Passionate about **teaching and making data exciting**  

---

## ✨ Key Takeaways
- 🤖 NLP enables computers to **understand, generate, and classify** human language.  
- 🐍 Requires **basic Python skills** and some ML knowledge — no prior NLP experience.  
- 📘 Covers: preprocessing, POS tagging, NER, sentiment analysis, vectorization, topic modeling, and custom classifiers.  
- 🏆 Includes a **real-world case study** and wraps up with **deep learning & LLMs** in NLP.  

---

👉 Now that we’ve set the stage, let’s dive into our first lesson:  
**What do we mean by Natural Language Processing?**


# 🧠 Introduction to NLP

## 📖 What is Natural Language Processing (NLP)?
Natural Language Processing (**NLP**) is a field of **artificial intelligence** that enables computers to **understand, interpret, and generate human language**, similar to how humans communicate with one another.

---

## ⚙️ Techniques Used in NLP
- 📊 **Statistics**
- 🤖 **Machine Learning**
- 🧮 **Deep Learning**

These approaches power modern NLP systems to analyze and process large volumes of text data efficiently.

---

## 🏛️ Origins of NLP
- 🕰️ **1950s** → Early NLP methods focused on **rule-based systems**.  
- 📜 These systems relied on **grammatical rules** to process text.  
- ⚠️ But just like in real life, grammar rules alone are **not enough** for true language understanding.  

---

## 🚀 Modern Advancements
- 💾 Availability of **large datasets**  
- ⚡ **Massive improvements in computing power**  
- 🤖 Breakthroughs in **deep learning architectures**  

👉 These advancements enabled the rise of **transformative systems like ChatGPT**, capable of human-like language understanding and generation.

---

## 📊 Why NLP Matters for Data Scientists
- 🔍 Gain insights from large collections of text data  
- ⏳ Save **significant time** by automating manual analysis  
- 💡 Discover insights that were previously **inaccessible or invisible**  

---

## 📝 Key Takeaways
- 🧠 NLP enables computers to **understand, interpret, and generate human language data**.  
- ⚙️ It leverages **statistics, machine learning, and deep learning**.  
- 📜 Early **rule-based systems** were limited in understanding language.  
- 🚀 **Modern NLP models** (like ChatGPT) are powerful tools for extracting insights from text at scale.  

---

👉 **Next Lesson:** Real-world examples of how NLP systems interact with our everyday lives.


# 🌍 NLP in Everyday Life

Let’s illustrate how far **Natural Language Processing (NLP)** has come and how much it interacts with our everyday lives through a few examples.

---

## 🔎 NLP in Search Engines
- Search engines use NLP to **understand a user's query** and provide relevant search results.  
- NLP techniques help:  
  - 📝 Extract keywords  
  - 🎯 Comprehend the **intent** behind the query  
  - 🌐 Return matching web pages  

---

## 📧 Spam Detection in Email
- NLP powers **automatic spam filtering** in email systems.  
- 🧮 **Classification algorithms** analyze patterns to:  
  - 🚫 Identify unwanted messages  
  - ✅ Distinguish legitimate emails  

---

## 🤖 Customer Support & Chatbots
- Chatbots and support systems rely on NLP to **understand customer queries**.  
- Data scientists design **conversational agents** that can:  
  - 💬 Interpret customer intent  
  - 📌 Provide accurate, context-aware responses  

---

## 💡 Explore More
There are many more examples of NLP applications around you — from **voice assistants** to **language translation tools**.  
By the end of this course, you’ll have the **foundational skills** to begin creating your own NLP solutions. 🚀  

---

## 📝 Key Takeaways
- 🌍 NLP is integral to everyday technologies like **search engines, spam filters, and chatbots**.  
- 🔎 Search engines leverage NLP to **extract keywords, understand intent, and serve relevant results**.  
- 📧 **Classification algorithms** separate spam from legitimate emails.  
- 🤖 Conversational agents use NLP to **interpret queries and provide responses**.  


# 🧠 Supervised vs Unsupervised NLP

## 📖 Introduction
Supervised and unsupervised learning are two **fundamental approaches** in tackling NLP problems.  
The method you choose depends on:  
- 📊 The **data available**  
- 🎯 The **questions you want to answer**

This course will cover **both approaches**, so let’s break them down.

---

## 🎓 Supervised Learning
Supervised learning = **learning with guidance (labels)**.  
- You provide both:  
  - 📝 Input → the **text data**  
  - 🎯 Output → the **labels (scores, categories, etc.)**

📌 **Example:**  
Imagine you have a dataset of product reviews.  
- Each review has:  
  - 💬 Review text  
  - ⭐ A review score (e.g., 7/10)  
- A supervised ML model can **learn the relationship** between text and score.  
- Later, it can **predict the score** of a new, unseen review.

---

## 🔍 Unsupervised Learning
Unsupervised learning = **no labels required**.  
- The algorithm finds **patterns and structures** in the text by itself.  

📌 **Example:**  
- **Clustering** reviews into groups:  
  - 📦 Positive tone  
  - 📦 Neutral tone  
  - 📦 Negative tone  

Even without knowing the “right” answer beforehand, the model can still uncover **hidden insights**.

---

## ⚖️ Choosing Between Them
- ✅ **Use Supervised** when you have **labeled data** and want to **predict or classify**.  
- ✅ **Use Unsupervised** when you have **unlabeled data** and want to **find patterns**.  

---

## 📝 Key Takeaways
- 🎓 **Supervised learning** = requires labeled data to map inputs → outputs.  
- 🔍 **Unsupervised learning** = works without labels, finds natural patterns.  
- ⚖️ The choice depends on **data availability** and the **insight you need**.  
- 📦 **Clustering** is a classic unsupervised method in NLP.  


# 📘 The Importance of Data Preparation

In **Natural Language Processing (NLP)**, the accuracy of any **machine learning model** or **insight** heavily depends on the **quality of the data** provided and how well that data has been **cleaned and prepared**.

---

## ⚠️ Garbage In, Garbage Out

- The phrase **“garbage in, garbage out” (GIGO)** means:
  - If we feed an algorithm **dirty, unstructured, noisy data**,  
  - …then the **results will also be poor**, regardless of the model used.  

👉 **Conclusion:** *Bad data = bad predictions*.

---

## 🛠️ Steps in Text Data Preprocessing

### 1. **General Cleaning**
- Organize the dataset  
- Tidy up text  
- Remove problematic elements that may cause errors  

---

### 2. **Noise Removal**
- Eliminate unnecessary data that adds no value  
- Examples:
  - HTML tags  
  - Special symbols  
  - Extra white spaces  
  - Stop words (like “is”, “the”, “of”)  
- Benefits:
  - Saves **memory space**  
  - Produces a **smaller, cleaner dataset**  

---

### 3. **Formatting for ML Algorithms**
- Ensure the data is in a **consistent format** suitable for ML models  
- Examples:
  - Lowercasing text  
  - Tokenization  
  - Lemmatization or stemming  

---

## 🔄 Transformation Flow

Before preprocessing:
```
"Hello!!!   This   is   an <b>Example</b>....."
```

After preprocessing:
```
"hello example"
```

---

## 📊 Diagram: Data Preparation Flow

```plaintext
Raw Text  ──► General Cleaning ──► Noise Removal ──► Formatting ──► ML-Ready Data
```

---

## 🐍 Python Example: Text Cleaning

```python
import re
import string

text = "Hello!!!   This   is   an <b>Example</b>....."

# Step 1: Remove HTML tags
cleaned = re.sub(r"<.*?>", "", text)

# Step 2: Remove punctuation
cleaned = cleaned.translate(str.maketrans("", "", string.punctuation))

# Step 3: Normalize spaces & lowercase
cleaned = re.sub(r"\s+", " ", cleaned).strip().lower()

print(cleaned)
```

```output
hello this is an example
```

---

## ✅ Key Takeaways

- **Quality and cleanliness of data** are crucial for accuracy in NLP.  
- **Garbage in, garbage out (GIGO):** Poor data quality = poor performance.  
- **Preprocessing steps:**
  1. General Cleaning  
  2. Noise Removal  
  3. Formatting for algorithms  
- **Outcome:** Transforms raw, messy text into a **structured, ML-ready format**.  

# 🔡 Lowercasing Text in NLP

## 📖 Why Lowercase?
An important first step in working with text data is **converting it into lowercase**.  

### ✨ Benefits:
- 🔄 Maintains **consistency** in data  
- 🧮 Ensures words are **counted the same** (`Apple` = `apple`)  
- 🤖 Helps ML models treat words uniformly  
- 🧹 Makes **further cleaning easier** (no need to handle cases separately)  

⚠️ **Caution:**  
Lowercasing can sometimes **change meaning**:  
- `"US"` 🇺🇸 → a country  
- `"us"` 🙋 → pronoun  

---

## 🐍 Python Example

### ✅ Lowercasing a Single Sentence
```python
sentence = "Her cat's name is Luna."
lower_sentence = sentence.lower()
print(lower_sentence)
```

**Output:**
```
her cat's name is luna.
```

---

### ✅ Lowercasing a List of Sentences
```python
sentence_list = ["Her cat's name is Luna.", "This is a Test.", "Python is Fun!"]
lower_sentence_list = [x.lower() for x in sentence_list]
print(lower_sentence_list)
```

**Output:**
```
["her cat's name is luna.", "this is a test.", "python is fun!"]
```

---

## 📝 Key Takeaways
- 🔡 **Lowercasing ensures consistency** in text data.  
- 🤖 Words with different cases (`Dog` vs `dog`) are treated **as the same**.  
- 🧹 Simplifies further **data cleaning steps**.  
- ⚠️ Be careful with acronyms & proper nouns (`US` ≠ `us`).  

# 📝 Removing Stopwords with NLTK

## 🚀 Introduction to Removing Stopwords
In this lesson, we will use the **NLTK** package to remove stopwords from our text.

👉 Stopwords are common words in the language that do not carry much meaning.  
Examples: **"and," "of," "a," "to"**  

Removing stopwords helps by:
- ⚡ Reducing complexity in the dataset  
- 🎯 Improving machine learning accuracy  
- ⏩ Speeding up processing time  

---

## 📥 Importing NLTK and Downloading Stopwords

```python
import nltk
nltk.download('stopwords')
```

If you have not already downloaded these, it may take a few minutes.

```python
from nltk.corpus import stopwords
```

---

## 📌 Assigning and Printing Stopwords

```python
n_stopwords = stopwords.words('english')
print(n_stopwords)
```

✅ This gives you a list of common English stopwords.  

---

## ✂️ Removing Stopwords from a Sentence

Sentence to process:
```python
sentence = 'It was too far to go to the shop and he did not want her to walk.'
```

Remove stopwords:
```python
sentence_no_stopwords = ' '.join([word for word in sentence.split() if word.lower() not in n_stopwords])
print(sentence_no_stopwords)
```

📌 Output:
```
far go shop. Want walk.
```

---

## ⚙️ Customizing the Stopwords List

You can modify the list:  

```python
n_stopwords.remove('did')
n_stopwords.remove('not')
n_stopwords.append('go')

sentence_no_stopwords_custom = ' '.join([word for word in sentence.split() if word.lower() not in n_stopwords])
print(sentence_no_stopwords_custom)
```

📌 Output:
```
far shop, did not want walk
```

---

## ✅ Key Takeaways

- 🛠️ **NLTK** provides built-in stopwords lists for many languages.  
- 🧹 Removing stopwords simplifies text data and speeds up processing.  
- 🎯 Cleaner datasets often improve ML model performance.  
- ✍️ You can customize stopword lists by adding or removing words.  


# Regular Expressions in NLP

## 📖 Introduction to Regular Expressions
Regular expressions, or **regex**, are a special syntax for searching strings that match specified patterns.  
They are a powerful tool to filter and sort through text when you want to **match patterns** instead of exact strings.  

---

## ⚙️ Importing the `re` Package
```python
import re
```

---

## 📝 Raw Strings in Python
Python treats characters like `\n` as special (newline).  
To avoid misinterpretation, prefix strings with `r` to indicate a **raw string**.

```python
my_folder = "c\desktop\notes"   # Interprets 
 as newline ❌
print(my_folder)

my_folder_raw = r"c\desktop\notes"  # Treated literally ✅
print(my_folder_raw)
```

---

## 🔍 Regex Functions Overview

- **`re.search(pattern, string)`** → returns match if found, else `None`  
- **`re.sub(pattern, replacement, string)`** → replaces matched text  

```python
text = "Sara was able to help me quickly."
new_text = re.sub("Sara", "Sarah", text)
print(new_text)  # Sarah was able to help me quickly.
```

---

## 🎯 Regex Syntax Examples

### ✅ Matching Optional Characters with `?`
```python
reviews = ["Sara helped a lot.", "Sarah was kind."]
pattern = r"Sara?h"
for review in reviews:
    if re.search(pattern, review):
        print(review)
# Matches both Sara and Sarah
```

### ✅ Start of String `^`
```python
re.search(r"^A", "Amazing work!")  # Matches
```

### ✅ End of String `$`
```python
re.search(r"y$", "Great delivery")  # Matches 'y' at end
```

### ✅ Alternation `|`
```python
pattern = r"(need|want)ed"
text = "I needed help and she wanted answers."
print(re.findall(pattern, text))  # ['need', 'want']
```

---

## ✂️ Removing Punctuation
```python
reviews = ["Amazing work!", "Really helpful :)"]
cleaned = [re.sub(r"[^\w\s]", "", review) for review in reviews]
print(cleaned)  # ['Amazing work', 'Really helpful ']
```

---

## 🎨 Regex Cheat Sheet (Visual)

```mermaid
flowchart LR
    A(^ = Start of String) --> B($ = End of String)
    B --> C(? = Optional Character)
    C --> D(| = OR / Alternation)
```

---

## 🚀 Key Takeaways
- Regex provides **pattern-based searching** instead of raw string matching.  
- Use `r""` raw strings to avoid escape issues.  
- Common functions: `re.search`, `re.sub`.  
- Syntax:  
  - `?` → optional character  
  - `^` → start of string  
  - `$` → end of string  
  - `|` → alternation  
- Regex is essential for **text preprocessing** (e.g., punctuation removal).  

# 📝 Tokenization

## 🔹 Introduction
A fundamental step in **Natural Language Processing (NLP)** involves converting our text into smaller units through a process known as **tokenization**.  
These smaller units are called **tokens**.

- **Word Tokenization** → Breaks text into words.  
- **Sentence Tokenization** → Breaks text into sentences.  
- Tokens can also be **subwords** or **characters**, depending on the use case.

👉 We perform tokenization because the meaning of text is better understood if we can analyze the **individual parts** as well as the **whole**.  

---

## ✂️ Sentence Tokenization Example
```python
import nltk
nltk.download('punkt')

from nltk.tokenize import sent_tokenize

# Example text
text = "Her cat's name is Luna. Her dog's name is Max."

# Sentence tokenization
sentences = sent_tokenize(text)
print(sentences)
```

### ✅ Expected Output
```python
["Her cat's name is Luna.", "Her dog's name is Max."]
```

---

## 🔤 Word Tokenization Example
```python
from nltk.tokenize import word_tokenize

# Word tokenization on a single sentence
sentence = "Her cat's name is Luna."
words = word_tokenize(sentence)
print(words)
```

### ✅ Expected Output
```python
['Her', 'cat', "'s", 'name', 'is', 'Luna', '.']
```

---

## 🧩 Combined Tokenization Example
```python
# Word tokenization on the full text
words_full = word_tokenize(text)
print(words_full)

# Convert all tokens to lowercase for consistency
words_lower = [w.lower() for w in words_full]
print(words_lower)
```

### ✅ Expected Output
```python
['Her', 'cat', "'s", 'name', 'is', 'Luna', '.', 'Her', 'dog', "'s", 'name', 'is', 'Max', '.']

['her', 'cat', "'s", 'name', 'is', 'luna', '.', 'her', 'dog', "'s", 'name', 'is', 'max', '.']
```

---

## 📌 Key Takeaways
- Tokenization = splitting text into smaller units (**tokens**).  
- Sentence tokenization → breaks text into **sentences**.  
- Word tokenization → breaks text into **words**.  
- Lowercasing tokens ensures consistency when analyzing frequencies.  

# 🔄 Stemming in NLP

Stemming is the process of reducing words to their **base or root form**.  
It is part of **text standardization** during preprocessing.

For example:  
- **connecting → connect**  
- **connected → connect**  
- **connectivity → connect**  

---

## 🌟 Why Use Stemming?
- Reduces the **number of unique words** in the dataset.  
- Simplifies and lowers the **complexity** of data.  
- Makes models focus on the **core meaning** instead of variations.  

> [!IMPORTANT]  
> Sometimes stemming produces non-words (e.g., *worse → wos*).  
> Always balance simplicity vs. semantic accuracy.

---

## 📦 Using NLTK’s Porter Stemmer

### Example 1: Words around **connect**
```python
from nltk.stem import PorterStemmer

PPS = PorterStemmer()

tokens_connect = ['connecting', 'connected', 'connectivity', 'connect', 'connects']
for token in tokens_connect:
    print(token, "→", PPS.stem(token))
```

✅ Output:
```
connecting → connect
connected → connect
connectivity → connect
connect → connect
connects → connect
```

---

### Example 2: Words around **learn**
```python
tokens_learn = ['learned', 'learning', 'learn', 'learns', 'learner', 'learners']
for token in tokens_learn:
    print(token, "→", PPS.stem(token))
```

✅ Output:
```
learned → learn
learning → learn
learn → learn
learns → learn
learner → learner
learners → learner
```

---

### Example 3: Irregular cases
```python
tokens_misc = ['likes', 'better', 'worse']
for token in tokens_misc:
    print(token, "→", PPS.stem(token))
```

✅ Output:
```
likes → like
better → better
worse → wors
```

> [!NOTE]  
> Notice how *worse* is stemmed to *wors*. This shows that stemming is **rule-based**, not meaning-based.

---

## 🧩 Stemming Process Diagram

```mermaid
flowchart LR
    A[Original Words] --> B[Stemming Rules]
    B --> C[Base Form Tokens]
    
    A:::aStyle
    B:::bStyle
    C:::cStyle

    classDef aStyle fill:#FFDDC1,stroke:#333,stroke-width:2px;
    classDef bStyle fill:#C1E1FF,stroke:#333,stroke-width:2px;
    classDef cStyle fill:#C1FFD7,stroke:#333,stroke-width:2px;
```

---

## ✅ Key Takeaways
- 🔡 **Stemming** reduces words to their base/root form by chopping suffixes.  
- 📉 Reduces dataset size and complexity.  
- ⚡ **Porter Stemmer** is widely used in Python via NLTK.  
- ⚠️ Not always linguistically correct (produces non-words).  

---

# 📝 Lemmatization in NLP

## 📖 Introduction
Where **stemming** removes the last few characters of a word, **lemmatization** reduces words to a more meaningful base form and ensures they do not lose their meaning.

🔑 **Key Difference:**  
- **Stemming**: Cuts off word endings (may result in non-meaningful roots).  
- **Lemmatization**: Uses a predefined **dictionary (WordNet)** and **context** to ensure the base form is meaningful.

---

## ⚙️ How Lemmatization Works
- Lemmatization works more intelligently by referencing a **predefined dictionary** containing the context of the word.  
- Words are reduced only if their meaningful lemma exists.  
- ✅ More accurate than stemming, but results in a larger dataset.

---

## 🐍 Python Example – Using NLTK

```python
import nltk
from nltk.stem import WordNetLemmatizer

# Download WordNet if not already
nltk.download('wordnet')
nltk.download('omw-1.4')

# Initialize Lemmatizer
lemmatizer = WordNetLemmatizer()

# Example tokens
tokens_connect = ['connecting', 'connected', 'connectivity', 'connect', 'connects']
tokens_learn = ['learned', 'learning', 'learn', 'learns', 'learner', 'learners']
tokens_likes = ['likes', 'better', 'worse']

print("📌 Connect tokens:")
for token in tokens_connect:
    print(f"{token} ➝ {lemmatizer.lemmatize(token)}")

print("\n📌 Learn tokens:")
for token in tokens_learn:
    print(f"{token} ➝ {lemmatizer.lemmatize(token)}")

print("\n📌 Likes tokens:")
for token in tokens_likes:
    print(f"{token} ➝ {lemmatizer.lemmatize(token)}")
```

---

## 📊 Example Output

```
📌 Connect tokens:
connecting ➝ connecting
connected ➝ connected
connectivity ➝ connectivity
connect ➝ connect
connects ➝ connects

📌 Learn tokens:
learned ➝ learned
learning ➝ learning
learn ➝ learn
learns ➝ learns
learner ➝ learner
learners ➝ learner

📌 Likes tokens:
likes ➝ like
better ➝ better
worse ➝ worse
```

---

## 🔍 Stemming vs Lemmatization

| Feature               | Stemming | Lemmatization |
|-----------------------|----------|---------------|
| Uses dictionary?      | ❌ No    | ✅ Yes |
| Speed                 | ⚡ Faster | 🐢 Slower |
| Accuracy              | ❌ May produce non-words (e.g., *worse → wor*) | ✅ Produces meaningful words (e.g., *worse → worse*) |
| Data size reduction   | ✅ Strong | ❌ Weak |
| Preserves meaning     | ❌ Not always | ✅ Yes |

---

## 🧩 Simple Diagram

```
   Words → [Stemming] → "wors"
        ↘ [Lemmatization] → "worse"
```

---

## 🏁 Key Takeaways
- ✂️ **Stemming** removes suffixes, sometimes breaking word meaning.  
- 📚 **Lemmatization** reduces words to valid base forms using a dictionary.  
- 📊 Lemmatization preserves **semantic meaning** but increases dataset size.  
- ⚡ Use **stemming** when speed matters, **lemmatization** when accuracy matters.  


# 📚 N-grams in NLP
N-grams help us **inspect preprocessing**, **explore data**, and **engineer features** for ML.  
An **n-gram** is a sequence of **n neighboring tokens** (words, subwords, or chars).

> [!TIP]
> Use n-grams after basic cleaning (lowercase, tokenization, stopword handling) to get clearer signal.

---

## 🔧 Requirements
```python
# Core packages
import nltk
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from nltk.util import ngrams
```
✅ Output:
```
# (no output on import)
```

---

## 🧩 Example Tokens (preprocessed-lite)
These are sample tokens (lowercased, light cleaning) for demonstration:

```python
tokens = [
    'two','of','the','cats','that','were','here',
    'two','of','the','dogs','that','were','barking',
    'two','of','the','birds','that','were','singing',
    'two','of','them','were','happy'
]
print(tokens[:12])
```
✅ Output:
```
['two', 'of', 'the', 'cats', 'that', 'were', 'here', 'two', 'of', 'the', 'dogs', 'that']
```

---

## 1️⃣ Unigrams (n = 1)

```python
# Count unigrams
uni_counts = Counter(tokens)
# Or via pandas for convenience
uni_series = pd.Series(tokens).value_counts()

print("Top unigrams (Counter):", uni_counts.most_common(5))
print("\nTop unigrams (pandas):\n", uni_series.head(10))
```
✅ Output:
```
Top unigrams (Counter): [('two', 4), ('of', 4), ('were', 4), ('the', 3), ('that', 3)]

Top unigrams (pandas):
two      4
of       4
were     4
the      3
that     3
cats     1
here     1
dogs     1
barking  1
birds    1
dtype: int64
```

### 📊 Visualize Top-10 Unigrams
> [!NOTE]
> Plotting uses **matplotlib** (single chart, default colors).

```python
top10 = uni_series.head(10).sort_values()
plt.figure(figsize=(8, 4))
top10.plot.barh()
plt.title("Top 10 Unigrams")
plt.tight_layout()
plt.show()
```
✅ Output:
```
# A horizontal bar chart is displayed.
```

---

## 2️⃣ Bigrams (n = 2)
```python
bigrams = list(ngrams(tokens, 2))
bigram_counts = Counter(bigrams)
print(bigram_counts.most_common(5))
```
✅ Output:
```
[(('two', 'of'), 4), (('of', 'the'), 3), (('that', 'were'), 3), (('the', 'cats'), 1), (('cats', 'that'), 1)]
```

---

## 3️⃣ Trigrams (n = 3)
```python
trigrams = list(ngrams(tokens, 3))
trigram_counts = Counter(trigrams)
print(trigram_counts.most_common(5))
```
✅ Output:
```
[(('two', 'of', 'the'), 3), (('of', 'the', 'cats'), 1), (('the', 'cats', 'that'), 1), (('cats', 'that', 'were'), 1), (('that', 'were', 'here'), 1)]
```

---

## 🧠 Quick Helper: N-gram Frequencies
```python
def ngram_freq(tokens, n=2, top_k=10):
    counts = Counter(ngrams(tokens, n))
    return counts.most_common(top_k)

print("Top-10 bigrams:", ngram_freq(tokens, n=2, top_k=10))
print("Top-10 trigrams:", ngram_freq(tokens, n=3, top_k=10))
```
✅ Output:
```
Top-10 bigrams: [(('two', 'of'), 4), (('of', 'the'), 3), (('that', 'were'), 3), (('the', 'cats'), 1), (('cats', 'that'), 1), (('the', 'dogs'), 1), (('dogs', 'that'), 1), (('were', 'barking'), 1), (('the', 'birds'), 1), (('birds', 'that'), 1)]
Top-10 trigrams: [(('two', 'of', 'the'), 3), (('of', 'the', 'cats'), 1), (('the', 'cats', 'that'), 1), (('cats', 'that', 'were'), 1), (('that', 'were', 'here'), 1), (('of', 'the', 'dogs'), 1), (('the', 'dogs', 'that'), 1), (('dogs', 'that', 'were'), 1), (('that', 'were', 'barking'), 1), (('of', 'the', 'birds'), 1)]
```

---

## 🧭 How N-grams Slide (Diagram)

```mermaid
flowchart LR
    A1["Tokens: two → of → the → cats → that → were → ..."]
    A2["Window (n=2): [two of] → [of the] → [the cats] → ..."]
    A3["Window (n=3): [two of the] → [of the cats] → [the cats that] → ..."]
    A1 --> A2 --> A3
```

---

## ✅ Key Takeaways
- An **n-gram** is a sequence of **n neighboring tokens**.  
- **Unigrams, bigrams, trigrams** correspond to **n = 1, 2, 3**.  
- With **NLTK + pandas**, you can **count** and **visualize** n-grams quickly.  
- N-gram analysis becomes powerful after **thorough preprocessing** and on **larger corpora**.

> [!IMPORTANT]
> For production features, consider:
> - filtering stopwords
> - normalizing (lowercase, stemming/lemmatization)
> - min-frequency thresholds to reduce noise
> - character n-grams for languages without whitespace