# 📘 Introduction to Natural Language Processing (NLP)

Welcome to your first class on **Natural Language Processing**! 🚀

In this 2-hour session, we will explore how computers understand human language. We will cover the linguistic basics, the history of NLP, and get hands-on with Python to clean and prepare text data.

### 🎯 Learning Objectives:
1.  Understand what NLP is and its goals.
2.  Learn the 4 pillars of linguistics: Syntax, Semantics, Pragmatics, and Discourse.
3.  Explore the evolution of NLP (The 3 Curves).
4.  Perform practical data pre-processing using Python libraries **NLTK** and **SpaCy**.

Let's get started! 🧠

---

## 1. What is Natural Language Processing? 🤖💬

**Natural Language Processing (NLP)** is a field of Artificial Intelligence (AI) that focuses on enabling computers to understand, interpret, and generate human language in a meaningful way.

It bridges the gap between how humans communicate and how computers understand data.

### The Goal
The ultimate goal is to read, decipher, and make sense of human language to extract valuable insights from unstructured data like emails, social media, and news articles.

### 📈 Common Applications
You use NLP every day! Here are some examples:
*   **Search Engines:** Google uses NLP to understand what you are looking for.
*   **Virtual Assistants:** Siri and Alexa process your voice commands.
*   **Machine Translation:** Google Translate converts text between languages.
*   **Chatbots:** Automated customer support agents.
*   **Sentiment Analysis:** Determining if a review is positive or negative.

### 🧠 Practice Task 1

Think of one technology you used today that might use NLP. Double-click this cell and write it down below:

**My Answer:** [Type your answer here]

---

## 2. The Pillars of Language: How Computers Analyze Text 🏛️

To process language like a human, a machine needs to understand it at different levels. In linguistics, we categorize these into four key areas:

1.  **Syntax (Grammar):** The rules governing the structure of sentences.
    *   *Example:* "The cat sat on the mat" is grammatically correct. "Sat the mat on cat the" is not.

2.  **Semantics (Literal Meaning):** The meaning of words and sentences independent of context.
    *   *Example:* "The cat sat on the mat" and "On the mat sat the cat" have different structures (syntax) but the same literal meaning (semantics).

3.  **Pragmatics (Context & Intent):** How context influences meaning. It looks beyond the literal definition to understand sarcasm, irony, or requests.

4.  **Discourse (The Bigger Picture):** How sentences connect to form a coherent conversation or text.

### 💡 Example in Action

Consider this conversation:
> **Person A:** "Can you pass the salt?"
> **Person B:** "Yeah, sure." *(Passes the salt)*

Let's analyze this:
*   **Syntax:** Both sentences follow English grammar rules.
*   **Semantics:** Literally, Person A is asking if Person B has the physical ability to pass the salt.
*   **Pragmatics:** Person B understands that this isn't a question about ability, but a **request** to perform an action.
*   **Discourse:** Person B's response is directly related to Person A's question, creating a coherent exchange.

### 🧠 Practice Task 2 (Quick Quiz)

Which of the following best describes the role of **Pragmatics** in NLP? (Delete the wrong answers below)

a) Analyzing the grammatical structure of sentences.
b) Understanding the literal meaning of words.
c) Interpreting meaning based on context and intent.
d) Identifying the root form of words.

---

## 3. The Evolution of NLP: The Three Curves 📉📈

NLP has evolved through three overlapping "curves" or eras:

1.  **The Syntactics Curve (The Past):** Rule-based systems focused on grammar. They were rigid and struggled with the ambiguity of real-world language (e.g., "bag-of-words" models).

2.  **The Semantics Curve (The Present):** The era of Machine Learning and Deep Learning. Models use statistics to learn the meaning of words from data. A hallmark of this is **Word Embeddings**, where mathematically similar words are grouped together.

3.  **The Pragmatics Curve (The Future):** Focuses on context and intent. Large Language Models (LLMs) like GPT are pushing boundaries here, aiming for systems that can reason and engage in natural conversation.

---

## 4. practical NLP: Data Pre-processing 🧹

Raw text data is often "noisy" and unstructured. Before an AI model can understand it, we must clean it. This is called **Data Pre-processing**.

We will use two popular Python libraries:
1.  **NLTK (Natural Language Toolkit):** Great for teaching, research, and understanding the basics.
2.  **SpaCy:** Modern, fast, and designed for production use.

### ⚙️ Setup (Run this cell once)
First, let's prepare our environment by downloading necessary data for NLTK.

In [None]:
# Import the NLTK library
import nltk

# Download necessary NLTK data (only needs to be done once)
# 'punkt' is for splitting sentences and words
# 'stopwords' is a list of common words like 'the', 'is', 'a'
print("⬇️ Downloading NLTK data...")
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
print("✅ Download complete!")

---

## 5. Noise Removal with NLTK 🧼

**Noise** in text includes:
*   **Stopwords:** Common words (the, a, is) that carry little semantic meaning.
*   **Punctuation:** Marks like commas and periods.

Removing noise helps the AI focus on the important words.

Let's look at the step-by-step process using NLTK:
1.  **Tokenization:** Splitting text into individual words (tokens).
2.  **Lowercasing:** Making everything lowercase so "The" and "the" are treated the same.
3.  **Stopword & Punctuation Removal.**

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

# Our raw text
text = "This is an example sentence, showing off the stop words filtration!"
print("ORIGINAL:", text)

# 1. Tokenization (splitting into words)
tokens = word_tokenize(text)

# 2. Lowercasing
tokens = [word.lower() for word in tokens]

# 3. Prepare Stopwords and Punctuation lists
stop_words = set(stopwords.words('english'))
punct = set(string.punctuation)

# 4. Filtering (keeping words NOT in stop_words AND NOT in punctuation)
filtered_tokens = []
for word in tokens:
    if word not in stop_words and word not in punct:
        filtered_tokens.append(word)

print("CLEANED: ", filtered_tokens)

### 🧠 Practice Task 3

In the cell below, change the `my_text` variable to a sentence of your choice, then run the cell to see how NLTK cleans it.

In [None]:
# 🧪 Try changing this sentence!
my_text = "NLP is amazing, but it requires a lot of data processing."

# --- Processing logic (same as above) ---
tokens = word_tokenize(my_text)
tokens = [word.lower() for word in tokens]
stop_words = set(stopwords.words('english'))
punct = set(string.punctuation)

# Using a 'list comprehension' for cleaner code
final_clean_words = [word for word in tokens if word not in stop_words and word not in punct]

print(f"Original: {my_text}")
print(f"Result:   {final_clean_words}")

---

In [4]:
# Run this once to download the SpaCy English model
!python -m spacy download en_core_web_sm

C:\Users\Qasim\anaconda3\python.exe: No module named spacy


Now, let's see how SpaCy handles noise removal. Notice how much shorter the code is!

In [5]:
import spacy

# 1. Load the English language model
nlp = spacy.load("en_core_web_sm")

text = "This is an example sentence, showing off the stop words filtration!"
print("ORIGINAL:", text)

# 2. Process the text with SpaCy (it tokenizes automatically)
doc = nlp(text)

# 3. Noise Removal in one line using SpaCy's attributes
# We check if a token is NOT a stopword AND is NOT punctuation
filtered_tokens = [token.text for token in doc if not token.is_stop and not token.is_punct]

print("CLEANED: ", filtered_tokens)

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

### 🧠 Practice Task 4

Create a sentence with **lots** of punctuation and run it through SpaCy in the cell below.

In [6]:
# Write a sentence with many symbols: ! @ # $ , .
messy_text = "Hello!!! Note: NLP is fun... #AI @Python."

# Process it
doc = nlp(messy_text)

# Clean it
clean_list = [token.text for token in doc if not token.is_stop and not token.is_punct]

print(clean_list)

NameError: name 'nlp' is not defined

---

---

## 🎓 Final Revision Assignment

Great job today! To wrap up the last 20 minutes, complete these tasks to reinforce what you've learned.

### Task 1: Concept Check
Explain the difference between **Stemming** and **Lemmatization**. How would each process the word "better"?
> *Write your answer here:*

### Task 2: Case Study
Imagine you are building a customer service chatbot. Why is understanding **Pragmatics** crucial? Give an example of a customer query where literal meaning is different from intended meaning.
> *Write your answer here:*

### Task 3: NLTK Coding Challenge
Write code below to process the sentence: `"The quick brown fox jumps over the lazy dog."`
1. Tokenize it.
2. Convert to lowercase.
3. Remove stopwords and punctuation.

In [None]:
# Your NLTK solution here
challenge_text = "The quick brown fox jumps over the lazy dog."

# Hint: Use the code from Section 5 as a reference!


In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

# 🧪 Change this text to include different names and places
text = "Google was founded by Larry Page and Sergey Brin in California."

doc = nlp(text)

print(f"Analyzing: '{text}'\n")

# Loop through recognized entities
for entity in doc.ents:
    print(f"Word: {entity.text} | Label: {entity.label_}")

---
## 🎉 Congratulations!

You have completed the **Introduction to NLP** class. You now understand the linguistic pillars, the history of the field, and how to use Python to clean text data.

Keep practicing, and happy coding! 🐍