# 🚀 Introduction to Natural Language Processing (NLP) for Beginners

Welcome to your first adventure in Natural Language Processing! In this 2-hour session, we'll explore the fundamental techniques used to help computers understand and process human language.

### 📘 Learning Objectives

By the end of this session, you will be able to:

1.  **Understand Tokenization**: Break down text into words and sentences.
2.  **Grasp Word Segmentation**: Learn how to identify words in languages without spaces.
3.  **Apply Stemming**: Reduce words to their root form.
4.  **Perform Text Normalization**: Clean and standardize raw text.
5.  **Use Regular Expressions**: Find patterns and extract information from text.

Let's get started! 💻

## Topic 1: Word and Sentence Tokenization

📄 **Explanation**

Tokenization is the very first step in most NLP tasks. It's like chopping vegetables before you cook! We break down a large text into smaller, manageable pieces called **tokens**.

- **Word Tokenization**: Splits a sentence into individual words. For example, `"NLP is fun"` becomes `['NLP', 'is', 'fun']`.
- **Sentence Tokenization**: Splits a paragraph into individual sentences.

This helps the computer see the text as a list of items it can work with, rather than just one big block of text.

In [12]:
# First, we need a library called NLTK (Natural Language Toolkit).
# This line downloads the 'punkt' package, which contains pre-trained models for tokenization.
import nltk
nltk.download('punkt')

# Now we import the specific functions we need.
from nltk.tokenize import word_tokenize, sent_tokenize

text = "NLTK is a powerful library for Natural Language Processing. It provides easy-to-use interfaces."

# Let's tokenize the text into words
word_tokens = word_tokenize(text)
print("✅ Word Tokens:", word_tokens)

# Now, let's tokenize the same text into sentences
sentence_tokens = sent_tokenize(text)
print("✅ Sentence Tokens:", sentence_tokens)

✅ Word Tokens: ['NLTK', 'is', 'a', 'powerful', 'library', 'for', 'Natural', 'Language', 'Processing', '.', 'It', 'provides', 'easy-to-use', 'interfaces', '.']
✅ Sentence Tokens: ['NLTK is a powerful library for Natural Language Processing.', 'It provides easy-to-use interfaces.']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Qasim\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### 🎯 Practice Task: Your Turn to Tokenize!

Given the sentence `"I'm learning NLP, aren't you?"`, perform word tokenization. See how the tokenizer handles contractions like `I'm` and `aren't`.

In [None]:
my_sentence = "I'm learning NLP, aren't you?"

# Your code here: Use the word_tokenize function on my_sentence
my_word_tokens = word_tokenize(my_sentence)

# Print your results
print("My Tokenized Words:", my_word_tokens)

## Topic 2: Word Segmentation

📄 **Explanation**

**Word Segmentation** is the process of breaking down text into individual words or tokens. Different languages require different approaches:

### Languages Without Spaces (Chinese, Japanese, Thai)
Imagine reading a sentence withnospacesbetweenwords. That's a challenge languages like Chinese, Japanese, and Thai present! These languages need special algorithms to identify word boundaries.

**Example in Chinese:**
- Original: `我喜欢自然语言处理`
- Segmented: `我` (I) / `喜欢` (like) / `自然语言处理` (Natural Language Processing)

### Languages With Spaces (Urdu, English, Arabic)
Languages like Urdu, English, and Arabic already have spaces between words, making tokenization simpler. We can split text by spaces to get individual words.

**Example in Urdu:**
- Original: `میں پاکستان سے محبت کرتا ہوں`
- Segmented: `میں` (I) / `پاکستان` (Pakistan) / `سے` (from) / `محبت` (love) / `کرتا` (do) / `ہوں` (am)

**Example in English:**
- Original: `I love natural language processing`
- Segmented: `I` / `love` / `natural` / `language` / `processing`

### Why is this important?
Without proper word segmentation, computers can't understand individual words and their meanings. This is the foundation for all other NLP tasks like translation, sentiment analysis, and information extraction.

**Key Tools:**
- **Chinese:** jieba library
- **Urdu/English/Arabic:** Simple split() method or specialized libraries
- **Japanese:** MeCab or Janome
- **Thai:** PyThaiNLP

In [14]:
# Simple and reliable solution for Urdu tokenization
# No installation needed - works immediately in Google Colab

text = "میں پاکستان سے محبت کرتا ہوں"  # Example Urdu text
seg_list = text.split()
print("Demonstration with Urdu text: " + "/ ".join(seg_list))

# Additional examples
print("\n--- More Examples ---")
examples = [
    "اسلام آباد پاکستان کا دارالحکومت ہے",
    "کیا حال ہے آپ کا",
    "یہ ایک خوبصورت دن ہے"
]

for ex in examples:
    tokens = ex.split()
    print(f"{ex}")
    print(f"Tokens: {' / '.join(tokens)}\n")

Demonstration with Urdu text: میں/ پاکستان/ سے/ محبت/ کرتا/ ہوں

--- More Examples ---
اسلام آباد پاکستان کا دارالحکومت ہے
Tokens: اسلام / آباد / پاکستان / کا / دارالحکومت / ہے

کیا حال ہے آپ کا
Tokens: کیا / حال / ہے / آپ / کا

یہ ایک خوبصورت دن ہے
Tokens: یہ / ایک / خوبصورت / دن / ہے



## Topic 3: Stemming

📄 **Explanation**

Computers are very literal. They see "run", "running", and "ran" as three completely different words. **Stemming** is a technique to chop off the ends of words to get to the basic root or **stem**.

- `studies` -> `studi`
- `studying` -> `studi`

This helps us group similar words together. It's not always perfect (notice `studi` isn't a real word), but it's fast and simple! The most famous algorithm for this is the **Porter Stemmer**.

In [None]:
# We'll use the PorterStemmer from our NLTK library
from nltk.stem import PorterStemmer

# First, create a stemmer object
stemmer = PorterStemmer()

words_to_stem = ["running", "runner", "runs", "easily", "fairly"]

# Let's loop through the words and stem each one
stemmed_words = [stemmer.stem(word) for word in words_to_stem]

print("Original words:", words_to_stem)
print("✅ Stemmed words:", stemmed_words)

# **Fun Fact**: Stemming can sometimes be too aggressive. For example, it might stem `"university"` and `"universe"` to the same stem `"univers"`, which can be confusing!

### 🎯 Practice Task: Stem Your Own Words

Create a list of words: `['connection', 'connected', 'connecting', 'connections']`. Apply the Porter Stemmer to see their common root.

In [10]:
# The stemmer object is already created from the previous cell
my_words = ['connection', 'connected', 'connecting', 'connections']

# Your code here: Create a new list with the stemmed versions of my_words
my_stemmed_words = [stemmer.stem(w) for w in my_words]

# Print the results
print("My stemmed list:", my_stemmed_words)

My stemmed list: ['connect', 'connect', 'connect', 'connect']


## Topic 4: Text Normalization (Putting It All Together)

📄 **Explanation**

Raw text from the real world is messy! It has capital letters, punctuation, numbers, and common but unimportant words (like "a", "the", "is"). **Text Normalization** is the process of cleaning and standardizing text to make it easier for a computer to analyze.

A typical normalization pipeline includes:
1.  **Case Folding**: Converting all text to lowercase.
2.  **Punctuation Removal**: Getting rid of characters like `!`, `.`, and `?`.
3.  **Stop Word Removal**: Removing common words that don't add much meaning (e.g., 'the', 'a', 'in').
4.  **Tokenization & Stemming**: We've already learned these!

In [13]:
import re # This library is for regular expressions, great for finding patterns!
from nltk.corpus import stopwords

# Download the list of stopwords from NLTK
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

raw_text = "The quick brown FOXES are JUMPING over 10 lazy dogs!"
print("Original Text:", raw_text)

# 1. Lowercasing
text = raw_text.lower()
print("\nStep 1 (Lowercase):", text)

# 2. Removing punctuation and numbers (using a simple regex)
text = re.sub(r'[^a-z\s]', '', text) # Keep only letters and spaces
print("Step 2 (Punctuation/Number Removal):", text)

# 3. Tokenization
tokens = word_tokenize(text)
print("Step 3 (Tokenization):", tokens)

# 4. Stop Word Removal
filtered_tokens = [word for word in tokens if word not in stop_words]
print("Step 4 (Stop Word Removal):", filtered_tokens)

# 5. Stemming
final_tokens = [stemmer.stem(word) for word in filtered_tokens]
print("\n✅ Final Normalized Tokens:", final_tokens)

Original Text: The quick brown FOXES are JUMPING over 10 lazy dogs!

Step 1 (Lowercase): the quick brown foxes are jumping over 10 lazy dogs!
Step 2 (Punctuation/Number Removal): the quick brown foxes are jumping over  lazy dogs
Step 3 (Tokenization): ['the', 'quick', 'brown', 'foxes', 'are', 'jumping', 'over', 'lazy', 'dogs']
Step 4 (Stop Word Removal): ['quick', 'brown', 'foxes', 'jumping', 'lazy', 'dogs']

✅ Final Normalized Tokens: ['quick', 'brown', 'fox', 'jump', 'lazi', 'dog']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Qasim\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 🎯 Practice Task: Normalize a Review

You have a customer review: `"This product is AMAZING!!! I bought 2 and I will be buying more."`

Normalize this review by:
1.  Converting it to lowercase.
2.  Tokenizing it.
3.  Removing stop words.

In [None]:
review = "This product is AMAZING!!! I bought 2 and I will be buying more."

# 1. Convert to lowercase
lower_review = review.lower()

# 2. Tokenize the lowercase review
review_tokens = word_tokenize(lower_review)

# 3. Remove stop words
# Your code here: Create a new list containing only the tokens that are NOT in stop_words
filtered_review = [token for token in review_tokens if token not in stop_words and token.isalpha()]

print("Cleaned Review Tokens:", filtered_review)

## Topic 5: Regular Expressions (Regex)

📄 **Explanation**

A **Regular Expression** (or regex) is a powerful tool for finding patterns in text. Think of it as a super-powered search command. You can use it to find things like email addresses, phone numbers, or any specific sequence of characters you can imagine.

Some basic patterns:
- `\d` matches any digit (0-9).
- `\s` matches any whitespace character (space, tab).
- `\w` matches any word character (letters, numbers, and underscore).
- `+` means "one or more" of the preceding character.
- `*` means "zero or more" of the preceding character.

In [1]:
import re

text = "The price of the product is $49.99. The event is on 10/20/2025. Contact support@example.com for help."

# Let's find a price that looks like $XX.XX
# \$ matches the dollar sign, \d+ matches one or more digits, \. matches the dot
prices = re.findall(r'\$\d+\.\d{2}', text)
print(f"✅ Prices found: {prices}")

✅ Prices found: ['$49.99']


In [3]:
### # Now let's find a date in the format XX/XX/XXXX
dates = re.findall(r'\d{2}/\d{2}/\d{4}', text)
print(f"✅ Dates found: {dates}")

✅ Dates found: ['10/20/2025']


In [4]:
# And finally, let's find the email address
# \S+ matches one or more non-whitespace characters
emails = re.findall(r'\S+@\S+', text)
print(f"✅ Emails found: {emails}")

✅ Emails found: ['support@example.com']


In [None]:
🧪 **Experiment!** Try changing the text string to include other prices or dates and see if the regex can find them.

### 🎯 Practice Task: Find the Phone Numbers

Write a regular expression to find all phone numbers in the format `XXX-XXX-XXXX` from the text below.

In [5]:
phone_text = "You can reach me at 123-456-7890 or my colleague at 987-654-3210. Do not call 555-1111."

# Your regex pattern here. Hint: \d{3} matches exactly three digits.
pattern = r'\d{3}-\d{3}-\d{4}'

# Use re.findall() with your pattern and the phone_text
found_numbers = re.findall(pattern, phone_text)

print("Found phone numbers:", found_numbers)

Found phone numbers: ['123-456-7890', '987-654-3210']


## Information Extraction with Regular Expressions

### 📊 Common Patterns for Data Extraction

| **Pattern Type** | **Regex Pattern** | **Example Match** | **Use Case** |
|------------------|-------------------|-------------------|--------------|
| **Email Address** | `\S+@\S+\.\S+` | support@example.com | Contact information extraction |
| **Price (USD)** | `\$\d+\.\d{2}` | $49.99 | E-commerce, financial documents |
| **Phone (US)** | `\d{3}-\d{3}-\d{4}` | 555-123-4567 | Contact details |
| **Date (MM/DD/YYYY)** | `\d{2}/\d{2}/\d{4}` | 10/20/2025 | Event scheduling, records |
| **URL** | `https?://\S+` | https://example.com | Web scraping, link extraction |
| **Hashtag** | `#\w+` | #NLP #AI | Social media analysis |
| **Mention** | `@\w+` | @username | Social media monitoring |
| **Credit Card** | `\d{4}-\d{4}-\d{4}-\d{4}` | 1234-5678-9012-3456 | Payment processing |
| **ZIP Code (US)** | `\b\d{5}\b` | 12345 | Address parsing |
| **IP Address** | `\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}` | 192.168.1.1 | Network logs, security |
| **Time (HH:MM)** | `\d{2}:\d{2}` | 14:30 | Scheduling, timestamps |
| **Percentage** | `\d+\.?\d*%` | 75.5% | Reports, analytics |

### 🇵🇰 Pakistan-Specific Patterns

| **Pattern Type** | **Regex Pattern** | **Example Match** | **Use Case** |
|------------------|-------------------|-------------------|--------------|
| **CNIC** | `\d{5}-\d{7}-\d{1}` | 12345-1234567-1 | Identity verification |
| **Mobile Number** | `\+92-?3\d{9}` | +92-3001234567 | Contact information |
| **Landline** | `\d{2,4}-\d{7,8}` | 051-1234567 | Business contacts |
| **Postal Code** | `\d{5}` | 46000 | Address validation |

In [10]:
import re

text = """
Contact us at support@example.com or call 555-123-4567.
Product price: $49.99. Visit https://example.com
Event date: 10/20/2025 at 14:30
Follow us @CompanyName #GreatDeals
Pakistani CNIC: 12345-1234567-1
Mobile: +92-3001234567
"""

# Extract different patterns
emails = re.findall(r'\S+@\S+\.\S+', text)
phones = re.findall(r'\d{3}-\d{3}-\d{4}', text)
prices = re.findall(r'\$\d+\.\d{2}', text)
dates = re.findall(r'\d{2}/\d{2}/\d{4}', text)
urls = re.findall(r'https?://\S+', text)
hashtags = re.findall(r'#\w+', text)
mentions = re.findall(r'@\w+', text)
cnic = re.findall(r'\d{5}-\d{7}-\d{1}', text)
pk_mobile = re.findall(r'\+92-?3\d{9}', text)

print(f"📧 Emails: {emails}")
print(f"📞 Phones: {phones}")
print(f"💰 Prices: {prices}")
print(f"📅 Dates: {dates}")
print(f"🔗 URLs: {urls}")
print(f"#️⃣ Hashtags: {hashtags}")
print(f"@ Mentions: {mentions}")
print(f"🆔 CNIC: {cnic}")
print(f"📱 PK Mobile: {pk_mobile}")

📧 Emails: ['support@example.com']
📞 Phones: ['555-123-4567']
💰 Prices: ['$49.99']
📅 Dates: ['10/20/2025']
🔗 URLs: ['https://example.com']
#️⃣ Hashtags: ['#GreatDeals']
@ Mentions: ['@example', '@CompanyName']
🆔 CNIC: ['12345-1234567-1']
📱 PK Mobile: ['+92-3001234567']


## 🎉 Final Revision Assignment 🎉

Congratulations on making it through the fundamentals of NLP! It's time to combine everything you've learned. These tasks are for you to practice at home to solidify your knowledge.

---

### Task 1: Clean Up a Messy Sentence

Given the sentence below, use a regular expression to remove all the numbers and special characters, leaving only letters and spaces.

In [None]:
messy_sentence = "*** HELLO!! 123 This is a TEST 456 sentence... please clean me! 789 ***"

# Your code here
cleaned_sentence = re.sub(r'[^a-zA-Z\s]', '', messy_sentence)
print(cleaned_sentence)

### Task 2: Full Normalization Pipeline

Take your `cleaned_sentence` from Task 1 and perform the following steps:
1.  Convert it to lowercase.
2.  Tokenize it into words.
3.  Remove all English stop words.

In [None]:
# Your code here (you can reuse the cleaned_sentence from the cell above)
lower_sentence = cleaned_sentence.lower()
tokens = word_tokenize(lower_sentence)
final_words = [word for word in tokens if word not in stop_words]

print(final_words)

### Task 3: Stem the Final Words

Now, take the list of `final_words` you created in Task 2 and apply the Porter Stemmer to each word.

In [None]:
# Your code here
stemmed_final_words = [stemmer.stem(word) for word in final_words]
print(stemmed_final_words)

### Task 4: Extract Information from a Bio

You have a short biography. Your goal is to extract the person's email and the year they were born using regular expressions.

In [None]:
bio = "John Doe, born in 1995, is a data scientist. You can contact him at john.doe@email.com for work inquiries. His old email was j.doe@university.edu."

# Find the year (4 digits)
year_pattern = r'\d{4}'
year = re.search(year_pattern, bio)
print("Year of birth:", year.group(0) if year else "Not found")

# Find all email addresses
email_pattern = r'\S+@\S+'
emails = re.findall(email_pattern, bio)
print("Emails found:", emails)

### Task 5: Sentence Boundary Detection

The following text has a tricky abbreviation. Use NLTK's `sent_tokenize` to see if it can correctly identify the two sentences.

In [None]:
tricky_text = "Dr. Smith lives in New York. He is a doctor."

# Your code here
sentences = sent_tokenize(tricky_text)
print(f"Found {len(sentences)} sentences:")
print(sentences)

## ✅ Well Done!

You've successfully covered the core building blocks of Natural Language Processing. These pre-processing steps are crucial for almost any advanced AI task involving text, from building chatbots to analyzing customer sentiment. Keep practicing and exploring!