# Day 1: NLP Introduction + Tokenization & Stop Words
**The AI Engineer Course 2026 - Sections 20 & 21a**

**Student:** Natruja

**Date:** Thursday, February 12, 2026

---

## Learning Objectives
1. Understand what Natural Language Processing (NLP) is
2. Learn about tokenization (word and sentence level)
3. Understand stop words and why they matter
4. Apply these concepts to real text

## Setup: Install and Import Required Libraries

In [None]:
# Install NLTK (Natural Language Toolkit)
import subprocess
import sys

# Install NLTK
subprocess.check_call([sys.executable, "-m", "pip", "install", "nltk", "-q"])

# Download required NLTK data
import nltk
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

print("✓ NLTK installed and data downloaded successfully!")

✓ NLTK installed and data downloaded successfully!


[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1028)>
[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1028)>


In [2]:
# Import necessary libraries
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import nltk

print("✓ All imports successful!")

✓ All imports successful!


## What is Natural Language Processing (NLP)?

Natural Language Processing is a branch of artificial intelligence that focuses on the interaction between computers and human language. It enables machines to:
- Understand text and spoken words
- Extract meaning from language
- Generate human-like responses

### Real-World Applications:
- Sentiment Analysis (Twitter mood tracking)
- Machine Translation (Google Translate)
- Chatbots (Customer service)
- Text Classification (Spam detection)
- Named Entity Recognition (Finding people, places, companies in text)

## Tokenization: Breaking Text into Pieces

**Tokenization** is the process of breaking text into smaller units called **tokens**.

### Types of Tokenization:
1. **Word Tokenization**: Split text into individual words
2. **Sentence Tokenization**: Split text into sentences
3. **Character Tokenization**: Split text into characters (less common)

### Why is Tokenization Important?
- Machines can't process raw text; they need structured data
- Tokenization is the first step in almost all NLP tasks
- Different tokenization methods suit different problems

## EXAMPLE: Word Tokenization

In [4]:
# Sample text
text = "Hello! How are you doing today? Natural Language Processing is fascinating!"

print("Original text:")
print(text)
print("\n" + "="*60)

# Word tokenization
tokens = word_tokenize(text)

print("\nTokenized words:")
print(tokens)
print(f"\nTotal number of tokens: {len(tokens)}")

Original text:
Hello! How are you doing today? Natural Language Processing is fascinating!


Tokenized words:
['Hello', '!', 'How', 'are', 'you', 'doing', 'today', '?', 'Natural', 'Language', 'Processing', 'is', 'fascinating', '!']

Total number of tokens: 14


## EXAMPLE: Sentence Tokenization

In [None]:
# Sample text with multiple sentences
text = "Hello! How are you? I'm learning NLP. It's amazing. What do you think?"

print("Original text:")
print(text)
print("\n" + "="*60)

# Sentence tokenization
sentences = sent_tokenize(text)

print("\nTokenized sentences:")
for i, sentence in enumerate(sentences, 1):
    print(f"{i}. {sentence}")

print(f"\nTotal number of sentences: {len(sentences)}")

## Stop Words: Common Words to Ignore

**Stop words** are common words that appear frequently in text but often don't carry much meaning.

### Examples of Stop Words:
- Articles: "a", "an", "the"
- Pronouns: "he", "she", "it", "I", "you"
- Prepositions: "in", "at", "on", "by"
- Conjunctions: "and", "but", "or"
- Auxiliary verbs: "is", "am", "are"

### Why Remove Stop Words?
- They reduce noise in text analysis
- They decrease computation time
- They help focus on meaningful content
- Some ML models work better without them

## EXAMPLE: Identifying and Removing Stop Words

In [5]:
# Get English stop words
stop_words = set(stopwords.words('english'))

print(f"Number of English stop words: {len(stop_words)}")
print(f"\nFirst 20 stop words: {sorted(list(stop_words))[:20]}")

# Sample text
text = "The quick brown fox jumps over the lazy dog"
tokens = word_tokenize(text.lower())

print(f"\n" + "="*60)
print(f"Original tokens: {tokens}")
print(f"Total tokens: {len(tokens)}")

# Remove stop words
filtered_tokens = [word for word in tokens if word not in stop_words]

print(f"\nFiltered tokens (stop words removed): {filtered_tokens}")
print(f"Total tokens after filtering: {len(filtered_tokens)}")

Number of English stop words: 198

First 20 stop words: ['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been']

Original tokens: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
Total tokens: 9

Filtered tokens (stop words removed): ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
Total tokens after filtering: 6


## YOUR TURN: Exercise 1 - Word Tokenization

Tokenize the following text into words. Count how many tokens you get.

In [9]:
# Exercise 1: Tokenize this text
exercise_text = "Python is great for data science. Machine learning and AI are the future!"

# TODO: Use word_tokenize() to tokenize exercise_text
tokens = word_tokenize(exercise_text.lower())

print(f"Tokens: {tokens}")
print(f"Number of tokens: {len(tokens)}")

Tokens: ['python', 'is', 'great', 'for', 'data', 'science', '.', 'machine', 'learning', 'and', 'ai', 'are', 'the', 'future', '!']
Number of tokens: 15


## YOUR TURN: Exercise 2 - Sentence Tokenization

Break the following paragraph into sentences.

In [10]:
# Exercise 2: Tokenize into sentences
paragraph = "NLP is fascinating. It helps machines understand human language. You should learn it! Don't you agree?"

# TODO: Use sent_tokenize() to split into sentences
sentences = sent_tokenize(paragraph)

print(f"Number of sentences: {len(sentences)}")
print("\nEach sentence:")
for i, sent in enumerate(sentences, 1):
    print(f"{i}. {sent}")

Number of sentences: 4

Each sentence:
1. NLP is fascinating.
2. It helps machines understand human language.
3. You should learn it!
4. Don't you agree?


## YOUR TURN: Exercise 3 - Stop Words Removal

Remove stop words from the tokenized text.

In [12]:
# Exercise 3: Remove stop words
text = "The quick brown fox jumps over the lazy dog on a sunny day"
tokens = word_tokenize(text.lower())
stop_words = set(stopwords.words('english'))

print(f"Original tokens: {tokens}")
print(f"Original count: {len(tokens)}")

# TODO: Filter out stop words using list comprehension
filtered_tokens = [token for token in tokens if token not in stop_words]

print(f"\nFiltered tokens: {filtered_tokens}")
print(f"Filtered count: {len(filtered_tokens)}")
print(f"\nStop words removed: {len(tokens) - len(filtered_tokens)}")

Original tokens: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', 'on', 'a', 'sunny', 'day']
Original count: 13

Filtered tokens: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', 'sunny', 'day']
Filtered count: 8

Stop words removed: 5


## YOUR TURN: Exercise 4 (Harder) - Complete Pipeline

Create a complete text processing pipeline: tokenize words, convert to lowercase, and remove stop words.

In [21]:
# Exercise 4: Complete preprocessing pipeline
raw_text = "Artificial Intelligence and Machine Learning are transforming industries. These technologies are shaping the future!"

# Step 1: Convert to lowercase
# Step 2: Tokenize
# Step 3: Remove stop words
# Step 4: Print results

# TODO: Fill in the steps
text_lower = raw_text.lower()
tokens = word_tokenize(text_lower)
stop_words = stopwords.words('english')
cleaned_tokens = [token for token in tokens if token not in stop_words]

print(f"Original: {raw_text}")
print(f"\nCleaned tokens: {cleaned_tokens}")
print(f"Reduction: {len(tokens)} → {len(cleaned_tokens)} tokens")

Original: Artificial Intelligence and Machine Learning are transforming industries. These technologies are shaping the future!

Cleaned tokens: ['artificial', 'intelligence', 'machine', 'learning', 'transforming', 'industries', '.', 'technologies', 'shaping', 'future', '!']
Reduction: 16 → 11 tokens


## CHALLENGE PROJECT: Text Analysis Tool

Create a simple text analysis tool that:
1. Takes any input text
2. Tokenizes it into words and sentences
3. Shows word statistics (with and without stop words)
4. Displays the most important words (non-stop words)

In [22]:
def analyze_text(text):
    """Analyze text and show tokenization statistics."""
    
    # Tokenize into sentences and words
    sentences = sent_tokenize(text)
    words = word_tokenize(text.lower())
    
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    meaningful_words = [w for w in words if w.isalnum() and w not in stop_words]
    
    # Print analysis
    print("="*60)
    print("TEXT ANALYSIS REPORT")
    print("="*60)
    print(f"\nOriginal Text: {text}")
    print(f"\nStatistics:")
    print(f"  • Sentences: {len(sentences)}")
    print(f"  • Total words: {len(words)}")
    print(f"  • Meaningful words (non-stop): {len(meaningful_words)}")
    print(f"  • Stop words removed: {len(words) - len(meaningful_words)}")
    print(f"\nMeaningful words: {meaningful_words}")
    print("="*60)

# Test with sample text
sample = "Machine learning is a subset of artificial intelligence. It enables computers to learn from data and improve over time."
analyze_text(sample)

TEXT ANALYSIS REPORT

Original Text: Machine learning is a subset of artificial intelligence. It enables computers to learn from data and improve over time.

Statistics:
  • Sentences: 2
  • Total words: 21
  • Meaningful words (non-stop): 11
  • Stop words removed: 10

Meaningful words: ['machine', 'learning', 'subset', 'artificial', 'intelligence', 'enables', 'computers', 'learn', 'data', 'improve', 'time']


## Summary

### Key Takeaways:
- **Tokenization** is breaking text into smaller units (words or sentences)
- **Word tokenization** helps analyze text at the word level
- **Sentence tokenization** helps understand document structure
- **Stop words** are common words that don't add much meaning
- Removing stop words can improve NLP model performance

### What's Next:
Tomorrow we'll learn about **Stemming and Lemmatization** - techniques to normalize words to their root forms!

---

*Created for Natruja's NLP study plan*