# üß™ Module-02: Challenge Lab

üß™ Challenge Lab: Advanced Stop Word Removal & Word Frequencies

üéØ Learning Goals
* By the end of this lab, you will be able to:
* Clean text (lowercase + remove punctuation)
* Tokenize text into words
* Remove stop words using NLTK and a custom list
* Wrap logic into simple Python functions
* Compute basic word frequency counts before and after stop word removal

üß± Students must complete the missing sections marked as TODO

‚úÖ  When they complete and run the code, they should see:
*  A cleaned version of the text (lowercase, no punctuation)
*  A list of all tokens (including ‚Äúthe‚Äù, ‚Äúis‚Äù, etc.)
*  A list of filtered tokens with stop words and some custom words removedThe
* Top 5 most frequent words before and after stop word removal

In [None]:
import nltk
from nltk.corpus import stopwords
import string
from collections import Counter

# ------------------------------
# 1. DOWNLOAD REQUIRED RESOURCES
# ------------------------------

# TODO: Download the NLTK stopwords once on a new computer
nltk.download('stopwords')

# ------------------------------
# 2. SAMPLE TEXT
# ------------------------------

text = """
Natural Language Processing enables computers to understand human language.
It is used in chatbots, search engines, translation tools, and many other applications.
Language processing can be challenging because human language is complex and full of nuance.
"""

# ------------------------------
# 3. TEXT CLEANING FUNCTION
# ------------------------------

def clean_text(raw_text):
    """
    Convert text to lowercase and remove punctuation.
    Returns a cleaned string.
    """
    # TODO: Convert to lowercase
    lower_text = raw_text.lower()

    # TODO: Remove punctuation using str.translate and string.punctuation
    cleaned_text = lower_text.translate(str.maketrans('', '', string.punctuation))

    return cleaned_text

# ------------------------------
# 4. TOKENIZATION FUNCTION
# ------------------------------

def tokenize(text):
    """
    Split text into a list of word tokens using spaces.
    """
    # TODO: Split the text on spaces
    tokens = text.split()
    return tokens

# ------------------------------
# 5. STOP WORD REMOVAL FUNCTION
# ------------------------------

def remove_stop_words(tokens):
    """
    Remove NLTK English stop words AND some custom stop words.
    Returns a new list of filtered tokens.
    """
    # Get standard English stop words from NLTK
    stop_words = set(stopwords.words('english'))

    # TODO: Add your own custom stop words here (e.g., common domain words)
    custom_stop_words = {"language", "processing"}

    # Combine both sets
    all_stop_words = stop_words.union(custom_stop_words)

    # TODO: Keep only the tokens that are NOT in all_stop_words
    filtered = [word for word in tokens if word.lower() not in all_stop_words]

    return filtered

# ------------------------------
# 6. FREQUENCY COUNT FUNCTION
# ------------------------------

def count_frequencies(tokens):
    """
    Return a dictionary-like object with word counts.
    """
    # TODO: Use Counter to count how many times each word appears
    freq = Counter(tokens)
    return freq

# ------------------------------
# 7. MAIN LOGIC
# ------------------------------

# Step 1: Clean the raw text
cleaned_text = clean_text(text)

# Step 2: Tokenize before stop word removal
original_tokens = tokenize(cleaned_text)

# Step 3: Remove stop words (standard + custom)
filtered_tokens = remove_stop_words(original_tokens)

# Step 4: Count frequencies before and after
original_freq = count_frequencies(original_tokens)
filtered_freq = count_frequencies(filtered_tokens)

# ------------------------------
# 8. PRINT RESULTS
# ------------------------------

print("=== CLEANED TEXT ===")
print(cleaned_text)

print("\n=== ORIGINAL TOKENS (with stop words) ===")
print(original_tokens)

print("\n=== FILTERED TOKENS (stop words removed) ===")
print(filtered_tokens)

print("\n=== TOP 5 WORDS BEFORE STOP WORD REMOVAL ===")
print(original_freq.most_common(5))

print("\n=== TOP 5 WORDS AFTER STOP WORD REMOVAL ===")
print(filtered_freq.most_common(5))



=== CLEANED TEXT ===

natural language processing enables computers to understand human language
it is used in chatbots search engines translation tools and many other applications
language processing can be challenging because human language is complex and full of nuance


=== ORIGINAL TOKENS (with stop words) ===
['natural', 'language', 'processing', 'enables', 'computers', 'to', 'understand', 'human', 'language', 'it', 'is', 'used', 'in', 'chatbots', 'search', 'engines', 'translation', 'tools', 'and', 'many', 'other', 'applications', 'language', 'processing', 'can', 'be', 'challenging', 'because', 'human', 'language', 'is', 'complex', 'and', 'full', 'of', 'nuance']

=== FILTERED TOKENS (stop words removed) ===
['natural', 'enables', 'computers', 'understand', 'human', 'used', 'chatbots', 'search', 'engines', 'translation', 'tools', 'many', 'applications', 'challenging', 'human', 'complex', 'full', 'nuance']

=== TOP 5 WORDS BEFORE STOP WORD REMOVAL ===
[('language', 4), ('processing

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
