# Module 1, Week 2: Assignment - Stemming, Lemmatization, and Advanced Tokenization

### Objective
This assignment is designed to:
1. Introduce you to stemming and lemmatization.
2. Teach you advanced tokenization techniques using NLTK and SpaCy.
3. Help you compare different preprocessing techniques and their impact on text data.

---

### Instructions:
- Use the provided text or your own sample text for analysis.
- Perform stemming, lemmatization, and advanced tokenization on the text.
- Analyze the results and reflect on the differences between these techniques.

---

## Step 1: Import Required Libraries

In [None]:
# Import Required Libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import spacy

# Download Required NLTK Data Files
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Load SpaCy Model
nlp = spacy.load('en_core_web_sm')

## Step 2: Define the Input Text
You can use the sample text below or replace it with your own dataset.

In [None]:
# Sample Raw Text
raw_text = """
Text preprocessing is a critical step in natural language processing tasks. 
It involves converting text into a format that is easy for machines to understand. 
Key steps include stemming, lemmatization, and tokenization.
"""

# Print Raw Text
print("Raw Text:\n")
print(raw_text)

## Step 3: Tokenization with NLTK
Tokenize the text into words using NLTK.

In [None]:
# Tokenize Text Using NLTK
nltk_tokens = word_tokenize(raw_text)
print("\nNLTK Tokenized Words:\n")
print(nltk_tokens)

## Step 4: Stemming
Perform stemming using NLTK's PorterStemmer.

In [None]:
# Initialize PorterStemmer
stemmer = PorterStemmer()

# Apply Stemming
stemmed_words = [stemmer.stem(word) for word in nltk_tokens]
print("\nStemmed Words:\n")
print(stemmed_words)

## Step 5: Lemmatization
Perform lemmatization using NLTK's WordNetLemmatizer and SpaCy.

In [None]:
# Initialize WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Apply Lemmatization with NLTK
nltk_lemmatized = [lemmatizer.lemmatize(word) for word in nltk_tokens]
print("\nNLTK Lemmatized Words:\n")
print(nltk_lemmatized)

# Apply Lemmatization with SpaCy
doc = nlp(raw_text)
spacy_lemmatized = [token.lemma_ for token in doc]
print("\nSpaCy Lemmatized Words:\n")
print(spacy_lemmatized)

## Step 6: Advanced Tokenization with SpaCy
Perform advanced tokenization using SpaCy, which includes handling punctuation and special characters.

In [None]:
# Advanced Tokenization with SpaCy
spacy_tokens = [token.text for token in doc]
print("\nAdvanced Tokenization with SpaCy:\n")
print(spacy_tokens)

## Step 7: Compare Results
Analyze and compare the output of stemming, lemmatization, and tokenization techniques. Discuss the differences and their implications.

In [None]:
# Compare Results
print("\nComparison of Techniques:\n")
print("Raw Text:", raw_text)
print("\nStemmed Words:", stemmed_words)
print("\nNLTK Lemmatized Words:", nltk_lemmatized)
print("\nSpaCy Lemmatized Words:", spacy_lemmatized)
print("\nNLTK Tokenized Words:", nltk_tokens)
print("\nSpaCy Tokenized Words:", spacy_tokens)

## Reflection Questions
1. What differences did you observe between stemming and lemmatization?
2. Which tokenization technique (NLTK vs. SpaCy) provided better results for your text?
3. How might the choice of preprocessing technique impact downstream NLP tasks like classification or summarization?
4. Experiment with different texts. How do the results vary for complex sentences or domain-specific text?