# Module 1, Week 1, Assignment 1: Tokenization and Preprocessing

In this assignment, you will explore the fundamentals of text preprocessing, a critical step in any NLP workflow. Follow the instructions provided in the notebook and complete the activities to solidify your understanding.

## Objectives
- Understand the importance of text preprocessing in NLP.
- Learn to tokenize text into sentences and words.
- Remove punctuation and special characters from text.
- Perform basic normalization by converting text to lowercase.
- Use NLTK to remove stopwords from the text.

---

### Instructions:
1. Run the provided code cells to see examples of preprocessing techniques.
2. Complete the **TODO** sections to practice your skills.
3. Analyze and compare the raw vs. preprocessed text in the final activity.

## Step 1: Import Required Libraries
First, we will import the libraries needed for tokenization and text preprocessing. NLTK will be the primary library for this assignment.

In [None]:
# Import Libraries
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import string

# Download required NLTK data files
nltk.download('punkt')  # For tokenization
nltk.download('stopwords')  # For stopword removal

## Step 2: Tokenization
Tokenization is the process of breaking down text into smaller units, such as words or sentences. Let's practice tokenizing a sample paragraph.

In [None]:
# Example Text
text = "Natural Language Processing (NLP) is a fascinating field of artificial intelligence. It focuses on the interaction between computers and human language. Tokenization is a crucial step in NLP pipelines!"

# Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentence Tokenization:\n", sentences)

# Word Tokenization
words = word_tokenize(text)
print("\nWord Tokenization:\n", words)

# TODO: Tokenize another sample paragraph of your choice
# Add your own text below and apply sentence and word tokenization
# text_custom = "Your text here"
# sentences_custom = sent_tokenize(text_custom)
# words_custom = word_tokenize(text_custom)

## Step 3: Removing Punctuation and Special Characters
Punctuation and special characters often don't carry meaningful information for NLP tasks. Removing them can help simplify the text.

In [None]:
# Remove Punctuation
text_no_punct = text.translate(str.maketrans('', '', string.punctuation))
print("Text without punctuation:\n", text_no_punct)

# TODO: Remove punctuation from your custom text (from Step 2)
# text_custom_no_punct = text_custom.translate(str.maketrans('', '', string.punctuation))

## Step 4: Lowercasing and Normalization
Normalization involves converting text to a consistent format, such as lowercasing, to reduce variability.

In [None]:
# Convert Text to Lowercase
text_lower = text_no_punct.lower()
print("Lowercased Text:\n", text_lower)

# TODO: Apply lowercasing to your custom text (from Step 3)
# text_custom_lower = text_custom_no_punct.lower()

## Step 5: Removing Stopwords
Stopwords are common words (e.g., "and", "the", "is") that often do not add much meaning to a sentence. Removing stopwords can help focus on the more meaningful words in the text.

In [None]:
# Define Stopword List
stop_words = set(stopwords.words('english'))

# Remove Stopwords
words_no_stopwords = [word for word in word_tokenize(text_lower) if word not in stop_words]
print("Text without stopwords:\n", words_no_stopwords)

# TODO: Remove stopwords from your custom text (from Step 4)
# words_custom_no_stopwords = [word for word in word_tokenize(text_custom_lower) if word not in stop_words]

## Step 6: Analyze Raw vs. Preprocessed Text
Compare the original text with the preprocessed version to understand the impact of these techniques.

In [None]:
print("Original Text:\n", text)
print("\nPreprocessed Text:\n", ' '.join(words_no_stopwords))

# TODO: Print the original and preprocessed versions of your custom text
# print("Custom Original Text:\n", text_custom)
# print("\nCustom Preprocessed Text:\n", ' '.join(words_custom_no_stopwords))

### Congratulations! 🎉
You have completed the tokenization and preprocessing assignment. These preprocessing steps are foundational in NLP workflows and will help you tackle more advanced topics in the future.

---

## Reflection:
- How did removing punctuation and stopwords change the text?
- Were there any cases where removing stopwords might not have been ideal?

Feel free to experiment with more text samples and explore additional preprocessing techniques!