# Module 1, Week 1, Assignment 1: Tokenization and Preprocessing

In this assignment, you will explore the fundamentals of text preprocessing, a critical step in any NLP workflow. Follow the instructions provided in the notebook and complete the activities to solidify your understanding.

## Objectives
- Understand the importance of text preprocessing in NLP.
- Learn to tokenize text into sentences and words.
- Remove punctuation and special characters from text.
- Perform basic normalization by converting text to lowercase.
- Use NLTK to remove stopwords from the text.

---

### Instructions:
1. Run the provided code cells to see examples of preprocessing techniques.
2. Complete the **TODO** sections to practice your skills.
3. Analyze and compare the raw vs. preprocessed text in the final activity.

## Step 1: Import Required Libraries
First, we will import the libraries needed for tokenization and text preprocessing. NLTK will be the primary library for this assignment.

In [1]:
# Import Libraries
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import string

# Download required NLTK data files
nltk.download('punkt')  # For tokenization
nltk.download('stopwords')  # For stopword removal

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\kaleem\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kaleem\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Step 2: Tokenization
Tokenization is the process of breaking down text into smaller units, such as words or sentences. Let's practice tokenizing a sample paragraph.

In [2]:
# Example Text
text = "Natural Language Processing (NLP) is a fascinating field of artificial intelligence. It focuses on the interaction between computers and human language. Tokenization is a crucial step in NLP pipelines!"

# Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentence Tokenization:\n", sentences)

# Word Tokenization
words = word_tokenize(text)
print("\nWord Tokenization:\n", words)

Sentence Tokenization:
 ['Natural Language Processing (NLP) is a fascinating field of artificial intelligence.', 'It focuses on the interaction between computers and human language.', 'Tokenization is a crucial step in NLP pipelines!']

Word Tokenization:
 ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field', 'of', 'artificial', 'intelligence', '.', 'It', 'focuses', 'on', 'the', 'interaction', 'between', 'computers', 'and', 'human', 'language', '.', 'Tokenization', 'is', 'a', 'crucial', 'step', 'in', 'NLP', 'pipelines', '!']


##### Exercise

In [3]:
# TODO: Tokenize another sample paragraph of your choice
# Add your own text below and apply sentence and word tokenization
text_custom = "Machine learning is a subset of artificial intelligence. It allows computers to learn from data without explicit programming. Techniques such as supervised learning and reinforcement learning are used in various applications."
sentences_custom = sent_tokenize(text_custom)
print("Sentence Tokenization:\n", sentences_custom)
words_custom = word_tokenize(text_custom)
print("\nWord Tokenization:\n", words_custom)

Sentence Tokenization:
 ['Machine learning is a subset of artificial intelligence.', 'It allows computers to learn from data without explicit programming.', 'Techniques such as supervised learning and reinforcement learning are used in various applications.']

Word Tokenization:
 ['Machine', 'learning', 'is', 'a', 'subset', 'of', 'artificial', 'intelligence', '.', 'It', 'allows', 'computers', 'to', 'learn', 'from', 'data', 'without', 'explicit', 'programming', '.', 'Techniques', 'such', 'as', 'supervised', 'learning', 'and', 'reinforcement', 'learning', 'are', 'used', 'in', 'various', 'applications', '.']


## Step 3: Removing Punctuation and Special Characters
Punctuation and special characters often don't carry meaningful information for NLP tasks. Removing them can help simplify the text.

In [4]:
# Remove Punctuation
text_no_punct = text.translate(str.maketrans('', '', string.punctuation))
print("Text without punctuation:\n", text_no_punct)

Text without punctuation:
 Natural Language Processing NLP is a fascinating field of artificial intelligence It focuses on the interaction between computers and human language Tokenization is a crucial step in NLP pipelines


##### Exercise

In [5]:
# TODO: Remove punctuation from your custom text (from Step 2)
text_custom_no_punct = text_custom.translate(str.maketrans('', '', string.punctuation))
print("Text without punctuation:\n", text_custom_no_punct)

Text without punctuation:
 Machine learning is a subset of artificial intelligence It allows computers to learn from data without explicit programming Techniques such as supervised learning and reinforcement learning are used in various applications


## Step 4: Lowercasing and Normalization
Normalization involves converting text to a consistent format, such as lowercasing, to reduce variability.

In [6]:
# Convert Text to Lowercase
text_lower = text_no_punct.lower()
print("Lowercased Text:\n", text_lower)

Lowercased Text:
 natural language processing nlp is a fascinating field of artificial intelligence it focuses on the interaction between computers and human language tokenization is a crucial step in nlp pipelines


##### Exercise

In [7]:
# TODO: Apply lowercasing to your custom text (from Step 3)
text_custom_lower = text_custom_no_punct.lower()
print("Lowercased Text:\n", text_custom_lower)

Lowercased Text:
 machine learning is a subset of artificial intelligence it allows computers to learn from data without explicit programming techniques such as supervised learning and reinforcement learning are used in various applications


## Step 5: Removing Stopwords
Stopwords are common words (e.g., "and", "the", "is") that often do not add much meaning to a sentence. Removing stopwords can help focus on the more meaningful words in the text.

In [9]:
# Define Stopword List
stop_words = set(stopwords.words('english'))

# Remove Stopwords
words_no_stopwords = [word for word in word_tokenize(text_lower) if word not in stop_words]
print("Text without stopwords:\n", words_no_stopwords)

Text without stopwords:
 ['natural', 'language', 'processing', 'nlp', 'fascinating', 'field', 'artificial', 'intelligence', 'focuses', 'interaction', 'computers', 'human', 'language', 'tokenization', 'crucial', 'step', 'nlp', 'pipelines']


##### Exercise

In [10]:
# TODO: Remove stopwords from your custom text (from Step 4)
words_custom_no_stopwords = [word for word in word_tokenize(text_custom_lower) if word not in stop_words]
print("Text without stopwords:\n", words_custom_no_stopwords)

Text without stopwords:
 ['machine', 'learning', 'subset', 'artificial', 'intelligence', 'allows', 'computers', 'learn', 'data', 'without', 'explicit', 'programming', 'techniques', 'supervised', 'learning', 'reinforcement', 'learning', 'used', 'various', 'applications']


## Step 6: Analyze Raw vs. Preprocessed Text
Compare the original text with the preprocessed version to understand the impact of these techniques.

In [11]:
print("Original Text:\n", text)
print("\nPreprocessed Text:\n", ' '.join(words_no_stopwords))

Original Text:
 Natural Language Processing (NLP) is a fascinating field of artificial intelligence. It focuses on the interaction between computers and human language. Tokenization is a crucial step in NLP pipelines!

Preprocessed Text:
 natural language processing nlp fascinating field artificial intelligence focuses interaction computers human language tokenization crucial step nlp pipelines


##### Exercise

In [12]:
# TODO: Print the original and preprocessed versions of your custom text
print("Custom Original Text:\n", text_custom)
print("\nCustom Preprocessed Text:\n", ' '.join(words_custom_no_stopwords))

Custom Original Text:
 Machine learning is a subset of artificial intelligence. It allows computers to learn from data without explicit programming. Techniques such as supervised learning and reinforcement learning are used in various applications.

Custom Preprocessed Text:
 machine learning subset artificial intelligence allows computers learn data without explicit programming techniques supervised learning reinforcement learning used various applications


### Congratulations! 🎉
You have completed the tokenization and preprocessing assignment. These preprocessing steps are foundational in NLP workflows and will help you tackle more advanced topics in the future.

---

## Reflection:
- How did removing punctuation and stopwords change the text?
- Were there any cases where removing stopwords might not have been ideal?

Feel free to experiment with more text samples and explore additional preprocessing techniques!