# Module 1, Week 1: NLP Pipeline Implementation Project

In this project, you will implement a complete text preprocessing pipeline for Natural Language Processing (NLP). The goal is to clean, preprocess, and prepare raw text data for further analysis or modeling. By the end of this project, you will:

1. Understand the flow of an NLP pipeline.
2. Perform tokenization, lowercasing, punctuation removal, and stopword removal.
3. Generate meaningful insights by comparing raw vs. processed text.

---

### Problem Statement:

Given a raw text dataset, implement a preprocessing pipeline that:
- Cleans the text data.
- Tokenizes sentences and words.
- Removes unnecessary characters and stopwords.
- Outputs a cleaned version of the text for further use.

---

### Dataset:
You will use a raw text paragraph for this project. Feel free to replace the sample text with your own data or a dataset from a text corpus like NLTK's Gutenberg or Reuters corpus.

---

## Step 1: Import Required Libraries

In [None]:
# Import Libraries
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

# Download Necessary NLTK Data Files
nltk.download('punkt')
nltk.download('stopwords')

# Set Stopwords
stop_words = set(stopwords.words('english'))

## Step 2: Define the Input Text
You can use the sample text below or replace it with your own dataset.

In [None]:
# Sample Raw Text
raw_text = """
Natural Language Processing (NLP) is a fascinating field of artificial intelligence. 
It helps computers understand, interpret, and generate human language. The applications are vast: 
chatbots, translation, text summarization, and many more. Preprocessing is the first step in any NLP pipeline.
"""

# Print Raw Text
print("Raw Text:\n")
print(raw_text)

## Step 3: Tokenization
Tokenize the text into sentences and words.

In [None]:
# Sentence Tokenization
sentences = sent_tokenize(raw_text)
print("\nSentence Tokenization:\n")
print(sentences)

# Word Tokenization
words = word_tokenize(raw_text)
print("\nWord Tokenization:\n")
print(words)

## Step 4: Lowercasing and Removing Punctuation
Normalize the text by converting it to lowercase and removing punctuation.

In [None]:
# Convert to Lowercase
words_lower = [word.lower() for word in words]
print("\nLowercased Words:\n")
print(words_lower)

# Remove Punctuation
words_no_punct = [word for word in words_lower if word not in string.punctuation]
print("\nWords Without Punctuation:\n")
print(words_no_punct)

## Step 5: Stopword Removal
Remove common words (stopwords) that don't carry significant meaning in the text.

In [None]:
# Remove Stopwords
filtered_words = [word for word in words_no_punct if word not in stop_words]
print("\nWords After Stopword Removal:\n")
print(filtered_words)

## Step 6: Compare Raw vs. Processed Text
Combine the cleaned words into a single string to compare the raw and processed text.

In [None]:
# Combine Words to Form Processed Text
processed_text = ' '.join(filtered_words)
print("\nProcessed Text:\n")
print(processed_text)

# Compare Raw and Processed Text
print("\nComparison:\n")
print("Raw Text:\n", raw_text)
print("\nProcessed Text:\n", processed_text)

## Step 7: Visualize Word Frequency Distribution
Plot the frequency of the most common words in the processed text to understand key themes.

In [None]:
# Calculate Word Frequencies
from nltk.probability import FreqDist
import matplotlib.pyplot as plt

fdist = FreqDist(filtered_words)

# Plot Most Common Words
print("\nWord Frequency Distribution:")
fdist.plot(10, title="Top 10 Words in Processed Text")

### Congratulations! 🎉
You have successfully implemented an end-to-end NLP preprocessing pipeline. Here's a quick recap of what you've achieved:

- Tokenized text into sentences and words.
- Lowercased the text and removed punctuation.
- Filtered out stopwords to retain meaningful words.
- Visualized word frequencies to understand key themes in the text.

---

### Reflection:
- What are the limitations of this preprocessing pipeline?
- How might preprocessing differ for specific NLP tasks like sentiment analysis vs. text summarization?
- Experiment with different texts. How does the processed output vary?

Feel free to expand this pipeline with additional techniques like stemming, lemmatization, or domain-specific cleaning!