# Topic 05 - Problem 08: Removing Stop Words and Punctuation

---

## 1. About the Problem

In many datasets, text contains **stop words** (commonly used words like "the", "is", "and") and **punctuation**, which do not add significant meaning in analysis or machine learning models.  
Removing these elements can:
- Clean the text
- Reduce dimensionality
- Improve model performance

In this problem, I will remove:
1. **Stop words**
2. **Punctuation**

This is a common step in **text preprocessing** for **sentiment analysis**, **text classification**, and other NLP tasks.

---


## 2. Solution Code

In [26]:
import pandas as pd
import string
from nltk.corpus import stopwords


# Sample dataset
data = {
    "product_description": [
        "This is a premium quality product",
        "Budget friendly option available",
        "Premium design with advanced features",
        "Standard model for basic usage"
    ]
}

df = pd.DataFrame(data)

# Remove stopwords and punctuation
stop_words=stopwords.words('english')
df['cleaned_description']=df['product_description'].apply(lambda x:' '.join([word for word in x.split() if word.lower() not in stop_words and word not in string.punctuation]))

print(df)


                     product_description               cleaned_description
0      This is a premium quality product           premium quality product
1       Budget friendly option available  Budget friendly option available
2  Premium design with advanced features  Premium design advanced features
3         Standard model for basic usage        Standard model basic usage


---

## 3. Explanation (What is happening)

- **stopwords.words('english')**  
  → Provides a list of common stopwords (e.g., "the", "a", "is")

- **string.punctuation**  
  → Provides a list of punctuation marks (e.g., ".", ",", ";")

- **apply(lambda x: ...)**  
  → Applies a function that:
    - Splits the text
    - Filters out stopwords and punctuation
    - Joins the cleaned text back into a string

For example:
- `"This is a premium quality product"`  
  becomes `"premium quality product"`

---

## 4. Summary / Takeaways

By solving this problem, I learned:

1. How to remove stop words and punctuation using **NLTK** and **string library**
2. The importance of text cleaning in NLP
3. How cleaned text helps in creating meaningful features for ML models
4. Why removing common words and symbols reduces noise in text data

This problem shows **NLP text cleaning** skills and is critical for any **text-based ML projects**.

---

Next, I’ll move toward:
- Lemmatization & stemming
- Advanced feature extraction



In [2]:
!pip install nltk

Collecting nltk
  Using cached nltk-3.9.2-py3-none-any.whl.metadata (3.2 kB)
Collecting click (from nltk)
  Downloading click-8.3.1-py3-none-any.whl.metadata (2.6 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2026.1.15-cp311-cp311-win_amd64.whl.metadata (41 kB)
Collecting tqdm (from nltk)
  Downloading tqdm-4.67.3-py3-none-any.whl.metadata (57 kB)
Using cached nltk-3.9.2-py3-none-any.whl (1.5 MB)
Downloading regex-2026.1.15-cp311-cp311-win_amd64.whl (277 kB)
Downloading click-8.3.1-py3-none-any.whl (108 kB)
Downloading tqdm-4.67.3-py3-none-any.whl (78 kB)
Installing collected packages: tqdm, regex, click, nltk

   ---------- ----------------------------- 1/4 [regex]
   ------------------------------ --------- 3/4 [nltk]
   ------------------------------ --------- 3/4 [nltk]
   ------------------------------ --------- 3/4 [nltk]
   ------------------------------ --------- 3/4 [nltk]
   ------------------------------ --------- 3/4 [nltk]
   ------------------------------