# Topic 05 - Problem 07: Tokenizing Text into Words

---

## 1. About the Problem

Tokenization is the process of splitting text into smaller units, typically words or phrases.  
This is the first step in most Natural Language Processing (NLP) tasks.

In this problem, I will **tokenize** a product description into **individual words**.  
This process is fundamental for:
- Text analysis
- Sentiment analysis
- Text classification

---


## 2. Solution Code

In [2]:
import pandas as pd

# Sample dataset
data = {
    "product_description": [
        "This is a premium quality product",
        "Budget friendly option",
        "Premium design with advanced features",
        "Standard model"
    ]
}

df = pd.DataFrame(data)

# Tokenizing text into words
df['tokens']=df['product_description'].str.split()
print(df)


                     product_description  \
0      This is a premium quality product   
1                 Budget friendly option   
2  Premium design with advanced features   
3                         Standard model   

                                        tokens  
0     [This, is, a, premium, quality, product]  
1                   [Budget, friendly, option]  
2  [Premium, design, with, advanced, features]  
3                            [Standard, model]  


---

## 3. Explanation (What is happening)

- **str.split()**  
  → Splits the text into words based on whitespace (by default)

For example:
- `"This is a premium quality product"` → `["This", "is", "a", "premium", "quality", "product"]`

---

## 4. Summary / Takeaways

By solving this problem, I learned:

1. How to split text into individual tokens (words)
2. The importance of tokenization for NLP tasks
3. How tokenized words can be used as features in models
4. Why tokenization is the first step in most NLP pipelines

This is a key **NLP preprocessing step** and shows your understanding of how to handle text data for machine learning.

---

Next, I’ll move toward:
- Tokenizing text into sentences
- Removing stop words and punctuation

