# Topic 05 - Problem 01: Cleaning Text Data (Lowercase & Remove Extra Spaces)

---

## 1. About the Problem

In real datasets, text data is often **messy**:
- Inconsistent capitalization  
- Extra spaces at the beginning or end  
- Mixed formatting  

Before doing any analysis, feature engineering, or ML modeling, I must **standardize text**.

In this problem, I will:
1. Convert all text to **lowercase**
2. Remove **leading and trailing spaces**

This is one of the **first preprocessing steps** in NLP and categorical data handling.

---


## 2. Solution Code

In [3]:
import pandas as pd

# Sample dataset with messy text
data = {
    "city": ["  New York", "london ", "PARIS", "  Berlin  ", "ToKYo"]
}

df=pd.DataFrame(data)

df['cleaned_city']=df['city'].str.strip().str.lower()

print(df)


         city cleaned_city
0    New York     new york
1     london        london
2       PARIS        paris
3    Berlin         berlin
4       ToKYo        tokyo


---

## 3. Explanation (What is happening)

- **str.strip()**  
  → Removes spaces from the beginning and end of text

- **str.lower()**  
  → Converts all characters to lowercase

- **.str**  
  → Allows vectorized string operations on pandas columns

This ensures:
- " New York", "new york", "NEW YORK"  
  all become → **"new york"**

---

## 4. Summary / Takeaways

By solving this problem, I learned:

1. Why raw text cannot be trusted in datasets
2. How to standardize string data using pandas
3. The importance of `.str` accessor
4. Why text cleaning is mandatory before ML or analysis

This problem shows **good data preprocessing habits** and is absolutely worth adding to GitHub.

---

Next, I will work on:
- Splitting text
- Extracting meaningful information
- Creating new features from strings

