# String cleaning
Pandas string operations must be accessed using the `.str` accessor
because each column is a Series, not a single string.

## Dataset Summary
- **Whitespace Noise**: Columns like `name` and `city` have hidden leading/trailing spaces (e.g., `" Amit "`, `"Pune "`).
- **Case Inconsistency**: The same city appears as `"Kolkata"`, `"KOLKATA"`, and `"pune"`.
- **Numerical "Dirt**": The `salary` column is now an `object` (string) type containing symbols like `$`, commas `,`, and suffixes like `/-`.
- **Prefixes/Suffixes**: Some names include titles like `"Mr.`" or degrees like `"PhD"`.
- **Spelling Variations**: The city "Kolkata" is sometimes spelled as `"Kolkatta"`.

In [1]:
import pandas as pd, numpy as np

In [2]:
df = pd.read_csv("09_String_Cleaning.csv")
df

Unnamed: 0,name,gender,age,salary,city,date of joining
0,Amit,Male,28,"$40,769",Kolkata,30-10-2021
1,Riya,Female,41,99735,Pune,09-02-2018
2,John,Male,36,96101,Pune,02-06-2019
3,Neha,Female,32,42433,KOLKATA,04-03-2020
4,Siddharth,Male,29,45311,pune,15-10-2022
5,Zoe,Female,42,77819,Bangalore,28-11-2023
6,Ken,Male,28,79188,Chennai,12-05-2018
7,Anjali,Female,47,57568,Kolkatta,25-10-2020
8,Vijay,Male,40,93707,Chennai,06-09-2022
9,Priya,Female,44,59769,Delhi,03-09-2022


## Basic Trimming
Trim whitespace from all string columns

In [3]:
df['name'] = df['name'].str.strip()
df['city'] = df['city'].str.strip()

## Case Normalization
Convert everything to Title Case

In [4]:
df['city'] = df['city'].str.title()
df['gender'] = df['gender'].str.capitalize()

## Cleaning Numerical Strings (Regex)
Remove $, commas, and /- then convert to float

In [5]:
df['salary'] = df['salary'].str.replace(r'[\$,/-]', '', regex=True)

## Pattern Removal & Replacement

### Remove `Mr.` and `PhD` from names

In [6]:
df['name'] = df['name'].str.replace(' PhD', '')
df['name'] = df['name'].str.replace(r'^(Mr|Mrs)\.?\s+', '', regex=True)

### Standardize spelling (Kolkatta -> Kolkata)

In [7]:
df['city'] = df['city'].str.replace('Kolkatta', 'Kolkata')
df

Unnamed: 0,name,gender,age,salary,city,date of joining
0,Amit,Male,28,40769,Kolkata,30-10-2021
1,Riya,Female,41,99735,Pune,09-02-2018
2,John,male,36,96101,Pune,02-06-2019
3,Neha,Female,32,42433,Kolkata,04-03-2020
4,Siddharth,Male,29,45311,Pune,15-10-2022
5,Zoe,Female,42,77819,Bangalore,28-11-2023
6,Ken,Male,28,79188,Chennai,12-05-2018
7,Anjali,Female,47,57568,Kolkata,25-10-2020
8,Vijay,Male,40,93707,Chennai,06-09-2022
9,Priya,Female,44,59769,Delhi,03-09-2022


**Note**: String methods safely handle missing values (NaN), but it is still important to inspect data before cleaning.

## Summary
- Real-world text data is messy
- Pandas provides vectorized string operations
- Clean text improves analysis and modeling