# **Data Cleaning**

## **5.Cleaning Strings (String Normalization & Fixes)**

In [1]:
import numpy as np
import pandas as pd 

## üîç Why Clean Strings?

String data is prone to:

* Inconsistent casing: `‚ÄúHR‚Äù`, `‚Äúhr‚Äù`, `‚ÄúHr‚Äù`
* Unwanted whitespace: `" Alice "` vs `"Alice"`
* Typos and noise: `"n/a"`, `"null"`, `"???"`
* Delimiters & splitting issues: `"First Last"` into `First` and `Last`

Such issues can break grouping, joining, filtering, and modeling.


In [2]:
df = pd.DataFrame({
    'Name': [' Alice ', 'BOB', 'charlie', 'David', 'EVA'],
    'Department': [' HR', 'hr ', 'Hr', 'finance', 'FINANCE'],
    'Email': ['alice@example.com', 'bob@example.com', None, '', 'eva@example.com'],
    'Location': ['New York', ' new york ', 'NEW YORK', 'London', 'London'],
})

df

Unnamed: 0,Name,Department,Email,Location
0,Alice,HR,alice@example.com,New York
1,BOB,hr,bob@example.com,new york
2,charlie,Hr,,NEW YORK
3,David,finance,,London
4,EVA,FINANCE,eva@example.com,London


In [3]:
df.dtypes

Name          object
Department    object
Email         object
Location      object
dtype: object

## üõ†Ô∏è String Cleaning Techniques

### üîπ **1. Trimming Whitespace**

#### ‚ñ∂Ô∏è Method: `str.strip()`, `str.lstrip()`, `str.rstrip()`

In [34]:
df['Name'] = df['Name'].str.strip()
df['Department'] = df['Department'].str.strip()
df['Location'] = df['Location'].str.strip()
df

Unnamed: 0,Name,Department,Email,Location,Username
0,Alice,hr,alice,new york,alice
1,BOB,hr,bob,new york,bob
2,charlie,hr,,new york,
3,David,finance,,london,
4,EVA,finance,eva,london,eva


#### ‚úÖ Use Case:

User-entered data often has **accidental leading/trailing spaces** ‚Äî which break filters and joins.

üîπ *Why this method?*
Cleans basic formatting issues, and it's fast and non-destructive.


### üîπ **2. Consistent Casing (Lower / Upper / Title)**

#### ‚ñ∂Ô∏è Method: `str.lower()`, `str.upper()`, `str.title()`

In [5]:
df

Unnamed: 0,Name,Department,Email,Location
0,Alice,HR,alice@example.com,New York
1,BOB,hr,bob@example.com,new york
2,charlie,Hr,,NEW YORK
3,David,finance,,London
4,EVA,FINANCE,eva@example.com,London


In [8]:
df['Department'] = df['Department'].str.lower()
df

Unnamed: 0,Name,Department,Email,Location
0,Alice,hr,alice@example.com,New York
1,BOB,hr,bob@example.com,new york
2,charlie,hr,,NEW YORK
3,David,finance,,London
4,EVA,finance,eva@example.com,London


In [32]:
df['Location'] = df['Location'].str.lower()
df

Unnamed: 0,Name,Department,Email,Location,Username
0,Alice,hr,alice,new york,alice
1,BOB,hr,bob,new york,bob
2,charlie,hr,,new york,
3,David,finance,,london,
4,EVA,finance,eva,london,eva


#### ‚úÖ Real-World Use Case:

You want to **group or filter** departments. `"HR"`, `"hr"`, `"Hr"` should all be treated as `"hr"`.

üîπ *Why this method?*
Ensures string **uniformity for comparison** or grouping.



### üîπ **3. Replacing or Removing Substrings**

#### ‚ñ∂Ô∏è Method: `str.replace()`

In [9]:
df

Unnamed: 0,Name,Department,Email,Location
0,Alice,hr,alice@example.com,New York
1,BOB,hr,bob@example.com,new york
2,charlie,hr,,NEW YORK
3,David,finance,,London
4,EVA,finance,eva@example.com,London


In [12]:
df['Email'] = df['Email'].str.replace('@example.com', '', regex=False)
df

Unnamed: 0,Name,Department,Email,Location
0,Alice,hr,alice,New York
1,BOB,hr,bob,new york
2,charlie,hr,,NEW YORK
3,David,finance,,London
4,EVA,finance,eva,London


#### ‚úÖ Use Case:

Remove **email domains** to isolate user names.

üîπ *Why this method?*
Flexible for simple and regex-based replacements.

### üîπ **4. Detecting and Handling Missing or Empty Strings**

In [13]:
df['Email'] = df['Email'].replace(['', ' ', None, 'n/a', 'null'], pd.NA)
df

Unnamed: 0,Name,Department,Email,Location
0,Alice,hr,alice,New York
1,BOB,hr,bob,new york
2,charlie,hr,,NEW YORK
3,David,finance,,London
4,EVA,finance,eva,London


In [14]:
df['Email'].isnull()

0    False
1    False
2     True
3     True
4    False
Name: Email, dtype: bool

#### ‚úÖ Real-World Use Case:

Empty strings like `""` or `'null'` are **not detected as NaN** by default ‚Äî you must convert them.

üîπ *Why this method?*
Ensures consistent missing value representation.

### üîπ **5. Pattern Matching with `str.contains()` or `str.match()`**

In [20]:
df[df['Email'].str.contains('alice', na=False)]

Unnamed: 0,Name,Department,Email,Location
0,Alice,hr,alice,New York


#### ‚úÖ Use Case:

Find users with names/emails matching a pattern or keyword.

üîπ *Why this method?*
Effective for **filtering rows using partial strings**.

### üîπ **6. Extracting Substrings with `str.extract()`**

In [21]:
df

Unnamed: 0,Name,Department,Email,Location
0,Alice,hr,alice,New York
1,BOB,hr,bob,new york
2,charlie,hr,,NEW YORK
3,David,finance,,London
4,EVA,finance,eva,London


In [22]:
df['Username'] = df['Email'].str.extract(r'(^[\w]+)')
df

Unnamed: 0,Name,Department,Email,Location,Username
0,Alice,hr,alice,New York,alice
1,BOB,hr,bob,new york,bob
2,charlie,hr,,NEW YORK,
3,David,finance,,London,
4,EVA,finance,eva,London,eva


#### ‚úÖ Use Case:

From `alice@example.com`, extract just the **username** before `@`.

üîπ *Why this method?*
Works well with **regex patterns** for custom string extraction.

### üîπ **7. Splitting Strings**

#### ‚ñ∂Ô∏è Method: `str.split()`

In [41]:
df[['Location 1', 'Location 2']] = df['Location'].str.split(' ', expand=True)
df

Unnamed: 0,Name,Department,Email,Location,Username,Location 1,Location 2
0,Alice,hr,alice,new york,alice,new,york
1,BOB,hr,bob,new york,bob,new,york
2,charlie,hr,,new york,,new,york
3,David,finance,,london,,london,
4,EVA,finance,eva,london,eva,london,


#### ‚úÖ Use Case:

You import names as `"Alice Smith"` and need to **split into separate columns**.

üîπ *Why this method?*
Makes unstructured data **columnar** for better use.

### üîπ **8. Joining Strings**

#### ‚ñ∂Ô∏è Method: `str.cat()`

In [44]:
df['EmailDomain'] = df['Name'].str.lower().str.strip().str.cat(['@example.com']*len(df))
df

Unnamed: 0,Name,Department,Email,Location,Username,Location 1,Location 2,EmailDomain
0,Alice,hr,alice,new york,alice,new,york,alice@example.com
1,BOB,hr,bob,new york,bob,new,york,bob@example.com
2,charlie,hr,,new york,,new,york,charlie@example.com
3,David,finance,,london,,london,,david@example.com
4,EVA,finance,eva,london,eva,london,,eva@example.com


#### ‚úÖ Use Case:

Construct **email addresses** from usernames.

üîπ *Why this method?*
Combines strings across columns or with static suffixes.

### üîπ **9. Removing Non-Alphabetic Characters**

In [45]:
df['Name'] = df['Name'].str.replace('[^a-zA-Z ]', '', regex=True)
df

Unnamed: 0,Name,Department,Email,Location,Username,Location 1,Location 2,EmailDomain
0,Alice,hr,alice,new york,alice,new,york,alice@example.com
1,BOB,hr,bob,new york,bob,new,york,bob@example.com
2,charlie,hr,,new york,,new,york,charlie@example.com
3,David,finance,,london,,london,,david@example.com
4,EVA,finance,eva,london,eva,london,,eva@example.com


#### ‚úÖ Use Case:

In text analysis or name fields, remove **symbols, digits, or emojis**.

üîπ *Why this method?*
Helpful for preprocessing text data for ML/NLP.

### üîπ **10. Mapping or Normalizing Categories**

In [46]:
df['Department'] = df['Department'].map({
    'hr': 'HR',
    'finance': 'Finance'
})

df

Unnamed: 0,Name,Department,Email,Location,Username,Location 1,Location 2,EmailDomain
0,Alice,HR,alice,new york,alice,new,york,alice@example.com
1,BOB,HR,bob,new york,bob,new,york,bob@example.com
2,charlie,HR,,new york,,new,york,charlie@example.com
3,David,Finance,,london,,london,,david@example.com
4,EVA,Finance,eva,london,eva,london,,eva@example.com


#### ‚úÖ Real-World Use Case:

Standardizing department names before grouping or pivoting.

üîπ *Why this method?*
You can **normalize similar values** to a single, clean label.

### üîπ **11. Removing Duplicates After Cleaning**

In [47]:
df['Location'] = df['Location'].str.strip().str.lower()
df['Location'].unique()

array(['new york', 'london'], dtype=object)

#### ‚úÖ Use Case:

Data shows `'new york'`, `'NEW YORK'`, and `' new york '` ‚Äî all need to be recognized as the same.

üîπ *Why this method?*
Fix inconsistencies **before aggregation or deduplication**.

## üìå Summary Table

| Task                    | Method                           | Example / Use Case                           |
| ----------------------- | -------------------------------- | -------------------------------------------- |
| Trim whitespace         | `str.strip()`                    | Clean user-entered names                     |
| Standardize casing      | `str.lower()` / `str.upper()`    | Group values like "HR", "hr", "Hr"           |
| Replace substrings      | `str.replace()`                  | Remove email domain or unwanted words        |
| Handle empty strings    | `replace()`                      | Treat `''`, `'null'` as missing              |
| Pattern detection       | `str.contains()` / `str.match()` | Filter emails with specific pattern          |
| Substring extraction    | `str.extract()`                  | Pull username from email                     |
| Split into columns      | `str.split()`                    | Separate full name into first and last names |
| Combine strings         | `str.cat()`                      | Create email IDs from names                  |
| Remove noise characters | `str.replace('[^a-zA-Z]')`       | Strip out digits/symbols from name           |
| Normalize categories    | `map()`                          | Convert various ‚Äúfinance‚Äù spellings into one |


### üß† Best Practices

* Use `str.strip().str.lower()` combo often on **categorical string columns**.
* Always handle empty strings and nulls before using `str` methods.
* Be cautious with `str.replace()`‚Äî check if it‚Äôs using `regex=True` by default.
* Use `.unique()` or `.value_counts()` to explore inconsistent values before cleaning.


<center><b>Thanks</b></center>