# **Data Cleaning**

# **7. Renaming and Replacing Columns, Labels**

In [47]:
import numpy as np
import pandas as pd 

This is essential for:

* Making data **more readable and standardized**,
* Cleaning **messy imports** (e.g., from Excel/CSV),
* Preparing for **visualization or modeling**.


## 🎯 What Are We Renaming or Replacing?

| Element                 | Description                                |
| ----------------------- | ------------------------------------------ |
| **Columns**             | Headers of the DataFrame                   |
| **Index Labels**        | Row labels (default: 0, 1, 2...) or custom |
| **Values inside cells** | Actual data values (e.g., 'Y' → 'Yes')     |


In [48]:
df = pd.DataFrame({
    'First Name ': ['Alice', 'Bob'],
    ' last-name': ['Smith', 'Jones'],
    'AGE': [25, 30],
    'country of residence': ['USA', 'Canada']
})

df

Unnamed: 0,First Name,last-name,AGE,country of residence
0,Alice,Smith,25,USA
1,Bob,Jones,30,Canada


## ✅ 1. `df.rename()` — Rename Specific Columns or Index

In [49]:
df.columns

Index(['First Name ', ' last-name', 'AGE', 'country of residence'], dtype='object')

In [50]:
df.rename(columns={
    'First Name ': 'first_name',
    ' last-name': 'last_name'
}, inplace=True)

df

Unnamed: 0,first_name,last_name,AGE,country of residence
0,Alice,Smith,25,USA
1,Bob,Jones,30,Canada


In [51]:
df.columns

Index(['first_name', 'last_name', 'AGE', 'country of residence'], dtype='object')

You can also rename row index:

In [52]:
df.rename(index={0: 'row_1', 1: 'row_2'}, inplace=True)
df

Unnamed: 0,first_name,last_name,AGE,country of residence
row_1,Alice,Smith,25,USA
row_2,Bob,Jones,30,Canada


### 🔹 Use Case:

* Rename a few selected columns or rows.
* Keeps others **untouched**.

🔹 **Why this?**
Targeted, safe changes. Avoids accidentally renaming everything.

## ✅ 2. `df.columns = [...]` — Rename All Columns at Once

In [53]:
df.columns = ['FirstName', 'LastName', 'Age', 'Country']
df

Unnamed: 0,FirstName,LastName,Age,Country
row_1,Alice,Smith,25,USA
row_2,Bob,Jones,30,Canada


### 🔹 Use Case:

* Renaming **all columns at once**, like after CSV import.

🔹 **Why this?**
Quick and clean when full control of headers is needed.

## ✅ 3. `str.strip()` / `str.lower()` / `str.replace()` for Cleaning Column Names

In [54]:
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
df

Unnamed: 0,firstname,lastname,age,country
row_1,Alice,Smith,25,USA
row_2,Bob,Jones,30,Canada


### 🔹 Use Case:

* Cleaning **messy or inconsistent column names** (common with Excel/CSV).

🔹 **Why this?**
Great for automatic cleanup — especially with auto-generated columns.

## ✅ 4. `df.set_index()` and `df.reset_index()`

In [55]:
df.set_index('firstname', inplace=True)
df

Unnamed: 0_level_0,lastname,age,country
firstname,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alice,Smith,25,USA
Bob,Jones,30,Canada


Or if index has become part of data and needs restoring:

In [56]:
df.reset_index(inplace=True)
df

Unnamed: 0,firstname,lastname,age,country
0,Alice,Smith,25,USA
1,Bob,Jones,30,Canada


### 🔹 Use Case:

* Setting a **label column** as the index (e.g., customer ID).
* Resetting index before saving or transforming data.

🔹 **Why this?**
Indexing is vital for merges, joins, plotting, and better row identification.


## ✅ 5. Renaming Row Labels in MultiIndex


In [57]:
df.columns = pd.MultiIndex.from_tuples([
    ('demographics', 'first_name'),
    ('demographics', 'last_name'),
    ('details', 'age'),
    ('details', 'country')
])

df

Unnamed: 0_level_0,demographics,demographics,details,details
Unnamed: 0_level_1,first_name,last_name,age,country
0,Alice,Smith,25,USA
1,Bob,Jones,30,Canada


In [58]:
df.rename(columns={'first_name': 'fname'}, level=1, inplace=True)
df

Unnamed: 0_level_0,demographics,demographics,details,details
Unnamed: 0_level_1,fname,last_name,age,country
0,Alice,Smith,25,USA
1,Bob,Jones,30,Canada


### 🔹 Use Case:

* Multi-level DataFrames like in **grouped**, **pivoted**, or **joined** data.

🔹 **Why this?**
Maintains **hierarchical context**, useful in time-series or grouped data.

## ✅ 6. Replacing Values in Columns (`replace()`)

In [59]:
df[('details', 'country')] = df[('details', 'country')].replace({'USA': 'United States', 'Canada': 'CA'})
df

Unnamed: 0_level_0,demographics,demographics,details,details
Unnamed: 0_level_1,fname,last_name,age,country
0,Alice,Smith,25,United States
1,Bob,Jones,30,CA


Or use `regex=True` for patterns:

In [60]:
df[('details', 'country')] = df[('details', 'country')].replace(r'\bUSA\b', 'United States', regex=True)
df

Unnamed: 0_level_0,demographics,demographics,details,details
Unnamed: 0_level_1,fname,last_name,age,country
0,Alice,Smith,25,United States
1,Bob,Jones,30,CA


### 🔹 Use Case:

* Fixing **abbreviations**, **typos**, **standardizing categories**.

🔹 **Why this?**
Keeps **categorical data** consistent, which is critical in ML or aggregations.

## ✅ 7. Replacing Based on Condition (`loc`)

In [61]:
df

Unnamed: 0_level_0,demographics,demographics,details,details
Unnamed: 0_level_1,fname,last_name,age,country
0,Alice,Smith,25,United States
1,Bob,Jones,30,CA


In [None]:
df.loc[df[('details', 'age')] > 28, ('details', 'age')] = 28 # cap age at 28

In [63]:
df

Unnamed: 0_level_0,demographics,demographics,details,details
Unnamed: 0_level_1,fname,last_name,age,country
0,Alice,Smith,25,United States
1,Bob,Jones,28,CA


### 🔹 Use Case:

* Rule-based transformation of specific rows/values.

🔹 **Why this?**
More control than `replace()` when logic depends on other columns.

## ✅ 8. Rename Columns with a Function

In [64]:
df.rename(columns=lambda x: x.strip().lower().replace(' ', '_'), inplace=True)

In [65]:
df

Unnamed: 0_level_0,demographics,demographics,details,details
Unnamed: 0_level_1,fname,last_name,age,country
0,Alice,Smith,25,United States
1,Bob,Jones,28,CA


### 🔹 Use Case:

* You want to **standardize column names** programmatically.

🔹 **Why this?**
Automates renaming — great for data pipelines and repeat use.

## 🔍 Summary Comparison

| Method                          | Use Case                            | Why Use It                                |
| ------------------------------- | ----------------------------------- | ----------------------------------------- |
| `rename(columns={...})`         | Rename selective columns or rows    | Safe and precise                          |
| `df.columns = [...]`            | Rename all columns                  | Quick full renaming                       |
| `str.strip()`, `str.lower()`    | Clean messy names from source files | Normalization & consistency               |
| `set_index()` / `reset_index()` | Change or restore row labels        | Better row identity or undo auto-indexing |
| `replace({...})`                | Standardize values in a column      | Fix typos, unify categories               |
| `loc[condition, column] = ...`  | Apply value changes based on logic  | More flexible than `replace()`            |
| `rename(columns=lambda...)`     | Rename with a function              | Dynamic renaming logic for automation     |


## 🧪 Real-World Scenarios

| Scenario                             | Suggested Approach                              |
| ------------------------------------ | ----------------------------------------------- |
| Cleaning Excel exports with spaces   | `str.strip().str.lower().str.replace(' ', '_')` |
| Updating labels before visualization | `rename(columns={...})` or `replace()`          |
| Mapping country codes to names       | `replace()` or `map()`                          |
| Setting Customer ID as index         | `set_index('customer_id')`                      |
| Reversing index before export        | `reset_index()`                                 |
| Column cleanup in a data pipeline    | `rename(columns=lambda x: ...)`                 |


<center><b>Thanks</b></center>