# **Data Cleaning**

## **8. Index Cleaning**

In [2]:
import numpy as np
import pandas as pd 

## ✅ What is Index Cleaning?

In pandas, the **index** is like a unique identifier (row label) for each row. A clean and properly structured index is important for:

* Easy row access and slicing
* Avoiding data mismatches or misalignments
* Maintaining data integrity during merges, joins, or reshaping

Index cleaning involves resetting, setting, renaming, validating, and removing redundant or poorly structured index values.

In [6]:
df = pd.DataFrame({
    'date': pd.date_range('2023-01-01', periods=3),
    'sales': [200, 300, 400]
})

df

Unnamed: 0,date,sales
0,2023-01-01,200
1,2023-01-02,300
2,2023-01-03,400


### 🔹 1. `set_index()` – Set a Column as Index

#### ➤ **Purpose**:

Assign one or more columns as a meaningful index (e.g., Date, ID, etc.).

#### ➤ **Syntax**:

```python
df.set_index('column_name', inplace=True)
```

#### ➤ **Use Case**:

Time-series analysis, uniquely identifying rows, logical grouping.

In [7]:
df

Unnamed: 0,date,sales
0,2023-01-01,200
1,2023-01-02,300
2,2023-01-03,400


In [9]:
df.set_index('date', inplace=True)
df

Unnamed: 0_level_0,sales
date,Unnamed: 1_level_1
2023-01-01,200
2023-01-02,300
2023-01-03,400


#### ✅ **Why**:

Indexing by time allows efficient slicing using time filters like:

In [11]:
df.loc['2023-01']

Unnamed: 0_level_0,sales
date,Unnamed: 1_level_1
2023-01-01,200
2023-01-02,300
2023-01-03,400


### 🔹 2. `reset_index()` – Reset to Default Integer Index

#### ➤ **Purpose**:

Remove custom index and revert to default integer index (`0,1,2,...`).

#### ➤ **Syntax**:

```python
df.reset_index(drop=False, inplace=True)
```

* `drop=True`: Removes the index column from the data.
* `drop=False`: Moves index back to a column.

#### ➤ **Use Case**:

After grouping or when a custom index becomes unnecessary.

In [12]:
df.reset_index(inplace=True)
df

Unnamed: 0,date,sales
0,2023-01-01,200
1,2023-01-02,300
2,2023-01-03,400


#### ✅ **Why**:

Useful when exporting data or when the index is not meaningful anymore.

### 🔹 3. `rename_axis()` – Rename Index/Column Axis Name

#### ➤ **Purpose**:

Give a name (label) to the index or column axis.

#### ➤ **Syntax**:

```python
df.rename_axis('IndexName', axis='index', inplace=True)
```

#### ➤ **Use Case**:

When the index represents something meaningful (like 'City' or 'Date') and we want to label it.

In [13]:
df = pd.DataFrame({'value': [10, 20, 30]}, index=['A', 'B', 'C'])

df

Unnamed: 0,value
A,10
B,20
C,30


In [14]:
df.rename_axis('Categories')

Unnamed: 0_level_0,value
Categories,Unnamed: 1_level_1
A,10
B,20
C,30


#### ✅ **Why**:

Helpful when exporting data (CSV/Excel) for clarity.

### 🔹 4. Removing Redundant Index Columns

#### ➤ Problem:

Sometimes you get extra "Unnamed: 0" columns when importing CSVs due to saved index.

#### ✅ **Fix**:

```python
df = pd.read_csv('file.csv', index_col=0)
# OR drop that column:
df.drop(columns='Unnamed: 0', inplace=True)
```

#### ✅ **Why**:

Prevents unnecessary confusion and clutter in datasets.


### 🔹 5. `sort_index()` – Sort the Index

#### ➤ **Purpose**:

Sort rows based on the index values.


In [17]:
df

Unnamed: 0,value
A,10
B,20
C,30


In [20]:
df.sort_index(ascending=False, inplace=True)
df

Unnamed: 0,value
C,30
B,20
A,10



#### ➤ Use Case:

Useful in time-series data, reporting, and consistent row order.

### 🔹 6. Validate Index Uniqueness

In [22]:
df.index.is_unique # Returns True or False

True

#### ➤ Use Case:

Before setting an index or merging/joining based on index, ensure it’s unique to avoid unexpected behaviors.

### 🔹 7. Removing MultiIndex (Flatten Hierarchical Index)

```python
df.columns = ['_'.join(col).strip() if isinstance(col, tuple) else col for col in df.columns]
```

#### ➤ Use Case:

After using `groupby().agg()` or pivoting, you often get MultiIndex columns. Flattening helps for export or analysis.


## ✅ Real-Time Scenarios

| Use Case                                    | Method                       | Reason                                   |
| ------------------------------------------- | ---------------------------- | ---------------------------------------- |
| After `groupby()` you want a flat DataFrame | `reset_index()`              | Bring grouped keys back as columns       |
| Time-series forecasting with dates          | `set_index('date')`          | Makes time slicing and plotting easy     |
| Cleaning CSV import with unwanted index     | `drop(columns='Unnamed: 0')` | Avoid unnecessary columns                |
| Assign readable index name                  | `rename_axis()`              | Helpful for documentation and export     |
| Ensuring proper join keys                   | `df.index.is_unique`         | Avoids data duplication or loss in joins |


## 🔄 Summary Table

| Task                         | Method                               |
| ---------------------------- | ------------------------------------ |
| Remove custom index          | `reset_index()`                      |
| Set column as index          | `set_index()`                        |
| Rename index label           | `rename_axis()`                      |
| Drop auto-added index column | `drop(columns=...)`                  |
| Sort rows by index           | `sort_index()`                       |
| Check index uniqueness       | `df.index.is_unique`                 |
| Flatten MultiIndex           | List comprehension over `df.columns` |


<center><b>Thanks</b></center>