# **Data Selection & Indexing**

In [1]:
import pandas as pd

## **7. Advanced Indexing Concepts**

Advanced indexing gives you **greater flexibility and precision** when working with complex datasets. This is crucial for handling **hierarchical data, time series, and high-dimensional data** often used in data science workflows.


## 📋 Topics Covered Under Advanced Indexing

| Subsection | Concept                                 |
| ---------- | --------------------------------------- |
| 1        | MultiIndex (Hierarchical Indexing)      |
| 2        | Cross-section (`.xs()`)                 |
| 3        | Slicing with `pd.IndexSlice`            |
| 4        | Fancy Indexing                          |
| 5        | Chained Indexing Warning                |
| 6        | Time-based Indexing (DatetimeIndex)     |
| 7        | Using `.at[]`, `.iat[]` for fast access |



## 1. MultiIndex (Hierarchical Indexing)

### ✅ What is it?

MultiIndex allows multiple levels of indexing on rows (or columns), enabling you to represent **higher-dimensional data** in 2D DataFrames.


### ✅ Creating MultiIndex

In [2]:
# From tuples
index = pd.MultiIndex.from_tuples([
    ('USA', 'New York'),
    ('USA', 'Los Angeles'),
    ('India', 'Delhi'),
    ('India', 'Mumbai')
], names=('Country', 'City'))
df = pd.DataFrame({
    'Population': [8.6, 4.0, 19.8, 20.7]},
    index=index
)

df

Unnamed: 0_level_0,Unnamed: 1_level_0,Population
Country,City,Unnamed: 2_level_1
USA,New York,8.6
USA,Los Angeles,4.0
India,Delhi,19.8
India,Mumbai,20.7


### ✅ Real-Time Use Case

* **E-commerce**: Index by region and product
* **Finance**: Index by sector and company
* **Healthcare**: Index by hospital and department


## 2. Accessing MultiIndex Data

### ✅ By level

In [3]:
df.loc['India']

Unnamed: 0_level_0,Population
City,Unnamed: 1_level_1
Delhi,19.8
Mumbai,20.7


### ✅ By multiple keys

In [4]:
df.loc[('India', 'Delhi')]

Population    19.8
Name: (India, Delhi), dtype: float64

## 3. Cross-section with `.xs()`

### ✅ Syntax:

```python
df.xs(key, level, axis)
```

In [5]:
# Get all rows for 'Delhi' across all countries
df.xs('Delhi', level='City')

Unnamed: 0_level_0,Population
Country,Unnamed: 1_level_1
India,19.8


## 4. Slicing with `pd.IndexSlice`

When slicing multiple index levels, use `pd.IndexSlice`.

In [6]:
idx = pd.IndexSlice
# Get rows from India and all its cities
df.loc[idx['India', :]]

Unnamed: 0_level_0,Population
City,Unnamed: 1_level_1
Delhi,19.8
Mumbai,20.7


In [7]:
df.loc[idx[:, ['Delhi', 'Mumbai']], :]

Unnamed: 0_level_0,Unnamed: 1_level_0,Population
Country,City,Unnamed: 2_level_1
India,Delhi,19.8
India,Mumbai,20.7


In [None]:
try:
    df.loc[idx[:, 'Delhi':'Mumbai'], :]
except Exception as e:
    print(e)

'MultiIndex slicing requires the index to be lexsorted: slicing on levels [1], lexsort depth 0'


## 5. Fancy Indexing

Fancy indexing means passing **a list/array of labels or conditions** to select specific rows/columns.

In [10]:
df.loc[[('USA', 'New York'), ('India', 'Mumbai')]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Population
Country,City,Unnamed: 2_level_1
USA,New York,8.6
India,Mumbai,20.7


## 6. Chained Indexing Warning ⚠️

Using chained indexing can lead to unpredictable behavior:
```python
df['column'][0] = 5  # ⚠ Bad Practice
```

This creates a **view**, not a copy. Use `.loc[]` instead:
```python
df.loc[0, 'column'] = 5  # ✅ Recommended
```

## 7. Datetime Indexing (Time Series)

Pandas supports **time-aware indexing** using `DatetimeIndex`.

In [13]:
date_range = pd.date_range(start='2023-01-01', periods=5, freq='D')
ts = pd.Series([10, 20, 30, 40, 50], index=date_range)

ts

2023-01-01    10
2023-01-02    20
2023-01-03    30
2023-01-04    40
2023-01-05    50
Freq: D, dtype: int64

In [14]:
# Access by date
print(ts['2023-01-03'])  # 30

30


### ✅ Time slicing:

In [15]:
ts['2023-01-02':'2023-01-04']

2023-01-02    20
2023-01-03    30
2023-01-04    40
Freq: D, dtype: int64

## 8. Fast Scalar Access with `.at[]` and `.iat[]`

| Method   | Use Case                  | Example         |
| -------- | ------------------------- | --------------- |
| `.at[]`  | Label-based fast access   | `df.at[0, 'A']` |
| `.iat[]` | Integer-based fast access | `df.iat[0, 1]`  |

Faster than `.loc[]` and `.iloc[]` for scalar values.


## ✅ Real-Time Use Cases

| Scenario                            | Technique                      |
| ----------------------------------- | ------------------------------ |
| Monthly sales by region and product | `MultiIndex` with `.xs()`      |
| Filter temperature by day/hour      | `DatetimeIndex` slicing        |
| Access precise value for ML feature | `.at[]` / `.iat[]`             |
| Large data pipeline indexing        | `IndexSlice`, `Fancy indexing` |
| Avoid ambiguous bugs                | Avoid chained indexing         |


## ✅ Summary Table

| Concept            | Function/Method                      | Example                         |
| ------------------ | ------------------------------------ | ------------------------------- |
| MultiIndex         | `pd.MultiIndex.from_tuples()`        | Create hierarchical index       |
| Cross-section      | `.xs()`                              | `df.xs('USA', level='Country')` |
| Index slicing      | `pd.IndexSlice`                      | `df.loc[idx['India', :]]`       |
| Time slicing       | DatetimeIndex                        | `ts['2023-01-01':'2023-01-04']` |
| Fast scalar access | `.at[]`, `.iat[]`                    | `df.at[0, 'col']`               |
| Fancy Indexing     | List/Array of labels                 | `df.loc[[('India', 'Delhi')]]`  |
| Safe access        | Use `.loc[]`, avoid chained indexing | `df.loc[0, 'col']`              |


<center><b>Thanks</b></center>