# **6. Aggregation & Grouping**

## **3. Multi-level Grouping (Hierarchical Grouping)**

In [1]:
import pandas as pd

## 1. **What it does and when to use it**

### 📌 What is Multi-level (Hierarchical) Grouping?

Multi-level grouping refers to using **more than one key/column** in the `.groupby()` method. This results in a **hierarchical or multi-indexed DataFrame**, where each level of the index represents one grouping criterion.

It allows for more **granular summaries and analysis**, especially useful when your data is nested (e.g., **region → city → product**).

---

### 🧠 When to use it?

Use multi-level grouping when:

* You need to group data **across multiple dimensions** (e.g., year + month, region + product).
* You want **subgroup analysis** within broader groups.
* You want to produce **multi-index tables** for further transformation or pivoting.


## 2. **Syntax and Core Parameters**

### 🔤 Basic Syntax:

```python
df.groupby(['col1', 'col2'])['value_col'].agg('sum')
```

Here, `col1` is the **outer group**, and `col2` is the **inner group**.

---

### 🧾 Important Parameters:

| Parameter       | Purpose                                            |
| --------------- | -------------------------------------------------- |
| `as_index=True` | Makes the group keys part of the index (default)   |
| `sort=False`    | Prevents sorting group keys (speeds up processing) |
| `dropna=False`  | Includes NaN in grouping                           |

## 3. **Different Methods and Techniques**

| Technique                    | Description                              | Example                             |
| ---------------------------- | ---------------------------------------- | ----------------------------------- |
| Group by two or more columns | Create hierarchical groups               | `df.groupby(['Region', 'City'])`    |
| Aggregate per level          | Apply `.sum()`, `.mean()` etc. on groups | `grouped.sum()`                     |
| Flatten index                | Reset hierarchy into columns             | `.reset_index()`                    |
| Rearrange levels             | Change index hierarchy                   | `.swaplevel()`, `.reorder_levels()` |
| Slice data from multi-index  | Use `.loc[]` or `.xs()`                  | `df.loc[('East', 'NY')]`            |

---

### 🧪 Grouping Approaches

#### a. **Group by multiple columns**

```python
df.groupby(['Region', 'City'])['Sales'].sum()
```

#### b. **Accessing inner group**

Use `.loc[]` on result:

```python
grouped.loc[('East', 'New York')]
```

#### c. **Flattening index**

```python
grouped.reset_index()
```

#### d. **Slicing via `.xs()`**

```python
grouped.xs('East')
```


## 4. **Common Pitfalls and Best Practices**

| ❌ Pitfall                      | ✅ Best Practice                                  |
| ------------------------------ | ------------------------------------------------ |
| Confused by hierarchical index | Use `.reset_index()` to flatten                  |
| Hard to access inner groups    | Use `.loc[]` or `.xs()` properly                 |
| Unintended sorting             | Use `sort=False` for large datasets              |
| Too many levels to manage      | Limit groupby to necessary keys                  |
| Index mismatch during merge    | Reset index before joining with other DataFrames |


## 5. **Examples on Real/Pseudo Data**

In [2]:
data = {
    'Region': ['East', 'East', 'East', 'West', 'West', 'North'],
    'City': ['NY', 'NY', 'Boston', 'LA', 'SF', 'Chicago'],
    'Product': ['A', 'B', 'A', 'A', 'B', 'A'],
    'Sales': [100, 150, 200, 300, 250, 400],
    'Profit': [10, 20, 30, 50, 45, 60]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Region,City,Product,Sales,Profit
0,East,NY,A,100,10
1,East,NY,B,150,20
2,East,Boston,A,200,30
3,West,LA,A,300,50
4,West,SF,B,250,45
5,North,Chicago,A,400,60


In [3]:
### ✅ a. Total sales by Region and City
df.groupby(['Region', 'City'])['Sales'].sum()

Region  City   
East    Boston     200
        NY         250
North   Chicago    400
West    LA         300
        SF         250
Name: Sales, dtype: int64

In [6]:
### ✅ b. Multiple aggregations (Sales & Profit)
aggregation = df.groupby(['Region', 'City']).agg({'Sales': 'sum', 'Profit': 'mean'})
aggregation

Unnamed: 0_level_0,Unnamed: 1_level_0,Sales,Profit
Region,City,Unnamed: 2_level_1,Unnamed: 3_level_1
East,Boston,200,30.0
East,NY,250,15.0
North,Chicago,400,60.0
West,LA,300,50.0
West,SF,250,45.0


In [7]:
### ✅ c. Flatten multi-index
aggregation.reset_index()

Unnamed: 0,Region,City,Sales,Profit
0,East,Boston,200,30.0
1,East,NY,250,15.0
2,North,Chicago,400,60.0
3,West,LA,300,50.0
4,West,SF,250,45.0


In [8]:
### ✅ d. Access a group using `.loc[]`
aggregation

Unnamed: 0_level_0,Unnamed: 1_level_0,Sales,Profit
Region,City,Unnamed: 2_level_1,Unnamed: 3_level_1
East,Boston,200,30.0
East,NY,250,15.0
North,Chicago,400,60.0
West,LA,300,50.0
West,SF,250,45.0


In [9]:
aggregation.loc[('East', 'Boston')]

Sales     200.0
Profit     30.0
Name: (East, Boston), dtype: float64

In [10]:
### ✅ e. Slice using `.xs()`
aggregation.xs('East')

Unnamed: 0_level_0,Sales,Profit
City,Unnamed: 1_level_1,Unnamed: 2_level_1
Boston,200,30.0
NY,250,15.0


## 6. **Real World Use Cases**

### 🏪 Retail Analytics

**Task**: Summarize sales by region and store.

```python
df.groupby(['Region', 'Store'])['Sales'].agg(['sum', 'mean'])
```

---

### 💼 HR Analytics

**Task**: Average salary by department and job title.

```python
df.groupby(['Department', 'JobTitle'])['Salary'].mean()
```

---

### 🎓 Education

**Task**: Pass count per school and subject.

```python
df[df['Result'] == 'Pass'].groupby(['School', 'Subject'])['Student'].count()
```

---

### 🏥 Healthcare

**Task**: Average stay by hospital and disease type.

```python
df.groupby(['Hospital', 'Diagnosis'])['StayDuration'].mean()
```

---

### 🛒 E-commerce

**Task**: Average order value by country and category.

```python
df.groupby(['Country', 'Category'])['OrderValue'].mean()
```

## ✅ Summary

| 🧠 Concept           | ✅ Summary                                        |
| -------------------- | ------------------------------------------------ |
| Multi-level Grouping | Use multiple columns in `groupby()`              |
| Output               | MultiIndex (hierarchical) DataFrame              |
| Accessing groups     | Use `.loc[]`, `.xs()`                            |
| Flattening           | Use `.reset_index()`                             |
| Best for             | Nested data summaries across multiple dimensions |


<center><b>Thanks</b></center>