# **6. Aggregation & Grouping**

## **7. Common Pitfalls & Best Practices on Groups**

In [1]:
import pandas as pd

## 🧠 1. What it does and when to use it

This section is not about a single function but about **understanding frequent mistakes (pitfalls)** developers make when using grouping techniques like `groupby()`, `agg()`, `apply()`, `transform()`, and how to **avoid them using best practices**.

**When to focus on these**:

* When you encounter errors or unexpected results with group-wise operations.
* When performance or readability suffers due to improper function use.
* When your grouped results don’t align correctly with the original DataFrame.


## 🧾 2. Syntax and Core Parameters Recap

Grouping typically involves:

```python
df.groupby(by, as_index=True).agg(func)
```

### Common grouping methods:

* `agg()` – Aggregation (sum, mean, count, etc.)
* `transform()` – Broadcast results back to original shape
* `apply()` – Apply custom function per group
* `filter()` – Keep/remove groups based on condition

**Parameters:**

* `by`: column name(s) or function
* `as_index`: Whether to keep group key(s) as index (default `True`)
* `group_keys`: Whether to include group labels when using `apply()`


## 🧪 3. Different Methods and Techniques (used incorrectly or effectively)

| Method                                    | Common Issue            | Fix / Best Practice                                      |
| ----------------------------------------- | ----------------------- | -------------------------------------------------------- |
| `groupby().apply()`                       | Slow and memory-heavy   | Use `agg()` or `transform()` if possible                 |
| `groupby().agg()` with multiple functions | Wrong column structure  | Use named aggregations for clarity                       |
| `transform()` vs `apply()`                | Shape mismatch errors   | Use `transform()` if you want output same shape as input |
| `groupby().mean()`                        | Group keys become index | Use `as_index=False` to keep them as columns             |


## ⚠️ 4. Common Pitfalls and Best Practices

### ❌ Pitfall 1: Misunderstanding `apply()` vs `transform()`

```python
df['NormSalary'] = df.groupby('Dept')['Salary'].apply(lambda x: x / x.mean())
```

This may fail or return a `Series` with unexpected index.

✅ **Best Practice**:

```python
df['NormSalary'] = df['Salary'] / df.groupby('Dept')['Salary'].transform('mean')
```

---

### ❌ Pitfall 2: Group keys become index unexpectedly

```python
df.groupby('Dept').agg('sum')  # Dept becomes index
```

✅ **Best Practice**:

```python
df.groupby('Dept', as_index=False).agg('sum')  # Keeps original column
```

---

### ❌ Pitfall 3: Using `apply()` unnecessarily

```python
# Inefficient
df.groupby('Dept').apply(lambda x: x['Salary'].mean())
```

✅ **Best Practice**:

```python
df.groupby('Dept')['Salary'].mean()  # More readable and efficient
```

---

### ❌ Pitfall 4: Modifying groups inside `apply()` without returning properly

```python
def faulty_func(g):
    g['Adj'] = g['Salary'] - g['Salary'].mean()
    # Forgot to return g
```

✅ **Best Practice**:

```python
def correct_func(g):
    g['Adj'] = g['Salary'] - g['Salary'].mean()
    return g
```

---

### ❌ Pitfall 5: Using `agg()` with multiple functions and getting messy column names

```python
df.groupby('Dept')['Salary'].agg(['mean', 'sum'])
```

✅ **Best Practice** – Use named aggregations:

```python
df.groupby('Dept').agg(
    MeanSalary=('Salary', 'mean'),
    TotalSalary=('Salary', 'sum')
)
```


## 🧩 5. Examples on Real/Pseudo Data

In [2]:
df = pd.DataFrame({
    'Dept': ['HR', 'HR', 'IT', 'IT', 'IT'],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Salary': [50000, 52000, 70000, 72000, 71000]
})

df

Unnamed: 0,Dept,Name,Salary
0,HR,Alice,50000
1,HR,Bob,52000
2,IT,Charlie,70000
3,IT,David,72000
4,IT,Eve,71000


### ❌ Bad Example: Using apply instead of transform

```python
df['Norm'] = df.groupby('Dept')['Salary'].apply(lambda x: x / x.mean())  # Might misalign
```

### ✅ Good Example:


In [5]:
df['Norm'] = df['Salary'] / df.groupby('Dept')['Salary'].transform('mean')
df

Unnamed: 0,Dept,Name,Salary,Norm2,Norm
0,HR,Alice,50000,0.980392,0.980392
1,HR,Bob,52000,1.019608,1.019608
2,IT,Charlie,70000,0.985915,0.985915
3,IT,David,72000,1.014085,1.014085
4,IT,Eve,71000,1.0,1.0


### ❌ Bad Example: Columns become index

```python
df.groupby('Dept').agg('sum')
```

### ✅ Good Example:

In [6]:
df.groupby('Dept', as_index=False).agg('sum')

Unnamed: 0,Dept,Name,Salary,Norm2,Norm
0,HR,AliceBob,102000,2.0,2.0
1,IT,CharlieDavidEve,213000,3.0,3.0


### ✅ Named Aggregation for better readability

In [7]:
df.groupby('Dept').agg(
    MaxSalary=('Salary', 'max'),
    Count=('Name', 'count')
)

Unnamed: 0_level_0,MaxSalary,Count
Dept,Unnamed: 1_level_1,Unnamed: 2_level_1
HR,52000,2
IT,72000,3


## 🌍 6. Real World Use Cases

| Domain       | Scenario                                | Pitfall                     | Fix                            |
| ------------ | --------------------------------------- | --------------------------- | ------------------------------ |
| HR Analytics | Normalize salary within each department | Used `apply()` wrongly      | Use `transform()`              |
| Retail       | Total sales per category                | Group key becomes index     | Use `as_index=False`           |
| Banking      | Rank clients within region              | Used `apply()` + sort       | Use `rank()` with `groupby`    |
| Healthcare   | Compare patient count per hospital      | Misaligned group values     | Use `groupby(...).transform()` |
| Education    | Assign grade rank in class              | Wrong aggregation structure | Use named aggregations         |


## ✅ Summary Table

| Issue                | Pitfall                        | Best Practice                         |
| -------------------- | ------------------------------ | ------------------------------------- |
| Shape mismatch       | `apply()` returns wrong shape  | Use `transform()`                     |
| Index problems       | Group keys in index            | Use `as_index=False`                  |
| Inefficiency         | `apply()` on simple operations | Prefer `agg()`, `transform()`         |
| Unclear column names | Multiple unnamed aggregations  | Use named tuples in `agg()`           |
| Data not returned    | Modifying inside `apply()`     | Always return DataFrame from function |


<center><b>Thanks</b></center>