# **6. Aggregation & Grouping**

## **6. Filtering & Custom Operations on Groups**

In [16]:
import pandas as pd

### 1. 🧠 **What it does and when to use it**

Filtering and applying custom operations on groups allow you to:

* **Select specific groups** based on conditions.
* **Apply complex logic** within groups that can't be handled by standard aggregation (`mean`, `sum`, etc.).
* Perform **group-wise transformations** with tailored computations.

🟢 **When to use**:

* When you want to include only those groups meeting specific criteria (e.g., total sales > 1000).
* When built-in functions (`sum()`, `mean()`) are not sufficient and you need custom logic.
* To apply **row-level logic** within each group.


### 2. 🧾 **Syntax and core parameters**

#### A. `filter()`

Filters out entire groups based on a condition.

```python
df.groupby('column').filter(function)
```

**Parameters:**

* `function`: A function that returns a **boolean** (True = keep the group, False = discard)

---

#### B. `apply()`

Allows application of **any custom function** to each group.

```python
df.groupby('column').apply(custom_function)
```

**Parameters:**

* `custom_function`: A function that takes a `DataFrame` or `Series` and returns a transformed object.


### 3. 🧪 **Different Methods and Techniques**

#### ✅ Group Filtering using `filter()`

```python
# Keep only departments with total salary > 100,000
df.groupby('Department').filter(lambda x: x['Salary'].sum() > 100000)
```

---

#### ✅ Group-wise Custom Logic using `apply()`

```python
# Add a column for salary normalized by department mean
df['NormalizedSalary'] = df.groupby('Department')['Salary'].apply(lambda x: x / x.mean())
```

---

#### ✅ Group-wise Ranking

```python
# Rank employees within departments
df['DeptRank'] = df.groupby('Department')['Salary'].rank(ascending=False)
```

---

#### ✅ Using `transform()` for broadcasting custom values

```python
# Subtract department mean from each salary (broadcasts to original shape)
df['MeanDiff'] = df['Salary'] - df.groupby('Department')['Salary'].transform('mean')
```


### 4. ⚠️ **Common Pitfalls and Best Practices**

| Pitfall                                                           | Recommendation                                                          |
| ----------------------------------------------------------------- | ----------------------------------------------------------------------- |
| ❌ `apply()` returns a combined DataFrame, which may change shape  | ✅ Use `transform()` when you want the same shape as original            |
| ❌ Filtering rows within a group instead of filtering whole groups | ✅ Use `filter()` only when you need to retain/discard **entire groups** |
| ❌ Complex functions in `apply()` are slow on large data           | ✅ Try vectorized operations or `transform()` where possible             |
| ❌ Grouped object gets index-reset in `apply()` results            | ✅ Use `group_keys=False` to avoid adding group labels                   |


### 5. 🧩 **Examples on Real/Pseudo Data**

In [17]:
data = {
    'Department': ['Sales', 'Sales', 'HR', 'HR', 'IT', 'IT', 'IT'],
    'Employee': ['Alice', 'Bob', 'Carol', 'David', 'Eve', 'Frank', 'Grace'],
    'Salary': [55000, 60000, 40000, 42000, 70000, 72000, 71000]
}

df = pd.DataFrame(data)

df

Unnamed: 0,Department,Employee,Salary
0,Sales,Alice,55000
1,Sales,Bob,60000
2,HR,Carol,40000
3,HR,David,42000
4,IT,Eve,70000
5,IT,Frank,72000
6,IT,Grace,71000


In [18]:
#### Example 1: Filtering groups where average salary > 50,000
df.groupby('Department').filter(lambda x: x['Salary'].mean() > 50000)

Unnamed: 0,Department,Employee,Salary
0,Sales,Alice,55000
1,Sales,Bob,60000
4,IT,Eve,70000
5,IT,Frank,72000
6,IT,Grace,71000


In [19]:
#### Example 2: Subtract each employee’s salary from the department max
df['Salary_diff_from_max'] = df['Salary'] - df.groupby('Department')['Salary'].transform('max')
df

Unnamed: 0,Department,Employee,Salary,Salary_diff_from_max
0,Sales,Alice,55000,-5000
1,Sales,Bob,60000,0
2,HR,Carol,40000,-2000
3,HR,David,42000,0
4,IT,Eve,70000,-2000
5,IT,Frank,72000,0
6,IT,Grace,71000,-1000


In [20]:
#### Example 3: Flag employee with highest salary in each department
df['is_top_earner'] = df.groupby('Department')['Salary'].transform('max') == df['Salary']
df

Unnamed: 0,Department,Employee,Salary,Salary_diff_from_max,is_top_earner
0,Sales,Alice,55000,-5000,False
1,Sales,Bob,60000,0,True
2,HR,Carol,40000,-2000,False
3,HR,David,42000,0,True
4,IT,Eve,70000,-2000,False
5,IT,Frank,72000,0,True
6,IT,Grace,71000,-1000,False


In [21]:
#### Example 4: Apply custom logic to calculate salary grade

def salary_grade(group):
    group['Grade'] = ["A" if x > group['Salary'].mean() else "B" for x in group['Salary']]
    return group

df1 = df.groupby('Department').apply(salary_grade)
df1

  df1 = df.groupby('Department').apply(salary_grade)


Unnamed: 0_level_0,Unnamed: 1_level_0,Department,Employee,Salary,Salary_diff_from_max,is_top_earner,Grade
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
HR,2,HR,Carol,40000,-2000,False,B
HR,3,HR,David,42000,0,True,A
IT,4,IT,Eve,70000,-2000,False,B
IT,5,IT,Frank,72000,0,True,A
IT,6,IT,Grace,71000,-1000,False,B
Sales,0,Sales,Alice,55000,-5000,False,B
Sales,1,Sales,Bob,60000,0,True,A


In [22]:
#### Example 4: Apply custom logic to calculate salary grade

def salary_grade(group):
    group['Grade'] = ["A" if x > group['Salary'].mean() else "B" for x in group['Salary']]
    return group

df2 = df.groupby('Department').apply(salary_grade, include_groups=False)
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,Employee,Salary,Salary_diff_from_max,is_top_earner,Grade
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
HR,2,Carol,40000,-2000,False,B
HR,3,David,42000,0,True,A
IT,4,Eve,70000,-2000,False,B
IT,5,Frank,72000,0,True,A
IT,6,Grace,71000,-1000,False,B
Sales,0,Alice,55000,-5000,False,B
Sales,1,Bob,60000,0,True,A


### 6. 🌍 **Real World Use Cases**

| Domain              | Use Case                                                                   |
| ------------------- | -------------------------------------------------------------------------- |
| **HR Analytics**    | Filter departments with average salary below threshold to flag for review. |
| **Sales Analytics** | Compare individual sales to team average or total.                         |
| **Healthcare**      | Identify hospitals with above-average recovery rates.                      |
| **Finance**         | Apply custom scoring logic per portfolio/branch.                           |
| **Education**       | Compute student ranking within departments or classes.                     |


### ✅ Summary

| Technique     | Use When                                  | Output Shape        |
| ------------- | ----------------------------------------- | ------------------- |
| `filter()`    | You want to keep/remove entire groups     | Filtered rows only  |
| `apply()`     | Custom logic for entire group             | Depends on function |
| `transform()` | Group-wise transformation with same shape | Same as original    |


<center><b>Thanks</b></center>