# **6. Aggregation & Grouping**

## **4. Transformation vs Aggregation**

In [1]:
import pandas as pd

## 1. **What it does and when to use it**

### 📌 What is Aggregation?

**Aggregation** reduces the size of the data by **summarizing** it. It produces **one output value per group**.

Example:
Group employees by department and compute **average salary per department**.

```python
df.groupby('Department')['Salary'].mean()
```

The result: one row per department.

---

### 📌 What is Transformation?

**Transformation** returns a **new DataFrame of the same shape** as the original, but with **group-specific computations broadcasted** to the original rows.
It’s useful when you want to **retain the original data but add group-level context**.

Example:
Add a column showing the **average salary of the employee’s department** next to each employee.

```python
df['Dept_Avg_Salary'] = df.groupby('Department')['Salary'].transform('mean')
```

---

### 🔍 Key Difference:

| Feature       | Aggregation                 | Transformation               |
| ------------- | --------------------------- | ---------------------------- |
| Output shape  | Smaller (one row per group) | Same shape as input          |
| Purpose       | Summarize                   | Contextualize                |
| Function type | Aggregation functions       | Aggregation or custom logic  |
| Use case      | Reporting/KPIs              | Feature engineering, context |


## 2. **Syntax and Core Parameters**

### 🧾 Aggregation Syntax:

```python
df.groupby('key')['col'].agg('func')
```

or:

```python
df.groupby('key').agg({'col1': 'sum', 'col2': 'mean'})
```

---

### 🧾 Transformation Syntax:

```python
df.groupby('key')['col'].transform('func')
```

or:

```python
df['new_col'] = df.groupby('key')['col'].transform(lambda x: x / x.sum())
```


## 3. **Different Methods and Techniques**

| Technique                      | Description                                         | Example                                                         |
| ------------------------------ | --------------------------------------------------- | --------------------------------------------------------------- |
| `.agg()`                       | Reduces group to a summary value                    | `df.groupby('Team')['Score'].agg('mean')`                       |
| `.transform()`                 | Retains original shape                              | `df['TeamAvg'] = df.groupby('Team')['Score'].transform('mean')` |
| Custom logic in `.transform()` | Complex group-wise metrics                          | `transform(lambda x: x / x.max())`                              |
| Combine both                   | Add transformed values next to aggregated summaries | `agg()` for report + `transform()` for features                 |

---

### Supported Functions:

Both `.agg()` and `.transform()` support:

* Strings: `'sum'`, `'mean'`, `'std'`, etc.
* Lambda functions: `lambda x: x / x.mean()`
* NumPy or custom functions

⚠️ But `.transform()` **must return an output of the same shape** as the input group!


## 4. **Common Pitfalls and Best Practices**

| ❌ Pitfall                                       | ✅ Best Practice                                                     |
| ----------------------------------------------- | ------------------------------------------------------------------- |
| Using `.agg()` when shape needs to be preserved | Use `.transform()` when output should align with original DataFrame |
| Applying multi-agg inside `.transform()`        | Only one function at a time in `.transform()`                       |
| Misunderstanding shape changes                  | Always inspect result shape to avoid logic bugs                     |
| Mixing `.apply()` instead of `.transform()`     | Prefer `.transform()` for performance and clarity                   |
| Forgetting to assign transformed output         | Use `df['new_col'] = ...` to store results                          |


## 5. **Examples on Real/Pseudo Data**

In [2]:
data = {
    'Department': ['HR', 'HR', 'IT', 'IT', 'Finance', 'Finance'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Salary': [50000, 52000, 60000, 62000, 58000, 61000]
}

df = pd.DataFrame(data)

df

Unnamed: 0,Department,Employee,Salary
0,HR,Alice,50000
1,HR,Bob,52000
2,IT,Charlie,60000
3,IT,David,62000
4,Finance,Eve,58000
5,Finance,Frank,61000


In [3]:
### ✅ a. Aggregation — average salary by department
df.groupby('Department')['Salary'].agg('mean')

Department
Finance    59500.0
HR         51000.0
IT         61000.0
Name: Salary, dtype: float64

In [5]:
### ✅ b. Transformation — assign department average salary to each row
df['dept_avg_salary'] = df.groupby('Department')['Salary'].transform('mean')
df

Unnamed: 0,Department,Employee,Salary,dept_avg_salary
0,HR,Alice,50000,51000.0
1,HR,Bob,52000,51000.0
2,IT,Charlie,60000,61000.0
3,IT,David,62000,61000.0
4,Finance,Eve,58000,59500.0
5,Finance,Frank,61000,59500.0


In [8]:
### ✅ c. Department-wise salary rank (custom transformation)
df['Dept_Salary_Rank'] = df.groupby('Department')['Salary'].transform(lambda x: x.rank(ascending=False))
df

Unnamed: 0,Department,Employee,Salary,dept_avg_salary,salary_rank,Dept_Salary_Rank
0,HR,Alice,50000,51000.0,0.490196,2.0
1,HR,Bob,52000,51000.0,0.509804,1.0
2,IT,Charlie,60000,61000.0,0.491803,2.0
3,IT,David,62000,61000.0,0.508197,1.0
4,Finance,Eve,58000,59500.0,0.487395,2.0
5,Finance,Frank,61000,59500.0,0.512605,1.0


In [6]:
### ✅ d. Salary share within department (relative proportion)
df['salary_rank'] = df.groupby('Department')['Salary'].transform(lambda x: x / x.sum())
df

Unnamed: 0,Department,Employee,Salary,dept_avg_salary,salary_rank
0,HR,Alice,50000,51000.0,0.490196
1,HR,Bob,52000,51000.0,0.509804
2,IT,Charlie,60000,61000.0,0.491803
3,IT,David,62000,61000.0,0.508197
4,Finance,Eve,58000,59500.0,0.487395
5,Finance,Frank,61000,59500.0,0.512605


## 6. **Real World Use Cases**

### 🏪 Retail Sales

**Goal**: Add a column showing each product’s sales as a % of its category’s total.

```python
df['Category_Share'] = df.groupby('Category')['Sales'].transform(lambda x: x / x.sum())
```

---

### 💼 HR Analytics

**Goal**: Add a column showing each employee’s salary vs. their department average.

```python
df['Salary_vs_DeptAvg'] = df['Salary'] - df.groupby('Department')['Salary'].transform('mean')
```

---

### 🏥 Healthcare

**Goal**: Show each patient’s stay compared to average stay for diagnosis.

```python
df['Stay_Deviation'] = df['Stay_Days'] - df.groupby('Diagnosis')['Stay_Days'].transform('mean')
```

---

### 🛒 E-commerce

**Goal**: Calculate the rank of each order within its customer.

```python
df['Order_Rank'] = df.groupby('CustomerID')['OrderAmount'].transform(lambda x: x.rank(ascending=False))
```


## ✅ Summary

| 🧠 Concept | Aggregation               | Transformation                   |
| ---------- | ------------------------- | -------------------------------- |
| Output     | Collapsed (1 value/group) | Same shape as input              |
| Use case   | Summarization             | Contextual calculation           |
| Method     | `.agg()`                  | `.transform()`                   |
| Examples   | Total sales by region     | Each product’s % share in region |


<center><b>Thanks</b></center>