# **6. Aggregation & Grouping**

## **2. Aggregation Techniques with `.agg()`**

In [1]:
import pandas as pd

## 1. **What it does and when to use it**

### 📌 What is `.agg()`?

`.agg()` stands for **"aggregate"**, and it allows you to apply **one or multiple aggregation functions** **simultaneously** to a DataFrame or Series — especially after using `groupby()`.

It's more powerful and flexible than calling `.sum()`, `.mean()`, etc., one by one.

---

### 🧠 When to use `.agg()`?

Use `.agg()` when you:

* Want to apply **multiple aggregation functions** to one or more columns.
* Need **different aggregations per column**.
* Want to apply **custom or user-defined functions**.
* Need readable, labeled summaries of group-level statistics.


## 2. **Syntax and Core Parameters**

### 🔤 General Syntax:

```python
df.groupby('column').agg(func)
```

Where `func` can be:

* A string: `'sum'`, `'mean'`, etc.
* A list of functions: `['sum', 'mean']`
* A dictionary: `{'col1': 'sum', 'col2': ['mean', 'max']}`
* A lambda or custom function

### 🧾 Example Structures:

#### a. **Single function on single column:**

```python
df.groupby('Region')['Sales'].agg('sum')
```

#### b. **Multiple functions on one column:**

```python
df.groupby('Region')['Sales'].agg(['sum', 'mean', 'max'])
```

#### c. **Different functions on different columns:**

```python
df.groupby('Region').agg({'Sales': 'sum', 'Profit': 'mean'})
```

#### d. **Custom lambda functions:**

```python
df.groupby('Region')['Sales'].agg(lambda x: x.max() - x.min())
```

#### e. **Named aggregations (cleaner labels):**

```python
df.groupby('Region').agg(
    total_sales=('Sales', 'sum'),
    avg_profit=('Profit', 'mean')
)
```


## 3. **Different Methods and Techniques**

| Technique                                | Description               | Example                                |
| ---------------------------------------- | ------------------------- | -------------------------------------- |
| Single function                          | Apply one aggregation     | `.agg('sum')`                          |
| Multiple functions on one column         | Create multiple summaries | `.agg(['mean', 'max'])`                |
| Different functions on different columns | Full customization        | `.agg({'col1': 'sum', 'col2': 'std'})` |
| Lambda/UDFs                              | Custom aggregation logic  | `.agg(lambda x: x.max() - x.min())`    |
| Named Aggregations                       | Rename results inline     | `.agg(avg_salary=('Salary', 'mean'))`  |


## 4. **Common Pitfalls and Best Practices**

| ❌ Pitfall                                 | ✅ Best Practice                                        |
| ----------------------------------------- | ------------------------------------------------------ |
| Applying string method across all columns | Be explicit with column names in dict form             |
| Unclear column names after aggregation    | Use **named aggregations** for clean labeling          |
| Using lambdas inefficiently               | For large datasets, use built-in methods when possible |
| Forgetting `.reset_index()`               | Use it if you want group keys as regular columns       |
| Overcomplicating logic                    | Prefer `.agg()` over `.apply()` for performance        |


## 5. **Examples on Real/Pseudo Data**

In [2]:
data = {
    'Region': ['East', 'East', 'West', 'West', 'North', 'North'],
    'Sales': [300, 400, 500, 600, 200, 250],
    'Profit': [30, 40, 50, 60, 20, 25]
}

df = pd.DataFrame(data)

df

Unnamed: 0,Region,Sales,Profit
0,East,300,30
1,East,400,40
2,West,500,50
3,West,600,60
4,North,200,20
5,North,250,25


In [3]:
### ✅ a. Apply single aggregation
df.groupby('Region')['Sales'].agg('sum')

Region
East      700
North     450
West     1100
Name: Sales, dtype: int64

In [4]:
### ✅ b. Multiple aggregations on one column
df.groupby('Region')['Sales'].agg(['sum', 'mean', 'max'])

Unnamed: 0_level_0,sum,mean,max
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
East,700,350.0,400
North,450,225.0,250
West,1100,550.0,600


In [5]:
### ✅ c. Aggregation per column
df.groupby('Region').agg({'Sales': 'sum', 'Profit': 'mean'})

Unnamed: 0_level_0,Sales,Profit
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
East,700,35.0
North,450,22.5
West,1100,55.0


In [6]:
### ✅ d. Named Aggregations
df.groupby('Region').agg(
    total_sales=('Sales', 'sum'),
    avg_profit=('Profit', 'mean')
)

Unnamed: 0_level_0,total_sales,avg_profit
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
East,700,35.0
North,450,22.5
West,1100,55.0


In [7]:
### ✅ e. Custom lambda
df.groupby('Region')['Sales'].agg(lambda x: x.max() - x.min())

Region
East     100
North     50
West     100
Name: Sales, dtype: int64

## 6. **Real World Use Cases**

### 🏪 Retail Analytics

**Task:** Show total and average sales and profits by region.

```python
df.groupby('Region').agg({
    'Sales': ['sum', 'mean'],
    'Profit': ['sum', 'mean']
})
```

---

### 💼 HR Analytics

**Task:** Show highest and lowest salaries per department.

```python
df.groupby('Department')['Salary'].agg(['min', 'max'])
```

---

### 🎓 Education

**Task:** Calculate pass percentage per subject.

```python
exam_df.groupby('Subject').agg(
    pass_rate=('Result', lambda x: (x == 'Pass').mean())
)
```

---

### 🏥 Healthcare

**Task:** Average hospital stay duration per disease.

```python
hospital_df.groupby('Diagnosis')['Stay'].agg('mean')
```

---

### 📦 Supply Chain

**Task:** Count orders and average delivery time per vendor.

```python
df.groupby('Vendor').agg(
    order_count=('OrderID', 'count'),
    avg_delivery_time=('DeliveryDays', 'mean')
)
```


## ✅ Summary

| 🔍 Feature | 📌 Use                                |
| ---------- | ------------------------------------- |
| `.agg()`   | Flexible grouping aggregations        |
| Works with | Strings, lists, dictionaries, lambdas |
| Best for   | Multiple stats per column/group       |
| Tip        | Use named aggregations for clarity    |


<center><b>Thanks</b></center>