# **6. Aggregation & Grouping**

## **1. Core `groupby()` Operations**

In [1]:
import pandas as pd

## 1. **What it does and when to use it**

### 📌 What is `groupby()`?

The `.groupby()` method in pandas **splits the data into groups based on one or more keys**, **applies an aggregation or transformation**, and then **combines** the results back into a new DataFrame or Series.
This is known as the **Split–Apply–Combine** strategy.

### 🧠 When to use it?

Use `.groupby()` when:

* You want to **summarize data** by categories (e.g., sales by region).
* You need to **aggregate**, **filter**, or **transform** subsets of your data **based on a key**.
* You're doing **descriptive analytics**, **KPI calculations**, or **report generation**.


## 2. **Syntax and Core Parameters**

### 🧾 Basic Syntax:

```python
df.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, dropna=True)
```

### 📌 Most Common Parameters:

| Parameter  | Description                                       |
| ---------- | ------------------------------------------------- |
| `by`       | Column(s) or function to group by                 |
| `axis`     | 0 = rows (default), 1 = columns                   |
| `level`    | Used with MultiIndex to group by a specific level |
| `as_index` | If True (default), grouped columns become index   |
| `sort`     | Sort group keys. Default is True                  |
| `dropna`   | Whether to exclude NA group keys                  |

---

### 📍 Basic Example:

```python
df.groupby('Department')
```

Group the data by the `'Department'` column.


## 3. **Different Methods and Techniques**

Here are common **aggregation methods** used after `groupby()`:

| Method                 | Description                             |
| ---------------------- | --------------------------------------- |
| `.sum()`               | Total sum of values per group           |
| `.mean()`              | Average per group                       |
| `.count()`             | Non-null entries per group              |
| `.min()` / `.max()`    | Min/max per group                       |
| `.size()`              | Number of rows per group                |
| `.median()`            | Median of each group                    |
| `.std()`               | Standard deviation                      |
| `.first()` / `.last()` | First/last non-null value in each group |

### 🧪 Grouping Techniques:

#### a. **Group by a single column**

```python
df.groupby('City')['Sales'].sum()
```

#### b. **Group by multiple columns**

```python
df.groupby(['Region', 'City'])['Sales'].mean()
```

#### c. **Group by a function**

```python
df.groupby(df['Date'].dt.year)['Revenue'].sum()
```

#### d. **Group without setting index**

```python
df.groupby('City', as_index=False)['Sales'].sum()
```

#### e. **Group by dictionary or series**

```python
group_map = {'NY': 'East', 'LA': 'West'}
df.groupby(df['City'].map(group_map))['Profit'].mean()
```


## 4. **Common Pitfalls and Best Practices**

| ❌ Pitfall                              | ✅ Solution                                                             |
| -------------------------------------- | ---------------------------------------------------------------------- |
| `groupby()` returns unexpected indexes | Use `as_index=False` to keep group columns as columns                  |
| Losing data due to `NaN` in group key  | Set `dropna=False` to include NaN groups                               |
| Forgetting to aggregate after groupby  | Always follow with an aggregation (`.sum()`, `.count()`, etc.)         |
| Using `.apply()` unnecessarily         | Prefer `.agg()` or `.transform()` when possible for better performance |
| Sorting groups slows performance       | Set `sort=False` if sorting is not required                            |


## 5. **Examples on Real/Pseudo Data**

In [2]:
data = {
    'Department': ['HR', 'HR', 'IT', 'IT', 'Finance', 'Finance'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Salary': [50000, 52000, 60000, 62000, 58000, 61000],
    'Bonus': [2000, 2100, 2500, 2700, 2300, 2600]
}

df = pd.DataFrame(data)

df

Unnamed: 0,Department,Employee,Salary,Bonus
0,HR,Alice,50000,2000
1,HR,Bob,52000,2100
2,IT,Charlie,60000,2500
3,IT,David,62000,2700
4,Finance,Eve,58000,2300
5,Finance,Frank,61000,2600


In [3]:
### ✅ Total salary by department
df.groupby('Department')['Salary'].sum()

Department
Finance    119000
HR         102000
IT         122000
Name: Salary, dtype: int64

In [4]:
### ✅ Average bonus by department
df.groupby('Department')['Bonus'].mean()

Department
Finance    2450.0
HR         2050.0
IT         2600.0
Name: Bonus, dtype: float64

In [6]:
### ✅ Count of employees in each department
df.groupby('Department').size()

Department
Finance    2
HR         2
IT         2
dtype: int64

In [7]:
### ✅ Multiple aggregations
df.groupby('Department')[['Salary', 'Bonus']].mean()

Unnamed: 0_level_0,Salary,Bonus
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
Finance,59500.0,2450.0
HR,51000.0,2050.0
IT,61000.0,2600.0


In [8]:
### ✅ Group without setting index
df.groupby('Department', as_index=False)['Salary'].sum()

Unnamed: 0,Department,Salary
0,Finance,119000
1,HR,102000
2,IT,122000


## 6. **Real-World Use Cases**

### 🏬 Sales Analysis

**Task**: Calculate total sales per region per product.

```python
sales_df.groupby(['Region', 'Product'])['Sales'].sum()
```

---

### 🏢 HR Analytics

**Task**: Find average salary and bonus per department.

```python
hr_df.groupby('Department')[['Salary', 'Bonus']].mean()
```

---

### 🎓 Education

**Task**: Count students passed per subject.

```python
exam_df[exam_df['Result'] == 'Pass'].groupby('Subject')['Student'].count()
```

---

### 🧪 Healthcare

**Task**: Find average hospital stay duration per diagnosis.

```python
hospital_df.groupby('Diagnosis')['Stay_Duration'].mean()
```

---

### 🛒 Retail

**Task**: Find number of unique customers per store.

```python
retail_df.groupby('Store')['CustomerID'].nunique()
```


## ✅ Summary

| Key Concept         | Description                                        |
| ------------------- | -------------------------------------------------- |
| `.groupby()`        | Used to split data into groups                     |
| Follow-up methods   | `.sum()`, `.mean()`, `.count()`, `.size()`, etc.   |
| Parameters to know  | `as_index`, `dropna`, `sort`                       |
| Avoid common issues | Use `as_index=False`, handle NaNs, prefer `.agg()` |


<center><b>Thanks</b></center>