# Topic 04 - Problem 01: Data Aggregation & Grouping

---

## 1. About the Problem

Data Aggregation & Grouping are used to **summarize data** and **extract insights** from large datasets.

Instead of looking at individual rows, we:
- Group data based on a category
- Apply aggregate functions like **mean, sum, count, min, max**

This is extremely common in:
- Business analytics
- Sales analysis
- User behavior analysis
- Machine learning preprocessing

In this problem, I will group employee data by department and calculate useful statistics.

---


## 2. Solution Code

In [16]:
import pandas as pd

data = {
    "department": ["IT", "HR", "IT", "Finance", "HR", "Finance", "IT"],
    "salary": [60000, 45000, 70000, 80000, 50000, 75000, 65000],
    "experience": [3, 2, 5, 8, 4, 7, 4]
}

df = pd.DataFrame(data)

grouped_data = df.groupby("department").agg({
    "salary": ["mean", "max", "min"],
    "experience": "mean"
})

print(grouped_data)


# # Get mean of each group
# print(grp.mean())

# # Get multiple statistics
# print(grp.agg(['mean', 'sum', 'count']))

# # Different functions for different columns
# print(grp.agg({
#     'salary': ['mean', 'max'],
#     'experience': 'mean'
# }))

             salary               experience
               mean    max    min       mean
department                                  
Finance     77500.0  80000  75000        7.5
HR          47500.0  50000  45000        3.0
IT          65000.0  70000  60000        4.0


---

## 3. Explanation (What is happening)

- **groupby("department")**  
  → Groups employees based on department

- **mean salary**  
  → Average salary per department

- **max & min salary**  
  → Highest and lowest salary in each department

- **mean experience**  
  → Average experience level per department

This transforms raw employee-level data into **department-level insights**.

---

## 4. Summary / Takeaways

By solving this problem, I learned:

1. Grouping helps analyze data at a higher level
2. Aggregation summarizes large datasets efficiently
3. `groupby()` + `agg()` is one of the most powerful tools in Pandas
4. This technique is widely used in **EDA, reporting, and ML pipelines**

Data aggregation is a **core skill** for any Data Scientist or ML Engineer because real-world decisions are made using **grouped insights**, not raw rows.

Next, I will continue with more advanced grouping and multi-column aggregation problems.
