# Data Exploration and Summary

## 7. Group Based Summary

In [1]:
import pandas as pd 
import numpy as np 

Group-based summary techniques help you **analyze subsets of your data** — like computing aggregate stats per category, pivoting the table to see multi-dimensional trends, and applying custom aggregations.

These methods are **essential in real-world analysis** such as:

* Aggregating sales by region
* Calculating average scores by class
* Analyzing customer behavior by gender

---

We’ll cover:

1. `df.groupby()`
2. `df.agg()`
3. `df.transform()`
4. `df.pivot_table()`

In [2]:
data = {
    'Department': ['Sales', 'Sales', 'HR', 'HR', 'IT', 'IT'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Salary': [50000, 60000, 45000, 47000, 70000, 72000],
    'Experience': [3, 4, 2, 3, 5, 6]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Department,Employee,Salary,Experience
0,Sales,Alice,50000,3
1,Sales,Bob,60000,4
2,HR,Charlie,45000,2
3,HR,David,47000,3
4,IT,Eve,70000,5
5,IT,Frank,72000,6


### 🔹 1. `df.groupby()` – Grouping and Aggregation

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

#### Group by one column:

In [5]:
df.groupby('Department')['Salary'].mean()

Department
HR       46000.0
IT       71000.0
Sales    55000.0
Name: Salary, dtype: float64

In [9]:
df.groupby(['Department', 'Experience'])['Salary'].max()

Department  Experience
HR          2             45000
            3             47000
IT          5             70000
            6             72000
Sales       3             50000
            4             60000
Name: Salary, dtype: int64

Returns **multi-index** with average salaries per group.

### 2. `.agg()` – Aggregate with Custom Functions

Apply one or more functions to grouped data.

#### Single column, multiple functions:

In [12]:
df.groupby('Department')['Salary'].agg(['min', 'max', 'mean'])

Unnamed: 0_level_0,min,max,mean
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
HR,45000,47000,46000.0
IT,70000,72000,71000.0
Sales,50000,60000,55000.0


#### Different functions for different columns:

In [18]:
df.groupby('Department').agg({
    'Salary': ['min', 'mean', 'max'],
    'Experience': 'sum'
    }
)

Unnamed: 0_level_0,Salary,Salary,Salary,Experience
Unnamed: 0_level_1,min,mean,max,sum
Department,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
HR,45000,46000.0,47000,5
IT,70000,71000.0,72000,11
Sales,50000,55000.0,60000,7


You can even use **named lambdas**:

In [13]:
df.groupby('Department').agg(
    avg_salary=('Salary', 'mean'),
    total_exp=('Experience', 'sum')
)

Unnamed: 0_level_0,avg_salary,total_exp
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
HR,46000.0,5
IT,71000.0,11
Sales,55000.0,7


### 3. `.transform()` – Transform but Keep Original Shape

Use when you want **to broadcast group-level values** back to original rows.

Example: Assign each employee the **average salary of their department**.

In [21]:
df['Average_salary_by_dept'] = df.groupby('Department')['Salary'].transform('mean')
df

Unnamed: 0,Department,Employee,Salary,Experience,Average_salary_by_dept
0,Sales,Alice,50000,3,55000.0
1,Sales,Bob,60000,4,55000.0
2,HR,Charlie,45000,2,46000.0
3,HR,David,47000,3,46000.0
4,IT,Eve,70000,5,71000.0
5,IT,Frank,72000,6,71000.0


This **does not reduce** the number of rows.

### 4. `pivot_table()` – Multi-Dimensional Aggregation

More flexible than `groupby()` for **summary tables**.

In [25]:
df.pivot_table(
    index='Department',
    values='Salary',
    aggfunc='mean'
)

Unnamed: 0_level_0,Salary
Department,Unnamed: 1_level_1
HR,46000.0
IT,71000.0
Sales,55000.0


#### With multiple values and functions:

In [26]:
df.pivot_table(
    index='Department',
    values=['Salary', 'Experience'],
    aggfunc={'Salary': 'mean', 'Experience': 'sum'}
)

Unnamed: 0_level_0,Experience,Salary
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
HR,5,46000.0
IT,11,71000.0
Sales,7,55000.0


#### Add columns to the pivot:

In [28]:
df['Gender'] = ['F', 'M', 'M', 'M', 'F', 'M']

df

Unnamed: 0,Department,Employee,Salary,Experience,Average_salary_by_dept,Gender
0,Sales,Alice,50000,3,55000.0,F
1,Sales,Bob,60000,4,55000.0,M
2,HR,Charlie,45000,2,46000.0,M
3,HR,David,47000,3,46000.0,M
4,IT,Eve,70000,5,71000.0,F
5,IT,Frank,72000,6,71000.0,M


In [33]:
df.pivot_table(
    index='Department',
    columns='Gender',
    values='Salary',
    aggfunc='mean',
    fill_value=0
)

Gender,F,M
Department,Unnamed: 1_level_1,Unnamed: 2_level_1
HR,0.0,46000.0
IT,70000.0,72000.0
Sales,50000.0,60000.0


### ✅ Comparison Table

| Method          | Shape After Grouping | Supports Multiple Aggregations | Keeps Original Shape |
| --------------- | -------------------- | ------------------------------ | -------------------- |
| `groupby()`     | Collapsed            | Yes                            | No                   |
| `agg()`         | Collapsed            | Yes                            | No                   |
| `transform()`   | Same shape           | Yes (per column)               | ✅                    |
| `pivot_table()` | Flexible layout      | Yes                            | No                   |

---

### 🔍 Real-Life Use Cases

| Task                                        | Method                                   |
| ------------------------------------------- | ---------------------------------------- |
| Avg salary by department                    | `groupby('Department')['Salary'].mean()` |
| Add avg dept salary column to each employee | `groupby().transform()`                  |
| Create summary table of sales by region     | `pivot_table()`                          |
| Apply custom aggregations                   | `.agg()`                                 |


### 🔸 Bonus: Group Filtering

You can filter groups too!

In [35]:
df.groupby('Department').filter(lambda x: x['Salary'].mean() > 50000)

Unnamed: 0,Department,Employee,Salary,Experience,Average_salary_by_dept,Gender
0,Sales,Alice,50000,3,55000.0,F
1,Sales,Bob,60000,4,55000.0,M
4,IT,Eve,70000,5,71000.0,F
5,IT,Frank,72000,6,71000.0,M


Keeps only rows from departments with avg salary > 50000.

### 🔸 Bonus: Group Ranking

Assign rank within groups:

In [36]:
df['Salary_Rank'] = df.groupby('Department')['Salary'].rank(ascending=False)
df

Unnamed: 0,Department,Employee,Salary,Experience,Average_salary_by_dept,Gender,Salary_Rank
0,Sales,Alice,50000,3,55000.0,F,2.0
1,Sales,Bob,60000,4,55000.0,M,1.0
2,HR,Charlie,45000,2,46000.0,M,2.0
3,HR,David,47000,3,46000.0,M,1.0
4,IT,Eve,70000,5,71000.0,F,2.0
5,IT,Frank,72000,6,71000.0,M,1.0



That completes the **Data Exploration & Summary** module in pandas.

---

## ✅ Summary of All Sections

| Section                  | Key Methods                                             |
| ------------------------ | ------------------------------------------------------- |
| Data Overview            | `head`, `shape`, `info`, `describe`, `sample`, `dtypes` |
| Summary Statistics       | `mean`, `median`, `mode`, `std`, `quantile`, `cumsum`   |
| Frequency Analysis       | `value_counts`, `unique`, `nunique`                     |
| Data Type Summary        | `dtypes`, `select_dtypes`, `astype`, `to_numeric`, etc. |
| Correlation & Covariance | `corr`, `cov`, `corrwith`, correlation heatmaps         |
| Missing Data Summary     | `isnull`, `fillna`, `dropna`, `interpolate`, `info`     |
| Group-Based Summary      | `groupby`, `agg`, `transform`, `pivot_table`            |

<center><b>Thanks</b></center>