# GroupBy & Aggregation

After cleaning and transforming data, we summarize it.

GroupBy allows us to answer questions like:<br>
- Total sales per city
- Average price per category
- Orders per product

This step converts raw data â†’ insights.

In [1]:
import pandas as pd

# Load transformed dataset created in previous notebook
sales = pd.read_csv("../data/processed/transformed_sales.csv")

sales.head()

Unnamed: 0,order_id,customer_id,product,category,price,quantity,city,date,total_amount,price_level,city_code
0,1001,C101,laptop,Tech,55000,1,Delhi,2024-01-05,55000,High,D
1,1002,C102,phone,Tech,20000,2,Mumbai,2024-01-06,40000,Medium,M
2,1003,C103,shoes,Fashion,3000,1,Pune,2024-01-07,3000,Low,P
3,1004,C101,headphones,Tech,2000,3,Delhi,2024-01-07,6000,Low,D
4,1005,C104,tshirt,Fashion,800,2,Bangalore,2024-01-08,1600,Low,B


## Grouping Data

We group rows that share common values.

In [2]:
# Group data by city
sales.groupby("city")

<pandas.api.typing.DataFrameGroupBy object at 0x000001E42BB8D010>

## Counting Records per City

In [3]:
# Number of orders per city
sales.groupby("city")["order_id"].count()

city
Bangalore    2
Chennai      1
Delhi        3
Mumbai       2
Pune         2
Name: order_id, dtype: int64

## Total Sales Amount per City

In [4]:
# Sum of total_amount column city wise
sales.groupby("city")["total_amount"].sum()

city
Bangalore      5100
Chennai        2500
Delhi         64000
Mumbai       100000
Pune          21000
Name: total_amount, dtype: int64

## Average Price per Category

In [6]:
sales.groupby("category")["price"].mean()

category
Accessories     2000.000000
Fashion         2433.333333
Tech           31000.000000
Name: price, dtype: float64

## Multiple Aggregations

In [7]:
sales.groupby("city")["total_amount"].agg(["sum", "mean", "max"])

Unnamed: 0_level_0,sum,mean,max
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bangalore,5100,2550.0,3500
Chennai,2500,2500.0,2500
Delhi,64000,21333.333333,55000
Mumbai,100000,50000.0,60000
Pune,21000,10500.0,18000


## Grouping by Multiple Columns

In [8]:
sales.groupby(["city", "category"])["total_amount"].sum()

city       category   
Bangalore  Fashion          5100
Chennai    Accessories      2500
Delhi      Accessories      3000
           Tech            61000
Mumbai     Tech           100000
Pune       Fashion          3000
           Tech            18000
Name: total_amount, dtype: int64

## Reset Index (important)

In [9]:
city_sales = sales.groupby("city")["total_amount"].sum().reset_index()
city_sales

Unnamed: 0,city,total_amount
0,Bangalore,5100
1,Chennai,2500
2,Delhi,64000
3,Mumbai,100000
4,Pune,21000


## Save Aggregated Report

In [10]:
city_sales.to_csv("../data/processed/city_sales_report.csv", index=False)

## Conclusion

GroupBy converts row-level data into business insights.