# **Exploring Pandas - Part 3**

# **Module 5: Aggregations & Grouping**
* 1. groupby() Basics
* 2. Aggregation Functions
* 3. Multiple Aggregations
* 4. Transformations & Filters



In [None]:
import pandas as pd

mydf = pd.read_csv("sales_dataset.csv")

mydf.head()


## **1. groupby() Basics**
* groupby() method divides data into groups based on one or more keys (columns).
* After splitting, you can apply aggregation, transformation, or filtration on each group.

* **groupby() + aggregation pattern = Split → Apply → Combine**
  * **Split** → divide data by one or more keys
  * **Apply** → apply aggregation function(s)
  * **Combine** → merge results into a DataFrame/Series




**A) Create and Access Group Details**
* groupby() - Creates the Group
* groups → Dictionary of group labels & row indices
* ngroups → No. of groups
* size() → Size of each group


In [None]:
# Create group object
mygroups = mydf.groupby("Region")

# Type of object
print(type(mygroups))

# No. of groups
print(mygroups.ngroups)

 # Size of each group
print(mygroups.size())

# See all groups ( as Dictionary)
print(mygroups.groups)

# See all group names
print(list(mygroups.groups.keys()))

<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
4
Region
East      4
North     4
South    12
West      5
dtype: int64
{'East': [8, 9, 11, 17], 'North': [0, 2, 3, 13], 'South': [1, 4, 5, 7, 14, 15, 16, 18, 19, 20, 22, 23], 'West': [6, 10, 12, 21, 24]}
['East', 'North', 'South', 'West']


**B) Access Required Group**


In [None]:
# Create group object
mygroups = mydf.groupby("Region")

# Get specific group data - Say North Group
north_data = mygroups.get_group("North")

north_data


Unnamed: 0,OrderID,Region,Category,Product,Quantity,Price,Discount,Sales,OrderDate
0,ORD001,North,Electronics,Fridge,7,1559,0,10913.0,2024-01-15
2,ORD003,North,Electronics,Rice,3,1338,10,3612.6,2024-01-11
3,ORD004,North,Electronics,Mobile,8,1496,15,10172.8,2024-02-09
13,ORD014,North,Home,TV,2,487,0,974.0,2024-02-29


In [None]:
# Get specific group data - Say South Group
south_data = mygroups.get_group("South")

south_data

**C) Iterate Over Groups**


In [None]:
# Create group object
mygroups = mydf.groupby("Region")

for name, mygroup in mygroups:
    print(f"Region: {name}")
    print(mygroup.iloc[:,0:6], "\n")

Region: East
   OrderID Region     Category Product  Quantity  Price
8   ORD009   East     Clothing    Sofa         6    352
9   ORD010   East  Electronics  Mobile         1    574
11  ORD012   East         Home  Laptop         9   1254
17  ORD018   East      Grocery      TV         3    305 

Region: North
   OrderID Region     Category Product  Quantity  Price
0   ORD001  North  Electronics  Fridge         7   1559
2   ORD003  North  Electronics    Rice         3   1338
3   ORD004  North  Electronics  Mobile         8   1496
13  ORD014  North         Home      TV         2    487 

Region: South
   OrderID Region     Category Product  Quantity  Price
1   ORD002  South     Clothing   Shirt         8   1824
4   ORD005  South     Clothing  Laptop         8    230
5   ORD006  South      Grocery    Sofa         2    443
7   ORD008  South      Grocery      TV         5   1284
14  ORD015  South  Electronics  Laptop         2    876
15  ORD016  South         Home  Mobile         2   1463
16 

## **2. Aggregation Functions**



**A. Built-in Aggregations**
* Pandas has many built-in aggregation functions
  * sum() – Sum of values
  * min() – Minimum value
  * max() – Maximum value
  * count() – Count of non-NA values
  * nunique() – Number of unique values
  * mean() – Arithmetic average
  * median() – Median (50th percentile)
  * mode() – Most frequent value(s)
  * std() – Standard deviation
  * var() – Variance
  * skew() – Skewness of distribution
  * first() – First value in group
  * last() – Last value in group
  * nth(n) – n-th value in group
  * idxmin() – Index of minimum value
  * idxmax() – Index of maximum value
  * cumsum() – Cumulative sum
  * cumprod() – Cumulative product
  * cummin() – Cumulative minimum
  * cummax() – Cumulative maximum
  * quantile(q) – Value at quantile q (e.g., 0.25 = Q1)
  * describe() – Summary stats (count, mean, std, min, quartiles, max)

### **1. Total Sum with sum().**
**UseCases:**
* Find Total sales per Region
* Find Total sales per Category




In [None]:
# Find Total sales per Region

mydf.groupby("Region")["Sales"].sum()

Unnamed: 0_level_0,Sales
Region,Unnamed: 1_level_1
East,14512.8
North,25672.4
South,78753.85
West,11238.7


In [None]:
# Find Total sales per Category

mydf.groupby("Category")["Sales"].sum()

Unnamed: 0_level_0,Sales
Category,Unnamed: 1_level_1
Clothing,26775.2
Electronics,57934.95
Grocery,7697.0
Home,37770.6


**2. Minimum with min()**
**Use Case:**
* Find lowest sales per Region.
* Find lowest sales per Category

In [None]:
# Minimum sales per Region
mydf.groupby("Region")["Sales"].min()

Unnamed: 0_level_0,Sales
Region,Unnamed: 1_level_1
East,516.6
North,974.0
South,753.1
West,571.9


In [None]:

# Minimum sales per Category
mydf.groupby("Category")["Sales"].min()

Unnamed: 0_level_0,Sales
Category,Unnamed: 1_level_1
Clothing,1748.0
Electronics,516.6
Grocery,571.9
Home,974.0


### **3. Maximum with max()**
**Use Case:**
* Find highest sales per Region
* Find highest sales per Category

In [None]:
# Maximum sales per Region
mydf.groupby("Region")["Sales"].max()


Unnamed: 0_level_0,Sales
Region,Unnamed: 1_level_1
East,11286.0
North,10913.0
South,14592.0
West,5397.3


In [None]:

# Maximum sales per Region
mydf.groupby("Category")["Sales"].max()

Unnamed: 0_level_0,Sales
Category,Unnamed: 1_level_1
Clothing,14592.0
Electronics,13961.25
Grocery,5457.0
Home,11412.0


### **4. Count with count()**
**Use Case:**
* Count orders per Region.
* Count orders per Category

In [None]:
# Order count per Region.
mydf.groupby("Region")["OrderID"].count()

Unnamed: 0_level_0,OrderID
Region,Unnamed: 1_level_1
East,4
North,4
South,12
West,5


In [None]:
# Order count per Category.
mydf.groupby("Category")["OrderID"].count()

Unnamed: 0_level_0,OrderID
Category,Unnamed: 1_level_1
Clothing,4
Electronics,9
Grocery,4
Home,8


### **5. Unique Counts with nunique()**
**Use Case:**
* How many different products are sold in each Region?
* How many different products are sold in each Category?

In [None]:
# Unique products per Region
mydf.groupby("Region")["Product"].nunique()

Unnamed: 0_level_0,Product
Region,Unnamed: 1_level_1
East,4
North,4
South,6
West,5


In [None]:
# Unique products per Category
mydf.groupby("Category")["Product"].nunique()

Unnamed: 0_level_0,Product
Category,Unnamed: 1_level_1
Clothing,3
Electronics,5
Grocery,3
Home,3


### **6. Average with mean()**
**Use Case:**
* Find average sales per Region
* Find average sales per Category

In [None]:
# Average sales per Region
(mydf.groupby("Region")["Sales"].mean()).round(2)


Unnamed: 0_level_0,Sales
Region,Unnamed: 1_level_1
East,3628.2
North,6418.1
South,6562.82
West,2247.74


In [None]:
# Average sales per Category
mydf.groupby("Category")["Sales"].mean()

Unnamed: 0_level_0,Sales
Category,Unnamed: 1_level_1
Clothing,6693.8
Electronics,6437.216667
Grocery,1924.25
Home,4721.325


### **7. Median with median()**
**Use Case:**
* Find median sales per Region.
* Find median sales per Category

In [None]:
# Median sales per Region
mydf.groupby("Region")["Sales"].median()


Unnamed: 0_level_0,Sales
Region,Unnamed: 1_level_1
East,1355.1
North,6892.7
South,4623.75
West,1956.0


In [None]:
# Median sales per Category
mydf.groupby("Category")["Sales"].median()

Unnamed: 0_level_0,Sales
Category,Unnamed: 1_level_1
Clothing,5217.6
Electronics,5397.3
Grocery,834.05
Home,2939.3


### **8. Most Frequent Value with mode()**
**Use Case:**
* Bestselling product per Region
* Find most common product per Category

In [None]:
# Bestselling product per Region
mydf.groupby("Region")["Product"].agg(lambda x: x.mode())


Unnamed: 0_level_0,Product
Region,Unnamed: 1_level_1
East,"[Laptop, Mobile, Sofa, TV]"
North,"[Fridge, Mobile, Rice, TV]"
South,"[Laptop, TV]"
West,"[Fridge, Laptop, Mobile, Rice, TV]"


In [None]:
# Bestselling product per Category
mydf.groupby("Category")["Product"].agg(lambda x: x.mode()[0])


In [None]:
mydata = {
    "Category":["Books","Books","Books","Books","Books","Books","Books","Pen","Pen","Pen","Pen","Pen","Pen"],
    "Sales":[100,100,90,80,80,80,70,50,50,40,30,30,20]
  }

df= pd.DataFrame(mydata)
# print(df)

# Mean ( sum of all values / no. of values)

mean1 = df["Sales"].mean()
print(mean1)

mean2 = df.groupby("Category")["Sales"].mean()
print(mean2)

print("-"*25)
# Median (50th Percentile/ middle element)

median1 = df["Sales"].median()
print(median1)

median2 = df.groupby("Category")["Sales"].median()
print(median2)

print("-"*25)
# Mode (Element with More frequency)

mode1 = df["Sales"].mode()
print(mode1)

mode2 = df.groupby("Category")["Sales"].apply(lambda x: x.mode())
print(mode2)



63.07692307692308
Category
Books    85.714286
Pen      36.666667
Name: Sales, dtype: float64
-------------------------
70.0
Category
Books    80.0
Pen      35.0
Name: Sales, dtype: float64
-------------------------
0    80
Name: Sales, dtype: int64
Category   
Books     0    80
Pen       0    30
          1    50
Name: Sales, dtype: int64


### **9. Standard Deviation with std()**
**Use Case:**
* Find sales volatility per region.
* Identify Category with high sales fluctuations.





In [None]:
# Sales standard deviation
mydf.groupby("Region")["Sales"].std()

Unnamed: 0_level_0,Sales
Region,Unnamed: 1_level_1
East,5133.07246
North,4892.543754
South,5063.065187
West,1926.524047


In [None]:

# Sales standard deviation
mydf.groupby("Category")["Sales"].std()

Unnamed: 0_level_0,Sales
Category,Unnamed: 1_level_1
Clothing,6181.328802
Electronics,5103.368773
Grocery,2359.332601
Home,4178.668707


### **10. Variance with var()**
**Use Case:**
* Find sales variance per Region.
* Understand spread of sales data per Category

In [None]:
# Sales variance per Region.
mydf.groupby("Region")["Sales"].var()

Unnamed: 0_level_0,Sales
Region,Unnamed: 1_level_1
East,26348430.0
North,23936980.0
South,25634630.0
West,3711495.0


In [None]:
# Sales variance per Category
mydf.groupby("Category")["Sales"].var()

Unnamed: 0_level_0,Sales
Category,Unnamed: 1_level_1
Clothing,38208830.0
Electronics,26044370.0
Grocery,5566450.0
Home,17461270.0


### **11. Skewness with skew()**
**Use Case:**
* Find skewness in sales distribution.
* Check if sales distribution is skewed.





In [None]:
# Sales skewness
mydf.groupby("Region")["Sales"].skew()

Unnamed: 0_level_0,Sales
Region,Unnamed: 1_level_1
East,1.936237
North,-0.225917
South,0.485194
West,1.382998


In [None]:
# Sales skewness
mydf.groupby("Category")["Sales"].skew()

Unnamed: 0_level_0,Sales
Category,Unnamed: 1_level_1
Clothing,0.738265
Electronics,0.167824
Grocery,1.978829
Home,1.26082


### **12. First Value with first()**
**Use Case:**
* Find first order per Region
* Earliest order in each Category

In [None]:
# First order date per Region
mydf.groupby("Region")["OrderDate"].first()


Unnamed: 0_level_0,OrderDate
Region,Unnamed: 1_level_1
East,2024-01-25
North,2024-01-15
South,2024-02-08
West,2024-02-29


In [None]:
# First order date per Category
mydf.groupby("Category")["OrderDate"].first()


Unnamed: 0_level_0,OrderDate
Category,Unnamed: 1_level_1
Clothing,2024-02-08
Electronics,2024-01-15
Grocery,2024-01-30
Home,2024-02-29


### **13. Last Value with last()**
**Use Case:**
* Find last order per Region
* Most recent order in each Category

In [None]:
# Last order date per Region
mydf.groupby("Region")["OrderDate"].last()

Unnamed: 0_level_0,OrderDate
Region,Unnamed: 1_level_1
East,2024-02-05
North,2024-02-29
South,2024-01-21
West,2024-01-24


In [None]:
# Last order date per Category
mydf.groupby("Category")["OrderDate"].last()

Unnamed: 0_level_0,OrderDate
Category,Unnamed: 1_level_1
Clothing,2024-01-18
Electronics,2024-01-21
Grocery,2024-02-05
Home,2024-01-24


### **14. nth Value with nth()**
**Use Case:**
* Find second order per Region
* Get a specific nth order date.

In [None]:
# Second order per Region
mydf.groupby("Region")["OrderDate"].nth(1)

Unnamed: 0,OrderDate
2,2024-01-11
4,2024-02-22
9,2024-02-11
10,2024-02-21


In [None]:

# Third order per Category
mydf.groupby("Category")["OrderDate"].nth(2)

Unnamed: 0,OrderDate
3,2024-02-09
8,2024-01-25
13,2024-02-29
17,2024-02-05


### **15. Index of Min with idxmin()**
**Use Case:**
* Find index of minimum sales per region.
* Find row with lowest sales in each Category


In [None]:
# Index of min sales
mydf.groupby("Region")["Sales"].idxmin()

Unnamed: 0_level_0,Sales
Region,Unnamed: 1_level_1
East,9
North,13
South,5
West,21


In [None]:
mydf.groupby("Region").get_group("East")

Unnamed: 0,OrderID,Region,Category,Product,Quantity,Price,Discount,Sales,OrderDate
8,ORD009,East,Clothing,Sofa,6,352,15,1795.2,2024-01-25
9,ORD010,East,Electronics,Mobile,1,574,10,516.6,2024-02-11
11,ORD012,East,Home,Laptop,9,1254,0,11286.0,2024-02-20
17,ORD018,East,Grocery,TV,3,305,0,915.0,2024-02-05


In [None]:
# Index of min sales
mydf.groupby("Category")["Sales"].idxmin()

Unnamed: 0_level_0,Sales
Category,Unnamed: 1_level_1
Clothing,4
Electronics,9
Grocery,21
Home,13


### **16. Index of Max with idxmax()**
**Use Case:**
* Find index of maximum sales per region.
* Find row with highest sales in each region.

In [None]:
# Index of max sales
mydf.groupby("Region")["Sales"].idxmax()

Unnamed: 0_level_0,Sales
Region,Unnamed: 1_level_1
East,11
North,0
South,1
West,10


In [None]:
# Index of max sales
mydf.groupby("Category")["Sales"].idxmax()

### **17. Cumulative Sum with cumsum()**
**Use Case:**
* Running total sales per Region.
* Track cumulative sales per Category

**Key Notes:**
* cumsum() → Running totals
* Importance: Tracks progressive totals.
* Useful for: Monitoring revenue growth, cumulative sales, customer counts over time.

In [None]:
mydf

In [None]:
# Cumulative sales per Region
mydf["Cum_Sales"] = mydf.groupby("Region")["Sales"].cumsum()

mydf.groupby("Region").get_group("North")

Unnamed: 0,OrderID,Region,Category,Product,Quantity,Price,Discount,Sales,OrderDate,Cum_Sales
0,ORD001,North,Electronics,Fridge,7,1559,0,10913.0,2024-01-15,10913.0
2,ORD003,North,Electronics,Rice,3,1338,10,3612.6,2024-01-11,14525.6
3,ORD004,North,Electronics,Mobile,8,1496,15,10172.8,2024-02-09,24698.4
13,ORD014,North,Home,TV,2,487,0,974.0,2024-02-29,25672.4


In [None]:
# Cumulative sales per Category
mydf["Cat_Cum_Sales"] = mydf.groupby("Category")["Sales"].cumsum()

mydf.groupby("Category").get_group("Electronics")

### **18. Cumulative Product with cumprod()**
**Use Case:**
* Running product of sales quantities per Region
* Running product of sales quantities  per Category

**Key Notes:**
* cumprod() → Compounding growth
* Importance: Captures multiplicative effects.
* Useful for: Investment returns, interest compounding, growth rates.

In [None]:
# Cumulative product of quantities
mydf.groupby("Region")["Quantity"].cumprod()

Unnamed: 0,Quantity
0,7
1,8
2,21
3,168
4,64
5,128
6,6
7,640
8,6
9,6


In [None]:
# Cumulative product of quantities
mydf.groupby("Category")["Quantity"].cumprod()

### **19. Cumulative Max with cummax()**
**Use Case:**
* Running maximum sales per region.
* Track peak sales achieved per Category

**Key Notes:**
* cummax() → All-time highs
* Importance: Shows record-breaking peaks.
* Useful for: Sales performance tracking, stock price highs

In [None]:
# Cumulative max sales
mydf.groupby("Region")["Sales"].cummax()


In [None]:
# Cumulative max sales
mydf.groupby("Category")["Sales"].cummax()

### **20. Cumulative Min with cummin()**
**Use Case:**
* Running minimum sales per region.
* Track lowest sales recorded  per Category

**Key Notes:**
* cummin() → All-time lows
* Importance: Identifies lowest points in history.
* Useful for: Risk analysis, performance dips, worst-case monitoring.

In [None]:
# Cumulative min sales
mydf.groupby("Region")["Sales"].cummin()

In [None]:
# Cumulative min sales
mydf.groupby("Category")["Sales"].cummin()

### **21. Quantile with quantile(q)**
**Use Case:**
* Identify high-value orders cutoff per region.
* Identify low-value orders cutoff per Category

**Key Notes**
* quantile(0.5) = Median
* quantile(0.25) = 25th percentile (low-value cutoff)
* quantile(0.75) = 75th percentile (high-value cutoff)
* Very useful in outlier detection, customer segmentation, and defining thresholds.

In [None]:
# 75th percentile sales
mydf.groupby("Region")["Sales"].quantile(0.75)

Unnamed: 0_level_0,Sales
Region,Unnamed: 1_level_1
East,4167.9
North,10357.85
South,10995.75
West,2473.5


In [None]:
# 25th percentile sales
mydf.groupby("Region")["Sales"].quantile(0.25)

Unnamed: 0_level_0,Sales
Region,Unnamed: 1_level_1
East,815.4
North,2952.95
South,2302.325
West,840.0


In [None]:

# 50th percentile sales(median)
mydf.groupby("Region")["Sales"].quantile(0.50)

Unnamed: 0_level_0,Sales
Region,Unnamed: 1_level_1
East,1355.1
North,6892.7
South,4623.75
West,1956.0


In [None]:
mydf.groupby("Region")["Sales"].median()

Unnamed: 0_level_0,Sales
Region,Unnamed: 1_level_1
East,1355.1
North,6892.7
South,4623.75
West,1956.0


### **22. Descriptive Stats with describe()**
**Use Case:**
* Full statistical summary per region.
* Quick distribution stats for sales.

In [None]:
mydf.describe()

Unnamed: 0,Quantity,Price,Discount,Sales,Cum_Sales,Cat_Cum_Sales
count,25.0,25.0,25.0,25.0,25.0,25.0
mean,4.96,1063.56,7.6,5207.11,23570.846,21474.266
std,2.7,618.862268,6.311365,4696.478033,19778.98808,13653.587217
min,1.0,120.0,0.0,516.6,1795.2,753.1
25%,2.0,485.0,0.0,1664.4,10913.0,13759.5
50%,6.0,1254.0,10.0,3391.5,16340.0,18135.2
75%,7.0,1559.0,15.0,10172.8,26701.6,30612.3
max,9.0,1999.0,15.0,14592.0,78753.85,57934.95


In [None]:
# Descriptive stats
mydf.groupby("Region")["Sales"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
East,4.0,3628.2,5133.07246,516.6,815.4,1355.1,4167.9,11286.0
North,4.0,6418.1,4892.543754,974.0,2952.95,6892.7,10357.85,10913.0
South,12.0,6562.820833,5063.065187,753.1,2302.325,4623.75,10995.75,14592.0
West,5.0,2247.74,1926.524047,571.9,840.0,1956.0,2473.5,5397.3


In [None]:
# Descriptive stats
mydf.groupby("Category")["Sales"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Clothing,4.0,6693.8,6181.328802,1748.0,1783.4,5217.6,10128.0,14592.0
Electronics,9.0,6437.216667,5103.368773,516.6,1664.4,5397.3,10857.0,13961.25
Grocery,4.0,1924.25,2359.332601,571.9,707.8,834.05,2050.5,5457.0
Home,8.0,4721.325,4178.668707,974.0,2344.125,2939.3,5664.375,11412.0


## **3. Multiple Aggregations**
* Using agg() for multiple aggregations on the same column
* Applying different aggregations on different columns
* Renaming results for better readability
* Custom aggregation using lambda




## **A) GroupBy on a Single Column**

### **1. Average Sales, Price & Quantity per Region**
**Use Case:**
* Compute the average Sales, Price, and Quantity per Region.

In [None]:
# Average Sales, Price & Quantity per Region
mydf.groupby("Region")[["Sales", "Price", "Quantity"]].mean()


Unnamed: 0_level_0,Sales,Price,Quantity
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
East,3628.2,621.25,4.75
North,6418.1,1220.0,5.0
South,6562.820833,1278.416667,5.416667
West,2247.74,776.6,4.0


In [None]:
# Using aggregation with custom column names
mydf.groupby("Region").agg(
    Avg_Sales=("Sales", "mean"),
    Avg_Price=("Price", "mean"),
    Avg_Quantity=("Quantity", "mean")
)

Unnamed: 0_level_0,Avg_Sales,Avg_Price,Avg_Quantity
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
East,3628.2,621.25,4.75
North,6418.1,1220.0,5.0
South,6562.820833,1278.416667,5.416667
West,2247.74,776.6,4.0


### **2. Average Sales, Price & Quantity per Category**
**Use Case:**
* Compute the average Sales, Price, and Quantity per Category.

In [None]:
# Average Sales, Price & Quantity per Category
mydf.groupby("Category")[["Sales", "Price", "Quantity"]].mean()


Unnamed: 0_level_0,Sales,Price,Quantity
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Clothing,6693.8,1001.5,7.0
Electronics,6437.216667,1259.777778,5.222222
Grocery,1924.25,583.25,3.0
Home,4721.325,1114.0,4.625


In [None]:
# Using aggregation with custom column names
mydf.groupby("Category").agg(
    Avg_Sales=("Sales", "mean"),
    Avg_Price=("Price", "mean"),
    Avg_Quantity=("Quantity", "mean")
)

Unnamed: 0_level_0,Avg_Sales,Avg_Price,Avg_Quantity
Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Clothing,6693.8,1001.5,7.0
Electronics,6437.216667,1259.777778,5.222222
Grocery,1924.25,583.25,3.0
Home,4721.325,1114.0,4.625


### **3. Total Sales & Order Count per Region**
**Use Case:**
* Calculate the total Sales and number of Orders for each Region.

In [None]:
# Total Sales & Order Count per Region
mydf.groupby("Region")["Sales"].agg(["sum", "count"])

Unnamed: 0_level_0,sum,count
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
East,14512.8,4
North,25672.4,4
South,78753.85,12
West,11238.7,5


In [None]:
# Using aggregation with custom column names
mydf.groupby("Region")["Sales"].agg(
    Total_Sales="sum",
    Order_Count="count",
     Sales_max="max"
)

Unnamed: 0_level_0,Total_Sales,Order_Count,Sales_max
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
East,14512.8,4,11286.0
North,25672.4,4,10913.0
South,78753.85,12,14592.0
West,11238.7,5,5397.3


In [None]:

# Using aggregation with custom column names
mydf.groupby("Region").agg(
    Total_Sales=("Sales", "sum"),
    Order_Count=("Sales", "count"),
    Sales_max=("Sales", "max")
)

Unnamed: 0_level_0,Total_Sales,Order_Count,Sales_max
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
East,14512.8,4,11286.0
North,25672.4,4,10913.0
South,78753.85,12,14592.0
West,11238.7,5,5397.3


### **4. First & Last Order Dates per Region**
**Use Case:**
* Get the first and last Order Dates for each Region.


In [None]:
# First & Last Order Dates per Region
mydf.groupby("Region")["OrderDate"].agg(["first", "last"])

In [None]:
# Using aggregation with custom column names
mydf.groupby("Region")["OrderDate"].agg(
    First_Order="first",
    Last_Order="last"
)

Unnamed: 0_level_0,First_Order,Last_Order
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
East,2024-01-25,2024-02-05
North,2024-01-15,2024-02-29
South,2024-02-08,2024-01-21
West,2024-02-29,2024-01-24


### **5. Min & Max Price per Product**
**Use Case:**
* Find the minimum and maximum Price for each Product.
* Identify price extremes (lowest and highest values).

In [None]:
# Min & Max Price per Product
mydf.groupby("Product")["Price"].agg(["min", "max"])

Unnamed: 0_level_0,min,max
Product,Unnamed: 1_level_1,Unnamed: 2_level_1
Fridge,301,1559
Laptop,230,1585
Mobile,574,1496
Rice,120,1551
Shirt,1600,1824
Sofa,352,443
TV,305,1999


In [None]:

# Using aggregation with custom column names
mydf.groupby("Product")["Price"].agg(
    Min_Price="min",
    Max_Price="max"
)

Unnamed: 0_level_0,Min_Price,Max_Price
Product,Unnamed: 1_level_1,Unnamed: 2_level_1
Fridge,301,1559
Laptop,230,1585
Mobile,574,1496
Rice,120,1551
Shirt,1600,1824
Sofa,352,443
TV,305,1999


### **6. Total Sales, Quantity & Average Price per Region**
**Use Case:**
* Combine total revenue, total units, and avg price per Region.
* Useful for regional performance dashboards.


In [None]:

# Total Sales, Quantity & Average Price per Region
mydf.groupby("Region").agg(
    Total_Sales=("Sales", "sum"),
    Total_Quantity=("Quantity", "sum"),
    Avg_Price=("Price", "mean")
)

Unnamed: 0_level_0,Total_Sales,Total_Quantity,Avg_Price
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
East,14512.8,19,621.25
North,25672.4,20,1220.0
South,78753.85,65,1278.416667
West,11238.7,20,776.6


### **7. Multiple Metrics on Sales per Region**
**Use Cases:**
* Calculate total, median, first, last, and count of Sales for each Region.

In [None]:
# Multiple metrics on Sales per Region
mydf.groupby("Region")["Sales"].agg(
    Total_Sales="sum",
    Median_Sales="median",
    First_Sale="first",
    Last_Sale="last",
    Order_Count="count"
)

Unnamed: 0_level_0,Total_Sales,Median_Sales,First_Sale,Last_Sale,Order_Count
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
East,14512.8,1355.1,1795.2,915.0,4
North,25672.4,6892.7,10913.0,974.0,4
South,78753.85,4623.75,14592.0,13961.25,12
West,11238.7,1956.0,2473.5,1956.0,5


### **8. Multiple Metrics on Sales, Quantity & Price per Region**
**Use Cases:**
* Calculate total, average, and count of Sales, along with min/max Quantity and min/max Price for each Region.

In [None]:
# Multiple metrics on Sales, Quantity & Price per Region
mydf.groupby("Region").agg(
    Total_Sales=("Sales", "sum"),       # Total sales
    Avg_Sales=("Sales", "mean"),        # Average sales
    Sales_Count=("Sales", "count"),     # Count of sales
    Max_Quantity=("Quantity", "max"),   # Max quantity ordered
    Min_Quantity=("Quantity", "min"),   # Min quantity ordered
    Min_Price=("Price", "min"),         # Minimum price
    Max_Price=("Price", "max")          # Maximum price
)

## **B) GroupBy on Multiple Columns (Region–Category)**

### **9. Min & Max Sales per Region–Category pair**
**Use Case:**
* Find the minimum and maximum Sales for each Region–Category combination.
* Identify sales extremes (lowest and highest values).

In [None]:
# Min & Max Sales per Region–Category pair
mydf.groupby(["Region", "Category"])["Sales"].agg(["min", "max"])


In [None]:
# Using aggregation with custom column names
mydf.groupby(["Region", "Category"])["Sales"].agg(
    Min_Sales="min",
    Max_Sales="max"
)

### **10. Average Sales & Quantity per Region–Category pair**
**Use Case:**
* Calculate the average Sales and Quantity for each Region–Category combination.
* Measure the typical order value and average units sold.


In [None]:
# Average Sales & Quantity per Region–Category pair
mydf.groupby(["Region", "Category"])[["Sales", "Quantity"]].mean()

In [None]:

# Using aggregation with custom column names
mydf.groupby(["Region", "Category"]).agg(
    Avg_Sales=("Sales", "mean"),
    Avg_Quantity=("Quantity", "mean")
)

### **11. Total Sales & Quantity per Region–Category pair**
**Use Case:**
* Calculate the total Sales and total Quantity for each Region–Category combination.
* Measure total revenue and total items sold.

In [None]:
# Total Sales & Quantity per Region–Category pair
mydf.groupby(["Region", "Category"])[["Sales", "Quantity"]].sum()


In [None]:

# Using aggregation with custom column names
mydf.groupby(["Region", "Category"]).agg(
    Total_Sales=("Sales", "sum"),
    Total_Quantity=("Quantity", "sum")
)

### **12. Min & Max Quantity per Region–Category pair**
**Use Case:**
* Get the minimum and maximum Quantity sold for each Region–Category combination.
* Spot regions/categories with extreme demand.


In [None]:
# Min & Max Quantity per Region–Category pair
mydf.groupby(["Region", "Category"])["Quantity"].agg(["min", "max"])


In [None]:
# Using custom names
mydf.groupby(["Region", "Category"])["Quantity"].agg(
    Min_Quantity="min",
    Max_Quantity="max"
)

### **13. Total & Average Sales & Quantity per Region–Category pair**
**Use Cases:**
* Calculate total and average for Sales and Quantity in one go.

In [None]:
# Combined Total & Average Sales & Quantity per Region–Category pair
mydf.groupby(["Region", "Category"]).agg(
    Total_Sales=("Sales", "sum"),
    Avg_Sales=("Sales", "mean"),
    Total_Quantity=("Quantity", "sum"),
    Avg_Quantity=("Quantity", "mean")
)

## **C) Sorting the Results after groupBy**

In [None]:
# Multiple metrics on Sales, Quantity & Price per Region
multi_metrics = mydf.groupby("Region").agg(
    Total_Sales=("Sales", "sum"),       # Total sales
    Avg_Sales=("Sales", "mean"),        # Average sales
    Sales_Count=("Sales", "count"),     # Count of sales
    Max_Quantity=("Quantity", "max"),   # Max quantity ordered
    Min_Quantity=("Quantity", "min"),   # Min quantity ordered
    Min_Price=("Price", "min"),         # Minimum price
    Max_Price=("Price", "max")          # Maximum price
)

multi_metrics

Unnamed: 0_level_0,Total_Sales,Avg_Sales,Sales_Count,Max_Quantity,Min_Quantity,Min_Price,Max_Price
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
East,14512.8,3628.2,4,9,1,305,1254
North,25672.4,6418.1,4,8,2,487,1559
South,78753.85,6562.820833,12,9,2,230,1995
West,11238.7,2247.74,5,7,2,120,1999


### **14. Sort by Total Sales (Descending)**
**Use Cases:**
* Sort the summary table by Total Sales in descending order.
* Find the highest revenue regions at the top of the table.

In [None]:
# Sort by Total Sales (Descending)
multi_metrics.sort_values(by="Total_Sales", ascending=False)

Unnamed: 0_level_0,Total_Sales,Avg_Sales,Sales_Count,Max_Quantity,Min_Quantity,Min_Price,Max_Price
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
West,11238.7,2247.74,5,7,2,120,1999
East,14512.8,3628.2,4,9,1,305,1254
North,25672.4,6418.1,4,8,2,487,1559
South,78753.85,6562.820833,12,9,2,230,1995


### **15. Sort by Average Sales (Ascending)**
**Use Cases:**
* Sort the summary table by Average Sales in ascending order.
* Identify regions with the lowest average sales first.

In [None]:
# Sort by Average Sales (Ascending)
multi_metrics.sort_values(by="Avg_Sales", ascending=True)

Unnamed: 0_level_0,Total_Sales,Avg_Sales,Sales_Count,Max_Quantity,Min_Quantity,Min_Price,Max_Price
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
West,11238.7,2247.74,5,7,2,120,1999
East,14512.8,3628.2,4,9,1,305,1254
North,25672.4,6418.1,4,8,2,487,1559
South,78753.85,6562.820833,12,9,2,230,1995


### **16. Sort by Multiple Columns**
**Use Cases:**
* Sort first by Total Sales (descending), then by Average Sales (descending).


In [None]:
# Sort by Multiple Columns
multi_metrics.sort_values(
    by=["Total_Sales", "Avg_Sales"],
    ascending=[False, False]
)

Unnamed: 0_level_0,Total_Sales,Avg_Sales,Sales_Count,Max_Quantity,Min_Quantity,Min_Price,Max_Price
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
South,78753.85,6562.820833,12,9,2,230,1995
North,25672.4,6418.1,4,8,2,487,1559
East,14512.8,3628.2,4,9,1,305,1254
West,11238.7,2247.74,5,7,2,120,1999


### **17. Sort by Index (Region Name)**
**Use Cases:**
* Sort the summary table by Region Name in alphabetical order.


In [None]:
# Sort by Index (Region Name)
multi_metrics.sort_index(ascending=True)

Unnamed: 0_level_0,Total_Sales,Avg_Sales,Sales_Count,Max_Quantity,Min_Quantity,Min_Price,Max_Price
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
East,14512.8,3628.2,4,9,1,305,1254
North,25672.4,6418.1,4,8,2,487,1559
South,78753.85,6562.820833,12,9,2,230,1995
West,11238.7,2247.74,5,7,2,120,1999


## **D) Time-Based Aggregations**

### **18. Count of Orders per Region–Month pair**
**Use Cases:**
* Count the total number of Orders for each Region–Month combination.

In [None]:
mydf

Unnamed: 0,OrderID,Region,Category,Product,Quantity,Price,Discount,Sales,OrderDate,Cum_Sales,Cat_Cum_Sales
0,ORD001,North,Electronics,Fridge,7,1559,0,10913.0,2024-01-15,10913.0,10913.0
1,ORD002,South,Clothing,Shirt,8,1824,0,14592.0,2024-02-08,14592.0,14592.0
2,ORD003,North,Electronics,Rice,3,1338,10,3612.6,2024-01-11,14525.6,14525.6
3,ORD004,North,Electronics,Mobile,8,1496,15,10172.8,2024-02-09,24698.4,24698.4
4,ORD005,South,Clothing,Laptop,8,230,5,1748.0,2024-02-22,16340.0,16340.0
5,ORD006,South,Grocery,Sofa,2,443,15,753.1,2024-01-30,17093.1,753.1
6,ORD007,West,Home,Laptop,6,485,15,2473.5,2024-02-29,2473.5,2473.5
7,ORD008,South,Grocery,TV,5,1284,15,5457.0,2024-02-27,22550.1,6210.1
8,ORD009,East,Clothing,Sofa,6,352,15,1795.2,2024-01-25,1795.2,18135.2
9,ORD010,East,Electronics,Mobile,1,574,10,516.6,2024-02-11,2311.8,25215.0


In [None]:
# Convert OrderDate to datetime
mydf["OrderDate"] = pd.to_datetime(mydf["OrderDate"])

# Extract Month from OrderDate
mydf["Month"] = mydf["OrderDate"].dt.month

# mydf.groupby(["Month"])["Sales"].sum()

# Count of Orders per Region–Month pair
mydf.groupby(["Region", "Month"])["OrderID"].count()


Unnamed: 0_level_0,Unnamed: 1_level_0,OrderID
Region,Month,Unnamed: 2_level_1
East,1,1
East,2,3
North,1,2
North,2,2
South,1,6
South,2,6
West,1,1
West,2,4


### **19. Total Sales per Category–Month pair**
**Use Cases:**
* Calculate total Sales for each Category–Month combination.
* Understand monthly sales trends across product categories.

In [None]:
# Convert OrderDate to datetime
mydf["OrderDate"] = pd.to_datetime(mydf["OrderDate"])

# Extract Month from OrderDate
mydf["Month"] = mydf["OrderDate"].dt.month

# Total Sales per Category–Month pair
mydf.groupby(["Category", "Month"])["Sales"].sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,Sales
Category,Month,Unnamed: 2_level_1
Clothing,1,10435.2
Clothing,2,16340.0
Electronics,1,39343.85
Electronics,2,18591.1
Grocery,1,753.1
Grocery,2,6943.9
Home,1,9138.0
Home,2,28632.6


### **20. Orders & Sales Metrics per Region–Month pair**
**Use Cases:**
* Calculate multiple metrics: total orders, total sales, average sales, min/max sales, and unique products for each Region–Month combination.
* Get a comprehensive monthly view of orders and sales performance per region.

In [None]:
# Orders & Sales Metrics per Region–Month pair
mydf.groupby(["Region", "Month"]).agg(
    Total_Orders=("OrderID", "count"),
    Total_Sales=("Sales", "sum"),
    Avg_Sales=("Sales", "mean"),
    Min_Sales=("Sales", "min"),
    Max_Sales=("Sales", "max"),
    Unique_Products=("Product", "nunique")
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Total_Orders,Total_Sales,Avg_Sales,Min_Sales,Max_Sales,Unique_Products
Region,Month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
East,1,1,1795.2,1795.2,1795.2,1795.2,1
East,2,3,12717.6,4239.2,516.6,11286.0,3
North,1,2,14525.6,7262.8,3612.6,10913.0,2
North,2,2,11146.8,5573.4,974.0,10172.8,2
South,1,6,41393.35,6898.891667,753.1,13961.25,5
South,2,6,37360.5,6226.75,1664.4,14592.0,4
West,1,1,1956.0,1956.0,1956.0,1956.0,1
West,2,4,9282.7,2320.675,571.9,5397.3,4


### **21. Sales & Quantity Metrics per Category–Month pair**
**Use Cases:**
* Calculate multiple metrics: total sales, average sales, total quantity, min/max quantity, and unique products for each Category–Month combination.
* Analyze monthly sales and quantity trends across product categories.


In [None]:
# Sales & Quantity Metrics per Category–Month pair
mydf.groupby(["Category", "Month"]).agg(
    Total_Sales=("Sales", "sum"),
    Avg_Sales=("Sales", "mean"),
    Total_Quantity=("Quantity", "sum"),
    Min_Quantity=("Quantity", "min"),
    Max_Quantity=("Quantity", "max"),
    Unique_Products=("Product", "nunique")
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Total_Sales,Avg_Sales,Total_Quantity,Min_Quantity,Max_Quantity,Unique_Products
Category,Month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Clothing,1,10435.2,5217.6,12,6,6,2
Clothing,2,16340.0,8170.0,16,8,8,2
Electronics,1,39343.85,9835.9625,26,3,9,3
Electronics,2,18591.1,3718.22,21,1,8,4
Grocery,1,753.1,753.1,2,2,2,1
Grocery,2,6943.9,2314.633333,10,2,5,2
Home,1,9138.0,3046.0,10,2,6,2
Home,2,28632.6,5726.52,27,2,9,3


## **E) Custom Aggregations with Lambda**

### **22. Double the Average Sales per Region**
**Use Cases:**
* Multiply the average Sales by 2 for each Region.
* Useful for scaling or projection analysis.

In [None]:
# Double the Average Sales per Region
mydf.groupby("Region")["Sales"].agg(
     Avg_Sales1="mean",
    Avg_Sales2=lambda x: x.mean(),
    Double_Avg=lambda x: x.mean() * 2
)



Unnamed: 0_level_0,Avg_Sales1,Avg_Sales2,Double_Avg
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
East,3628.2,3628.2,7256.4
North,6418.1,6418.1,12836.2
South,6562.820833,6562.820833,13125.641667
West,2247.74,2247.74,4495.48


### **23. Count of Large Orders per Region**
**Use Cases:**
* Count how many Orders have Sales > 2000 in each Region.
* Identify premium/high-value orders distribution.

In [None]:
# Count of Orders per Region
mydf.groupby("Region")["Sales"].agg(
    Orders1="count",
    Orders2=lambda x: x.count()
)

Unnamed: 0_level_0,Orders1,Orders2
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
East,4,4
North,4,4
South,12,12
West,5,5


In [None]:
# Count of Large Orders per Region
mydf.groupby("Region")["Sales"].agg(
    Large_Orders=lambda x: (x > 3000).sum()
)

Unnamed: 0_level_0,Large_Orders
Region,Unnamed: 1_level_1
East,1
North,3
South,8
West,1


### **24. Total Sales after 10% Discount per Region**
**Use Cases:**
* Calculate total Sales if a 10% discount is applied on each order.
* Simulate revenue after discounts.


In [None]:
# Total Sales after 10% discount per Region
mydf.groupby("Region")["Sales"].agg(
    Total_Sales=lambda x: x.sum(),
    Total_Sales_10_Disc=lambda x: (x * 0.9).sum()
)

Unnamed: 0_level_0,Total_Sales,Total_Sales_10_Disc
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
East,14512.8,13061.52
North,25672.4,23105.16
South,78753.85,70878.465
West,11238.7,10114.83


### **25. Total Sales after Adding 5% Tax per Region**
**Use Cases:**
* Add 5% tax to all Sales before summing.
* Project total revenue including tax per Region

In [None]:
# Total Sales after adding 5% tax per Region
mydf.groupby("Region")["Sales"].agg(
    Total_Sales=lambda x: x.sum(),
    Total_Sales_5Tax=lambda x: (x * 1.05).sum()
)

Unnamed: 0_level_0,Total_Sales,Total_Sales_5Tax
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
East,14512.8,15238.44
North,25672.4,26956.02
South,78753.85,82691.5425
West,11238.7,11800.635


### **26. Average Price after 20% Discount per Region**
**Use Cases:**
* Apply 20% discount on Price and then calculate average Price per Region.
* Useful for discounted pricing analysis.

In [None]:
# Average Price after 20% discount per Region
mydf.groupby("Region")["Price"].agg(
    Avg_Price=lambda x: x.mean(),
    Avg_Price_20Disc=lambda x: (x * 0.8).mean()
)

Unnamed: 0_level_0,Avg_Price,Avg_Price_20Disc
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
East,621.25,497.0
North,1220.0,976.0
South,1278.416667,1022.733333
West,776.6,621.28


### **27. Orders Above 3000 after 15% Discount per Region**
**Use Cases:**
* Count Orders that remain above 3000 even after 15% discount.
* Evaluate high-value orders post-discount scenario.

In [None]:
# Count of Orders > 3000 after 15% discount per Region
mydf.groupby("Region")["Sales"].agg(
    Orders_High_15Disc=lambda x: ((x * 0.85) > 3000).sum()
)

Unnamed: 0_level_0,Orders_High_15Disc
Region,Unnamed: 1_level_1
East,1
North,3
South,7
West,1


### **28. Net Sales after Discount & Tax per Region**
**Use Cases:**
* Apply 10% discount and then add 5% tax on discounted Sales.
* Simulate realistic billing with discount + tax combined.

In [None]:
# Net Sales after 10% discount and 5% tax per Region
mydf.groupby("Region")["Sales"].agg(
    Net_Sales=lambda x: (x * 0.9 * 1.05).sum()
)

Unnamed: 0_level_0,Net_Sales
Region,Unnamed: 1_level_1
East,13714.596
North,24260.418
South,74422.38825
West,10620.5715


## **4. Transformations & Filters**
* transform() for element-wise transformations on groups
* filter() to include/exclude groups based on conditions

**What’s the difference?**

* **agg()** → returns one value per group (collapses rows)
* **transform()** → returns one value per row (same length as original), computed within each group
* **filter()** → keeps or drops whole groups based on a condition




## **A) transform() — Element-wise, Per-Group**

### **1. Sales as % of Region Total**
**UseCase:**
* Compute each Order’s Sales divided by its Region’s total Sales.

In [None]:
# Percent of regional total
mydf["Sales_Pct_of_Region"] = mydf.groupby("Region")["Sales"].transform(
    lambda x: (x / x.sum())*100
)

mydf[["OrderID","Region","Category","Product","Sales_Pct_of_Region"]].head()

Unnamed: 0,OrderID,Region,Category,Product,Sales_Pct_of_Region
0,ORD001,North,Electronics,Fridge,42.508686
1,ORD002,South,Clothing,Shirt,18.528618
2,ORD003,North,Electronics,Rice,14.071922
3,ORD004,North,Electronics,Mobile,39.625434
4,ORD005,South,Clothing,Laptop,2.219574


### **2. De-meaned Sales per Category**
**UseCase:**
* Subtract Category mean from each row’s Sales.


In [None]:
# Demeaned Sales per Category
mydf["Sales_Demeaned_Category"] = mydf.groupby("Category")["Sales"].transform(
    lambda x: x - x.mean()
)

mydf[["OrderID","Region","Category","Product","Sales_Pct_of_Region","Sales_Demeaned_Category"]].head()

Unnamed: 0,OrderID,Region,Category,Product,Sales_Pct_of_Region,Sales_Demeaned_Category
0,ORD001,North,Electronics,Fridge,42.508686,4475.783333
1,ORD002,South,Clothing,Shirt,18.528618,7898.2
2,ORD003,North,Electronics,Rice,14.071922,-2824.616667
3,ORD004,North,Electronics,Mobile,39.625434,3735.583333
4,ORD005,South,Clothing,Laptop,2.219574,-4945.8


### **3. Fill NaN Price with Region Median**
**UseCase:**
* Impute missing Price using the median Price within the same Region.
* Ensures region-level imputation instead of global fill.

In [None]:
# Impute Price by regional median
mydf["Price"] = mydf.groupby("Region")["Price"].transform(
    lambda x: x.fillna(x.median())
)

mydf[["OrderID","Region","Category","Product","Price"]].head()

Unnamed: 0,OrderID,Region,Category,Product,Price
0,ORD001,North,Electronics,Fridge,1559
1,ORD002,South,Clothing,Shirt,1824
2,ORD003,North,Electronics,Rice,1338
3,ORD004,North,Electronics,Mobile,1496
4,ORD005,South,Clothing,Laptop,230


### **4. Rank of Sales within Category (Dense Ranking)**
**UseCase:**
* Compute row-wise rank inside each Category.
* Dense ranking: consecutive ranks without gaps (largest = 1).

In [None]:
# Dense rank within Category
mydf["Sales_Rank_in_Category_Dense"] = mydf.groupby("Category")["Quantity"].transform(
    lambda x: x.rank(method="dense", ascending=False)
)

results = mydf[["OrderID","Region","Category","Product","Quantity","Sales_Rank_in_Category_Dense"]]

results.sort_values(by="Sales_Rank_in_Category_Dense", ascending=True)

### **5. Rank of Sales within Category (Standard Ranking)**
**UseCase:**
* Compute row-wise rank inside each Category.
* Standard ranking: ranks have gaps when there are ties (largest = 1).

In [None]:
# Standard rank within Category
mydf["Sales_Rank_in_Category"] = mydf.groupby("Category")["Quantity"].transform(
    lambda x: x.rank(method="min", ascending=False)
)

results = mydf[["OrderID","Region","Category","Product","Quantity","Sales_Rank_in_Category"]]

results.sort_values(by="Sales_Rank_in_Category", ascending=True)



### **6. Cumulative Sales % within Region**
**UseCase:**
* Compute each row’s cumulative Sales percentage within its Region.

In [None]:
# Cumulative Sales % within Region
mydf["Cumulative_Sales_Pct_Region"] = mydf.groupby("Region")["Sales"].transform(
    lambda x: ((x.cumsum() / x.sum())*100).round(2)
)

mydf[["OrderID","Region","Category","Product","Quantity","Cumulative_Sales_Pct_Region"]]



Unnamed: 0,OrderID,Region,Category,Product,Quantity,Cumulative_Sales_Pct_Region
0,ORD001,North,Electronics,Fridge,7,42.51
1,ORD002,South,Clothing,Shirt,8,18.53
2,ORD003,North,Electronics,Rice,3,56.58
3,ORD004,North,Electronics,Mobile,8,96.21
4,ORD005,South,Clothing,Laptop,8,20.75
5,ORD006,South,Grocery,Sofa,2,21.7
6,ORD007,West,Home,Laptop,6,22.01
7,ORD008,South,Grocery,TV,5,28.63
8,ORD009,East,Clothing,Sofa,6,12.37
9,ORD010,East,Electronics,Mobile,1,15.93


## **B) filter() — keep/discard entire groups**

### **1) Keep Regions with Total Sales ≥ 25k**
**UseCase:**
* Focus analysis on material regions (keeps South and North with this dataset).

In [None]:
# Regions with strong revenue
big_regions = mydf.groupby("Region").filter(
    lambda x: x["Sales"].sum() >= 25000
)


big_regions[["OrderID","Region","Category","Product","Quantity","Sales"]]


Unnamed: 0,OrderID,Region,Category,Product,Quantity,Sales
0,ORD001,North,Electronics,Fridge,7,10913.0
1,ORD002,South,Clothing,Shirt,8,14592.0
2,ORD003,North,Electronics,Rice,3,3612.6
3,ORD004,North,Electronics,Mobile,8,10172.8
4,ORD005,South,Clothing,Laptop,8,1748.0
5,ORD006,South,Grocery,Sofa,2,753.1
7,ORD008,South,Grocery,TV,5,5457.0
13,ORD014,North,Home,TV,2,974.0
14,ORD015,South,Electronics,Laptop,2,1664.4
15,ORD016,South,Home,Mobile,2,2487.1


### **2) Keep Category groups with ≥ 5 Unique Products**
**UseCase:**
* Ensure enough product diversity per Category (all categories meet this in your data).

In [None]:
#  Category groups with ≥ 5 Unique Products
rich_categories = mydf.groupby("Category").filter(
    lambda x: x["Product"].nunique() >= 5
)

rich_categories[["OrderID","Region","Category","Product"]]


Unnamed: 0,OrderID,Region,Category,Product
0,ORD001,North,Electronics,Fridge
2,ORD003,North,Electronics,Rice
3,ORD004,North,Electronics,Mobile
9,ORD010,East,Electronics,Mobile
10,ORD011,West,Electronics,TV
12,ORD013,West,Electronics,Rice
14,ORD015,South,Electronics,Laptop
16,ORD017,South,Electronics,Rice
23,ORD024,South,Electronics,TV


### **3) Keep Region–Month groups with Median Sales ≥ 2000**
**UseCase:**
* Identify healthy months per region

In [None]:
# Region–Month combos with healthy median
mydf["Month"] = pd.to_datetime(mydf["OrderDate"]).dt.to_period("M")

healthy_rm = mydf.groupby(["Region", "Month"]).filter(
    lambda g: g["Sales"].median() >= 2000
)

healthy_rm[["OrderID","Region","Category","Product","Month","Sales"]]


### **4) Keep Categories with Order Count ≥ 8**
**UseCase:**
* Enforce minimum sample size for reliability

In [None]:
# Categories with enough orders
stable_categories = mydf.groupby("Category")["OrderID"].count()

stable_categories

Unnamed: 0_level_0,OrderID
Category,Unnamed: 1_level_1
Clothing,4
Electronics,9
Grocery,4
Home,8


In [None]:
# Categories with enough orders
stable_categories = mydf.groupby("Category").filter(
    lambda g: g["OrderID"].count() >= 8
)

stable_categories[["OrderID","Region","Category","Product","Month","Sales"]]



### **5) Keep Regions that have any high-ticket item (Price > 1800)**
**UseCase:**
* Target regions with premium-priced sales



In [None]:
# Regions with high-ticket items
regions_high_ticket = mydf.groupby("Region").filter(
    lambda g: (g["Price"] > 1800).any()
)

regions_high_ticket[["OrderID","Region","Category","Product","Price"]]


### **Exploring Rank() function**

#### **1) Build the DataFrame**

In [None]:
import pandas as pd

data = {
"Category": ["A","A","A","A","A","A","A", "B","B","B","B","B","B"],
"Value": [100,100, 90, 80, 80,80,70, 50, 50, 40, 30,30,20],
"Note":["A1","A2","A3","A4","A5","A6","A7","B1","B2","B3","B4","B5","B6"],
}

mydf = pd.DataFrame(data)
mydf


#### **2) Ranking inside each Category**

**Case 1: DENSE — ties share the same rank; next distinct value gets next integer (no gaps)**

In [None]:
mydf["rank_dense"] = mydf.groupby("Category")["Value"].transform(lambda x: x.rank(method="dense", ascending=False))

mydf

In [None]:
mydf["rank_dense"] = mydf.groupby("Category")["Value"].transform(lambda x: x.rank(method="dense", ascending=True))

mydf

**Case 2: MIN — ties get the minimum positional rank; next rank skips ahead (gaps)**

In [None]:
mydf["rank_min"] = mydf.groupby("Category")["Value"].transform(lambda x: x.rank(method="min", ascending=False))

mydf



In [None]:
mydf["rank_min"] = mydf.groupby("Category")["Value"].transform(lambda x: x.rank(method="min", ascending=True))

mydf



**Case 3: FIRST — unique ranks; ties broken by original row order within each group**

In [None]:
mydf["rank_first"] = mydf.groupby("Category")["Value"].transform(lambda x: x.rank(method="first", ascending=False))

mydf

In [None]:
mydf["rank_first"] = mydf.groupby("Category")["Value"].transform(lambda x: x.rank(method="first", ascending=True))

mydf

#### **3) Ranking among all categories**

In [11]:
import pandas as pd

data = {
"Category": ["A","A","A","A","A","A","A", "B","B","B","B","B","B"],
"Value": [100,100, 90, 80, 80,80,70, 50, 50, 40, 30,30,20],
"Note":["A1","A2","A3","A4","A5","A6","A7","B1","B2","B3","B4","B5","B6"],
}

mydf = pd.DataFrame(data)
mydf


Unnamed: 0,Category,Value,Note
0,A,100,A1
1,A,100,A2
2,A,90,A3
3,A,80,A4
4,A,80,A5
5,A,80,A6
6,A,70,A7
7,B,50,B1
8,B,50,B2
9,B,40,B3


In [12]:
mydf["rank_dense1"] = mydf["Value"].transform(lambda x: x.rank(method="dense", ascending=False))

mydf

Unnamed: 0,Category,Value,Note,rank_dense1
0,A,100,A1,1.0
1,A,100,A2,1.0
2,A,90,A3,2.0
3,A,80,A4,3.0
4,A,80,A5,3.0
5,A,80,A6,3.0
6,A,70,A7,4.0
7,B,50,B1,5.0
8,B,50,B2,5.0
9,B,40,B3,6.0


In [13]:
mydf["rank_min1"] = mydf["Value"].transform(lambda x: x.rank(method="min", ascending=False))

mydf

Unnamed: 0,Category,Value,Note,rank_dense1,rank_min1
0,A,100,A1,1.0,1.0
1,A,100,A2,1.0,1.0
2,A,90,A3,2.0,3.0
3,A,80,A4,3.0,4.0
4,A,80,A5,3.0,4.0
5,A,80,A6,3.0,4.0
6,A,70,A7,4.0,7.0
7,B,50,B1,5.0,8.0
8,B,50,B2,5.0,8.0
9,B,40,B3,6.0,10.0


In [14]:
mydf["rank_first1"] = mydf["Value"].transform(lambda x: x.rank(method="first", ascending=False))

mydf

Unnamed: 0,Category,Value,Note,rank_dense1,rank_min1,rank_first1
0,A,100,A1,1.0,1.0,1.0
1,A,100,A2,1.0,1.0,2.0
2,A,90,A3,2.0,3.0,3.0
3,A,80,A4,3.0,4.0,4.0
4,A,80,A5,3.0,4.0,5.0
5,A,80,A6,3.0,4.0,6.0
6,A,70,A7,4.0,7.0,7.0
7,B,50,B1,5.0,8.0,8.0
8,B,50,B2,5.0,8.0,9.0
9,B,40,B3,6.0,10.0,10.0
