### Module 9 - Activity Scenario: The Global Retail Analyzer

You are a data analyst for a global retail chain. You have been given a dataset of transaction records containing information about regions, product categories, sales amounts, and customer ratings. Your goal is to derive insights about regional performance, product popularity, and data quality.

---

### Phase 0: Data Generation

Run the following code block to generate the synthetic dataset for this exercise.

In [None]:
import pandas as pd
import numpy as np

np.random.seed(42)

# Generate synthetic data
n_rows = 100
data = {
    'Region': np.random.choice(['North', 'South', 'East', 'West'], n_rows),
    'Category': np.random.choice(['Electronics', 'Clothing', 'Home', 'Toys'], n_rows),
    'Sales': np.random.randint(100, 1000, n_rows),
    'Quantity': np.random.randint(1, 20, n_rows),
    'Rating': np.random.choice([1, 2, 3, 4, 5, np.nan], n_rows, p=[0.1, 0.1, 0.3, 0.3, 0.1, 0.1])
}

df = pd.DataFrame(data)

# Introduce a specific missing value scenario for Step 7
df.loc[df['Category'] == 'Electronics', 'Rating'] = df.loc[df['Category'] == 'Electronics', 'Rating'].fillna(np.nan)

print(df.head())

  Region  Category  Sales  Quantity  Rating
0   East      Home    674         4     4.0
1   West  Clothing    963        11     NaN
2  North  Clothing    842        17     3.0
3   East      Toys    340         6     4.0
4   East  Clothing    663         5     4.0


### Phase 1: Grouping and Hierarchical Reshaping

**Exercise 1: Regional Sales Analysis**
Group the data by `Region` and calculate the **total** (sum) `Sales` for each region. Store this in a variable called `regional_sales`.

In [None]:
regional_sales = df.groupby("Region")["Sales"].sum()
regional_sales

Unnamed: 0_level_0,Sales
Region,Unnamed: 1_level_1
East,13291
North,13406
South,15237
West,16518


**Exercise 2: Multi-level Categorization**

1. Group the data by **both** `Region` and `Category`.
2. Calculate the **mean** `Sales`.
3. The result is a MultiIndex Series. Use `unstack()` to transform the `Category` level into columns so you have a DataFrame where indices are Regions and columns are Categories.

In [None]:
region_category_mean_sales = (
    df.groupby(["Region", "Category"])["Sales"]
      .mean()
      .unstack()
)

region_category_mean_sales

Category,Clothing,Electronics,Home,Toys
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
East,556.0,492.363636,787.6,428.25
North,889.25,506.5,700.0,617.571429
South,686.5,626.428571,491.4,483.833333
West,548.25,612.25,526.466667,568.428571


In [None]:
def spread(arr):
    return arr.max() - arr.min()

sales_volatility = df.groupby("Category")["Sales"].agg(spread)
sales_volatility

Unnamed: 0_level_0,Sales
Category,Unnamed: 1_level_1
Clothing,847
Electronics,825
Home,774
Toys,811


**Exercise 4: The "Manager's Report" (Multiple Aggregations)**
Management wants a summary table grouped by `Region`. Create a single command using `.agg()` and a dictionary to calculate:


* **Sales:** The `sum` total.
* **Quantity:** The `max` quantity sold in a single transaction.
* **Rating:** The `mean` average rating.

In [None]:
managers_report = (
    df.groupby("Region")
      .agg({
          "Sales": "sum",
          "Quantity": "max",
          "Rating": "mean"
      })
)

managers_report

Unnamed: 0_level_0,Sales,Quantity,Rating
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
East,13291,19,3.173913
North,13406,17,2.722222
South,15237,17,2.68
West,16518,19,3.08


### Phase 3: Advanced `apply` and Bucket Analysis

**Exercise 5: Top Performers (Split-Apply-Combine)**

Use the `apply` method to find the **top 2 transactions** (highest Sales) for **each** `Region`.

* *Hint:* You will need to define a function (or use a lambda) that sorts values by Sales and takes the top 2, then apply it to the group.


In [None]:
top2_sales_per_region = (
    df.groupby("Region")
      .apply(lambda g: g.sort_values("Sales", ascending=False).head(2)[["Category", "Sales"]])
)

top2_sales_per_region

  .apply(lambda g: g.sort_values("Sales", ascending=False).head(2)[["Category", "Sales"]])


Unnamed: 0_level_0,Unnamed: 1_level_0,Category,Sales
Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
East,61,Electronics,926
East,11,Home,848
North,6,Clothing,999
North,57,Clothing,918
South,69,Clothing,929
South,75,Clothing,906
West,1,Clothing,963
West,88,Home,954



**Exercise 6: Sales Bucketing**

1. Use `pd.qcut` to divide the `Sales` column into 3 buckets: "Low", "Medium", and "High".
2. Group the dataframe by these new buckets.
3. Calculate the **count** of transactions and the **mean** `Rating` for each bucket.


In [None]:
sales_bucket = pd.qcut(df["Sales"], q=3, labels=["Low", "Medium", "High"])

bucket_analysis = df.groupby(sales_bucket).agg(
    Sales=("Sales", "count"),
    Rating=("Rating", "mean")
)

bucket_analysis

  bucket_analysis = df.groupby(sales_bucket).agg(


Unnamed: 0_level_0,Sales,Rating
Sales,Unnamed: 1_level_1,Unnamed: 2_level_1
Low,34,3.0
Medium,33,2.724138
High,33,3.03125


### Phase 4: Handling Missing Data by Group

**Exercise 7: Context-Aware Imputation**
Your `Rating` column has missing values (`NaN`).

1. Check the mean rating for "Clothing" vs "Electronics" using `groupby`. You will notice they are different.
2. Replacing all `NaN`s with the global mean is inaccurate. Instead, fill the missing `Rating` values with the **mean rating of that specific Category**.

* *Hint:* Group by `Category` and use `apply` with a lambda function that employs `fillna` using the group's mean.

In [None]:
df["Rating"] = (
    df.groupby("Category")["Rating"]
      .apply(lambda s: s.fillna(s.mean()))
      .reset_index(level=0, drop=True)
)

df["Rating"].isna().sum()

np.int64(0)

**Exercise 9: Grouping by Series Mapping**

You are analyzing monthly sales data. The columns in your dataset represent individual months. You want to aggregate these months into Quarters (Q1 and Q2) to see broader trends. Instead of typing out a dictionary manually, you have a pandas Series that defines which month belongs to which quarter.

In [None]:
import pandas as pd
import numpy as np

# 1. The Main Dataframe: Monthly Sales per Store
data = np.random.randint(1000, 5000, size=(4, 6))
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
stores = ['Store A', 'Store B', 'Store C', 'Store D']

df_monthly = pd.DataFrame(data, columns=months, index=stores)

print("Monthly Data:\n")
display(df_monthly)

Monthly Data:



Unnamed: 0,Jan,Feb,Mar,Apr,May,Jun
Store A,1356,3070,2785,3569,2781,4908
Store B,4171,3849,4251,4294,4977,3809
Store C,4730,3489,2631,3816,2015,2348
Store D,1515,4087,3839,1335,2782,3305


**Task:**

1. Create a dictionary called `quarter_map` with these values: `'Jan': 'Q1', 'Feb': 'Q1', 'Mar': 'Q1', 'Apr': 'Q2', 'May': 'Q2', 'Jun': 'Q2'`


2. Pass the `quarter_map` Series to `.groupby()`.

2. Set `axis=1` because we are grouping the columns (months).

3. Calculate the sum to get total sales per Quarter for each store.

In [None]:
quarter_map = pd.Series(
    {'Jan': 'Q1', 'Feb': 'Q1', 'Mar': 'Q1', 'Apr': 'Q2', 'May': 'Q2', 'Jun': 'Q2'}
)

print("\nMapping Series:\n")
display(quarter_map)

quarterly_sales = df_monthly.groupby(quarter_map, axis=1).sum()
quarterly_sales


Mapping Series:



Unnamed: 0,0
Jan,Q1
Feb,Q1
Mar,Q1
Apr,Q2
May,Q2
Jun,Q2


  quarterly_sales = df_monthly.groupby(quarter_map, axis=1).sum()


Unnamed: 0,Q1,Q2
Store A,7211,11258
Store B,12271,13080
Store C,10850,8179
Store D,9441,7422
