### Practice Activity: E-Commerce Data Wrangling

**Scenario:** You are a Data Analyst for a small electronics retailer. You have data scattered across different files regarding customers, products, sales transactions for January and February, and regional performance. Your goal is to organize, combine, and analyze this data using the techniques covered in Module 8.

-----

### Part 0: Data Setup

Copy and run the following code block to generate the sample datasets for this activity.

In [None]:
import pandas as pd
import numpy as np

# 1. Customers Data
customers = pd.DataFrame({
    'Cust_ID': [101, 102, 103, 104, 105],
    'Name': ['Alice Smith', 'Bob Jones', 'Charlie Brown', 'Dana White', 'Evan Lee'],
    'Region': ['East', 'West', 'East', 'North', 'South']
})

# 2. Products Data
products = pd.DataFrame({
    'Product_ID': ['P001', 'P002', 'P003', 'P004'],
    'Product_Name': ['Laptop', 'Headphones', 'Monitor', 'Keyboard'],
    'Price': [1200, 50, 300, 80]
})

# 3. January Sales Data
sales_jan = pd.DataFrame({
    'Trans_ID': [1, 2, 3, 4],
    'Cust_ID': [101, 102, 101, 103],
    'Product_ID': ['P001', 'P002', 'P003', 'P002'],
    'Quantity': [1, 2, 1, 5]
})

# 4. February Sales Data
sales_feb = pd.DataFrame({
    'Trans_ID': [5, 6, 7],
    'Cust_ID': [104, 105, 102],
    'Product_ID': ['P004', 'P001', 'P004'],
    'Quantity': [3, 1, 2]
})

# 5. Regional Targets (Multi-Index Data)
arrays = [
    ['East', 'East', 'West', 'West', 'North', 'North'],
    ['New York', 'Chicago', 'Los Angeles', 'San Francisco', 'Chicago', 'Detroit']
]
index = pd.MultiIndex.from_arrays(arrays, names=('Region', 'City'))
regional_targets = pd.DataFrame({
    'Target_Sales': [50000, 30000, 55000, 40000, 35000, 25000],
    'Actual_Sales': [52000, 28000, 51000, 42000, 36000, 24000]
}, index=index)

In [None]:
sales_feb

Unnamed: 0,Trans_ID,Cust_ID,Product_ID,Quantity
0,5,104,P004,3
1,6,105,P001,1
2,7,102,P004,2


-----

### Part 1: Indexing Practice

1.  **Set Index:** Display the `products` DataFrame. Notice the default 0-3 index. Create a new DataFrame called `products_indexed` by setting the `'Product_ID'` column as the index.

In [None]:
products

Unnamed: 0,Product_ID,Product_Name,Price
0,P001,Laptop,1200
1,P002,Headphones,50
2,P003,Monitor,300
3,P004,Keyboard,80


In [None]:
product_indexed = products.set_index("Product_ID")
product_indexed

Unnamed: 0_level_0,Product_Name,Price
Product_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
P001,Laptop,1200
P002,Headphones,50
P003,Monitor,300
P004,Keyboard,80


In [None]:
product_indexed.reset_index().reset_index()

Unnamed: 0,index,Product_ID,Product_Name,Price
0,0,P001,Laptop,1200
1,1,P002,Headphones,50
2,2,P003,Monitor,300
3,3,P004,Keyboard,80


2.  **Reset Index:** Take your new `products_indexed` DataFrame and reset the index so that `'Product_ID'` becomes a regular column again.

### Part 2: Hierarchical Indexing


3.  **Selection:** Look at the `regional_targets` DataFrame (which already has a MultiIndex: Region and City).
      * Select all data for the `'West'` Region.
      * Select the specific `Actual_Sales` for `'Chicago'` in the `'North'` Region.

In [None]:
regional_targets

Unnamed: 0_level_0,Unnamed: 1_level_0,Target_Sales,Actual_Sales
Region,City,Unnamed: 2_level_1,Unnamed: 3_level_1
East,New York,50000,52000
East,Chicago,30000,28000
West,Los Angeles,55000,51000
West,San Francisco,40000,42000
North,Chicago,35000,36000
North,Detroit,25000,24000


In [None]:
regional_targets.loc["West"]

Unnamed: 0_level_0,Target_Sales,Actual_Sales
City,Unnamed: 1_level_1,Unnamed: 2_level_1
Los Angeles,55000,51000
San Francisco,40000,42000


In [None]:
regional_targets.loc["North", "Actual_Sales"] #It will be indexed based on Nort. It will only print Actual_Sales.

Unnamed: 0_level_0,Actual_Sales
City,Unnamed: 1_level_1
Chicago,36000
Detroit,24000


In [None]:
regional_targets.loc[("North", "Chicago"), "Actual_Sales"]

np.int64(36000)

In [None]:
regional_targets.loc[(slice(None), "Chicago"), "Actual_Sales"]

Unnamed: 0_level_0,Unnamed: 1_level_0,Actual_Sales
Region,City,Unnamed: 2_level_1
East,Chicago,28000
North,Chicago,36000


4.  **Swapping Levels:** You want to analyze data by City first, rather than Region. Use `swaplevel()` to switch `Region` and `City`, and assign this to a new variable `city_first`.

In [None]:
city_first= regional_targets.swaplevel()

Unnamed: 0_level_0,Unnamed: 1_level_0,Target_Sales,Actual_Sales
City,Region,Unnamed: 2_level_1,Unnamed: 3_level_1
New York,East,50000,52000
Chicago,East,30000,28000
Los Angeles,West,55000,51000
San Francisco,West,40000,42000
Chicago,North,35000,36000
Detroit,North,25000,24000


5.  **Sorting:** Sort the `city_first` DataFrame by its new outer index (City) to ensure the data is organized alphabetically.

In [None]:
city_first.sort_index(level=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,Target_Sales,Actual_Sales
City,Region,Unnamed: 2_level_1,Unnamed: 3_level_1
Chicago,East,30000,28000
New York,East,50000,52000
Chicago,North,35000,36000
Detroit,North,25000,24000
Los Angeles,West,55000,51000
San Francisco,West,40000,42000


In [None]:
city_first.sort_index(level=0, axis=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,Actual_Sales,Target_Sales
City,Region,Unnamed: 2_level_1,Unnamed: 3_level_1
New York,East,52000,50000
Chicago,East,28000,30000
Los Angeles,West,51000,55000
San Francisco,West,42000,40000
Chicago,North,36000,35000
Detroit,North,24000,25000


### Part 3: Concatenating (Stacking Data)


6.  **Basic Concatenation:** You have separate dataframes for January (`sales_jan`) and February (`sales_feb`). Concatenate them into a single DataFrame called `all_sales` containing all transactions. Reset the index of the combined dataframe so it flows sequentially (0 to 6).


In [None]:
all_sales = pd.concat([sales_jan, sales_feb], ignore_index=True)
all_sales

Unnamed: 0,Trans_ID,Cust_ID,Product_ID,Quantity
0,1,101,P001,1
1,2,102,P002,2
2,3,101,P003,1
3,4,103,P002,5
4,5,104,P004,3
5,6,105,P001,1
6,7,102,P004,2


7.  **Concatenation with Keys:** Concatenate the two months again, but this time use the `keys` parameter to label the rows as `'Jan'` and `'Feb'`. Store this in `sales_with_keys` and display it.


In [None]:
sales_with_keys = pd.concat([sales_jan, sales_feb], keys=["Jan", "Feb"])
sales_with_keys

Unnamed: 0,Unnamed: 1,Trans_ID,Cust_ID,Product_ID,Quantity
Jan,0,1,101,P001,1
Jan,1,2,102,P002,2
Jan,2,3,101,P003,1
Jan,3,4,103,P002,5
Feb,0,5,104,P004,3
Feb,1,6,105,P001,1
Feb,2,7,102,P004,2


### Part 4: Merging (Database-Style Joins)




8.  **Inner Join:** Merge your `all_sales` dataframe (from Q6) with the `customers` dataframe.
      * Join them on the `'Cust_ID'` column.
      * *Note: Ensure only records with matching Customer IDs in both tables are kept.*

In [None]:
pd.merge(all_sales, customers, on="Cust_ID", how="inner")

Unnamed: 0,Trans_ID,Cust_ID,Product_ID,Quantity,Name,Region
0,1,101,P001,1,Alice Smith,East
1,2,102,P002,2,Bob Jones,West
2,3,101,P003,1,Alice Smith,East
3,4,103,P002,5,Charlie Brown,East
4,5,104,P004,3,Dana White,North
5,6,105,P001,1,Evan Lee,South
6,7,102,P004,2,Bob Jones,West


9.  **Left Join (Enriching Data):** Merge `all_sales` with `products` to get the Product Names and Prices for every transaction.
      * Use a `left` join to ensure you keep all sales records, even if a product ID is missing in the products table (though in this clean data, all match).

In [None]:
full_data= pd.merge(products, all_sales, how="left", on="Product_ID")
full_data

Unnamed: 0,Product_ID,Product_Name,Price,Trans_ID,Cust_ID,Quantity
0,P001,Laptop,1200,1,101,1
1,P001,Laptop,1200,6,105,1
2,P002,Headphones,50,2,102,2
3,P002,Headphones,50,4,103,5
4,P003,Monitor,300,3,101,1
5,P004,Keyboard,80,5,104,3
6,P004,Keyboard,80,7,102,2


10. **Analysis:** Calculate the **Total Revenue** for each transaction (hint: multiply `Quantity` by `Price` after your merge in Q9).

In [None]:
full_data["Revenue"] = full_data["Quantity"]*full_data["Price"]
full_data

Unnamed: 0,Product_ID,Product_Name,Price,Trans_ID,Cust_ID,Quantity,Revenue
0,P001,Laptop,1200,1,101,1,1200
1,P001,Laptop,1200,6,105,1,1200
2,P002,Headphones,50,2,102,2,100
3,P002,Headphones,50,4,103,5,250
4,P003,Monitor,300,3,101,1,300
5,P004,Keyboard,80,5,104,3,240
6,P004,Keyboard,80,7,102,2,160
