
# Lab 5
**Duration:** ~2 hours

**Topics:**  
- `groupby()` with `mean/sum/count`; `agg()` multi-aggregation; multi-index results  
- `pivot_table()`; merging/joining DataFrames  
- Load & explore; clean missing/inconsistent data; derive new columns  
- Use `groupby`, filtering, sorting; visualize with Pandas `.plot()`; save outputs  
- Plots: line/bar/hist/pie via `.plot()`; customize labels/titles; plot from groupby  
- Intro to `matplotlib.pyplot` (`plot`, `bar`, `scatter`); styling & basic subplots

**Instructions:** For each task, read the markdown cell and implement your solution in the empty code cell directly below it.


# Case Study: Myntra Pants Dataset — Load & Explore

**Exercise 1:** Load the dataset from `myntra_dataset_ByScraping.csv` into `df`. Show `df.shape`, `df.head(3)`, and list column names.

**Exercise 2:** Run `df.info()` and `df.describe(include=['number']))`. Briefly inspect possible numeric columns (price, MRP, discount_percent, ratings, number_of_ratings).

# Cleaning: Missing & Inconsistent Data

**Exercise 1:** Display total missing values per column with `df.isnull().sum()`. Identify columns with missing values.

**Exercise 2:** Ensure numeric columns are numeric: cast `price`, `MRP`, `discount_percent`, `ratings`, `number_of_ratings` to numeric (coerce errors).

**Exercise 3:** Handle missing numeric data: fill `ratings` with its median and `number_of_ratings` with 0. Drop rows where `price` or `MRP` is missing.

**Exercise 4:** Fix inconsistencies: ensure `price` and `MRP` are positive; clip negatives to NaN then drop. Ensure `discount_percent` is between 0 and 1 (clip), then recompute a clean `net_discount` = `(MRP - price) / MRP`.

# Deriving New Columns

**Exercise 1:** Create `discount_amount = MRP - price`.

**Exercise 2:** Bucket `ratings` into categories with `pd.cut`: ['low', 'mid', 'high'] using bins [0,3.5,4.2,5]. Store in `rating_band`.

**Exercise 3:** Create `review_band` from `number_of_ratings` using `pd.qcut` into 4 quantiles (Q1..Q4).

**Exercise 4:** Create a simple `value_score = ratings * (1 + net_discount)`.

# GroupBy, Aggregations, Multi-index

**Exercise 1:** Group by `brand_name` and compute `mean price`, `mean ratings`, and `count` of products.

**Exercise 2:** Use `agg()` for multi-aggregation on `price` and `ratings` grouped by `brand_name` (mean, median, std).

**Exercise 3:** Create a **multi-index** groupby by `brand_name` and `rating_band`, aggregating `price` mean and product count. Display the first 10 rows.

# Pivot Tables

**Exercise 1:** Build a pivot table with index=`brand_name`, columns=`rating_band`, values=`price`, `aggfunc='mean'`.

**Exercise 2:** Create a pivot table of product counts by `brand_name` (rows) and `review_band` (columns).

# Merging / Joining DataFrames

**Exercise 1:** Create a small mapping DataFrame `brand_segment` with two segments: mark the **top 5 brands by product count** as 'Top', others as 'Other'. Merge it back to `df` on `brand_name`.

**Exercise 2:** Compute a summary DataFrame `seg_summary` = mean `price` and mean `ratings` by `segment`, and sort by mean price desc.

# Filtering & Sorting

**Exercise 1:** Filter products with `value_score >= 4.0` and `net_discount >= 0.3`. Sort by `value_score` desc; show top 10.

# Visualizing with Pandas .plot() (line, bar, hist, pie); Plotting from groupby

**Exercise 1:** From `gb_brand` (mean_price/count), plot a **bar chart** of the top 10 brands by `count_items`. Add labels and title.

**Exercise 2:** Plot a **histogram** of `net_discount` (bins=20). Label axes and add a title.

**Exercise 3:** From `seg_summary`, draw a **pie chart** of `count` by `segment` with percentage labels.

**Exercise 4:** Build a **line plot** showing mean `price` per `rating_band` (order low→mid→high).

**Exercise 5:** Plot directly from groupby: group by `brand_name` and plot mean `ratings` (top 10 brands by count). Use a horizontal bar chart.

# matplotlib.pyplot Basics: plot(), bar(), scatter(); Styling & Subplots

**Exercise 1:** Using matplotlib directly, create a **scatter** of `price` vs `ratings` for 500 random samples (if available). Add labels/title and alpha for visibility.

**Exercise 2:** Create a **bar chart** with matplotlib: show counts of products per `segment`.

**Exercise 3:** Create a 1x2 **subplot** figure: (left) histogram of `ratings`; (right) histogram of `price` (use 20 bins). Add overall suptitle.

# Save Cleaned Data & Figures

**Exercise 1:** Save cleaned DataFrame to csv` to your local system or drive.