## 1. Introduction

---

### Q1. What is the general area of study (the topics of the data)?

**Answer:**
The general area of study revolves around consumer purchase behavior during the Black Friday sales period. The dataset provides detailed transaction records and demographic information of customers, allowing us to explore spending patterns, demographic correlations, and co-purchase relationships. This is particularly relevant for domains such as retail analytics, consumer behavior modeling, and marketing strategy optimization.

---

### Q2. What specific data sources are used in this project?

**Answer:**
The project utilizes a single dataset known as the **Black Friday Sales Dataset**. It is a publicly available dataset often used for practice in exploratory data analysis and visualization tasks. The dataset includes detailed records of purchases made during Black Friday by customers, including demographic features and product categories.

---

### Q3. What is the dataset about, and how was it gathered?

**Answer:**
The dataset is about consumer transactions during Black Friday from a retail store. It includes approximately 550,000 rows and covers the following details for each transaction:
- **User demographics:** Gender, Age, Occupation, City Category, Stay in Current City Years, Marital Status
- **Product details:** Product IDs and 3 levels of product category grouping
- **Transaction details:** Purchase amount for each product

Though the exact source of the dataset is not explicitly documented, it is widely circulated across open data platforms and data science learning portals (like Kaggle). It is assumed to be a synthetic or anonymized dataset designed to reflect realistic customer behavior in a retail context.

---

### Q4. How can the data source be accessed?

**Answer:**
The dataset is publicly available and can be accessed from multiple open platforms. In this project, I have used a pre-cleaned version titled **train.csv**, which was directly uploaded into the Jupyter Notebook environment. Since the dataset is less than 2MB, it is stored and shared locally within the project files. A common reference link to the dataset is below:
- [Kaggle Black Friday Dataset](https://www.kaggle.com/datasets/sdolezel/black-friday)

---

### Q5. Who is the client and what is their interest in the data?

**Answer:**
The **hypothetical client** for this project is a **retail data analytics team** within a mid-to-large scale e-commerce company. Their interest lies in uncovering **insights about consumer purchasing behavior during high-volume sale events** like Black Friday. Their key objectives are to:
- Identify high-value customer segments
- Understand product co-purchase relationships
- Optimize marketing and targeting strategies
- Gain visual clarity over demographic spending patterns

This client values intuitive, data-driven visuals that can inform decisions related to **inventory management**, **personalized recommendations**, and **campaign planning**.


## 2. Dataset Details

---

### Q1. Was any preliminary examination or cleaning performed on the data?

**Answer:**
Yes, a preliminary examination was conducted to assess data quality and ensure compatibility with the planned visualizations. The dataset was generally clean, but the following steps were performed:

- Missing values in `Product_Category_2` and `Product_Category_3` were observed and handled appropriately (ignored in visualizations that did not require them).
- Columns were renamed and reformatted where necessary for clarity (e.g., converting categorical codes to readable labels).
- Some new features were synthesized (e.g., Age groups) to support specific visual idioms like grouped heatmaps.

---

### Q2. Is further description of the data necessary?

**Answer:**
Yes, for context, the dataset contains over **550,000** purchase records made by customers during a Black Friday sale event. Each row represents a single product purchase and includes:
- **User-level attributes** (demographic and location)
- **Product-level attributes** (category and ID)
- **Purchase amount**
This allows multi-dimensional analysis such as how spending varies by age, occupation, city category, and product groupings.

---

### Q3. What is the complete Munzner WHAT analysis for all data items used in the project?

| **Attribute**           | **Attribute Type**          | **Semantics**                                                                 | **Additional Notes**                                                                 |
|------------------------|-----------------------------|-------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
| `User_ID`              | Categorical (Nominal)        | Anonymized unique user ID                                                    | Used only to count unique users for metrics like "Avg Purchase per User"            |
| `Gender`               | Categorical (Binary)         | Male or Female                                                               | Used in filtering for scatter plot visualization                                    |
| `Age`                  | Categorical (Ordered)        | User age grouped into ranges (e.g., 18–25, 26–35)                             | Used to define X-axis in heatmaps and bar charts                                    |
| `Occupation`           | Categorical (Nominal)        | Encoded integer representing job category                                    | Used in bubble chart to measure average purchase by occupation        |
| `City_Category`        | Categorical (Nominal)        | City classification A/B/C                                                    | Used as a Y-axis grouping in heatmaps                    |
| `Stay_In_Current_City_Years` | Categorical (Ordered) | How long the customer has lived in the current city                          | Not directly visualized but could be used for filtering or time-based trends        |
| `Marital_Status`       | Categorical (Binary)         | 0 = Single, 1 = Married                                                       | Could be Used in scatter plots to compare spending patterns between marital groups           |
| `Product_ID`           | Categorical (Nominal)        | Unique product identifier                                                     | Used only in raw aggregations; not visualized directly                              |
| `Product_Category_1`   | Categorical (Nominal)        | First-level category of the product                                           | Used in co-purchase network visualizations                              |
| `Purchase`             | Quantitative (Ordered, Ratio) | Monetary value of the purchase (in currency units)                           | Main quantitative variable for all visualizations                                   |
| `Synthetic_Date` (created) | Quantitative (Ordered, Temporal) | Artificially generate date values to simulate timeline-based analysis | Could be Used for time-series visualization to demonstrate purchase trends over time         |

---

### Q4. Were there any additional curation steps or techniques used?

**Answer:**
Yes, several data curation steps were performed:

- **Synthetic Dates:** I tried creating A new `Synthetic_Date` column using random sampling with temporal constraints to simulate daily purchase trends. This could allow generation of a time-series visualization even though the original dataset lacked a temporal dimension. But I did not use it further
- **User-Level Aggregation:** For average purchase analysis, purchases were grouped by user and combined with demographic data to compute per-user statistics.
- **Co-Purchase Network Preparation:** A JSON network file was created to show frequency of product category co-purchases using node-link and chord diagram formats. This required building an adjacency matrix from transactions with multiple products.
- **Normalization:** normalization was done by dividing total purchase by the number of users in that group, for example, Where it could be needed (e.g., average purchase by occupation), 
- **Dropdown and Interactivity:** In some visualizations (e.g., heatmap and chord), dropdown filters and hover-based tooltips were implemented for interactivity and clarity.

---



## 3. Project Goals and Objectives

---

### Q1. What are the key questions posed on behalf of the client?

**Answer:**
To support the goals of a retail data analyst (the imagined client), I tried coming up with the following three core questions. Each question addresses a specific decision-making need and directly connects to one of the final visualizations created for the project.

---

#### 🔹 Question 1:
**Which customer segments (based on age and city) contribute the most to sales performance?**

- *Client motivation:* Identify high-value demographic combinations to optimize marketing strategies and tailor regional promotions.
- *Visualization Link:* **Interactive Heatmap** (Age × City) showing Total Purchase, Avg Purchase/User, and User Count

---

#### 🔹 Question 2:
**Which product categories are both popular and profitable based on user-level spending behavior?**

- *Client motivation:* Determine which categories generate the highest return and understand their user engagement. This is important to inform stock decisions, bundling, or promotions.
- *Visualization Link:* **Bubble Chart** — encodes Avg Purchase/User (Y-axis) and User Count (bubble size) per Product Category

---

#### 🔹 Question 3:
**Which product categories are frequently bought together, and can this be used to inform bundling strategies?**

- *Client motivation:* Identify co-purchase patterns that can support combo offers, cross-promotions, or upsell opportunities.
- *Visualization Link:* **Chord Diagram** — shows co-purchase frequency between categories through ribbon thickness

---



### Q2. What is the Munzner-style WHY analysis for each question?

---

####  For Question 1:
 *Which customer segments (based on age and city) contribute the most to sales performance?*

| **Action**   | **Target**           | **Level**     | **Explanation**                                                                 |
|--------------|----------------------|----------------|----------------------------------------------------------------------------------|
| **Compare**  | Attribute values     | Group          | Compare city categories and age groups in terms of total/average spending       |
| **Discover** | Trends               | Grouped        | Reveal which demographic combinations are most valuable                         |
| **Query**    | Specific segment     | Cell           | Use dropdown and hover to explore exact numbers by metric and subgroup          |

---

####  For Question 2:
 *Which product categories are both popular and profitable based on user-level spending behavior?*

| **Action**    | **Target**          | **Level**       | **Explanation**                                                                  |
|---------------|---------------------|------------------|-----------------------------------------------------------------------------------|
| **Compare**   | Product categories   | Individual/Group | View how avg purchase differs across product categories                          |
| **Identify**  | High-value outliers  | Individual       | Spot categories with high avg spend but low user count                           |
| **Summarize** | Engagement level     | Aggregate        | Bubble size encodes popularity (user count) for product-level comparison         |

---

####  For Question 3:
 *Which product categories are frequently bought together, and can this be used to inform bundling strategies?*

| **Action**    | **Target**             | **Level**         | **Explanation**                                                              |
|---------------|------------------------|--------------------|-------------------------------------------------------------------------------|
| **Discover**  | Relationships          | Pairwise           | Show which category pairs are most frequently co-purchased                    |
| **Compare**   | Co-purchase strength   | Pairwise           | Use ribbon thickness to compare link frequency                               |
| **Query**     | Category focus filter  | Filtered pairs     | Dropdown filter enables focusing on one category’s relationships             |
| **Present**   | Network structure      | All pair groups    | Circular layout maps the full product-category co-purchase ecosystem         |

---


###  Q3. How should the evaluator understand these action/target pairs?

**Answer:**
Each action/target pair in the WHY analysis is designed to support **real-world analytical behavior**, directly tied to the client’s goals.

- **Compare** enables side-by-side evaluation of customer groups and product categories — this is clearly seen in the heatmap (age vs. city) and bubble chart (category vs. avg spend).
- **Identify** allows the detection of outlier behaviors — for example, categories with **high avg spend but low engagement** (bubble chart), or standout city-age combinations (heatmap).
- **Query** represents interactive exploration — users can change the displayed metric in the heatmap or focus on a specific category in the chord diagram using dropdown filters.
- **Summarize** and **Present** are used to distill larger patterns — such as overall product category strength or visible connections in the chord diagram.

These actions and targets are **intentionally layered** across the visualizations to offer a **multi-level analysis** approach — allowing the client to both **zoom in** on specific behaviors and **zoom out** to see trends.

---

### Q4. Additional insights into project objectives?

**Answer:**

- **Descriptive** (What happened?)  
   e.g., “Which city-age combinations have the highest spending?”  
   Visualized through the annotated heatmap

- **Diagnostic** (Why is it happening?)  
   e.g., “Which categories are profitable but under-purchased?”  
   Visualized through the bubble chart that compares avg spend vs. engagement

- **Prescriptive** (What to do next?)  
   e.g., “Which categories should be bundled together?”  
   Visualized through the chord diagram highlighting strong co-purchase relationships

This structure ensures the project delivers insights that are not only visually clear but also **strategically actionable**, helping the client move from understanding behavior to making better business decisions.

---



## 4. Initial Data Analysis

---

###  Overview

Before designing final visualizations, a detailed **exploratory data analysis (EDA)** was performed to examine the structure, semantics, and distribution of the Black Friday dataset. This step was crucial for understanding which attributes were most meaningful for visualization, how they relate to the client's goals, and which visual idioms would best support clear, actionable insights.

The findings shaped three selected visualizations:
- A **bubble chart** comparing average purchase per user across product categories  
- An **interactive heatmap** comparing demographic segments (age × city)  
- A **chord diagram** visualizing co-purchase frequency between product categories  

---

###  Basic Statistical Insights

| Attribute             | Observation                       |
|----------------------|------------------------------------|
| `User_ID`            | ~5,891 unique users               |
| `Product_Category_1` | ~20 categories used across products |
| `Age`                | 7 ordered categories (e.g., 18–25, 26–35, etc.) |
| `City_Category`      | 3 classes: A (Metro), B, C        |
| `Purchase`           | Skewed distribution, high variance (outliers exist) |

These statistics helped guide aggregation techniques such as calculating **average per user** and **user counts**, rather than relying solely on raw totals which could be biased by high-frequency shoppers.

---

###  Exploratory Aggregations and What They Revealed

####  1. Avg Purchase Per User per Product Category
- **Grouped by `User_ID` and `Product_Category_1`**
- Calculated total purchase per user, then averaged across users
-  *Result:* Identified categories with high-spend, low-volume users (e.g., Category 10)

####  2. Purchase Metrics by Age × City
- Used a pivot table to compute:
  - **Total purchase**
  - **Avg purchase per user**
  - **User count**
-  *Result:* Noticed demographic imbalances and trends (e.g., Age 26–35 dominates in City B)

####  3. Co-Purchase Frequency Between Product Categories
- Collected all product categories purchased per user (`Product_Category_1/2/3`)
- Created **co-occurrence pairs** by user and counted their frequency
- Filtered to keep only pairs with strong relationships
-  *Result:* Identified strong connections between certain categories (e.g., 1 & 5, 5 & 8), which led to selecting a **chord diagram** for network-style visualization

---

###  How This Supports Munzner’s Methodology

---

####  WHAT — Understanding the Data Attributes

| Attribute                            | Type                  | How It Was Used |
|--------------------------------------|------------------------|------------------|
| `Product_Category_1`                | Nominal               | Used in bubble chart (X-axis), heatmap (co-occurrence), chord diagram (nodes) |
| `User_ID`                           | Nominal               | Used for user-level grouping to calculate averages and user counts |
| `Purchase`                          | Quantitative (Ratio)  | Main metric in all visuals (as avg, total, color, position) |
| `Age`                               | Ordered Categorical   | Column axis in heatmap |
| `City_Category`                     | Nominal               | Row axis in heatmap |
| Co-purchase Matrix (derived)        | Quantitative (Pairwise frequency) | Input to chord diagram (links/ribbons) |
| Derived Metrics (Avg Purchase/User, User Count) | Quantitative | Encoded via bubble size and color |

---

####  WHY — Clarifying Client-Centered Tasks

| Task        | Applied Where        | Purpose |
|-------------|----------------------|---------|
| **Compare** | Bubble Chart, Heatmap | Compare avg spend, user count, and totals |
| **Discover**| Bubble & Chord        | Spot standout products and relationships |
| **Query**   | Heatmap & Chord       | Allow user-driven filtering via dropdown |
| **Summarize** | Heatmap              | Show general demographic-level insights |

 *Example:* Bubble chart highlighted Category 10 as high-value but underused. The chord diagram revealed strong hidden ties between Category 5 and 8 — supporting bundling strategies.

---

####  HOW — Visual Encodings Informed by Data Shape

| Finding from EDA                   | Visual Design Choice                                  |
|------------------------------------|--------------------------------------------------------|
| Skewed purchase values             | Used **averages** over totals for fair comparisons     |
| Many products with unequal volume  | Bubble chart with size = user count                   |
| Category co-purchase patterns      | Chord diagram to map pairwise relationships visually  |
| Multidimensional demographic grid  | Heatmap with dropdown + color + annotations           |

 *Example:* Chord diagrams were chosen after discovering meaningful co-occurrence trends across products, which required a pairwise relational visual idiom.

---

###  Role of EDA in Project Methodology

This phase was both **technical preparation** and a **strategic design driver**. EDA enabled:

- Informed selection of visual attributes and transformations
- Clarification of the client’s underlying goals
- Effective mapping of raw data into the WHAT–WHY–HOW framework
- Discovery of hidden relationships that wouldn't emerge from basic summaries

By grounding visualization choices in EDA, the final deliverables were charts as well as **tools for insight**.

---


## 5. Visualization Design Choices

---

###  Visualization 1: Interactive Heatmap with Dropdown and Annotations  
**Title**: City vs Age — Purchase Trends by Demographic  
**Type**: Annotated Heatmap (Interactive Plotly)

---

### a. HOW Analysis 

This visualization was designed based on earlier WHAT and WHY analysis. It uses a **matrix idiom** to display categorical intersections (City × Age) and encodes a third metric (purchase behavior) using both **color** and **text annotations**.

#### MARKS:
- **Rectangles**: Represent combinations of `City_Category` (rows) and `Age` (columns)
- Each mark encodes a value (metric) in a distinct visual space

#### CHANNELS:
| Channel      | Encoding                        | Why It Was Used                            |
|--------------|----------------------------------|---------------------------------------------|
| Position     | City (Y-axis), Age (X-axis)     | Accurate for categorical comparison         |
| Color hue    | Metric values (e.g., Purchase)  | Intuitive way to show magnitude             |
| Text         | Annotated values on each cell   | Adds precision without sacrificing overview |
| Dropdown     | User interaction                | Supports dynamic comparison of 3 metrics    |

This chart allows viewers to clearly and quickly compare demographic segments and switch between:
- Total purchase
- Avg purchase per user
- Number of users

This aligns with Munzner’s HOW principle of **applying multiple channels for redundancy and clarity**.

---

### b. Addressed Actions/Targets 

Focused on key action/target pairs from the WHY analysis:
- **Compare**: Compare spending across Age × City cells
- **Query**: User can switch metrics using a dropdown - dynamic query

Other WHY-level actions like “Identify Outliers” are possible but secondary to the primary design.

---

### c. HOW Methods Applied

| Design Feature           | Method                                                                 |
|--------------------------|------------------------------------------------------------------------|
| Heatmap layout           | Used pivot table to convert raw data into grid form                   |
| Color encoding           | Used a continuous color scale to show low-high gradient    |
| Text annotations         | Overlaid precise numbers for interpretability                         |
| Dropdown interactivity   | Used Plotly’s update menus to let users toggle between 3 metric views |
| Tooltip on hover         | Added customized hovertemplate for full data visibility               |

---

### d. Idioms and Channel Justification

| Element       | Justification (WHY + WHAT)                                          |
|---------------|----------------------------------------------------------------------|
| Heatmap idiom | Ideal for showing intersections between two categorical attributes  |
| Color         | Used for encoding ratio-scale metric (`Purchase`) — intuitive & fast |
| Text overlay  | Reinforces exact value, satisfying user need for precision           |
| Dropdown      | Adds interactivity → aligns with WHY: Query task                     |

This visual design uses **well-established idioms** and **channels known for their effectiveness** per Munzner’s guidelines (e.g., color hue for magnitude, position for category).

---

### e. Additional Design Inspiration and Personal Contribution

This visualization was **inspired by public heatmap dashboards** but significantly enhanced:

- **Dropdown feature** was custom-built to toggle between three different pivot tables
- **Annotation overlay** was added manually (not a Plotly default behavior)
- Custom **hovertemplate** was created to avoid label truncation and show readable tooltips
- **Semantic labeling** was used in axis titles and colorbars to make it user-friendly

Although the heatmap structure is a known visual pattern, **the enhancements (metric switching, overlay labels, interactivity)** were fully authored and tailored to the dataset and client goals.

---

### f. Originality and Extension

While heatmaps are a common visual form, I created:
- **Three versions of the data metrics**
- Wrote a **custom dropdown toggle**
- Enhanced the UX with annotations, better tooltips, and responsive layout

These additions show both **technical creativity** and **design intent**, aligned with Munzner's idea of building on idioms.

---

### g. Comments

#### (1)  Goal Alignment:
- Addresses the client’s first question: **"Which customer segments (based on age and city) contribute the most to sales performance?"**
- Enables both high-level demographic comparison and low-level detail exploration through switching metrics
- Supports the WHY-level actions of **Compare**, **Discover**, and **Query**


#### (2)  Pros and Cons:

| Pros                                | Cons                                      |
|-------------------------------------|-------------------------------------------|
| Very easy to interpret              | Can become cluttered with too many metrics |
| Offers 3 data perspectives in one   | Not mobile-friendly (Plotly size limits)   |
| Visual and numerical clarity        | Dropdown needs explanation if unfamiliar   |

#### (3)  Suggested Improvements:
- Add radio buttons for quicker switching between metrics
- Allow user to highlight specific cell on click
- Add summary below chart: “Top segment: Age 26–35 in City B”


---




###  Visualization 2: Bubble Chart — Avg Purchase per User vs Product Category  
**Title**: Average Purchase per User by Product Category  
**Type**: Scatter Bubble Chart (Interactive via Plotly)

---

### a. HOW Analysis

This visualization uses the **scatterplot idiom**, enhanced by **bubble size** to encode an additional metric: user count per category. It was designed to allow the client to compare **how much users spend** per product category and also **how many users** are purchasing within that category.

#### MARKS:
- **Point (bubble)**: One per product category

#### CHANNELS:
| Channel         | Encoding                        | Why It Was Used                              |
|-----------------|----------------------------------|-----------------------------------------------|
| X-axis          | Product Category                | Categorical comparison across products        |
| Y-axis          | Avg Purchase per User           | Core value for comparison                     |
| Bubble Size     | User Count                      | Indicates how many users shop in each category|
| Bubble Color    | Avg Purchase per User (again)   | Visual reinforcement through gradient scale   |
| Tooltip         | All metrics shown on hover      | Provides interpretability without clutter     |

This combination allows multiple metrics to be viewed simultaneously — which directly aligns with Munzner’s principle of **channel redundancy + multi-value encoding**.

---

### b. Addressed Actions/Targets

referencing **Question 2** in the WHY analysis:
 *Which product categories are most profitable, and how do different user groups engage with them?*

| WHY Action     | Target           | Used? | How it's Represented |
|----------------|------------------|-------|------------------------|
| **Summarize**  | Product categories | yes    | Total average shown for each product |
| **Compare**    | Category patterns | yes    | Y-axis shows avg purchase to enable comparison |
| **Identify**   | High/low performers | yes  | Easily spot categories with high avg spend but low user count |
| **Query**      | Specific categories | yes | Hover tooltip gives full detail per node |

---

### c. HOW Methods Applied in the Design

| Feature                 | Implementation |
|-------------------------|----------------|
| Data Preparation        | Grouped by User + Category to compute per-user average |
| Size Metric             | Count of unique users per product category |
| Color Metric            | Same as Y-axis to reinforce through color |
| Bubble Chart Logic      | Scatter plot with scaled size and colored points |
| Tooltip                 | Custom labels to show Avg Purchase and User Count |

 These decisions enable the chart to support **discovery, comparison, and ranking**, all in one view.

---

### d. Idiom and Channel Justification

| Idiom         | Why It Fits |
|---------------|-------------|
| **Bubble Chart (2D Scatter)** | Best for showing 2+ continuous metrics per category |
| **X-axis (Product Category)** | Lets us compare nominal groupings |
| **Y-axis (Avg Purchase/User)** | Reveals profitability potential |
| **Size = User Count** | Helps identify high-engagement categories |
| **Color = Same as Y** | Enhances value interpretation visually |

The idiom is used here — to **compare popularity vs profitability** — is uniquely targeted to the client's business questions.

---

### e. Design Inspirations & Customization

This design was **inspired by ecommerce dashboard concepts**, but all logic, code, and structure were developed from scratch:

- Bubble size = user count was added manually from grouped aggregation
- Hover tooltips were customized to show both metrics without relying on default labels
- The combination of **Y-axis + size + color** is tuned to answer a specific business scenario:  
   “Is this category popular, profitable, or both?”

This moves beyond basic scatterplots into a more **multivariate comparison tool** — fully aligned with real-world decision-making.

---

### f. Originality and Extension

This is **not a standard or copied bubble chart**:
- Metrics were thoughtfully selected and constructed from raw data
- X-axis is categorical — a less common but useful adaptation of the scatter idiom
- Multi-channel encoding allows one visual to replace multiple bar plots


---

### g. Commentary

#### (1)  Goal Alignment:
- Directly addresses the client’s second question: **"Which product categories are both popular and profitable based on user-level spending behavior?"**
- Supports multi-metric discovery by combining average purchase per user (Y-axis), user count (bubble size), and visual emphasis (color)
- Allows the client to **compare, rank, and identify** product categories for potential focus or reallocation of marketing resources
- Maps to the WHY-level actions of **Compare**, **Summarize**, **Identify**, and **Query**


#### (2)  Pros and Cons:

| Pros                                      | Cons                             |
|-------------------------------------------|----------------------------------|
| Shows 3 metrics in one (avg, volume, count) | Bubbles may overlap in dense areas |
| Visual comparison is easy                  | X-axis is categorical → spacing is artificial |
| Tooltips give full story                   | Static screenshots lose interactivity |

#### (3)  Suggested Improvements:
- Add **hover labels** with extra context (e.g., Category names if available)
- Group low-frequency categories into "Others" to reduce clutter
- Create **filter** for bubble size or avg threshold

---

**Conclusion:**  
This bubble chart is a powerful visual that allows the client to make **strategic, data-backed decisions** on which product categories to promote, bundle, or optimize. Its original design and clear encoding of three interrelated metrics exemplifies thoughtful visualization practice grounded in Munzner’s principles.

---


---

###  Visualization 3: Chord Diagram — Product Category Co-Purchase Relationships  
**Title**: Product Category Co-Purchase Chord Diagram  
**Type**: Interactive D3.js Chord Visualization (HTML-based)

---

### a. HOW Analysis 

This visualization uses a **circular layout idiom** (Chord Diagram) to represent relationships between categorical variables — in this case, **product categories**. The primary goal is to reveal **frequently co-purchased category pairs**, which supports business insights like bundling and cross-selling.

#### MARKS:
- **Arcs**: Represent individual product categories (nodes)
- **Ribbons**: Represent co-purchase links (edges between categories)

#### CHANNELS:
| Channel     | Encoding                     | Why It Was Used                                  |
|-------------|------------------------------|--------------------------------------------------|
| Arc Length  | Category Total Connections   | Visual weight of a category                      |
| Ribbon Width| Co-purchase Frequency        | Shows strength of relationship between pairs     |
| Ribbon Color| Source/Target category        | Reinforces node-link identity                    |
| Position    | Circular layout for balance  | Allows equal visual opportunity for all nodes    |
| Tooltip     | Hover to show counts         | Allows querying individual links without clutter |

---

### b. WHY Action/Target Coverage

This visualization directly addresses **Question 3** from the WHY analysis:

*Are there any strong co-purchase patterns among product categories that can inform bundling strategies?*

| WHY Action   | Target                | Used? | Explanation |
|--------------|------------------------|--------|-------------|
| **Discover** | Relationships          | yes     | Viewer can explore surprising connections |
| **Compare**  | Pairwise strength      | yes     | Ribbon thickness supports direct comparison |
| **Query**    | Category focus (filter)| yes     | Dropdown highlights links by selected category |

---

### c. HOW Methods Applied in the Design

| Design Feature           | Method |
|--------------------------|--------|
| **Chord Layout**         | Used D3’s chord + ribbon layout with inner/outer radius |
| **Dropdown Filter**      | Added `<select>` element with JavaScript event listener |
| **Tooltip**              | Implemented with D3 `.on('mouseover')` and `.transition()` |
| **Top 3 Highlights**     | Thickened and outlined the strongest 3 connections |
| **Custom Label Positioning** | Rotated text anchors dynamically based on arc angle |

---

### d. Idiom and Channel Justification

| Element     | Justification |
|-------------|---------------|
| **Chord Diagram** | Best idiom for visualizing pairwise symmetric relationships |
| **Ribbons**       | Show strength of links using area + color |
| **Arcs**          | Maintain circular symmetry, maximize comparability |
| **Color**         | Used `d3.schemeCategory10` for distinct visual identity |

This visual idiom was chosen because **pairwise relationships** between product categories were a key interest for the client — **bar charts or treemaps** would not capture this well.

---

### e. Design Inspiration and Additional Methodology

While chord diagrams are a known idiom in D3, this version includes several original and thoughtful enhancements:
-  Custom **data pipeline** built using Python to extract co-purchase frequencies
-  **Thresholding**: Only included links where `value > 10` to reduce noise
-  Interactive dropdown to filter and highlight **focus category**
-  Top 3 connections bolded with **stroke-width and color outline**
-  Arc label placement tuned for readability

The final design was inspired by academic visualizations of network flows but was built **from scratch** using raw CSV data and D3 v7.

---

### f. Originality and Author Contribution

This visualization demonstrates originality through:
- Custom data processing (`generate_network_data.py`)
- Filter interaction and top-3 highlighting authored in JavaScript
- Precision placement of labels and interactive hover effects
- Dynamic matrix generation from adjacency data

both the data structure (`product_category_network.json`) and the full visual logic (`chords.html`) were designed from the ground up to fit the dataset and client question.

---

### g. Commentary

#### (1)  Goal Alignment:
- Directly addresses the client’s third question: **"Which product categories are frequently bought together, and can this be used to inform bundling strategies?"**
- Supports pairwise pattern discovery by visually highlighting **strong co-purchase links**
- Enables the client to **query**, **compare**, and **analyze** relationships between categories to uncover bundling opportunities
- Aligns with WHY actions: **Discover**, **Compare**, **Query**, and **Present**


#### (2)  Pros and Cons

| Pros                                 | Cons                                       |
|--------------------------------------|--------------------------------------------|
| Shows relationship strength visually | Complex if viewer is unfamiliar with chords |
| Compact representation of all pairs  | Hard to read when too many ribbons overlap |
| Interactive and insightful           | Needs explanation if presented live        |

#### (3)  Suggested Improvements:
- Add **category descriptions** as tooltips (if available)
- Integrate slider to change **minimum frequency threshold**
- Export high-res PNG/SVG for report presentation

---

###  File & Data Pipeline Summary

**`generate_network_data.py`:**
- Loaded and cleaned product categories
- Created co-occurrence pairs using `combinations()`
- Ignored missing values (-1), counted frequency of pairs
- Exported to `product_category_network.json`

**`product_category_network.json`:**
- Nodes: One for each product category
- Links: Frequency count of co-purchased category pairs
- Only links with `value > 10` included to focus on meaningful relationships

**`chords.html`:**
- Loads JSON into D3
- Constructs chord matrix
- Implements interactivity:
  - Category focus dropdown
  - Top 3 ribbon highlight
  - Tooltip with category names and link values

---

**Conclusion**:  
This chord diagram offers a highly insightful view into customer behavior, uncovering hidden relationships in the dataset that inform **product bundling strategies**. It is technically rich, perceptually effective, and for this project, it showcases creativity, precision, and practical value.

---


## 6. Conclusions

---

### a. Success: How well did the project address its goals?

The project was moderately successful in achieving its three main goals, each of which was connected to a specific visualization:

- **Goal 1:** Identify top-performing customer segments  
   Addressed through the **interactive heatmap**, which revealed city-age segment patterns. The visualization provided clarity, but its simplicity may have missed interactions with other variables like marital status or occupation.

- **Goal 2:** Understand popularity vs profitability of product categories  
   Captured by the **bubble chart**, which was highly effective at visualizing multiple metrics simultaneously. However, limited labeling and category identifiers made it harder to interpret without supplemental data.

- **Goal 3:** Reveal co-purchase relationships for bundling  
   Explored with the **chord diagram**, which visualized connections well but posed usability challenges for viewers unfamiliar with network idioms.

**Overall**, the visualizations were functionally correct and visually coherent. However, in terms of **client usability**, more context (e.g., axis guides, legends, and summary annotations) could improve communication, especially for non-technical users.

---

### b. Methodology: Reflection on Munzner’s WHAT–WHY–HOW

The Munzner methodology was a helpful framework for structuring design decisions, though its **rigor introduced both strengths and blind spots**:

#### WHAT: 
- Extremely useful during data cleaning and attribute classification.
- It forced deliberate thinking about how each field (e.g., `Age`, `Product_Category_1`) was encoded and interpreted.
- However, it did not account well for **data quality issues** — for example, multiple product categories per transaction required extra logic, not covered in WHAT analysis.

#### WHY: 
- The action/target taxonomy helped refine task design (e.g., “Compare vs Discover”), especially when justifying interactions like dropdowns.
- It became challenging when one visualization attempted to address **multiple WHY tasks** — clarity sometimes suffered (e.g., the heatmap's triple metric switch felt over-ambitious).
- Not all WHY categories felt natural to my dataset. “Present” and “Query” were clearly applicable, but "Identify" and "Annotate" weren’t as straightforward to implement.

#### HOW: 
- Most valuable in **visual encoding decisions** — channel choice (position, color, text) directly improved the quality of each visualization.
- Helped prevent common design mistakes (e.g., avoiding hue for ordered data).
- One limitation: HOW does not always address **chart literacy**. For example, the chord diagram was a good idiomatic fit, but not easily readable by a general audience — a concern not addressed by Munzner’s framework.

If repeating the project, I would still use Munzner’s model, but:
- Place more emphasis on **audience literacy and interaction patterns**
- Limit multi-target visualizations to reduce interpretative overload
- Consider integrating **perceptual and UX heuristics** alongside HOW analysis

---

### c. Improvements: What would I do differently?

#### Data and Preprocessing:
- **Normalize metrics** (e.g., per capita values or Z-scores) in visuals to adjust for volume effects
- Derive and use **richer features** (e.g., recency/frequency metrics for users)
- Integrate more product hierarchy data if available (subcategory or brand-level)

#### Visual Design:
- Add **labels and legends** more consistently (especially in the bubble chart)
- Use **icons or tooltips** in the heatmap to guide interaction (many users may not understand the dropdown)
- Simplify the chord diagram by clustering low-frequency nodes or using hover-on-demand instead of showing all links

#### Interactivity and Usability:
- Provide **narrative explanations** within each chart (e.g., top 3 takeaways auto-highlighted)
- Include **filters or toggles** to reduce cognitive load (e.g., slider to hide categories with <50 users)
- Add a **comparison dashboard** to allow side-by-side filtering of segments

---

### Final Thought

The project successfully applied Munzner’s framework to a real dataset and produced three reasonably strong visualizations. However, the greatest area for growth lies in **usability and refinement** — not in the data logic or visual correctness, but in making sure **the right viewer sees the right story at the right time**.

The design process was guided well by WHAT–WHY–HOW, but a future iteration should balance theory with **audience empathy**, **simplicity**, and **narrative focus** to ensure that insight doesn’t just exist — but is seen.

---
