# CSCE 676 — Project Checkpoint 1  
## Dataset Comparison, Selection, and EDA (Checkpoint 1)

**Student:** Jaehoon Lee  
**Date:** 2026-02-11  

This notebook identifies three candidate datasets, compares them across required dimensions, selects one dataset, performs exploratory data analysis (EDA), and proposes initial research directions.

> **Reproducibility note:** Designed for **Google Colab** (installs packages + streams data).


---

## 0. Collaboration Declaration (Required)

On my honor, I declare the following resources were used for this notebook:

1. **Collaborators:** None.  
2. **Web / Material Sources:**  
   - Amazon Reviews'23 website: https://amazon-reviews-2023.github.io/  
   - Hugging Face dataset card (Amazon Reviews 2023): https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023  
   - AmazonReviews2023 processing scripts (GitHub): https://github.com/hyp1231/AmazonReviews2023  
   - OSHA Accident Search: https://www.osha.gov/ords/imis/accidentsearch.html  
   - U.S. DOL Enforcement Data (OSHA): https://enforcedata.dol.gov/  
   - NHTSA FARS overview: https://www.nhtsa.gov/research-data/fatality-analysis-reporting-system-fars  
   - Data.gov FARS catalog entry: https://catalog.data.gov/dataset/fatality-analysis-reporting-system-fars  
3. **AI Tools:** ChatGPT (OpenAI) — used to help draft the comparison write-up, EDA plan, and to structure clean, documented code.  
4. **Papers / Citations used:**  
   - Hou et al. (2024) *Bridging Language and Items for Retrieval and Recommendation* (linked from the dataset website / HF card).


---

## 1. Checkpoint Requirements Map

- **(A)** 3 candidate datasets (name/source, course alignment, beyond-course technique, size/structure, types, targets, licensing)  
- **(B)** Comparative table (tasks, quality, feasibility, bias, ethics)  
- **(C)** Dataset selection + justification + trade-offs  
- **(D)** EDA for selected dataset only (basics, cleaning, bias notes)  
- **(E)** Initial insights / hypotheses / potential RQs  
- **(F)** GitHub repo link (replace placeholder)

All algorithmic decisions include a short **WHY** comment.


---

## 2. (A) Identification of Candidate Datasets (3 candidates)


### Dataset 1 — OSHA Accident / Investigation Narratives (Text + coded fields)

- **Dataset name & source:** OSHA Accident Search (incident investigation summaries)  
  https://www.osha.gov/ords/imis/accidentsearch.html  
  Related portal: https://enforcedata.dol.gov/  
- **Course topic alignment:** Text mining; clustering; anomaly detection  
- **Beyond-course techniques:** Transformer-based text modeling; topic modeling; information extraction (e.g., causal relation extraction)  
- **Size & structure:** Large incident-level records; narratives + structured fields (varies by export)  
- **Data types:** Narrative text; dates; industry codes; (possible) injury/severity indicators  
- **Target variable(s):** Often unsupervised; possible supervised targets if available (severity/fatality)  
- **Licensing/constraints:** Public-facing U.S. government data; handle sensitive narratives carefully and avoid re-identification.


### Dataset 2 — Amazon Reviews’23 (McAuley Lab) (Text + User–Item Graph)

- **Dataset name & source:** Amazon Reviews 2023  
  Website: https://amazon-reviews-2023.github.io/  
  Hugging Face: https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2023  
  Processing scripts: https://github.com/hyp1231/AmazonReviews2023  
- **Course topic alignment:** Text mining; graph mining; clustering/embeddings  
- **Beyond-course techniques:** Graph Neural Networks (GNN); transformer-based embeddings/retrieval; topic modeling  
- **Size & structure:** Extremely large → requires **streaming/sampling** for feasibility  
- **Data types:** user_id, (parent_)asin, rating, text/title, timestamps, helpful votes, verified purchase, metadata fields  
- **Target variable(s):** rating/helpfulness prediction; retrieval/recommendation; unsupervised topics/clusters  
- **Licensing/constraints:** Follow dataset card / repository terms; be mindful of platform/recommendation bias.


### Dataset 3 — NHTSA FARS (Fatality Analysis Reporting System)

- **Dataset name & source:** NHTSA FARS  
  https://www.nhtsa.gov/research-data/fatality-analysis-reporting-system-fars  
  Data.gov: https://catalog.data.gov/dataset/fatality-analysis-reporting-system-fars  
- **Course topic alignment:** Clustering; anomaly detection  
- **Beyond-course techniques:** Spatiotemporal modeling; change-point detection; carefully-scoped causal analysis  
- **Size & structure:** Large, multi-table relational structure (crash/vehicle/person), yearly releases  
- **Data types:** Many coded categorical/ordinal variables + time/location; joins required  
- **Target variable(s):** Context class prediction; risk factor modeling (severity variation may be limited in a fatality-only dataset)  
- **Licensing/constraints:** Public domain metadata (data.gov); still handle fatality outcomes responsibly.


---

## 3. (B) Comparative Analysis (Required Table)


| Dimension | OSHA Narratives | Amazon Reviews’23 | NHTSA FARS |
|---|---|---|---|
| **Supported tasks (course vs beyond)** | Course: text mining, clustering, anomaly. Beyond: transformers/topic modeling/IE | Course: text + graph + clustering. Beyond: GNN/transformers/topic modeling | Course: clustering/anomaly. Beyond: spatiotemporal modeling |
| **Data quality issues** | Noisy narratives; missing fields | Noisy text/spam; category imbalance; huge scale | Multi-table joins; year-to-year schema differences |
| **Algorithmic feasibility** | Feasible but acquisition/export can be work | Feasible via streaming/sampling; manageable subgraphs | Feasible with year/variable restrictions |
| **Bias considerations** | Reporting/selection bias by industry/time | Recommendation/exposure bias; reviewer self-selection | Geographic/policy/reporting effects |
| **Ethical considerations** | Sensitive incident content; re-identification risk | Lower direct harm; still avoid overclaiming | Sensitive fatality outcomes; avoid stigma |


---

## 4. (C) Dataset Selection

### **Selected dataset: Amazon Reviews’23 (McAuley Lab)**

**Reasons**
1. Supports multiple course techniques (text mining, graph-based analysis, clustering/embeddings).  
2. Clear beyond-course extension: **GNN embeddings / transformer retrieval** on user–item graphs.  
3. Existing tooling (Hugging Face loader + processing scripts) enables a **fully runnable** checkpoint notebook.

**Trade-offs**
- Not safety-domain specific.  
- Scale requires justified sampling/streaming choices.


---

## 5. (D) Exploratory Data Analysis (Selected Dataset Only)

### EDA plan (what + WHY)
1. **Stream + sample** (WHY: full dataset too large for RAM).  
2. Optionally filter to **one category** (WHY: interpretability).  
3. Compute distributions: ratings, text length, helpful votes, verified purchase.  
4. Minimal cleaning: remove empty text, validate rating bounds (WHY: avoid brittle assumptions early).  
5. Graph proxy stats: unique users/items + sparsity estimate (WHY: anticipate feasibility for graph/GNN methods).


In [None]:
# If running in Colab, install dependencies.
# WHY: keep the notebook runnable from scratch.

!pip -q install datasets pandas numpy matplotlib

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datasets import load_dataset


In [None]:
# -----------------------------
# 5.1 Load via streaming
# -----------------------------
# WHY streaming: Amazon Reviews'23 is extremely large.
DATASET_ID = "McAuley-Lab/Amazon-Reviews-2023"
SPLIT = "train"

ds_stream = load_dataset(DATASET_ID, split=SPLIT, streaming=True)
ds_stream


In [None]:
# -----------------------------
# 5.2 Sample N rows into a DataFrame
# -----------------------------
# WHY: EDA needs enough rows for distributions but must be feasible in Colab.

N_SAMPLE = 20000  # tradeoff between stability and runtime
rows = []
for i, ex in enumerate(ds_stream):
    rows.append(ex)
    if i + 1 >= N_SAMPLE:
        break

df = pd.DataFrame(rows)
print("Sampled rows:", len(df))
df.head()


In [None]:
# 5.3 Schema inspection
print("Columns:", df.columns.tolist())
df.describe(include="all").T.head(40)


In [None]:
# 5.4 Minimal cleaning (WHY: required for reliable EDA/text analysis)

# Choose a text column robustly
text_col_candidates = [c for c in ["text", "review_text", "reviewText"] if c in df.columns]
assert len(text_col_candidates) >= 1, "No recognizable text column found!"
TEXT_COL = text_col_candidates[0]
print("Using text column:", TEXT_COL)

df[TEXT_COL] = df[TEXT_COL].fillna("").astype(str)
before = len(df)
df = df[df[TEXT_COL].str.strip().str.len() > 0].copy()
print("Dropped empty-text rows:", before - len(df))

# Rating validity if present
if "rating" in df.columns:
    df["rating"] = pd.to_numeric(df["rating"], errors="coerce")
    before = len(df)
    df = df[df["rating"].between(1.0, 5.0, inclusive="both")].copy()
    print("Dropped invalid-rating rows:", before - len(df))

# Derived features
df["text_len_chars"] = df[TEXT_COL].str.len()
df["text_len_words"] = df[TEXT_COL].str.split().map(len)

df.head()


In [None]:
# 5.5 Optional: filter to one category for interpretability (if field exists)

category_col_candidates = [c for c in ["main_category", "category", "product_category"] if c in df.columns]
CATEGORY_COL = category_col_candidates[0] if category_col_candidates else None
print("Category column:", CATEGORY_COL)

if CATEGORY_COL is not None:
    top_cat = df[CATEGORY_COL].value_counts().index[0]
    print("Selected category (most frequent in sample):", top_cat)
    df_cat = df[df[CATEGORY_COL] == top_cat].copy()
    print("Rows in selected category:", len(df_cat))
else:
    df_cat = df.copy()
    print("No category field found; using full sample.")


In [None]:
# 5.6 EDA: distributions

# Rating distribution
if "rating" in df_cat.columns:
    ax = df_cat["rating"].value_counts().sort_index().plot(kind="bar")
    ax.set_title("Rating Distribution (sample)")
    ax.set_xlabel("Rating")
    ax.set_ylabel("Count")
    plt.show()

# Review length (words)
ax = df_cat["text_len_words"].clip(upper=200).plot(kind="hist", bins=50)
ax.set_title("Review Length (words) — clipped at 200")
ax.set_xlabel("Words")
plt.show()

# Verified purchase
if "verified_purchase" in df_cat.columns:
    vp = df_cat["verified_purchase"].value_counts(dropna=False)
    print("Verified purchase counts:")
    display(vp)
    print("Verified purchase rate:", (df_cat["verified_purchase"] == True).mean())

# Helpful votes
if "helpful_vote" in df_cat.columns:
    hv = pd.to_numeric(df_cat["helpful_vote"], errors="coerce").fillna(0)
    print("Helpful vote summary:")
    display(hv.describe())
    ax = hv.clip(upper=50).plot(kind="hist", bins=50)
    ax.set_title("Helpful Votes — clipped at 50")
    ax.set_xlabel("Votes")
    plt.show()


In [None]:
# 5.7 Graph feasibility proxy: user–item sparsity estimate (sample)

item_col_candidates = [c for c in ["parent_asin", "asin", "item_id"] if c in df_cat.columns]
assert len(item_col_candidates) >= 1, "No recognizable item ID column found!"
ITEM_COL = item_col_candidates[0]
print("Using item ID column:", ITEM_COL)

user_col_candidates = [c for c in ["user_id", "reviewerID", "user"] if c in df_cat.columns]
assert len(user_col_candidates) >= 1, "No recognizable user ID column found!"
USER_COL = user_col_candidates[0]
print("Using user ID column:", USER_COL)

n_users = df_cat[USER_COL].nunique()
n_items = df_cat[ITEM_COL].nunique()
n_edges = len(df_cat)

density_est = n_edges / (n_users * n_items) if (n_users * n_items) > 0 else np.nan

print(f"Unique users: {n_users:,}")
print(f"Unique items: {n_items:,}")
print(f"Interactions (edges): {n_edges:,}")
print(f"User–item density estimate (sample): {density_est:.6e}")


---

## 6. Tests / Sanity Checks (Required)

Simple assertion-based checks to validate:
- expected columns exist (text/user/item)
- cleaning invariants (no empty text; ratings in [1,5])
- derived features are valid
- sparsity estimate is finite


In [None]:
import math

print("Running EDA sanity checks...")

assert (df_cat[TEXT_COL].str.strip().str.len() > 0).all(), "Empty text found after cleaning!"
assert (df_cat["text_len_words"] >= 0).all(), "Negative word length detected!"
assert (df_cat["text_len_chars"] >= 0).all(), "Negative char length detected!"

if "rating" in df_cat.columns:
    assert df_cat["rating"].between(1.0, 5.0, inclusive="both").all(), "Rating outside [1,5] found!"

assert df_cat[USER_COL].notna().any(), "No user IDs found!"
assert df_cat[ITEM_COL].notna().any(), "No item IDs found!"
assert df_cat[USER_COL].nunique() > 10, "Too few unique users; sampling may be broken."
assert df_cat[ITEM_COL].nunique() > 10, "Too few unique items; sampling may be broken."

assert math.isfinite(float(density_est)), "Density estimate is not finite!"

print("All EDA sanity checks passed.")


---

## 7. Bias & Ethical Considerations (Selected Dataset)

**Potential biases**
- **Reviewer self-selection:** reviewers are not representative of all buyers.  
- **Exposure/recommendation bias:** visible products get more reviews.  
- **Category imbalance:** patterns may not generalize.

**Ethics**
- Avoid releasing any identifying user information; report aggregate stats.
- Avoid overclaiming causal effects from observational review data.


---

## 8. (E) Initial Insights and Direction

After running EDA, use your observed plots/stats to fill in specifics.

### Likely observations
- Ratings are often skewed toward high values (positivity bias).  
- Review lengths are long-tailed (many short reviews, few very long).  
- User–item interactions are sparse, suggesting graph-based models must handle sparsity.

### Hypotheses
1. Text-derived topics may explain variance beyond rating alone (e.g., shipping vs durability).  
2. GNN item embeddings may capture similarity beyond text-only embeddings by leveraging multi-hop connectivity.

### Potential research questions (RQs)
1. How do **topic clusters** (from review text) relate to rating distributions within a category?  
2. Do **GNN-based item embeddings** improve retrieval/similarity vs text-only embeddings?  
3. Can we detect **anomalous/spam-like review patterns** using text length, helpfulness, and temporal features?

### Beyond-course technique target
- **Graph Neural Networks (GNN)** for user–item embedding learning (beyond course).


---

## 9. (F) GitHub Portfolio Link (Required)

**Public GitHub repository (replace placeholder):**  
- https://github.com/<YOUR_USERNAME>/csce676-project

Include in the repo:
- This notebook (fully run)
- A short `README.md` describing dataset, course + beyond-course techniques, and how to run.
