
# 02 · Cleaning & Preprocessing

**Purpose**  
Transform raw datasets in `../data/raw/` into **clean, canonical** files in `../data/clean/` using the project’s cleaning utilities.  
This notebook is a thin wrapper around the module logic so the pipeline remains testable and re-runnable.

**What happens here**  
- Convert formats when needed (e.g., JSON → JSONL, Parquet → CSV if previously created)
- Apply dataset-specific normalization:
  - FEVER: evidence structure normalization
  - HotpotQA: `supporting_facts` and `context` flattening
  - SQuAD v2: robust answer field parsing and text normalization
- Copy passthrough files to `clean/` when no custom cleaning is required

**Non-goals**  
No plotting, statistics, or modeling. EDA lives in `03_eda.ipynb` and feature creation in `04_feature_engineering.ipynb`.


In [6]:
# --- Imports & Path Setup ---
from pathlib import Path
import sys
import pandas as pd

In [None]:
# Ensure project root on sys.path
PROJECT_ROOT = Path.cwd().resolve().parent
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

RAW_DIR = PROJECT_ROOT / "data" / "raw"
CLEAN_DIR = PROJECT_ROOT / "data" / "clean"
RAW_DIR.mkdir(parents=True, exist_ok=True)
CLEAN_DIR.mkdir(parents=True, exist_ok=True)

print("Project root:", PROJECT_ROOT)
print("Raw dir:", RAW_DIR)
print("Clean dir:", CLEAN_DIR)

# Try both import paths depending on repo layout

from data_acquisition.data_cleaner import main as run_cleaner
print("Imported: data_acquisition.data_cleaner.main")



Project root: C:\Users\iauge\Documents\Drexel MSDS\DSCI 591\DSCI591-FACTS
Raw dir: C:\Users\iauge\Documents\Drexel MSDS\DSCI 591\DSCI591-FACTS\data\raw
Clean dir: C:\Users\iauge\Documents\Drexel MSDS\DSCI 591\DSCI591-FACTS\data\clean
Imported: data_acquisition.data_cleaner.main


## Run the Dataset Cleaning Script

The `cleaner.py` script processes raw QA datasets from the `/data/raw/` directory and transforms them into clean, BigQuery-compatible CSV files stored in `/data/clean/`.

This script uses the `DataCleaner` class, which includes dataset-specific parsing and normalization logic to:
- **Standardize nested answer formats** (e.g., from arrays or dictionaries),
- **Escape problematic characters** (e.g., rogue quotes or newline characters),
- **Validate presence of required fields** (`id`, `title`, `context`, `question`, `answers`),
- **Log and isolate failures** in a separate `*_failed.csv` file for inspection.

Key features:
- Handling of inconsistencies across datasets with diverse schemas (e.g., FEVER, HotpotQA, SQuAD).
- Inline cleaning functions for each dataset ensure modular, extensible preprocessing logic.
- All successfully cleaned rows are written to `/data/clean/`, and any rows with malformed or incomplete data are written to `/data/raw/*_failed.csv`.
    - The logic was used exclusively for **SQuAD v2.0** during implementation as it was the most problematic to convert from raw to a cleaned version

> **Note:** This step is essential before loading data into BigQuery, as unescaped quotes and inconsistent schemas will cause ingestion to fail.


In [None]:
# --- Run Cleaning ---

run_cleaner()


In [4]:
# --- Post-clean Sanity Checks ---

clean_files = sorted(CLEAN_DIR.glob("*"))
print(f"Files in {CLEAN_DIR} ({len(clean_files)}):")
for f in clean_files:
    sz = f.stat().st_size / (1024 * 1024)
    print(f" - {f.name}  ({sz:.2f} MB)")


Files in C:\Users\iauge\Documents\Drexel MSDS\DSCI 591\DSCI591-FACTS\data\clean (10):
 - fever_dev_train.jsonl  (6.47 MB)
 - hotpot_dev_distractor.jsonl  (61.01 MB)
 - hotpot_dev_fullwiki.jsonl  (62.63 MB)
 - hotpot_train.jsonl  (737.53 MB)
 - nq_open_train.jsonl  (8.21 MB)
 - squad_v2_train.csv  (115.07 MB)
 - squad_v2_validation.csv  (11.50 MB)
 - truthful_qa_train.csv  (0.48 MB)
 - truthful_qa_with_source_text.csv  (9.88 MB)
 - truthfulqa_missing_urls.csv  (0.09 MB)


In [None]:
# Apply additional cleaning for Truthful_QA
# TODO: Implement this logic inside of the cleaner.py module and remove this section

def clean_text(text):
    return ' '.join(str(text).strip().split())

df_clean = pd.read_csv(CLEAN_DIR / "truthful_qa_train.csv")

df_clean['Question'] = df_clean['Question'].apply(clean_text)
df_clean['Best Answer'] = df_clean['Best Answer'].apply(clean_text)
df_clean['Best Incorrect Answer'] = df_clean['Best Incorrect Answer'].apply(clean_text)

In [8]:
# Normalizing and flatenning lists

df_clean['Correct Answers'] = df_clean['Correct Answers'].apply(lambda x: [s.strip() for s in x.split(';')])
df_clean['Incorrect Answers'] = df_clean['Incorrect Answers'].apply(lambda x: [s.strip() for s in x.split(';')])

In [11]:
# Write back to CSV
df_clean.to_csv(CLEAN_DIR / "truthful_qa_train.csv", index=False)
