# 01 · Data Acquisition

**Purpose**  
This notebook downloads all raw datasets required for the project into `../data/raw/`.  
It acts as a wrapper for [`loader.py`](../data_acquisition/loader.py), which:

- Retrieves datasets from predefined URLs
- Downloads from Hugging Face repositories
- Handles Parquet → CSV conversions
- Saves all outputs to the `data/raw/` directory

**Usage Notes**  
- Run this notebook when setting up the project for the first time or when refreshing datasets.  
- The cleaning and preprocessing of these datasets will be handled later in `02_cleaning_preprocessing.ipynb`.  
- To overwrite existing files, set `FORCE_DOWNLOAD = True` in the settings cell below.  
- For interactive overwrite confirmation, set `PROMPT_USER = True`.




In [1]:
from pathlib import Path
import sys

# --- Ensure project modules are on path ---
PROJECT_ROOT = Path.cwd().resolve().parent
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

# --- Ensure raw data directory exists ---
RAW_DIR = PROJECT_ROOT / "data" / "raw"
RAW_DIR.mkdir(parents=True, exist_ok=True)
print(f"Raw data directory: {RAW_DIR.resolve()}")

# --- Import loader main ---
from data_acquisition.loader import main as loader_main

Raw data directory: C:\Users\iauge\Documents\Drexel MSDS\DSCI 591\DSCI591-FACTS\data\raw


In [2]:
# --- Download Settings ---
# Set to True to overwrite existing files
FORCE_DOWNLOAD = False

# Set to True to prompt interactively (forces user input in cell output)
PROMPT_USER = False


## Run the Data Loading Script

The `loader.py` script is responsible for downloading and storing a core set of fact verification and QA datasets into the local project environment in their **original file formats** (e.g., `.json`, `.jsonl`, `.parquet`, or `.csv`).

Currently supported datasets include:
- **FEVER 2.0**
- **HotpotQA**
- **Natural Questions (Lite)**
- **SQuAD v2.0**
- **TruthfulQA**

The script is built around a modular `DataDownloader` class, which encapsulates:
- dataset-specific retrieval logic,
- support for both **Hugging Face Hub** and **direct download URLs**,
- dynamic filetype handling for JSON, JSONL, CSV, and Parquet,
- customizable storage paths.

This design makes it easy to extend with new datasets: simply update the Hugging Face or URL mappings in `loader.py`, and rerun the script. Each dataset is downloaded only once unless the `overwrite` flag is enabled.

> **Note:** All files are saved into the `/data/raw/` folder using consistent and identifiable filenames to support reproducibility and transparent data lineage in downstream processing.


In [3]:
# --- Run the Loader ---
loader_main(force=FORCE_DOWNLOAD, prompt_user=PROMPT_USER)


URL Downloads: 100%|██████████| 7/7 [00:00<00:00, 699.67it/s]


Skipping hotpot_train: cleaned or downloaded file already exists.
Skipping hotpot_dev_distractor: cleaned or downloaded file already exists.
Skipping hotpot_dev_fullwiki: cleaned or downloaded file already exists.
Skipping fever_dev_train: cleaned or downloaded file already exists.
Skipping truthful_qa_train: cleaned or downloaded file already exists.
Skipping squad_v2_train: cleaned or downloaded file already exists.
Skipping squad_v2_validation: cleaned or downloaded file already exists.

Downloading datasets from Hugging Face...



Hugging Face Downloads: 100%|██████████| 1/1 [00:00<00:00, 1000.55it/s]

Skipping HuggingFace download for nq_open_train: file already exists.





In [4]:
# --- Post-download check ---

print("\nDownloaded files in ../data/raw/:")
for f in sorted(RAW_DIR.glob("*")):
    size_mb = f.stat().st_size / (1024 * 1024)
    print(f"{f.name}  ({size_mb:.2f} MB)")



Downloaded files in ../data/raw/:
fever_dev_train.jsonl  (4.15 MB)
hotpot_dev_distractor.json  (44.17 MB)
hotpot_dev_distractor.jsonl  (44.65 MB)
hotpot_dev_fullwiki.json  (45.26 MB)
hotpot_dev_fullwiki.jsonl  (45.74 MB)
hotpot_train.json  (540.19 MB)
hotpot_train.jsonl  (540.19 MB)
nq_open_train.json  (7.86 MB)
nq_open_train.jsonl  (7.86 MB)
squad_v2_train.csv  (117.54 MB)
squad_v2_train.parquet  (15.61 MB)
squad_v2_train_failed.csv  (0.12 MB)
squad_v2_validation.csv  (11.76 MB)
squad_v2_validation.parquet  (1.29 MB)
squad_v2_validation_failed.csv  (0.01 MB)
truthful_qa_train.csv  (0.48 MB)
