# Environment Setup, Authentication and Acquisition

Welcome to the project onboarding notebook. This notebook helps you configure your local environment and validate access to required services such as Google Cloud (BigQuery, Cloud Storage).

One key step in this setup involves authenticating with Google Cloud using a **Service Account**. Each teammate will need access to a JSON key file for the project's service account that authenticates their access to shared cloud resources.

Your service account key file should be placed in the credentials folder. This keeps sensitive files organized and makes it easier to manage your environment setup across machines or users.

> **Important:** Never commit your service account JSON file to version control. The `.gitignore` includes all file found within the `credentials/` directory so `.env` files and the JSON key files will not be pushed to the public repo.

The next section will create and validate a `.env` file that stores the path to your service account credentials and confirms successful authentication.


## GCP Authentication & `.env` Setup

This code block does the following:

1. Checks whether `credentials/secrets.env` exists.
2. If missing, it creates a **template** with a placeholder for your service account key.
3. It then attempts to load the environment variable `GOOGLE_APPLICATION_CREDENTIALS` from the file.
4. If a valid path is found and the file exists, it initializes your GCP clients (BigQuery, Cloud Storage) and prints your authenticated service account email.

> If the `.env` file is missing, the script will create it and **halt execution**, allowing you to add your credentials before continuing. Once the `.env` file is created, add the full path to your JSON key file which should also be stored in the `credentials/` directory.

Once authenticated, you can begin querying BigQuery or interacting with GCS buckets programmatically.


In [1]:
import sys
from dotenv import load_dotenv, find_dotenv
import os
from pathlib import Path
sys.path.append(str(Path("..").resolve()))

from google.cloud import storage, bigquery
from google.auth import default
from data_pipeline.uploader import DataUploader
from data_acquisition.loader import main as run_loader
from data_acquisition.data_cleaner import main as run_cleaner


In [None]:
# GCP Authentication & `.env` Setup
# This script sets up Google Cloud authentication and checks for the necessary environment variables.

# Ensure credentials directory exists
credentials_dir = Path("../credentials")

if not credentials_dir.exists():
    print("Credentials directory not found. Creating...")
    credentials_dir.mkdir(parents=True, exist_ok=True)
else:
    print(f"Credentials directory found at: {credentials_dir.resolve()}")
    
# Define secrets file path
secrets_path = Path("../credentials/secrets.env")

# Create file if it doesn't exist
if not secrets_path.exists():
    print("'secrets.env' not found. Creating a template...")
    secrets_path.parent.mkdir(parents=True, exist_ok=True)
    secrets_path.write_text("GOOGLE_APPLICATION_CREDENTIALS=path/to/your/service_account.json\n")
    print(f"Created template at: {secrets_path.resolve()}")
    print("Please update this file with the directory path to your GCP JSON key.")
    print("Store JSON key in the 'credentials' directory to prevent upload to GitHub.")
    sys.exit(1)  
else:
    print(f"Found existing secrets file at: {secrets_path.resolve()}")
    
load_dotenv(find_dotenv("../credentials/secrets.env"))

cred_path = os.getenv("GOOGLE_APPLICATION_CREDENTIALS")

if not cred_path or not os.path.exists(cred_path):
    print(
        "GOOGLE_APPLICATION_CREDENTIALS is not set or the file does not exist.\n"
        "Please ensure secrets.env contains a valid path to your service account JSON file."
    )
    storage_client = None
    bq_client = None
else:
    print("GOOGLE_APPLICATION_CREDENTIALS loaded from .env")

    # Initialize GCP clients using ADC
    storage_client = storage.Client()
    bq_client = bigquery.Client()

    # Confirm authentication
    creds, project_id = default()
    member_email = creds.service_account_email
    print(f"Authenticated as: {member_email}")
    print(f"GCP Project ID: {project_id}")

# GCP configuration
REGION = "us-east1"
print(f"GCP region set to: {REGION}")

# Initialize GCP clients with the project ID if provided
storage_client = storage.Client(project = project_id if project_id else None)
bq_client = bigquery.Client(project = project_id if project_id else None)


Found existing secrets file at: C:\Users\iauge\Documents\Drexel MSDS\DSCI 591\DSCI591-FACTS\credentials\secrets.env
GOOGLE_APPLICATION_CREDENTIALS loaded from .env
Authenticated as: 13742792432-compute@developer.gserviceaccount.com
GCP Project ID: dsci-591-capstone
GCP region set to: us-east1


# Data Acquisition

This section outlines the process used to gather, standardize, and prepare a diverse set of question-answering (QA) datasets for downstream machine learning tasks.

Our goal is to build a robust and scalable pipeline for retrieving raw datasets, performing initial validation and cleaning, and outputting schema-consistent files ready for preprocessing, exploration, feature extraction, and modeling.

### Approach Overview

The pipeline is composed of three main stages:

1. **Data Loading** (`loader.py`):  
   Downloads raw datasets from either direct URLs or the Hugging Face Hub. All files are stored in `/data/raw/` in their original formats (e.g., JSONL, CSV, Parquet).

2. **Data Cleaning** (`cleaner.py`):  
   Transforms raw files into clean, flat CSVs with standardized fields required for QA tasks: `id`, `title`, `context`, `question`, and `answers`. Rows with formatting or structural issues are logged separately for inspection.

3. **Data Upload** (`#TODO - script in progress`):  
   A final upload step will push cleaned datasets to BigQuery for centralized cloud storage, enabling streamlined access to modeling workflows using Google's ML tools, including Vertex AI.

The pipeline is modular by design. New datasets can be added by extending the loader configuration and creating a dataset-specific cleaner as needed.


## Run the Data Loading Script

The `loader.py` script is responsible for downloading and storing a core set of fact verification and QA datasets into the local project environment in their **original file formats** (e.g., `.json`, `.jsonl`, `.parquet`, or `.csv`).

Currently supported datasets include:
- **FEVER 2.0**
- **HotpotQA**
- **Natural Questions (Lite)**
- **SQuAD v2.0**
- **TruthfulQA**

The script is built around a modular `DataDownloader` class, which encapsulates:
- dataset-specific retrieval logic,
- support for both **Hugging Face Hub** and **direct download URLs**,
- dynamic filetype handling for JSON, JSONL, CSV, and Parquet,
- customizable storage paths.

This design makes it easy to extend with new datasets: simply update the Hugging Face or URL mappings in `loader.py`, and rerun the script. Each dataset is downloaded only once unless the `overwrite` flag is enabled.

> **Note:** All files are saved into the `/data/raw/` folder using consistent and identifiable filenames to support reproducibility and transparent data lineage in downstream processing.


In [None]:
# Download raw data files from URLs and Hugging Face

run_loader(force=False, prompt_user=False)

## Run the Dataset Cleaning Script

The `cleaner.py` script processes raw QA datasets from the `/data/raw/` directory and transforms them into clean, BigQuery-compatible CSV files stored in `/data/clean/`.

This script uses the `DataCleaner` class, which includes dataset-specific parsing and normalization logic to:
- **Standardize nested answer formats** (e.g., from arrays or dictionaries),
- **Escape problematic characters** (e.g., rogue quotes or newline characters),
- **Validate presence of required fields** (`id`, `title`, `context`, `question`, `answers`),
- **Log and isolate failures** in a separate `*_failed.csv` file for inspection.

Key features:
- Handling of inconsistencies across datasets with diverse schemas (e.g., FEVER, HotpotQA, SQuAD).
- Inline cleaning functions for each dataset ensure modular, extensible preprocessing logic.
- All successfully cleaned rows are written to `/data/clean/`, and any rows with malformed or incomplete data are written to `/data/raw/*_failed.csv`.
    - The logic was used exclusively for **SQuAD v2.0** during implementation as it was the most problematic to convert from raw to a cleaned version

> **Note:** This step is essential before loading data into BigQuery, as unescaped quotes and inconsistent schemas will cause ingestion to fail.


In [4]:
# Convert JSON to JSONL
# Clean JSONL data structure for BigQuery upload
# Clean CSV data structure for BigQuery upload

run_cleaner()


Cleaning datasets: 0it [00:00, ?it/s]

Processing JSONL file: fever_dev_train.jsonl


Cleaning fever_dev_train.jsonl: 19998it [00:00, 62420.79it/s]
Cleaning datasets: 1it [00:00,  3.06it/s]

Converting hotpot_dev_distractor.json to JSONL format


Cleaning datasets: 1it [00:01,  3.06it/s]

Converted ..\data\raw\hotpot_dev_distractor.json to ..\data\raw\hotpot_dev_distractor.jsonl


Cleaning hotpot_dev_distractor.jsonl: 7405it [00:01, 6438.10it/s]
Cleaning datasets: 2it [00:02,  1.33s/it]

Processing JSONL file: hotpot_dev_distractor.jsonl


Cleaning hotpot_dev_distractor.jsonl: 7405it [00:01, 6382.12it/s]
Cleaning datasets: 3it [00:03,  1.25s/it]

Converting hotpot_dev_fullwiki.json to JSONL format


Cleaning datasets: 3it [00:04,  1.25s/it]

Converted ..\data\raw\hotpot_dev_fullwiki.json to ..\data\raw\hotpot_dev_fullwiki.jsonl


Cleaning hotpot_dev_fullwiki.jsonl: 7405it [00:01, 6187.41it/s]
Cleaning datasets: 4it [00:05,  1.57s/it]

Processing JSONL file: hotpot_dev_fullwiki.jsonl


Cleaning hotpot_dev_fullwiki.jsonl: 7405it [00:01, 6265.23it/s]
Cleaning datasets: 5it [00:06,  1.43s/it]

Converting hotpot_train.json to JSONL format


Cleaning datasets: 5it [00:15,  1.43s/it]

Converted ..\data\raw\hotpot_train.json to ..\data\raw\hotpot_train.jsonl


Cleaning hotpot_train.jsonl: 90447it [00:14, 6291.31it/s]
Cleaning datasets: 6it [00:30,  9.13s/it]

Processing JSONL file: hotpot_train.jsonl


Cleaning hotpot_train.jsonl: 90447it [00:14, 6319.75it/s]
Cleaning datasets: 7it [00:45, 10.84s/it]

Converting nq_open_train.json to JSONL format
JSON decode failed with: Extra data: line 2 column 1 (char 90). Assuming file is already JSONL.


Cleaning nq_open_train.jsonl: 87925it [00:00, 125304.03it/s]
Cleaning datasets: 8it [00:46,  7.65s/it]

Processing JSONL file: nq_open_train.jsonl


Cleaning nq_open_train.jsonl: 87925it [00:00, 127958.67it/s]
Cleaning datasets: 9it [00:46,  5.48s/it]

Processing CSV file: squad_v2_train.csv
Cleaning SQuAD CSV: squad_v2_train.csv


Cleaning datasets: 10it [01:23, 15.02s/it]

Saved cleaned CSV to: ..\data\clean\squad_v2_train.csv
Saved 109 failed rows to: ..\data\raw\squad_v2_train_failed.csv
Processing CSV file: squad_v2_validation.csv
Cleaning SQuAD CSV: squad_v2_validation.csv


Cleaning datasets: 14it [01:26,  6.18s/it]

Saved cleaned CSV to: ..\data\clean\squad_v2_validation.csv
Saved 11 failed rows to: ..\data\raw\squad_v2_validation_failed.csv
Processing CSV file: truthful_qa_train.csv
No specific cleaning function for truthful_qa_train.csv, copying as is.
Data cleaning completed successfully.





## Upload Cleaned Data to BigQuery

Once datasets have been cleaned and standardized, the final step is to upload them to a centralized BigQuery dataset for easy access in cloud-based analysis and modeling workflows.

The `DataUploader` class manages this process, handling table creation and data ingestion. Each CSV in the `/data/clean/` directory is read and pushed to a BigQuery table under the specified dataset (`data_clean` by default), with the table name matching the file name.

This enables a seamless transition from local data wrangling to scalable, cloud-native machine learning development.


In [5]:
# Initialize DataUploader with BigQuery client and project ID
uploader = DataUploader(
    bq_client=bq_client,
    project_id=project_id,
    dataset_name="data_clean"
)

# Upload files to BigQuery
for path in Path("../data/clean").glob("*.*"):
    name = path.stem
    uploader.upload_to_bigquery(
        file_path=path,
        table_name=name
    )


Table dsci-591-capstone.data_clean.fever_dev_train already exists. Using existing table.
Successfully uploaded ..\data\clean\fever_dev_train.jsonl to BigQuery table dsci-591-capstone.data_clean.fever_dev_train.
Table dsci-591-capstone.data_clean.hotpot_dev_distractor already exists. Using existing table.
Successfully uploaded ..\data\clean\hotpot_dev_distractor.jsonl to BigQuery table dsci-591-capstone.data_clean.hotpot_dev_distractor.
Table dsci-591-capstone.data_clean.hotpot_dev_fullwiki already exists. Using existing table.
Successfully uploaded ..\data\clean\hotpot_dev_fullwiki.jsonl to BigQuery table dsci-591-capstone.data_clean.hotpot_dev_fullwiki.
Table dsci-591-capstone.data_clean.hotpot_train already exists. Using existing table.
Successfully uploaded ..\data\clean\hotpot_train.jsonl to BigQuery table dsci-591-capstone.data_clean.hotpot_train.
Table dsci-591-capstone.data_clean.nq_open_train already exists. Using existing table.
Successfully uploaded ..\data\clean\nq_open_trai