# Environment Setup & Authentication

Welcome to the project onboarding notebook. This notebook helps you configure your local environment and validate access to required services such as Google Cloud (BigQuery, Cloud Storage).

One key step in this setup involves authenticating with Google Cloud using a **Service Account**. Each teammate will need access to a JSON key file for the project's service account that authenticates their access to shared cloud resources.

Your service account key file should be placed in the credentials folder. This keeps sensitive files organized and makes it easier to manage your environment setup across machines or users.

> **Important:** Never commit your service account JSON file to version control. The `.gitignore` includes all file found within the `credentials/` directory so `.env` files and the JSON key files will not be pushed to the public repo.

The next section will create and validate a `.env` file that stores the path to your service account credentials and confirms successful authentication.


## GCP Authentication & `.env` Setup

This code block does the following:

1. Checks whether `credentials/secrets.env` exists.
2. If missing, it creates a **template** with a placeholder for your service account key.
3. It then attempts to load the environment variable `GOOGLE_APPLICATION_CREDENTIALS` from the file.
4. If a valid path is found and the file exists, it initializes your GCP clients (BigQuery, Cloud Storage) and prints your authenticated service account email.

> If the `.env` file is missing, the script will create it and **halt execution**, allowing you to add your credentials before continuing. Once the `.env` file is created, add the full path to your JSON key file which should also be stored in the `credentials/` directory.

Once authenticated, you can begin querying BigQuery or interacting with GCS buckets programmatically.


In [1]:
import sys
from dotenv import load_dotenv, find_dotenv
import os
from pathlib import Path
sys.path.append(str(Path("..").resolve()))

from google.cloud import storage, bigquery
from google.auth import default
from data_pipeline.uploader import DataUploader
from data_acquisition.loader import main as run_loader


In [2]:
# GCP Authentication & `.env` Setup
# This script sets up Google Cloud authentication and checks for the necessary environment variables.

# Define secrets file path
secrets_path = Path("../credentials/secrets.env")

# Create file if it doesn't exist
if not secrets_path.exists():
    print("'secrets.env' not found. Creating a template...")
    secrets_path.parent.mkdir(parents=True, exist_ok=True)
    secrets_path.write_text("GOOGLE_APPLICATION_CREDENTIALS=path/to/your/service_account.json\n")
    print(f"Created template at: {secrets_path.resolve()}")
    print("Please update this file with the directory path to your GCP JSON key.")
    print("Store JSON key in the 'credentials' directory to prevent upload to GitHub.")
    sys.exit(1)  
else:
    print(f"Found existing secrets file at: {secrets_path.resolve()}")
    
load_dotenv(find_dotenv("../credentials/secrets.env"))

cred_path = os.getenv("GOOGLE_APPLICATION_CREDENTIALS")

if not cred_path or not os.path.exists(cred_path):
    print(
        "GOOGLE_APPLICATION_CREDENTIALS is not set or the file does not exist.\n"
        "Please ensure secrets.env contains a valid path to your service account JSON file."
    )
    storage_client = None
    bq_client = None
else:
    print("GOOGLE_APPLICATION_CREDENTIALS loaded from .env")

    # Initialize GCP clients using ADC
    storage_client = storage.Client()
    bq_client = bigquery.Client()

    # Confirm authentication
    creds, project_id = default()
    member_email = creds.service_account_email
    print(f"Authenticated as: {member_email}")
    print(f"GCP Project ID: {project_id}")

# GCP configuration
REGION = "us-east1"
print(f"GCP region set to: {REGION}")


Found existing secrets file at: C:\Users\iauge\Documents\Drexel MSDS\DSCI 591\DSCI591-FACTS\credentials\secrets.env
GOOGLE_APPLICATION_CREDENTIALS loaded from .env
Authenticated as: 13742792432-compute@developer.gserviceaccount.com
GCP Project ID: dsci-591-capstone
GCP region set to: us-east1


In [3]:

# Initialize GCP clients with the project ID if provided
storage_client = storage.Client(project = project_id if project_id else None)
bq_client = bigquery.Client(project = project_id if project_id else None)

In [5]:
sys.path.append(str(Path("..")))

run_loader(force=False, prompt_user=False)

Downloading datasets from URLs...



URL Downloads:  20%|██        | 1/5 [00:46<03:07, 46.87s/it]

Data written to ..\data\raw\hotpot_train.json


URL Downloads:  40%|████      | 2/5 [00:51<01:05, 21.75s/it]

Data written to ..\data\raw\hotpot_dev_distractor.json


URL Downloads:  60%|██████    | 3/5 [00:55<00:27, 13.63s/it]

Data written to ..\data\raw\hotpot_dev_fullwiki.json


URL Downloads:  80%|████████  | 4/5 [00:55<00:08,  8.42s/it]

Data written to ..\data\raw\fever_dev_train.jsonl


URL Downloads: 100%|██████████| 5/5 [00:55<00:00, 11.13s/it]


Data written to ..\data\raw\truthful_qa_train.csv

Downloading datasets from Hugging Face...



Hugging Face Downloads:   0%|          | 0/2 [00:00<?, ?it/s]

Creating CSV from Arrow format:   0%|          | 0/131 [00:00<?, ?ba/s]

Hugging Face Downloads:  50%|█████     | 1/2 [00:06<00:06,  6.53s/it]

Saved train split to ..\data\raw\squad_v2_train.csv


Creating json from Arrow format:   0%|          | 0/88 [00:00<?, ?ba/s]

Hugging Face Downloads: 100%|██████████| 2/2 [00:10<00:00,  5.36s/it]


Saved train split to ..\data\raw\nq_open_train.json

Converting JSON files to JSONL format...



Converting JSON to JSONL:   0%|          | 0/4 [00:00<?, ?it/s]

Converting hotpot_dev_distractor.json to JSONL format


Converting JSON to JSONL:  25%|██▌       | 1/4 [00:00<00:02,  1.18it/s]

Converted ..\data\raw\hotpot_dev_distractor.json to ..\data\raw\hotpot_dev_distractor.jsonl
Converting hotpot_dev_fullwiki.json to JSONL format


Converting JSON to JSONL:  50%|█████     | 2/4 [00:01<00:01,  1.18it/s]

Converted ..\data\raw\hotpot_dev_fullwiki.json to ..\data\raw\hotpot_dev_fullwiki.jsonl
Converting hotpot_train.json to JSONL format


Converting JSON to JSONL:  50%|█████     | 2/4 [00:10<00:01,  1.18it/s]

Converted ..\data\raw\hotpot_train.json to ..\data\raw\hotpot_train.jsonl


Converting JSON to JSONL: 100%|██████████| 4/4 [00:10<00:00,  2.70s/it]

Converting nq_open_train.json to JSONL format
Failed to convert ..\data\raw\nq_open_train.json to JSONL: Extra data: line 2 column 1 (char 90)





In [6]:
import json
from pathlib import Path
from tqdm import tqdm

def normalize_fever_evidence(record: dict) -> list[dict]:
    normalized = []
    for evidence_group in record.get("evidence", []):
        for item in evidence_group:
            if isinstance(item, list) and len(item) == 4:
                normalized.append({
                    "annotation_id": item[0],
                    "evidence_id": item[1],
                    "wikipedia_title": item[2],
                    "sentence_id": item[3]
                })
    return normalized

# Paths
input_path = Path("../data/raw/fever_dev_train.jsonl")
output_path = Path("../data/raw/fever_dev_train_cleaned.jsonl")
output_path.parent.mkdir(parents=True, exist_ok=True)

# Convert
with open(input_path, "r", encoding="utf-8") as f_in, open(output_path, "w", encoding="utf-8") as f_out:
    for i, line in enumerate(tqdm(f_in, desc="Normalizing evidence")):
        try:
            record = json.loads(line)
            if "evidence" in record:
                record["evidence"] = normalize_fever_evidence(record)
            json.dump(record, f_out)
            f_out.write("\n")
        except Exception as e:
            tqdm.write(f"⚠️ Skipped line {i+1}: {e}")

print(f"\n✅ Saved cleaned JSONL to: {output_path}")

Normalizing evidence: 19998it [00:00, 60983.29it/s]


✅ Saved cleaned JSONL to: ..\data\raw\fever_dev_train_cleaned.jsonl





In [7]:
with open(output_path, "r") as f:
    for _ in range(3):
        print(json.loads(f.readline()))

{'id': 91198, 'verifiable': 'NOT VERIFIABLE', 'label': 'NOT ENOUGH INFO', 'claim': 'Colin Kaepernick became a starting quarterback during the 49ers 63rd season in the National Football League.', 'evidence': [{'annotation_id': 108548, 'evidence_id': None, 'wikipedia_title': None, 'sentence_id': None}]}
{'id': 194462, 'verifiable': 'NOT VERIFIABLE', 'label': 'NOT ENOUGH INFO', 'claim': 'Tilda Swinton is a vegan.', 'evidence': [{'annotation_id': 227768, 'evidence_id': None, 'wikipedia_title': None, 'sentence_id': None}]}
{'id': 137334, 'verifiable': 'VERIFIABLE', 'label': 'SUPPORTS', 'claim': 'Fox 2000 Pictures released the film Soul Food.', 'evidence': [{'annotation_id': 289914, 'evidence_id': 283015, 'wikipedia_title': 'Soul_Food_-LRB-film-RRB-', 'sentence_id': 0}, {'annotation_id': 291259, 'evidence_id': 284217, 'wikipedia_title': 'Soul_Food_-LRB-film-RRB-', 'sentence_id': 0}, {'annotation_id': 293412, 'evidence_id': 285960, 'wikipedia_title': 'Soul_Food_-LRB-film-RRB-', 'sentence_id':

In [None]:
import json
from pathlib import Path

path = Path("../data/raw/fever_dev_train.jsonl")

with open(path, "r", encoding="utf-8") as f:
    for i, line in enumerate(f):
        try:
            record = json.loads(line)
            print(f"\n--- Record {i+1} ---\n")
            for key, val in record.items():
                print(f"{key}: {type(val)} → {val}")
        except Exception as e:
            print(f"❌ JSON parsing error on line {i+1}: {e}")
        if i == 2:
            break  # Just show first 3 lines


In [None]:
# Initialize DataUploader with BigQuery client and project ID
uploader = DataUploader(
    bq_client=bq_client,
    project_id=project_id,
    dataset_name="data_raw"
)

# Upload files to BigQuery
for path in Path("../data/raw").glob("*.*"):
    name = path.stem
    uploader.upload_to_bigquery(
        file_path=path,
        table_name=name
    )
