
# 00 · Environment Setup

**Purpose.** One-time checks so teammates can run the project end-to-end without surprises.  
This notebook:
- Confirms Python & package versions
- Creates the expected `data/` folders
- Generates a `.env` template (if missing) and loads env vars
- Optionally verifies Google Cloud credentials (if used)
- Performs a simple read/write sanity test in the repo
- Saves a machine-readable environment report to `../data/meta/env_report.json`

One key step in this setup involves authenticating with Google Cloud using a **Service Account**. Each teammate will need access to a JSON key file for the project's service account that authenticates their access to shared cloud resources.

Your service account key file should be placed in the credentials folder. This keeps sensitive files organized and makes it easier to manage your environment setup across machines or users.

> **Important:** Never commit your service account JSON file to version control. The `.gitignore` includes all file found within the `credentials/` directory so `.env` files and the JSON key files will not be pushed to the public repo.

In [8]:

# --- Core ---
import os, sys, platform, json
from pathlib import Path

# --- Common data stack ---
import pandas as pd
import numpy as np

# --- Utils ---
from dotenv import load_dotenv, find_dotenv

import google.cloud
from google.cloud import storage, bigquery
from google.auth import default as google_auth_default

print("Python:", sys.version.split()[0])
print("Platform:", platform.platform())
print("Pandas:", pd.__version__)
print("NumPy:", np.__version__)
print("dotenv available:", True)


Python: 3.12.4
Platform: Windows-11-10.0.26100-SP0
Pandas: 2.2.2
NumPy: 1.26.4
dotenv available: True


## Paths & Folders

In [3]:

# Resolve project-relative paths (this notebook lives in /notebooks ideally)
PROJECT_ROOT = Path.cwd().resolve()
DATA_DIR = (PROJECT_ROOT / ".." / "data").resolve()
RAW_DIR = DATA_DIR / "raw"
CLEAN_DIR = DATA_DIR / "clean"
PROCESSED_DIR = DATA_DIR / "processed"
FEATURES_DIR = DATA_DIR / "features"
META_DIR = DATA_DIR / "meta"

for p in [DATA_DIR, RAW_DIR, CLEAN_DIR, PROCESSED_DIR, FEATURES_DIR, META_DIR]:
    p.mkdir(parents=True, exist_ok=True)
    print("Ensured:", p)

# Optional: show a quick tree view (lightweight)
print("\nTop-level project contents:")
for item in sorted((PROJECT_ROOT / "..").resolve().iterdir()):
    if item.is_dir():
        print("DIR ", item.name)
    else:
        print("FILE", item.name)


Ensured: C:\Users\iauge\Documents\Drexel MSDS\DSCI 591\DSCI591-FACTS\data
Ensured: C:\Users\iauge\Documents\Drexel MSDS\DSCI 591\DSCI591-FACTS\data\raw
Ensured: C:\Users\iauge\Documents\Drexel MSDS\DSCI 591\DSCI591-FACTS\data\clean
Ensured: C:\Users\iauge\Documents\Drexel MSDS\DSCI 591\DSCI591-FACTS\data\processed
Ensured: C:\Users\iauge\Documents\Drexel MSDS\DSCI 591\DSCI591-FACTS\data\features
Ensured: C:\Users\iauge\Documents\Drexel MSDS\DSCI 591\DSCI591-FACTS\data\meta

Top-level project contents:
DIR  .git
FILE .gitignore
DIR  credentials
DIR  data
DIR  data_acquisition
DIR  data_pipeline
FILE LICENSE
DIR  notebooks
DIR  preprocessing
FILE README.md
FILE requirements.txt
DIR  shared_docs


## `.env` Template & Load

In [6]:
# Path to credentials folder
CREDENTIALS_DIR = (PROJECT_ROOT / ".." / "credentials").resolve()
CREDENTIALS_DIR.mkdir(parents=True, exist_ok=True)

# Path to .env inside credentials
ENV_PATH = CREDENTIALS_DIR / ".env"

# Template for .env
ENV_TEMPLATE = """# --- Project Environment Variables ---
GCP_PROJECT_ID=
BQ_DATASET=
GCS_BUCKET=
"""

if not ENV_PATH.exists():
    with open(ENV_PATH, "w", encoding="utf-8") as f:
        f.write(ENV_TEMPLATE)
    print(f"Created .env template at: {ENV_PATH}")
else:
    print(f".env already present at: {ENV_PATH}")

# Load env
load_dotenv(find_dotenv(str(ENV_PATH)), override=False)

GCP_PROJECT_ID = os.getenv("GCP_PROJECT_ID")
BQ_DATASET = os.getenv("BQ_DATASET")
GCS_BUCKET = os.getenv("GCS_BUCKET")

print("GCP_PROJECT_ID:", GCP_PROJECT_ID)
print("BQ_DATASET:", BQ_DATASET)
print("GCS_BUCKET:", GCS_BUCKET)

Created .env template at: C:\Users\iauge\Documents\Drexel MSDS\DSCI 591\DSCI591-FACTS\credentials\.env
GCP_PROJECT_ID: 
BQ_DATASET: 
GCS_BUCKET: 



## GCP Authentication (replicates project standard)

This block mirrors your current pattern:

- Keep a `credentials/` folder at repo root
- Store a `secrets.env` file with a single line pointing to your service account JSON:
  ```
  GOOGLE_APPLICATION_CREDENTIALS=path/to/your/service_account.json
  ```
- Load that path, validate the file exists, then initialize **ADC** clients for Storage and BigQuery.


In [9]:
# --- GCP Authentication & `.env` Setup ---

# Ensure credentials directory exists
CREDENTIALS_DIR = Path("../credentials").resolve()
CREDENTIALS_DIR.mkdir(parents=True, exist_ok=True)
print(f"Credentials directory ensured at: {CREDENTIALS_DIR}")

env_path = CREDENTIALS_DIR / ".env"
if env_path.exists():
    load_dotenv(find_dotenv(str(env_path)), override=False)
    print(f"Loaded project env vars from: {env_path}")
else:
    print(f"No .env found at {env_path} — create one to store GCP_PROJECT_ID, BQ_DATASET, etc.")

# Load from ../credentials/secrets.env (private key path)
secrets_path = CREDENTIALS_DIR / "secrets.env"
if not secrets_path.exists():
    print("'secrets.env' not found. Creating a template...")
    secrets_path.write_text(
        "GOOGLE_APPLICATION_CREDENTIALS=path/to/your/service_account.json\n",
        encoding="utf-8"
    )
    print(f"Created template at: {secrets_path}")
    print("Please update with the full path to your GCP JSON key.")
else:
    print(f"Found existing secrets file at: {secrets_path}")
load_dotenv(find_dotenv(str(secrets_path)), override=False)

# Retrieve credentials path
cred_path = os.getenv("GOOGLE_APPLICATION_CREDENTIALS")
storage_client = None
bq_client = None
project_id = None

if not cred_path or not os.path.exists(cred_path):
    print(
        "GOOGLE_APPLICATION_CREDENTIALS is not set or file does not exist.\n"
        "Update credentials/secrets.env with a valid path to your service account JSON."
    )
else:
    print("GOOGLE_APPLICATION_CREDENTIALS loaded.")
    try:
        storage_client = storage.Client()
        bq_client = bigquery.Client()
        creds, project_id = google_auth_default()
        member_email = getattr(creds, "service_account_email", "unknown")
        print(f"Authenticated as: {member_email}")
        print(f"GCP Project ID: {project_id}")
    except Exception as e:
        print("Failed to initialize GCP clients:", e)

# Region setting
REGION = "us-east1"
print(f"GCP region set to: {REGION}")


Credentials directory ensured at: C:\Users\iauge\Documents\Drexel MSDS\DSCI 591\DSCI591-FACTS\credentials
Loaded project env vars from: C:\Users\iauge\Documents\Drexel MSDS\DSCI 591\DSCI591-FACTS\credentials\.env
Found existing secrets file at: C:\Users\iauge\Documents\Drexel MSDS\DSCI 591\DSCI591-FACTS\credentials\secrets.env
GOOGLE_APPLICATION_CREDENTIALS loaded.
Authenticated as: 13742792432-compute@developer.gserviceaccount.com
GCP Project ID: dsci-591-capstone
GCP region set to: us-east1


## Read/Write Sanity Test

In [10]:

TEST_FILE = META_DIR / "rw_test.txt"
try:
    with open(TEST_FILE, "w", encoding="utf-8") as f:
        f.write("ok")
    with open(TEST_FILE, "r", encoding="utf-8") as f:
        print("Read back:", f.read())
    print("Read/write test ✓")
except Exception as e:
    print("Read/write test failed:", e)


Read back: ok
Read/write test ✓


## Save Machine-Readable Environment Report

In [12]:

report = {
    "python": sys.version.split()[0],
    "platform": platform.platform(),
    "pandas": pd.__version__,
    "numpy": np.__version__,
    "paths": {
        "project_root": str(PROJECT_ROOT),
        "data_dir": str(DATA_DIR),
        "raw_dir": str(RAW_DIR),
        "clean_dir": str(CLEAN_DIR),
        "processed_dir": str(PROCESSED_DIR),
        "features_dir": str(FEATURES_DIR),
        "meta_dir": str(META_DIR),
    },
    "env": {
        "GCP_PROJECT_ID_set": bool(GCP_PROJECT_ID),
        "BQ_DATASET_set": bool(BQ_DATASET),
        "GCS_BUCKET_set": bool(GCS_BUCKET),
    }
}

out_path = META_DIR / "env_report.json"
with open(out_path, "w", encoding="utf-8") as f:
    json.dump(report, f, indent=2)
print("Wrote:", out_path)
report


Wrote: C:\Users\iauge\Documents\Drexel MSDS\DSCI 591\DSCI591-FACTS\data\meta\env_report.json


{'python': '3.12.4',
 'platform': 'Windows-11-10.0.26100-SP0',
 'pandas': '2.2.2',
 'numpy': '1.26.4',
 'paths': {'project_root': 'C:\\Users\\iauge\\Documents\\Drexel MSDS\\DSCI 591\\DSCI591-FACTS\\notebooks',
  'data_dir': 'C:\\Users\\iauge\\Documents\\Drexel MSDS\\DSCI 591\\DSCI591-FACTS\\data',
  'raw_dir': 'C:\\Users\\iauge\\Documents\\Drexel MSDS\\DSCI 591\\DSCI591-FACTS\\data\\raw',
  'clean_dir': 'C:\\Users\\iauge\\Documents\\Drexel MSDS\\DSCI 591\\DSCI591-FACTS\\data\\clean',
  'processed_dir': 'C:\\Users\\iauge\\Documents\\Drexel MSDS\\DSCI 591\\DSCI591-FACTS\\data\\processed',
  'features_dir': 'C:\\Users\\iauge\\Documents\\Drexel MSDS\\DSCI 591\\DSCI591-FACTS\\data\\features',
  'meta_dir': 'C:\\Users\\iauge\\Documents\\Drexel MSDS\\DSCI 591\\DSCI591-FACTS\\data\\meta'},
 'env': {'GCP_PROJECT_ID_set': False,
  'BQ_DATASET_set': False,
  'GCS_BUCKET_set': False}}