<a href="https://colab.research.google.com/github/jennastengel/csc786-ethics-demo/blob/main/CSC786_Ethics_Demo_ST_Edited.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# CSC 786 – Data Ethics & Reproducibility Workshop  

This notebook demonstrates a complete ethical, reproducible data-collection workflow:

- Ethical handling of APIs and environment variables  
- Data collection using both key-based and public APIs  
- Provenance logging and metadata documentation  
- Responsible data storage and reproducible version control  
- Pushing results to a GitHub repository  

All steps run directly in Google Colab.


## The Big Picture
Think of your Colab notebook as the entry point to your research repo.
The notebook does the work (collects data, logs metadata), while the repo (on GitHub) stores the evidence — code, data samples, metadata logs, and ethical documentation.

As a prerequisite, you need to create the GitHub repo first (empty). See the next cell for details.



## Create an empty GitHub repo (UI steps)
1. Sign in to GitHub.
2. Click the + (top-right) → New repository.
3. Repository name: e.g., csc786-ethics-demo.
4. Owner: your account.
5. Visibility: Public (recommended for this class) or Private.
6. Important: Do NOT check “Add a README”, “Add .gitignore”, or “Choose a license”. Leaving these unchecked keeps the repo truly empty, which makes the first push from Colab simplest.
7. Click Create repository.
8. On the next page, copy the HTTPS URL. You will it use it later in notebook.

# Create (or confirm) a GitHub Personal Access Token (PAT) for Colab pushes
You’ll push from Colab using HTTPS + a token (safer/simpler than SSH during class).
1. Go to https://github.com/settings → Developer settings → Personal access tokens. Choose “Fine-grained tokens” (preferred).
2. Generate new token
- Token name (e.g. colab-demo)
- Only select respositories -> choose course repository
- Permissions -> Add permissions -> Contents -> Access: Read and write  
3. Generate the token and copy it once (you won’t see it again).

Tip: keep this token handy just for the class; you can revoke it afterward.

# Setup Cell
Run once per session

In [None]:
%env GITHUB_TOKEN=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

!git config --global user.name "Jenna Stengel" ## Display name not necessarily your username
!git config --global user.email "jenna.stengel@trojans.dsu.edu"

env: GITHUB_TOKEN=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


# One time only: Connect the empty repo from Colab (first push)

In [None]:
!git init
!git add .
!git commit -m "Initial reproducibility demo"
!git branch -M main

# Replace <username> and <PAT> and repo name.

!git remote add origin https://jennastengel:$GITHUB_TOKEN@github.com/jennastengel/csc786-ethics-demo.git

!git push -u origin main

[33mhint: Using 'master' as the name for the initial branch. This default branch name[m
[33mhint: is subject to change. To configure the initial branch name to use in all[m
[33mhint: [m
[33mhint: 	git config --global init.defaultBranch <name>[m
[33mhint: [m
[33mhint: Names commonly chosen instead of 'master' are 'main', 'trunk' and[m
[33mhint: 'development'. The just-created branch can be renamed via this command:[m
[33mhint: [m
[33mhint: 	git branch -m <name>[m
Initialized empty Git repository in /content/.git/
[master (root-commit) 0f577db] Initial reproducibility demo
 21 files changed, 51025 insertions(+)
 create mode 100644 .config/.last_opt_in_prompt.yaml
 create mode 100644 .config/.last_survey_prompt.yaml
 create mode 100644 .config/.last_update_check.json
 create mode 100644 .config/active_config
 create mode 100644 .config/config_sentinel
 create mode 100644 .config/configurations/config_default
 create mode 100644 .config/default_configs.db
 create mode 100

If everything is correct, you’ll see the push succeed and your files appear in the GitHub repo (refresh the repo page).

In [None]:
%%bash
# --- Create and push .gitignore for clean, ethical repo ---

cat > .gitignore << 'EOF'
.ipynb_checkpoints/
__pycache__/
data/*
.env
*.env
EOF

git add .gitignore
git commit -m "Add .gitignore for data, cache, and secrets"
git push


[main 6809012] Add .gitignore for data, cache, and secrets
 1 file changed, 5 insertions(+)
 create mode 100644 .gitignore


remote: This repository moved. Please use the new location:        
remote:   https://github.com/jennastengel/CSC786-Ethics-demo.git        
To https://github.com/jennastengel/csc786-ethics-demo.git
   0f577db..6809012  main -> main


# When you reopen Colab next time
You’ll simply clone your GitHub repo back into /content, instead of re-initializing a new one.

So, the reconnect workflow will look like this:

In [None]:
# 1. Clone your existing repo from GitHub
!git clone https://github.com/jennastengel/csc786-ethics-demo.git # todo update url
%cd csc786-ethics-demo


# 2. Optional: verify remote
!git remote -v


# 3. If you make changes and want to push again
!git remote set-url origin https://jennastengel:$GITHUB_TOKEN@github.com/jennastengel/csc786-ethics-demo.git # todo update url

!git add .
!git commit -m "Update from Colab session"
!git push


Cloning into 'csc786-ethics-demo'...
remote: Enumerating objects: 53, done.[K
remote: Counting objects:   1% (1/53)[Kremote: Counting objects:   3% (2/53)[Kremote: Counting objects:   5% (3/53)[Kremote: Counting objects:   7% (4/53)[Kremote: Counting objects:   9% (5/53)[Kremote: Counting objects:  11% (6/53)[Kremote: Counting objects:  13% (7/53)[Kremote: Counting objects:  15% (8/53)[Kremote: Counting objects:  16% (9/53)[Kremote: Counting objects:  18% (10/53)[Kremote: Counting objects:  20% (11/53)[Kremote: Counting objects:  22% (12/53)[Kremote: Counting objects:  24% (13/53)[Kremote: Counting objects:  26% (14/53)[Kremote: Counting objects:  28% (15/53)[Kremote: Counting objects:  30% (16/53)[Kremote: Counting objects:  32% (17/53)[Kremote: Counting objects:  33% (18/53)[Kremote: Counting objects:  35% (19/53)[Kremote: Counting objects:  37% (20/53)[Kremote: Counting objects:  39% (21/53)[Kremote: Counting objects:  41% (22/53)[Kremote

In [None]:
# You can always check what's currently configured by:

!git config --global --list


user.name=Jenna Stengel
user.email=jenna.stengel@trojans.dsu.edu


## Colab-specific access details
Note: While we work in Colab, everything inside /content/ is a temporary mini-repo.
As you run the notebook:
1. It creates the folder /content/data/ for your CSVs.
2. It appends provenance info into /content/DATA_README.md.
3. You can add extra markdown files manually.

## Step 1 – Setup Environment

In [None]:
!pip install python-dotenv --quiet
import os, pandas as pd, requests, hashlib, json, sys, time
from datetime import datetime, timezone
from pathlib import Path

ROOT = Path("/content/csc786-ethics-demo") ## todo: may update repo name if needed
DATA = ROOT / "data"
DATA.mkdir(exist_ok=True)
print("Environment ready. Files will be stored in:", DATA)


Environment ready. Files will be stored in: /content/CSC786-Ethics-demo/data



# Ethical Reminder

Before collecting any data:

- Check Terms of Service and rate limits.  
- Avoid collecting or storing personally identifiable information (PII).  
- Document every endpoint, parameter, and date of collection.  
- Keep secrets (API keys) out of public repositories.  


## Step 2 – Create Reproducibility Documentation Files

In [None]:
from pathlib import Path
ROOT = Path("/content/csc786-ethics-demo")

# 1 - README.md  (general project overview)
readme_text = """# Reproducibility Demo – CSC 786

This repository demonstrates an ethical, reproducible data-collection workflow used in the CSC 786 course.

## Overview (udpate as necessary)
This project uses open-source NFL datasets (for example, player statistics, play-by-play data, and contextual game data).
This project will collect and preprocess open NFL data, integrate contextual variables (weather, opponent rank, location),
build a baseline and enhanced prediction models, compare the model accuracy and document all steps for transparency.
All data collection parameters and metadata are stored in a version-controlled repository.

## Files
| File | Purpose |
|------|----------|
| `README.md` | Project overview and usage instructions |
| `ETHICS.md` | Ethical statement for transparency |
| `DATA_README.md` | Auto-logged metadata for every data collection event |


"""
(ROOT / "README.md").write_text(readme_text)


# 2 - ETHICS.md  → ethical statement / responsible data use
ethics_text = """## Ethical Statement

- Data sources are open and public.
- No personally identifiable information (PII) is collected.
- All API usage complies with provider Terms of Service and rate limits.
- API keys (if required) are stored securely using environment variables.
- Every dataset generated is logged with parameters, timestamps, and hashes in `DATA_README.md`.
- This workflow aligns with academic integrity and reproducibility standards at Dakota State University.

- Potential risks (bias, privacy, security)
  My project will only be using publicly available and anonymized datasets, to ensure the privacy of players.
  My project will ensure representation across all roles to have a balanced dataset of all positions, to ensure model fairness.
  My project will clearly document data sources, steps, and model evaluation, to ensure transparency.
- Mitigations (data handling, bias checks)
  To mitigate these potential risks I will use only public available and anonymized datasets for data handling. I will have bias
  checks to ensure that their is representation across player positions and teams.
- Limitations (known constraints)
  Some limitations is that the study relies heavily on publicly available data, which may not include certain contextual variables.
  Models trained on older data may capture outdated trends in team performance or play-calling strategies.

---

"""
(ROOT / "ETHICS.md").write_text(ethics_text)


# 3 - DATA_README.md  → provenance log (append-only)
data_readme_path = ROOT / "DATA_README.md"
if not data_readme_path.exists():
    data_readme_path.write_text("""# Data Provenance Log
Each entry below documents a data-collection event.
Auto-generated by the notebook.

Example entry format:
- {"timestamp_utc": "<time>", "endpoint": "nflfastR API", "params": {"season": year, "type": "regular"}, "output": "data/nfl_year_regular.csv", "sha256": "sha256",  "notes": "Initial dataset collection for player performance"}
- {"timestamp_utc": "<time>", "endpoint": "https://api.open-meteo.com/v1/forecast", "params": {"latitude": lat, longitude": long, "hourly": ["temperature_2m", "precipitation", "relative_humidity_2m", "wind_speed_10m"], "wind_speed_unit": "mph", "temperature_unit": "fahrenheit", "precipitation_unit": "inch",}, "output": "data/weather.csv", "sha256": "sha256",  "notes": "Get the weather for each game"}
---
""")

print("Created reproducibility files:")
!ls -lh /content/csc786-ethics-demo | grep .md

Created reproducibility files:
-rw-r--r-- 1 root root 3.3K Oct 24 18:09 DATA_README.md
-rw-r--r-- 1 root root 1.4K Oct 24 18:42 ETHICS.md
-rw-r--r-- 1 root root  871 Oct 24 18:42 README.md


## Step 3 – Managing Secrets (Key-based API Example)

In [None]:

# Example using OpenWeatherMap (requires free key)
# Register: https://home.openweathermap.org/users/sign_up

# Store key securely in this Colab session
%env OPENWEATHER_API_KEY=1346f5b54e067bc91efe32dcf0bcba04

API_KEY = os.getenv("OPENWEATHER_API_KEY")
print("Key loaded:", API_KEY[:6] + "****" if API_KEY else "No key found")


env: OPENWEATHER_API_KEY=1346f5b54e067bc91efe32dcf0bcba04
Key loaded: 1346f5****


### Example: Fetch Data Using OpenWeather API

In [None]:
url = "http://api.openweathermap.org/data/2.5/forecast?id=524901&"
params = {"q": "Arlington", "appid": API_KEY, "units": "imperial"}

r = requests.get(url, params=params, timeout=10)
r.raise_for_status()
data = r.json()
target_date = "2025-10-26"  #"YYYY-MM-DD"
for entry in data["list"]:
  dt_txt = entry["dt_txt"]
  if dt_txt.startswith(target_date):
    weather = {
        "city": data["city"]["name"],
        "temperature": entry["main"]["temp"],
        "humidity": entry["main"]["humidity"],
        "condition": entry["weather"][0]["description"]
    }
weather


{'city': 'Arlington',
 'temperature': 79.63,
 'humidity': 45,
 'condition': 'clear sky'}

## Step 4 – Public API Example (Open-Meteo)

You will work with your own Key-based API.

In [None]:
date = "2025-10-19"
ENDPOINT = "https://api.open-meteo.com/v1/forecast"
PARAMS = {
	"latitude": 32.7473,
	"longitude": -97.0945,
	"hourly": ["temperature_2m", "precipitation", "relative_humidity_2m", "wind_speed_10m"],
	"wind_speed_unit": "mph",
	"temperature_unit": "fahrenheit",
	"precipitation_unit": "inch",
}

for attempt in range(3):
    try:
        r = requests.get(ENDPOINT, params=PARAMS, timeout=10)
        r.raise_for_status()
        break
    except requests.exceptions.RequestException as e:
        wait = 2 ** attempt
        print(f"Retrying in {wait}s due to: {e}")
        time.sleep(wait)

data = r.json()

df = pd.DataFrame({
    "time": data["hourly"]["time"],
    "temperature_2m": data["hourly"]["temperature_2m"],
    "Precipitation": data["hourly"]["precipitation"],
    "Humidity": data["hourly"]["relative_humidity_2m"],
    "Wind": data["hourly"]["wind_speed_10m"]
})
df.head()


Unnamed: 0,time,temperature_2m,Precipitation,Humidity,Wind
0,2025-10-24T00:00,77.4,0.0,64,7.5
1,2025-10-24T01:00,74.8,0.0,69,8.6
2,2025-10-24T02:00,74.4,0.0,65,7.5
3,2025-10-24T03:00,75.7,0.0,51,8.1
4,2025-10-24T04:00,75.3,0.0,47,7.7


## Step 5 – Save Data and Log Provenance

In [None]:
timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H%M%SZ")
out_csv = DATA / f"hourly_temps_{timestamp}.csv"
df.to_csv(out_csv, index=False)

file_hash = hashlib.sha256(out_csv.read_bytes()).hexdigest()

meta = {
    "timestamp_utc": timestamp,
    "endpoint": ENDPOINT,
    "params": PARAMS,
    "output": out_csv.name,
    "sha256": file_hash,
    "python": sys.version.split()[0],
    "pandas": pd.__version__,
    "requests": requests.__version__,
}

with open(ROOT / "DATA_README.md", "a") as f:
    f.write(f"\n- {json.dumps(meta)}")

print(f"Saved {out_csv.name}, hash={file_hash[:10]}…")
!tail -n 3 /content/csc786-ethics-demo/DATA_README.md


Saved hourly_temps_2025-10-24T210700Z.csv, hash=736b94bb13…
---

- {"timestamp_utc": "2025-10-24T210700Z", "endpoint": "https://api.open-meteo.com/v1/forecast", "params": {"latitude": 32.7473, "longitude": -97.0945, "hourly": ["temperature_2m", "precipitation", "relative_humidity_2m", "wind_speed_10m"], "wind_speed_unit": "mph", "temperature_unit": "fahrenheit", "precipitation_unit": "inch"}, "output": "hourly_temps_2025-10-24T210700Z.csv", "sha256": "736b94bb13f7c87a4beb4b52002b927a4b8a3e7a75b2645a7cae0494bc01f487", "python": "3.12.12", "pandas": "2.2.2", "requests": "2.32.4"}

You can veryify everything before pushing.

In [None]:
!ls -lh /content
!ls -lh /content/csc786-ethics-demo
!head -n 5 /content/csc786-ethics-demo/README.md
!tail -n 5 /content/csc786-ethics-demo/DATA_README.md

total 12K
drwxr-xr-x 7 root root 4.0K Oct 24 21:21 csc786-ethics-demo
drwxr-xr-x 8 root root 4.0K Oct 24 18:40 CSC786-Ethics-demo
drwxr-xr-x 1 root root 4.0K Oct 22 13:39 sample_data
total 24K
drwxr-xr-x 2 root root 4.0K Oct 24 21:21 csc786-ethics-demo
drwxr-xr-x 2 root root 4.0K Oct 24 21:21 CSC786-Ethics-demo
-rw-r--r-- 1 root root 1.3K Oct 24 21:21 DATA_README.md
-rw-r--r-- 1 root root 1.4K Oct 24 21:21 ETHICS.md
-rw-r--r-- 1 root root  871 Oct 24 21:21 README.md
drwxr-xr-x 2 root root 4.0K Oct 24 21:21 sample_data
# Reproducibility Demo – CSC 786

This repository demonstrates an ethical, reproducible data-collection workflow used in the CSC 786 course.

## Overview (udpate as necessary)
- {"timestamp_utc": "<time>", "endpoint": "nflfastR API", "params": {"season": year, "type": "regular"}, "output": "data/nfl_year_regular.csv", "sha256": "sha256",  "notes": "Initial dataset collection for player performance"}
- {"timestamp_utc": "<time>", "endpoint": "https://api.open-meteo.com/v1/

## Step 7 – Push to GitHub

In [None]:
!git remote set-url origin https://jennastengel:$GITHUB_TOKEN@github.com/jennastengel/csc786-ethics-demo.git

!git add .
!git commit -m "Update from Colab session"
!git push

[main 5807a8c] Update from Colab session
 1 file changed, 1 deletion(-)
 delete mode 160000 CSC786-Ethics-demo
Enumerating objects: 3, done.
Counting objects: 100% (3/3), done.
Delta compression using up to 2 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (2/2), 243 bytes | 243.00 KiB/s, done.
Total 2 (delta 1), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (1/1), completed with 1 local object.[K
To https://github.com/jennastengel/csc786-ethics-demo.git
   d0c7300..5807a8c  main -> main


## Step 8 – Wrap-Up & Reflection


### In this demo we:
- Accessed both key-based and open APIs ethically.  
- Created transparency files: README.md, ETHICS.md, DATA_README.md.  
- Logged complete metadata (endpoint, params, hash, timestamp).  
- Pushed the entire reproducible workflow to GitHub.  

### Now think:
- How could you adapt this structure for your own project?  
- What extra metadata might your discipline require (license, consent, citation)?  