
# CSC 786 – Data Ethics & Reproducibility Workshop  

This notebook demonstrates a complete ethical, reproducible data-collection workflow:

- Ethical handling of APIs and environment variables  
- Data collection using both key-based and public APIs  
- Provenance logging and metadata documentation  
- Responsible data storage and reproducible version control  
- Pushing results to a GitHub repository  

All steps run directly in Google Colab.


## The Big Picture
Think of your Colab notebook as the entry point to your research repo.
The notebook does the work (collects data, logs metadata), while the repo (on GitHub) stores the evidence — code, data samples, metadata logs, and ethical documentation.

As a prerequisite, you need to create the GitHub repo first (empty). See the next cell for details.



## Create an empty GitHub repo (UI steps)
1. Sign in to GitHub.
2. Click the + (top-right) → New repository.
3. Repository name: e.g., csc786-ethics-demo.
4. Owner: your account.
5. Visibility: Public (recommended for this class) or Private.
6. Important: Do NOT check “Add a README”, “Add .gitignore”, or “Choose a license”. Leaving these unchecked keeps the repo truly empty, which makes the first push from Colab simplest.
7. Click Create repository.
8. On the next page, copy the HTTPS URL. You will it use it later in notebook.

# Create (or confirm) a GitHub Personal Access Token (PAT) for Colab pushes
You’ll push from Colab using HTTPS + a token (safer/simpler than SSH during class).
1. Go to https://github.com/settings → Developer settings → Personal access tokens. Choose “Fine-grained tokens” (preferred).
2. Generate new token
- Token name (e.g. colab-demo)
- Only select respositories -> choose course repository
- Permissions -> Add permissions -> Contents -> Access: Read and write  
3. Generate the token and copy it once (you won’t see it again).

Tip: keep this token handy just for the class; you can revoke it afterward.

# Setup Cell
Run once per session

In [13]:
%env GITHUB_TOKEN=
!git config --global user.name "Michael.Mendoza" ## Display name not necessarily your username
!git config --global user.email "michael.mendoza@trojans.dsu.edu"

env: GITHUB_TOKEN=


# One time only: Connect the empty repo from Colab (first push)

In [2]:
!git init
!git add .
!git commit -m "Initial reproducibility demo"
!git branch -M main

# Replace <username> and <PAT> and repo name.

!git remote add origin https://liIBits:$GITHUB_TOKEN@github.com/liIBits/csc786-ethics-demo.git

!git push -u origin main

[33mhint: Using 'master' as the name for the initial branch. This default branch name[m
[33mhint: is subject to change. To configure the initial branch name to use in all[m
[33mhint: [m
[33mhint: 	git config --global init.defaultBranch <name>[m
[33mhint: [m
[33mhint: Names commonly chosen instead of 'master' are 'main', 'trunk' and[m
[33mhint: 'development'. The just-created branch can be renamed via this command:[m
[33mhint: [m
[33mhint: 	git branch -m <name>[m
Initialized empty Git repository in /content/.git/
[master (root-commit) ee11c76] Initial reproducibility demo
 21 files changed, 51025 insertions(+)
 create mode 100644 .config/.last_opt_in_prompt.yaml
 create mode 100644 .config/.last_survey_prompt.yaml
 create mode 100644 .config/.last_update_check.json
 create mode 100644 .config/active_config
 create mode 100644 .config/config_sentinel
 create mode 100644 .config/configurations/config_default
 create mode 100644 .config/default_configs.db
 create mode 100

If everything is correct, you’ll see the push succeed and your files appear in the GitHub repo (refresh the repo page).

In [3]:
%%bash
# --- Create and push .gitignore for clean, ethical repo ---

cat > .gitignore << 'EOF'
.ipynb_checkpoints/
__pycache__/
data/*
.env
*.env
EOF

git add .gitignore
git commit -m "Add .gitignore for data, cache, and secrets"
git push


[main 604f6e9] Add .gitignore for data, cache, and secrets
 1 file changed, 5 insertions(+)
 create mode 100644 .gitignore


To https://github.com/liIBits/csc786-ethics-demo.git
   ee11c76..604f6e9  main -> main


# When you reopen Colab next time
You’ll simply clone your GitHub repo back into /content, instead of re-initializing a new one.

So, the reconnect workflow will look like this:

In [5]:
# 1. Clone your existing repo from GitHub
!git clone https://github.com/liIBits/csc786-ethics-demo # todo update url
%cd csc786-ethics-demo


# 2. Optional: verify remote
!git remote -v


# 3. If you make changes and want to push again
!git remote set-url origin https://github.com/liIBits/csc786-ethics-demo # todo update url

!git add .
!git commit -m "Update from Colab session"
!git push


Cloning into 'csc786-ethics-demo'...
remote: Enumerating objects: 31, done.[K
remote: Counting objects: 100% (31/31), done.[K
remote: Compressing objects: 100% (19/19), done.[K
remote: Total 31 (delta 5), reused 31 (delta 5), pack-reused 0 (from 0)[K
Receiving objects: 100% (31/31), 8.42 MiB | 13.70 MiB/s, done.
Resolving deltas: 100% (5/5), done.
/content/csc786-ethics-demo
origin	https://github.com/liIBits/csc786-ethics-demo (fetch)
origin	https://github.com/liIBits/csc786-ethics-demo (push)
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean
fatal: could not read Username for 'https://github.com': No such device or address


In [4]:
# You can always check what's currently configured by:

!git config --global --list

user.name=Michael.Mendoza
user.email=michael.mendoza@trojans.dsu.edu


## Colab-specific access details
Note: While we work in Colab, everything inside /content/ is a temporary mini-repo.
As you run the notebook:
1. It creates the folder /content/data/ for your CSVs.
2. It appends provenance info into /content/DATA_README.md.
3. You can add extra markdown files manually.

## Step 1 – Setup Environment

In [8]:
# --- Environment Setup (single source of truth) ---
!pip install python-dotenv --quiet

import os, sys, json, hashlib, pandas as pd
from datetime import datetime, timezone
from pathlib import Path, PurePosixPath

ROOT = Path("/content/csc786-ethics-demo")  # repo root that will be pushed
ROOT.mkdir(parents=True, exist_ok=True)

DATA = ROOT / "data"
DATA.mkdir(exist_ok=True)

# Reuse these everywhere
OUTPUT_CSV = DATA / "syscalls_summary.csv"

print("Environment ready")
print("ROOT:", ROOT)
print("DATA:", DATA)
print("OUTPUT_CSV:", OUTPUT_CSV)

Environment ready
ROOT: /content/csc786-ethics-demo
DATA: /content/csc786-ethics-demo/data
OUTPUT_CSV: /content/csc786-ethics-demo/data/syscalls_summary.csv



# Ethical Reminder

Before collecting any data:

- Check Terms of Service and rate limits.  
- Avoid collecting or storing personally identifiable information (PII).  
- Document every endpoint, parameter, and date of collection.  
- Keep secrets (API keys) out of public repositories.  


### Synthetic data cell

In [9]:
from datetime import datetime, timezone
import random

random.seed(786)
now_iso = datetime.now(timezone.utc).isoformat()

rows = [
    {"timestamp": now_iso, "syscall": "open",   "process_name": "test_program",  "alert_flag": 0},
    {"timestamp": now_iso, "syscall": "execve", "process_name": "malicious_sim", "alert_flag": 1},
    {"timestamp": now_iso, "syscall": "connect","process_name": "net_test",      "alert_flag": 0},
    {"timestamp": now_iso, "syscall": "write",  "process_name": "io_bench",      "alert_flag": 0},
    {"timestamp": now_iso, "syscall": "unlink", "process_name": "cleanup_tool",  "alert_flag": 0},
]
df = pd.DataFrame(rows)
df.to_csv(OUTPUT_CSV, index=False)
print("Saved:", OUTPUT_CSV)
df.head()


Saved: /content/csc786-ethics-demo/data/syscalls_summary.csv


Unnamed: 0,timestamp,syscall,process_name,alert_flag
0,2025-10-28T21:14:14.523959+00:00,open,test_program,0
1,2025-10-28T21:14:14.523959+00:00,execve,malicious_sim,1
2,2025-10-28T21:14:14.523959+00:00,connect,net_test,0
3,2025-10-28T21:14:14.523959+00:00,write,io_bench,0
4,2025-10-28T21:14:14.523959+00:00,unlink,cleanup_tool,0


## Step 2 – Create Reproducibility Documentation Files

In [10]:
# --- Reproducibility docs (uses ROOT, DATA, OUTPUT_CSV from setup) ---

# 1) README.md
(ROOT / "README.md").write_text("""# CSC786 Data Collection & Ethics Activity
### Evaluating Linux EDR Syscall Monitoring and Evasion Techniques
*Author: Michael Mendoza*
*Date: October 2025*

This notebook demonstrates a **reproducible and ethical** data collection workflow for analyzing Linux EDR telemetry.
Data are **synthetic** representations of syscall activity collected from a simulated Wazuh endpoint on RHEL.
All code and data are safe for public sharing and do not include real system logs or personally identifiable information (PII).

## Files
| File | Purpose |
|------|----------|
| `README.md` | Project overview and usage instructions |
| `ETHICS.md` | Ethical statement for transparency |
| `DATA_README.md` | Auto-logged metadata for each data collection event |
""")

# 2) ETHICS.md (unchanged content from your version)
(ROOT / "ETHICS.md").write_text("""## Ethical Statement
- No PII or sensitive data are collected or shared.
- All data are synthetic, generated in a lab/Colab environment.
- No live external systems are accessed.
- Every dataset generated is logged with parameters, timestamps, and hashes in `DATA_README.md`.
- This notebook documents what is simulated and what is omitted for ethical reasons.

### Risks
Bias (synthetic ≠ real), privacy risk if misused with real logs, and security misuse outside isolated labs.

### Mitigations
Use synthetic data only here; isolate real tests; document parameters; don’t store identifiers or real audit logs.

### Limitations
Synthetic data cannot capture full kernel complexity; this demonstrates reproducibility, not EDR performance.
""")

# 3) DATA_README.md (append-only log)
data_readme = ROOT / "DATA_README.md"
if not data_readme.exists():
    data_readme.write_text("# Data Provenance Log\nAuto-generated by the notebook.\n---\n")

if OUTPUT_CSV.exists():
    sha = hashlib.sha256(OUTPUT_CSV.read_bytes()).hexdigest()
    event = {
        "timestamp_utc": datetime.now(timezone.utc).isoformat(),
        "data_type": "synthetic_syscall_dataset",
        "params": {
            "records": 5,
            "fields": ["timestamp", "syscall", "process_name", "alert_flag"],
            "environment": f"Python {sys.version.split()[0]}, pandas {pd.__version__}"
        },
        # repo-relative path for portability
        "output": str(PurePosixPath("data/processed/syscalls_summary.csv")),
        "sha256": sha
    }
    with open(data_readme, "a") as f:
        f.write(f"- {json.dumps(event)}\n")
    print("✅ Docs updated:", data_readme)
else:
    print("⚠️ OUTPUT_CSV not found; run the synthetic data cell first.")

# quick sanity check
!ls -lh {ROOT}/*.md


✅ Docs updated: /content/csc786-ethics-demo/DATA_README.md
-rw-r--r-- 1 root root 422 Oct 28 21:14 /content/csc786-ethics-demo/DATA_README.md
-rw-r--r-- 1 root root 742 Oct 28 21:14 /content/csc786-ethics-demo/ETHICS.md
-rw-r--r-- 1 root root 759 Oct 28 21:14 /content/csc786-ethics-demo/README.md


Data Simulation

You can veryify everything before pushing.

In [11]:
!ls -lh /content
!ls -lh /content/data
!head -n 5 README.md
!tail -n 5 DATA_README.md

total 8.0K
drwxr-xr-x 6 root root 4.0K Oct 28 21:14 csc786-ethics-demo
drwxr-xr-x 1 root root 4.0K Oct 27 13:37 sample_data
ls: cannot access '/content/data': No such file or directory
# CSC786 Data Collection & Ethics Activity  
### Evaluating Linux EDR Syscall Monitoring and Evasion Techniques  
*Author: Michael Mendoza*  
*Date: October 2025*

# Data Provenance Log
Auto-generated by the notebook.
---
- {"timestamp_utc": "2025-10-28T21:14:18.168905+00:00", "data_type": "synthetic_syscall_dataset", "params": {"records": 5, "fields": ["timestamp", "syscall", "process_name", "alert_flag"], "environment": "Python 3.12.12, pandas 2.2.2"}, "output": "data/processed/syscalls_summary.csv", "sha256": "c0de36a6e65a0bed65f5fa36f0d232e49fefcb6aee3a5c0bd89a2c2d6fdd64da"}


## Step 7 – Push to GitHub

In [12]:
!git remote set-url origin https://liIBits:$GITHUB_TOKEN@github.com/liIBits/csc786-ethics-demo.git

!git add .
!git commit -m "Update from Colab session"
!git push

[main 123f986] Update from Colab session
 3 files changed, 34 insertions(+)
 create mode 100644 DATA_README.md
 create mode 100644 ETHICS.md
 create mode 100644 README.md
Enumerating objects: 6, done.
Counting objects: 100% (6/6), done.
Delta compression using up to 2 threads
Compressing objects: 100% (5/5), done.
Writing objects: 100% (5/5), 1.60 KiB | 1.60 MiB/s, done.
Total 5 (delta 0), reused 0 (delta 0), pack-reused 0
To https://github.com/liIBits/csc786-ethics-demo.git
   604f6e9..123f986  main -> main


### Reflection

During this activity, I adapted the provided reproducibility framework to fit my research on **Linux EDR syscall monitoring and evasion techniques**. Instead of collecting data from an external API, I generated **synthetic telemetry** to simulate auditd and Wazuh-style syscall activity. This ensured the workflow remained ethical and shareable without involving any real host or sensitive information.

One challenge was aligning my folder paths and `.gitignore` configuration with the original template. Initially, my data was stored outside the repository, which meant it was lost whenever the Colab session reset. I fixed this by restructuring the directories so both `data/raw/` and `data/processed/` live inside the repository while maintaining an ethical `.gitignore` to prevent sensitive or unnecessary files from being pushed. This change made the workflow fully reproducible for peers cloning the repo.

Overall, this process helped me understand how **ethical data handling**, **version control**, and **reproducibility documentation** (README, ETHICS, and DATA_README files) work together. My final setup now mirrors the professor’s reproducibility model — clean, traceable, and ready for peer verification.
