<a href="https://colab.research.google.com/github/michalrylko/decision-latency/blob/main/00_data_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0. Decision Latency Analytics - Project Overview
## Apache Airflow Pull Requests Case Study

This project analyzes **decision-making latency** using real-world data from
GitHub Pull Requests (PRs).

A Pull Request is treated as a proxy for a **decision process**:
- a proposal is submitted
- reviewers evaluate the change
- a final decision is made (merge or close)

This notebook covers **Stage 0: Data Extraction**.
Subsequent stages will focus on:
- exploratory data analysis (EDA)
- feature engineering
- predictive modeling
- decision design recommendations


# 1. Environment Setup & GitHub API Access

This section:
- installs required libraries
- sets up the Python environment
- configures access to the GitHub API

Authentication is handled via an environment variable (`GITHUB_TOKEN`)
configured outside the notebook (e.g. Colab Secrets).

In [9]:
!pip -q install pandas requests

import os
import time
import requests
import pandas as pd

# GitHub token is expected to be available as an environment variable
GITHUB_TOKEN = os.getenv("GITHUB_TOKEN")

import os

def load_github_token():
    # 1) Standard: environment variable
    token = os.getenv("GITHUB_TOKEN")
    if token:
        return token

    # 2) Colab Secrets fallback
    try:
        from google.colab import userdata  # type: ignore
        return userdata.get("GITHUB_TOKEN")
    except Exception:
        return None

GITHUB_TOKEN = load_github_token()
print("GitHub token loaded:", bool(GITHUB_TOKEN))



GitHub token loaded: True


If the token is correctly configured, authenticated requests will benefit from
a significantly higher rate limit (~5000 requests/hour).

# 2. Repository Configuration

We extract data from the public GitHub repository:

- Repository: `apache/airflow`
- Data source: GitHub Pull Requests API
- Scope: closed Pull Requests only

In [10]:
REPO_OWNER = "apache"
REPO_NAME = "airflow"

PER_PAGE = 100
MAX_PAGES = 100        # ~10 000 PRs
REQUEST_DELAY = 0.2 # seconds between API calls

headers = {"Accept": "application/vnd.github+json"}
if GITHUB_TOKEN:
    headers["Authorization"] = f"Bearer {GITHUB_TOKEN}"

session = requests.Session()
session.headers.update(headers)


# 3. GitHub API Helper

A helper function is used to:
- handle temporary API failures
- retry requests affected by rate limits
- keep the extraction pipeline robust


In [11]:
def fetch_json(url, params=None, retries=3):
    for attempt in range(retries):
        response = session.get(url, params=params)

        if response.status_code == 200:
            return response.json()

        if response.status_code in (403, 429, 502, 503, 504):
            wait_time = 2 ** attempt
            time.sleep(wait_time)
            continue

        print(f"API error {response.status_code}: {response.text[:200]}")
        return None

    return None


## 4. Pull Request Collection

In this step we:
1. Fetch a lightweight list of closed Pull Requests
2. Fetch detailed metadata for each Pull Request

This includes information such as:
- comments and review comments
- number of commits
- files changed
- lines added and deleted


In [None]:
pull_requests_light = []

# Step 1: lightweight PR list
for page in range(1, MAX_PAGES + 1):
    url = f"https://api.github.com/repos/{REPO_OWNER}/{REPO_NAME}/pulls"
    params = {
        "state": "closed",
        "per_page": PER_PAGE,
        "page": page
    }

    data = fetch_json(url, params=params)
    if not data:
        break

    for pr in data:
        pull_requests_light.append({
            "pr_number": pr.get("number"),
            "pr_id": pr.get("id"),
            "repository": f"{REPO_OWNER}/{REPO_NAME}",
            "created_at": pr.get("created_at"),
            "closed_at": pr.get("closed_at"),
            "merged_at": pr.get("merged_at"),
            "author": (pr.get("user") or {}).get("login"),
            "labels": [label.get("name") for label in pr.get("labels", [])]
        })

    time.sleep(REQUEST_DELAY)

# Step 2: detailed metadata
pull_requests_full = []

for pr in pull_requests_light:
    pr_number = pr["pr_number"]
    if pr_number is None:
        continue

    detail_url = (
        f"https://api.github.com/repos/"
        f"{REPO_OWNER}/{REPO_NAME}/pulls/{pr_number}"
    )

    details = fetch_json(detail_url)
    if details is None:
        continue

    pull_requests_full.append({
        **pr,
        "merged": details.get("merged_at") is not None,
        "comments_count": details.get("comments"),
        "review_comments_count": details.get("review_comments"),
        "commit_count": details.get("commits"),
        "changed_files_count": details.get("changed_files"),
        "additions": details.get("additions"),
        "deletions": details.get("deletions")
    })

    time.sleep(REQUEST_DELAY)


## 5. Dataset Construction and Decision Latency Metric

Decision latency is defined as the number of days between:
- Pull Request creation
- Pull Request closure (merge or close)
**bold text**

In [None]:

df = pd.DataFrame(pull_requests_full)

df["created_at"] = pd.to_datetime(df["created_at"], errors="coerce")
df["closed_at"] = pd.to_datetime(df["closed_at"], errors="coerce")

df["decision_latency_days"] = (
    df["closed_at"] - df["created_at"]
).dt.days

df.head()


# 6. Data Export & Validation

The dataset is saved locally to avoid repeated API calls
in subsequent analysis stages.

In [None]:
output_file = "apache_airflow_pull_requests_raw.csv"
df.to_csv(output_file, index=False)

print("Dataset saved:", output_file)
df["decision_latency_days"].describe()


# 7. Summary & Next Steps

This notebook completed **Stage 0: Data Extraction**.

Next stages:
- Exploratory Data Analysis (EDA)
- Feature engineering
- Predictive modeling
- Decision design recommendations
