<a href="https://colab.research.google.com/github/michalrylko/decision-latency/blob/main/01_eda_ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploratory Data Analysis (EDA)

This notebook explores decision latency patterns in GitHub Pull Requests
based on data extracted in `00_data_extraction.ipynb`.

The goal is to understand:
- how decision latency is distributed,
- how it differs between merged and non-merged PRs,
- which observable factors are associated with delays.


## 1. Load Dataset and Basic Validation


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

DATA_PATH = "../data/apache_airflow_pull_requests_raw.csv"
df = pd.read_csv(DATA_PATH)

df.shape


In [None]:
df.head()


In [None]:
df.info()


The dataset contains metadata for Pull Requests, including timestamps,
review activity, and final decision outcomes.


## 2. Decision Latency Distribution


In [None]:
df["decision_latency_days"].describe()


In [None]:
plt.figure(figsize=(8, 4))
sns.histplot(df["decision_latency_days"], bins=30, kde=True)
plt.title("Distribution of Decision Latency (days)")
plt.xlabel("Decision latency (days)")
plt.ylabel("Count")
plt.show()


Decision latency is highly right-skewed, with a large proportion of
very fast decisions and a long tail of delayed cases.


## 3. Decision Outcome vs Latency


In [None]:
df.groupby("merged")["decision_latency_days"].describe()


In [None]:
plt.figure(figsize=(6, 4))
sns.boxplot(
    data=df,
    x="merged",
    y="decision_latency_days"
)
plt.title("Decision Latency by Outcome")
plt.xlabel("Merged")
plt.ylabel("Decision latency (days)")
plt.show()


Merged and non-merged Pull Requests exhibit different latency profiles,
suggesting distinct decision dynamics.


## 4. Review Activity and Decision Latency


In [None]:
cols = ["comments", "review_comments", "commits"]
df[cols].describe()


In [None]:
sns.pairplot(
    df[cols + ["decision_latency_days"]],
    diag_kind="kde"
)
plt.show()


Higher review activity appears to correlate with longer decision latency,
potentially reflecting increased complexity or coordination cost.


## 5. Temporal Patterns


In [None]:
df["created_at"] = pd.to_datetime(df["created_at"])
df["created_weekday"] = df["created_at"].dt.day_name()

df.groupby("created_weekday")["decision_latency_days"].median()


In [None]:
plt.figure(figsize=(8, 4))
sns.barplot(
    data=df,
    x="created_weekday",
    y="decision_latency_days",
    estimator="median",
    order=[
        "Monday", "Tuesday", "Wednesday",
        "Thursday", "Friday", "Saturday", "Sunday"
    ]
)
plt.title("Median Decision Latency by PR Creation Day")
plt.xticks(rotation=45)
plt.show()


## 6. Key Insights and Hypotheses

Based on exploratory analysis, we observe that:

1. Decision latency is heavily skewed, with most decisions happening quickly.
2. Non-merged PRs tend to remain open longer than merged ones.
3. Increased review activity is associated with longer decision times.
4. Temporal effects suggest operational rhythms in decision-making.

These findings motivate feature engineering focused on:
- review intensity,
- collaboration complexity,
- early signals of stalled decisions.
