# Rust Unstable Feature Analysis (v3)

This notebook summarizes the sampling pipeline and visualizes the CSV outputs generated by `analyze_features.py` for `features_head_v3.db`.

What you need:
- Python 3.9+
- `pandas`, `matplotlib`, `seaborn` (install with `pip install pandas matplotlib seaborn` if missing)
- CSVs in `analysis_outputs/`: `feature_head_summary.csv`, `feature_history_summary.csv`, `feature_lifetimes.csv`, `category_summary.csv`
- Optional: raw tables in `features_head_v3.db` for deeper joins.

Sampling background (v3):
- Two-layer sampling from crates.io dump:
  - **Core stratum**: top N by reverse dependencies (N=300) + optional core list.
  - **Non-core stratum**: filter by downloads >=100 and latest year >=2015, then stratify by (top_category, popularity_band) using p50/p90 downloads.
- Only GitHub repos are kept; duplicate owner/repo collapsed (pick higher revdeps/downloads).
- Target size: 1500 repos; final unique GitHub repos: 1500.

This notebook first re-summarizes sampling outputs (ratios by core/non-core, categories), then visualizes nightly feature usage.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import sqlite3
from datetime import datetime

# Optional: load sampled CSV to recompute sampling stats
SAMPLE_CSV = Path("sampled_crates_v3.csv")
DB_PATH = Path("features_head_v3.db")

sns.set_theme(style="whitegrid")
BASE = Path("analysis_outputs")

head_df = pd.read_csv(BASE / "feature_head_summary.csv")
hist_df = pd.read_csv(BASE / "feature_history_summary.csv")
life_df = pd.read_csv(BASE / "feature_lifetimes.csv")
cat_df = pd.read_csv(BASE / "category_summary.csv")

sample_df = None
if SAMPLE_CSV.exists():
    sample_df = pd.read_csv(SAMPLE_CSV)

conn = None
if DB_PATH.exists():
    conn = sqlite3.connect(DB_PATH)


head_df.head(), hist_df.head(), life_df.head(), cat_df.head()

## 1) Sampling recap
- Core vs non-core composition
- Top categories by sample size
- HEAD vs EVER nightly ratios for core/non-core

In [None]:
if sample_df is not None:
    # Basic counts
    total = len(sample_df)
    core = sample_df[sample_df["is_core"] == 1]
    noncore = sample_df[sample_df["is_core"] == 0]
    print(f"Sample size: {total} (core={len(core)}, non-core={len(noncore)})")

    # Category sizes
    top_cats = sample_df["top_category"].value_counts().head(15).reset_index()
    top_cats.columns = ["top_category", "count"]

    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    sns.barplot(data=top_cats, x="count", y="top_category", ax=axes[0], palette="viridis")
    axes[0].set_title("Top categories in sample")

    # Core vs non-core ratio of HEAD/EVER nightly (join with cat_df ratios is coarse; here use cat_df aggregated)
    cat_df_sorted = cat_df.sort_values("total_repos", ascending=False).head(15)
    sns.barplot(data=cat_df_sorted, x="head_nightly_ratio", y="category", ax=axes[1], palette="rocket")
    axes[1].set_title("HEAD nightly ratio by category (top 15)")
    axes[1].set_xlim(0, 1)

    plt.tight_layout()
    plt.show()
else:
    print("sampled_crates_v3.csv not found; skip sampling recap.")

## 2) Feature usage overview
- Distribution of HEAD vs EVER usage counts per feature
- Identify dominant gates

In [None]:
life_df["avg_lifetime_days"] = pd.to_numeric(life_df["avg_lifetime_days"], errors="coerce")
life_df["median_lifetime_days"] = pd.to_numeric(life_df["median_lifetime_days"], errors="coerce")
life_df["num_repos"] = pd.to_numeric(life_df["num_repos"], errors="coerce")

fig, axes = plt.subplots(1, 2, figsize=(16, 6))
sns.histplot(life_df["median_lifetime_days"].dropna(), bins=40, ax=axes[0], color="steelblue")
axes[0].set_title("Distribution of feature median lifetimes (days)")

top_long = life_df.dropna(subset=["avg_lifetime_days"]).nlargest(15, "avg_lifetime_days")
sns.barplot(data=top_long, x="avg_lifetime_days", y="feature_name", ax=axes[1], palette="magma")
axes[1].set_title("Longest average lifetimes")
plt.tight_layout()
plt.show()

## 4) Retention vs adoption
- Still-present ratio = num_still_present / num_repos
- Compare with coverage to spot "sticky" gates


In [None]:
life_df["still_ratio"] = life_df["num_still_present"] / life_df["num_repos"]
top_ratio = life_df[life_df["num_repos"] >= 5].nlargest(20, "still_ratio")

plt.figure(figsize=(10, 8))
sns.barplot(data=top_ratio, x="still_ratio", y="feature_name", palette="crest")
plt.title("Top 20 features by still-present ratio (>=5 repos)")
plt.xlim(0, 1.05)
plt.show()

## 5) Category perspective
- Compare per-category nightly reliance on HEAD vs EVER


In [None]:
cat_df["head_nightly_ratio"] = pd.to_numeric(cat_df["head_nightly_ratio"], errors="coerce")
cat_df["ever_nightly_ratio"] = pd.to_numeric(cat_df["ever_nightly_ratio"], errors="coerce")
cat_top = cat_df.sort_values("total_repos", ascending=False).head(20)

fig, axes = plt.subplots(1, 2, figsize=(16, 8))
sns.barplot(data=cat_top, x="head_nightly_ratio", y="category", ax=axes[0], palette="rocket")
axes[0].set_title("HEAD nightly ratio by category (top 20 categories)")
sns.barplot(data=cat_top, x="ever_nightly_ratio", y="category", ax=axes[1], palette="mako")
axes[1].set_title("EVER nightly ratio by category (top 20 categories)")
plt.tight_layout()
plt.show()

## 6) Next steps (fill in as you iterate)
- Join with a feature→stable_date table to plot adoption curves vs stabilization.
- Compute time-to-stable intervals: (first substantial adoption) → (stable date).
- Segment by core/non-core, popularity bands, downloads/revdeps quantiles.
- Highlight currently unstable gates with high still-present ratios and long lifetimes.

## 7) Stable timeline for selected gates
Hard-code known stable dates (approx; use release notes/unstable book). Dates as ISO-8601; missing → still unstable.

In [None]:
# Known stabilization dates (rough, keep in sync with Rust release notes)
stable_map = {
    "doc_cfg": "1.54.0|2021-07-29",
    "doc_auto_cfg": "1.75.0|2024-01-18",
    "never_type": "1.41.0|2020-01-30",
    "allocator_api": None,   # still unstable
    "specialization": None,  # still unstable
    "plugin": "1.29.0|2018-07-19",  # removed/retired, was unstable
    "test": "1.0.0|2015-05-15",   # public via libtest but feature gate retired
}

def parse_stable_date(s):
    if not s:
        return None
    if "|" in s:
        _, date = s.split("|", 1)
    else:
        date = s
    try:
        return datetime.fromisoformat(date)
    except Exception:
        return None

stable_dates = {k: parse_stable_date(v) for k, v in stable_map.items()}
stable_dates

## 8) Adoption curves for selected gates
- Pull from `repo_feature_history` in DB.
- Count active repos per quarter (first_seen_date bucket) to see adoption over time.

In [None]:
import pandas as pd

def load_history(conn, features):
    qmarks = ",".join(["?"]*len(features))
    df = pd.read_sql_query(
        f"""
        SELECT feature_name, first_seen_date, last_seen_date, still_present, key
        FROM repo_feature_history
        WHERE feature_name IN ({qmarks})
        """,
        conn,
        params=features,
    )
    df["first_seen_date"] = pd.to_datetime(df["first_seen_date"], errors="coerce")
    df["last_seen_date"] = pd.to_datetime(df["last_seen_date"], errors="coerce")
    return df

selected = ["doc_cfg", "doc_auto_cfg", "specialization", "allocator_api", "never_type", "plugin"]

if conn is not None:
    hist_sel = load_history(conn, selected)
    if not hist_sel.empty:
        hist_sel["first_quarter"] = hist_sel["first_seen_date"].dt.to_period("Q")
        adoption = (hist_sel.groupby(["feature_name", "first_quarter"])  # count repos first adopting per quarter
                             .size()
                             .reset_index(name="count"))
        adoption = adoption.sort_values("first_quarter")
        all_quarters = adoption["first_quarter"].dropna().sort_values().unique()
        plt.figure(figsize=(12,6))
        for feat in selected:
            sub = adoption[adoption["feature_name"] == feat]
            if sub.empty:
                continue
            # cumulative adoption over time
            sub = sub.sort_values("first_quarter")
            sub["cum"] = sub["count"].cumsum()
            x = sub["first_quarter"].dt.to_timestamp()
            plt.plot(x, sub["cum"], marker="o", label=feat)
        if len(all_quarters) > 0:
            xticks = pd.PeriodIndex(all_quarters).to_timestamp()
            plt.gca().set_xticks(xticks)
            plt.gca().set_xticklabels([str(q) for q in all_quarters], rotation=75)
        plt.title("Cumulative adoption over time (selected features)")
        plt.ylabel("Repos (cumulative)")
        plt.legend()
        plt.tight_layout()
        plt.show()
    else:
        print("No history rows for selected features.")
else:
    print("DB not available; skip adoption curves.")

## 9) Time-to-stable vs adoption threshold
- For stable gates: pick threshold (e.g., first quarter when cumulative >= 5/20 repos)
- Compute days from that threshold to stable date.

In [None]:
def threshold_date(adoption_df, feat, threshold):
    sub = adoption_df[adoption_df["feature_name"] == feat].sort_values("first_quarter")
    if sub.empty:
        return None
    sub = sub.copy()
    sub["cum"] = sub["count"].cumsum()
    hit = sub[sub["cum"] >= threshold]
    if hit.empty:
        return None
    q = hit.iloc[0]["first_quarter"]
    # approximate to quarter start
    return q.to_timestamp()

if conn is not None and 'hist_sel' in locals() and not hist_sel.empty:
    thresholds = [5, 20]
    results = []
    adoption = (hist_sel.groupby(["feature_name", "first_quarter"]).size().reset_index(name="count"))
    for feat in selected:
        stab = stable_dates.get(feat)
        for th in thresholds:
            tdate = threshold_date(adoption, feat, th)
            delta_days = None
            if tdate is not None and stab is not None:
                delta_days = (stab - tdate).days
            results.append({"feature": feat, "threshold": th, "threshold_date": tdate, "stable_date": stab, "days_to_stable": delta_days})
    res_df = pd.DataFrame(results)
    display(res_df)
else:
    print("Skip time-to-stable; missing data.")

## 10) Core vs non-core, categories, and size signals
- Join `repos` with history to see where adoption comes from.
- Compare adoption counts by core/non-core and top_category.

In [None]:
def load_repos(conn):
    r = pd.read_sql_query("SELECT key, is_core, top_categories, downloads_sum, revdeps_sum FROM repos", conn)
    r["is_core"] = r["is_core"].fillna(0).astype(int)
    return r

if conn is not None:
    repos_df = load_repos(conn)
    hist_sel = load_history(conn, selected)
    if not hist_sel.empty:
        merged = hist_sel.merge(repos_df, on="key", how="left")
        merged["top_cat_primary"] = merged["top_categories"].fillna("").apply(lambda s: s.split(",")[0] if s else "uncategorized")
        # core vs non-core counts per feature
        core_counts = merged.groupby(["feature_name", "is_core"]).size().reset_index(name="count")
        plt.figure(figsize=(10,6))
        sns.barplot(data=core_counts, x="feature_name", y="count", hue="is_core", palette="Set2")
        plt.title("Adoption counts by core/non-core (selected features)")
        plt.xticks(rotation=45)
        plt.show()

        # category view (top 8 per feature)
        for feat in selected:
            sub = merged[merged["feature_name"] == feat]
            top_cat = sub["top_cat_primary"].value_counts().head(8).reset_index()
            top_cat.columns = ["top_category", "count"]
            plt.figure(figsize=(8,4))
            sns.barplot(data=top_cat, x="count", y="top_category", palette="Blues_r")
            plt.title(f"Top categories for {feat}")
            plt.tight_layout()
            plt.show()
    else:
        print("No history data for selected features.")
else:
    print("DB not available; skip core/category analysis.")

## 11) Pressure indicators for unstable gates
- For unstable gates (no stable date): look at adoption level, growth, and retention.
- Use current `still_present` counts and cumulative adoption.

In [None]:
if conn is not None and not hist_df.empty:
    unstable = [f for f, d in stable_dates.items() if d is None]
    pressure = hist_df[hist_df["feature_name"].isin(unstable)].copy()
    pressure = pressure.rename(columns={"ever_repo_count": "adoption"})
    # add still_present from lifetimes if available
    life_sub = life_df.set_index("feature_name")["num_still_present"]
    pressure["still_present"] = pressure["feature_name"].map(life_sub)
    display(pressure.sort_values("adoption", ascending=False))
else:
    print("Skip pressure analysis; missing data.")