Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions docs/import-provider-detection-latency.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Import provider (PR/MR) detection latency

Tracking: [#232](https://github.com/kurrent-io/kcap-cli/issues/232) (AI-1122).

## Question

`ImportCommand`'s bulk loop calls `RepositoryDetection.DetectRepositoryAsync` per session/cwd,
which can spawn a live `gh pr view` / `glab api` per unique repo (bounded by a ~2 s best-effort
cap each). The per-host memo in `GitProviderRouter` stops the `gh auth status` **probe** from
multiplying, but the **detection call** still runs per cwd — and GitLab's `glab api` is a network
round-trip rather than gh's local resolution. Does this hurt large multi-repo imports, and if so,
should we cache per `(repo, branch)` or skip detection during import?

## Finding: the PR/MR detection is *dead work* on the import paths

The decisive observation is not a latency number — it's that **every bulk-import path discards the
PR fields**, so the `gh`/`glab` round-trip produces a result that is never used:

- **Scope-picker resolution** and the **"Scanning repositories"** step use only `Owner`/`RepoName`.
- **Per-session ingest** (`ImportCommand`) builds its `repository` node manually with
`user_name`/`user_email`/`remote_url`/`owner`/`repo_name`/`branch` — **no `pr_*` fields**.
- **Classification** (`TranscriptFileClassification`) uses the detected repo only for the
owner/repo exclusion key.
- The **Copilot / Pi / Kiro** import sources use their repo detector only for the same exclusion
key.

The **live/hook path** (`EnrichWithRepositoryInfo` → `BuildRepositoryNode`) *does* emit
`pr_number`/`pr_title`/`pr_url`/`pr_head_ref`, so PR detection genuinely matters there — just not
during import.

## Decision

Add a `detectPullRequest` flag (default `true`, preserving every live/hook caller) to
`DetectRepositoryAsync` and pass `detectPullRequest: false` on the bulk-import paths. This removes
the provably-unused `gh pr view` / `glab api` round-trip per cwd while still resolving base repo
info from local git. It is **behaviour-preserving**: imported payloads never carried PR fields on
those paths, so nothing observable changes — only the discarded network calls are dropped.

Because the result is discarded, this is a stronger justification than a wall-clock measurement:
the work is unnecessary regardless of its per-call cost. A representative "large multi-repo import"
benchmark was **not** run — it needs live `gh`/`glab` authentication, many real repositories with
open PRs/MRs, and network access that isn't reproducible in the dev/CI sandbox — and would not
change the conclusion.

### Exception: Cursor import keeps PR detection

`CursorImportSource` is the one import source whose synthetic `session-start/cursor` payload carries
a `repository` node built via `RepositoryDetection.BuildRepositoryNode`, which **does** emit `pr_*`
when populated (AI-1152). It therefore keeps live PR detection (its default detector is *not*
flipped) so imported Cursor sessions retain their PR tagging.
2 changes: 1 addition & 1 deletion src/Capacitor.Cli/Commands/CopilotImportSource.cs
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ public CopilotImportSource(
) {
_sessionStateDir = sessionStateDirOverride ?? CopilotPaths.SessionStateDir();
_legacySessionStateDir = legacyDirOverride ?? CopilotPaths.LegacySessionStateDir();
_repoDetector = repoDetector ?? (cwd => RepositoryDetection.DetectRepositoryAsync(cwd));
_repoDetector = repoDetector ?? (cwd => RepositoryDetection.DetectRepositoryAsync(cwd, detectPullRequest: false));
}

static StringComparison PathComparison =>
Expand Down
4 changes: 4 additions & 0 deletions src/Capacitor.Cli/Commands/CursorImportSource.cs
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,10 @@ public CursorImportSource(
_projectsDir = projectsDirOverride ?? CursorPaths.ProjectsDir();
_workspaceStorageDir = workspaceStorageDirOverride ?? CursorPaths.Resolve().WorkspaceStorageDir;
_sanitizedToFolder = new Lazy<IReadOnlyDictionary<string, string?>>(BuildSanitizedToFolderMap);
// Cursor is the one import source that emits a `repository` node in its synthetic
// session-start (AI-1152) via BuildRepositoryNode, which includes pr_* when populated —
// so it keeps live PR detection (unlike the owner/repo-only detectors in the other
// sources). Dropping it here would silently strip PR tagging from imported Cursor sessions.
_repoDetector = repoDetector ?? (cwd => RepositoryDetection.DetectRepositoryAsync(cwd));
}

Expand Down
10 changes: 7 additions & 3 deletions src/Capacitor.Cli/Commands/ImportCommand.cs
Original file line number Diff line number Diff line change
Expand Up @@ -589,7 +589,8 @@ await Parallel.ForEachAsync(
opts,
async (cwd, _) => {
try {
var repo = await RepositoryDetection.DetectRepositoryAsync(cwd);
// Import only needs owner/repo here — skip the PR/MR provider round-trip.
var repo = await RepositoryDetection.DetectRepositoryAsync(cwd, detectPullRequest: false);
Comment thread
realtonyyoung marked this conversation as resolved.
repoByCwd[cwd] = repo is { Owner: { } o, RepoName: { } n } ? (o, n) : null;
} catch {
repoByCwd[cwd] = null;
Expand Down Expand Up @@ -2003,7 +2004,8 @@ string sessionId
var total = uniqueCwds.Count;

async ValueTask DetectOne(string cwd) {
var repo = await RepositoryDetection.DetectRepositoryAsync(cwd);
// Import only needs owner/repo here — skip the PR/MR provider round-trip.
var repo = await RepositoryDetection.DetectRepositoryAsync(cwd, detectPullRequest: false);
repoByCwd[cwd] = repo is { Owner: { } o, RepoName: { } n } ? (o, n) : null;
}

Expand Down Expand Up @@ -2358,7 +2360,9 @@ CancellationToken ct
var codexRepo = session.Vendor == "codex" ? ExtractCodexGitInfo(session.FilePath) : null;

if (cwd is not null) {
var repo = await RepositoryDetection.DetectRepositoryAsync(cwd);
// The imported session-start payload carries no PR fields (only owner/repo/branch/user),
// so skip the PR/MR provider round-trip.
var repo = await RepositoryDetection.DetectRepositoryAsync(cwd, detectPullRequest: false);

if (repo is not null || codexRepo is not null) {
var repoNode = new JsonObject();
Expand Down
2 changes: 1 addition & 1 deletion src/Capacitor.Cli/Commands/KiroImportSource.cs
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ public KiroImportSource(
Func<string, Task<RepositoryPayload?>>? repoDetector = null
) {
_sessionsDir = sessionsDirOverride ?? KiroPaths.SessionsDir();
_repoDetector = repoDetector ?? (cwd => RepositoryDetection.DetectRepositoryAsync(cwd));
_repoDetector = repoDetector ?? (cwd => RepositoryDetection.DetectRepositoryAsync(cwd, detectPullRequest: false));
}

static StringComparison PathComparison =>
Expand Down
2 changes: 1 addition & 1 deletion src/Capacitor.Cli/Commands/PiImportSource.cs
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ public PiImportSource(
Func<string, Task<RepositoryPayload?>>? repoDetector = null
) {
_sessionsDir = sessionsDirOverride ?? PiPaths.SessionsDir();
_repoDetector = repoDetector ?? (cwd => RepositoryDetection.DetectRepositoryAsync(cwd));
_repoDetector = repoDetector ?? (cwd => RepositoryDetection.DetectRepositoryAsync(cwd, detectPullRequest: false));
}

static StringComparison PathComparison =>
Expand Down
3 changes: 2 additions & 1 deletion src/Capacitor.Cli/Commands/TranscriptFileClassification.cs
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,8 @@ CancellationToken ct

if (cwd is not null) {
if (excludedRepos is { Length: > 0 }) {
var repo = await RepositoryDetection.DetectRepositoryAsync(cwd);
// Classification only needs owner/repo for the exclusion key — skip PR detection.
var repo = await RepositoryDetection.DetectRepositoryAsync(cwd, detectPullRequest: false);

if (repo?.Owner is not null && repo.RepoName is not null) {
var key = $"{repo.Owner}/{repo.RepoName}";
Expand Down
24 changes: 17 additions & 7 deletions src/Capacitor.Cli/RepositoryDetection.cs
Original file line number Diff line number Diff line change
Expand Up @@ -115,8 +115,15 @@ static bool RepoPayloadEquals(RepositoryPayload a, RepositoryPayload b) =>
&& a.UserName == b.UserName
&& a.UserEmail == b.UserEmail;

public static async Task<RepositoryPayload?> DetectRepositoryAsync(string cwd, TimeSpan? budget = null) {
// detectPullRequest=false skips the live PR/MR provider detection (the `gh pr view` / `glab api`
// round-trip) while still resolving base repo info (owner/repo/user/branch/host). Bulk import
// passes false: it never emits PR fields, so that per-cwd round-trip is pure wasted latency
// (AI-1122). `run` is an injectable command runner (defaults to the real process spawner) so the
// git/provider spawns are unit-testable.
public static async Task<RepositoryPayload?> DetectRepositoryAsync(
string cwd, TimeSpan? budget = null, bool detectPullRequest = true, CommandRunner? run = null) {
if (budget is { } b0 && b0 <= TimeSpan.Zero) return null;
run ??= DefaultRunner;
try {
// Bound the git/gh probes by the remaining hook budget so a slow repo never
// overruns the deadline. When no budget is set, keep the historical 5s/2s caps.
Expand All @@ -131,7 +138,7 @@ static bool RepoPayloadEquals(RepositoryPayload a, RepositoryPayload b) =>
string? userName, userEmail, remoteUrl, owner, repoName, branch, host;

// Always detect branch fresh — it changes frequently during a session
var branchTask = RunCommandAsync("git", "branch --show-current", cwd, gitCap);
var branchTask = run("git", "branch --show-current", cwd, gitCap);

if (cache is not null) {
userName = cache.UserName;
Expand All @@ -143,9 +150,9 @@ static bool RepoPayloadEquals(RepositoryPayload a, RepositoryPayload b) =>
branch = await branchTask;
} else {
// Run git commands in parallel
var userNameTask = RunCommandAsync("git", "config user.name", cwd, gitCap);
var userEmailTask = RunCommandAsync("git", "config user.email", cwd, gitCap);
var remoteUrlTask = RunCommandAsync("git", "remote get-url origin", cwd, gitCap);
var userNameTask = run("git", "config user.name", cwd, gitCap);
var userEmailTask = run("git", "config user.email", cwd, gitCap);
var remoteUrlTask = run("git", "remote get-url origin", cwd, gitCap);

await Task.WhenAll(userNameTask, userEmailTask, remoteUrlTask, branchTask);

Expand Down Expand Up @@ -193,8 +200,11 @@ static bool RepoPayloadEquals(RepositoryPayload a, RepositoryPayload b) =>
// (only ClaudeHookCommand does), else the historical 2s cap. Covers probe + detector.
var providerCap = ghCap;

if (providerCap > TimeSpan.Zero && host is not null) {
var pr = await ResolveAndDetectPrAsync(host, owner, repoName, branch, cwd, providerCap, DefaultRunner);
// Import passes detectPullRequest:false to skip the provider round-trip (it discards
// PR fields). ResolveAndDetectPrAsync (from #261) owns the probe+detector budget split;
// thread the injectable `run` through so the git/provider spawns stay unit-testable.
if (detectPullRequest && providerCap > TimeSpan.Zero && host is not null) {
var pr = await ResolveAndDetectPrAsync(host, owner, repoName, branch, cwd, providerCap, run);

if (pr is not null) {
prNumber = pr.Number;
Expand Down
21 changes: 21 additions & 0 deletions test/Capacitor.Cli.Tests.Unit/Import/CursorImportSourceTests.cs
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,27 @@ public async Task does_not_support_title_generation() {
await Assert.That(src.SupportsTitleGeneration).IsFalse();
}

[Test]
public async Task session_start_payload_carries_pr_fields_when_repo_has_a_pr() {
// Cursor is the one import source whose synthetic session-start carries a `repository`
// node (AI-1152) via BuildRepositoryNode — including pr_* when a PR is detected. Guard
// that the payload propagates those fields (a regression dropped them when Cursor's repo
// detector was switched to skip PR detection during the import-latency work — AI-1122).
var repo = new RepositoryPayload {
Owner = "o", RepoName = "r", Host = "github.com", Branch = "main",
PrNumber = 7, PrTitle = "t", PrUrl = "https://github.com/o/r/pull/7", PrHeadRef = "main"
};
var repoNode = RepositoryDetection.BuildRepositoryNode(repo);

var payload = CursorImportSource.BuildSessionStartPayload(
"sid", "/ws", "/t.jsonl", DateTimeOffset.UtcNow, repoNode);

var repository = payload["repository"] as JsonObject;
await Assert.That(repository).IsNotNull();
await Assert.That(repository!["pr_number"]!.GetValue<int>()).IsEqualTo(7);
await Assert.That(repository["pr_url"]!.GetValue<string>()).IsEqualTo("https://github.com/o/r/pull/7");
}

[Test]
public async Task is_available_when_projects_dir_exists() {
using var fx = new ProjectsDirFixture();
Expand Down
51 changes: 51 additions & 0 deletions test/Capacitor.Cli.Tests.Unit/ImportProviderDetectionTests.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
using Capacitor.Cli;
using Capacitor.Cli.PrDetection;

namespace Capacitor.Cli.Tests.Unit;

/// <summary>
/// Import discards PR/MR fields on every path, so running the live `gh pr view` / `glab api`
/// provider round-trip per cwd during a bulk import is wasted latency (AI-1122). These tests pin
/// the `detectPullRequest` switch on RepositoryDetection.DetectRepositoryAsync via an injected
/// command runner: import (false) resolves owner/repo with git only and never spawns a provider
/// CLI, while the default (live) path still does.
/// </summary>
public class ImportProviderDetectionTests {
// Records every command the detector spawns; answers git probes with a GitHub remote so
// owner/repo/host resolve, and answers `gh pr view` with a PR so the live path can populate it.
static CommandRunner Recording(List<string> log) => (cmd, args, _, _) => {
log.Add($"{cmd} {args}");
if (cmd == "git" && args.StartsWith("remote get-url")) return Task.FromResult<string?>("https://github.com/foo/bar");
if (cmd == "git" && args.StartsWith("branch")) return Task.FromResult<string?>("main");
if (cmd == "git") return Task.FromResult<string?>("");
if (cmd == "gh") return Task.FromResult<string?>("""{"number":7,"title":"t","url":"https://github.com/foo/bar/pull/7","headRefName":"main"}""");
return Task.FromResult<string?>(null);
};

[Test]
public async Task Import_detection_resolves_repo_without_spawning_a_provider_cli() {
var log = new List<string>();

var repo = await RepositoryDetection.DetectRepositoryAsync(
"/fake/import/skip-pr", budget: null, detectPullRequest: false, run: Recording(log));

// Base repo info still resolves from git alone…
await Assert.That(repo!.Owner).IsEqualTo("foo");
await Assert.That(repo.RepoName).IsEqualTo("bar");
await Assert.That(repo.PrNumber).IsNull();
// …but no `gh` / `glab` round-trip was made.
await Assert.That(log.Any(c => c.StartsWith("gh") || c.StartsWith("glab"))).IsFalse();
await Assert.That(log.Any(c => c.StartsWith("git"))).IsTrue();
}

[Test]
public async Task Live_detection_still_runs_provider_detection() {
var log = new List<string>();

var repo = await RepositoryDetection.DetectRepositoryAsync(
"/fake/live/do-pr", budget: null, detectPullRequest: true, run: Recording(log));

await Assert.That(log.Any(c => c.StartsWith("gh pr view"))).IsTrue(); // provider detection ran
await Assert.That(repo!.PrNumber).IsEqualTo(7);
}
}
Loading