Skip to content

Add basic github lexical search#312228

Merged
mjbvz merged 3 commits intomicrosoft:mainfrom
mjbvz:dev/mjbvz/boring-python
Apr 23, 2026
Merged

Add basic github lexical search#312228
mjbvz merged 3 commits intomicrosoft:mainfrom
mjbvz:dev/mjbvz/boring-python

Conversation

@mjbvz
Copy link
Copy Markdown
Collaborator

@mjbvz mjbvz commented Apr 23, 2026

Fixes #312210

This is using the old search endpoint which is limited. We'll start switching over the new one once it's ready. We should be able to support this for external ingest workspaces too

mjbvz and others added 2 commits April 23, 2026 14:16
Fixes microsoft#312210

This is using the old search endpoint. We'll start switching over the new one once it's ready

Co-authored-by: Copilot <copilot@github.com>
Copilot AI review requested due to automatic review settings April 23, 2026 21:24
@mjbvz mjbvz enabled auto-merge April 23, 2026 21:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a first-party GitHub lexical (keyword) code search path to the Copilot extension, alongside the existing semantic repo search, to enable out-of-the-box text search over GitHub repos/orgs.

Changes:

  • Introduces lexicalSearch to IGithubCodeSearchService and implements it via GitHub’s REST search/code endpoint.
  • Splits remote search results into SemanticCodeSearchResult vs LexicalCodeSearchResult, and threads the semantic type through existing semantic callers.
  • Adds a new github_text_search tool (and updates/renames the semantic repo tool wiring + localization).
Show a summary per file
File Description
extensions/copilot/test/base/simuliationWorkspaceChunkSearch.ts Updates simulation code search service to new semantic/lexical API surface.
extensions/copilot/src/platform/workspaceChunkSearch/node/codeSearch/codeSearchRepo.ts Narrows repo search result type to semantic results.
extensions/copilot/src/platform/workspaceChunkSearch/node/codeSearch/codeSearchChunkSearch.ts Propagates semantic result typing through chunk search flow.
extensions/copilot/src/platform/remoteCodeSearch/common/remoteCodeSearch.ts Splits result types into semantic vs lexical and adds optional repo/ref metadata.
extensions/copilot/src/platform/remoteCodeSearch/common/githubCodeSearchService.ts Adds lexicalSearch implementation using GitHub REST search; extends semantic parsing with remote URL + ref extraction.
extensions/copilot/src/platform/remoteCodeSearch/common/adoCodeSearchService.ts Updates ADO code search typing to semantic result type.
extensions/copilot/src/platform/github/common/githubAPI.ts Extends REST request helper to support custom Accept and extra headers.
extensions/copilot/src/extension/tools/node/githubTextSearchTool.tsx New tool that exposes GitHub lexical search to prompts and renders results.
extensions/copilot/src/extension/tools/node/githubRepoSemanticSearchTool.tsx Renames/updates semantic repo tool and uses returned remote URL/ref to build links.
extensions/copilot/src/extension/tools/node/allTools.ts Registers new tool modules for runtime discovery.
extensions/copilot/src/extension/tools/common/toolNames.ts Adds tool names for GitHub text search and renames semantic repo tool enum entries.
extensions/copilot/package.nls.json Updates localized tool names/descriptions; adds text-search strings.
extensions/copilot/package.json Contributes the new copilot_githubTextSearch tool and exposes it in the web tool set.

Copilot's findings

Comments suppressed due to low confidence (1)

extensions/copilot/src/extension/tools/node/githubRepoSemanticSearchTool.tsx:72

  • URI.joinPath(…, entry.chunk.file.path) is unsafe here because entry.chunk.file.path is an absolute path (starts with /). URI.joinPath will treat that as an absolute segment and discard the preceding tree/{ref} segments, producing an incorrect GitHub URL. Strip the leading slash (or join using path segments) before calling URI.joinPath.
  • Files reviewed: 13/13 changed files
  • Comments generated: 8

Comment on lines +135 to +159
const renderedChunks = this.props.chunks
.filter(x => x.chunk.text)
.map(chunk => {
const fileKey = chunk.chunk.file.toString();
if (!seenFiles.has(fileKey)) {
seenFiles.add(fileKey);
references.push(new PromptReference(chunk.chunk.file));
}

const githubInfo = parseGithubFileUrl(chunk.chunk.file);
const displayPath = githubInfo?.path ?? chunk.chunk.file.toString();
const nwoLabel = githubInfo?.nwo;

const lineInfo = ` starting at line ${chunk.chunk.range.startLineNumber}`;

const headerText = nwoLabel
? `Text match excerpt from \`${nwoLabel}\` in \`${displayPath}\`${lineInfo}:`
: `Text match excerpt in \`${displayPath}\`${lineInfo}:`;

return <TextChunk>
{headerText}<br />
{createFencedCodeBlock(getLanguageId(chunk.chunk.file), chunk.chunk.text)}<br /><br />
</TextChunk>;
});

Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

references (and therefore toolResultMessage) is derived only from chunks with non-empty chunk.text because of .filter(x => x.chunk.text). But parseLexicalSearchResponse can produce isFullFile: true entries with text: '' when the API omits text_matches, which would make the tool report “no results” even though files matched. Consider rendering (or at least referencing) those file-only matches so the tool result accurately reflects hits.

Suggested change
const renderedChunks = this.props.chunks
.filter(x => x.chunk.text)
.map(chunk => {
const fileKey = chunk.chunk.file.toString();
if (!seenFiles.has(fileKey)) {
seenFiles.add(fileKey);
references.push(new PromptReference(chunk.chunk.file));
}
const githubInfo = parseGithubFileUrl(chunk.chunk.file);
const displayPath = githubInfo?.path ?? chunk.chunk.file.toString();
const nwoLabel = githubInfo?.nwo;
const lineInfo = ` starting at line ${chunk.chunk.range.startLineNumber}`;
const headerText = nwoLabel
? `Text match excerpt from \`${nwoLabel}\` in \`${displayPath}\`${lineInfo}:`
: `Text match excerpt in \`${displayPath}\`${lineInfo}:`;
return <TextChunk>
{headerText}<br />
{createFencedCodeBlock(getLanguageId(chunk.chunk.file), chunk.chunk.text)}<br /><br />
</TextChunk>;
});
const renderedChunks = this.props.chunks.map(chunk => {
const fileKey = chunk.chunk.file.toString();
if (!seenFiles.has(fileKey)) {
seenFiles.add(fileKey);
references.push(new PromptReference(chunk.chunk.file));
}
const githubInfo = parseGithubFileUrl(chunk.chunk.file);
const displayPath = githubInfo?.path ?? chunk.chunk.file.toString();
const nwoLabel = githubInfo?.nwo;
if (!chunk.chunk.text) {
if (!chunk.chunk.isFullFile) {
return undefined;
}
const headerText = nwoLabel
? l10n.t("Match found in `{0}` in `{1}`. GitHub did not provide a text excerpt.", nwoLabel, displayPath)
: l10n.t("Match found in `{0}`. GitHub did not provide a text excerpt.", displayPath);
return <TextChunk>
{headerText}<br /><br />
</TextChunk>;
}
const lineInfo = ` starting at line ${chunk.chunk.range.startLineNumber}`;
const headerText = nwoLabel
? `Text match excerpt from \`${nwoLabel}\` in \`${displayPath}\`${lineInfo}:`
: `Text match excerpt in \`${displayPath}\`${lineInfo}:`;
return <TextChunk>
{headerText}<br />
{createFencedCodeBlock(getLanguageId(chunk.chunk.file), chunk.chunk.text)}<br /><br />
</TextChunk>;
});

Copilot uses AI. Check for mistakes.
Comment on lines +277 to +278
"github.copilot.tools.githubRepo.name": "Semantic Search GitHub Repository",
"github.copilot.tools.githubRepo.userDescription": "Semantic Search a GitHub repository for relevant source code snippets. You can specify a repository using `owner/repo`",
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User-facing text: “Semantic Search a GitHub repository …” is grammatically incorrect. Consider changing it to “Semantically search a GitHub repository …” (same applies to the userDescription).

Suggested change
"github.copilot.tools.githubRepo.name": "Semantic Search GitHub Repository",
"github.copilot.tools.githubRepo.userDescription": "Semantic Search a GitHub repository for relevant source code snippets. You can specify a repository using `owner/repo`",
"github.copilot.tools.githubRepo.name": "Semantically Search GitHub Repository",
"github.copilot.tools.githubRepo.userDescription": "Semantically search a GitHub repository for relevant source code snippets. You can specify a repository using `owner/repo`",

Copilot uses AI. Check for mistakes.
import './getNotebookCellOutputTool';
import './getSearchViewResultsTool';
import './githubRepoTool';
import './githubRepoSemanticSearchTool.tsx';
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import './githubRepoSemanticSearchTool.tsx' will fail to compile under the extension's tsconfig because allowImportingTsExtensions is not enabled (TypeScript disallows importing .ts/.tsx extensions by default). Drop the explicit .tsx extension and import the module the same way as the other tool imports in this file.

Suggested change
import './githubRepoSemanticSearchTool.tsx';
import './githubRepoSemanticSearchTool';

Copilot uses AI. Check for mistakes.
if (parts.length >= 3) {
const nwo = `${parts[0]}/${parts[1]}`;
const relativePath = parts.slice(2).join('/');
file = URI.joinPath(URI.parse('https://github.com'), nwo, 'tree', 'main', '/' + relativePath).with({
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

URI.joinPath uses path.posix.join, so passing a segment that starts with / resets the path. Using '/' + relativePath here will drop the owner/repo/tree/main segments and produce an incorrect GitHub URL. Pass relativePath without a leading slash (or split it into path segments) when calling URI.joinPath.

Suggested change
file = URI.joinPath(URI.parse('https://github.com'), nwo, 'tree', 'main', '/' + relativePath).with({
file = URI.joinPath(URI.parse('https://github.com'), nwo, 'tree', 'main', relativePath).with({

Copilot uses AI. Check for mistakes.
Comment on lines +64 to +69
const relativePath = parts.slice(2).join('/');
file = URI.joinPath(URI.parse('https://github.com'), nwo, 'tree', 'main', '/' + relativePath).with({
fragment: entry.range.startLineNumber > 0
? `L${entry.range.startLineNumber}-L${entry.range.endLineNumber}`
: undefined,
});
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tool hard-codes the branch name to main when building GitHub URLs. That will generate broken links for repos whose default branch is not main (common for master, release branches, or protected default branches). Consider using the html_url returned by GitHub's search API (or fetching the repo default branch once) instead of hardcoding main.

Copilot uses AI. Check for mistakes.
Comment on lines +147 to +156

const lineInfo = ` starting at line ${chunk.chunk.range.startLineNumber}`;

const headerText = nwoLabel
? `Text match excerpt from \`${nwoLabel}\` in \`${displayPath}\`${lineInfo}:`
: `Text match excerpt in \`${displayPath}\`${lineInfo}:`;

return <TextChunk>
{headerText}<br />
{createFencedCodeBlock(getLanguageId(chunk.chunk.file), chunk.chunk.text)}<br /><br />
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lineInfo is always rendered as starting at line ${chunk.chunk.range.startLineNumber}. For lexical search results the range is currently new Range(0, 0, 0, 0), which will display “starting at line 0” (and also prevents fragments from being added). Consider omitting the line info (and fragment) when the range is unknown/0 so the output is not misleading.

Suggested change
const lineInfo = ` starting at line ${chunk.chunk.range.startLineNumber}`;
const headerText = nwoLabel
? `Text match excerpt from \`${nwoLabel}\` in \`${displayPath}\`${lineInfo}:`
: `Text match excerpt in \`${displayPath}\`${lineInfo}:`;
return <TextChunk>
{headerText}<br />
{createFencedCodeBlock(getLanguageId(chunk.chunk.file), chunk.chunk.text)}<br /><br />
const hasKnownRange = chunk.chunk.range.startLineNumber > 0;
const lineInfo = hasKnownRange ? ` starting at line ${chunk.chunk.range.startLineNumber}` : '';
const headerText = nwoLabel
? hasKnownRange
? `Text match excerpt from \`${nwoLabel}\` in \`${displayPath}\`${lineInfo}:`
: `Text match from \`${nwoLabel}\` in \`${displayPath}\`:`
: hasKnownRange
? `Text match excerpt in \`${displayPath}\`${lineInfo}:`
: `Text match in \`${displayPath}\`:`;
return <TextChunk>
{headerText}<br />
{hasKnownRange ? <>{createFencedCodeBlock(getLanguageId(chunk.chunk.file), chunk.chunk.text)}<br /></> : undefined}
<br />

Copilot uses AI. Check for mistakes.

const scopeQualifier = scope.kind === 'org' ? `org:${scope.org}` : `repo:${toGithubNwo(scope.githubRepoId)}`;
const searchQuery = `${query} ${scopeQualifier}`;
const routeSlug = `search/code?q=${encodeURIComponent(searchQuery)}&per_page=${maxResults}`;
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub’s REST code search API only supports per_page up to 100. maxResults is passed through directly, so callers can accidentally generate invalid requests by providing values > 100. Consider clamping maxResults to 100 (and/or paging) before building the routeSlug.

Suggested change
const routeSlug = `search/code?q=${encodeURIComponent(searchQuery)}&per_page=${maxResults}`;
const perPage = Math.min(maxResults, 100);
const routeSlug = `search/code?q=${encodeURIComponent(searchQuery)}&per_page=${perPage}`;

Copilot uses AI. Check for mistakes.

async searchRepo(authOptions: { silent: boolean }, embeddingType: EmbeddingType, repo: GithubCodeSearchRepoInfo, query: string, maxResults: number, options: WorkspaceChunkSearchOptions, _telemetryInfo: TelemetryCorrelationId, token: CancellationToken): Promise<CodeSearchResult> {
async lexicalSearch(_authOptions: { silent: boolean }, _scope: GithubCodeSearchScope, _query: string, _maxResults: number, _options: WorkspaceChunkSearchOptions, _telemetryInfo: TelemetryCorrelationId, _token: CancellationToken): Promise<LexicalCodeSearchResult> {
throw new Error('Method not implemented.');
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simulation implementation of IGithubCodeSearchService.lexicalSearch throws Method not implemented. With lexical search now used by a first-party tool, this makes the simulation workspace brittle (any future test/tool usage will crash at runtime). Consider returning an empty LexicalCodeSearchResult (or a simple simulated response) instead of throwing.

Suggested change
throw new Error('Method not implemented.');
this._logService.trace('SimulationGithubCodeSearchService::lexicalSearch returning empty simulated result');
return {} as LexicalCodeSearchResult;

Copilot uses AI. Check for mistakes.
@mjbvz mjbvz merged commit fe56e84 into microsoft:main Apr 23, 2026
26 checks passed
@vs-code-engineering vs-code-engineering Bot added this to the 1.118.0 milestone Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable text search tool for GitHub repos and orgs

3 participants