Add basic github lexical search#312228
Conversation
Fixes microsoft#312210 This is using the old search endpoint. We'll start switching over the new one once it's ready Co-authored-by: Copilot <copilot@github.com>
There was a problem hiding this comment.
Pull request overview
Adds a first-party GitHub lexical (keyword) code search path to the Copilot extension, alongside the existing semantic repo search, to enable out-of-the-box text search over GitHub repos/orgs.
Changes:
- Introduces
lexicalSearchtoIGithubCodeSearchServiceand implements it via GitHub’s RESTsearch/codeendpoint. - Splits remote search results into
SemanticCodeSearchResultvsLexicalCodeSearchResult, and threads the semantic type through existing semantic callers. - Adds a new
github_text_searchtool (and updates/renames the semantic repo tool wiring + localization).
Show a summary per file
| File | Description |
|---|---|
| extensions/copilot/test/base/simuliationWorkspaceChunkSearch.ts | Updates simulation code search service to new semantic/lexical API surface. |
| extensions/copilot/src/platform/workspaceChunkSearch/node/codeSearch/codeSearchRepo.ts | Narrows repo search result type to semantic results. |
| extensions/copilot/src/platform/workspaceChunkSearch/node/codeSearch/codeSearchChunkSearch.ts | Propagates semantic result typing through chunk search flow. |
| extensions/copilot/src/platform/remoteCodeSearch/common/remoteCodeSearch.ts | Splits result types into semantic vs lexical and adds optional repo/ref metadata. |
| extensions/copilot/src/platform/remoteCodeSearch/common/githubCodeSearchService.ts | Adds lexicalSearch implementation using GitHub REST search; extends semantic parsing with remote URL + ref extraction. |
| extensions/copilot/src/platform/remoteCodeSearch/common/adoCodeSearchService.ts | Updates ADO code search typing to semantic result type. |
| extensions/copilot/src/platform/github/common/githubAPI.ts | Extends REST request helper to support custom Accept and extra headers. |
| extensions/copilot/src/extension/tools/node/githubTextSearchTool.tsx | New tool that exposes GitHub lexical search to prompts and renders results. |
| extensions/copilot/src/extension/tools/node/githubRepoSemanticSearchTool.tsx | Renames/updates semantic repo tool and uses returned remote URL/ref to build links. |
| extensions/copilot/src/extension/tools/node/allTools.ts | Registers new tool modules for runtime discovery. |
| extensions/copilot/src/extension/tools/common/toolNames.ts | Adds tool names for GitHub text search and renames semantic repo tool enum entries. |
| extensions/copilot/package.nls.json | Updates localized tool names/descriptions; adds text-search strings. |
| extensions/copilot/package.json | Contributes the new copilot_githubTextSearch tool and exposes it in the web tool set. |
Copilot's findings
Comments suppressed due to low confidence (1)
extensions/copilot/src/extension/tools/node/githubRepoSemanticSearchTool.tsx:72
URI.joinPath(…, entry.chunk.file.path)is unsafe here becauseentry.chunk.file.pathis an absolute path (starts with/).URI.joinPathwill treat that as an absolute segment and discard the precedingtree/{ref}segments, producing an incorrect GitHub URL. Strip the leading slash (or join using path segments) before callingURI.joinPath.
- Files reviewed: 13/13 changed files
- Comments generated: 8
| const renderedChunks = this.props.chunks | ||
| .filter(x => x.chunk.text) | ||
| .map(chunk => { | ||
| const fileKey = chunk.chunk.file.toString(); | ||
| if (!seenFiles.has(fileKey)) { | ||
| seenFiles.add(fileKey); | ||
| references.push(new PromptReference(chunk.chunk.file)); | ||
| } | ||
|
|
||
| const githubInfo = parseGithubFileUrl(chunk.chunk.file); | ||
| const displayPath = githubInfo?.path ?? chunk.chunk.file.toString(); | ||
| const nwoLabel = githubInfo?.nwo; | ||
|
|
||
| const lineInfo = ` starting at line ${chunk.chunk.range.startLineNumber}`; | ||
|
|
||
| const headerText = nwoLabel | ||
| ? `Text match excerpt from \`${nwoLabel}\` in \`${displayPath}\`${lineInfo}:` | ||
| : `Text match excerpt in \`${displayPath}\`${lineInfo}:`; | ||
|
|
||
| return <TextChunk> | ||
| {headerText}<br /> | ||
| {createFencedCodeBlock(getLanguageId(chunk.chunk.file), chunk.chunk.text)}<br /><br /> | ||
| </TextChunk>; | ||
| }); | ||
|
|
There was a problem hiding this comment.
references (and therefore toolResultMessage) is derived only from chunks with non-empty chunk.text because of .filter(x => x.chunk.text). But parseLexicalSearchResponse can produce isFullFile: true entries with text: '' when the API omits text_matches, which would make the tool report “no results” even though files matched. Consider rendering (or at least referencing) those file-only matches so the tool result accurately reflects hits.
| const renderedChunks = this.props.chunks | |
| .filter(x => x.chunk.text) | |
| .map(chunk => { | |
| const fileKey = chunk.chunk.file.toString(); | |
| if (!seenFiles.has(fileKey)) { | |
| seenFiles.add(fileKey); | |
| references.push(new PromptReference(chunk.chunk.file)); | |
| } | |
| const githubInfo = parseGithubFileUrl(chunk.chunk.file); | |
| const displayPath = githubInfo?.path ?? chunk.chunk.file.toString(); | |
| const nwoLabel = githubInfo?.nwo; | |
| const lineInfo = ` starting at line ${chunk.chunk.range.startLineNumber}`; | |
| const headerText = nwoLabel | |
| ? `Text match excerpt from \`${nwoLabel}\` in \`${displayPath}\`${lineInfo}:` | |
| : `Text match excerpt in \`${displayPath}\`${lineInfo}:`; | |
| return <TextChunk> | |
| {headerText}<br /> | |
| {createFencedCodeBlock(getLanguageId(chunk.chunk.file), chunk.chunk.text)}<br /><br /> | |
| </TextChunk>; | |
| }); | |
| const renderedChunks = this.props.chunks.map(chunk => { | |
| const fileKey = chunk.chunk.file.toString(); | |
| if (!seenFiles.has(fileKey)) { | |
| seenFiles.add(fileKey); | |
| references.push(new PromptReference(chunk.chunk.file)); | |
| } | |
| const githubInfo = parseGithubFileUrl(chunk.chunk.file); | |
| const displayPath = githubInfo?.path ?? chunk.chunk.file.toString(); | |
| const nwoLabel = githubInfo?.nwo; | |
| if (!chunk.chunk.text) { | |
| if (!chunk.chunk.isFullFile) { | |
| return undefined; | |
| } | |
| const headerText = nwoLabel | |
| ? l10n.t("Match found in `{0}` in `{1}`. GitHub did not provide a text excerpt.", nwoLabel, displayPath) | |
| : l10n.t("Match found in `{0}`. GitHub did not provide a text excerpt.", displayPath); | |
| return <TextChunk> | |
| {headerText}<br /><br /> | |
| </TextChunk>; | |
| } | |
| const lineInfo = ` starting at line ${chunk.chunk.range.startLineNumber}`; | |
| const headerText = nwoLabel | |
| ? `Text match excerpt from \`${nwoLabel}\` in \`${displayPath}\`${lineInfo}:` | |
| : `Text match excerpt in \`${displayPath}\`${lineInfo}:`; | |
| return <TextChunk> | |
| {headerText}<br /> | |
| {createFencedCodeBlock(getLanguageId(chunk.chunk.file), chunk.chunk.text)}<br /><br /> | |
| </TextChunk>; | |
| }); |
| "github.copilot.tools.githubRepo.name": "Semantic Search GitHub Repository", | ||
| "github.copilot.tools.githubRepo.userDescription": "Semantic Search a GitHub repository for relevant source code snippets. You can specify a repository using `owner/repo`", |
There was a problem hiding this comment.
User-facing text: “Semantic Search a GitHub repository …” is grammatically incorrect. Consider changing it to “Semantically search a GitHub repository …” (same applies to the userDescription).
| "github.copilot.tools.githubRepo.name": "Semantic Search GitHub Repository", | |
| "github.copilot.tools.githubRepo.userDescription": "Semantic Search a GitHub repository for relevant source code snippets. You can specify a repository using `owner/repo`", | |
| "github.copilot.tools.githubRepo.name": "Semantically Search GitHub Repository", | |
| "github.copilot.tools.githubRepo.userDescription": "Semantically search a GitHub repository for relevant source code snippets. You can specify a repository using `owner/repo`", |
| import './getNotebookCellOutputTool'; | ||
| import './getSearchViewResultsTool'; | ||
| import './githubRepoTool'; | ||
| import './githubRepoSemanticSearchTool.tsx'; |
There was a problem hiding this comment.
import './githubRepoSemanticSearchTool.tsx' will fail to compile under the extension's tsconfig because allowImportingTsExtensions is not enabled (TypeScript disallows importing .ts/.tsx extensions by default). Drop the explicit .tsx extension and import the module the same way as the other tool imports in this file.
| import './githubRepoSemanticSearchTool.tsx'; | |
| import './githubRepoSemanticSearchTool'; |
| if (parts.length >= 3) { | ||
| const nwo = `${parts[0]}/${parts[1]}`; | ||
| const relativePath = parts.slice(2).join('/'); | ||
| file = URI.joinPath(URI.parse('https://github.com'), nwo, 'tree', 'main', '/' + relativePath).with({ |
There was a problem hiding this comment.
URI.joinPath uses path.posix.join, so passing a segment that starts with / resets the path. Using '/' + relativePath here will drop the owner/repo/tree/main segments and produce an incorrect GitHub URL. Pass relativePath without a leading slash (or split it into path segments) when calling URI.joinPath.
| file = URI.joinPath(URI.parse('https://github.com'), nwo, 'tree', 'main', '/' + relativePath).with({ | |
| file = URI.joinPath(URI.parse('https://github.com'), nwo, 'tree', 'main', relativePath).with({ |
| const relativePath = parts.slice(2).join('/'); | ||
| file = URI.joinPath(URI.parse('https://github.com'), nwo, 'tree', 'main', '/' + relativePath).with({ | ||
| fragment: entry.range.startLineNumber > 0 | ||
| ? `L${entry.range.startLineNumber}-L${entry.range.endLineNumber}` | ||
| : undefined, | ||
| }); |
There was a problem hiding this comment.
This tool hard-codes the branch name to main when building GitHub URLs. That will generate broken links for repos whose default branch is not main (common for master, release branches, or protected default branches). Consider using the html_url returned by GitHub's search API (or fetching the repo default branch once) instead of hardcoding main.
|
|
||
| const lineInfo = ` starting at line ${chunk.chunk.range.startLineNumber}`; | ||
|
|
||
| const headerText = nwoLabel | ||
| ? `Text match excerpt from \`${nwoLabel}\` in \`${displayPath}\`${lineInfo}:` | ||
| : `Text match excerpt in \`${displayPath}\`${lineInfo}:`; | ||
|
|
||
| return <TextChunk> | ||
| {headerText}<br /> | ||
| {createFencedCodeBlock(getLanguageId(chunk.chunk.file), chunk.chunk.text)}<br /><br /> |
There was a problem hiding this comment.
lineInfo is always rendered as starting at line ${chunk.chunk.range.startLineNumber}. For lexical search results the range is currently new Range(0, 0, 0, 0), which will display “starting at line 0” (and also prevents fragments from being added). Consider omitting the line info (and fragment) when the range is unknown/0 so the output is not misleading.
| const lineInfo = ` starting at line ${chunk.chunk.range.startLineNumber}`; | |
| const headerText = nwoLabel | |
| ? `Text match excerpt from \`${nwoLabel}\` in \`${displayPath}\`${lineInfo}:` | |
| : `Text match excerpt in \`${displayPath}\`${lineInfo}:`; | |
| return <TextChunk> | |
| {headerText}<br /> | |
| {createFencedCodeBlock(getLanguageId(chunk.chunk.file), chunk.chunk.text)}<br /><br /> | |
| const hasKnownRange = chunk.chunk.range.startLineNumber > 0; | |
| const lineInfo = hasKnownRange ? ` starting at line ${chunk.chunk.range.startLineNumber}` : ''; | |
| const headerText = nwoLabel | |
| ? hasKnownRange | |
| ? `Text match excerpt from \`${nwoLabel}\` in \`${displayPath}\`${lineInfo}:` | |
| : `Text match from \`${nwoLabel}\` in \`${displayPath}\`:` | |
| : hasKnownRange | |
| ? `Text match excerpt in \`${displayPath}\`${lineInfo}:` | |
| : `Text match in \`${displayPath}\`:`; | |
| return <TextChunk> | |
| {headerText}<br /> | |
| {hasKnownRange ? <>{createFencedCodeBlock(getLanguageId(chunk.chunk.file), chunk.chunk.text)}<br /></> : undefined} | |
| <br /> |
|
|
||
| const scopeQualifier = scope.kind === 'org' ? `org:${scope.org}` : `repo:${toGithubNwo(scope.githubRepoId)}`; | ||
| const searchQuery = `${query} ${scopeQualifier}`; | ||
| const routeSlug = `search/code?q=${encodeURIComponent(searchQuery)}&per_page=${maxResults}`; |
There was a problem hiding this comment.
GitHub’s REST code search API only supports per_page up to 100. maxResults is passed through directly, so callers can accidentally generate invalid requests by providing values > 100. Consider clamping maxResults to 100 (and/or paging) before building the routeSlug.
| const routeSlug = `search/code?q=${encodeURIComponent(searchQuery)}&per_page=${maxResults}`; | |
| const perPage = Math.min(maxResults, 100); | |
| const routeSlug = `search/code?q=${encodeURIComponent(searchQuery)}&per_page=${perPage}`; |
|
|
||
| async searchRepo(authOptions: { silent: boolean }, embeddingType: EmbeddingType, repo: GithubCodeSearchRepoInfo, query: string, maxResults: number, options: WorkspaceChunkSearchOptions, _telemetryInfo: TelemetryCorrelationId, token: CancellationToken): Promise<CodeSearchResult> { | ||
| async lexicalSearch(_authOptions: { silent: boolean }, _scope: GithubCodeSearchScope, _query: string, _maxResults: number, _options: WorkspaceChunkSearchOptions, _telemetryInfo: TelemetryCorrelationId, _token: CancellationToken): Promise<LexicalCodeSearchResult> { | ||
| throw new Error('Method not implemented.'); |
There was a problem hiding this comment.
The simulation implementation of IGithubCodeSearchService.lexicalSearch throws Method not implemented. With lexical search now used by a first-party tool, this makes the simulation workspace brittle (any future test/tool usage will crash at runtime). Consider returning an empty LexicalCodeSearchResult (or a simple simulated response) instead of throwing.
| throw new Error('Method not implemented.'); | |
| this._logService.trace('SimulationGithubCodeSearchService::lexicalSearch returning empty simulated result'); | |
| return {} as LexicalCodeSearchResult; |
Fixes #312210
This is using the old search endpoint which is limited. We'll start switching over the new one once it's ready. We should be able to support this for external ingest workspaces too