Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize text search #16286

Merged
merged 28 commits into from
Dec 1, 2016
Merged

Conversation

roblourens
Copy link
Member

Fixes #15384 (comment), comment is updated with final timings.

This PR rearranges the existing text search code to run using multiple processes. This doesn't include work to show results sooner - users still need to wait for a batch of 512 file matches to be filled. That work is tracked in #16284. Here's a brain dump for posterity -

RawSearchService creates an instance of textSearch.Engine and passes it a FileWalker which is producing paths in the workspace. This is the same.

textSearch.Engine creates some number of Search Worker processes. The number is determined by require('os').cpus().length which returns 8 on a MBP, because it has 4 physical cores, with hyperthreading.

The fileWalker produces file paths, and the sizes in bytes of those files. They are grouped into batches of roughly equal sizes in bytes. The number of files in each batch may vary a lot, but each batch should represent files that are cumulatively about 1MB in size. Hopefully this will make them more likely to be processed in a similar amount of time, with a similar number of results, so one process isn't running for much longer than others. This might sound like a large batch - smaller workspaces won't benefit much from parallelization - but they don't need it. I did some testing on a few workspaces of different sizes with different batch sizes, and 1MB seemed like the sweet spot. I also experimented with a batch size that starts small and grows exponentially, thinking it might help on medium-sized workspaces, but it didn't seem to make a difference.

Batches are sent to the worker as soon as they're ready, in a round-robin fashion. I experimented with sending one batch at a time to each process, with another when it's finished, but that was slower because the worker isn't doing anything while it's waiting for a new batch. I also tried sending a couple batches up front, and more as needed, but it was no better than sending them ASAP. Looking at timings, I think this is because the workers are slow to handle the first couple batches they receive - the fileWalker is actually often done by the time the workers are sending back their first batch of results. I need to investigate more to figure out why this is - maybe because the new processes are still loading code, JITting, etc. They process the rest of the batches much faster after those first few. It may be faster to keep the worker processes around between searches, but I don't know that we want to persist 8 (or more) processes per vscode window.

The worker searches the files using the same code that we already had, this hasn't changed much. It does include the change to eliminate Buffer.slice calls that is already checked in. The only change is that I limited the number of files that it will open at a time, because I kept running into the OS open file limit without it.

The batch includes the current maxResults, i.e. the initial max minus the number of results received so far. The main process might receive more than the max number of results from workers, so it has to trim the result set. This is kind of annoying to get exactly right, because the result set is not a flat list, but a tree of files with matches, lines in those files with matches, and the matches on each line. So, it will just take batches until the total is greater than the max. It can go over the max, but this is much easier.

@roblourens roblourens added the search Search widget and operation issues label Nov 30, 2016
@roblourens roblourens added this to the November 2016 milestone Nov 30, 2016
@@ -114,7 +118,7 @@ suite('TextSearch performance', () => {
const finishedEvents = [];
return runSearch() // Warm-up first
.then(() => {
if (testWorkspaceArg) { // Don't measure by default
if (testWorkspacePath) { // Don't measure by default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is always set. I'd revert this change so unit tests run without additional output.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw., any unit tests that would make sense adding or are the existing ones still sufficient?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some unit tests in search.test.ts.


// Emit progress() unless we got canceled or hit the limit
if (processed && !this.isDone && !this.isCanceled && !this.limitReached) {
progress();
}

// Emit done()
if (this.worked === this.total && this.walkerIsDone && !this.isDone) {
if (!this.isDone && this.processedBytes === this.totalBytes && this.walkerIsDone) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this.processedBytes === this.totalBytes is a good condition to rely on. If a file changes its size between the file traversal and the text search it would never hold, would it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we only check the file size once. This is based on the size that the fileWalker provides. I could change it to >= though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Should be fine as-is then.

let nextBatchBytes = 0;
const batchFlushBytes = 2 ** 20; // 1MB
this.walker.walk(this.config.rootFolders, this.config.extraFiles, result => {
let bytes = result.size || 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea why we || 1? Would an empty file make the condition this.processedBytes === this.totalBytes further up never hold?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe so we continue to report progress while searching empty files. It won't break that check though.

env: {
AMD_ENTRYPOINT: 'vs/workbench/services/search/node/worker/searchWorkerApp',
PIPE_LOGGING: 'true',
VERBOSE_LOGGING: 'true'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Others seem to be using environmentService.verbose or !environmentService.isBuilt || environmentService.verbose. Not sure if it matters in this case, but we should avoid being noisy.


public search(onResult: (match: ISerializedFileMatch) => void, onProgress: (progress: IProgress) => void, done: (error: Error, complete: ISerializedSearchComplete) => void): void {
let resultCounter = 0;
this.workers.forEach(w => w.cancel());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add error handler to returned promises?

this.isDone = true;
this.disposeWorkers();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this different from cancel() above? Should the walker also be canceled/disposed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer my question: The walker already stopped, either because it finished, errored or was canceled. If I understand correctly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At that point, all the work has been completed, so the walker and workers are already done, and we just need to kill the processes.

if (this.limitReached || this.isCanceled) {
return; // return early if canceled or limit reached
const maxResults = this.config.maxResults - this.numResults;
worker.search({ absolutePaths: batch, maxResults }).then(result => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add error handler to arguments of .then()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we didn't surface errors from the text search, just from the file walker. I wasn't sure whether to change that. I guess I should do something with the error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, let's propagate and log it. We can still put a limit on it, if there are cases where too many are reported.

const channel = ipc.getNextTickChannel(client.getChannel<ISearchWorkerChannel>('searchWorker'));
const channelClient = new SearchWorkerChannelClient(channel);
const config: ISearchWorkerConfig = { pattern: this.config.contentPattern, id, fileEncoding: this.config.fileEncoding };
channelClient.initialize(config);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add error handler.


call(command: string, arg?: any): TPromise<any> {
switch(command) {
case 'initialize': return TPromise.wrap(this.worker.initialize(arg));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TPromise.wrap is already used in initialize().

@bpasero
Copy link
Member

bpasero commented Dec 1, 2016

Unassigning myself, I have full trust in @chrmarti review. If there are specific questions, ping me.

@roblourens
Copy link
Member Author

Thanks, merging

@roblourens roblourens merged commit c751c0c into microsoft:master Dec 1, 2016
@roblourens roblourens deleted the roblou/parallelSearch branch December 1, 2016 05:48
@sallar
Copy link

sallar commented Dec 14, 2016

Hi @roblourens, just wanted to say that this code reads like poetry to me. Thank you.

@roblourens
Copy link
Member Author

Thanks @sallar :)

@github-actions github-actions bot locked and limited conversation to collaborators Mar 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
search Search widget and operation issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

File Search - Improve performance of file search
5 participants