This is causing most of our persistent indexing errors, and fixing it here is a better solution than explicitly blocking the strange repositories. #136

rmmh · 2021-06-04T19:43:13Z

No description provided.

Change-Id: Ic36ab798aaa097414b1c5b3a67a19893664a314a

Change-Id: I99766314e1f551a7ae900dd5ffe578ac1d694c4d

Change-Id: I060ef3c4f55fe91b4d45979685475749027765a4

It is only used by the iterators. So we can instead use VisitMatchTree to update the stats. Change-Id: I1be5e3c57fc8ea5e3baa20de6f7747c161423365

Change-Id: I34a3601f5686bf3cb79e76fe1e06f86b4b7d0329

Change-Id: I41503e95a41512f8445eb8109069f37b7cea0314

Change-Id: I58af12bc32735c52cdd2ef080d1fada328caba48

Change-Id: I89f211e77af9b9893e3e016f5cf704ddab2115ec

Change-Id: Iba7868c1935477d88f9dd285ef010fe0ebe37617

This reverts commit dacbbfd.

This reverts commit 23a3bdb.

This reverts commit 4381e0a.

This reverts commit d8c7986.

This reverts commit bf3e7a5.

…ewMatchTree" This reverts commit bfcde41.

This reverts commit 3d4a710.

…xData" This reverts commit 2d60dcf.

This reverts commit 3011f8c.

This reverts commit b50b7f5.

Change-Id: Ibe3634e9a541e068cc0c5e90bfa37cd7e1a9e894

Change-Id: I70f6c53edc355ba2cc0b02ffda34567ecce8dbe7

This is to prevent log spam due to it failing at startup. Now it will only log after 4 attemps with exp backoff ~= 15s. Change-Id: Id93c56a24ecada87f3d30d66e0c460450a1e9068

* fetch options from server and pass them through to index command Change-Id: I49c69efe18a21ed6efe7936300f0ce91af7aae4b * non pointer Change-Id: Idcfc1d17d9c086f6fe6d31a320e520c4bfc5e4d2 * add test for get options func Change-Id: I4df5ffc9aac538989cd08443a258a8daf393ec52 * correct argument building and add it to test Change-Id: I26985085c091122da47529e02ed78942e8d1ea5f

Change-Id: Idea480bef9f74884647e2b2785aea0754522115a

Change-Id: I70e51197558e760432e820743284696a4c469132

Save the options that affect the index that is created in order to know if we should reindex based on option changes. Change-Id: I676503e2c32934edae7c594a8ca9ac1b0f72e226

Change-Id: If2b44ab7f822be5c97dd0460023cad3a2a573b0e

Change-Id: I3e15b31d6fcdc06279d95f33b03a77ebfeffa591

We did not call `tr.Finish` in `shardedSeacher.Search` which caused the associated spans to dangle in Jaeger UI.

We add a simply buffering logic to the stream server: events containing no file matches but only stats are aggregated and sent out with the first event containing file matches. This approach should reduce the number of events sent, but still give us the per-shard granularity that we want.

For `traceAwareSearcher`, both `Search` and `StreamingSearch` had the same operation name which was confusing to read in the traces.

We want to enforce this in our next release of Sourcegraph to ensure all indexes have the Sourcegraph repository ID set. In the last release we would only have the repoid set if we re-indexed. This would only occur if the repository was updated.

If we never dial a client and we try close it, it will lead to a nil panic. Change-Id: Iea0fbd5add7cda8c8ef922838f190c5dd0a6b40f

This changes how we handle context errors returned from the server: Instead of translating context errors, which were serialized as string, back to context errors, we treat them the same way as any other error. To not confuse `fmt.Errorf("context canceled")` with `context.Canceled` during debugging, we add the prefix "error received from zoekt".

This will avoid filling disks when a repo persistently fails to index midway for some reason-- OOMs, crashes, or otherwise.

This commit introduces a scheduler abstraction for searches to use. Every search now starts in the "interactive" queue. But after 5s will move to the "batch" queue. The idea here is to have a simple approach of allowing long running queries to run without impacting interactive queries. We use WeightedSemaphore as the "queue" (internally has a FIFO queue). Before this commit we just had one WeightedSemaphore with size #CPUs. In this commit that semaphore becomes the "interactive" queue. The "batch" queue is a new semaphore with size 1/4 #CPUs. This doesn't limit the amount of CPU a search can consume, but rather we use number of CPUs as a guide for the number of concurrent searches that can be running. This commit changes a few things, so view the PR for it so you can see it commit by commit.

Change-Id: I36b9c4bd6f13013806d3e51dafd6289f1527fe60

If master is not your default branch, but you are indexing it then we would not index the default branch. This happened because in the git repo with synthesize we run `update-ref HEAD foo` then `update-ref master bar`. Because `HEAD` pointed to `refs/heads/master` we would have a final value of `bar` for HEAD. This commit changes the default branch to a random string which shouldn't be indexed. Now `HEAD` will point to the random string, preventing this issue.

This adds the option `--debug-shard` to zoekt-sourcegraph-indexserver to print a csv of trigrams and the length of their postings list for a particular shard.

This will allow docker to self report its version. Change-Id: I4afd0ce6b07a5dd9985c65c546431db15a224959

We want to start collecting continuous profiles of zoekt serving user requests. This adds integration for google cloud profiler. We already use this for frontend and gitserver. This is opt-in, so we will add the GOOGLE_CLOUD_PROFILER_ENABLED environment variable to the sourcegraph.com deployment. Change-Id: Ia005213d27d47bb8ae7a520b8dc530c1c251168b

…t by them zoekt-sourcegraph-reposerver receives floating-point repo priorities as part of the sync loop running getIndexOptions, and persists them to priority.json in the index dir. zoekt-webserver initializes shardedSearcher which loads priority.json and watches for changes, and sorts by priority first and then by name.

In upstream zoekt we always call searchOneShard, even when the ctx is canceled. This ensures we have correct statistics calculated. Previously when we returned early, we would not count the remaining shard in the ShardsSkipped statistic. Co-authored-by: Stefan Hengl <stefan@sourcegraph.com> Change-Id: I7cb44065cfce5363722dba5d32b39fe67de78599

Our heap profiles are dominated by readIndexData. By moving some of the work into its own functions, the heap profiles will tell us where in particular data is allocated from. We suspect most data is from readNgrams, but want to confirm emperically. Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>

We know the number of entries we will add to the ngram maps. By specifying the go runtime can avoid over allocating as it grows. Locally on a small corpus this reduced the map byte size by 15%. Co-authored-by: Stefan Hengl <stefan@sourcegraph.com> Change-Id: I060e7aa12db726b093e7971ec6bc54ce193ec405

We are logging when we update or remove a shard to disk. This is a tab separated values (TSV) file. We will be using this data as input to a simulator for shard merging. Additionally this data can be used as input to interesting graphs. We use two different files to avoid concurrent processes writing to the same file. Co-authored-by: Stefan Hengl <stefan@sourcegraph.com> Change-Id: I9c907bc687612db616fb1f896930362608b198c3

Change-Id: I6e96aecb96ed4b8e0b3cfb5f9ce38bfe8794515e

Same change we did for zoekt-webserver in f669d63. Change-Id: I91715ac16226a221f3af95b85b84b8a942806041

There's no reason to assume that ctags can complete within 5s for large repositories.

This is causing most of our persistent indexing errors, and fixing it here is a better solution than explicitly blocking the strange repositories.

rmmh · 2021-06-04T19:43:37Z

Sorry, I opened this against the wrong repository.

google-cla · 2021-06-04T19:48:05Z

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

keegancsmith and others added 30 commits November 13, 2018 15:59

matchtree: Remove unused param stats from newMatchTree

3d4a710

Change-Id: Ic36ab798aaa097414b1c5b3a67a19893664a314a

matchtree: Factor out indexData specific functionality from NewMatchTree

bfcde41

Change-Id: I99766314e1f551a7ae900dd5ffe578ac1d694c4d

matchtree: Remove unused visit*Matches

bf3e7a5

Change-Id: I060ef3c4f55fe91b4d45979685475749027765a4

matchtree: Remove UpdateStats from MatchTree interface

d8c7986

It is only used by the iterators. So we can instead use VisitMatchTree to update the stats. Change-Id: I1be5e3c57fc8ea5e3baa20de6f7747c161423365

matchtree: Move general matchtree code to matchtree pkg

4381e0a

Change-Id: I34a3601f5686bf3cb79e76fe1e06f86b4b7d0329

matchtree: Rename exported MatchTrees to avoid stutter

23a3bdb

Change-Id: I41503e95a41512f8445eb8109069f37b7cea0314

matchtree: Rename unexported MatchTrees to avoid stutter

dacbbfd

Change-Id: I58af12bc32735c52cdd2ef080d1fada328caba48

Merge remote-tracking branch 'gerrit/master'

bec45ad

Change-Id: I89f211e77af9b9893e3e016f5cf704ddab2115ec

update go.mod

7e8e89c

Change-Id: Iba7868c1935477d88f9dd285ef010fe0ebe37617

Revert "matchtree: Rename unexported MatchTrees to avoid stutter"

1c631f1

This reverts commit dacbbfd.

Revert "matchtree: Rename exported MatchTrees to avoid stutter"

a66d938

This reverts commit 23a3bdb.

Revert "matchtree: Move general matchtree code to matchtree pkg"

ca644ee

This reverts commit 4381e0a.

Revert "matchtree: Remove UpdateStats from MatchTree interface"

d9df75c

This reverts commit d8c7986.

Revert "matchtree: Remove unused visit*Matches"

d8178d6

This reverts commit bf3e7a5.

Revert "matchtree: Factor out indexData specific functionality from N…

0dfb7c2

…ewMatchTree" This reverts commit bfcde41.

Revert "matchtree: Remove unused param stats from newMatchTree"

3987685

This reverts commit 3d4a710.

Revert "matchtree: Export And and NoVisit since they are used by inde…

77c12ee

…xData" This reverts commit 2d60dcf.

Revert "matchtree: Export a minimal ContentProvider interface"

33dcaec

This reverts commit 3011f8c.

Revert "matchtree: Export interfaces and functions"

d54ddf6

This reverts commit b50b7f5.

Merge remote-tracking branch 'gerrit/master'

4c17330

Change-Id: Ibe3634e9a541e068cc0c5e90bfa37cd7e1a9e894

go mod vendor

a1d7437

Change-Id: I70f6c53edc355ba2cc0b02ffda34567ecce8dbe7

zoekt-sourcegraph-indexserver: Retry List request

8aa57bd

This is to prevent log spam due to it failing at startup. Now it will only log after 4 attemps with exp backoff ~= 15s. Change-Id: Id93c56a24ecada87f3d30d66e0c460450a1e9068

search: add comment saying what zoekt-webserver does

408f932

search: add comment saying what zoekt-webserver does

e07b49d

cmd: centralize flag handling for indexers

54c18d9

Change-Id: Idea480bef9f74884647e2b2785aea0754522115a

add build option for large file whitelist

499906b

Change-Id: I70e51197558e760432e820743284696a4c469132

build: save options used for building index

b45557a

Save the options that affect the index that is created in order to know if we should reindex based on option changes. Change-Id: I676503e2c32934edae7c594a8ca9ac1b0f72e226

cmd: use large files flag in build options

c742ce8

Change-Id: If2b44ab7f822be5c97dd0460023cad3a2a573b0e

Merge remote-tracking branch 'gerrit/master'

8479bbc

Change-Id: I3e15b31d6fcdc06279d95f33b03a77ebfeffa591

stefanhengl and others added 26 commits February 25, 2021 10:35

fix: finish trace

da52e7c

We did not call `tr.Finish` in `shardedSeacher.Search` which caused the associated spans to dangle in Jaeger UI.

web: fix operation name for traceAwareSearcher (#73)

1e934d0

For `traceAwareSearcher`, both `Search` and `StreamingSearch` had the same operation name which was confusing to read in the traces.

rpc: close underlying client only if dialed

dd9bc3f

If we never dial a client and we try close it, it will lead to a nil panic. Change-Id: Iea0fbd5add7cda8c8ef922838f190c5dd0a6b40f

Merge https://gerrit.googlesource.com/zoekt

9cd98bf

sourcegraph-indexserver: clean up old .tmp files

a205a2a

This will avoid filling disks when a repo persistently fails to index midway for some reason-- OOMs, crashes, or otherwise.

shards: default new scheduler to on

e23cac1

Change-Id: I36b9c4bd6f13013806d3e51dafd6289f1527fe60

Update ctags (#81)

978e89d

indexserver: add option to print csv of trigrams (#83)

cc7c454

This adds the option `--debug-shard` to zoekt-sourcegraph-indexserver to print a csv of trigrams and the length of their postings list for a particular shard.

ci: set version in binaries to image tag

ac237b3

This will allow docker to self report its version. Change-Id: I4afd0ce6b07a5dd9985c65c546431db15a224959

address review comments

d720674

Add prometheus metrics for priority.json.

965d233

gomod: use gopkg.in import for lumberjack

8b83027

Change-Id: I6e96aecb96ed4b8e0b3cfb5f9ce38bfe8794515e

indexserver: add google cloud profiler integration

a3b04c1

Same change we did for zoekt-webserver in f669d63. Change-Id: I91715ac16226a221f3af95b85b84b8a942806041

bump ctags timeout from 5s to 60s

800377b

There's no reason to assume that ctags can complete within 5s for large repositories.

zoekt-git-index: ignore UTF-8 BOM in .gitmodules

d96f332

This is causing most of our persistent indexing errors, and fixing it here is a better solution than explicitly blocking the strange repositories.

rmmh closed this Jun 4, 2021

rmmh deleted the bom branch June 7, 2021 13:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This is causing most of our persistent indexing errors, and fixing it here is a better solution than explicitly blocking the strange repositories. #136

This is causing most of our persistent indexing errors, and fixing it here is a better solution than explicitly blocking the strange repositories. #136

rmmh commented Jun 4, 2021

rmmh commented Jun 4, 2021

google-cla bot commented Jun 4, 2021

This is causing most of our persistent indexing errors, and fixing it here is a better solution than explicitly blocking the strange repositories. #136

This is causing most of our persistent indexing errors, and fixing it here is a better solution than explicitly blocking the strange repositories. #136

Conversation

rmmh commented Jun 4, 2021

rmmh commented Jun 4, 2021

google-cla bot commented Jun 4, 2021