Frequently asked questions
Software engineering is more about reading than writing code, and part of this process is finding the code that you should read. If you are working on a large project, then finding source code through navigation quickly becomes inefficient.
Search engines let you find interesting code much faster than browsing code, in much the same way that search engines speed up finding things on the internet.
Can you give an example?
I had to implement SSH hashed hostkey checking on a whim recently, and here is how I quickly zoomed into the relevant code using our public zoekt instance:
hash host ssh: more than 20k results in 750 files, in 3 seconds
hash host r:openssh: 6k results in 114 files, in 20ms
hash host r:openssh known_host: 4k result in 42 files, in 13ms
the last query still yielded a substantial number of results, but the
hash_host that I was looking for was the 3rd result from
the first file.
What features make a code search engine great?
Often, you don't know exactly what you are looking for, until you found it. Code search is effective because you can formulate an approximate query, and then refine it based on results you got. For this to work, you need the following features:
Coverage: the code that interests you should be available for searching
Speed: search should return useful results quickly (sub-second), so you can iterate on queries
Approximate queries: matching should be done case insensitively, on arbitrary substrings, so we don't have to know what we are looking for in advance.
Filtering: we can winnow down results by composing more specific queries
Ranking: interesting results (eg. function definitions, whole word matches) should be at the top.
zoekt provide for these?
zoektcomes with tools to mirror parts of common Git hosting sites.
cs.bazel.builduses this to index most of the Google authored open source software on github.com and googlesource.com.
zoektuses an index based on positional trigrams. For rare strings, eg.
nienhuys, this typically yields results in ~10ms if the operating system caches are warm.
zoektsupports substring patterns and regular expressions, and can do case-insensitive matching on UTF-8 text.
Filtering: you can filter query by adding extra atoms (eg.
f:\.go$limits to Go source code), and filter out terms with
\blinus\b -torvaldsfinds the Linuses other than Linus Torvalds.
Ranking: zoekt uses ctags to find declarations, and these are boosted in the search ranking.
How does this compare to
Grep lets you find arbitrary substrings, but it doesn't scale to large corpuses, and lacks filtering and ranking.
What about my IDE?
If your project fits into your IDE, than that is great. Unfortunately, loading projects into IDEs is slow, cumbersome, and not supported by all projects.
What about the search on
Github's search has great coverage, but unfortunately, its search functionality doesn't support arbitrary substrings. For example, a query for part of my surname does not turn up anything (except this document), while my complete name does.
What about Etsy/Hound?
Etsy/hound is a code search engine which supports regular expressions over large corpuses, it is about 10x slower than zoekt. However, there is only rudimentary support for filtering, and there is no symbol ranking.
What about livegrep?
livegrep is a code search engine which supports regular expressions over large corpuses. However, due to its indexing technique, it requires a lot of RAM and CPU. There is only rudimentary support for filtering, and there is no symbol ranking.
How much resources does
The search server should have local SSD to store the index file (which is 3.5x the corpus size), and have at least 20% more RAM than the corpus size.
Can I index multiple branches?
Yes. You can index 64 branches (see also https://github.com/google/zoekt/issues/32). Files that are identical across branches take up space just once in the index.
How fast is the search?
Rare strings, are extremely fast to retrieve, for example
r:torvalds crazy (search "crazy" in the linux kernel) typically takes about
The speed for common strings is dominated by how many results you want to see. For example [r:torvalds license] can give some results quickly, but producing all 86k results takes between 100ms and 1 second. Then, streaming the results to your browser, and rendering the HTML takes several seconds.
How fast is the indexer?
The Linux kernel (55K files, 545M data) takes about 160s to index on my x250 laptop using a single thread. The process can be parallelized for speedup.
cs.bazel.build run on?What does
Currently, it runs on a single Google Cloud VM with 16 vCPUs, 60G RAM and an attached physical SSD.
In short, it splits up the file in trigrams (groups of 3 unicode characters), and stores the offset of each occurrence. Substrings are found by searching different trigrams from the query at the correct distance apart.
I want to know more
Some further background documentation