Golang Packages Database
Andromeda analyzes the complete graph of the known Go Universe.
- Golang 1.7 or newer
- Git v2.3 or newer (to avoid interactive prompts interrupting the crawler)
- go-bindata
- stringer
go get -u -a golang.org/x/tools/cmd/stringer
- OpenSSL (for automatic retrieval of SSL/TLS server public-key to feed gRPC by remote-crawler)
- xz (for downloading daily snapshots from godoc.org)
Note: openssl cert.pem cmd:
openssl s_client -showcerts -servername andromeda.gigawatt.io -connect andromeda.gigawatt.io:443 < /dev/null 2>/dev/null | openssl x509 -outform PEM 2>/dev/null
go get jaytaylor.com/andromeda/...
Additional setup may be required depending on which DB backend you want to use.
No additional packages or work necessary.
- Install RocksDB
RocksDB installation instructions.
- Install the RocksDB golang package
gorocksdb package installation instructions.
- Build andromeda with the RocksDB backend enabled
go get jaytaylor.com/andromeda/...
cd "${GOPATH}/src/jaytaylor.com/andromeda
go build -o andromeda -tags rocks
- Install postgresql
apt-get install \
postgresql \
postgresql-client \
postgresql-contrib \
postgresql-10-prefix
- Enable the prefix module
"prefix" module enablement instructions.
go get -u github.com/golang/protobuf/...
go get -u github.com/gogo/protobuf/...
go get -u github.com/gogo/gateway/...
go get -u github.com/gogo/googleapis/...
go get -u github.com/grpc-ecosystem/go-grpc-middleware/...
go get -u github.com/grpc-ecosystem/grpc-gateway/...
go get -u github.com/mwitkow/go-proto-validators/...
protoc-gen-gorm compatibility is tightly coupled to certain versions of various packages, so it's necessary to use dep to fetch all vendored dependencies.
go get github.com/infobloxopen/protoc-gen-gorm
cd "${GOPATH}/src/github.com/infobloxopen/protoc-gen-gorm"
dep ensure
go get .
- Regenerating the domain package models:
go generate ./...
Grab latest seed list from godoc.org:
./download-godoc-packages.sh
Locate the downloaded file and extract it with xz -k -d <filename>
.
Then cleanup the input and seed into andromeda:
./scripts/input-cleaner.sh archive.godoc.org/packages.20180706 \
| andromeda bootstrap -g - -f text
Instructions generally live alongside the code within the header of the relevant program, so always check the top of the scripts and source code for installation instructions and per-script documentation.
The exception to this rule is the andromeda binary, where usage instructions are available by running andromeda --help
or andromeda <sub-command> --help
.
Ensure the target user account has the "Run as a System Service" Policy.
Perform the following to edit the Local Security Policy of the computer you want to define the 'Logon as a Service' permission:
1.Logon to the computer with administrative privileges. 2.Open the 'Administrative Tools' and open the 'Local Security Policy'. 3.Expand 'Local Policy' and click on 'User Rights Assignment'. 4.In the right pane, right-click 'Log on as a service' and select properties. 5.Click on the 'Add User or Group' button to add the new user. 6.In the 'Select Users or Groups' dialogue, find the user you wish to enter and click 'OK'. 7.Click 'OK' in the 'Log on as a service Properties' to save changes.
Notes:
Ensure that the user which you have added above is not listed in the 'Deny log on as a service' policy in the Local Security Policy.
andromeda service crawler install -v --delete-after -s /tmp/src -a <host.name>:443 -c <path-to-letsencrypt-cert.pem> -u .\<windows-username> -p <windows-password>
A ramdisk partition mount can be used on windows. The only configuration change required is to set core.symlinks = false
in .gitconfig.
See 52830545-git-clone-not-works-with-some-ramdisk-and-ntfs for an explanation about why.
Host github.com gitlab.com bitbucket.com bitbucket.org code.cloudfoundry.org launchpad.net git.code.sf.net ProxyCommand ncat --proxy proxy.example.com:80 %h %p Compression yes
Host github.com gitlab.com bitbucket.com bitbucket.org code.cloudfoundry.org launchpad.net git.code.sf.net ProxyCommand nc -X connect -x proxy.example.com:80 %h %p Compression yes
cp -a andromeda.bolt a
andromeda -b a -v stats mru -n 1000 | jq -r '.[] | .path' | xargs -n10 andromeda remote enqueue -a 127.0.0.1:8001 -v -f
rm a
andromeda util rebuild-db \
-v \
--driver postgres \
--db "dbname=andromeda host=/var/run/postgresql" \
--rebuild-db-driver bolt \
--rebuild-db-file new.bolt \
andromeda util rebuild-db \
-v \
--driver bolt \
--db no-history.bolt \
--rebuild-db-driver postgres \
--rebuild-db-file "dbname=andromeda host=/var/run/postgresql" \
--rebuild-db-filters clearHistories
5 */6 * * * /home/ppx/go/src/jaytaylor.com/andromeda/scripts/cron/download-godoc-packages.sh >/dev/null 2>&1
15 */6 * * * /home/ppx/go/src/jaytaylor.com/andromeda/scripts/cron/enqueue-godoc.sh >/dev/null 2>&1
45 */12 * * * /home/ppx/go/src/jaytaylor.com/andromeda/scripts/cron/trends.now.sh >/dev/null 2>&1
*/5 * * * * /home/ppx/go/src/jaytaylor.com/andromeda/scripts/cron/github.sh >/dev/null 2>&1
Default application values can be overridden by a ~/.andromeda.toml
or ~/.config/andromeda.toml
configuration file (they are searched for in this order, first one found to exist will be used).
It's helpful to have these settings already defined in a configuration file if you use the command-line client much. Specifying long flags like --driver
and --db <connection string>
over and over gets tiresome!
For available configuration variables, see the example andromeda.toml config file.
To get started, copy it to your home directory:
cp andromeda.toml ~/.andromeda.toml
A non-default configuration file path location may be specified with the -config
flag.
(C) All Rights Reserved, Jay Taylor, 2018-2019.
- Elasticsearch backend?
- Show nested packages listing in sub-packages template (subs don't imply a terminal!).
- Find a way to include
main
packages. - Re-enable kubernetes (hardcoded as disabled in master.go).
Note: Instead of "repo" or whatever, think about calling a reporoot a "tree". This terminology is used here.
- Add attribute "CanGoGet" to indicate if package is buildable via
go get
. Then provide a search filter to only include such packages. - Add
deprecated
attribute and set based on detection of phrases like "is deprecated" and "superceded by" in README. - Attempt topic extraction from READMEs.
- Add usage popularity comparison between 2 pkgs.
- Generic fork and hierarchy detection: Find packages with the pkg same name, then use
lastCommitedAt
field to derive possibly related packages. Take this list and inspect commit histories to determine if there is commit-overlap between the two. Could implement github scraping hacks to verify accuracy. - Handle relative imports (x2).
- Detect and persist whether each import is vendored or not in the reverse-imports mapping data.
- Add analysis of RepoRoot sub-package paths and import names.
- Add counts for total number of packages tracked, globally (currently repos are tracked and called "packages" everywhere, ugh..).
- 1/2 Distinguish between pkg and repo by refactoring what is currently called a "package" in andromeda into a "repo".
- 2/2 Add alias tracking table.
- 1/4 Add a
--id
flag for remote crawlers to uniquely identify them (x2). - 2/4 Remote crawlers should track and store their own statistics in a local bolt db file, per crawler-id. For example, keep track of number of crawls done per day, total size of crawled content, number of successful and failed crawls.
- 3/4 Server-side: Track crawlers by ID, and track when they were last seen, IP addresses, number of packages crawled, number of successful crawls vs errors.
- 4/4 Provide live-query mechanism for server to ping all crawlers to get an accurate count of actives. Would also be interesting to have the crawlers include their version (git hash) and crawl stats in the response.
- Add queue monitor, when it is empty add N least recently updated packages to crawl.
- Add errors counter to ToCrawlEntry and throw away when error count exceeds N.
- Add process-level concurrency support for remote crawlers (to increase throughput without resorting to trying to manage multiple crawler processes per host).
- To avoid dropping items across restarts, implement some kind of a WAL and resume functionality (x2, see next item below).
- Protect against losing queue items from process restarts / interruptions; Add in-flight TCE's to an intermediate table, and at startup move items from said table back into the to-crawl queue.
- Remote-crawler: Store crawl result on disk when sending failed, then when remote starts, check for failed transmit and send it. Possible complexity due to server not expecting that crawl result. May need to expose via different gRPC API endpoint than
Attach
.
- Fix
-s
strangeness, should only specify the base path and auto-append "/src". - Consider refactoring "Package" to "Repo", since a go repo contains an arbitrary number of packages (I known, "yuck", but..).
- Implement Postgres backend.
- Implement Postgres queue.
- Implement CockroachDB backend.
- Implement CockroachDB queue.
- Add git commit hash to builds, and has gRPC client send it with requests.
- Implement pure-postgres native db.Client interface and see if or how much better we can do compared to K/V approach.
- Implement pending references updates as a batch job (currently it's been disabled due to low performance). Another way to solve it would be to only save pending references sometimes - just add an extra parameter on the internal save method (went with this, was very simple to add a single param to the save functions to avoid merging pending references for recursively-triggered saves.
- Implement a ~/.andromeda/config configuration file to avoid having to pass
--driver
/--db
all the time. - Review github cron: verify it is well-behaved and doesn't submit duplicates every run.
- Add git version check to crawler (because it's easy to forget to upgrade git!). Note: This is part of the
check
command, also verifies availability of openssl binary. - Make it work for repo roots without go files, e.g. github.com/deferpanic/virgo
- Add a monitor and require that the disk where the DB is stored always has at least X GB free, where X is based on a multiple of the Bolt database file. This is to ensure safety that things don't get into a state where data cannot be written to the DB or even worse it gets corrupt. Remember that DB size may grow non-linearly (need to double check this, but this is what I recall observing).
- Move failed to-crawls to different table instead of dropping them outright.
- 1/2 Expose queue contents over rest API.
- 2/2 Frontend viewer for queue head and tail contents.
- Migrate table names to be singular.
- Add "view on sourcegraph.com" link.
- Handle
"transport: authentication handshake failed: x509: certificate signed by unknown authority" source="crawler/remote.go:150"
errors by fetching latest cert.pem.
To locate additional TODOs just find . -name '*.go' -exec grep 'TODO'
Some of them are only noted in the relevant code region :)