indx

indx is an experimental search engine written out of curiosity. The goal of indx is mostly just for me to have fun working with massive amounts of data while still doing something somewhat useful.

Structure

indx is split into a couple parts that build off each other:

cmd: User-facing services.
crawler: HTTP crawler.
db: The SQLite3 side of things; deals with document metadata.
index: Magic happens here. Implements the forward/inverted index, lexicon, and BM25 search algorithm. Nearly everything builds off this in some way.
ingest: Adds documents into the index.
pagerank: The PageRank algorithm.
porter: The Porter stemming algorithm.
_example: Example application for interfacing with everything else here. Contains broken and messy code and used exclusively for testing.

See each of their respective directories for more information on what each part does.

Running

indx is made up of a handful of different services which all work independently of each other. The main trio is:

cmd/crawler, which will crawl web pages for you
cmd/ingester, which adds pages into the index
cmd/search, which is the actual search engine

Each of these programs can be run concurrently provided that the index that cmd/search is using is not modified.

Please see each respective service's code for setup instructions. A quick setup guide will eventually be written but the primary focus of this project is currently to experiment and as such this guide will probably not be written any time soon.

Additionally, the formats of the index are very much in flux so please expect to need to recreate the index occasionally. This can be done without using the crawler whatsoever and I'll write down the procedure for it eventually.

Performance

Unknown, but an attempt is made to work well on hardware you probably have at your disposal right now. Due to how the index is structured, performance will probably increase with more memory; files are mapped into memory so more memory means your OS can cache more data.

Most algorithms in indx are currently naive implementations just to get something working decently enough.

Name		Name	Last commit message	Last commit date
Latest commit History 155 Commits
_example		_example
cmd		cmd
crawler		crawler
db		db
index		index
ingest		ingest
internal/bigarray		internal/bigarray
pagerank		pagerank
porter		porter
snippet		snippet
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
version.go		version.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

indx

Structure

Running

Performance

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

indx

Structure

Running

Performance

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages