Skip to content

mca3/indx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

155 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

indx

indx is an experimental search engine written out of curiosity. The goal of indx is mostly just for me to have fun working with massive amounts of data while still doing something somewhat useful.

Structure

indx is split into a couple parts that build off each other:

  • cmd: User-facing services.
  • crawler: HTTP crawler.
  • db: The SQLite3 side of things; deals with document metadata.
  • index: Magic happens here. Implements the forward/inverted index, lexicon, and BM25 search algorithm. Nearly everything builds off this in some way.
  • ingest: Adds documents into the index.
  • pagerank: The PageRank algorithm.
  • porter: The Porter stemming algorithm.
  • _example: Example application for interfacing with everything else here. Contains broken and messy code and used exclusively for testing.

See each of their respective directories for more information on what each part does.

Running

indx is made up of a handful of different services which all work independently of each other. The main trio is:

  • cmd/crawler, which will crawl web pages for you
  • cmd/ingester, which adds pages into the index
  • cmd/search, which is the actual search engine

Each of these programs can be run concurrently provided that the index that cmd/search is using is not modified.

Please see each respective service's code for setup instructions. A quick setup guide will eventually be written but the primary focus of this project is currently to experiment and as such this guide will probably not be written any time soon.

Additionally, the formats of the index are very much in flux so please expect to need to recreate the index occasionally. This can be done without using the crawler whatsoever and I'll write down the procedure for it eventually.

Performance

Unknown, but an attempt is made to work well on hardware you probably have at your disposal right now. Due to how the index is structured, performance will probably increase with more memory; files are mapped into memory so more memory means your OS can cache more data.

Most algorithms in indx are currently naive implementations just to get something working decently enough.

About

Experimental from-scratch search engine

Resources

License

Stars

Watchers

Forks

Contributors