SpaghettiSearch: A Concurrent Search Engine

Fully-functioning search engine built on top of Golang to satisfy HKUST COMP4321 requirements.It is built using Golang as its backend, and React as its frontend.

Live Demo

http://spaghetti-search.herokuapp.com/

Features

Implemented Topic-Sensitive PageRank (T. H. Haveliwala, 2003) with the use of query as the sole given context, and user's interest is equally reflected on every topic.
Combination of PageRank and Vector-Space Model to rank the result
Utilised anchor text and metatags suggested on Google's paper to increase precision and index much more webpages
Make use of generator, future, and fan-in fan-out concurrency pattern in Golang to increase retrieval performance
Dynamic document summary retrieval
Use BadgerDB as database which optimised for SSD
Support keyword list search and phrase search (use double quotes for phrase search)

Setup & Installation

Backend

Install golang from here

$ sudo tar -C /usr/local -xzf go$VERSION.$OS-$ARCH.tar.gz
$ export PATH=$PATH:/usr/local/go/bin

Download this repo using go get

$ go get github.com/nwihardjo/SpaghettiSearch

Frontend

Install node and npm from here
The build has been uploaded. No need to install node to get this running.

Dependencies

dep is used as the package management to ensure the installed dependencies are the correct version from the correct vendor. Run dep ensure on project root to install required packages, or run go get ./... to same thing.

Building

Run make in the project root directory. It will install the necessary binary packages to bin/ directory, as well as install dependendcies
Run the crawler and specify the argument needed as below, then spin up the server. The backend and React server has been integrated, so that only one server by Golang needed to be started.

$ ./bin/start_crawl [-numPages=<number of pages to be crawled>] [-startURL=<starting entry point for the crawler to crawl>] [-domainOnly=<whether webpages to be crawled only in the domain of given starting URL)]
$ ./bin/server

Head up to your browser, and go to localhost:8080. The server is hosted on port 8080, or check the output of your terminal.

Name		Name	Last commit message	Last commit date
Latest commit History 284 Commits
bin		bin
cmd		cmd
crawler		crawler
database		database
indexer		indexer
interface		interface
parser		parser
ranking		ranking
retrieval		retrieval
.gitignore		.gitignore
DatabaseDesign.pdf		DatabaseDesign.pdf
Gopkg.lock		Gopkg.lock
Gopkg.toml		Gopkg.toml
Procfile		Procfile
README.md		README.md
finalReport.pdf		finalReport.pdf
makefile		makefile

nwihardjo/SpaghettiSearch

Folders and files

Latest commit

History

Repository files navigation

SpaghettiSearch: A Concurrent Search Engine

Live Demo

Features

Setup & Installation

Backend

Frontend

Dependencies

Building

Contributor

About

Topics

Resources

Stars

Watchers

Forks

Languages