Skip to content
No longer maintained. Please read our shutdown message.
C++ Erlang HTML C Python Makefile Other
Branch: master
Clone or download
Pull request Compare This branch is 8910 commits ahead, 234 commits behind gigablast:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
antiword-dir
doxygen
html
misc
sto
test
third-party
tokenizer
tools
ucdata
unicode
word_variations
.clang-format
.gitignore
.gitmodules
.travis.yml
.valgrindrc
Abbreviations.cpp
Abbreviations.h
BaseScoringParameters.cpp
BaseScoringParameters.h
BigFile.cpp
BigFile.h
BitOperations.h
Bits.cpp
Bits.h
ByteOrderMark.cpp
ByteOrderMark.h
Clusterdb.cpp
Clusterdb.h
Collectiondb.cpp
Collectiondb.h
Conf.cpp
Conf.h
ContentMatchList.cpp
ContentMatchList.h
ContentTypeBlockList.cpp
ContentTypeBlockList.h
ConvertSpiderdb.cpp
ConvertSpiderdb.h
CountryCode.cpp
CountryCode.h
CountryLanguage.cpp
CountryLanguage.h
DailyMerge.cpp
DailyMerge.h
Dir.cpp
Dir.h
Dns.cpp
Dns.h
DnsBlockList.cpp
DnsBlockList.h
DnsProtocol.h
Dns_internals.h
DocDelete.cpp
DocDelete.h
DocProcess.cpp
DocProcess.h
DocRebuild.cpp
DocRebuild.h
DocReindex.cpp
DocReindex.h
Docid.cpp
Docid.h
Docid2Siteflags.cpp
Docid2Siteflags.h
DocumentIndexChecker.h
Doledb.cpp
Doledb.h
Domains.cpp
Domains.h
DumpSpiderdbSqlite.cpp
DumpSpiderdbSqlite.h
EGStack.cpp
EGStack.h
Entities.cpp
Entities.h
Errno.cpp
Errno.h
File.cpp
File.h
FxBlobCache.cpp
FxBlobCache.h
FxBlobCacheInstantiation.cpp
FxCheckAdult.cpp
FxCheckAdult.h
FxCheckSpam.cpp
FxCheckSpam.h
FxClient.cpp
FxClient.h
FxExplicitKeywords.cpp
FxExplicitKeywords.h
FxLanguage.cpp
FxLanguage.h
FxTermCheckList.cpp
FxTermCheckList.h
GbCache.h
GbCompress.cpp
GbCompress.h
GbCopyFile.cpp
GbCopyFile.h
GbDns.cpp
GbDns.h
GbEncoding.cpp
GbEncoding.h
GbFormat.h
GbMakePath.cpp
GbMakePath.h
GbMoveFile.cpp
GbMoveFile.h
GbMoveFile2.cpp
GbMoveFile2.h
GbMutex.cpp
GbMutex.h
GbRegex.cpp
GbRegex.h
GbSignature.cpp
GbSignature.h
GbThreadQueue.cpp
GbThreadQueue.h
GbUtil.cpp
GbUtil.h
GigablastRequest.h
HashTable.cpp
HashTable.h
HashTableT.cpp
HashTableT.h
HashTableX.cpp
HashTableX.h
HighFrequencyTermShortcuts.cpp
HighFrequencyTermShortcuts.h
Highlight.cpp
Highlight.h
HostFlags.cpp
HostFlags.h
Hostdb.cpp
Hostdb.h
HttpMime.cpp
HttpMime.h
HttpRequest.cpp
HttpRequest.h
HttpServer.cpp
HttpServer.h
IOBuffer.h
IPAddressChecks.cpp
IPAddressChecks.h
Images.cpp
Images.h
InstanceInfoExchange.cpp
InstanceInfoExchange.h
IpBlockList.cpp
IpBlockList.h
Jenkinsfile
JobScheduler.cpp
JobScheduler.h
Json.cpp
Json.h
LICENSE
Lang.cpp
Lang.h
LanguageResultOverride.cpp
LanguageResultOverride.h
Lemma.cpp
Lemma.h
Lexicons.cpp
Lexicons.h
Linkdb.cpp
Linkdb.h
Log.cpp
Log.h
Loop.cpp
Loop.h
Makefile
MatchList.cpp
MatchList.h
Matches.cpp
Matches.h
Mem.cpp
Mem.h
MemoryMappedFile.cpp
MemoryMappedFile.h
MergeSpaceCoordinator.cpp
MergeSpaceCoordinator.h
Msg0.cpp
Msg0.h
Msg13.cpp
Msg13.h
Msg2.cpp
Msg2.h
Msg20.cpp
Msg20.h
Msg22.cpp
Msg22.h
Msg25.cpp
Msg25.h
Msg3.cpp
Msg3.h
Msg39.cpp
Msg39.h
Msg3a.cpp
Msg3a.h
Msg40.cpp
Msg40.h
Msg4In.cpp
Msg4In.h
Msg4Out.cpp
Msg4Out.h
Msg5.cpp
Msg5.h
Msg51.cpp
Msg51.h
MsgC.cpp
MsgC.h
Msge0.cpp
Msge0.h
Msge1.cpp
Msge1.h
Multicast.cpp
Multicast.h
PageAddColl.cpp
PageAddUrl.cpp
PageBasic.cpp
PageCrawlBot.cpp
PageCrawlBot.h
PageDocProcess.cpp
PageDoledbIPTable.cpp
PageGet.cpp
PageHealthCheck.cpp
PageHosts.cpp
PageInject.cpp
PageInject.h
PageLinkdbLookup.cpp
PageParser.cpp
PageParser.h
PagePerf.cpp
PageReindex.cpp
PageReindex.h
PageResults.cpp
PageResults.h
PageRoot.cpp
PageRoot.h
PageSockets.cpp
PageSpider.cpp
PageSpiderdbLookup.cpp
PageStats.cpp
PageTemperatureRegistry.cpp
PageTemperatureRegistry.h
PageThreads.cpp
PageTitledb.cpp
Pages.cpp
Pages.h
Parms.cpp
Parms.h
Phrases.cpp
Phrases.h
Pops.cpp
Pops.h
Pos.cpp
Pos.h
Posdb.cpp
Posdb.h
PosdbTable.cpp
PosdbTable.h
Process.cpp
Process.h
Profiler.cpp
Profiler.h
Proxy.cpp
Proxy.h
Punycode.cpp
Punycode.h
Query.cpp
Query.h
QueryLanguage.cpp
QueryLanguage.h
README.md
Rdb.cpp
Rdb.h
RdbBase.cpp
RdbBase.h
RdbBuckets.cpp
RdbBuckets.h
RdbCache.cpp
RdbCache.h
RdbDump.cpp
RdbDump.h
RdbIndex.cpp
RdbIndex.h
RdbIndexQuery.cpp
RdbIndexQuery.h
RdbList.cpp
RdbList.h
RdbMap.cpp
RdbMap.h
RdbMem.cpp
RdbMem.h
RdbMerge.cpp
RdbMerge.h
RdbScan.cpp
RdbScan.h
RdbTree.cpp
RdbTree.h
Rebalance.cpp
Rebalance.h
Repair.cpp
Repair.h
ResultOverride.cpp
ResultOverride.h
RobotRule.cpp
RobotRule.h
Robots.cpp
Robots.h
RobotsBlockedResultOverride.cpp
RobotsBlockedResultOverride.h
RobotsCheckList.cpp
RobotsCheckList.h
S99gb
SafeBuf.cpp
SafeBuf.h
Sanity.cpp
Sanity.h
ScalingFunctions.cpp
ScalingFunctions.h
ScopedLock.h
ScoringWeights.cpp
ScoringWeights.h
SearchInput.cpp
SearchInput.h
Sections.cpp
Sections.h
Serialize.cpp
Serialize.h
SiteGetter.cpp
SiteGetter.h
SiteMedianPageTemperature.cpp
SiteMedianPageTemperature.h
SiteMedianPageTemperatureRegistry.cpp
SiteMedianPageTemperatureRegistry.h
SiteNumInlinks.cpp
SiteNumInlinks.h
Speller.cpp
Speller.h
Spider.cpp
Spider.h
SpiderCache.cpp
SpiderCache.h
SpiderColl.cpp
SpiderColl.h
SpiderLoop.cpp
SpiderLoop.h
SpiderProxy.cpp
SpiderProxy.h
SpiderdbRdbSqliteBridge.cpp
SpiderdbRdbSqliteBridge.h
SpiderdbSqlite.cpp
SpiderdbSqlite.h
SpiderdbUtil.cpp
SpiderdbUtil.h
Statistics.cpp
Statistics.h
Stats.cpp
Stats.h
StopWords.cpp
StopWords.h
Summary.cpp
Summary.h
SummaryCache.cpp
SummaryCache.h
Synonyms.cpp
Synonyms.h
Tagdb.cpp
Tagdb.h
TcpServer.cpp
TcpServer.h
TcpSocket.h
Title.cpp
Title.h
TitleRecVersion.h
TitleSummaryCodepointFilter.h
Titledb.cpp
Titledb.h
TopTree.cpp
TopTree.h
UdpProtocol.h
UdpServer.cpp
UdpServer.h
UdpSlot.cpp
UdpSlot.h
UdpStatistic.cpp
UdpStatistic.h
Url.cpp
Url.h
UrlBlockCheck.cpp
UrlBlockCheck.h
UrlComponent.cpp
UrlComponent.h
UrlMatch.cpp
UrlMatch.h
UrlMatchList.cpp
UrlMatchList.h
UrlParser.cpp
UrlParser.h
UrlRealtimeClassification.cpp
UrlRealtimeClassification.h
UrlResultOverride.cpp
UrlResultOverride.h
Version.cpp
Version.h
WantedCheckExampleLib.cpp
WantedChecker.cpp
WantedChecker.h
WantedCheckerApi.h
Wiki.cpp
Wiki.h
Wiktionary.cpp
Wiktionary.h
WordVariationsConfig.h
Xml.cpp
Xml.h
XmlDoc.cpp
XmlDoc.h
XmlDoc_Indexing.cpp
XmlNode.cpp
XmlNode.h
adultphrases.txt.example
adultwords.txt.example
antiword
bmptopnm
browser.py
cmpversiongte
collnum_t.h
control.deb
copyright.head
copyright.tail
default.css
entities.json
fctypes.cpp
fctypes.h
g_hashtab.inc
gb-1.0.spec
gb.deb.rules
gb.pem
gbclean.sh
gbconvert.sh
gbmemcpy.h
gbstart.sh
generate_entities.py
generate_query_stop_word_languages.sh
generate_query_stop_words.sh
generate_tld_list.sh
giftopnm
hash.cpp
hash.h
iana_charset.cpp
iana_charset.h
init.gb.conf
ip.cpp
ip.h
jpegtopnm
libiconv.a
libiconv.la
libiconv64.a
libjpeg.so.62
libnetpbm.so.10
libpng12.so.0
libtiff.so.4
linkspam.cpp
linkspam.h
main.cpp
matches2.cpp
matches2.h
max_coll_len.h
max_hosts.h
max_niceness.h
max_url_len.h
max_words.h
msgtype_t.h
mysynonyms.txt
nodeid_t.h
pngtopnm
pnmscale
ppmtojpeg
pstotext
query_stop_words.da.txt
query_stop_words.de.txt
query_stop_words.en.txt
query_stop_words.xx.txt
rdbid_t.h
repair_mode.h
robotsblockedresultoverride.txt
runCoverityAnalysis.sh
runSonarQubeAnalysis.sh
sitelinks.txt
sonar-project.properties
sort.cpp
sort.h
spider_status_t.h
termid_mask.h
tifftopnm
tlds-additional-2nd-level-domains.txt
tlds-alpha-by-domain.txt
tlds-official-2nd-level-domains.txt
types.h
unifiedDict.txt
urlmatchlist.txt.example
urlresultoverride.txt.example
utf8.cpp
utf8.h
utf8_convert.cpp
utf8_convert.h
utf8_fast.cpp
utf8_fast.h
valgrind.cfg
wikititles.txt.part1
wikititles.txt.part2
wiktionary-buf.txt
wiktionary-lang.txt
wiktionary-syns.dat
zconf.h

README.md

Warning: Do not use this code.

Findx is shutting down. Please read https://privacore.github.io/

Gigablast - an open source search engine

An open source web and enterprise search engine and spider/crawler.

This is a fork of the original Gigablast project available at https://github.com/gigablast/open-source-search-engine/. This version is heavily modified by Privacore, and tailored for our use. It is not a drop-in replacement for the original Gigablast.

Modifications by Privacore

Our aim is not to maintain backwards compatibility with the original Gigablast data files.

Feature Description
Multi-threading Many improvements have been made with regards to multi-threading and general optimizations.
Stability Numerous general bugfixes and major improvements in thread safety.
Data formats Posdb is being changed to store the entries for a page in a single Posdb file, rather than spreading out a the entries across multiple files and merging the data in memory + handling delete keys at query time. A new index file will point to the file containing the newest version of a document.
Spiderdb is modified to use sqlite3 database instead of RDB format.
Data file merging Our version use a dedicated drive for merging, instead of merging + deleting part files on-the-fly on the same data drive. We will create a completely merged file on the merge drive, temporarily make GB use that file for queries, delete the original files, copy the newly merged file back to the 'production drive', switch back query handling to that drive and delete the temporary file. The merge drive must be big enough to hold at least 1 instance's posdb data.
Alerting Start script improved to send alerts if GB crashes (and avoid successive coredumps, but stay down for analysis).
Trace log Lots of options to add very detailed trace log to different parts of the code.
Summaries Improvements in search results summary generation.
Language detection Google's CLD2 library integrated to improve language detection.
Code removed About half of the original source has been removed, e.g. diffbot/eventguru/buzzlogic/seo specific integrations.
Disk space Lots of 'junk' removed from the Posdb data files, reducing space usage significantly. This means that if you use our version with old Gigablast data files, data will not be deleted up correctly when re-indexing a page. You will need to rebuild the Posdb data files.
Ranking Ranking weights made configurable.
... and much more...

Migrating Gigablast to our fork

Step Description
Backup! There, you have been warned..
Build git clone https://github.com/privacore/open-source-search-engine.git
git submodule init
git submodule update
make -j4
make dist
Copy Stop your running GB instances. Copy the files contained in the new gb-[date]-[rev].tar.gz file to your GB instance 0.
Install Go to your GB instance 0 and do a './gb install' to copy the binary and needed files to all instances.
Remove files Remove the posdb files from your collections
Convert files Convert the spiderdb files to sqlite3 format by using './gb convertspiderdb'
Start './gb start' from your instance 0 and you should be on your way.
Rebuild Rebuild the posdb data files through the web UI. This is needed because we store less data in posdb than the original version, and GB cannot clean this 'junk' data up when re-indexing pages.

SUPPORTED PLATFORMS

Primary:

  • Ubuntu 16.04, g++ 5.4.0, Python 2.7.6

Secondary:

  • OpenSuSE 13.2, GCC 4.8.3
  • OpenSuSE 42.2, GCC 6.2.1
  • Fedora 25, GCC 6.3.1

DEPENDENCIES

Compilation

Ubuntu

  • g++
  • make
  • cmake
  • python
  • libpcre3-dev
  • libssl-dev
  • libprotobuf-dev
  • protobuf-compiler
  • libsqlite3-dev

OpenSuse

  • g++
  • make
  • cmake
  • python
  • pcre-devel
  • libssl-dev
  • protobuf-devel
  • libprotobuf13

Fedora

  • g++
  • make
  • cmake
  • python
  • pcre-devel
  • openssl-devel
  • protobuf-devel
  • protobuf-compiler
  • sqlite-devel

Runtime

  • Multi-instance installations require Vagus for keeping track of which instances are dead and alive.

Ubuntu

  • libssl1.0.0
  • libpcre3
  • libprotobuf9v5

RUNNING GIGABLAST

See html/faq.html for all administrative documentation including the quick start instructions.

Alternatively, visit http://www.gigablast.com/faq.html

CODE ARCHITECTURE

See html/developer.html for all code documentation.

Alternatively, visit http://www.gigablast.com/developer.html

SUPPORT

Privacore does not provide paid support for Gigablast. We refer you to the original project at https://github.com/gigablast/open-source-search-engine/ and the owner Matt Wells. He has a Pro version you can buy which include support options.

We provide limited support for our fork, primarily for active contributors.

You can’t perform that action at this time.