Skip to content

Search engine #8

Open
davidar opened this Issue Aug 29, 2015 · 18 comments

8 participants

@davidar
IPFS member
davidar commented Aug 29, 2015

Getting documents archived on IPFS is one thing, but we also need to be able to search through them. Given that these archives are eventually going to become too large to fit on a single machine, it's likely that the search index will need to also be distributed over the IPFS network (e.g. with each node providing an index of the contents of their local blockstore). Some possible ways this could be achieved:

  • the static way: each node stores their index in a trie-like structure distributed across multiple IPFS objects, in such a way that clients only need to download a small subset of the index for any given query
  • the dynamic way: queries are broadcast to the network, nodes perform the search on their local index, and provide the results back over IPFS
  • the magic way: somehow storing the index in the IPFS DHT?

Looking through the IRC logs, I've seen several people express interest in an IPFS search engine (@whyrusleeping @rht @jbenet @rschulman @zignig @kragen), but haven't been able to find any specific proposals. Perhaps we could coordinate here?

@davidar davidar added the help wanted label Aug 29, 2015
@rht
rht commented Aug 30, 2015

+1 static indexes (and any objects computed out of the archives) should be static and versioned in merkledag just like the archives themselves.

dependency: file metadata ipfs/ipfs#36

@davidar
IPFS member
davidar commented Aug 30, 2015

@rht I'm leaning towards the first one too. I added the latter two mainly because I believe YaCy works that way.

The two main problems that (I think) need to be solved for static indexes are: discovering which nodes provide them (ipfs/notes#15), and how to aggregate them all. Index aggregation could either be done a) entirely client-side (easy to implement but inefficient, especially once the network gets large), or b) through cooperation between index nodes (more difficult to implement, but also more efficient for clients).

@jbenet suggested that CRDT could be used to achieve (b), but I'm not familiar enough with it to make any intelligent comments :). I do know that trie merging is commutative (unlike some other search trees or hash tables), so it at least fits within CRDTs assumptions. The main thing to keep in mind is that these indexes are likely to become quite large, so you'd want to coordinate the index merging without nodes having to replicate much of each other's indexes (not sure how this would work within CRDT). One possible way of doing this is to partition prefixes across the nodes (since you only need to look at shared prefixes when performing the merge), but I'm not sure how this would work within IPFS.

@davidar
IPFS member
davidar commented Sep 1, 2015

Looks like @krl is working on search functionality for starlog: https://github.com/krl/aolog (IRC)

@nbingham1

If you go with one of the first two options, you lose flexibility -- mainly because the IPFS executable is determining what the index should look like. If you go with option 3, then you can have any number of indices with different structures storing different things. I don't think the search functionality should be integrated into the IPFS structure itself, but should instead be built on top as a web service. The real question you should be asking is not what an index should look like, but what a database built on top of the IPFS system should look like.

ipfs/ipfs#82

@davidar
IPFS member
davidar commented Sep 4, 2015

@nbingham1 None of them necessarily need to be integrated into ipfs itself, I'm just assuming nodes have some kind of indexing service running, which could be an entirely separate executable (and you could have multiple different ones running if you wanted). In terms of what a database on top of ipfs would look like, I'm in favor of one big json object transparently distributed over ipfs. You could then easily build radix trees and such on top of that.

@jbenet
IPFS member
jbenet commented Sep 4, 2015

@nbingham1 @davidar indeed. very much indeed.

@nbingham1

Ah, ok, that makes much more sense. So the nodes run the indexing service, and updates the index whenever that node adds a file, but the index is still stored in a database format in the IPFS network. I was misinterpreting the options (Sorry! just found out about IPFS Yesterday!).

@davidar
IPFS member
davidar commented Sep 4, 2015

@nbingham1 No need to apologise, clarifying questions are good :)

@davidar davidar referenced this issue in ipfs/notes Sep 9, 2015
Open

Aggregation --> CRDTs discussion #40

@BrendanBenshoof

Merging the Tries via gossip is really viable, essentially each node just shares it's "head node" of the trie they store on ipfs with it's peers. Each peer preforms a recursive merge of the index (which is normally the complexity of the smaller index as you can re-use unaltered entries) and then propagates the new index.

Then we can implement search in js by pointing to that ipfs node's search index by /ipns/local/searchindexroot or somesuch.

Then anybody who wants can implement a search engine on client side with whatever UI or features they want.

This falls into the classic Byzantine Generals problem of people messing with the index. A proof of accuracy based block-chain actually looks like a decent solution to this problem.

@davidar
IPFS member
davidar commented Sep 13, 2015

@BrendanBenshoof cool, sounds good

Do we actually need a blockchain though? I was thinking all of the (non-malicious) nodes would agree on the merged root node after the gossiping finishes, after which you should just be able to do a simple popular vote?

@BrendanBenshoof

We could rotate it into web of trust but straight democracy is highly vulnerable to sybil attacks.

We just wandered into the big open issue in distributed systems, how to deal with consensus with attackers. I'm not a fan of proof of work blockchains as a general solution to this problem. I like the idea of a block-chain here because there is a clear "proof of truthiness" in that nodes can sanity check a "sub-index" against the contents of the file it indexes to ensure it is at least accurate. Once all the added "sub-indexes" have been inspected, you can also confirm the deterministic merging of them into the "primary index" and agree on the new "root" hash.

My only worry is that it is too wasteful in processing power and too complex. It might be more efficient just to deal with crap in the database,

A block would look like:


Address of last block
new source id | index_head
new source id | index_head
...

new_root id

Basic use dissemination would look like:

  • I get a new block:
    • I sanity check the new indexes
    • I merge the indexes into the "master index" and confirm the new head node
  • I propagate the block if it passes muster

  • I listen for newly advertised indexes

    • I sanity check that index
    • I propagate the index
    • If I have more than k indexes, publish a new block

This also has the side effect of aggressively caching newly indexed files by all the search engine participants. It also feeds into my preference that indexing be an active choice by a content producer rather than a spider.

@davidar
IPFS member
davidar commented Sep 14, 2015

@BrendanBenshoof OK, I think I see what you mean, but distributed systems and associated attacks isn't really my area, and it sounds like you're more familiar with it than I am.

Just to clarify what my thoughts were (but as you say, I haven't thought about possible attacks on this). Suppose you have two nodes, and they swap information in order to agree on the merge of both their indexes. They then both sign this hash, so that everyone else can verify that they're satisfied this merged index contains their own subindexes. You then do a similar thing with merged indexes, etc, so that you end up with a global root signed by everyone with a stake in it.

Thinking about it some more, I agree popular vote wouldn't work in general, I guess what I meant was a vote from nodes that you trust (eg the bootstrap nodes), which you assume aren't malicious, but some of them might make mistakes.

But like I said, I have no idea how well these things will work in practice ☺

@jbenet @whyrusleeping thoughts?

@cryptix
IPFS member
cryptix commented Nov 5, 2015

Maybe cothorities are relevant here? Warning: I came across this recently and I still try to make sense of it.

To summarize from the slides: Collective authorities want to make tamper-evident public logs while solving the forking- and freshness problem without falling back to a proof-of-work scheme, risk of temporary forks or a 51% attack vulnerability.

I imagine IPFS could offer this and individual projects can use it to sign their latest versions and indexes?

@rht
rht commented Nov 7, 2015

@cryptix I haven't read the paper fully. Regardless, from its scope and aim, it can be used in 2 ways:
1. Indirectly, only to keep track of whitelisted index providers. Providers are certified in the same way as http servers. One issue is, what if one requests a content from a provider (where it returns a hash) and the search index of the content, but the said provider is not in the whitelist log? Looks like the log has to be updated manually and witnessed by the cosigners. In this case, maybe the content credential (because one has to trust the provider that it provides the correct hash) can be reused for the index.
2. Directly (provider-independent), where every new index is verified by all cosigners. Provided that there is a cheap way to verify an index without recreating it from scratch. One way is to test index few search terms and check against the full index (much like Rabin fingerprint). However, this is not as complete as pki or validating the proof of a theorem.

Also, since it is not crucial to get timestamp of indexes (of an immutable content) in order, I don't think a full-blown blockchain (be it proof-of-work or not) is necessary. There should be a much faster way than method 2 above.

An alternative to fingerprinting the index is verify the code used to build the index, provided that it can be verified to run in a sandbox with known parameters.

(worth looking: SCP, which uses quorum slice, which enables more organic growth of signing nodes)

@rht
rht commented Nov 7, 2015

(ok the fingerprint check is basically @BrendanBenshoof's "proof-of-truthiness", which drawbacks have been mentioned. I favor the weak ordering method described in the CRDT discussion, if there is a protocol for this that has been fleshed out)

@davidar
IPFS member
davidar commented Nov 7, 2015

I think it would make sense for index providers to sign the aggregated index once they've verified it contains their own index. That way "verification" is simply checking that it's been signed by the provider of the part of the index you're interested in.

@alexpear alexpear referenced this issue in ipfs/faq Dec 13, 2015
Open

Search #81

@doesntgolf

I made a prototype search tool here (also available here if it's unavailable on ipfs). The search index is built from scraping json search results from a public searx instance (which itself aggregates results from major search providers). This means the only things being searched are page title, URL, and a snippet of the page's text. It also means the search index only has about 1100 items. The client-side search is done with lunr.js. (search source, scraper source)

It's nothing impressive but was fun to hack together!

@vinctux
vinctux commented Dec 13, 2015

@doesntgolf Very nice to see progress here ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.