New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide full-text search of documents described in CDXJ #205
Comments
|
I think I mentioned |
|
@ibnesayeed It's outside of the scope of this ticket but I think it would be worthwhile to identify what is needed to make the indexer script more scalable in a separate issue. If you recall or come up with anything regarding approaches in performing the goals of this ticket, please document it here. It would be good to get this right the first time instead of coming up with a working implementation only have to rewrite it in favor of a better approach. |
|
Our current indexer is tailored to work well with smaller collections while minimizing friction for users, which comes with the cost of not being very scalable. I think there are some common pieces that can be extracted out and then there can be more than one wrapper scripts to handle different situations. However, we can discuss that in a separate ticket as you suggested. Getting back to this topic, I think we can broadly see two approaches on indexing archival content for fulltext searching:
|
|
I prefer the second approach, particularly adding the search ability as an add-on. Incrementing async indexing for search is also an interesting approach as well but seems like it would be a scalable solution compared to consulting the WARCs. |
machawk1 commentedJun 29, 2017
At one point, https://github.com/ipfs-search/ipfs-search allowed searching of the content of dereferenced payloads described by a set of IPFS hashes. Because IPFS hashes are describing non-changing content, being able to search the contents needs only a single iteration of indexing.
One initial target audience for ipwb is for smaller collections of archival content, identified through our conventional CDXJ indexing and replay procedure.
Provide a mechanism to allow a user to search the contents of the payloads identified in the currently loaded CDXJ. @ibnesayeed Please provide insights on this.
The text was updated successfully, but these errors were encountered: