Allow distribution of gowarcservers with a "parent->child" relationship #4

Avokadoen · 2021-02-09T12:19:33Z

Based on meeting with @maeb. He had an idea of a potential direction to improve gowarcserver.

Is your feature request related to a problem? Please describe.
This will solve two problems.

In the loke GUI you can see all the different collections on the main page. If you have a series of collections, then it can be cumbersome to find a given warc record as you have to be aware which collection has the record or manually search each collection
Optimize gowarcserver by distributing indexing and searching

Describe the solution you'd like

We can structure gowarcservers like a tree. Each node in the tree can hold records and N child nodes. Using arguments or editing the config should allow you to point at child nodes of the gowarcserver that is being fired up. When the server receive a query it should process the query while also ask all children to do the same. How it should handle finding results is left undefined for now i.e discarding request to children and just send found item or wait for all children to answer before aggregating result etc. It's important to note that based on the diagram, the only difference between a parent- and leaf node is that the leaf node has no registered children. Programmatically they should be identical.

Problem 1 will be solved by introduction of the concept of a parent-child relation. It will allow us to set up a network of servers where a root instance can aggregate queries throughout the gowarcserver network. Loke will only have to know about the root. This will result in the end user not having to care about which collection that contains the target record.

Problem 2 will be solved by the fact that queries can be aggregated using go routines to children and self which should make queries scale with increased data. Indexing of records will also be distributed without locking it to a topic or area (i.e all indexing of newspapers having to be central)

It's worth noting that this will introduce greater complexity to the codebase and abusing said tree structure might lead to slower results as request will be chained based on tree depth.

This will also open up future optimizations. Examples of this could be: caching common queries where no changes has been made in the db or skipping nodes when we already know target node for query.

Additional context
Googles talk about about go servers (mainly from slide 33 and out)
Potential API http://timetravel.mementoweb.org/guide/api/

Avokadoen · 2021-04-04T10:29:42Z

Implementation idea:
To avoid implementations that locks containers into a single pod. The containers should communicate using http(s) for requests. The reason for this is that it delivers security. Containers do not need to know more than their children's url which allows for sandboxing, if required.

I'm not very familiar with kubernetes so my initial idea to implement this is to have a static url that points to a api that describes the whole deployed hierarchy. Kubernetes would have to initialize by spinning up this service first. The nodes should handle being spawned before the service i.e by polling or using some sort of mechanism in kubernetes. When a node is spawned it requests its children from the service and then the service notifies the node parent about the new child (the service knows the api for the parent node and so can just send it to the child url to the parent node). Authentication is important here to avoid hijacking attacks. For our use the risk of this is low, but should be accounted for anyways. This would a single point of failure design though ...

As stated above my knowledge about kubernetes is limited, so kubernetes might have all or some of the functionality for this.

Resources to learn more:

Avokadoen · 2021-04-06T08:29:58Z

Standup MVP:
The node network only has to be configured though kubernetes configs. The simples solution then would be to use a environment variable with child urls

Also:
veidemann-cache has similar behavior to what is needed

Avokadoen · 2021-04-08T14:44:38Z

Another resource: https://matthewpalmer.net/kubernetes-app-developer/articles/kubernetes-networking-guide-beginners.html

you can now construct a network of gowarcserver processes and ask a single gowarcserver process which will aggregate child processes query results

you can now construct a network of gowarcserver processes and ask a single gowarcserver process which will aggregate child processes query results Also expanded variable documentation in README

Also reused behaviour related to child queries between each handler

Also: * reused behaviour related to child queries between each handler * resourcehandler will only serve the first resource that was any bytes and in a non error status

this commit implements queries to child nodes as descibed in issue nlnwa#4 To achieve this there is a new package 'localhttp' that create an abstraction for among other things writing a response from either the local node or its children

Avokadoen added enhancement New feature or request question Further information is requested labels Feb 9, 2021

Avokadoen added a commit to Avokadoen/gowarcserver that referenced this issue May 26, 2021

nlnwa#4: support for aggregated queries

bb2376e

you can now construct a network of gowarcserver processes and ask a single gowarcserver process which will aggregate child processes query results

Avokadoen mentioned this issue May 26, 2021

Enable distributed queries of gowarcserver processes #21

Closed

Avokadoen mentioned this issue May 28, 2021

Feature: Support running gowarcserver without badger #22

Closed

Avokadoen added a commit to Avokadoen/gowarcserver that referenced this issue Jun 2, 2021

nlnwa#4: support aggregated queries (indexhandler)

ab158fc

Avokadoen added a commit to Avokadoen/gowarcserver that referenced this issue Jun 2, 2021

nlnwa#4: support aggregated queries (resourcehandler)

7a61966

Also reused behaviour related to child queries between each handler

Avokadoen added a commit to Avokadoen/gowarcserver that referenced this issue Jun 2, 2021

.nlnwa#4: support aggregated queries (resourcehandler)

de4ea3f

Also reused behaviour related to child queries between each handler

Avokadoen mentioned this issue Jun 9, 2021

Support aggregated queries #24

Merged

Avokadoen linked a pull request Jun 9, 2021 that will close this issue

Support aggregated queries #24

Merged

maeb closed this as completed in #24 Jun 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow distribution of gowarcservers with a "parent->child" relationship #4

Allow distribution of gowarcservers with a "parent->child" relationship #4

Avokadoen commented Feb 9, 2021 •

edited

Loading

Avokadoen commented Apr 4, 2021 •

edited

Loading

Avokadoen commented Apr 6, 2021

Avokadoen commented Apr 8, 2021

Allow distribution of gowarcservers with a "parent->child" relationship #4

Allow distribution of gowarcservers with a "parent->child" relationship #4

Comments

Avokadoen commented Feb 9, 2021 • edited Loading

Avokadoen commented Apr 4, 2021 • edited Loading

Avokadoen commented Apr 6, 2021

Avokadoen commented Apr 8, 2021

Avokadoen commented Feb 9, 2021 •

edited

Loading

Avokadoen commented Apr 4, 2021 •

edited

Loading