Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow distribution of gowarcservers with a "parent->child" relationship #4

Closed
Avokadoen opened this issue Feb 9, 2021 · 3 comments · Fixed by #24
Closed

Allow distribution of gowarcservers with a "parent->child" relationship #4

Avokadoen opened this issue Feb 9, 2021 · 3 comments · Fixed by #24
Labels
enhancement New feature or request question Further information is requested

Comments

@Avokadoen
Copy link
Contributor

Avokadoen commented Feb 9, 2021

Based on meeting with @maeb. He had an idea of a potential direction to improve gowarcserver.

Is your feature request related to a problem? Please describe.
This will solve two problems.

  1. In the loke GUI you can see all the different collections on the main page. If you have a series of collections, then it can be cumbersome to find a given warc record as you have to be aware which collection has the record or manually search each collection
  2. Optimize gowarcserver by distributing indexing and searching

Describe the solution you'd like

Gowarcserver network diagram

We can structure gowarcservers like a tree. Each node in the tree can hold records and N child nodes. Using arguments or editing the config should allow you to point at child nodes of the gowarcserver that is being fired up. When the server receive a query it should process the query while also ask all children to do the same. How it should handle finding results is left undefined for now i.e discarding request to children and just send found item or wait for all children to answer before aggregating result etc. It's important to note that based on the diagram, the only difference between a parent- and leaf node is that the leaf node has no registered children. Programmatically they should be identical.

Problem 1 will be solved by introduction of the concept of a parent-child relation. It will allow us to set up a network of servers where a root instance can aggregate queries throughout the gowarcserver network. Loke will only have to know about the root. This will result in the end user not having to care about which collection that contains the target record.

Problem 2 will be solved by the fact that queries can be aggregated using go routines to children and self which should make queries scale with increased data. Indexing of records will also be distributed without locking it to a topic or area (i.e all indexing of newspapers having to be central)

It's worth noting that this will introduce greater complexity to the codebase and abusing said tree structure might lead to slower results as request will be chained based on tree depth.

This will also open up future optimizations. Examples of this could be: caching common queries where no changes has been made in the db or skipping nodes when we already know target node for query.

Additional context
Googles talk about about go servers (mainly from slide 33 and out)
Potential API http://timetravel.mementoweb.org/guide/api/

@Avokadoen Avokadoen added enhancement New feature or request question Further information is requested labels Feb 9, 2021
@Avokadoen
Copy link
Contributor Author

Avokadoen commented Apr 4, 2021

Implementation idea:
To avoid implementations that locks containers into a single pod. The containers should communicate using http(s) for requests. The reason for this is that it delivers security. Containers do not need to know more than their children's url which allows for sandboxing, if required.

I'm not very familiar with kubernetes so my initial idea to implement this is to have a static url that points to a api that describes the whole deployed hierarchy. Kubernetes would have to initialize by spinning up this service first. The nodes should handle being spawned before the service i.e by polling or using some sort of mechanism in kubernetes. When a node is spawned it requests its children from the service and then the service notifies the node parent about the new child (the service knows the api for the parent node and so can just send it to the child url to the parent node). Authentication is important here to avoid hijacking attacks. For our use the risk of this is low, but should be accounted for anyways. This would a single point of failure design though ...

As stated above my knowledge about kubernetes is limited, so kubernetes might have all or some of the functionality for this.

Resources to learn more:

@Avokadoen
Copy link
Contributor Author

Standup MVP:
The node network only has to be configured though kubernetes configs. The simples solution then would be to use a environment variable with child urls

Also:
veidemann-cache has similar behavior to what is needed

@Avokadoen
Copy link
Contributor Author

Avokadoen added a commit to Avokadoen/gowarcserver that referenced this issue May 26, 2021
you can now construct a network of gowarcserver processes and ask a single gowarcserver process which will aggregate child processes query results
Avokadoen added a commit to Avokadoen/gowarcserver that referenced this issue May 28, 2021
you can now construct a network of gowarcserver processes and ask a single gowarcserver process which will aggregate child processes query results

Also expanded variable documentation in README
Avokadoen added a commit to Avokadoen/gowarcserver that referenced this issue Jun 2, 2021
you can now construct a network of gowarcserver processes and ask a single gowarcserver process which will aggregate child processes query results

Also expanded variable documentation in README
Avokadoen added a commit to Avokadoen/gowarcserver that referenced this issue Jun 2, 2021
Avokadoen added a commit to Avokadoen/gowarcserver that referenced this issue Jun 2, 2021
Also reused behaviour related to child queries between each handler
Avokadoen added a commit to Avokadoen/gowarcserver that referenced this issue Jun 2, 2021
Also reused behaviour related to child queries between each handler
Avokadoen added a commit to Avokadoen/gowarcserver that referenced this issue Jun 6, 2021
Also:
* reused behaviour related to child queries between each handler
* resourcehandler will only serve the first resource that was any bytes
  and in a non error status
Avokadoen added a commit to Avokadoen/gowarcserver that referenced this issue Jun 9, 2021
this commit implements queries to child nodes as descibed in issue nlnwa#4
To achieve this there is a new package 'localhttp' that create an abstraction for among other things
writing a response from either the local node or its children
@Avokadoen Avokadoen linked a pull request Jun 9, 2021 that will close this issue
@maeb maeb closed this as completed in #24 Jun 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
1 participant