GIP-0065: Subgraph Availability Manager #38

Maikol · 2024-03-11T20:53:44Z

Adding GIP for new contract SubgraphAvailabilityManager

hopeyen

looks great, just have some minor questions while reading through!

hopeyen · 2024-03-15T16:05:57Z

gips/0065-subgraph-availability-manager.md

+
+Given that all oracle operators are mandated to run the same software version, uniformity in voting outcomes is expected. Discrepancies in oracle votes are therefore an area of concern and warrant investigation. In instances where voting differences occur or concerns are raised to the foundation or the core developers, the Technical Advisory Board and the Subgraph Availability Oracles operators will appoint an investigator to look into the issue. This investigation aims to determine whether the discrepancy was due to an honest failure or a result of malicious intent. The investigators decision will be final.
+
+If an oracle operator is found to be acting maliciously, the Graph Council will nominate a new operator to replace them. In scenarios where an oracle consistently deviates from the expected voting pattern due to honest errors, The Graph Council will evaluate its performance. Persistent non-compliance with the charter's guidelines, even if unintentional, may lead to the replacement of the oracle operator to maintain the integrity and reliability of the network.


Should there be any contingency plans outlined for coordinated malicious behavior from 3 or more operators, of if 3 operators experienced the same honest mistake and a transaction has gone through?

With 5 operators in our system, it can still function correctly even if up to 2 operators are malicious or make honest mistakes, allowing time for corrective actions, such as replacing operators. If a transaction is processed due to an honest mistake, operators must rectify the error, and the oracles will automatically correct the transaction.

thanks for clarifying! is this something you would include in the GIP, such as how do we expect the operators to realize there was an erroneous transaction, and the process to rectify the error?

Erroneous transactions should be because of differences in configuration which in my opinion are covered in the GIP already in Subgraph Availability Oracle Software and Versioning section. Once configuration is fix there's nothing else the operator needs to do.

hopeyen · 2024-03-15T16:07:03Z

gips/0065-subgraph-availability-manager.md

+
+## 2. Appointment and Governance of Subgraph Availability Oracles Operators
+
+Oracle operators are appointed by The Graph Council, which retains the authority to add or remove oracle operators, ensuring compliance with the guidelines and responsibilities outlined in this charter.


Is there any incentives for these operators to maintain integrity and high availability?

Oracles will be run initially by core devs so there's already incentives I believe 🙂

sounds reasonable, I surely hope so:)
would you include this assumption in the GIP and perhaps also alternative considerations when oracles aren't run by core devs, if we ever expect this to happen? I can also understand if you consider this to be out of scope

I'd upvote @hopeyen's proposal to include this info in the GIP. I had the exact same question reading through.

Sounds good, I'll include in the GIP that on this first iteration oracles will be run by core developer teams.

hopeyen · 2024-03-15T16:10:51Z

gips/0065-subgraph-availability-manager.md

+
+```
+
+It is strongly recommended that Oracle operators diversify their sources for `IPFS_NODE_URL` and `RPC_URL` configurations. Utilizing varied sources for these critical components not only enhances the robustness and reliability of the network but also mitigates potential risks associated with single points of failure.


Should there be any descriptions on operator's version upgrade process? especially when new versions become available, how to reach consensus among the oracle operators

The intention is so that operators are responsible to coordinate between themselves to reach consensus for versioning and configurations. I believe this process will be clearer once we start working together but I'm happy to hear suggestions on how that might look like and if we want to include that process on the GIP. In my opinion it could be something that changes from time to time while we get more experience so I like having the GIP be flexible.

Cool, I agree it would be good to keep the GIP flexible.

Perhaps some options could be provided within the GIP? Like if the coordination process should be documented somewhere (github or notion or onchain?), if it should be internal or public, how long should it take to migrate between versions, ... general considerations like so.

I'll update the GIP specifying that a group chat will be used as a communication channel and specify how long oracles have to update versions.

hopeyen · 2024-03-15T16:13:53Z

gips/0065-subgraph-availability-manager.md

+
+The initial mapping key corresponds to a public variable `currentNonce`. This nonce serves as a mechanism to invalidate all votes whenever a configuration is altered—be it updating the oracle operators list or modifying the `voteTimeLimit` parameter. The inner mapping stores the `subgraphDeploymentId` against the timestamp of each oracle's vote, using an array denoting the oracle index.
+
+The `executionThreshold` will be set to three during contract construction. Whenever an oracle casts a vote, the `SubgraphAvailabilityManager` contract will query the corresponding vote aggregation storage to verify if enough votes have been registered. Upon the third vote a transaction to `RewardsManager` is triggered to make the subgraph rewards state change.


this might be nuanced for the implementer: what happens if the votes are not unanimous? for example, if there's 2 positive and 3 negative votes, should the transaction still be triggered in favor of 3 negative votes without any additional actions?

Transaction will still be triggered when reaching the threshold of 3. All oracles should be running the same configs and same code so they should be voting equally. If oracles vote differently, the appropriate action is to investigate the discrepancy and respond accordingly based on the findings of the investigation.

anirudh2 · 2024-03-21T20:22:21Z

gips/0065-subgraph-availability-manager.md

+
+Oracle operators are appointed by The Graph Council, which retains the authority to add or remove oracle operators, ensuring compliance with the guidelines and responsibilities outlined in this charter.
+
+The number of oracle operators is strictly limited to five, ensuring a balance between decentralization and manageability.


Why five? This seems pretty arbitrary. Could you explain your reasoning behind this number?

ensuring a balance between decentralization and manageability.

This is just sorta asserted as axiomatic.

For this first iteration operators will be core developer teams and five is the number of teams that are able to participate by running an oracle. This number could change in the future.

ensuring a balance between decentralization and manageability.

This is just sorta asserted as axiomatic.

Fair enough, I can remove this statement.

anirudh2 · 2024-03-21T20:31:05Z

gips/0065-subgraph-availability-manager.md

+
+# Motivation
+
+The Subgraph Availability Oracle (SAO) plays a crucial role in The Graph's ecosystem. Its main function is to verify the availability and validity of subgraph files. Currently there is a single instance of the SAO in operation executing open-source code accessible at [The Graph's Subgraph Oracle repository](https://github.com/graphprotocol/subgraph-oracle). This SAO instance runs at five-minute intervals, performing a series of checks to assess the state of subgraph files. In instances where the SAO identifies unavailability or invalidity in a subgraph's files, it sends a transaction to the `RewardsManager` contract to deny indexing rewards for that subgraph.


Will this catch non-deterministic failures? If not, this seems like only half of the puzzle for enabling indexing rewards.

Subgraph deployments have rewards enabled by default. SAO then runs and performs these checks and denies rewards when needed. This is a service we have running today, the proposal is to have 5 operators instead of 1. Changes to how the SAO works are out of scope for this GIP.

anirudh2 · 2024-03-21T20:39:36Z

gips/0065-subgraph-availability-manager.md

+
+```
+
+It is strongly recommended that Oracle operators diversify their sources for `IPFS_NODE_URL` and `RPC_URL` configurations. Utilizing varied sources for these critical components not only enhances the robustness and reliability of the network but also mitigates potential risks associated with single points of failure.


Could you explain IPFS_NODE_URL and RPC_URL? Specifically, I'm curious about how varying these helps ensure that the "robustness and reliability of the network."

This are current single points of failure. If IPFS node is down or RPC is down the current system fails to perform its tasks. If we have 5 new oracles running this same configuration the single point of failure still exists so if possible operators should communicate with each other and use different components.

facuspagnuolo · 2024-03-21T22:15:53Z

gips/0065-subgraph-availability-manager.md

+
+# Detailed Specification
+
+The number of oracle operators will be strictly enforced as five. Upon contract initialization, five valid accounts must be provided, and each oracle must maintain their index position, crucial for passing it as a parameter when casting votes.


Any reason why enforcing only five operators? Would it be possible to make it at least five with a cap limit?

We're enforcing five operators because that's what we're launching with. We decided having fixed values for number of oracles and threshold was safer than having maximum and updating. This doesn't mean that in the future we can't have more than five oracles though, we can deploy a new contract with new configurations.

facuspagnuolo · 2024-03-21T22:18:54Z

gips/0065-subgraph-availability-manager.md

+
+## 3. Responsibility and Integrity
+
+Operators must ensure their oracles remain operational, acknowledging that while occasional outages are part of normal operations, efforts to minimize these are crucial. Operators should communicate their infrastructure choices with one another to prevent risks associated with common dependencies, thereby enhancing the network's overall resilience. Active participation in a group chat to discuss operational challenges and coordinate on mitigating risks is required. Operators who are unresponsive or consistently underperform may be recommended for replacement by the technical advisory board or fellow operators to the Graph Council.


It might be an overhead for now but we may want to think of any contingency process that could be triggered in case any of the operators starts behaving unexpectedly.

It's stated in the charter that operators not complying with the Charter will be subject to removal. Operators themselves are responsible for monitoring and investigating oracles behaving unexpectedly. In case an oracle is detected to be intentionally or consistently not following the Charter, operators will suggest to the Council that it should be replaced. The robustness of this system comes from the fact that it has 5 oracles and requires at least 3 to be running honestly, so even if 2 oracles are behaving dishonestly or unexpectedly, it's still operational.

madumas · 2024-03-22T13:51:15Z

gips/0065-subgraph-availability-manager.md

+
+# High-Level Description
+
+The new smart contract `SubgraphAvailabilityManager` will be exclusively deployed to the Arbitrum One network and will function as a mediator between five distinct oracle operators and the `RewardsManager` contract. It will feature a vote aggregation system and enforce an execution threshold requiring three votes to alter the state of a subgraph deployment within the `RewardsManager` contract. Additionally, it will incorporate a `voteTimeLimit` timeframe to monitor current votes, after which they expire and become invalid.


It might be worth pointing out what is the default availability status for subgraphs (presumably "available"), and that no initial action or vote is typically necessary when a new subgraph is added to the network

Sounds good, I can include that in Motivation together with the brief explanation of how the SAO works.

madumas · 2024-03-22T13:55:37Z

gips/0065-subgraph-availability-manager.md

+
+## 4. Subgraph Availability Oracle Software and Versioning
+
+Subgraph Availability Oracles operators must use the software available at [The Graph's Subgraph Oracle repository](https://github.com/graphprotocol/subgraph-oracle). This ensures uniformity and accuracy in their assessments across the network. A specific commit or release version from this repository will be designated as the required version for all oracles. This version is subject to change over time as updates and improvements are made.


Is there a specification of the availability criteria, or the latest subgraph-oracle software release is considered authoritative.

It's up to the operators to reach consensus on which commit to use. In general I'd say the latest version would probably be used but it's up to them to agree.

madumas · 2024-03-22T14:05:47Z

gips/0065-subgraph-availability-manager.md

+
+The `executionThreshold` will be set to three during contract construction. Whenever an oracle casts a vote, the `SubgraphAvailabilityManager` contract will query the corresponding vote aggregation storage to verify if enough votes have been registered. Upon the third vote a transaction to `RewardsManager` is triggered to make the subgraph rewards state change.
+
+A `voteTimeLimit` parameter will be used to invalidate old votes, initially set to ten minutes. This duration is slightly longer than the current subgraph availability oracle configuration, which runs every five minutes, allowing time for oracles to cast their votes and accounting for potential delays in execution. This minimizes the need for frequent transactions. When comparing oracle votes, the `SubgraphAvailabilityManager` will use the `voteTimeLimit` to ensure the last recorded vote occurred within this timeframe. Votes older than the limit will not be considered valid, necessitating a new vote from the oracle if required. It's crucial to note that any adjustments to `voteTimeLimit` must always exceed the oracle's execution period to prevent the premature invalidation of votes before all oracles have participated. However, this timeframe should be kept as brief as possible, since an excessively long `voteTimeLimit` undermines its intended purpose. The council can modify this timeframe if desired but it must be set with consideration to the oracle’s iteration period to avoid repeated voting or expired votes.


If voteTimeLimit expires without a consensus vote, are oracles are expected to start a new voting round?

There is a scenario where a large spike in onchain gas price could deplete some of the oracle account balances, which will put them offline, and in turn force the other oracles to spend their remaining funds in doing repeated voting rounds, depleting their own funds in the process.

A backoff mechanism could help, but maybe the detailed software mechanism are outside the scope of this GIP?

Indeed, if voteTimeLimit expires oracles will start a new voting round. This should be covered in Responsibility and Integrity section of the Charter where operators are required to have oracles running at all times. Part of having the oracle running is having the oracle wallet funded. Changes to the oracle's code are out of scope for this GIP but I'm not sure if we want to backoff since fix should come from offline oracles. However, I agree that depleting funds with no purpose isn't ideal. This situation should be minimized by having 5 oracles and requiring 3 votes, so even with 2 oracles offline, the system should remain operational.

This could be avoided by having continuous voting. Instead of 10 minutes rounds, oracles can vote at any time. As soon as 3 vote "Deny", the subgraph is denied, and renabled as soon as 1 of the 3 vote "Allow". Non-voting oracles are considered a "Allow". That would remove the need for repeated identical votes.

What's the reasoning behind voting in rounds?

Sorry, not sure where I explained incorrectly but this is how it currently works, with the exception that it also requires 3 votes for an "Allow" transaction. Oracles run every 5 minutes, each time they run they'll vote if they see that a subgraph state needs to change. The 10 minutes voteTimeLimit is a safety mechanism so as to not leave the system where one malicious vote could trigger a change in RewardsManager. For example if two oracles incorrectly vote to deny a subgraph and after that they correct themselves if we don't expire those old votes a new oracle with just one vote could change the state for that subgraph.

I'm not sure I understand the reason behind the voting rounds. In this example, if two oracles incorrectly vote to deny, and they correct themselves minutes or hours later, couldn't they simply change their vote? I don't understand why we'd need the concept of a time limit to invalidate these initial votes.

Oracles only cast votes when a change in rewards is needed. If two oracles wrongly vote to deny a subgraph and after that they correct themselves they wouldn't be changing their vote in the smart contract, they would just stop sending transactions.

Data-Nexus · 2024-03-27T14:37:22Z

gips/0065-subgraph-availability-manager.md

+
+# Motivation
+
+The Subgraph Availability Oracle (SAO) plays a crucial role in The Graph's ecosystem. Its main function is to verify the availability and validity of subgraph files. Currently there is a single instance of the SAO in operation executing open-source code accessible at [The Graph's Subgraph Oracle repository](https://github.com/graphprotocol/subgraph-oracle). This SAO instance runs at five-minute intervals, performing a series of checks to assess the state of subgraph files. In instances where the SAO identifies unavailability or invalidity in a subgraph's files, it sends a transaction to the `RewardsManager` contract to deny indexing rewards for that subgraph. By default, subgraphs are presumed to be available, and typically, no initial action or vote by the SAO is necessary when a new subgraph is added to the network.


With the current amount of subgraphs (1,000 on arb), how long does it take to process these checks? Are there any concerns with how this would operate with 100,000 subgraphs?

Also, are all checks related to items contained in the manifest (features/networks). Are there any circumstances where features are used in the mappings but not included in the manifest? ipfs.cat() as an example

You can find all checks it performs here. Most checks are for items contained in manifest, there's no check for features used in mappings but not included in manifest but we do restrict certain host function calls from ipfs.cat(). Also note that changes to SAO's code are out of scope for this GIP but the repository is open source so feel free to add comments there.

With the current amount of subgraphs (1,000 on arb), how long does it take to process these checks? Are there any concerns with how this would operate with 100,000 subgraphs?

Currently it takes 60 seconds to check all subgraphs. We might need a different approach for those numbers.

PedroMD · 2024-04-29T10:06:38Z

@tmigone , given that all the feedback was already addressed and that the GIP was already approved by the Council with GGP-0038, feel free to add me as a reviewer so I can merge it right away.

PedroMD

This all looks good to me. Furthermore:

The TAB has officially reviewed and voted on this GIP, signaling consensus on the design and operational details found in this GIP.
The Council approved GGP-0038

Maikol added 3 commits March 11, 2024 17:53

GIP-0065: Subgraph Availability Manager

0cf97b6

chore: add forum and audit links

1abf3eb

fix: remove URLs from config message

9c0cc30

hopeyen reviewed Mar 15, 2024

View reviewed changes

fix: update responsability and integrity section

d78cb30

anirudh2 reviewed Mar 21, 2024

View reviewed changes

facuspagnuolo reviewed Mar 21, 2024

View reviewed changes

madumas reviewed Mar 22, 2024

View reviewed changes

Maikol added 4 commits March 22, 2024 14:17

fix: add that operators will be core developers for now

18a92aa

fix: add communication channel and time limit for updates

793cb66

fix: remove statement

f3d372b

fix: add context about initial state of subgraphs files

cf99862

Data-Nexus reviewed Mar 27, 2024

View reviewed changes

pcarranzav requested a review from PedroMD April 29, 2024 14:00

PedroMD approved these changes May 7, 2024

View reviewed changes

PedroMD merged commit 286703f into graphprotocol:main May 7, 2024
1 check passed


		Given that all oracle operators are mandated to run the same software version, uniformity in voting outcomes is expected. Discrepancies in oracle votes are therefore an area of concern and warrant investigation. In instances where voting differences occur or concerns are raised to the foundation or the core developers, the Technical Advisory Board and the Subgraph Availability Oracles operators will appoint an investigator to look into the issue. This investigation aims to determine whether the discrepancy was due to an honest failure or a result of malicious intent. The investigators decision will be final.

		If an oracle operator is found to be acting maliciously, the Graph Council will nominate a new operator to replace them. In scenarios where an oracle consistently deviates from the expected voting pattern due to honest errors, The Graph Council will evaluate its performance. Persistent non-compliance with the charter's guidelines, even if unintentional, may lead to the replacement of the oracle operator to maintain the integrity and reliability of the network.


		## 2. Appointment and Governance of Subgraph Availability Oracles Operators

		Oracle operators are appointed by The Graph Council, which retains the authority to add or remove oracle operators, ensuring compliance with the guidelines and responsibilities outlined in this charter.


		```

		It is strongly recommended that Oracle operators diversify their sources for `IPFS_NODE_URL` and `RPC_URL` configurations. Utilizing varied sources for these critical components not only enhances the robustness and reliability of the network but also mitigates potential risks associated with single points of failure.


		The initial mapping key corresponds to a public variable `currentNonce`. This nonce serves as a mechanism to invalidate all votes whenever a configuration is altered—be it updating the oracle operators list or modifying the `voteTimeLimit` parameter. The inner mapping stores the `subgraphDeploymentId` against the timestamp of each oracle's vote, using an array denoting the oracle index.

		The `executionThreshold` will be set to three during contract construction. Whenever an oracle casts a vote, the `SubgraphAvailabilityManager` contract will query the corresponding vote aggregation storage to verify if enough votes have been registered. Upon the third vote a transaction to `RewardsManager` is triggered to make the subgraph rewards state change.


		Oracle operators are appointed by The Graph Council, which retains the authority to add or remove oracle operators, ensuring compliance with the guidelines and responsibilities outlined in this charter.

		The number of oracle operators is strictly limited to five, ensuring a balance between decentralization and manageability.


		# Motivation

		The Subgraph Availability Oracle (SAO) plays a crucial role in The Graph's ecosystem. Its main function is to verify the availability and validity of subgraph files. Currently there is a single instance of the SAO in operation executing open-source code accessible at [The Graph's Subgraph Oracle repository](https://github.com/graphprotocol/subgraph-oracle). This SAO instance runs at five-minute intervals, performing a series of checks to assess the state of subgraph files. In instances where the SAO identifies unavailability or invalidity in a subgraph's files, it sends a transaction to the `RewardsManager` contract to deny indexing rewards for that subgraph.


		# Detailed Specification

		The number of oracle operators will be strictly enforced as five. Upon contract initialization, five valid accounts must be provided, and each oracle must maintain their index position, crucial for passing it as a parameter when casting votes.


		## 3. Responsibility and Integrity

		Operators must ensure their oracles remain operational, acknowledging that while occasional outages are part of normal operations, efforts to minimize these are crucial. Operators should communicate their infrastructure choices with one another to prevent risks associated with common dependencies, thereby enhancing the network's overall resilience. Active participation in a group chat to discuss operational challenges and coordinate on mitigating risks is required. Operators who are unresponsive or consistently underperform may be recommended for replacement by the technical advisory board or fellow operators to the Graph Council.


		# High-Level Description

		The new smart contract `SubgraphAvailabilityManager` will be exclusively deployed to the Arbitrum One network and will function as a mediator between five distinct oracle operators and the `RewardsManager` contract. It will feature a vote aggregation system and enforce an execution threshold requiring three votes to alter the state of a subgraph deployment within the `RewardsManager` contract. Additionally, it will incorporate a `voteTimeLimit` timeframe to monitor current votes, after which they expire and become invalid.


		## 4. Subgraph Availability Oracle Software and Versioning

		Subgraph Availability Oracles operators must use the software available at [The Graph's Subgraph Oracle repository](https://github.com/graphprotocol/subgraph-oracle). This ensures uniformity and accuracy in their assessments across the network. A specific commit or release version from this repository will be designated as the required version for all oracles. This version is subject to change over time as updates and improvements are made.


		The `executionThreshold` will be set to three during contract construction. Whenever an oracle casts a vote, the `SubgraphAvailabilityManager` contract will query the corresponding vote aggregation storage to verify if enough votes have been registered. Upon the third vote a transaction to `RewardsManager` is triggered to make the subgraph rewards state change.

		A `voteTimeLimit` parameter will be used to invalidate old votes, initially set to ten minutes. This duration is slightly longer than the current subgraph availability oracle configuration, which runs every five minutes, allowing time for oracles to cast their votes and accounting for potential delays in execution. This minimizes the need for frequent transactions. When comparing oracle votes, the `SubgraphAvailabilityManager` will use the `voteTimeLimit` to ensure the last recorded vote occurred within this timeframe. Votes older than the limit will not be considered valid, necessitating a new vote from the oracle if required. It's crucial to note that any adjustments to `voteTimeLimit` must always exceed the oracle's execution period to prevent the premature invalidation of votes before all oracles have participated. However, this timeframe should be kept as brief as possible, since an excessively long `voteTimeLimit` undermines its intended purpose. The council can modify this timeframe if desired but it must be set with consideration to the oracle’s iteration period to avoid repeated voting or expired votes.

GIP-0065: Subgraph Availability Manager #38

GIP-0065: Subgraph Availability Manager #38

Conversation

Maikol commented Mar 11, 2024

hopeyen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PedroMD commented Apr 29, 2024

PedroMD left a comment

Choose a reason for hiding this comment