Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GIP-0065: Subgraph Availability Manager #38

Merged

Conversation

Maikol
Copy link
Member

@Maikol Maikol commented Mar 11, 2024

Adding GIP for new contract SubgraphAvailabilityManager

Copy link

@hopeyen hopeyen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, just have some minor questions while reading through!


Given that all oracle operators are mandated to run the same software version, uniformity in voting outcomes is expected. Discrepancies in oracle votes are therefore an area of concern and warrant investigation. In instances where voting differences occur or concerns are raised to the foundation or the core developers, the Technical Advisory Board and the Subgraph Availability Oracles operators will appoint an investigator to look into the issue. This investigation aims to determine whether the discrepancy was due to an honest failure or a result of malicious intent. The investigators decision will be final.

If an oracle operator is found to be acting maliciously, the Graph Council will nominate a new operator to replace them. In scenarios where an oracle consistently deviates from the expected voting pattern due to honest errors, The Graph Council will evaluate its performance. Persistent non-compliance with the charter's guidelines, even if unintentional, may lead to the replacement of the oracle operator to maintain the integrity and reliability of the network.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be any contingency plans outlined for coordinated malicious behavior from 3 or more operators, of if 3 operators experienced the same honest mistake and a transaction has gone through?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 5 operators in our system, it can still function correctly even if up to 2 operators are malicious or make honest mistakes, allowing time for corrective actions, such as replacing operators. If a transaction is processed due to an honest mistake, operators must rectify the error, and the oracles will automatically correct the transaction.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for clarifying! is this something you would include in the GIP, such as how do we expect the operators to realize there was an erroneous transaction, and the process to rectify the error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Erroneous transactions should be because of differences in configuration which in my opinion are covered in the GIP already in Subgraph Availability Oracle Software and Versioning section. Once configuration is fix there's nothing else the operator needs to do.


## 2. Appointment and Governance of Subgraph Availability Oracles Operators

Oracle operators are appointed by The Graph Council, which retains the authority to add or remove oracle operators, ensuring compliance with the guidelines and responsibilities outlined in this charter.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any incentives for these operators to maintain integrity and high availability?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oracles will be run initially by core devs so there's already incentives I believe 🙂

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds reasonable, I surely hope so:)
would you include this assumption in the GIP and perhaps also alternative considerations when oracles aren't run by core devs, if we ever expect this to happen? I can also understand if you consider this to be out of scope

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd upvote @hopeyen's proposal to include this info in the GIP. I had the exact same question reading through.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll include in the GIP that on this first iteration oracles will be run by core developer teams.


```

It is strongly recommended that Oracle operators diversify their sources for `IPFS_NODE_URL` and `RPC_URL` configurations. Utilizing varied sources for these critical components not only enhances the robustness and reliability of the network but also mitigates potential risks associated with single points of failure.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be any descriptions on operator's version upgrade process? especially when new versions become available, how to reach consensus among the oracle operators

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention is so that operators are responsible to coordinate between themselves to reach consensus for versioning and configurations. I believe this process will be clearer once we start working together but I'm happy to hear suggestions on how that might look like and if we want to include that process on the GIP. In my opinion it could be something that changes from time to time while we get more experience so I like having the GIP be flexible.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I agree it would be good to keep the GIP flexible.

Perhaps some options could be provided within the GIP? Like if the coordination process should be documented somewhere (github or notion or onchain?), if it should be internal or public, how long should it take to migrate between versions, ... general considerations like so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the GIP specifying that a group chat will be used as a communication channel and specify how long oracles have to update versions.


The initial mapping key corresponds to a public variable `currentNonce`. This nonce serves as a mechanism to invalidate all votes whenever a configuration is altered—be it updating the oracle operators list or modifying the `voteTimeLimit` parameter. The inner mapping stores the `subgraphDeploymentId` against the timestamp of each oracle's vote, using an array denoting the oracle index.

The `executionThreshold` will be set to three during contract construction. Whenever an oracle casts a vote, the `SubgraphAvailabilityManager` contract will query the corresponding vote aggregation storage to verify if enough votes have been registered. Upon the third vote a transaction to `RewardsManager` is triggered to make the subgraph rewards state change.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be nuanced for the implementer: what happens if the votes are not unanimous? for example, if there's 2 positive and 3 negative votes, should the transaction still be triggered in favor of 3 negative votes without any additional actions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transaction will still be triggered when reaching the threshold of 3. All oracles should be running the same configs and same code so they should be voting equally. If oracles vote differently, the appropriate action is to investigate the discrepancy and respond accordingly based on the findings of the investigation.


Oracle operators are appointed by The Graph Council, which retains the authority to add or remove oracle operators, ensuring compliance with the guidelines and responsibilities outlined in this charter.

The number of oracle operators is strictly limited to five, ensuring a balance between decentralization and manageability.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why five? This seems pretty arbitrary. Could you explain your reasoning behind this number?

ensuring a balance between decentralization and manageability.

This is just sorta asserted as axiomatic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this first iteration operators will be core developer teams and five is the number of teams that are able to participate by running an oracle. This number could change in the future.

ensuring a balance between decentralization and manageability.

This is just sorta asserted as axiomatic.

Fair enough, I can remove this statement.


# Motivation

The Subgraph Availability Oracle (SAO) plays a crucial role in The Graph's ecosystem. Its main function is to verify the availability and validity of subgraph files. Currently there is a single instance of the SAO in operation executing open-source code accessible at [The Graph's Subgraph Oracle repository](https://github.com/graphprotocol/subgraph-oracle). This SAO instance runs at five-minute intervals, performing a series of checks to assess the state of subgraph files. In instances where the SAO identifies unavailability or invalidity in a subgraph's files, it sends a transaction to the `RewardsManager` contract to deny indexing rewards for that subgraph.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this catch non-deterministic failures? If not, this seems like only half of the puzzle for enabling indexing rewards.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subgraph deployments have rewards enabled by default. SAO then runs and performs these checks and denies rewards when needed. This is a service we have running today, the proposal is to have 5 operators instead of 1. Changes to how the SAO works are out of scope for this GIP.


```

It is strongly recommended that Oracle operators diversify their sources for `IPFS_NODE_URL` and `RPC_URL` configurations. Utilizing varied sources for these critical components not only enhances the robustness and reliability of the network but also mitigates potential risks associated with single points of failure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain IPFS_NODE_URL and RPC_URL? Specifically, I'm curious about how varying these helps ensure that the "robustness and reliability of the network."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This are current single points of failure. If IPFS node is down or RPC is down the current system fails to perform its tasks. If we have 5 new oracles running this same configuration the single point of failure still exists so if possible operators should communicate with each other and use different components.


# Detailed Specification

The number of oracle operators will be strictly enforced as five. Upon contract initialization, five valid accounts must be provided, and each oracle must maintain their index position, crucial for passing it as a parameter when casting votes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why enforcing only five operators? Would it be possible to make it at least five with a cap limit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're enforcing five operators because that's what we're launching with. We decided having fixed values for number of oracles and threshold was safer than having maximum and updating. This doesn't mean that in the future we can't have more than five oracles though, we can deploy a new contract with new configurations.


## 3. Responsibility and Integrity

Operators must ensure their oracles remain operational, acknowledging that while occasional outages are part of normal operations, efforts to minimize these are crucial. Operators should communicate their infrastructure choices with one another to prevent risks associated with common dependencies, thereby enhancing the network's overall resilience. Active participation in a group chat to discuss operational challenges and coordinate on mitigating risks is required. Operators who are unresponsive or consistently underperform may be recommended for replacement by the technical advisory board or fellow operators to the Graph Council.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be an overhead for now but we may want to think of any contingency process that could be triggered in case any of the operators starts behaving unexpectedly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's stated in the charter that operators not complying with the Charter will be subject to removal. Operators themselves are responsible for monitoring and investigating oracles behaving unexpectedly. In case an oracle is detected to be intentionally or consistently not following the Charter, operators will suggest to the Council that it should be replaced. The robustness of this system comes from the fact that it has 5 oracles and requires at least 3 to be running honestly, so even if 2 oracles are behaving dishonestly or unexpectedly, it's still operational.


# High-Level Description

The new smart contract `SubgraphAvailabilityManager` will be exclusively deployed to the Arbitrum One network and will function as a mediator between five distinct oracle operators and the `RewardsManager` contract. It will feature a vote aggregation system and enforce an execution threshold requiring three votes to alter the state of a subgraph deployment within the `RewardsManager` contract. Additionally, it will incorporate a `voteTimeLimit` timeframe to monitor current votes, after which they expire and become invalid.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth pointing out what is the default availability status for subgraphs (presumably "available"), and that no initial action or vote is typically necessary when a new subgraph is added to the network

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I can include that in Motivation together with the brief explanation of how the SAO works.


## 4. Subgraph Availability Oracle Software and Versioning

Subgraph Availability Oracles operators must use the software available at [The Graph's Subgraph Oracle repository](https://github.com/graphprotocol/subgraph-oracle). This ensures uniformity and accuracy in their assessments across the network. A specific commit or release version from this repository will be designated as the required version for all oracles. This version is subject to change over time as updates and improvements are made.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specification of the availability criteria, or the latest subgraph-oracle software release is considered authoritative.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's up to the operators to reach consensus on which commit to use. In general I'd say the latest version would probably be used but it's up to them to agree.


The `executionThreshold` will be set to three during contract construction. Whenever an oracle casts a vote, the `SubgraphAvailabilityManager` contract will query the corresponding vote aggregation storage to verify if enough votes have been registered. Upon the third vote a transaction to `RewardsManager` is triggered to make the subgraph rewards state change.

A `voteTimeLimit` parameter will be used to invalidate old votes, initially set to ten minutes. This duration is slightly longer than the current subgraph availability oracle configuration, which runs every five minutes, allowing time for oracles to cast their votes and accounting for potential delays in execution. This minimizes the need for frequent transactions. When comparing oracle votes, the `SubgraphAvailabilityManager` will use the `voteTimeLimit` to ensure the last recorded vote occurred within this timeframe. Votes older than the limit will not be considered valid, necessitating a new vote from the oracle if required. It's crucial to note that any adjustments to `voteTimeLimit` must always exceed the oracle's execution period to prevent the premature invalidation of votes before all oracles have participated. However, this timeframe should be kept as brief as possible, since an excessively long `voteTimeLimit` undermines its intended purpose. The council can modify this timeframe if desired but it must be set with consideration to the oracle’s iteration period to avoid repeated voting or expired votes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If voteTimeLimit expires without a consensus vote, are oracles are expected to start a new voting round?

There is a scenario where a large spike in onchain gas price could deplete some of the oracle account balances, which will put them offline, and in turn force the other oracles to spend their remaining funds in doing repeated voting rounds, depleting their own funds in the process.

A backoff mechanism could help, but maybe the detailed software mechanism are outside the scope of this GIP?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, if voteTimeLimit expires oracles will start a new voting round. This should be covered in Responsibility and Integrity section of the Charter where operators are required to have oracles running at all times. Part of having the oracle running is having the oracle wallet funded. Changes to the oracle's code are out of scope for this GIP but I'm not sure if we want to backoff since fix should come from offline oracles. However, I agree that depleting funds with no purpose isn't ideal. This situation should be minimized by having 5 oracles and requiring 3 votes, so even with 2 oracles offline, the system should remain operational.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be avoided by having continuous voting. Instead of 10 minutes rounds, oracles can vote at any time. As soon as 3 vote "Deny", the subgraph is denied, and renabled as soon as 1 of the 3 vote "Allow". Non-voting oracles are considered a "Allow". That would remove the need for repeated identical votes.

What's the reasoning behind voting in rounds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not sure where I explained incorrectly but this is how it currently works, with the exception that it also requires 3 votes for an "Allow" transaction. Oracles run every 5 minutes, each time they run they'll vote if they see that a subgraph state needs to change. The 10 minutes voteTimeLimit is a safety mechanism so as to not leave the system where one malicious vote could trigger a change in RewardsManager. For example if two oracles incorrectly vote to deny a subgraph and after that they correct themselves if we don't expire those old votes a new oracle with just one vote could change the state for that subgraph.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the reason behind the voting rounds. In this example, if two oracles incorrectly vote to deny, and they correct themselves minutes or hours later, couldn't they simply change their vote? I don't understand why we'd need the concept of a time limit to invalidate these initial votes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oracles only cast votes when a change in rewards is needed. If two oracles wrongly vote to deny a subgraph and after that they correct themselves they wouldn't be changing their vote in the smart contract, they would just stop sending transactions.


# Motivation

The Subgraph Availability Oracle (SAO) plays a crucial role in The Graph's ecosystem. Its main function is to verify the availability and validity of subgraph files. Currently there is a single instance of the SAO in operation executing open-source code accessible at [The Graph's Subgraph Oracle repository](https://github.com/graphprotocol/subgraph-oracle). This SAO instance runs at five-minute intervals, performing a series of checks to assess the state of subgraph files. In instances where the SAO identifies unavailability or invalidity in a subgraph's files, it sends a transaction to the `RewardsManager` contract to deny indexing rewards for that subgraph. By default, subgraphs are presumed to be available, and typically, no initial action or vote by the SAO is necessary when a new subgraph is added to the network.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current amount of subgraphs (1,000 on arb), how long does it take to process these checks? Are there any concerns with how this would operate with 100,000 subgraphs?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, are all checks related to items contained in the manifest (features/networks). Are there any circumstances where features are used in the mappings but not included in the manifest? ipfs.cat() as an example

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can find all checks it performs here. Most checks are for items contained in manifest, there's no check for features used in mappings but not included in manifest but we do restrict certain host function calls from ipfs.cat(). Also note that changes to SAO's code are out of scope for this GIP but the repository is open source so feel free to add comments there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current amount of subgraphs (1,000 on arb), how long does it take to process these checks? Are there any concerns with how this would operate with 100,000 subgraphs?

Currently it takes 60 seconds to check all subgraphs. We might need a different approach for those numbers.

@PedroMD
Copy link
Contributor

PedroMD commented Apr 29, 2024

@tmigone , given that all the feedback was already addressed and that the GIP was already approved by the Council with GGP-0038, feel free to add me as a reviewer so I can merge it right away.

@pcarranzav pcarranzav requested a review from PedroMD April 29, 2024 14:00
Copy link
Contributor

@PedroMD PedroMD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me. Furthermore:

@PedroMD PedroMD merged commit 286703f into graphprotocol:main May 7, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants