- An instance further referred as Deployer instance which will be used to run these scripts from.
- Terraform. To install Terraform proceed to the Install Terraform section.
- (Optional) AWS CLI. To install Azure CLI follow the instruction.
Also you will need a set of keys known as NODE KEY
, STASH
, CONTROLLER
and SESSION KEYS
. As for this release in Kusama
and Westend
network there are 5 keys inside of SESSION KEYS
object - GRANDPA, BABE, ImOnline, Parachains, AuthorityDiscovery. You will have to generate all of them. You can do it either using Subkey tool or using PolkadotJS website.
Key name | Key short name | Key type |
---|---|---|
NODE KEY | - | ed25519 |
STASH | - | sr25519 |
CONTROLLER | - | ed25519 |
GRANDPA | gran | ed25519 |
BABE | babe | sr25519 |
I'M ONLINE | imon | sr25519 |
PARACHAINS | para | sr25519 |
AUTHORITY DISCOVERY | audi | sr25519 |
- Download Terraform.
- Unpack Terraform using
unzip terraform*
command. - Move the
terraform
binary to the one of the folders specified at thePATH
variable. For example:sudo mv terraform /usr/local/bin/
Either clone this repo using git clone
command or simply download it from Web and unpack on the deployer node.
- Open
aws
folder of the cloned (downloaded) repo. - Create
terraform.tfvars
file inside of theaws
folder of the cloned repo, whereterraform.tfvars.example
is located. - Fill it with the appropriate variables. You can check the very minimum example at example file and the full list of supported variables (and their types) at variables file. Fill
validator_keys
variable with your SESSION KEYS. For key types use short types from the following table - Keys reference. - Set
AWS_ACCESS_KEY
andAWS_SECRET_KEY
environment variables. - (Optional) You can either place a Terraform state file on S3 bucket or on your local machine. To place it on the local machine rename the
remote-state.tf
file toremote-state.tf.stop
. To place it on S3 - create an S3 bucket and proceed to the next step. You will be interactively asked to provide S3 configuration details. - Run
terraform init
. - Run
terraform plan -out terraform.tfplan
and check the set of resources to be created on your cloud account. - If you are okay with the proposed plan - run
terraform apply terraform.tfplan
to apply the deployment. - After the deployment is complete you can open your EC2 console to check that the instances were deployed successfully.
- (Optional) Subscribe to notifications. As for now Terraform does not support automatic email alert creation due to AWS API limitation. Thus, these scripts creates an SNS topic that you should subscribe to manually to start receiving alert messages.
-
Into standalone mode
terraform plan -var failover_mode=single terraform apply -auto-approve -var delete_vms_with_api_in_single_mode=true -var failover_mode=single
-
Into distributed mode
terraform plan terraform apply -auto-approve
-
Apply with next variable:
terraform apply -auto-approve -var expose_prometheus=true
-
Get terraform output
terraform output prometheus_target
-
Adjust your prometheus configuration file and import grafana dashboard
- Watch Polkadot Telemetry for your node to synchronize with the network.
- Make sure you have funds on your STASH account. Bond your fund to CONTROLLER account. For this and the following steps you can either perform a transaction on your node or use or use PolkadotJS website. For this operation use
staking.bond
transaction. - Set your session keys to the network - perform a
session.setKeys
transaction. As an argument pass all your session keys in hex format in a order specified here concatenating them one by one.
For example if you have the following keys:
GRAN - 0xbeaa0ec217371a8559f0d1acfcc4705b48082b7a02fd6cb2e76714380576151e
BABE - 0xec648f4ad1693cc61e340aa122c7142d7603e26e04a47a5f0811c31a60c07b49
IMON - 0x9633780f889f0fc6280adba40695139f77c00e53168544492c6fa2399b693e3c
PARA - 0xee383120ff7b87409e105de2b0150432a95153d0a1edd5bea0af669001b80f1d
AUDI - 0x701ed6b86f109a6d59d7933df3311c5b6edc3862657179259cb983149bfc404c
The argument for sessions.setKeys will be 0xbeaa0ec217371a8559f0d1acfcc4705b48082b7a02fd6cb2e76714380576151eec648f4ad1693cc61e340aa122c7142d7603e26e04a47a5f0811c31a60c07b499633780f889f0fc6280adba40695139f77c00e53168544492c6fa2399b693e3cee383120ff7b87409e105de2b0150432a95153d0a1edd5bea0af669001b80f1d701ed6b86f109a6d59d7933df3311c5b6edc3862657179259cb983149bfc404c
Note that there is only one 0x left, all the others are omitted.
- Start validating - perform a
staking.validate
transaction.
You will receive the alerts in the following cases:
- No validator nodes are currently running
- More than 1 validator is currently running
- Node reports unhealthy status
Basically, there are two possible ways to understand which node is taking the lead. First is to go to the CloudWatch dashboard, select the alarm that is created by Terraform script and monitors the number of validators, open it and check which of the nodes sends the metric value equals to 1. This is the node that running the validator right now.
The other way is to SSH into each node subsequentally and run sudo docker ps -a --no-trunc
command. This command will show you the docker container that are run on this machine. Check the command that is used to run the container. Only one container on one instance will have --validator
argument at the launch command. All the other containers will have the --pruning=archive
.
The prefix is used in a majority of resources names, so they can be easily identified among others. This causes the limitation because not all of the deployed resources supports long names or names with non alphanumeric symbols. The optimal is to have around 5 alphanumeric characters as a system prefix.
Current version of failover scripts supports only the Nitro instances as the node type. This is connected to the way disks are being managed by AWS - nitro instances has the most disks attached as the /dev/xvd*
which does not have the very same numeration as on the EC2 console. This issue has to be fixed in future, but we are open for any improvements PRs for now.
As for now the implemented failover mechanism won't work if 2 out of the 3 chosen regions goes offline. Make sure to use geographically distributed regions to improve nodes stability.
Set delete_on_terminate
variable to true
to override this behavior.