# Node Managing with CLAP

This notebook introduces CLAP's features to create, manage, and destroy CLAP's nodes. It will walk through the `NodeManager` class, used to manage computing nodes, and `ConfigurationManager` class, used to get CLAP's instances configuration from CLAP's configurations files. The search location is defined by the `CLAP_PATH` environment variable, usually at `~/.clap/configs`.

Make sure, you are executing this notebook inside CLAP's environment (clap-env), or inside a Docker container.

As this notebook is inside CLAP's `examples/api` directory, let's add `../..` to python system's paths, in order to the python interpreter locate CLAP package. 

In [1]:
import sys
sys.path.append('../..')

You may execute bash commands using the exclamation mark (!). For example, you may list all python packages that are installed using pip.

In [2]:
!pip list

Package               Version
--------------------- -----------
ansible               5.8.0
ansible-core          2.12.6
ansible-runner        2.2.0
anyio                 3.6.1
argcomplete           2.0.0
argon2-cffi           21.3.0
argon2-cffi-bindings  21.2.0
asttokens             2.0.5
attrs                 21.4.0
Babel                 2.10.1
backcall              0.2.0
bcrypt                3.2.2
beautifulsoup4        4.11.1
bleach                5.0.0
boto                  2.49.0
boto3                 1.23.8
botocore              1.26.8
certifi               2022.5.18.1
cffi                  1.15.0
charset-normalizer    2.0.12
click                 8.1.3
coloredlogs           15.0.1
contextlib2           21.6.0
cryptography          37.0.2
cycler                0.11.0
dacite                1.6.0
debugpy               1.6.0
decorator             5.1.1
defusedxml            0.7.1
docutils              0.18.1
entrypoints           0.4
executing             0.8.3
fastjsonschema      

Let's perform some imports. In order to facilitate the creation of the `NodeManager` and `ConfigurationDatabase` classes, let's use the defaults defined in `app.cli.modules.node` which will search for configurations at `$CLAP_PATH/configs` and use `$CLAP_PATH/storage/nodes.db` as default node repository. 
- Note: by default, CLAP looks for information at `~/.clap/` directory. However, in case the `CLAP_PATH` environment variable is defined, CLAP will look for configurations and other information at `CLAP_PATH`.

In [3]:
# Print the CLAP_PATH environment variable
!echo CLAP_PATH: "$CLAP_PATH"

CLAP_PATH: /home/lopani/tmp/eborin/mo833-atividade9/clap_config


In [4]:
# Import some useful modules
import time
import yaml
from dataclasses import asdict
from app.cli.modules.node import get_config_db, get_node_manager
from clap.utils import float_time_to_string

In [5]:
# Creating configuration database and node manager objects
configuration_db = get_config_db()
node_manager = get_node_manager()

`configuration_db` will load all instance configs at `~/.clap/configs/instances.yaml` and will store all in the `instance_descriptors` member. 
`instances_descriptor` is a dictionary, where the keys are the name of instance configuration at instances file and the values are dataclasses of type InstanceInfo.

Let's check the contents of `$CLAP_PATH/configs/instances.yaml`

In [6]:
!cat $CLAP_PATH/configs/instances.yaml

type-t2.small:
    provider: aws-config-us-east-1
    login: login-ubuntu
    flavor: t2.small
    image_id: ami-0c4f7023847b90238
    security_group: open-security-group
    boot_disk_size: 16

type-t2.medium:
    provider: aws-config-us-east-1
    login: login-ubuntu
    flavor: t2.medium
    image_id: ami-0c4f7023847b90238
    security_group: open-security-group
    boot_disk_size: 16


In [7]:
all_instances_ids = list(configuration_db.instance_descriptors.keys())
print(f"All instance ids presented in my system: {', '.join(all_instances_ids)}")

All instance ids presented in my system: type-t2.small, type-t2.medium


In [8]:
# Lets pick the type-t2.medium instance info and verify it
type_t2_instance_info = configuration_db.instance_descriptors['type-t2.medium']
print(f"Instance config: {type_t2_instance_info}")
# Instances info are dataclasses, you can access members using access python's member access syntax (via '.'). For instance:
flavor = type_t2_instance_info.instance.flavor
print(f"Instance flavor: {flavor}")

Instance config: InstanceInfo(provider=ProviderConfigAWS(provider_config_id='aws-config-us-east-1', region='us-east-1', access_keyfile='ec2_access_key.pub', secret_access_keyfile='ec2_access_key.pem', vpc=None, url=None, provider='aws'), login=LoginConfig(login_config_id='login-ubuntu', user='ubuntu', keypair_name='otavio_aws_2022_key', keypair_public_file='key.pub', keypair_private_file='key.pem', ssh_port=22, sudo=True, sudo_user='root'), instance=InstanceConfigAWS(instance_config_id='type-t2.medium', provider='aws-config-us-east-1', login='login-ubuntu', flavor='t2.medium', image_id='ami-0c4f7023847b90238', security_group='open-security-group', boot_disk_size=16, boot_disk_device=None, boot_disk_type=None, boot_disk_iops=None, boot_disk_snapshot=None, placement_group=None, price=None, timeout=None, network_ids=[]))
Instance flavor: t2.medium


In [9]:
# Dataclasses can be easily converted to dict using asdict function
type_t2_instance_info_dict = asdict(type_t2_instance_info)
# Lets print dict in yaml syntax
print(yaml.dump(type_t2_instance_info_dict, indent=4))

instance:
    boot_disk_device: null
    boot_disk_iops: null
    boot_disk_size: 16
    boot_disk_snapshot: null
    boot_disk_type: null
    flavor: t2.medium
    image_id: ami-0c4f7023847b90238
    instance_config_id: type-t2.medium
    login: login-ubuntu
    network_ids: []
    placement_group: null
    price: null
    provider: aws-config-us-east-1
    security_group: open-security-group
    timeout: null
login:
    keypair_name: otavio_aws_2022_key
    keypair_private_file: key.pem
    keypair_public_file: key.pub
    login_config_id: login-ubuntu
    ssh_port: 22
    sudo: true
    sudo_user: root
    user: ubuntu
provider:
    access_keyfile: ec2_access_key.pub
    provider: aws
    provider_config_id: aws-config-us-east-1
    region: us-east-1
    secret_access_keyfile: ec2_access_key.pem
    url: null
    vpc: null



VM nodes can be started using the `start_node` function. 
This function will start `N` CLAP nodes with a given `InstanceInfo`.
Use `start_nodes` for starting different number of nodes with different `InstanceInfos`.

Let's start 2 nodes of `type_t2_instance_info`. The function will return the node IDs for nodes that are sucessfully started.

In [10]:
started_node_ids = node_manager.start_node(type_t2_instance_info, count=2)

[1;35mthe implicit localhost does not match 'all'[0m
[0;35mbased upon a deprecated version of the AWS SDKs and is deprecated in favor of [0m
[0;35mthe ec2_instance module. Please update your tasks. This feature will be removed[0m

PLAY [localhost] ***************************************************************

TASK [Starting 2 type-t2.medium instances (timeout 600 seconds)] ***************
[0;33mchanged: [localhost][0m

PLAY RECAP *********************************************************************
[0;33mlocalhost[0m                  : [0;32mok=1   [0m [0;33mchanged=1   [0m unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
[1;35mthe implicit localhost does not match 'all'[0m

PLAY [localhost] ***************************************************************

TASK [Tagging instances] *******************************************************
[0;33mchanged: [localhost] => (item={'id': 'i-05ff0a138e58f73bf', 'name': 'FranklinClark-4559b6ba'})[0m
[0;33mcha

In [11]:
print(f"{len(started_node_ids)} nodes started: {started_node_ids}")

2 nodes started: ['4559b6baec444ee0a8435bb390bf096b', '1ee703179f944bcfbe3aee92578d19be']


Functions in CLAP's node manager usually use the IDs of nodes to operate. 
The functions `get_*_nodes` (*e.g.*, `get_all_nodes`, `get_nodes_by_id`) will return a list of `NodeDescriptor` objects. 
`NodeDescriptor` is a dataclass that describe all node information. 
As `NodeDesriptor` is a dataclass, it can be easily transformed to a dict, using `asdict` function.

Let's pick all nodes in CLAP and print it using the YAML format. 

In [12]:
for node in node_manager.get_all_nodes():
    # Can be accessed with '.' operator
    node_id = node.node_id
    print('---------')
    print(f"Node Id: {node_id}, created at {float_time_to_string(node.creation_time)}; Status: {node.status}")
    print('---------')
    # Or can be converted to a dict
    node_dict = asdict(node)
    # Printing dict in YAML format
    print(yaml.dump(node_dict, indent=4))
    print('**********')

---------
Node Id: 4559b6baec444ee0a8435bb390bf096b, created at 30-05-22 10:50:14; Status: started
---------
cloud_instance_id: i-05ff0a138e58f73bf
cloud_lifecycle: normal
configuration:
    instance:
        boot_disk_device: null
        boot_disk_iops: null
        boot_disk_size: 16
        boot_disk_snapshot: null
        boot_disk_type: null
        flavor: t2.medium
        image_id: ami-0c4f7023847b90238
        instance_config_id: type-t2.medium
        login: login-ubuntu
        network_ids: []
        placement_group: null
        price: null
        provider: aws-config-us-east-1
        security_group: open-security-group
        timeout: null
    login:
        keypair_name: otavio_aws_2022_key
        keypair_private_file: key.pem
        keypair_public_file: key.pub
        login_config_id: login-ubuntu
        ssh_port: 22
        sudo: true
        sudo_user: root
        user: ubuntu
    provider:
        access_keyfile: ec2_access_key.pub
        provider: aws
    

Nodes with status == 'started' are nodes that were started but no SSH login was performed yet. 
Once a successful login is performed, the node changes it status to 'reachable'. 
If the SSH fails, the status become 'unreachable'.

The `is_alive` function  checks if the node is alive and updates its information on the local database. 
This function also updates several other information, such as IP, status *etc*.
It returns a dictionary that maps the node ids to a boolean values that indicates whether or not the node is alive (*i.e.*, if a successfull SSH connection was performed).

Note:
* This function may output "Unable to connect to port 22 on XXX.XXX.XXX.XXX" when a unsucessfull login is performed. So, the function will wait a 'wait_timeout' seconds and try again for 'retries' times

In [13]:
alive_nodes = node_manager.is_alive(started_node_ids)
for node_id, alive_flag in alive_nodes.items():
    alive_str = 'alive' if alive_flag == True else 'not alive'
    print(f"{node_id} --> {alive_str}.")

Error executing command in node 4559b6ba: [Errno None] Unable to connect to port 22 on 3.84.185.3
Error executing command in node 1ee70317: [Errno None] Unable to connect to port 22 on 54.160.254.244
Error executing command in 4559b6ba: [Errno None] Unable to connect to port 22 on 3.84.185.3.
Error executing command in 1ee70317: [Errno None] Unable to connect to port 22 on 54.160.254.244.


4559b6baec444ee0a8435bb390bf096b --> alive.
1ee703179f944bcfbe3aee92578d19be --> alive.


Once a node is detected to be alive, the node status changes to a reachable state.

CLAP won't check node status periodically. Consider using `is_alive` function occasionally to update node information. 

The `get_node_by_id` function returns full node information (`NodeDescriptor` objects) that matches the specified node ids.

In [14]:
nodes = node_manager.get_nodes_by_id(started_node_ids)
for node in nodes:
    print (f"{node.node_id}: Status: {node.status}; IP: {node.ip}")

4559b6baec444ee0a8435bb390bf096b: Status: reachable; IP: 3.84.185.3
1ee703179f944bcfbe3aee92578d19be: Status: reachable; IP: 54.160.254.244


## Tags

Nodes can be annotated with tags and nodes IDs can be retrieved based on these tags.  
This feature facilitates the interaction with the nodes. For example, the user may add tags to identify worker and manager nodes. 
Tags are a dict with key and values as strings and can be added (removed) to (from) nodes using the `add_tags` (`remove_tags`) function. 
This function takes a list of node ids and a dictionary with tags that must be added (removed).

* Note: CLAP tags are not the same tags used by cloud providers; hence, the cloud provider tags (*e.g., *AWS VM node tags) are not updated or retrieved when using the CLAP tags interface.

Let's add some fictitious tags to the two created nodes.

In [15]:
# Lets add these tags to started_node_ids[0] only
tags = {
    'example-notebook': 'cool',
    'uuid': 'worker-0'
}
node_ids_with_tags_added = node_manager.add_tags([started_node_ids[0]], tags)
print(f'Tags {tags} were added to nodes: {node_ids_with_tags_added}')

# Lets add these tags to started_node_ids[1] only
tags = {
    'example-notebook': 'cool',
    'uuid': 'worker-1'
}
node_ids_with_tags_added = node_manager.add_tags([started_node_ids[1]], tags)
print(f'Tags {tags} were added to nodes: {node_ids_with_tags_added}')

Tags {'example-notebook': 'cool', 'uuid': 'worker-0'} were added to nodes: ['4559b6baec444ee0a8435bb390bf096b']
Tags {'example-notebook': 'cool', 'uuid': 'worker-1'} were added to nodes: ['1ee703179f944bcfbe3aee92578d19be']


We can easily get full node information based on node tags, using get_nodes_with_tag or get_nodes_with_tag_value. The primer will get all nodes that contains a the key specified and the last, the nodes that has the tag and a speficied value.

In [16]:
# Get nodes that have tag 'example-notebook'
node_descriptors = node_manager.get_nodes_with_tag('example-notebook')
node_ids = [node.node_id for node in node_descriptors]
print(f"{len(node_ids)} nodes with example-notebook tag:  {node_ids}")

# Get nodes that has tag 'uuid' and value 'worker-0'
node_descriptors = node_manager.get_nodes_with_tag_value('uuid', 'worker-0')
node_ids = [node.node_id for node in node_descriptors]
print(f"{len(node_ids)} node with work-0 tag:  {node_ids}")


2 nodes with example-notebook tag:  ['4559b6baec444ee0a8435bb390bf096b', '1ee703179f944bcfbe3aee92578d19be']
1 node with work-0 tag:  ['4559b6baec444ee0a8435bb390bf096b']


# Pausing and resuming nodes

CLAP supports pausing and resuming nodes. 
Pausing and resuming will not terminate nodes or reboot nodes, it will only suspend nodes. 
Once a node is paused, its status is changed to 'paused' and its IP to None. 
Once the node is resumed, it status is changed to 'started'. 
The `resume_nodes` function will also try to log into the node again, if a sucessfull login is performed, its status is changed to `reachable`, else `unreachable`.

`pause_nodes` and `resume_nodes` methods will pause and resume nodes taking the nodes ids as input and return a list of node ids for nodes that successfuly performed the operations.

Notes:
* Pausing nodes that already have been paused or resuming nodes already running will result in nothing.
* If a paused node is resumed outside CLAP (*e.g.,* on the web console or by other tool), you can use the `is_alive` function to update node status and all other information.
* If the resume operation is called before instance effectively be paused, an error will be raised. In aws provider, the error is something like: `The instance 'i-09XXXXX' is not in a state from which it can be started`

In [17]:
# Pausing nodes will pause nodes based on their node ids. It will return a list of successfuly paused node ids
paused_node_ids = node_manager.pause_nodes(started_node_ids)
print(f"Paused {len(paused_node_ids)} nodes: {paused_node_ids}")

# Printing node status
nodes = node_manager.get_nodes_by_id(paused_node_ids)
for node in nodes:
    print(f'Node {node.node_id} (nickname: {node.nickname}), status: {node.status}, IP: {node.ip}')

[1;35mthe implicit localhost does not match 'all'[0m
[0;35mbased upon a deprecated version of the AWS SDKs and is deprecated in favor of [0m
[0;35mthe ec2_instance module. Please update your tasks. This feature will be removed[0m

PLAY [localhost] ***************************************************************

TASK [Pausing nodes `CindyCarrico, FranklinClark`] *****************************
[0;33mchanged: [localhost][0m

PLAY RECAP *********************************************************************
[0;33mlocalhost[0m                  : [0;32mok=1   [0m [0;33mchanged=1   [0m unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
Paused 2 nodes: ['4559b6baec444ee0a8435bb390bf096b', '1ee703179f944bcfbe3aee92578d19be']
Node 4559b6baec444ee0a8435bb390bf096b (nickname: FranklinClark), status: paused, IP: None
Node 1ee703179f944bcfbe3aee92578d19be (nickname: CindyCarrico), status: paused, IP: None


In [18]:
# Let's wait ~1 minute for cloud provider perform state change (from running to paused). Thenwe will try to resume
time.sleep(60)

In [19]:
# Resuming nodes will resume nodes based on their node ids. It will return a list of successfuly paused node ids
resumed_node_ids = node_manager.resume_nodes(paused_node_ids)
print(f"Resumed {len(resumed_node_ids)} nodes: {resumed_node_ids}")

# Printing node status
nodes = node_manager.get_nodes_by_id(resumed_node_ids)
for node in nodes:
    print(f'Node {node.node_id} (nickname: {node.nickname}), status: {node.status}, IP: {node.ip}')

[1;35mthe implicit localhost does not match 'all'[0m
[0;35mbased upon a deprecated version of the AWS SDKs and is deprecated in favor of [0m
[0;35mthe ec2_instance module. Please update your tasks. This feature will be removed[0m

PLAY [localhost] ***************************************************************

TASK [Resuming nodes `CindyCarrico, FranklinClark`] ****************************
[0;33mchanged: [localhost][0m

PLAY RECAP *********************************************************************
[0;33mlocalhost[0m                  : [0;32mok=1   [0m [0;33mchanged=1   [0m unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   


Error executing command in node 1ee70317: [Errno None] Unable to connect to port 22 on 34.224.99.201
Error executing command in node 4559b6ba: [Errno None] Unable to connect to port 22 on 18.212.248.79
Error executing command in 4559b6ba: [Errno None] Unable to connect to port 22 on 18.212.248.79.
Error executing command in 1ee70317: [Errno None] Unable to connect to port 22 on 34.224.99.201.


Resumed 2 nodes: ['4559b6baec444ee0a8435bb390bf096b', '1ee703179f944bcfbe3aee92578d19be']
Node 4559b6baec444ee0a8435bb390bf096b (nickname: FranklinClark), status: reachable, IP: 18.212.248.79
Node 1ee703179f944bcfbe3aee92578d19be (nickname: CindyCarrico), status: reachable, IP: 34.224.99.201


# Stopping (Terminating) nodes

Finally, nodes can be stopped (terminated at cloud provider) using the `stop_nodes` method. 
The method take the node IDs as input and returns a list of node IDs successfully stopped. 
Stopped nodes will automatically be removed from node repository unless the `remove_nodes` parameter is passed to the method as False.

In [20]:
stopped_node_ids = node_manager.stop_nodes(resumed_node_ids)
print(f"Stopped {len(stopped_node_ids)} nodes: {stopped_node_ids}")

[1;35mthe implicit localhost does not match 'all'[0m
[0;35mbased upon a deprecated version of the AWS SDKs and is deprecated in favor of [0m
[0;35mthe ec2_instance module. Please update your tasks. This feature will be removed[0m

PLAY [localhost] ***************************************************************

TASK [Stopping nodes CindyCarrico, FranklinClark] ******************************
[0;33mchanged: [localhost][0m

PLAY RECAP *********************************************************************
[0;33mlocalhost[0m                  : [0;32mok=1   [0m [0;33mchanged=1   [0m unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   
Stopped 2 nodes: ['4559b6baec444ee0a8435bb390bf096b', '1ee703179f944bcfbe3aee92578d19be']


In [21]:
# No more nodes in repository...
nodes = node_manager.get_all_nodes()
print(f"Got {len(nodes)} nodes: {nodes}")

Got 0 nodes: []
