OpenStack StarlingX

StarlingX is packaged Kubernetes and (optionally) OpenStack with few additions (Ceph FS, multiple Helm application Manager - AirshipArmada or in future FluxCD) for so called "Edge Cloud".

To quote https://docs.starlingx.io/

StarlingX is a fully integrated edge cloud software stack that provides everything needed to deploy an edge cloud on one, two, or up to 100 servers.

Edge means that every Location (for example company branch office) has 1 or few local servers that has installed fully autonomous OpenStack (and few other components) for reliable and low latency computing (meaning: be independent of Internet connection to external Cloud).

one important feature is, for example, that all required repositories (YUM for OS and Docker Registry) are locally mirrored. So you should be able to reinstall your applications even when Internet connection is broken.
another important feature is sysinv (System Inventory) - all resources (including CPus, Memory, Disk Partitions LVM (LVs, VGs, PVs), network interfaces, Ceph,...) are managed using system command (which is just client to API server) and it enforces known system state using Puppet provisioning tool.

Many parts of StarlingX are developed and supported by WindRiver. Please see https://www.windriver.com/studio/operator/starlingx for more information.

Project homepage is on:

https://www.starlingx.io

Please note that such Production server(s) should have at least 32GB RAM and 500GB SSD disk space as can be found on:

https://docs.starlingx.io/planning/openstack/hardware-requirements.html

However for test we may try nested VM using Libvirt.

Setup

We will use nested VM using Libvirt. There is official installation guide on: https://docs.starlingx.io/deploy_install_guides/r6_release/virtual/aio_simplex.html that we will mostly follow.

I will use Ubuntu 20.04 LTS VM in Azure as "Host PC".

Our Azure VM must meet 3 requirements:

must support nested virtualization: see https://azure.microsoft.com/en-us/blog/introducing-the-new-dv3-and-ev3-vm-sizes/ for list os supported VMs
must have at least around ~20GB of RAM (nested VM requires around 18 GB of RAM)
must have at least 8 cores (nested VM requires 6 cores)
I have selected Standard_D8s_v3
WARNING! As of 2022-07-16 such VM costs around $300/month (!)
I strongly recommend to monitor your spending and add Auto Shutdown feature (omitted in script!)

You have to update at least these variables before running the script below:

subnet=xxxxx to point to your Subnet in your Virtual Net (vNet)
ssh_key_path=pwd/hp_vm2.pub to point to your SSH public key that you will use to connect to VM

Here is my create_vm_ubuntu_for_stx.sh script to setup VM openstack-stx. Run it in Azure Bash in Azure portal: with public IP:

#!/bin/bash

set -ue -o pipefail
# Your SubNet ID
subnet=/subscriptions/xxx/resourceGroups/VpnGatewayRG101/providers/Microsoft.Network/virtualNetworks/VNet101/subnets/FrontEnd 
ssh_key_path=`pwd`/hp_vm2.pub 

rg=OsStxRG
loc=germanywestcentral
vm=openstack-stx
IP=$vm-ip
opts="-o table"
# URN from command:
# az vm image list --all -l germanywestcentral -f 0001-com-ubuntu-server-focal -p canonical -s 20_04-lts-gen2 -o table 
image=Canonical:0001-com-ubuntu-server-focal:20_04-lts-gen2:latest

set -x
az group create -l $loc -n $rg $opts
az network public-ip create -g $rg -l $loc --name $IP --sku Basic $opts
az vm create -g $rg -l $loc \
    --image $image  \
    --nsg-rule NONE \
    --subnet $subnet \
    --public-ip-address "$IP" \
    --storage-sku Premium_LRS \
    --size Standard_D8s_v3 \
    --os-disk-size-gb 128 \
    --ssh-key-values $ssh_key_path \
    --admin-username azureuser \
    -n $vm $opts
set +x
cat <<EOF
You may access this VM in 2 ways:
1. using Azure VPN Gateway 
2. Using Public IP - in such case you need to add appropriate
   SSH allow in rule to NSG rules of this created VM
EOF
exit 0

Follow above instructions and login to above VM to continue.

Note: in text below I will call:

Host - parent Azure VM (openstack-stx)
Libvirt VM - nested VM running StarlingX controller, with libvirt machine name (called domain - from Xen times) simplex-controller-0. Once this nested VM is insatalled it wil have hostname controller-0

Verify that your Azure VM supports nested virtualization:

$ ls -l /dev/kvm

crw-rw---- 1 root kvm 10, 232 Jul 16 06:42 /dev/kvm

If above device does not exist you need to use different type (called Size in Azure) of VM.

NOTE!

Just recently found great article Deploy a virtual StarlingX Simplex node on

https://ericho.github.io/2019-09-12-deploy-virtual-starlingx/

https://opendev.org/starlingx/test/src/branch/master/automated-robot-suite/README.rst where author use this StarlingX test project:
git clone https://opendev.org/starlingx/test.git
cd automated-robot-suite
It seems to be even even more comfortable (just specifying which python test suite to run).

However I did not try it yet.

Inside VM prepare system:

sudo apt-get update
sudo apt-get dist-upgrade
# reboot recommended if kernel or critical system components (libc)
# were updated.

Please ensure that your shell is Bash:

echo $SHELL
/bin/bash

Now we will get source and install required packages:

sudo apt-get install git
cd
git clone https://opendev.org/starlingx/tools.git stx-tools
cd stx-tools/
git describe --always --long
# my version is: vr/stx.6.0-404-g591be74
cd deployment/libvirt/
sudo ./install_packages.sh 

Package libvirt-bin is not available, but is referred to by another package.

Ubuntu 20.04 LTS no longer contains above package so we have to fix it manually:

sudo apt-get install virt-manager

Optional: if you want to run virsh command without sudo you can add yourself to libvirt group using:

sudo /usr/sbin/usermod -G libvirt -a $USER

Logout and login to Host (Azure VM) so this change take in effect.

Run again:

sudo ./install_packages.sh

Now you can safely ignore all libvirt-bin related errors. Manually restart right service:

sudo systemctl restart libvirtd

And again follow guide:

sudo apt install -y apparmor-profiles
sudo apt-get install -y ufw
sudo ufw disable
sudo ufw status
# should output:
# Status: inactive

Now run setup_network.sh:

./setup_network.sh

Verify that this script really setup network:

it should created 4 bridges:

$ ip -br l | fgrep stxbr

stxbr1           UNKNOWN        76:7f:f1:6e:f0:37 <BROADCAST,MULTICAST,UP,LOWER_UP>
stxbr2           UNKNOWN        e6:f2:94:07:73:9e <BROADCAST,MULTICAST,UP,LOWER_UP>
stxbr3           UNKNOWN        e2:fa:74:8c:ed:95 <BROADCAST,MULTICAST,UP,LOWER_UP>
stxbr4           UNKNOWN        7a:cc:b7:d0:aa:87 <BROADCAST,MULTICAST,UP,LOWER_UP>

first bridge stxbr1 should have assigned hardcoded IP address:

$ ip -br -4 a | fgrep stxbr

stxbr1           UNKNOWN        10.10.10.1/24

there should be created this NAT rule that allows Internet access from above stxbr1:

$ sudo /sbin/iptables -t nat -L POSTROUTING

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination         
LIBVIRT_PRT  all  --  anywhere             anywhere            
MASQUERADE  all  --  10.10.10.0/24        anywhere

last rule IP 10.10.10.0/24 allows Internet access from stxbr1

If all above requirements are met you can continue:

we have to follow guide and download ISO
in your browser go to: http://mirror.starlingx.cengn.ca/mirror/starlingx/
and look for latest ISO
in my case this link is right:
- http://mirror.starlingx.cengn.ca/mirror/starlingx/release/6.0.0/centos/flock/outputs/iso/bootimage.iso

So back In VM download above ISO using:

cd
curl -fLO http://mirror.starlingx.cengn.ca/mirror/starlingx/release/6.0.0/centos/flock/outputs/iso/bootimage.iso
# optional for Azure - copy ISO to SSD for better speed:
sudo cp ~/bootimage.iso /mnt

Before creating VM I recommend to up memory and CPUs. In my case I made these changes

diff --git a/deployment/libvirt/controller_allinone.xml b/deployment/libvirt/controller_allinone.xml
index 6f7272e..ec209a1 100644
--- a/deployment/libvirt/controller_allinone.xml
+++ b/deployment/libvirt/controller_allinone.xml
@@ -1,8 +1,8 @@
 <domain type='kvm' id='164'>
   <name>NAME</name>
-  <memory unit='GiB'>18</memory>
-  <currentMemory unit='GiB'>18</currentMemory>
-  <vcpu placement='static'>6</vcpu>
+  <memory unit='GiB'>26</memory>
+  <currentMemory unit='GiB'>26</currentMemory>
+  <vcpu placement='static'>7</vcpu>
   <resource>
     <partition>/machine</partition>
   </resource>
@@ -16,7 +16,7 @@
   </features>
   <cpu match='exact'>
     <model fallback='forbid'>Nehalem</model>
-    <topology sockets='1' cores='6' threads='1'/>
+    <topology sockets='1' cores='7' threads='1'/>
     <feature policy='optional' name='vmx'/>

(My Azure VM has 32GB RAM and 8 vCPUs - so it should be safe).

Now create and start VM controller-0 using:

cd ~/stx-tools/deployment/libvirt/
./setup_configuration.sh -c simplex -i /mnt/bootimage.iso

You can safely ignore cannot open display: message.

Now connect to serial console using:

$ sudo virsh console simplex-controller-0

Do NOT press ENTER yet!!! - Because it would select Wrong type of installation. If you already pressed ENTER accidentaly - press ESC to return to main menu.
to redraw menu press Ctrl-L
select All-in-one Controller Configuration -> Serial Console
- as described on https://docs.starlingx.io/deploy_install_guides/r6_release/virtual/aio_simplex_install_kubernetes.html#id1
now there will proceed complete KickStart (Anaconda installation)
in my case it installs 1206 packages
NOTE: you can any time disconnect from serial console using Ctrl-]
and later reconnect with same virsh console command (sometimes --force is needed if your connection was canceled abruptly...

After this nested VM simplex-controller-0 reboots, we can follow

https://docs.starlingx.io/deploy_install_guides/r6_release/virtual/aio_simplex_install_kubernetes.html#id2
there will be several errors, because there no network configured yet
login as sysadmin/sysadmin
you will be forced to change pasword. Unfortunately there are many strict rules (including dictionary check) that you must adher to change password succesfully...

Now we have to temporarily config network in this libvirt VM. As sysadmin look what network interface to use:

$ ip -br l

lo               UNKNOWN        00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP> 
eth1000          UP             52:54:00:5a:eb:79 <BROADCAST,MULTICAST,UP,LOWER_UP>
eth1001          UP             52:54:00:5b:68:3a <BROADCAST,MULTICAST,UP,LOWER_UP>
enp2s1           UP             52:54:00:28:24:a3 <BROADCAST,MULTICAST,UP,LOWER_UP>
enp2s2           UP             52:54:00:e1:c9:34 <BROADCAST,MULTICAST,UP,LOWER_UP>

In our case the right device is enp2s1. If you are not sure you can dump network interface assignment

# run on Host

$ sudo virsh domiflist simplex-controller-0

 Interface   Type     Source   Model    MAC
-----------------------------------------------------------
 vnet0       bridge   stxbr1   e1000    52:54:00:28:24:a3
 vnet1       bridge   stxbr2   e1000    52:54:00:e1:c9:34
 vnet2       bridge   stxbr3   virtio   52:54:00:5a:eb:79
 vnet3       bridge   stxbr4   virtio   52:54:00:5b:68:3a

So stxbr1 has MAC 52:54:00:28:24:a3

looking inside VM:

ip -br l | fgrep 52:54:00:28:24:a3
enp2s1           UP             52:54:00:28:24:a3 <BROADCAST,MULTICAST,UP,LOWER_UP>

NOTE: these mac addresses are hardoced in setup scripts so they should be same

now inside Libvirt VM create network setup script net_setup.sh with contents:

MY_DEV=enp2s1
export CONTROLLER0_OAM_CIDR=10.10.10.3/24
export DEFAULT_OAM_GATEWAY=10.10.10.1
sudo ip address add $CONTROLLER0_OAM_CIDR dev $MY_DEV
sudo ip link set up dev $MY_DEV
sudo ip route add default via $DEFAULT_OAM_GATEWAY dev $MY_DEV

and SOURCE it:

. ./net_setup.sh
# can be asked by sudo for password
# - enter sysadmin's password to proceed

if network is correctly setup than Internet access must work (however without DNS because there is empty /etc/resolv.conf.

so try in Libvirt VM

host www.cnn.com 8.8.8.8
# should return addresses...

NOTE: Default Ansible configuraion contains all necessary information. You can just copy it to HOME for reference using:
```
cp /usr/share/ansible/stx-ansible/playbooks/host_vars/bootstrap/default.yml \
   ~/
```
now cross your fingers and run (we run playbook using sudo because it sometimes wants password in the middle of installation and breaks ansible):
```
sudo ansible-playbook /usr/share/ansible/stx-ansible/playbooks/bootstrap.yml
```

if you are bored while ansible is running you can connect from your Host (Azure VM) to this libvirt VM using command:

ssh sysadmin@10.10.10.3
# WARNING! After another reboot shi address will change
# to 10.10.10.2 !!!

peek into /etc/os-release: PRETTY_NAME="CentOS Linux 7 (Core)"

find more details about magic system command (without d suffix):

$ rpm -qf /usr/bin/system

cgts-client-1.0-276.tis.x86_64

$ rpm -qi cgts-client | egrep '^(Summary|Packager)'

Packager    : Wind River <info@windriver.com>
Summary     : System Client and CLI

Ansible installation should end with messages like:

bootstrap/bringup-bootstrap-applications : Check if application already exists -- 16.76s
common/armada-helm : Launch Armada with Helm v3 ------------------------ 16.11s
bootstrap/bringup-bootstrap-applications : Upload application ---------- 14.14s

After ansible finished we must configure OAM (Operations, Administration and Management) network (network, where all StarlingX APIs and services are exposed) to enable controller at all:

https://docs.starlingx.io/deploy_install_guides/r6_release/virtual/aio_simplex_install_kubernetes.html#id1
scroll down to section Configure controller-0 on above web page
enter these commands in Libvirt VM (user sysadmin):

find OAM Interface:

$ ip -br -4 a | fgrep 10.10.10.

enp2s1           UP             10.10.10.3/24

configure enp2s1 as OAM Interface

$ source /etc/platform/openrc

$ OAM_IF=enp2s1

$ system host-list

+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname     | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1  | controller-0 | controller  | locked         | disabled    | online       |
+----+--------------+-------------+----------------+-------------+--------------+

$ system host-if-list 1

+--------------------------------------+------+----------+---------+------+-------+------+------+------------+
| uuid                                 | name | class    | type    | vlan | ports | uses | used | attributes |
|                                      |      |          |         | id   |       | i/f  | by   |            |
|                                      |      |          |         |      |       |      | i/f  |            |
+--------------------------------------+------+----------+---------+------+-------+------+------+------------+
| 8d66dac8-ba2c-499d-98aa-c76ca540af5c | lo   | platform | virtual | None | []    | []   | []   | MTU=1500   |
+--------------------------------------+------+----------+---------+------+-------+------+------+------------+

# we have to replace loopback with our Admin interface (OAM):

$ system host-if-modify controller-0 $OAM_IF -c platform

$ system interface-network-assign controller-0 $OAM_IF oam

# verify setting:

$ system host-if-list 1

+--------------------------------------+--------+----------+----------+---------+-------------+----------+-------------+------------+
| uuid                                 | name   | class    | type     | vlan id | ports       | uses i/f | used by i/f | attributes |
+--------------------------------------+--------+----------+----------+---------+-------------+----------+-------------+------------+
| 43adf65d-1579-4770-afb1-923f095be6a2 | enp2s1 | platform | ethernet | None    | [u'enp2s1'] | []       | []          | MTU=1500   |
| 8d66dac8-ba2c-499d-98aa-c76ca540af5c | lo     | platform | virtual  | None    | []          | []       | []          | MTU=1500   |
+--------------------------------------+--------+----------+----------+---------+-------------+----------+-------------+------------+

Unfortunately we need still lot of thing to configure. Try again this command:

$ system host-list

+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname     | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1  | controller-0 | controller  | locked         | disabled    | online       |
+----+--------------+-------------+----------------+-------------+--------------+

Notice locked and disabled. It means that our controller is not yet able to service. Many Kubernetes applicatins require persistent storage. It is called:

PV - Persistent Volume - configured by administrator
PVC - Persistent Volume Claim - ks8 application request presistent storaget using these request

So we must configure Ceph

https://docs.starlingx.io/deploy_install_guides/r6_release/virtual/aio_simplex_install_kubernetes.html#configure-controller-0
chapter Optionally, initialize a Ceph-based Persistent Storage Backend There are two options

Host base Ceph
K8s based Ceph (Rook)

I always prefer Host over Containers so try it:

$ system storage-backend-list

Empty output...

$ system storage-backend-add ceph --confirmed

ystem configuration has changed.
Please follow the administrator guide to complete configuring the system.

+--------------------------------------+------------+---------+------------+------+----------+----------------+
| uuid                                 | name       | backend | state      | task | services | capabilities   |
+--------------------------------------+------------+---------+------------+------+----------+----------------+
| 960f7afc-c309-47d6-bc0f-78afe530c5b1 | ceph-store | ceph    | configured | None | None     | min_replicatio |
|                                      |            |         |            |      |          | n: 1           |
|                                      |            |         |            |      |          | replication: 1 |
|                                      |            |         |            |      |          |                |
+--------------------------------------+------------+---------+------------+------+----------+----------------+

$ system host-disk-list 1                 
+--------------------------------------+-------------+------------+-------------+----------+---------------+-...
| uuid                                 | device_node | device_num | device_type | size_gib | available_gib | ...
+--------------------------------------+-------------+------------+-------------+----------+---------------+-...
| a205e8e0-c5aa-41f8-92bb-daec6381fab1 | /dev/sda    | 2048       | HDD         | 600.0    | 371.679       | ...
| 4880a1d2-e629-4513-baf2-399dfb064410 | /dev/sdb    | 2064       | HDD         | 200.0    | 199.997       | ...
| 1fdeb451-6d6f-4950-9d99-d8215c10ed47 | /dev/sdc    | 2080       | HDD         | 200.0    | 199.997       | ...
+--------------------------------------+-------------+------------+-------------+----------+---------------+ ...

# note UUID for /dev/sdb:
# 4880a1d2-e629-4513-baf2-399dfb064410 

$ system host-stor-add 1 4880a1d2-e629-4513-baf2-399dfb064410

$ system host-stor-list 1           

+--------------------------------------+----------+-------+-----------------------+-...
| uuid                                 | function | osdid | state                 | ...
+--------------------------------------+----------+-------+-----------------------+-...
| 292ae029-f652-4e6c-b046-18daacd80a76 | osd      | 0     | configuring-on-unlock | ...
+--------------------------------------+----------+-------+-----------------------+-...

Notice configuring-on-unlock in column state

We want to use openstack so we have to follow For OpenStack only: section in guide. Create script setup_os.sh with contents:

#!/bin/bash
set -xeuo pipefail
DATA0IF=eth1000
DATA1IF=eth1001
export NODE=controller-0
PHYSNET0='physnet0'
PHYSNET1='physnet1'
SPL=/tmp/tmp-system-port-list
SPIL=/tmp/tmp-system-host-if-list
system host-port-list ${NODE} --nowrap > ${SPL}
system host-if-list -a ${NODE} --nowrap > ${SPIL}
DATA0PCIADDR=$(cat $SPL | grep $DATA0IF |awk '{print $8}')
DATA1PCIADDR=$(cat $SPL | grep $DATA1IF |awk '{print $8}')
DATA0PORTUUID=$(cat $SPL | grep ${DATA0PCIADDR} | awk '{print $2}')
DATA1PORTUUID=$(cat $SPL | grep ${DATA1PCIADDR} | awk '{print $2}')
DATA0PORTNAME=$(cat $SPL | grep ${DATA0PCIADDR} | awk '{print $4}')
DATA1PORTNAME=$(cat  $SPL | grep ${DATA1PCIADDR} | awk '{print $4}')
DATA0IFUUID=$(cat $SPIL | awk -v DATA0PORTNAME=$DATA0PORTNAME '($12 ~ DATA0PORTNAME) {print $2}')
DATA1IFUUID=$(cat $SPIL | awk -v DATA1PORTNAME=$DATA1PORTNAME '($12 ~ DATA1PORTNAME) {print $2}')
system datanetwork-add ${PHYSNET0} vlan
system datanetwork-add ${PHYSNET1} vlan
system host-if-modify -m 1500 -n data0 -c data ${NODE} ${DATA0IFUUID}
system host-if-modify -m 1500 -n data1 -c data ${NODE} ${DATA1IFUUID}
system interface-datanetwork-assign ${NODE} ${DATA0IFUUID} ${PHYSNET0}
system interface-datanetwork-assign ${NODE} ${DATA1IFUUID} ${PHYSNET1}
exit 0

And run it. Now we have to follow OpenStack-specific host configuration:

system host-label-assign controller-0 openstack-control-plane=enabled
system host-label-assign controller-0 openstack-compute-node=enabled
system host-label-assign controller-0 openvswitch=enabled

Now we have to folow For OpenStack Only: Set up disk partition for nova-local volume group, which is needed for stx-openstack nova ephemeral disks.: Create script setup_os_storage.sh

#!/bin/bash
set -xeuo pipefail
export NODE=controller-0
echo ">>> Getting root disk info"
ROOT_DISK=$(system host-show ${NODE} | grep rootfs | awk '{print $4}')
ROOT_DISK_UUID=$(system host-disk-list ${NODE} --nowrap | grep ${ROOT_DISK} | awk '{print $2}')
echo "Root disk: $ROOT_DISK, UUID: $ROOT_DISK_UUID"

echo ">>>> Configuring nova-local"
NOVA_SIZE=34
NOVA_PARTITION=$(system host-disk-partition-add -t lvm_phys_vol ${NODE} ${ROOT_DISK_UUID} ${NOVA_SIZE})
NOVA_PARTITION_UUID=$(echo ${NOVA_PARTITION} | grep -ow "| uuid | [a-z0-9\-]* |" | awk '{print $4}')
system host-lvg-add ${NODE} nova-local
sleep 60
system host-pv-add ${NODE} nova-local ${NOVA_PARTITION_UUID}
sleep 60
exit 0

And run it.

Now moment of truth - Unlocking controller - this time we shall see if all components really work:

$ system host-unlock controller-0

# WARNING! Restart will follow....

NOTE: In my case network startup took around 3 minutes. Don't know why...

After reboot the main IP of Libvirt Vm changed, you now have to use:

ssh sysadmin@10.10.10.2

To connect to Libvirt VM.

After reboot, login as sysadmin and verify host status:

$ source /etc/platform/openrc
$ system host-list

+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname     | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1  | controller-0 | controller  | unlocked       | enabled     | available    |
+----+--------------+-------------+----------------+-------------+--------------+

It must be in state unlocked and enabled and available Also try:

$ system application-list | sed -r 's/(.{75}).*/\1.../'

+--------------------------+---------+-----------------------------------+-...
| application              | version | manifest name                     | ...
+--------------------------+---------+-----------------------------------+-...
| cert-manager             | 1.0-26  | cert-manager-manifest             | ...
| nginx-ingress-controller | 1.1-18  | nginx-ingress-controller-manifest | ...
| oidc-auth-apps           | 1.0-61  | oidc-auth-manifest                | ...
| platform-integ-apps      | 1.0-44  | platform-integration-manifest     | ...
| rook-ceph-apps           | 1.0-14  | rook-ceph-manifest                | ...
+--------------------------+---------+-----------------------------------+-...

$ ceph -s

  cluster:
    id:     d993e564-bd99-4a4d-946c-a3aa090da4f9
    health: HEALTH_OK
 
  services:
    mon: 1 daemons, quorum controller-0 (age 23m)
    mgr: controller-0(active, since 21m)
    mds: kube-cephfs:1 {0=controller-0=up:active}
    osd: 1 osds: 1 up (since 20m), 1 in (since 20m)
 
  data:
    pools:   3 pools, 192 pgs
    objects: 22 objects, 2.2 KiB
    usage:   107 MiB used, 199 GiB / 199 GiB avail
    pgs:     192 active+clean

$ kubectl get ns

NAME              STATUS   AGE
armada            Active   108m
cert-manager      Active   102m
default           Active   109m
deployment        Active   102m
kube-node-lease   Active   109m
kube-public       Active   109m
kube-system       Active   109m

Avoding Lost write to Ceph

At least when using Host based Ceph (my case) there was observerd improper shutdown - where kernel RBD client reported lost-write to Ceph (which was already shut).

Deploying K8s application

TODO:

Optional: Installing OpenStack

Even after lot of work we have not installed OpenStack yet(!). We have to follow this guide

https://docs.starlingx.io/deploy_install_guides/r6_release/openstack/install.html
login to Libvirt VM as sysadmin

First ve have to increase LV for Docker from 30GB to 60GB.

verify current assignments:

$ system host-fs-list 1 | sed -r 's/.{39}//'

+---------+-------------+----------------+
| FS Name | Size in GiB | Logical Volume |
+---------+-------------+----------------+
| backup  | 25          | backup-lv      |
| docker  | 30          | docker-lv      |
| kubelet | 10          | kubelet-lv     |
| scratch | 16          | scratch-lv     |
+---------+-------------+----------------+

theoretically it should be easy:

$ system host-fs-modify controller-0 docker=60

HostFs update failed: Not enough free space on cgts-vg.
Current free space 16 GiB, requested total increase 30 GiB

it can be confirmed with this command:

sudo vgs

VG         #PV #LV #SN Attr   VSize    VFree  
cgts-vg      1  12   0 wz--n- <178.97g <16.16g
nova-local   1   1   0 wz--n-  <34.00g      0

Now we clone script that created VG nova-local and reuse it to create new disk partition and add it to VG cgts-vg Create script resize_os_docker.sh with these contents:

!/bin/bash
set -xeuo pipefail
export NODE=controller-0
echo ">>> Getting root disk info"
ROOT_DISK=$(system host-show ${NODE} | grep rootfs | awk '{print $4}')
ROOT_DISK_UUID=$(system host-disk-list ${NODE} --nowrap | grep ${ROOT_DISK} | awk '{print $2}')
echo "Root disk: $ROOT_DISK, UUID: $ROOT_DISK_UUID"

echo ">>>> Extending VG cgts-vg +50GB"
NOVA_SIZE=50
NOVA_PARTITION=$(system host-disk-partition-add -t lvm_phys_vol ${NODE} ${ROOT_DISK_UUID} ${NOVA_SIZE})
NOVA_PARTITION_UUID=$(echo ${NOVA_PARTITION} | grep -ow "| uuid | [a-z0-9\-]* |" | awk '{print $4}')
sleep 60  # it takes time before PV is created
system host-pv-add ${NODE} cgts-vg ${NOVA_PARTITION_UUID}
sleep 60  # it takes time before PV is added to VG !!!
exit 0

And run it. Unfortunately there are few timing races - so sometimes it is needed to recreate it manually. If above script was succesfull you can verify it with:

sudo pvs
  PV         VG         Fmt  Attr PSize    PFree  
  /dev/sda5  cgts-vg    lvm2 a--  <178.97g <16.16g
  /dev/sda6  nova-local lvm2 a--   <34.00g      0 
  /dev/sda8  cgts-vg    lvm2 a--   <49.97g <49.97g # << NEW PV

NOTE: You will likely have /dev/sda7 as partition (I made some experiments before runing it). Aand finally VG should after a while see new free space:

sudo vgs
  VG         #PV #LV #SN Attr   VSize    VFree 
  cgts-vg      2  12   0 wz--n- <228.94g 66.12g
  nova-local   1   1   0 wz--n-  <34.00g     0

Notice VFree 66GB - should be enough for docker.

Now you can finally resume guide on

https://docs.starlingx.io/deploy_install_guides/r6_release/openstack/install.html

and run:

$ system host-fs-modify controller-0 docker=60
...+---------+-------------+----------------+
...| FS Name | Size in GiB | Logical Volume |
...+---------+-------------+----------------+
...| backup  | 25          | backup-lv      |
...| docker  | 60          | docker-lv      |
...| kubelet | 10          | kubelet-lv     |
...| scratch | 16          | scratch-lv     |
...+---------+-------------+----------------+

Now we have to find latest OpenStack application

go to http://mirror.starlingx.cengn.ca/mirror/starlingx/release/6.0.0/centos/flock/outputs/helm-charts/

and download to your Libvirt vm (as sysadmin) suitable package, for example:

[sysadmin@controller-0 ~(keystone_admin)]$ cd
curl -fLO http://mirror.starlingx.cengn.ca/mirror/starlingx/release/6.0.0/centos/flock/outputs/helm-charts/stx-openstack-1.0-140-centos-stable-latest.tgz
ls -l stx-openstack-1.0-140-centos-stable-latest.tgz 

  -rw-r--r-- 1 sysadmin sys_protected 1804273 Jul 16 15:43 stx-openstack-1.0-140-centos-stable-latest.tgz

if you are brave upload Application:

$ system application-upload stx-openstack-1.0-140-centos-stable-latest.tgz
# now poll using command 
$ system application-show stx-openstack
# it must report progress: completed, status: uploaded

and install it:
```
system application-apply stx-openstack
```

again poll with:

watch -n 60 system application-show stx-openstack

If you were lucky you can access OpenStack by following:

https://docs.starlingx.io/deploy_install_guides/r6_release/openstack/access.html

WARNING!

When I rebooted machine with installed OpenStack, I saw these grave messages on console

rbd: lost write It means that there were active RBD (Remote Block Device) - networked disk device connected to ceph.

I suspect that systemd terminates services in wrong order, thus risking data loss to any containers using PV from Ceph (!!!)

Kernel messages:

EXT4-fs error (device rbd0): __ext4_find_entry:1536: inode #2: comm start.py: reading directory lblock 0
libceph: connect (1)192.168.204.2:6789 error -101
libceph: connect (1)192.168.204.2:6789 error -101

To access OpenStack at all we have to create different context file, by following docs run:

sed '/export OS_AUTH_URL/c\export OS_AUTH_URL=http://keystone.openstack.svc.cluster.local/v3' /etc/platform/openrc > ~/openrc.os

Now you have to remember:

to access OpenStack run:
```
source ~/openrc.os
```
to access Starlingx (system command , etc.), run
```
source /etc/platform/openrc
```

These commands should work:

source ~/openrc.os
openstack flavor list
openstack image list

WARNING! In my case OpenStack is excssively hungry:

$ uptime

 07:36:40 up 46 min,  2 users,  load average: 18.09, 24.35, 19.93

$ vmstat 1

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
15  0      0 3326296 242220 4103136    0    0   332   158 2115 1180 42 15 40  4  0
36  1      0 3295668 242304 4103532    0    0    72   668 11102 18076 52 12 32  4  0
34  1      0 3295984 242372 4103664    0    0     0  1320 9557 14930 54  8 36  2  0
46  0      0 3274756 242496 4103976    0    0    96  1032 13529 22478 58 13 28  0  0

You can see that 1st column (number of processes in Run Queue - processed that need CPU but has to wait for it) is sometimes up to 50...

It is also confirmed by alarms:

source /etc/platform/openrc
fm alarm-list
# if you want to see details run
fm alarm-list --uuid
# and pass uuid to: fm alarm-show

So I'm not sure - how can be that used for low latency edge computing...

Another problem:

But - after reboot I had no luck:

openstack flavor list
Failed to discover available identity versions when contacting http://keystone.openstack.svc.cluster.local/v3. Attempting to parse version from URL.
Service Unavailable (HTTP 503)

To see all openstack components in Kubernetes we can use this command

kubectl get pod -n openstack

There are 111 pods in my case(!).

To quickly find problematic pods we can try:

$ kubectl get pod -n openstack | fgrep Running | fgrep '0/'
mariadb-ingress-847cdb5dfb-4zgd6 0/1 Running 1  15h
mariadb-server-0                 0/1 Running 2  15h

Running 0 out of 1 is definitely problem... Trying:

kubectl logs mariadb-server-0 -n openstack
# nothing suspicious

But:

kubectl describe pod/mariadb-server-0 -n openstack
# Hmm
  Normal   Killing                 11m                kubelet                  Container mariadb failed startup probe, will be restarted
Warning  Unhealthy  8m46s (x12 over 20m)  kubelet  Startup probe failed:

To get terminal in container:

kubectl exec -it mariadb-server-0 -n openstack sh
# use exit to exit container

Also:

source /etc/platform/openrc 
fm alarm-list

+-------+----------------------------+---------------+----------+-------------+
| Alarm | Reason Text                | Entity ID     | Severity | Time Stamp  |
| ID    |                            |               |          |             |
+-------+----------------------------+---------------+----------+-------------+
| 270.  | Host controller-0 compute  | host=         | critical | 2022-07-17T |
| 001   | services failure           | controller-0. |          | 07:26:01.   |
|       |                            | services=     |          | 637858      |
|       |                            | compute       |          |             |
|       |                            |               |          |             |
...

However you need fm alarm-list --uuid to get UUID for fm alarm-show (ooohhhh).

NOTE: After around 30 minutes this error resolved....

So I can now see flavours:

$ openstack flavor list | sed -r 's/.{39}/.../;s/.{12}$/.../'

...+-----------+-------+------+-----------+-------+...
...| Name      |   RAM | Disk | Ephemeral | VCPUs |...
...+-----------+-------+------+-----------+-------+...
...| m1.small  |  2048 |   20 |         0 |     1 |...
...| m1.large  |  8192 |   80 |         0 |     4 |...
...| m1.medium |  4096 |   40 |         0 |     2 |...
...| m1.xlarge | 16384 |  160 |         0 |     8 |...
...| m1.tiny   |   512 |    1 |         0 |     1 |...
...+-----------+-------+------+-----------+-------+...

Hmm:

openstack image list
# no image...

Fortunately I alredy wrote guide on OpenStack-from-Scratch. So try:

source ~/openrc.os
cd
curl -OLf http://download.cirros-cloud.net/0.5.1/cirros-0.5.1-x86_64-disk.img
openstack image create --public --container-format bare \
   --disk-format qcow2 --file cirros-0.5.1-x86_64-disk.img cirros
openstack image list

+--------------------------------------+--------+--------+
| ID                                   | Name   | Status |
+--------------------------------------+--------+--------+
| 22958ef4-05f1-4a9f-b1e9-1ca9eb1f5ebf | cirros | active |
+--------------------------------------+--------+--------+

Now we can follow another my guide: OpenStack AIO in Azure. We need some network:

openstack network list
# Hmm, empty output...

We have to follow (sort of) guide from:

https://wiki.openstack.org/wiki/StarlingX/Networking

First we must know data network type using this command:

$ system datanetwork-list  

+--------------------------------------+----------+--------------+------+
| uuid                                 | name     | network_type | mtu  |
+--------------------------------------+----------+--------------+------+
| a4b7cc2e-68fe-47ad-b0ff-67dd147d85b0 | physnet0 | vlan         | 1500 |
| 2bba400c-4a2d-4111-8247-4db251b6ad31 | physnet1 | vlan         | 1500 |
+--------------------------------------+----------+--------------+------+

So we know which snippet from above wiki to use - this one: Create script setup_os_net.sh with contents

#!/bin/bash
set -euo pipefail
set -x
ADMINID=$(openstack project show -f value -c id admin)
[[ $ADMINID =~ ^[a-f0-9]{32}$ ]] || {
	echo "Unable to get ID for project 'admin'" >&2
	exit 1
}

PHYSNET0='physnet0'
PHYSNET1='physnet1'
PUBLICNET0='public-net0'
PUBLICNET1='public-net1'
PUBLICSUBNET0='public-subnet0'
PUBLICSUBNET1='public-subnet1'
openstack network segment range create ${PHYSNET0}-a \
  --network-type vlan --physical-network ${PHYSNET0} \
  --minimum 400 --maximum 499 --private --project ${ADMINID}
openstack network segment range create ${PHYSNET1}-a \
  --network-type vlan --physical-network ${PHYSNET1} \
  --minimum 500 --maximum 599 --private --project ${ADMINID}
openstack network create --project ${ADMINID} \
  --provider-network-type=vlan --provider-physical-network=${PHYSNET0} \
  --provider-segment=400 ${PUBLICNET0}
openstack network create --project ${ADMINID} \
  --provider-network-type=vlan --provider-physical-network=${PHYSNET1} \
  --provider-segment=500 ${PUBLICNET1}
openstack subnet create --project ${ADMINID} ${PUBLICSUBNET0} \
  --network ${PUBLICNET0} --subnet-range 192.168.101.0/24
openstack subnet create --project ${ADMINID} ${PUBLICSUBNET1} \
  --network ${PUBLICNET1} --subnet-range 192.168.102.0/24
exit 0

And run it that way

source ~/openrc.os
chmod +x ~/setup_os_net.sh
~/setup_os_net.sh

Now list available networks using:

$ openstack network list

+--------------------------------------+-------------+--------------------------------------+
| ID                                   | Name        | Subnets                              |
+--------------------------------------+-------------+--------------------------------------+
| c9160a9d-4f4f-424f-b205-f17e2fbfadc6 | public-net0 | 46a911af-5a12-428e-82bc-10b26a344a81 |
| fa016989-93ac-41f6-9d70-dd3c690a433f | public-net1 | 6f8eeffd-78c4-4501-977e-7b8d23a96521 |
+--------------------------------------+-------------+--------------------------------------+

And finally try to run VM:

openstack server create --flavor m1.tiny --image cirros \
    --nic net-id=c9160a9d-4f4f-424f-b205-f17e2fbfadc6 \
    test-cirros

Hmm, nearly done:

Unknown Error (HTTP 504)

But it seems that it just needs a bit of time:

$ openstack server list

+--------------------------------------+-------------+--------+----------+--------+---------+
| ID                                   | Name        | Status | Networks | Image  | Flavor  |
+--------------------------------------+-------------+--------+----------+--------+---------+
| 5490d592-f594-496c-9573-6b0922041f29 | test-cirros | BUILD  |          | cirros | m1.tiny |
+--------------------------------------+-------------+--------+----------+--------+---------+

Few minutes later

$ openstack server list

+--------------------------------------+-------------+--------+----------------------------+--------+---------+
| ID                                   | Name        | Status | Networks                   | Image  | Flavor  |
+--------------------------------------+-------------+--------+----------------------------+--------+---------+
| 5490d592-f594-496c-9573-6b0922041f29 | test-cirros | ACTIVE | public-net0=192.168.101.70 | cirros | m1.tiny |
+--------------------------------------+-------------+--------+----------------------------+--------+---------+
# ACTIVE - umm in my case the system was critically overloaded...
# if you are lucky
openstack console log show test-cirros
openstack console url show test-cirros

WARNING! That VLAN network is not much usefull (no DHCP etc...) Not sure if FLAT is still supported - was in older docs:

https://wiki.openstack.org/wiki/StarlingX/Networking

Shutdown and reboot of Azure VM

To shutdown:

login as sysadmin to Libvirt VM and run
```
 sudo init 0
```
now Stop Azure VM from portal. Remember that if you just shutdown Azure VM internally (using sudo init 0 it will NOT stop billing!

Power Up:

start Azure VM openstack-stx
login to Azure VM

run

cd ~/stx-tools/deployment/libvirt
./setup_network.sh

if your user is member of libvirt you can omit sudo in these two commands below:
run VM sudo virsh start simplex-controller-0

in my case got error:

 error: Failed to start domain simplex-controller-0
 error: Cannot access storage file '/mnt/bootimage.iso':
     No such file or directory

In such case - find offending device:

$ virsh domblklist simplex-controller-0

 Target   Source
--------------------------------------------------------------
 sda      /var/lib/libvirt/images/simplex-controller-0-0.img
 sdb      /var/lib/libvirt/images/simplex-controller-0-1.img
 sdc      /var/lib/libvirt/images/simplex-controller-0-2.img
 sdd      /mnt/bootimage.iso

$ virsh change-media simplex-controller-0 /mnt/bootimage.iso --eject

Successfully ejected media.

$ virsh domblklist simplex-controller-0

Target   Source

sda /var/lib/libvirt/images/simplex-controller-0-0.img sdb /var/lib/libvirt/images/simplex-controller-0-1.img sdc /var/lib/libvirt/images/simplex-controller-0-2.img sdd -

- problem fixed, start this VM again

run console to see progress sudo virsh console simplex-controller-0
once simplex-controller-0shows login prompt you can login on serial console or viassh sysadmin@10.10.10.2` (I prefer this, because serial console has some quirks)
verify that all k8s applications are X/X Running or Completed
```
kubectl get pods -A
```
important - wait until Ceph cluster is in HEALTH_OK state using:
```
ceph -s
```

Source

There is an official guide how to rebuild StarlingX by yourself:

https://docs.starlingx.io/developer_resources/build_guide.html I did not try it yet (at least 32GB RAM and 500GB disk required), but there are few noteworthy things:
https://opendev.org/starlingx/manifest/src/branch/master/default.xml list of git repositories
- processed by Android's repo tool (see below)
https://opendev.org/starlingx/tools (you already know this project)

Interesting is Dockerfile:

https://opendev.org/starlingx/tools/src/branch/master/Dockerfile You can see there
how hard it is today to pin to exact CentOS 7 and EPEL 7 versions.

how to get go version of Android repo tool (to fetch repos from /manifest/default.xml

curl https://storage.googleapis.com/git-repo-downloads/repo > /usr/local/bin/repo && \
chmod a+x /usr/local/bin/repo

Glossary

STX - StarlingX
DC - Distributed Cloud
WR - WindRiver (key contributor to StarlingX project)
CGCS - original project name Wind River Carrier Grade Communications Server, renamed to StarlingX, see https://opendev.org/starlingx/config/commit/d9f2aea0fb228ed69eb9c9262e29041eedabc15d You can see cgcs-X as prefix in various RPM packacges
CGTS - original project name Wind River Carrier Grade Telecom (or Titanium???) Server (???), renamed to StarlingX You can see cgts-X in prefix in various RPM packacges
TIS - original project name Wind River Titanium Server platform, renamed to StarlingX, see https://opendev.org/starlingx/config/commit/d9f2aea0fb228ed69eb9c9262e29041eedabc15d You can still see tis as Vendor tag in StarlingX RPM packages.
sysinv - System Inventory - accessed with famous system command in StarlingX