Skip to content

Concept for a geo distributed Bot on Azure with region failover capabilities

License

Notifications You must be signed in to change notification settings

h2floh/GeoDistributedAzureBot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Azure Bot Framework based Geo Distributed Bot with failover capability

Test OneClickDeploy TM

Test OneClickDeploy AFD

Build Status Sample GeoBot Build Status KeyVault CertBot Image Build Status Pipeline Agent Image

This repo contains deployment scripts and a sample bot to spin up a geo distributed and geo failover capable bot, which can be accessed like any other Azure Bot Framework Service based bot via the Bot Framework Service Channels (Directline/WebChat and may more).

The idea of this repo is to give you a full working starting point for a Azure Cloud Native architecture pattern from where you can customize it for your Bot and your needs.

This repo focuses on the surrounding architectural aspects rather than to have a good conversational AI experience.

Things you can learn about (most are Bot unrelated):

  • Centralized and secure configuration of services
  • Simple but powerful Auto-Failover capabilities by cloud architecture design
  • Advanced Terraform functionalities like for_each and dynamic
  • Fully automated LUIS deployment for CI/CD pipelines
  • Automated Let's encrypt certification issuing

Why should I care?

Most bot projects focusing on the user experience and natural language processing and other AI capabilities of the bot. But as this bot becomes a main core service of your organization it will be critical that you can provide a globally spanning bot service and have a integrated failover/always-on architecture or strategy even in the (unlikely) downtime of a whole Azure region.

Since there is limited guidance on how to create such an architecture we started creating this repo as an idea book.

The big picture

We don't claim that this is the only valid architecture. There are a lot of ways to build this and to customize and improve from this base architecture on.

The used architecture

Architecture explanation

Your client apps or channels (Teams, Slack, Facebook Messenger) connect via the Azure Bot Frameworks Server (BFS), a already globally distributed service, to your bot. We configure BFS to use a Traffic Manager endpoint with 'Performance' mode. So Traffic Manager will lookup the bot which is nearest to the current connected BFS instance, which should be the same as the nearest to the client/user. TrafficManager is also capable of doing a healthcheck on the registered endpoints, which allows the regional failover by design.

Doing this comes actually with a price, BFS by design requires the Bot endpoint to have a valid trust chain SSL certificate in place, so we can not just create a self-signed SSL for a test run or demo.

In order to easily deploy the bot (and to be able to create a lean CI/CD pipeline), the bot(s) will reference a central configuration store, KeyVault to retrieve all configuration. The bot will be deployed as a .NET Core application into WebApps (on App Service).

For each region where we deploy a bot we also use the LUIS service in the same region to get a good latency. For storing the state of the conversations we are using CosmosDB in MultiMaster mode, so that even in case of failover the conversation can continue from the point where it was.

To ease out additional complexity overhead we introduced a healthcheck API within the bot which checks for availability of LUIS and CosmosDB. In case of one failure of one component the whole region will be failed-over which is maybe a bit harsh. You can extend each individual service to provide more regional failover and high availability, but as introduced we have to draw a line somewhere.

We are not using any other global or reginal services but for a better visualization they are displayed in the architecture picture.

Design decisions

  • Using TrafficManager with Performance profile in order to "dispatch" to the nearest available region/bot.
  • Using KeyVault as central configuration store even for non-secret config (App Configuration Services can be used also but is still in preview and to keep the amount of services low)
  • Using MultiMaster CosmosDB as state store, maybe you won't need a global MultiMaster CosmosDB, maybe for a bigger geographical region like North America or Europe separate CosmosDB MultiMaster clusters would be just fine.
  • Placing all global services (management pane - Traffic Manager, Bot Registration) into a separate Azure region
  • Putting the Healthcheck API within the bot (to reduce complexity/additional code and services)

Try it yourself

Please report any problems you face under issues!

Prerequisites for all tasks

ℹ️ Scripts are tested with PowerShell Core under Windows 10, Ubuntu 18.04.3 LTS and Azure Shell

  • PowerShell Core >=6.2.3
  • Terraform >=0.12.18
  • Azure CLI >=2.0.71
  • .NET Core SDK >=2.2
  • Node.js & npm >= 8.5
  • LUIS CLI >=2.6.2
  • Be logged into Azure CLI and having Subscription Level Owner or Contributor rights on the "isDefault" marked subscription
  • You can also running the deployment in the docker container following the steps below:
    1. Clone the repo to your local drive;

    2. Pull the image of the 'h2floh/geobotagent' which has the prerequisite software installed:

      docker pull h2floh/geobotagent

    3. In your docker setting, ensure the drive, which contains the repo, is shared and would be available to your containers created in next step.

    4. Execute the command on your terminial:

      docker run -it --name <containername> --mount type=bind,source="$(pwd)"/GeoDistributedAzureBot,target=/<targetfoldername> h2floh/geobotagent:latest usr/bin/pwsh

      The command will mount the existing folder to the docker container and start the container from powershell terminal. The mounted files will be located in the target place specified by the targetfoldername.

    5. Run the deployment steps.

Summary of steps

  1. Deploying the Infrastructure & Sample Bot (includes import or creation of an SSL certificate)
  2. Testing Bot and Failover
  3. Destroying the Infrastructure (and saving your SSL certificate for reuse)
  4. Deploy it again

1. Deploying the Infrastructure & Sample Bot

You can use the OneClickDeploy.ps1 script, several options are available.

ℹ️ Azure Front Door mode: It is now possible to use Azure Front Door instead of TrafficManager. This setup won't need a custom domain name or Let's Encrypt SSL certificate and therefore will be stable functioning. Add the -AZUREFRONTDOOR $True parameter to the excecution of the OneClickDeployment script. ℹ️

⚠️ TrafficManager mode: For testing the provided automatic issuing of a Let's Encrypt certificate is a good way to overcome this, but it has rate limitations (top level domain 50 per week more info here). Also currently there is no automatic way in place to renew the certificate automatically every 3 months. So use it wisely and try to reuse the SSL certificate. Even this architecture is capable of handling and be easily scaled out for production environments we strongly recommend a Custom Domain Name and to use certificate issuing via AppServices or your preferred CA (Certificate Authority). ⚠️

⚠️ Known issues/drawbacks:

  • the Bot Name parameter has to be unique since several Azure services will use it as prefix. Stick to lowercase no dashes and special chars and less than 20char. e.g. myfirstname1234

ℹ️ You can change to -AUTOAPPROVE $False to accept / see the changes Terraform will do on your subscription. There are 3 to 5 executions so be prepared to enter yes in between. ℹ️

ℹ️ Without changing the parameters the bot will deploy to three Azure regions:

  • Global/central artifacts: japaneast
  • Bot: koreacentral and southeastasia

ℹ️ To use a custom domain name you have just to set a CNAME entry in your DNS server pointing to the TrafficManager domain name (default <botname>.trafficmanager.net). See here on how to do it if you use Azure DNS.

# Example 1: Azure Front Door Version (Let's Encrypt SSL certificate not needed)
.\Deploy\OneClickDeploy.ps1 -BOT_NAME <yourbotname> -AZUREFRONTDOOR $True -AUTOAPPROVE $True

# Example 2: Issues a SSL certificate from Let's Encrypt for the TrafficManager Endpoint Domain
# [HINT: Export your Certificate (see ExportSSL.ps1) for reuse in subsequent runs]
.\Deploy\OneClickDeploy.ps1 -BOT_NAME <yourbotname> -YOUR_CERTIFICATE_EMAIL <yourmailaddressforletsencrypt> -AUTOAPPROVE $True

# Example 3: Issues a SSL certificate from Let's Encrypt for your custom domain
# [HINT: Export your Certificate (see ExportSSL.ps1) for reuse in subsequent runs]
.\Deploy\OneClickDeploy.ps1 -BOT_NAME <yourbotname> `
 -YOUR_CERTIFICATE_EMAIL <yourmailaddressforletsencrypt> -YOUR_DOMAIN <yourdomain> -AUTOAPPROVE $True

# Example 4: Imports an existing SSL certificate (PFX File) for the TrafficManager Endpoint Domain
.\Deploy\OneClickDeploy.ps1 -BOT_NAME <yourbotname> `
 -PFX_FILE_LOCATION <path to pfx file> -PFX_FILE_PASSWORD <password of pfx file> -AUTOAPPROVE $True

# Example 5: Imports an existing SSL certificate (PFX File) for your custom domain
.\Deploy\OneClickDeploy.ps1 -BOT_NAME <yourbotname> `
 -PFX_FILE_LOCATION <path to pfx file> -PFX_FILE_PASSWORD <password of pfx file> `
 -YOUR_DOMAIN <yourdomain> -AUTOAPPROVE $True

2. Testing Bot and Failover

If the deployment script runs without any failures it will output generated links for accessing the WebChat locally or from within this repo's GitPage.

Here some hints on how to use the bot.

ℹ️ Alternatively you can grab your Directline key from the Bot Channel Registration pane. Use the provided Test Webchat static index.html and paste following query arguments ?bot=<BOT_NAME>&key=<DIRECT_LINE_KEY>

Last but not least break something (removing LUIS Endpoint Key in luis.ai, Stop the WebApp your bot responds from - TODO create sample scripts to do that)

3. Destroying the Infrastructure (and saving your SSL certificate for reuse)

With the execution of the below script you can save your SSL certificate and then delete all generated infrastructure:

# Example 1: Exports the SSL certificate as PFX File and destroys the infrastructure
.\Deploy\OneClickDestroy.ps1 -BOT_NAME <yourbotname>

4. Deploy it again

If you used the integrated Let's Encrypt certificate issuing please the saved certificate (it is valid for 3 months) for redeployments (if either you use the same Bot Name or Custom Domain for redeploy).

# Example 1: Imports an existing SSL certificate (PFX File) for the TrafficManager Endpoint Domain
.\Deploy\OneClickDeploy.ps1 -BOT_NAME <yourbotname> `
 -PFX_FILE_LOCATION <path to pfx file> -PFX_FILE_PASSWORD <password of pfx file> -AUTOAPPROVE $True

# Example 2: Imports an existing SSL certificate (PFX File) for your custom domain
.\Deploy\OneClickDeploy.ps1 -BOT_NAME <yourbotname> `
 -PFX_FILE_LOCATION <path to pfx file> -PFX_FILE_PASSWORD <password of pfx file> `
 -YOUR_DOMAIN <yourdomain> -AUTOAPPROVE $True

Learnings

There is no one fits it all Infrastructure as Code tool

  • While Terraform is good for the loop over each region, it is not very good in multi step scenarios including waiting for a resource/artifact to be created
  • Terraform also is less optimal if you want to introduce architecture choices
  • For waiting I used script loops together with Azure CLI commands
  • Terraform AzureRM provider still lacks some update features. E.g. there is a need to update only the Bot's endpoint in a subsequent Terraform execution, but this is not possible because there is no data source for Bot, so we would have to keep track of all parameters. In such cases we used Azure CLI for updating.
  • Terraform is very convenient if you want to destroy the environment again (demos, non frequent reoccurring tasks)
  • For real cross platform usage of PowerShell Core scripts you have to stick to unix file name/path conventions

Open points and next steps

Listing up various things from different domain/view angles:

  • Include prerequisite validation check
  • Change from LUIS CLI to API calls in order to overcome Azure Shell restriction on npm executable packages
  • Create additional documentation for all scripts and their options / deployment flow
  • Update scripts and Terraform to use remote state store based on Blob Storage
  • Extend Bot with Geo distributed Speech service
  • Include scripts to simulate different type of failures
  • Create a containerized version where AppService will be replaced with Azure Kubernetes Service or Azure Container Instances
  • Create a version where LUIS and Speech service runs on the same AKS as the bot

Related Work and references

  • Active/Passive Failover approach for Azure Bots - Blog by Sowmyan Soman Chullikkattil