Azure Bot Framework based Geo Distributed Bot with failover capability

This repo contains deployment scripts and a sample bot to spin up a geo distributed and geo failover capable bot, which can be accessed like any other Azure Bot Framework Service based bot via the Bot Framework Service Channels (Directline/WebChat and may more).

The idea of this repo is to give you a full working starting point for a Azure Cloud Native architecture pattern from where you can customize it for your Bot and your needs.

This repo focuses on the surrounding architectural aspects rather than to have a good conversational AI experience.

Things you can learn about (most are Bot unrelated):

Centralized and secure configuration of services
Simple but powerful Auto-Failover capabilities by cloud architecture design
Advanced Terraform functionalities like for_each and dynamic
Fully automated LUIS deployment for CI/CD pipelines
Automated Let's encrypt certification issuing

Why should I care?

Most bot projects focusing on the user experience and natural language processing and other AI capabilities of the bot. But as this bot becomes a main core service of your organization it will be critical that you can provide a globally spanning bot service and have a integrated failover/always-on architecture or strategy even in the (unlikely) downtime of a whole Azure region.

Since there is limited guidance on how to create such an architecture we started creating this repo as an idea book.

The big picture

We don't claim that this is the only valid architecture. There are a lot of ways to build this and to customize and improve from this base architecture on.

Architecture explanation

Your client apps or channels (Teams, Slack, Facebook Messenger) connect via the Azure Bot Frameworks Server (BFS), a already globally distributed service, to your bot. We configure BFS to use a Traffic Manager endpoint with 'Performance' mode. So Traffic Manager will lookup the bot which is nearest to the current connected BFS instance, which should be the same as the nearest to the client/user. TrafficManager is also capable of doing a healthcheck on the registered endpoints, which allows the regional failover by design.

Doing this comes actually with a price, BFS by design requires the Bot endpoint to have a valid trust chain SSL certificate in place, so we can not just create a self-signed SSL for a test run or demo.

In order to easily deploy the bot (and to be able to create a lean CI/CD pipeline), the bot(s) will reference a central configuration store, KeyVault to retrieve all configuration. The bot will be deployed as a .NET Core application into WebApps (on App Service).

For each region where we deploy a bot we also use the LUIS service in the same region to get a good latency. For storing the state of the conversations we are using CosmosDB in MultiMaster mode, so that even in case of failover the conversation can continue from the point where it was.

To ease out additional complexity overhead we introduced a healthcheck API within the bot which checks for availability of LUIS and CosmosDB. In case of one failure of one component the whole region will be failed-over which is maybe a bit harsh. You can extend each individual service to provide more regional failover and high availability, but as introduced we have to draw a line somewhere.

We are not using any other global or reginal services but for a better visualization they are displayed in the architecture picture.

Design decisions

Using TrafficManager with Performance profile in order to "dispatch" to the nearest available region/bot.
Using KeyVault as central configuration store even for non-secret config (App Configuration Services can be used also but is still in preview and to keep the amount of services low)
Using MultiMaster CosmosDB as state store, maybe you won't need a global MultiMaster CosmosDB, maybe for a bigger geographical region like North America or Europe separate CosmosDB MultiMaster clusters would be just fine.
Placing all global services (management pane - Traffic Manager, Bot Registration) into a separate Azure region
Putting the Healthcheck API within the bot (to reduce complexity/additional code and services)

Try it yourself

Please report any problems you face under issues!

Prerequisites for all tasks

ℹ️ Scripts are tested with PowerShell Core under Windows 10, Ubuntu 18.04.3 LTS ~~and Azure Shell~~

PowerShell Core >=6.2.3
Terraform >=0.12.18
Azure CLI >=2.0.71
.NET Core SDK >=2.2
Node.js & npm >= 8.5
LUIS CLI >=2.6.2
Be logged into Azure CLI and having Subscription Level Owner or Contributor rights on the "isDefault" marked subscription
You can also running the deployment in the docker container following the steps below:
1. Clone the repo to your local drive;
2. Pull the image of the 'h2floh/geobotagent' which has the prerequisite software installed:
  
  docker pull h2floh/geobotagent
3. In your docker setting, ensure the drive, which contains the repo, is shared and would be available to your containers created in next step.
4. Execute the command on your terminial:
  
  docker run -it --name <containername> --mount type=bind,source="$(pwd)"/GeoDistributedAzureBot,target=/<targetfoldername> h2floh/geobotagent:latest usr/bin/pwsh
  
  The command will mount the existing folder to the docker container and start the container from powershell terminal. The mounted files will be located in the target place specified by the targetfoldername.
5. Run the deployment steps.

Summary of steps

Deploying the Infrastructure & Sample Bot (includes import or creation of an SSL certificate)
Testing Bot and Failover
Destroying the Infrastructure (and saving your SSL certificate for reuse)
Deploy it again

1. Deploying the Infrastructure & Sample Bot

You can use the OneClickDeploy.ps1 script, several options are available.

ℹ️ Azure Front Door mode: It is now possible to use Azure Front Door instead of TrafficManager. This setup won't need a custom domain name or Let's Encrypt SSL certificate and therefore will be stable functioning. Add the -AZUREFRONTDOOR $True parameter to the excecution of the OneClickDeployment script. ℹ️

⚠️ TrafficManager mode: For testing the provided automatic issuing of a Let's Encrypt certificate is a good way to overcome this, but it has rate limitations (top level domain 50 per week more info here). Also currently there is no automatic way in place to renew the certificate automatically every 3 months. So use it wisely and try to reuse the SSL certificate. Even this architecture is capable of handling and be easily scaled out for production environments we strongly recommend a Custom Domain Name and to use certificate issuing via AppServices or your preferred CA (Certificate Authority). ⚠️

⚠️ Known issues/drawbacks:

the Bot Name parameter has to be unique since several Azure services will use it as prefix. Stick to lowercase no dashes and special chars and less than 20char. e.g. myfirstname1234

ℹ️ You can change to -AUTOAPPROVE $False to accept / see the changes Terraform will do on your subscription. There are 3 to 5 executions so be prepared to enter yes in between. ℹ️

ℹ️ Without changing the parameters the bot will deploy to three Azure regions:

Global/central artifacts: japaneast

Bot: koreacentral and southeastasia

ℹ️ To use a custom domain name you have just to set a CNAME entry in your DNS server pointing to the TrafficManager domain name (default <botname>.trafficmanager.net). See here on how to do it if you use Azure DNS.

# Example 1: Azure Front Door Version (Let's Encrypt SSL certificate not needed)
.\Deploy\OneClickDeploy.ps1 -BOT_NAME <yourbotname> -AZUREFRONTDOOR $True -AUTOAPPROVE $True

# Example 2: Issues a SSL certificate from Let's Encrypt for the TrafficManager Endpoint Domain
# [HINT: Export your Certificate (see ExportSSL.ps1) for reuse in subsequent runs]
.\Deploy\OneClickDeploy.ps1 -BOT_NAME <yourbotname> -YOUR_CERTIFICATE_EMAIL <yourmailaddressforletsencrypt> -AUTOAPPROVE $True

# Example 3: Issues a SSL certificate from Let's Encrypt for your custom domain
# [HINT: Export your Certificate (see ExportSSL.ps1) for reuse in subsequent runs]
.\Deploy\OneClickDeploy.ps1 -BOT_NAME <yourbotname> `
 -YOUR_CERTIFICATE_EMAIL <yourmailaddressforletsencrypt> -YOUR_DOMAIN <yourdomain> -AUTOAPPROVE $True

# Example 4: Imports an existing SSL certificate (PFX File) for the TrafficManager Endpoint Domain
.\Deploy\OneClickDeploy.ps1 -BOT_NAME <yourbotname> `
 -PFX_FILE_LOCATION <path to pfx file> -PFX_FILE_PASSWORD <password of pfx file> -AUTOAPPROVE $True

# Example 5: Imports an existing SSL certificate (PFX File) for your custom domain
.\Deploy\OneClickDeploy.ps1 -BOT_NAME <yourbotname> `
 -PFX_FILE_LOCATION <path to pfx file> -PFX_FILE_PASSWORD <password of pfx file> `
 -YOUR_DOMAIN <yourdomain> -AUTOAPPROVE $True

2. Testing Bot and Failover

If the deployment script runs without any failures it will output generated links for accessing the WebChat locally or from within this repo's GitPage.

Here some hints on how to use the bot.

ℹ️ Alternatively you can grab your Directline key from the Bot Channel Registration pane. Use the provided Test Webchat static index.html and paste following query arguments ?bot=<BOT_NAME>&key=<DIRECT_LINE_KEY>

Last but not least break something (removing LUIS Endpoint Key in luis.ai, Stop the WebApp your bot responds from - TODO create sample scripts to do that)

3. Destroying the Infrastructure (and saving your SSL certificate for reuse)

With the execution of the below script you can save your SSL certificate and then delete all generated infrastructure:

# Example 1: Exports the SSL certificate as PFX File and destroys the infrastructure
.\Deploy\OneClickDestroy.ps1 -BOT_NAME <yourbotname>

4. Deploy it again

If you used the integrated Let's Encrypt certificate issuing please the saved certificate (it is valid for 3 months) for redeployments (if either you use the same Bot Name or Custom Domain for redeploy).

# Example 1: Imports an existing SSL certificate (PFX File) for the TrafficManager Endpoint Domain
.\Deploy\OneClickDeploy.ps1 -BOT_NAME <yourbotname> `
 -PFX_FILE_LOCATION <path to pfx file> -PFX_FILE_PASSWORD <password of pfx file> -AUTOAPPROVE $True

# Example 2: Imports an existing SSL certificate (PFX File) for your custom domain
.\Deploy\OneClickDeploy.ps1 -BOT_NAME <yourbotname> `
 -PFX_FILE_LOCATION <path to pfx file> -PFX_FILE_PASSWORD <password of pfx file> `
 -YOUR_DOMAIN <yourdomain> -AUTOAPPROVE $True

Learnings

There is no one fits it all Infrastructure as Code tool

While Terraform is good for the loop over each region, it is not very good in multi step scenarios including waiting for a resource/artifact to be created
Terraform also is less optimal if you want to introduce architecture choices
For waiting I used script loops together with Azure CLI commands
Terraform AzureRM provider still lacks some update features. E.g. there is a need to update only the Bot's endpoint in a subsequent Terraform execution, but this is not possible because there is no data source for Bot, so we would have to keep track of all parameters. In such cases we used Azure CLI for updating.
Terraform is very convenient if you want to destroy the environment again (demos, non frequent reoccurring tasks)
For real cross platform usage of PowerShell Core scripts you have to stick to unix file name/path conventions

Open points and next steps

Listing up various things from different domain/view angles:

Include prerequisite validation check
Change from LUIS CLI to API calls in order to overcome Azure Shell restriction on npm executable packages
Create additional documentation for all scripts and their options / deployment flow
~~Update scripts and Terraform to use remote state store based on Blob Storage~~
~~Extend Bot with Geo distributed Speech service~~
Include scripts to simulate different type of failures
Create a containerized version where AppService will be replaced with Azure Kubernetes Service or Azure Container Instances
Create a version where LUIS and Speech service runs on the same AKS as the bot

Related Work and references

Active/Passive Failover approach for Azure Bots - Blog by Sowmyan Soman Chullikkattil

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
AzureDevOps		AzureDevOps
Deploy		Deploy
Doc		Doc
GeoBot		GeoBot
SSL/Docker		SSL/Docker
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AzureDevOps

AzureDevOps

Deploy

Deploy

Doc

Doc

GeoBot

GeoBot

SSL/Docker

SSL/Docker

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Azure Bot Framework based Geo Distributed Bot with failover capability

Why should I care?

The big picture

Architecture explanation

Design decisions

Try it yourself

Prerequisites for all tasks

Summary of steps

1. Deploying the Infrastructure & Sample Bot

2. Testing Bot and Failover

3. Destroying the Infrastructure (and saving your SSL certificate for reuse)

4. Deploy it again

Learnings

Open points and next steps

Related Work and references

About

Releases 5

Packages

Contributors 2

Languages

License

h2floh/GeoDistributedAzureBot

Folders and files

Latest commit

History

Repository files navigation

Azure Bot Framework based Geo Distributed Bot with failover capability

Why should I care?

The big picture

Architecture explanation

Design decisions

Try it yourself

Prerequisites for all tasks

Summary of steps

1. Deploying the Infrastructure & Sample Bot

2. Testing Bot and Failover

3. Destroying the Infrastructure (and saving your SSL certificate for reuse)

4. Deploy it again

Learnings

Open points and next steps

Related Work and references

About

Resources

License

Stars

Watchers

Forks

Languages