Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Site Partial Outage Summary (~5 hours) #385

Closed
orta opened this issue Mar 17, 2020 · 5 comments
Closed

Site Partial Outage Summary (~5 hours) #385

orta opened this issue Mar 17, 2020 · 5 comments

Comments

@orta
Copy link
Contributor

orta commented Mar 17, 2020

Re: #378

A deploy of the TypeScript v2 website, with somewhat novel re-directs structures, ensured that subsequent builds to the site would not be deployed correctly.

Effectively some files on the site were the beta version of the website, and others were the older version of the website.

This meant a lot of pages couldn't link between each other if they were a beta page, and that some links didn't work (like the playground, which relies on a lot of other files).

Timeline

  • 3/14/2020 - v2 deployed for testing over the weekend
  • 3/14/2020 - PR adds redirects for missing URLs to TypeScript v2 site
  • 3/15/2020 - Deploy switching back to TypeScript v1
  • 3/16/2020 🔒 - Noticed that the site was still v2, but working fine, previous deploys had passed local build tests but not deployed to servers
  • 3/17/2020
    • 12:00pm EST Tried deploying v2 from scratch to see if there was an issue with v1 code specifically
    • 12:00 EST first reports of pages with 404s start coming in
    • 1:00 EST Started internal thread that deploys weren't acting like expected
    • 1:10 EST Tried deploying an older build of v1 to see if there was an issue with new merged PRs
    • 1:30 EST Tried deploying last weeks working v2
    • 2:00 EST Tried deploying the week before thats v2
    • 2:30 EST Tried deploying v1 without the v2 subfolder, on the chance there was something specific to v2
    • 3:30 EST Tested a deploy of v1 to staging, which worked
    • 4:00 EST Started asking internally to our org about forcing access to the azure portal so we can read the deploy logs
    • 4:40 EST Got access to logs, diagnosed problem, started fix
    • 5:20 EST Full v1 deploy succeeded

Causes

We want to make sure that no links break in transitioning from v1 to v2. There's a lot of v1 links which aren't in active use, but also redirect to existing pages, so the v2 site has a section which looks like this:

export const veryOldRedirects = {
  Playground: "/play",
  Tutorial: "docs/home",
  Handbook: "docs/home",
  samples: "docs/home",
  "/docs/home.html": "/docs/home",
  "/playground": "/play",
}

export const handbookRedirects = {
  "/docs/handbook/writing-declaration-files": "/docs/handbook/declaration-files/introduction.html",
  "/docs/handbook/writing-declaration-files.html": "/docs/handbook/declaration-files/introduction.html",
  "/docs/handbook/writing-definition-files": "/docs/handbook/declaration-files/introduction.html",
  "/docs/handbook/typings-for-npm-packages": "/docs/handbook/declaration-files/publishing.html",
  "/docs/handbook/release-notes": "/docs/handbook/release-notes/overview",
  "/docs/tutorial.html": "/docs/handbook/release-notes/overview",
}

export const setupRedirects = addRedirects => {
  addRedirects(veryOldRedirects)
  addRedirects(handbookRedirects)
}

setupRedirects loops through the objects above it and tells gatsby that these redirects exist. These redirects are then emitted to the file system as /Playground/index.html which forwards you to /play via client-side JavaScript using the plugin gatsby-plugin-client-side-redirect.

During a deploy, CI pushes to a either the branch SITE-PRODUCTION / SITE-STAGING and then sends a webhook for Azure to pick up and deploy the static HTML from those branches (similar to how github pages works)

The deployment script in Azure failed when a file transitioned from a path like /docs/handbook/writing-declaration-files.html to instead be a folder with an index.html (/docs/handbook/writing-declaration-files/index.html).

This meant that every file alphabetically from the deploy had successfully migrated until the above was hit. Causing half the site to be in v1, and half the site to be in v2.

Resolution

Effectively we had been slow with setting up access to the Azure portal, which is where we would have been able to see build logs for deploys.

Deploys to the TypeScript site have been unpredictable on the azure side for quite a while, normally you can send another build down the pipeline and it fixes itself on the next run. This meant when a bad deploy happened, the first few answers were simply "let's send another build across" which is roughly a 30 minute process (~15m in CI, then ~5-30m in Azure) to verify

After a few cases of "send a deploy" didn't work and gave baffling results of the v1 index page and the v2 playground, then it started to look like getting access to the build logs was going to be the only answer.

@DanielRosenwasser asked for some help from someone who had been helping the TS team set up our Azure portal access (Thanks Antoni) to see if we could speed it up.

Once we had access to the build logs, it became quite obvious what the issue was:

Error: The target file "D:\home\site\wwwroot\docs\handbook\writing-declaration-files.html" is a directory, not a file.

KuduSync.NET from: 'D:\home\site\repository' to: 'D:\home\site\wwwroot'
Copying file: 'docs\handbook\writing-declaration-files.html'
Failed exitCode=1, command="kudusync" -v 50  -f "D:\home\site\repository" -t "D:\home\site\wwwroot" -n "D:\home\site\deployments\d05a2ce67eeb43eb7b1efb61d5add7bca7afa673\manifest" -p "D:\home\site\deployments\871b18365dbe5f2e572dbb5043fdf3de61c3af69\manifest" -i ".git;.hg;.deployment;deploy.cmd"
An error has occurred during web site deployment.
Error: The target file "D:\home\site\wwwroot\docs\handbook\writing-declaration-files.html" is a directory, not a file.\r\nD:\Program Files (x86)\SiteExtensions\Kudu\85.11226.4297\bin\Scripts\starter.cmd "D:\home\site\deployments\tools\deploy.cmd"

From there the files were deleted via the Azure console, and a triggered redeploy successfully got through making the site v1.

Post-Mortem

It took us 9 months to get other members of the team access to the Azure portal, which ironically, was supposed to happen earlier this morning but we had to re-schedule (given how wild everything is with COVID-19).

I didn't push the deadlines hard enough because we weren't seeing any problems, having portal access to change settings and see build logs is a "nice to have" when you think you're working with a static site hosting. It turns out that the site isn't running on cloud storage, but is an Azure App Service app which meant we own more of the hosting responsibilities than I had anticipated.

To my knowledge, this has been the first downtime since I've started working on the site - that sucks. Sorry folks.

Mitigation

I have a few direct TODOs to stop this happening again:

  • Update the redirect plugin in v2 to not create the /index.html when the file is already a *.html
  • Set up to deploy directly to Azure, instead of using the git integration
  • Add alerts when a build deploy fails into a teams chat room

For the long term:

  • Look into moving the TypeScript website to a static site CDN (which should be a perf boost, I hope too)
@orta orta changed the title Site Partial Outage Summary (~6 hours) Site Partial Outage Summary (~4 hours) Mar 17, 2020
@orta orta changed the title Site Partial Outage Summary (~4 hours) Site Partial Outage Summary (~5 hours) Mar 17, 2020
@DanielRosenwasser
Copy link
Member

DanielRosenwasser commented Mar 17, 2020

To my knowledge, this has been the first downtime since I've started working on the site - that sucks. Sorry folks.

We all make mistakes, and I've certainly been on the other side of deploys where I couldn't figure out what went wrong at all. Thanks so much for putting this post-mortem together so we can discuss this.


set up to deploy directly to Azure, instead of using the git integration

Can you elaborate on the "deploy directly" bit?

  • What's the difference between this and what we have now? Is one a webhook with Azure and one done by GitHub actions? Is the idea there to give us a more transparent build process to fix the "let's try deploying again" problem?
  • It looks like one benefit of that would be to provide our outputs part of a single folder's output via package? Would that actually help the directory naming issue?

Look into moving the TypeScript website to a static site CDN (which should be a perf boost, I hope too)

One of the things here that you mentioned is that the site is an app service which for our cases (a static site with no back-end) is more complicated and less efficient. We've wanted to change this, but never been able to because...well, the authentications/permissions issues you hit!! 😫😄

Generally the guidance I've heard from folks in Azure is to use the storage APIs, but the workflow sounds confusing there. We should figure out what an ideal workflow and hosting situation is.

@orta
Copy link
Contributor Author

orta commented Mar 18, 2020

Is one a webhook with Azure and one done by GitHub actions?

Today, the deploy process is:

  • A GitHub Action which builds the site as a static set of html, then it pushes to this git repo
  • The push to the git repo triggers a webhook, this webhook sends Azure a notice that it should update the repo

This would switch it so that it's only one thing:

  • The same thing building the static site, then deploys a zip file to azure

This is much closer to how we do PR builds right now basically.


Is the idea there to give us a more transparent build process to fix the "let's try deploying again" problem?

The idea here is that we can get logs in the CI. As well as being able to see pass/fail without going into azure. I'm hoping that the CI action will just zip it up itself and overwrite the current running app.

Though, its possible we could hit that same bug, but the thing we definitely know is whether we've got our code to azure correctly. When the webhook was unreliable, you weren't sure without logging in.


One of the things here that you mentioned is that the site is an app service which for our cases (a static site with no back-end) is more complicated and less efficient.

Ace, yeah, I thought the site was this setup all along! Was kinda surprised myself, I've set up a CDN on azure with edge support before (which is what playground v2 uses) and it's likely that this should be workable for the TS site too.

I'll need to do a bit of work (it's not quite netlify/now) but it should be achievable.

@orta
Copy link
Contributor Author

orta commented Mar 18, 2020

I've sent a PR to the gatsby plugin which would stop the potential filepath/to/a/file.html -> path/to/a/file.html/index.html happening which triggered the issue

@orta
Copy link
Contributor Author

orta commented Mar 23, 2020

OK, closing this out

@orta orta closed this as completed Mar 23, 2020
@orta
Copy link
Contributor Author

orta commented Apr 2, 2020

I have the staging deploys now directly deploying to Azure's blob storage as a static website

@orta orta mentioned this issue Apr 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants