Agent CLI should be less-brittle when starting#207
Agent CLI should be less-brittle when starting#207acunniffe merged 7 commits intoopticdev:developfrom
Conversation
| //@ts-ignore | ||
| import jwtDecode from 'jwt-decode'; | ||
| //@ts-ignore | ||
| import niceTry from 'nice-try'; |
There was a problem hiding this comment.
🚨 swallowing errors, pretty much never the thing we want to do. I totally agree that the UX should be nice and that Optic service being down shouldn't stop services. Swallowing errors, however, is the total enemy to visibility and complete ignores that thrown errors could have put the process in an undefined state.
I don't see it used anywhere, but it shouldn't even be imported. We have to do the work of actually hardening this, not just hide the issues!
Totally agree here. Expected errors in operating of a distributed system should never bring the process down. Network partitions, downtime of other systems, all of that is expected. However, misconfiguration should always be as loud as possible and fail a deployment. There's nothing as frustration when running multiple instances on servers, with log outputs all collated, to have no idea why things aren't working, only to find out there is 1 tiny log line amongst millions indicating a failed parse somewhere, causing an entire branch to go ignored. When users choose to deploy Optic in a place, that's a direct intent, so if they have misconfigured things, it's our jobs to let them know. |
|
Good points @JaapRood -- I'll find a better balance between preventing Optic from being the thing that takes down your production infra and providing better visibility. Pending :) |
|
Past-Jaap who spent hours sifting through aggregated logs looking for similar issues thanks you! |
I made some changes to the Agent CLI to make it less brittle in production.
The
runcommand would crash on any invalid input to--config. Invalid tokens, strings, json of a different shape, etc. I wrapped that destabilization in a try/catch to we can branch the code.Branch 1 is followed when your start config and monitoring config is valid and starts the agent as usual
Branch 2 is followed when the start config is valid and the monitoring config is not. It starts the API and proxy as usual and swaps out the agent persistence manager for the
DoNothingCapture-- just a proxy with no recording, batching, etc :)Optic CI API having downtime, or some accidental clearing of the config environment variable should never affect your ability to deploy your code to prod.
--masqueradeto--listento match other proxies/http CLIs. Also, the context around the wordmasqueradeis not welcome cognitive load for security teams.