-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Helm charts enhancement #147
Comments
Related issue: #129
This is kind of done (but undocumented) w/ the prometheus setup. |
Do we ship basic alerts and related runbooks as well? This could have caught few of the issues I've seen in the last few days. |
Here's what we ship by default around alerting: https://github.com/PostHog/charts-clickhouse/blob/main/charts/posthog/values.yaml#L592-L671 Runbooks I think would live alongside our documentation in the handbook w/ troubleshooting sections. |
Priorities, in my view atm: high:
mid:
low:
|
In the last year we have implemented the majority of the improvements above. I'm going to close this issue as we are now tracking the remaining tasks individually. |
👋 Hi! I’m going to list here few random ideas on how we could improve our helm charts divided by topic:
📈 Scaling
we should support vertical and horizontal scaling of all our dependendencies: Kafka, ClickHouse and PostgreSQL
vertical service a scale: this is usually an operation used as first mitigation in case of resource contention. It usually involves adding more CPU/memory/storage to a
pod
.horizontal service scale: this is usually an operation that can take some time (depending on the dataset) and usually requires dataset partitioning/sharding and a cluster rebalance operation.
related to ☝️ we should make sure we mount service data dir on top of resizable storage
🚨 Monitoring & Alerting
As part of the helm charts, we should ship a basic monitoring/alerting stack. I know we have some debugging information already built-in into PostHog and we could probably extend that but I don’t think it will covers most of the cases we might need (e.g. how can we troubleshoot a problem when a PostHog installation is down?)
📑 Documentation
We should document all the maintenance operations & alerts in a runbook.
Please share your ideas and I'll add them to this post. Thank you!
The text was updated successfully, but these errors were encountered: