Skip to content

Latest commit

 

History

History
11 lines (9 loc) · 849 Bytes

76f27cf3-b204-40e4-942e-19657614f658.md

File metadata and controls

11 lines (9 loc) · 849 Bytes
uuid url categories company product
76f27cf3-b204-40e4-942e-19657614f658
postmortem
Amazon

Inability to contact a data collection server triggered a latent memory leak bug in the reporting agent on the storage servers. And there is no graceful degradation handling, thus the reporting agent continuously contacted the collection server in a way that slowly consumed system memory. Also the monitoring system failed to alarm this EBS server's memory leak, also EBS servers generally make very dynamic use of all memory. By Monday morning, the rate of memory loss became quite high and confused enough memory on the affected storage servers which cannot keep with the request handling process. This error got further severed by the inability to do the failover, which resulted in the outage.