Diagnostic scripts built to accelerate incident triage and reduce Mean Time to Resolution (MTTR).
The Symptom: Manual verification of production application services was slowing down initial ticket triage and increasing MTTR. Support teams needed a faster way to confirm if a reported outage was a localized user issue or a broader service degradation.
The Investigation: During incident response, engineers were manually running ad-hoc curl commands, grepping unstructured logs, and SSHing into multiple hosts to check resource states — all before even confirming whether the reported service was actually degraded.
The Resolution: Developed lightweight, zero-dependency diagnostic scripts to automatically ping endpoints, parse health status, profile system resources, and scan application logs for recurring error patterns. These tools reduced manual verification time from ~15 minutes to under 60 seconds and accelerated engineering escalations with structured, evidence-backed output.
| Script | Purpose |
|---|---|
check_api_health.sh |
Hit a target endpoint and extract HTTP status, total response time, DNS lookup, TLS handshake, and payload size — structured as JSON for ticket evidence |
ssl_cert_check.sh |
Verify SSL/TLS certificate validity and calculate days-to-expiration with color-coded severity (OK / WARNING / CRITICAL) |
| Script | Purpose |
|---|---|
log_analyzer.py |
Parse structured (JSON) or standard Nginx access logs to aggregate HTTP status codes, identify high-frequency error endpoints, and profile top client IPs |
| Script | Purpose |
|---|---|
sys_health_dump.sh |
Capture a point-in-time snapshot of CPU, memory, disk, top processes, network connections, and kernel messages — designed to preserve transient state during active incidents |
| Script | Purpose |
|---|---|
db_latency_tester.py |
Measure database connection time and simple query latency across multiple iterations with min/max/avg/median statistics — isolates DB load from application logic |
# API Health Check — is the service down or just slow?
./network/check_api_health.sh https://api.example.com/v1/health 5
# SSL Certificate — rule out expired certs during outage escalation
./network/ssl_cert_check.sh example.com 443
# Log Analysis — surface error patterns for ticket evidence
python3 logs/log_analyzer.py /var/log/nginx/access.log --format nginx --limit 15
# System Snapshot — capture transient state during an active incident
./system/sys_health_dump.sh
# DB Latency — baseline query performance from the app tier
python3 db/db_latency_tester.py \
--dsn "postgres://user:pass@host:5432/db" --iterations 10- OS: Linux / macOS
- Bash: 4.0+
- Python: 3.8+ (standard library only —
psycopg2-binaryrequired for DB module) - Optional:
jqfor pretty-printed JSON output in health checks
All scripts are read-only diagnostic operations. They do not modify system state, write to production paths, or require elevated privileges (except dmesg in the system dump). Always review script contents before executing in a live customer environment.
| Repository | Description |
|---|---|
| log-rotation-maintenance | Automated Bash scripts for log rotation, compression, and storage cleanup on Linux servers |
| cloud-operations-runbook | 15+ standardized SOPs and runbooks covering compute, networking, identity, and application troubleshooting |
Maintained by a Technical Support Engineer focused on operational reliability and incident response.