Disclaimer - Trust no one, use your brain! (Work continuously in progress)
Content:
- Blueprint/template for a new service
- Documentation, standards, guides (how-to, know-how documents)
- Team support
- Understanding of the whole process by each member
- Pro-active development and support
- Accepted responsibilities and duties for each stage of a service
- The information on how to contribute MUST explain the contribution process (pull requests/issue)
- Plan for service live circle
- Pre-production development
- Launching
- Rollout backward compatible version of a service
- Hotfixing
- Rollout backward incompatible version of a service
- Data migration
- Switchover
- Service rollback
- Pre-commit/push hooks.
- Short test.
- Fast linters.
- Continuous development
- Tests
- Automated
- Unit tests
- Functional tests
- Code style (golangci-lint)
- Code quality monitoring (Sonar, Scrutinizer)
- Code coverage checks (gitlab example)
- Manual
- Feature acceptance/Business acceptance
- A/B tests
- Automated
- Conditions of integration
- Code style checks
- Test results
- Code coverage percentage
- Conditions of disintegration a feature
- Error rate after deploy live
- Healthchecks
- Storing a new tested snapshots/artefact of a service
- Artefact storage (Docker registry)
- Cleanup policy (Delete old tags with timeout)
- Artefact storage (Docker registry)
- Tests
- Continuous delivery of stable artefacts
- Images builder
- Services provisioning(Ansible)
- System layers
- Hardware: Servers and networks
- Scaling (adding new nodes) should not affect consistency of other layers
- Degradation (removing nodes) should not affect consistency of other layers
- Monitoring
- Hardware
- Network
- Resources and load
- Alerting policy
- Cluster: Services management system (Kubernetes, alternatives: OpenShift,Apache Mesos/Apache Karaf)
- Monitoring
- Availability of each node in the cluster
- All services up and running
- Connectivity between different pods and services
- Public endpoints accessibility
- Alerting policy
- Restart (full or partial) should bring cluster and systems up without destruction
- Log aggregation system - collect all logs from all containers
- Execution environment
- Meta-project with topology of the system
- Showroom + Staging
- Separate namespace for each showroom
- Fixed showroom for the staging (last stable pre-release)
- Production
- Configuration
- Secrets
- Configs should be a part of the meta-project
- Configuration
- Showroom + Staging
- Meta-project with topology of the system
- Monitoring
- Service: Application and any service
- Service itself (Docker image)
- Backward compatibility for a few generations
- Cleanup policy for deprecated/unused:
- Logic branches
- Data structures (RDBMS/NoSql)
- Cleanup policy for deprecated/unused:
- One container - one process
- Segregated commands even in one image (management layer can pick any to run)
- Built in commands
- Test service/source code (docker compose to setup required test ENV)
- DEV/DEBUG mode
- Logging
- Writing in stdout (without using containers’ file system) will enforce cluster layer to keep all logs
- Structured logging for better search.
- Access and application logs must be archived for a minimum of 90 days
- Monitoring
- Application and business checks (New Relic: throughput, metrics)
- Self health checks (metrics+Prometeus+Grafana)
- Queues content (amount of messages)
- Db content (custom checks)
- Cache utilization check
- Alerting policies (Prometeus, NewRelic)
- Tracing system agent (zipkin, Jaeger)
- Self-sufficiency
- Interfaces documentation
- Restful API
- Port and service description (README.md files)
- Service should be able to set itself up
- Interfaces documentation
- Backward compatibility for a few generations
- Replication, balancing and scaling on service level
- Failover and self-reorganisation in case of:
- Service crashed
- Physical node out of cluster
- Resources problems on specific node
- Logs system
- Service to collect and access logs grabbed from Cluster layer
- ELK stack/Gray Log/etc
- Service to collect and access logs grabbed from Cluster layer
- Persistent volumes to keep data
- EBS AWS
- Ceph
- NFS
- Service itself (Docker image)
- Hardware: Servers and networks
- Common services
- Tracing system (zipkin, Jaeger)
- Single sign-on service
- Authentication service (JWT)
- Authorization requests from all services
- Detached processing (CQRS)
- Request-Queue-Processor schema
- Stream data addressing and processing (Reactor)
- Real Time data requests processing
- Reliable data provider/API gateway (sync data retrieving)
- Request-Manager-Service solution
- Reliable data provider/API gateway (sync data retrieving)
- Reliable data-bus for events
- Event-Broker-Subscriber solution (Apache Camel)
- Http/TCP API endpoint to accept events
- Event fulfilment (Earn required information for subscribers)
- Event delivery
- Event delivery policies
- Retry
- Reque
- Giveup
- Event-Broker-Subscriber solution (Apache Camel)
- RDBMS: Postgres cluster
- DB backups: PG backupper
- Key-value + Queue: Redis cluster
- Messages system: Rabbit MQ cluster
- Healthcheck system
- Alerting system
- Disaster recovery plan (how to recover service/database/whole cluster/whole region)
- Release strategy process (canary, blue/green)
- Capacity planning (resources & limits)
- Alerts at least for basic metrics (CPU, memory, IOPS, disk space, restarts, servuces up)
- Access and application logs must be archived for a minimum of 90 days
- Set HSTS to 31536000 (1 year)
strict-transport-security: max-age=31536000
- Correctly set client IP
- Confirm client ip is in the proper location in X-Forwarded-For, modifying what is sent from the client if needed. AWS and GCP's load balancers will do this automatically.
- Make sure the web server and the application get the true client IP by configuring trusted IP's within Nginx or Apache
- Nginx: ngx_http_realip_module
- Apache: mod_remoteip
- If you have a service-oriented architecture, you must always be able to find the IP of the client that sent the initial request. Recommend passing along the
X-Forwarded-For
to all back-end services.
- for Cookies:
- Set the Secure and HTTPOnly flags
- Use a sensible Expiration
- Use the prefix
__Host-
for the cookie name
- All SQL queries must be parameterized, not concatenated
- Applications must use accounts with limited GRANTS when connecting to databases
- In particular, applications must not use admin or owner accounts, to decrease the impact of a sql injection vulnerability.
- User data must be escaped for the right context prior to reflecting it
- Apply sensible limits to user inputs, see input validation
- POST body size should be small (<500kB) unless explicitly needed