diff --git a/docs/LuciaTrainingPlatform/blog/2026-04-30-release-1-6.md b/docs/LuciaTrainingPlatform/blog/2026-04-30-release-1-6.md new file mode 100644 index 00000000..089a552e --- /dev/null +++ b/docs/LuciaTrainingPlatform/blog/2026-04-30-release-1-6.md @@ -0,0 +1,58 @@ +--- +slug: release-ltp-v1.6 +title: Releasing Lucia Training Platform v1.6 +author: Lucia Training Platform Team +tags: [ltp, announcement, release] +--- + +We are pleased to announce the official release of **Lucia Training Platform v1.6.0**! + +## Lucia Training Platform v1.6.0 Release Notes + +This release focuses on security hardening, Docker image optimization, infrastructure upgrades, and bug fixes across the platform. + +## Platform Features & Bug Fixes +- Upgraded webportal to Node.js 24 and removed the separate webportal-dind service — webportal now runs directly without Docker-in-Docker, simplifying deployment and reducing image size +- Fixed job-detail page error handling for permission denied errors — now shows a clear message instead of infinite loading +- Fixed job YAML and output log display issues on the webportal +- Added support for tagging different types of GPUs +- Skipped validation job submission for CPU nodes +- Made Prometheus retention size configurable per service to prevent disk full issues +- Added tool to preserve application tokens when revoking all tokens +- Removed cronjob of abnormal-detector when stopping the service +- Fixed exception when no name exists in filter + +## Docker Image Optimization +- Reduced Docker image sizes for cluster-local-storage, copilot-chat, dashboard-data-backup, utilization-reporter, abnormal-detector, cert-expiration-checker, cluster-utilization, reverse-proxy, and model-proxy +- Upgraded metrics-cleaner base image from Python 3.7 to 3.12-slim +- Cleaned up job-exporter Docker image + +## Infrastructure & Networking +- Updated Cilium from 1.18.6 to 1.18.9 +- Updated Go version to 1.25 across all Go-based components +- Homebrew build for kube-scheduler and Grafana container images +- Downgraded kube-scheduler version to match service Kubernetes version +- Added IPoIB subnet route in init.sh to fix InfiniBand TCP connectivity on NetworkManager-managed nodes +- Fixed DNS problem for cluster-local-storage +- Fixed zlib 1.3.1 missing issue for pylon +- Added Managed Identity support for build scripts +- Made imagePullSecrets conditional to eliminate FailedToRetrieveImagePullSecret warnings +- Removed secret deployment for image pull in favor of ACR credentials + +## Alert Manager & Node Management +- Fixed KeyError when alert-parser processes validating nodes with no alerts +- Downgraded hardware issues without Azure FaultCode to triaged_unknown to avoid broken OFR pipeline +- Prevented node-recycler from submitting duplicate OFR tickets for the same node +- Skipped classification for cordoned nodes with empty NodeId to prevent OFR pipeline stalling + +## Security +- Updated Go toolchain and packages across all Go-based services +- Updated Node.js packages for rest-server, alert-handler, job-status-change-notification, database-controller, and webportal +- Updated Python packages for copilot-chat +- Fixed S360 vulnerabilities across 13 container images including openssl, axios, follow-redirects, lodash, nodemailer, and minimatch +- Updated go-ntlmssp to 0.1.1 for reverse proxy +- Updated k8s-rdma-shared-dev-plugin to adapt to latest gRPC package + +## CI/CD +- Updated CI workflow to filter dev-box from changed services detection +- Removed all existing statefulsets in the system during cleanup instead of only config-defined ones