## 服务动态扩容？Kubernetes弹性伸缩+负载均衡实现自动化运维

以下场景是是不是你每天要面对的？

1. 业务高峰期8个GPU节点全部爆满，用户排队等推理结果
2. 凌晨3点突发流量激增，手动扩容根本来不及，服务直接崩了。
3. 单月GPU费用烧了12万美金！
4. ...

## 核心架构：三层弹性伸缩体系

### 第一层：应用层弹性 (HPA) - GPU感知的Pod自动扩缩容
- 监控GPU利用率、显存使用率、推理队列长度
- 当GPU利用率>70%时自动扩容，<30%时缩容
- 解决"用户排队等推理"的问题
  
### 第二层：资源层弹性 (VPA) - 动态调整Pod资源配置  
- 7B模型用16GB显存，70B模型用80GB显存
- 根据实际使用情况智能推荐资源配置
- 解决"资源配置不匹配"的问题
  
### 第三层：集群层弹性 (CA) - 自动增减GPU节点
- 当Pod因资源不足无法调度时自动加节点
- 节点空闲超过30秒开始缩容评估
- 解决"节点数量跟不上业务变化"的问题

这三层协同工作，就像给你的GPU集群装了个"智能大脑"。

## 第一步：HPA配置 - GPU感知的Pod自动扩缩容

传统HPA就像个近视眼，只能看到CPU和内存，完全看不见GPU这个"耗电大户"。我曾经遇到过这样的尴尬：CPU使用率只有30%，HPA觉得很轻松不扩容，但GPU利用率已经95%了，用户请求全在排队！

**解决方案：让HPA长出"GPU眼睛"** 

我们用KEDA + Prometheus给HPA装上"GPU眼睛"，让它能看懂GPU的真实负载。

四个步骤

1. 安装GPU监控组件
2. 配置GPU指标采集
3. 部署大模型Agent服务
4. 配置基于GPU的HPA

具体操作如下：

### 步骤1: 安装GPU监控组件

In [None]:
# install-gpu-monitoring.sh

#!/bin/bash

# 安装NVIDIA GPU Operator（官方安装方式）
echo "正在安装NVIDIA GPU Operator..."
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install --wait gpu-operator \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true \
  --set gfd.enabled=true

# 安装DCGM-Exporter用于GPU指标采集
echo "正在安装DCGM-Exporter..."
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update

helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace gpu-operator \
  --set serviceMonitor.enabled=true \
  --set serviceMonitor.additionalLabels.release=prometheus

# 安装KEDA (事件驱动自动扩缩容)
echo "正在安装KEDA..."
helm repo add kedacore https://kedacore.github.io/charts
helm repo update

helm install keda kedacore/keda \
  --namespace keda-system \
  --create-namespace \
  --set prometheus.metricServer.enabled=true \
  --set prometheus.operator.enabled=true

# 验证安装
echo "验证安装状态..."
kubectl get pods -n gpu-operator
kubectl get pods -n keda-system

### 步骤2: 配置GPU指标采集

In [None]:
apiVersion: v1
kind: ConfigMap
metadata:
  name: gpu-metrics-config
  namespace: gpu-operator
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    
    rule_files:
      - "gpu_rules.yml"
    
    scrape_configs:
    - job_name: 'dcgm-exporter'
      kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names:
          - gpu-operator
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_name]
        action: keep
        regex: dcgm-exporter
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        action: keep
        regex: metrics
      metrics_path: /metrics
      scrape_interval: 10s
      scrape_timeout: 10s
      
    - job_name: 'gpu-node-exporter'
      kubernetes_sd_configs:
      - role: node
      relabel_configs:
      - source_labels: [__meta_kubernetes_node_label_accelerator]
        action: keep
        regex: nvidia.*
      - source_labels: [__address__]
        regex: '(.*):10250'
        target_label: __address__
        replacement: '${1}:9100'
      metrics_path: /metrics
      scrape_interval: 15s

  gpu_rules.yml: |
    groups:
    - name: gpu.rules
      rules:
      - alert: GPUHighUtilization
        expr: DCGM_FI_DEV_GPU_UTIL > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU utilization is high"
          description: "GPU {{ $labels.gpu }} on {{ $labels.instance }} has been above 90% for more than 5 minutes"
      
      - alert: GPUHighMemoryUsage
        expr: (DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_TOTAL) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU memory usage is high"
          description: "GPU {{ $labels.gpu }} memory usage on {{ $labels.instance }} is above 85%"
      
      - alert: GPUTemperatureHigh
        expr: DCGM_FI_DEV_GPU_TEMP > 80
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "GPU temperature is high"
          description: "GPU {{ $labels.gpu }} temperature on {{ $labels.instance }} is above 80°C"

---
apiVersion: v1
kind: Service
metadata:
  name: gpu-metrics-service
  namespace: gpu-operator
  labels:
    app: gpu-metrics
spec:
  selector:
    app: dcgm-exporter
  ports:
  - name: metrics
    port: 9400
    targetPort: 9400
    protocol: TCP
  type: ClusterIP

---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: gpu-metrics-monitor
  namespace: gpu-operator
  labels:
    app: gpu-metrics
    release: prometheus
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  endpoints:
  - port: metrics
    interval: 10s
    path: /metrics

### 步骤3: 部署大模型Agent服务

In [None]:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-agent
  namespace: default
  labels:
    app: llm-agent
    version: v1
spec:
  replicas: 2
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: llm-agent
  template:
    metadata:
      labels:
        app: llm-agent
        version: v1
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
        prometheus.io/path: "/metrics"
    spec:
      nodeSelector:
        accelerator: nvidia-tesla-gpu
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - llm-agent
              topologyKey: kubernetes.io/hostname
      containers:
      - name: llm-agent
        image: your-registry/llm-agent:latest
        imagePullPolicy: Always
        resources:
          requests:
            nvidia.com/gpu: 1
            cpu: 2000m
            memory: 8Gi
            ephemeral-storage: 10Gi
          limits:
            nvidia.com/gpu: 1
            cpu: 4000m
            memory: 16Gi
            ephemeral-storage: 20Gi
        env:
        - name: MODEL_PATH
          value: "/models/your-model"
        - name: MAX_BATCH_SIZE
          value: "8"
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: "compute,utility"
        - name: MAX_CONCURRENT_REQUESTS
          value: "32"
        - name: MODEL_CACHE_SIZE
          value: "4096"
        - name: PROMETHEUS_PORT
          value: "9090"
        ports:
        - containerPort: 8000
          name: http
          protocol: TCP
        - containerPort: 9090
          name: metrics
          protocol: TCP
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8000
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 3
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8000
            scheme: HTTP
          initialDelaySeconds: 120
          periodSeconds: 30
          timeoutSeconds: 10
          successThreshold: 1
          failureThreshold: 3
        startupProbe:
          httpGet:
            path: /health/startup
            port: 8000
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          successThreshold: 1
          failureThreshold: 30
        volumeMounts:
        - name: model-storage
          mountPath: /models
          readOnly: true
        - name: cache-storage
          mountPath: /cache
        - name: tmp-storage
          mountPath: /tmp
        securityContext:
          runAsNonRoot: true
          runAsUser: 1000
          runAsGroup: 1000
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop:
            - ALL
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-pvc
      - name: cache-storage
        emptyDir:
          sizeLimit: 5Gi
      - name: tmp-storage
        emptyDir:
          sizeLimit: 2Gi
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      - key: node.kubernetes.io/instance-type
        operator: Equal
        value: gpu-instance
        effect: NoSchedule
      terminationGracePeriodSeconds: 60

---
apiVersion: v1
kind: Service
metadata:
  name: llm-agent-service
  namespace: default
  labels:
    app: llm-agent
spec:
  selector:
    app: llm-agent
  ports:
  - name: http
    port: 80
    targetPort: 8000
    protocol: TCP
  - name: metrics
    port: 9090
    targetPort: 9090
    protocol: TCP
  type: ClusterIP
  sessionAffinity: None

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-pvc
  namespace: default
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 50Gi
  storageClassName: fast-ssd

---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: llm-agent-monitor
  namespace: default
  labels:
    app: llm-agent
    release: prometheus
spec:
  selector:
    matchLabels:
      app: llm-agent
  endpoints:
  - port: metrics
    interval: 15s
    path: /metrics
    honorLabels: true

### 步骤4: 配置基于GPU的HPA

In [None]:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llm-agent-scaler
  namespace: default
  labels:
    app: llm-agent
spec:
  scaleTargetRef:
    name: llm-agent
  minReplicaCount: 2
  maxReplicaCount: 10
  pollingInterval: 30
  cooldownPeriod: 300
  idleReplicaCount: 1
  fallback:
    failureThreshold: 3
    replicas: 3
  advanced:
    restoreToOriginalReplicaCount: true
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300
          policies:
          - type: Percent
            value: 50
            periodSeconds: 60
        scaleUp:
          stabilizationWindowSeconds: 60
          policies:
          - type: Percent
            value: 100
            periodSeconds: 30
          - type: Pods
            value: 2
            periodSeconds: 60
          selectPolicy: Max
  triggers:
  # GPU利用率触发器
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
      metricName: gpu_utilization_avg
      threshold: '70'
      query: |
        avg(
          DCGM_FI_DEV_GPU_UTIL{
            job="dcgm-exporter",
            kubernetes_pod_name=~"llm-agent-.*"
          }
        )
    authenticationRef:
      name: prometheus-auth
  
  # GPU显存使用率触发器
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
      metricName: gpu_memory_utilization_avg
      threshold: '80'
      query: |
        avg(
          (DCGM_FI_DEV_FB_USED{
            job="dcgm-exporter",
            kubernetes_pod_name=~"llm-agent-.*"
          } / DCGM_FI_DEV_FB_TOTAL{
            job="dcgm-exporter",
            kubernetes_pod_name=~"llm-agent-.*"
          }) * 100
        )
    authenticationRef:
      name: prometheus-auth
  
  # 请求队列长度触发器
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
      metricName: request_queue_length_avg
      threshold: '10'
      query: |
        avg(
          llm_request_queue_length{
            job="llm-agent",
            kubernetes_pod_name=~"llm-agent-.*"
          }
        )
    authenticationRef:
      name: prometheus-auth
  
  # 请求响应时间触发器
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring.svc.cluster.local:9090
      metricName: request_duration_p95
      threshold: '5000'  # 5秒
      query: |
        histogram_quantile(0.95,
          rate(llm_request_duration_seconds_bucket{
            job="llm-agent",
            kubernetes_pod_name=~"llm-agent-.*"
          }[5m])
        ) * 1000
    authenticationRef:
      name: prometheus-auth

---
apiVersion: v1
kind: Secret
metadata:
  name: prometheus-auth
  namespace: default
type: Opaque
data:
  # 如果Prometheus需要认证，在这里配置
  # username: <base64-encoded-username>
  # password: <base64-encoded-password>

---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: prometheus-auth
  namespace: default
spec:
  secretTargetRef:
  - parameter: username
    name: prometheus-auth
    key: username
  - parameter: password
    name: prometheus-auth
    key: password

---
# 备用HPA配置（如果KEDA不可用）
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-agent-hpa-backup
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-agent
  minReplicas: 2
  maxReplicas: 8
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30

In [None]:
# Deployment script

#!/bin/bash

echo "Starting deployment of GPU monitoring and auto-scaling system..."

# Step 1: Install GPU monitoring components
echo "Step 1: Installing GPU monitoring components"
bash install-gpu-monitoring.sh

# Wait for GPU Operator to be ready
echo "Waiting for GPU Operator to become ready..."
kubectl wait --for=condition=ready pod \
  -l app=nvidia-operator-validator \
  -n gpu-operator \
  --timeout=300s

# Step 2: Configure GPU metrics collection
echo "Step 2: Configuring GPU metrics collection"
kubectl apply -f gpu-metrics-config.yaml

# Step 3: Deploy the LLM Agent service
echo "Step 3: Deploying the Large Language Model (LLM) Agent service"
kubectl apply -f llm-agent-deployment.yaml

# Wait for the service to be ready
echo "Waiting for the LLM Agent service to become ready..."
kubectl wait --for=condition=ready pod \
  -l app=llm-agent \
  --timeout=600s

# Step 4: Configure GPU-based Horizontal Pod Autoscaling (HPA)
echo "Step 4: Configuring GPU-based HPA"
kubectl apply -f llm-agent-hpa.yaml

# Verify deployment
echo "Verifying deployment status..."
kubectl get pods -n gpu-operator
kubectl get pods -l app=llm-agent
kubectl get scaledobjects
kubectl get hpa

echo "Deployment complete!"
echo "You can view GPU metrics using the following command:"
echo "kubectl port-forward -n gpu-operator svc/dcgm-exporter 9400:9400"
echo "Then visit http://localhost:9400/metrics"

## 第二步：VPA配置 - 动态资源调整

核心问题：模型大小差异巨大，资源需求天差地别

**解决方案：VPA当"智能裁缝"**

VPA就像个智能裁缝，会根据实际使用情况推荐最合适的"衣服尺寸"。

### 1. 安装 VPA 组件


In [None]:
#!/bin/bash

# Install Vertical Pod Autoscaler (revised version)
echo "Installing Vertical Pod Autoscaler (VPA) components..."

# Method 1: Use the officially recommended installation method
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler/

# Check Kubernetes version compatibility
KUBE_VERSION=$(kubectl version --short | grep "Server Version" | awk '{print $3}' | sed 's/v//')
echo "Detected Kubernetes version: $KUBE_VERSION"

# Install VPA components
./hack/vpa-up.sh

# Verify installation
echo "Verifying VPA component installation..."
kubectl get pods -n kube-system | grep vpa
kubectl get crd | grep verticalpodautoscaler

# Method 2: Install using Helm (recommended for production)
# helm repo add cowboysysop https://cowboysysop.github.io/charts/
# helm install vpa cowboysysop/vertical-pod-autoscaler \
#   --namespace kube-system \
#   --set recommender.enabled=true \
#   --set updater.enabled=false \
#   --set admissionController.enabled=false

echo "VPA installation completed!"


### 2. 配置VPA策略（VPA最佳实践配置）

In [None]:
# Best Practice 1: Environment-specific VPA config (Dev)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: llm-agent-vpa-dev
  namespace: development
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-agent
  updatePolicy:
    updateMode: "Auto"  # In dev, you can allow automatic updates
    minReplicas: 1      # Dev can allow a single replica
  resourcePolicy:
    containerPolicies:
    - containerName: llm-agent
      minAllowed:
        cpu: 500m
        memory: 2Gi
      maxAllowed:
        cpu: 4000m
        memory: 16Gi
      controlledResources: ["cpu", "memory"]

---
# Best Practice 2: Conservative production config (Prod)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: llm-agent-vpa-prod
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-agent
  updatePolicy:
    updateMode: "Initial"  # In prod, apply only at pod creation time
  resourcePolicy:
    containerPolicies:
    - containerName: llm-agent
      minAllowed:
        cpu: 2000m     # Higher minimums in production
        memory: 8Gi
      maxAllowed:
        cpu: 6000m     # Conservative maximums
        memory: 24Gi
      controlledResources: ["cpu", "memory"]
      controlledValues: "RequestsOnly"  # Only adjust requests

---
# Best Practice 3: VPA with a PodDisruptionBudget (PDB)
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: llm-agent-pdb
  namespace: default
spec:
  minAvailable: 1  # Ensure at least 1 pod is always running
  selector:
    matchLabels:
      app: llm-agent

---
# Best Practice 4: VPA + HPA working together
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: llm-agent-vpa-with-hpa
  namespace: default
  annotations:
    vpa.kubernetes.io/hpa-compatible: "true"  # Mark as HPA-compatible
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-agent
  updatePolicy:
    updateMode: "Off"  # Recommended to disable auto-updates when used with HPA
  resourcePolicy:
    containerPolicies:
    - containerName: llm-agent
      minAllowed:
        cpu: 2000m
        memory: 8Gi
      maxAllowed:
        cpu: 8000m
        memory: 32Gi
      controlledResources: ["cpu", "memory"]
      # IMPORTANT: special configuration when used with HPA
      mode: Auto


### 3. VPA监控和告警配置

In [None]:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: vpa-recommender-monitor
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: vpa-recommender
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics

---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: vpa-alerts
  namespace: kube-system
spec:
  groups:
  - name: vpa.rules
    rules:
    - alert: VPARecommendationDeviation
      expr: |
        (
          kube_pod_container_resource_requests{resource="cpu"} -
          vpa_recommendation_target{resource="cpu"}
        ) / vpa_recommendation_target{resource="cpu"} * 100 > 50
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "VPA recommended value deviates too much from the current configuration"
        description: "Pod {{ $labels.pod }} CPU requests differ from the VPA recommendation by more than 50%"

    - alert: VPARecommenderDown
      expr: up{job="vpa-recommender"} == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "VPA Recommender service is unavailable"
        description: "VPA Recommender has been down for more than 5 minutes"


## 第三步：CA配置 - 集群节点自动扩缩容

核心问题：GPU节点贵得要命，必须精打细算

解决方案：Karpenter - 比传统CA聪明10倍的"节点管家"

Karpenter是AWS开源的下一代集群自动扩缩容器，专门为GPU这种"金贵"资源设计。

- 秒级决策：传统CA需要几分钟考虑，Karpenter几秒钟就决定
- 成本优先：自动选择最便宜的实例类型组合
- 多实例类型：可以混合使用不同GPU型号，灵活调度


核心配置举例：

### 1. 智能实例选择策略 - 成本优化核心

```yaml
# 最关键的成本优化配置
spec:
  template:
    spec:
      requirements:
      - key: node.kubernetes.io/instance-type
        operator: In
        values: ["g4dn.xlarge", "g4dn.2xlarge", "p3.2xlarge"]  # 按成本排序
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot", "on-demand"]  # 优先竞价实例，节省70%成本
```
核心优势:

- 自动选择成本最低的GPU实例类型
- 竞价实例优先，成本降低70%
- 多实例类型支持，提高可用性

### 2. 激进缩容策略 - 防止资源浪费

```yaml
# 激进的缩容配置 - 省钱核心
disruption:
  consolidationPolicy: WhenEmpty    # 节点空闲立即评估缩容
  consolidateAfter: 30s            # 30秒空闲即缩容（关键！）
  expireAfter: 2160h               # 90天强制替换，避免老化
```
核心优势:

- 30秒缩容 : 比传统5分钟节省大量成本
- WhenEmpty策略 : 节点完全空闲立即回收
- 自动节点替换 : 避免节点老化问题


### 3. GPU节点专用污点 - 资源隔离

```yaml
# GPU节点隔离配置
taints:
- key: nvidia.com/gpu
  value: "true"
  effect: NoSchedule  # 只有GPU工作负载才能调度到GPU节点
```
核心优势:

- 防止非GPU工作负载占用昂贵的GPU节点
- 确保GPU资源专用，提高利用率
- 避免资源浪费和成本泄漏

### 4. 多可用区容错配置
```yaml
# 高可用配置
subnetSelectorTerms:
- tags:
    karpenter.sh/discovery: ${CLUSTER_NAME}  # 自动发现所有可用区
```

核心优势:

- 自动跨可用区分布
- 避免单点故障
- 提高服务可用性


### 5. 自动GPU驱动安装
```yaml
# 自动化GPU环境配置
userData: |
  #!/bin/bash
  /etc/eks/bootstrap.sh ${CLUSTER_NAME}
  # 自动安装NVIDIA驱动
  yum install -y nvidia-driver-latest-dkms
  # 自动启动GPU监控
  systemctl enable nvidia-dcgm
  systemctl start nvidia-dcgm
```
核心优势:

- 零人工干预的GPU环境配置
- 自动安装最新驱动
- 内置GPU监控支持

### 完整的成本优化配置
```yaml
# 终极成本优化的Karpenter配置
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: cost-optimized-gpu-pool
spec:
  template:
    metadata:
      labels:
        node-type: gpu-optimized
        cost-tier: spot-first
    spec:
      requirements:
      # 按成本优先级排序的实例类型
      - key: node.kubernetes.io/instance-type
        operator: In
        values: [
          "g4dn.xlarge",    # 最便宜的GPU实例
          "g4dn.2xlarge",   # 性价比次选
          "g4ad.xlarge",    # AMD GPU备选
          "p3.2xlarge"      # 高性能备选
        ]
      # 竞价实例优先策略
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot", "on-demand"]
      # 多可用区分布
      - key: topology.kubernetes.io/zone
        operator: In
        values: ["us-west-2a", "us-west-2b", "us-west-2c"]
      
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: optimized-gpu-nodeclass
      
      # GPU专用污点
      taints:
      - key: nvidia.com/gpu
        value: "dedicated"
        effect: NoSchedule
      - key: workload-type
        value: "ml-inference"
        effect: NoSchedule
  
  # 严格的资源限制
  limits:
    cpu: 500      # 控制最大CPU数量
    memory: 500Gi # 控制最大内存
    nvidia.com/gpu: 20  # 最大GPU数量限制
  
  # 激进的成本优化策略
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 30s        # 30秒空闲即缩容
    expireAfter: 1440h          # 60天强制替换（更激进）
    
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: optimized-gpu-nodeclass
spec:
  role: "KarpenterGPURole-${CLUSTER_NAME}"
  
  # 子网选择 - 自动发现最便宜的可用区
  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: ${CLUSTER_NAME}
      tier: "gpu-optimized"
  
  # 安全组选择
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: ${CLUSTER_NAME}
  
  amiFamily: AL2_x86_64_GPU  # GPU优化的AMI
  
  # 实例存储优化
  instanceStorePolicy: RAID0
  
  # 详细的标签用于成本跟踪
  tags:
    Environment: "production"
    CostCenter: "ml-inference"
    ManagedBy: "karpenter"
    AutoScaling: "enabled"
  
  # 优化的用户数据脚本
  userData: |
    #!/bin/bash
    set -e
    
    # EKS节点初始化
    /etc/eks/bootstrap.sh ${CLUSTER_NAME} \
      --container-runtime containerd \
      --kubelet-extra-args '--node-labels=node-type=gpu-optimized,cost-tier=spot-first'
    
    # GPU驱动和工具安装
    yum update -y
    yum install -y nvidia-driver-latest-dkms
    yum install -y nvidia-docker2
    
    # GPU监控组件
    systemctl enable nvidia-dcgm
    systemctl start nvidia-dcgm
    
    # 性能优化
    echo 'net.core.somaxconn = 65535' >> /etc/sysctl.conf
    sysctl -p
    
    # 成本监控脚本
    cat > /usr/local/bin/cost-monitor.sh << 'EOF'
    #!/bin/bash
    # 记录实例使用情况用于成本分析
    INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
    INSTANCE_TYPE=$(curl -s http://169.254.169.254/latest/meta-data/instance-type)
    echo "$(date): Instance $INSTANCE_ID ($INSTANCE_TYPE) started" >> /var/log/cost-tracking.log
    EOF
    chmod +x /usr/local/bin/cost-monitor.sh
    /usr/local/bin/cost-monitor.sh
  ```



## 第四步：负载均衡配置 - 智能流量分发

核心问题：大模型服务的"特殊需求"

大模型服务不像普通Web服务，它有两个"特殊癖好"：

1. 会话保持：用户问"你叫什么名字"，模型回答"我叫ChatGLM"，下一句用户问"你几岁了"，如果请求被分到另一个Pod，模型就不知道前面聊了什么，体验很差。
  
2. 预热需求：新启动的Pod需要加载模型到显存，这个过程需要2-5分钟。如果立即接收流量，用户会等到怀疑人生。
  

**解决方案：多层负载均衡 + 智能预热**

通用解决方案： Ingress + Service层 + Pod预热机制

1. Ingress层配置

```yaml
# llm-agent-ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: llm-agent-ingress
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/affinity-mode: "persistent"
    nginx.ingress.kubernetes.io/session-cookie-name: "llm-session"
    nginx.ingress.kubernetes.io/session-cookie-max-age: "3600"
    nginx.ingress.kubernetes.io/session-cookie-secure: "true"  # 生产环境建议启用
    nginx.ingress.kubernetes.io/upstream-hash-by: "$cookie_llm_session"
    # 预热新Pod的配置
    nginx.ingress.kubernetes.io/server-snippet: |
      location /warmup {
        access_log off;
        return 200 "OK";
      }
spec:
  ingressClassName: nginx  # 推荐使用ingressClassName而非annotation
  rules:
  - host: llm-api.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: llm-agent-service
            port:
              number: 80

```
2. Service层配置

```yaml
# llm-agent-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: llm-agent-service
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: http
spec:
  type: LoadBalancer
  sessionAffinity: ClientIP  # 基于客户端IP的会话保持
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 3600   # 1小时会话保持
  selector:
    app: llm-agent
  ports:
  - port: 80
    targetPort: 8000
    protocol: TCP
```

3. Pod预热机制
```yaml
# 在Deployment中添加预热InitContainer
spec:
  template:
    spec:
      initContainers:
      - name: model-warmup
        image: your-registry/llm-agent:latest
        command: ["/bin/sh"]
        args:
        - -c
        - |
          echo "开始模型预热..."
          python warmup.py --model-path /models/your-model --warmup-requests 5
          echo "预热完成"
        resources:
          requests:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: model-storage
          mountPath: /models
```

### 负载均衡最佳实践：不同场景不同策略

会话保持策略（根据业务场景选择）：

- Web应用：使用Cookie会话保持，时长1小时（用户关闭浏览器前保持）
- API服务：使用ClientIP会话保持，时长30分钟（API调用相对短暂）
- 长连接：使用连接级会话保持（WebSocket推理服务）

健康检查配置（别让"半熟"的Pod接收流量）：
```yaml
readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 60   # 给模型加载留足时间，别急
  periodSeconds: 10
  timeoutSeconds: 5
  successThreshold: 2       # 连续2次成功才算真的Ready
  failureThreshold: 3       # 连续3次失败才算真的挂了
```


## 实战案例：某医疗AI公司部署的ChatGLM-6B推理服务，日均处理10万次医疗问答请求。

### 业务背景：医疗问答的"高要求"

- 延迟要求：医生问诊时等不起，首Token延迟必须<2秒
- 准确性要求：医疗领域容错率极低，服务可用性要求99.9%+
- 成本压力：医疗AI利润微薄，GPU成本必须严格控制
- 流量特征：工作日8-18点高峰，夜间和周末低谷

### 关键改进：
- 平均GPU利用率从50%提升到85%
- 月度成本从$12万降到$7.2万，节省40%
- 服务可用性从95.5%提升到99.8%
- 彻底告别半夜被叫醒扩容的噩梦

### 部署步骤：手把手教你复制成功

#### 1. 创建命名空间和资源配额
```shell
# 创建专用命名空间
kubectl create namespace llm-production

# 创建GPU资源配额（防止资源滥用）
kubectl apply -f - <<EOF
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: llm-production
spec:
  hard:
    requests.nvidia.com/gpu: "10"  # 最多10个GPU，根据预算调整
    limits.nvidia.com/gpu: "10"
EOF
```

#### 2. 一键部署所有组件
```shell
# 应用所有配置文件（确保顺序正确）
kubectl apply -f gpu-metrics-config.yaml -n llm-production
kubectl apply -f llm-agent-deployment.yaml -n llm-production
kubectl apply -f llm-agent-service.yaml -n llm-production
kubectl apply -f llm-agent-hpa.yaml -n llm-production
kubectl apply -f llm-agent-vpa.yaml -n llm-production
kubectl apply -f llm-agent-ingress.yaml -n llm-production
kubectl apply -f gpu-nodepool.yaml  # 这个是集群级别的，不需要namespace
```

#### 3. 验证部署效果（重要！）
```shell
# 检查Pod状态（应该都是Running）
kubectl get pods -n llm-production -l app=llm-agent

# 检查HPA状态（应该能看到GPU指标）
kubectl get hpa -n llm-production

# 检查节点GPU资源（确认GPU被正确识别）
kubectl describe nodes -l node-type=gpu

# 测试服务可用性（最关键的一步）
curl -X POST https://llm-api.yourdomain.com/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "你好，请介绍一下自己"}' \
  -w "响应时间: %{time_total}秒\n"
```

如果curl测试返回正常响应且时间<2秒，恭喜你，部署成功了！

#### 性能监控Dashboard

使用Grafana创建监控面板，关键指标包括：
```yaml
# 关键监控指标
- GPU利用率：avg(DCGM_FI_DEV_GPU_UTIL)
- GPU显存使用率：avg(DCGM_FI_DEV_MEM_COPY_UTIL) 
- Pod副本数：kube_deployment_status_replicas
- 请求延迟：histogram_quantile(0.95, http_request_duration_seconds)
- 错误率：rate(http_requests_total{status=~"5.."}[5m])
- 节点数量：count(kube_node_info{node_type="gpu"})
```

## 总结：大模型弹性伸缩的"武功秘籍"


### 核心心法：三层协同，缺一不可

**HPA层**：GPU感知的智能扩缩容，让系统有"眼睛"看到真实负载

**VPA层**：资源配置的智能推荐，让每个Pod都穿"合身的衣服"  

**CA层**：节点数量的精确控制，让集群成为"省钱小能手"

**负载均衡**：流量分发的智能调度，让用户体验"丝般顺滑"


### 最佳实践精华：这些经验值千金

1. 扩容触发阈值：GPU利用率70%，别设太低（浪费钱）也别设太高（影响体验）
2. 冷却期设置：5-10分钟，防止"神经质"扩缩容
3. 最小副本数：至少保留2个，完全缩容重启太慢
4. 健康检查：大模型启动慢，给足时间（5-10分钟）
5. 监控告警：GPU温度、利用率、延迟、错误率，一个都不能少