
---

## ✅ 2.6 Infrastructure as Code (IaC)

Provision, scale, and manage LLM infrastructure **declaratively** using code — ensuring reproducibility, scalability, and cost-efficiency.

---

### 🧱 **2.6.1 Provisioning Compute**

Declarative infrastructure setup for LLM training/inference:

| Tool        | Purpose                                   |
| ----------- | ----------------------------------------- |
| `Terraform` | Provision cloud GPUs, storage, networking |
| `Pulumi`    | IaC with Python/TypeScript support        |

Common use: Automate launch of A100/TPU nodes across AWS, GCP, Azure.

---

### ☁️ **2.6.2 Cloud Platforms**

Pre-built MLOps services for scalable deployments:

| Platform        | Capabilities                                    |
| --------------- | ----------------------------------------------- |
| `AWS SageMaker` | Model training, inference endpoints, monitoring |
| `GCP Vertex AI` | Pipelines, tuning, managed notebooks            |
| `Azure ML`      | End-to-end MLOps with security & cost tracking  |

Supports fine-tuning, monitoring, and deployment out of the box.

---

### 📅 **2.6.3 Autoscaling & Scheduling**

Manage resource scaling for training & inference jobs:

| Tool         | Use Case                                    |
| ------------ | ------------------------------------------- |
| `Kubernetes` | Pod autoscaling, GPU orchestration          |
| `Ray`        | Parallel model serving, dynamic autoscaling |
| `Slurm`      | HPC job scheduling in research clusters     |

Useful for batch training jobs or scaling real-time endpoints.

---

### 🧪 **2.6.4 Reproducible Environments**

Ensure experiments run identically across environments:

| Tool     | Purpose                                   |
| -------- | ----------------------------------------- |
| `Docker` | Containerize environments for portability |
| `Conda`  | Manage Python & CUDA dependencies         |

Use environment.yaml or Dockerfile with model versions for reproducibility.

---
