### Set your OpenShift login token

This cell is a simple placeholder for your **OpenShift login token**.

A **token** is like a temporary password that proves to the cluster who you are.  
You’ll paste your real token between the quotes so the next cell can use it to log in.

In [None]:
TOKEN="<redact>"

### Log in to the OpenShift cluster

This cell logs you in to the shared **OpenShift** cluster using the token from above.

- `oc login` is the command-line way to sign in to OpenShift.
- The **token** acts as your secure “key”.
- The **server URL** points to the specific cluster used for this workshop.

Once this runs successfully, any later commands you run with `oc` will act in your logged-in project/namespace on that cluster.


In [None]:
!oc login \
    --token="{TOKEN}" \
    --server=https://api.service-cbe-3.bkhx.p1.openshiftapps.com:6443

### Create a secret with storage credentials

This cell creates a **Kubernetes Secret** called `raft-workshop-secrets`.

A **Secret** is a safe place in the cluster to store sensitive information, such as:
- **S3 endpoint** – the URL of the object storage service (MinIO/S3).
- **S3 bucket** – the “top-level folder” where your dataset lives.
- **Access key / Secret key** – similar to a username and password for that storage.

We pull these values from environment variables (set earlier in the workshop) and store them in the Secret.  
The finetuning job will read this Secret so it can download the dataset and upload results without exposing passwords in the notebook.


In [None]:
!oc create secret generic raft-workshop-secrets \
  --from-literal=S3_ENDPOINT="$AWS_S3_ENDPOINT" \
  --from-literal=S3_BUCKET="$AWS_S3_BUCKET" \
  --from-literal=S3_ACCESS_KEY="$AWS_ACCESS_KEY_ID" \
  --from-literal=S3_SECRET_KEY="$AWS_SECRET_ACCESS_KEY"

### Submit the finetuning job to run on a GPU

This cell creates a **Kubernetes Job** that runs the actual model finetuning on the cluster.

Key ideas in plain language:

- A **Job** is a one-time task that runs until it finishes (or fails), like a “batch job”.
- The YAML you see describes:
  - The **name and labels** so the platform can track this job.
  - A **template** for a Pod (a running container) that does the work.
  - A single **container**:
    - Uses a pre-built workshop image: `docker.io/cengleby86/peft-training-workshop:latest`.
    - Runs `python train.py` inside that image to finetune the Granite model using your dataset.
    - Loads credentials from the `raft-workshop-secrets` we created above.
    - Sets `DATASET_NAME="surfing"` so the training script knows which dataset to use.
  - **Resource settings**:
    - Requests **1 GPU** (`nvidia.com/gpu: "1"`) so the training runs fast.
    - Reserves CPU and memory so the job has enough power to complete.
  - A small **temporary disk** (`emptyDir`) mounted at `/tmp` for scratch space during training.
  - `restartPolicy: Never` so the job doesn’t endlessly retry on failure.

In [None]:
%%bash
oc create -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  generateName: granite-ft-surfing-
  labels:
    app: granite-ft-surfing
spec:
  backoffLimit: 0
  template:
    metadata:
      labels:
        app: granite-ft
    spec:
      restartPolicy: Never
      serviceAccountName: default
      containers:
        - name: granite-ft-trainer
          image: docker.io/cengleby86/peft-training-workshop:latest
          imagePullPolicy: Always
          command: ["python"]
          args: ["train.py"]
          envFrom:
            - secretRef:
                name: raft-workshop-secrets
          env:
            - name: S3_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: DATASET_NAME
              value: "surfing"
          resources:
            limits:
              nvidia.com/gpu: "1"
              cpu: "4"
              memory: "20Gi"
            requests:
              cpu: "1"
              memory: "8Gi"
          volumeMounts:
            - name: tmp-volume
              mountPath: /tmp
      volumes:
        - name: tmp-volume
          emptyDir:
            sizeLimit: "40Gi"
      nodeSelector:
        nvidia.com/gpu.present: "true"
EOF