# 06 – Llama 3 + LoRA on OpenShift AI (CommsCom Churn)

In this module you will:

1. Build a **prompt dataset** for Llama 3 using the CommsCom churn model.
2. Save it as a JSONL file that can be used for **Llama 3.1 + LoRA**
   fine-tuning on OpenShift AI (using the Red Hat `distributed-workloads`
   example with Ray).
3. Understand the **next steps** to:
   - run the fine-tune on a GPU workbench,
   - serve the fine-tuned Llama model,
   - and call it to generate churn **explanations** and **retention actions**.

> This notebook runs fine on a **CPU workbench**.  
> The actual Llama training will happen later on a **GPU workbench** using
> the Red Hat example notebook.


In [11]:
from pathlib import Path
import os
import sys

# Locate the MLforEng project root by walking up until we see "mlforeng"
NOTEBOOK_DIR = Path.cwd()
PROJECT_ROOT = None
for p in [NOTEBOOK_DIR, *NOTEBOOK_DIR.parents]:
    if (p / "mlforeng").exists():
        PROJECT_ROOT = p
        break

if PROJECT_ROOT is None:
    raise RuntimeError("Could not locate mlforeng package next to this notebook.")

print("NOTEBOOK_DIR:", NOTEBOOK_DIR)
print("PROJECT_ROOT:", PROJECT_ROOT)

# Ensure project root is on sys.path
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

os.chdir(PROJECT_ROOT)
print("CWD:", os.getcwd())

from mlforeng.data_churn import train_test_churn, load_churn_raw
from mlforeng.predict import load_trained_model, predict_dataframe


NOTEBOOK_DIR: /Users/vgrover/Downloads/software/AIWorkshops/MLforEng
PROJECT_ROOT: /Users/vgrover/Downloads/software/AIWorkshops/MLforEng
CWD: /Users/vgrover/Downloads/software/AIWorkshops/MLforEng


In [12]:
model_name = "commscom_rf_tuned"
loaded = load_trained_model(model_name)

print("Loaded model:")
print("  path   :", loaded.path)
print("  dataset:", loaded.dataset)


Loaded model:
  path   : /Users/vgrover/Downloads/software/AIWorkshops/MLforEng/artifacts/pretrained/commscom_rf_tuned
  dataset: commscom_churn


In [13]:
df_raw = load_churn_raw()
df_raw.head()


Unnamed: 0,Customer ID,Gender,Age,Married,Number of Dependents,City,Zip Code,Latitude,Longitude,Number of Referrals,...,Payment Method,Monthly Charge,Total Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Total Revenue,Customer Status,Churn Category,Churn Reason
0,0002-ORFBO,Female,37,Yes,0,Frazier Park,93225,34.827662,-118.999073,2,...,Credit Card,65.6,593.3,0.0,0,381.51,974.81,Stayed,,
1,0003-MKNFE,Male,46,No,0,Glendale,91206,34.162515,-118.203869,0,...,Credit Card,-4.0,542.4,38.33,10,96.21,610.28,Stayed,,
2,0004-TLHLJ,Male,50,No,0,Costa Mesa,92627,33.645672,-117.922613,0,...,Bank Withdrawal,73.9,280.85,0.0,0,134.6,415.45,Churned,Competitor,Competitor had better devices
3,0011-IGKFF,Male,78,Yes,0,Martinez,94553,38.014457,-122.115432,1,...,Bank Withdrawal,98.0,1237.85,0.0,0,361.66,1599.51,Churned,Dissatisfaction,Product dissatisfaction
4,0013-EXCHZ,Female,75,Yes,0,Camarillo,93010,34.227846,-119.079903,3,...,Credit Card,83.9,267.4,0.0,0,22.14,289.54,Churned,Dissatisfaction,Network reliability


In [14]:
import pandas as pd  # if not already imported

splits = train_test_churn(test_size=0.2, stratify=True)

X_train, X_test = splits.X_train, splits.X_test
y_train, y_test = splits.y_train, splits.y_test

# pandas 2.0+ compatible: use concat instead of append
X_full = pd.concat([X_train, X_test], axis=0)
y_full = pd.concat([y_train, y_test], axis=0)

X_full.shape, y_full.shape


((6589, 34), (6589,))

In [15]:
# Use the loaded model to get churn probabilities for the full set
proba = loaded.model.predict_proba(X_full)[:, 1]

X_scored = X_full.copy()
X_scored["churn_proba"] = proba
X_scored["churn_label"] = y_full.values

X_scored.head()


Unnamed: 0,Gender,Age,Married,Number of Dependents,City,Zip Code,Latitude,Longitude,Number of Referrals,Tenure in Months,...,Paperless Billing,Payment Method,Monthly Charge,Total Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Total Revenue,churn_proba,churn_label
2370,Male,64,No,0,Mount Hamilton,95140,37.382909,-121.634151,0,43,...,Yes,Bank Withdrawal,60.0,2548.55,0.0,0,930.52,3479.07,0.0325,0
576,Male,44,No,2,North Palm Springs,92258,33.906496,-116.569499,0,62,...,No,Bank Withdrawal,20.0,1250.1,0.0,0,1517.76,2767.86,0.01,0
5559,Female,41,No,0,San Diego,92121,32.898613,-117.202937,0,1,...,No,Bank Withdrawal,-9.0,45.8,0.0,10,22.11,77.91,0.985,1
3564,Male,50,No,0,Echo Lake,95721,38.851842,-120.076204,0,3,...,Yes,Bank Withdrawal,105.35,323.25,0.0,0,41.49,364.74,0.975,1
444,Female,58,Yes,2,Pescadero,94060,37.22565,-122.297533,5,62,...,Yes,Bank Withdrawal,70.75,4263.45,0.0,0,2607.1,6870.55,0.01,0


In [16]:
# Threshold for "high risk" – tweak as you like
THRESH = 0.7
high_risk = X_scored[X_scored["churn_proba"] >= THRESH].copy()

len(high_risk), high_risk["churn_proba"].describe()


(1524,
 count    1524.000000
 mean        0.872572
 std         0.088272
 min         0.700000
 25%         0.800000
 50%         0.885000
 75%         0.955000
 max         1.000000
 Name: churn_proba, dtype: float64)

In [17]:
def make_prompt(row) -> str:
    """
    Turn a customer row + churn probability into a text prompt for Llama.
    """
    r = row

    return (
        "You are an AI assistant helping a telecom company called CommsCom "
        "reduce customer churn.\n\n"
        "Here is a customer profile:\n"
        f"- Tenure in months: {r.get('Tenure in Months')}\n"
        f"- Contract type: {r.get('Contract')}\n"
        f"- Monthly charges: {r.get('Monthly Charge')}\n"
        f"- Internet type: {r.get('Internet Type')}\n"
        f"- Has phone service: {r.get('Phone Service')}\n"
        f"- Has streaming TV: {r.get('Streaming TV')}\n"
        f"- Has streaming movies: {r.get('Streaming Movies')}\n"
        f"- Payment method: {r.get('Payment Method')}\n"
        f"- Churn probability from our ML model: {r['churn_proba']:.2f}\n\n"
        "Explain in simple business language why this customer might churn and "
        "suggest exactly ONE specific retention action CommsCom should take."
    )


def make_completion(row) -> str:
    """
    Construct a simple, template-style completion.
    In a real setting, this could be replaced by human-authored examples.
    """
    p = row["churn_proba"]
    if p >= 0.9:
        severity = "very high"
    elif p >= 0.8:
        severity = "high"
    else:
        severity = "elevated"

    tenure = row.get("Tenure in Months")
    contract = row.get("Contract")
    charges = row.get("Monthly Charge")

    return (
        f"Churn risk is {severity} because the customer has a tenure of "
        f"{tenure} months and is on a {contract} contract. Monthly charges of "
        f"{charges} may also be perceived as high compared to the value they "
        "feel they are receiving.\n\n"
        "Recommended action: Offer a targeted retention incentive such as a "
        "temporary discount or an upgraded plan at the same price, and have "
        "a service agent proactively confirm that their internet and phone "
        "service is working reliably."
    )


In [18]:
import pandas as pd

records = []
max_examples = 500  # cap for workshop; adjust as needed

for _, row in high_risk.head(max_examples).iterrows():
    prompt = make_prompt(row)
    completion = make_completion(row)
    records.append({"prompt": prompt, "completion": completion})

len(records)


500

In [19]:
# Peek at the first example
records[0]


{'prompt': 'You are an AI assistant helping a telecom company called CommsCom reduce customer churn.\n\nHere is a customer profile:\n- Tenure in months: 1\n- Contract type: Month-to-Month\n- Monthly charges: -9.0\n- Internet type: DSL\n- Has phone service: Yes\n- Has streaming TV: No\n- Has streaming movies: No\n- Payment method: Bank Withdrawal\n- Churn probability from our ML model: 0.98\n\nExplain in simple business language why this customer might churn and suggest exactly ONE specific retention action CommsCom should take.',
 'completion': 'Churn risk is very high because the customer has a tenure of 1 months and is on a Month-to-Month contract. Monthly charges of -9.0 may also be perceived as high compared to the value they feel they are receiving.\n\nRecommended action: Offer a targeted retention incentive such as a temporary discount or an upgraded plan at the same price, and have a service agent proactively confirm that their internet and phone service is working reliably.'}

In [20]:
import json

datasets_dir = PROJECT_ROOT / "artifacts" / "datasets"
datasets_dir.mkdir(parents=True, exist_ok=True)

jsonl_path = datasets_dir / "commscom_llama_prompts.jsonl"

with jsonl_path.open("w", encoding="utf-8") as f:
    for r in records:
        f.write(json.dumps(r, ensure_ascii=False) + "\n")

jsonl_path, jsonl_path.stat().st_size


(PosixPath('/Users/vgrover/Downloads/software/AIWorkshops/MLforEng/artifacts/datasets/commscom_llama_prompts.jsonl'),
 501574)

## Step 2 – Make the dataset available to your GPU workbench

The file you just created is:

```text
artifacts/datasets/commscom_llama_prompts.jsonl
