# Support Ticket Clustering Pipeline

This notebook runs the full clustering pipeline on Vertex AI Workbench.
Everything runs from within this notebook — no terminal needed.

**Data stays within GCP** — the input CSV is pulled from a GCS bucket,
and results are pushed back to GCS. Raw transcripts never leave the cloud.

**Steps:**
1. Configure settings (project, bucket, input path)
2. Install dependencies
3. Upload code files + pull input CSV from GCS
4. Run disaggregation (Gemini API)
5. Run embedding
6. Run clustering
7. Preview results
8. Push results to GCS

**Estimated time:** ~30-45 min for 5K transcripts

## 1. Configuration

**Edit these values before running:**

In [None]:
# =============================================================================
# CONFIGURATION - EDIT THESE VALUES
# =============================================================================

# Your GCP project ID
PROJECT_ID = "your-project-id"  # <- Change this!

# GCS bucket name (without gs://)
BUCKET = "your-bucket-name"  # <- Change this!

# Input CSV path in GCS — raw data stays within GCP, never on a local machine
# (see step 3 below for how to upload your CSV to this location)
GCS_INPUT = "gs://your-bucket-name/inputs/tickets.csv"  # <- Change this!

# Local filename once pulled to the Workbench VM (no need to change)
INPUT_FILE = "input.csv"

# Processing options
LIMIT = 5000  # Number of transcripts to process (0 = all)
WORKERS = 20  # Concurrent API calls

# Clustering options
MIN_CLUSTER_SIZE = 15  # Minimum items to form a cluster
PROB_THRESHOLD = 0.0   # Minimum HDBSCAN membership probability (0.0 = keep all, try 0.05 to push uncertain points to noise)

# Vertex AI region for Gemini
REGION = "us-central1"  # Gemini is widely available across regions

# =============================================================================
# DON'T EDIT BELOW THIS LINE
# =============================================================================

GCS_OUTPUT = f"gs://{BUCKET}/outputs"

print("Configuration:")
print(f"  Project:          {PROJECT_ID}")
print(f"  Input (GCS):      {GCS_INPUT}")
print(f"  Output (GCS):     {GCS_OUTPUT}")
print(f"  Limit:            {LIMIT} transcripts")
print(f"  Workers:          {WORKERS} concurrent calls")
print(f"  Min cluster size: {MIN_CLUSTER_SIZE}")
print(f"  Prob threshold:   {PROB_THRESHOLD}")
print(f"  Region:           {REGION}")

## 2. Install Dependencies

Run this once per Workbench session:

In [None]:
!pip install -q google-cloud-aiplatform sentence-transformers umap-learn hdbscan pyarrow
print("Dependencies installed")

## 3. Set Up Files

### 3a. Upload pipeline code via JupyterLab file browser

Upload the following **code files only** using the file browser (left sidebar, upload button):
- `disaggregate.py`
- `embed.py`
- `cluster.py`
- `pipeline_utils.py`
- `system_prompt.txt`
- `examples.txt`

These are small, non-sensitive code files.

### 3b. Upload your input CSV to GCS

Your raw support transcript CSV contains sensitive data and should **never be downloaded to a local machine**. Upload it directly to GCS from wherever it lives:

**Option 1 — GCP Console (browser):**
1. Go to **GCP Console -> Cloud Storage -> Buckets**
2. Open your bucket (or create one)
3. Create an `inputs/` folder
4. Click **Upload Files** and select your CSV

**Option 2 — From a GCP VM or Cloud Shell that already has the data:**
```bash
gsutil cp /path/to/your/tickets.csv gs://your-bucket-name/inputs/tickets.csv
```

**Option 3 — From this Workbench notebook (if the CSV is already on another GCP service):**
Run this in a cell:
```python
!gsutil cp gs://source-bucket/path/to/tickets.csv gs://your-bucket-name/inputs/tickets.csv
```

Then run the cells below to pull the CSV to the VM and verify everything.

In [None]:
import os
import subprocess

# --- Pull input CSV from GCS to the Workbench VM ---
print(f"Pulling input CSV from GCS...")
print(f"  Source:      {GCS_INPUT}")
print(f"  Destination: {INPUT_FILE}")
result = subprocess.run(
    ["gsutil", "cp", GCS_INPUT, INPUT_FILE],
    capture_output=True, text=True
)
if result.returncode != 0:
    print(f"\nERROR pulling from GCS:\n{result.stderr}")
    print("\nCheck that:")
    print(f"  1. The file exists at {GCS_INPUT}")
    print(f"  2. The Workbench service account has read access to the bucket")
else:
    size = os.path.getsize(INPUT_FILE)
    print(f"  Downloaded: {size:,} bytes")

# --- Verify all required files are present ---
print("\n--- File check ---")
code_files = [
    "disaggregate.py",
    "embed.py",
    "cluster.py",
    "pipeline_utils.py",
    "system_prompt.txt",
    "examples.txt",
]
all_files = code_files + [INPUT_FILE]
missing = [f for f in all_files if not os.path.exists(f)]

if missing:
    print("MISSING FILES:")
    for f in missing:
        if f == INPUT_FILE:
            print(f"  - {f}  (failed to pull from GCS — see error above)")
        else:
            print(f"  - {f}  (upload via JupyterLab file browser)")
else:
    print("All files present:")
    for f in all_files:
        size = os.path.getsize(f)
        label = "(from GCS)" if f == INPUT_FILE else "(uploaded)"
        print(f"  {f} ({size:,} bytes) {label}")
    print("\nReady to run the pipeline!")

## Alternative: Run in Terminal (Disconnect-Safe)

**Only use this if your job will take >30 minutes** (e.g. processing 20K+ transcripts).

The notebook cells above work fine for smaller jobs. But if your browser disconnects
during a long run, the job stops. The terminal approach uses `nohup` so the job
survives disconnects.

**This is an ALTERNATIVE to running cells 4-6 above.** Choose one:
- **Notebook cells:** Simpler, see output live, but requires browser to stay connected
- **Terminal:** Disconnect-safe, check logs later, better for long jobs

**Terminal commands** — open **File -> New -> Terminal** in JupyterLab, then copy-paste:

```bash
# Run full pipeline with nohup (disconnect-safe)
# Replace YOUR_PROJECT_ID and your-tickets.csv with your actual values
nohup bash -c '
    python3 disaggregate.py --input your-tickets.csv --output disaggregated.parquet \
        --project YOUR_PROJECT_ID --region us-central1 --workers 20 --limit 0 && \
    python3 embed.py disaggregated.parquet embedded.parquet && \
    python3 cluster.py embedded.parquet clustered.csv --min-cluster-size 15 && \
    echo "DONE at $(date)"
' > pipeline.log 2>&1 &

# Check if it's running
ps aux | grep python3

# Watch progress (Ctrl+C to stop watching, job continues)
tail -f pipeline.log
```

**To check on job later:**
```bash
tail -50 pipeline.log          # see recent output
ps aux | grep python3           # check if still running
grep "DONE" pipeline.log        # check if finished
```

---

## 4. Run Disaggregation (Gemini API)

This calls Gemini to analyze each transcript. Takes ~15-30 min for 5K.

**Features:**
- JSON mode (`response_mime_type`) for reliable structured output
- Checkpoint saved after every batch (50 rows)
- If job fails, run the next cell to resume

**Cost:** ~$2.50 for 5K transcripts

In [None]:
!python3 disaggregate.py \
    --input {INPUT_FILE} \
    --output disaggregated.parquet \
    --project {PROJECT_ID} \
    --region {REGION} \
    --workers {WORKERS} \
    --limit {LIMIT}

### 4b. Resume from Checkpoint (if job failed)

Only run this if the cell above failed partway through:

In [None]:
# Resume from last checkpoint (only run if step 4 failed)
!python3 disaggregate.py \
    --input {INPUT_FILE} \
    --output disaggregated.parquet \
    --project {PROJECT_ID} \
    --region {REGION} \
    --workers {WORKERS} \
    --limit {LIMIT} \
    --resume

## 5. Run Embedding

Converts text to vectors for clustering. Takes ~10 min for 5K.

In [None]:
!python3 embed.py disaggregated.parquet embedded.parquet

## 6. Run Clustering

Groups similar problems together. Takes ~5 min for 5K.

In [None]:
!python3 cluster.py embedded.parquet clustered.csv --min-cluster-size {MIN_CLUSTER_SIZE} --prob-threshold {PROB_THRESHOLD}

## 7. Preview Results

In [None]:
import pandas as pd

df = pd.read_csv('clustered.csv')

print(f"Total rows: {len(df)}")
print(f"Clusters found: {df[df['cluster'] != -1]['cluster'].nunique()}")
print(f"Noise/outliers: {(df['cluster'] == -1).sum()}")

print("\nFidelity breakdown:")
print(df['fidelity'].value_counts())

print("\nJourney breakdown (clustered rows only):")
print(df[df['cluster'] != -1]['journey'].value_counts().head(10))

print("\nTeam breakdown (clustered rows only):")
print(df[df['cluster'] != -1]['team'].value_counts())

print("\nCluster sizes (top 10):")
print(df[df['cluster'] != -1]['cluster'].value_counts().head(10))

print("\nSample from largest cluster:")
largest_cluster = df[df['cluster'] != -1]['cluster'].value_counts().index[0]
sample_df = df[df['cluster'] == largest_cluster][['summarised_problem', 'fidelity', 'journey', 'team']].head(5)
print(sample_df.to_string(index=False))

## 8. Push Results to GCS

Results go back to the same bucket. The raw data and outputs never leave GCP.

In [None]:
import subprocess

files_to_push = {
    "clustered.csv": f"{GCS_OUTPUT}/clustered.csv",
    "disaggregated.parquet": f"{GCS_OUTPUT}/disaggregated.parquet",
    "embedded.parquet": f"{GCS_OUTPUT}/embedded.parquet",
}

for local_file, gcs_path in files_to_push.items():
    if os.path.exists(local_file):
        print(f"Pushing {local_file} -> {gcs_path}")
        result = subprocess.run(
            ["gsutil", "cp", local_file, gcs_path],
            capture_output=True, text=True
        )
        if result.returncode != 0:
            print(f"  ERROR: {result.stderr}")
        else:
            size_mb = os.path.getsize(local_file) / (1024 * 1024)
            print(f"  Done ({size_mb:.1f} MB)")
    else:
        print(f"Skipping {local_file} (not found)")

print(f"\nResults in GCS:")
!gsutil ls -l {GCS_OUTPUT}/

---

## IMPORTANT: Stop Workbench When Done!

Go to **GCP Console -> Vertex AI -> Workbench** and click **STOP** on your instance.

Otherwise it keeps charging ~$8/day.