# RNA Folding Pipeline: YAML Conversion, Readable Data Loader, and Model Ensembling

This notebook demonstrates three reusable components extracted and refactored from my Kaggle pipeline work:

1. **YAML conversion** – turn raw RNA sequences into per-target YAML configs for downstream inference (Boltz-1 / Protenix).
2. **Readable data loader** – standardize raw CSV into a clean, schema-checked DataFrame for batch processing.
3. **Two-model ensemble** – wrap *Boltz-1* and *Protenix* predictors with a unified interface and combine outputs (mean / rank-average).

> Note: The actual model inference steps are mocked with placeholder functions so the notebook is self-contained. 


## 1. YAML Conversion Utility

Given a raw table of targets and sequences (e.g., from `train.csv` or `submission_template.csv`), we **emit one YAML file per target**.

**Output schema** (minimal for inference):
```yaml
id: <sequence_id>
sequence: <ACGU...>
constraints: []
```

In [None]:

from pathlib import Path
import pandas as pd
import re

def to_yaml_block(seq_id: str, seq: str) -> str:
    """
    Minimal YAML config for model inference (per target).
    """
    # Defensive clean-up to ensure only valid characters (ACGU) — adjust if other tokens are expected
    clean_seq = re.sub(r'[^ACGU]', '', seq.upper())
    return f"id: {seq_id}\nsequence: {clean_seq}\nconstraints: []\n"

def write_yaml_files(df: pd.DataFrame, id_col: str, seq_col: str, out_dir: str) -> None:
    out = Path(out_dir)
    out.mkdir(parents=True, exist_ok=True)
    bad_rows = []
    for i, row in df.iterrows():
        sid = str(row[id_col]).strip()
        seq = str(row[seq_col]).strip()
        if not sid or not seq:
            bad_rows.append(i)
            continue
        yaml_text = to_yaml_block(sid, seq)
        (out / f"{sid}.yaml").write_text(yaml_text)
    if bad_rows:
        print(f"[WARN] Skipped {len(bad_rows)} rows with missing id/sequence: {bad_rows[:5]}{'...' if len(bad_rows)>5 else ''}")
    print(f"[OK] Wrote YAML files to: {out.resolve()}")

# --- demo input (replace with your actual CSV) ---
demo = pd.DataFrame({
    "target_id": ["T001", "T002", "T003"],
    "sequence":  ["ACGUACGU", "ACGU-UU", "NNACGU"]
})

write_yaml_files(demo, id_col="target_id", seq_col="sequence", out_dir="/mnt/data/inputs_prediction")
!ls -la /mnt/data/inputs_prediction | head -n 10


## 2. Readable Data Loader

Standardize input CSV to a **validated, tidy** DataFrame to minimize surprises downstream.

- Required columns: `target_id`, `sequence`
- Optional columns: anything else (kept and passed through)
- Validation: non-empty `target_id`/`sequence`, legal tokens (A/C/G/U) with auto-clean


In [None]:

import pandas as pd

REQUIRED = ["target_id", "sequence"]

def load_readable_csv(path: str) -> pd.DataFrame:
    df = pd.read_csv(path)
    missing = [c for c in REQUIRED if c not in df.columns]
    if missing:
        raise ValueError(f"Missing required columns: {missing}")
    # normalize
    df["target_id"] = df["target_id"].astype(str).str.strip()
    df["sequence"] = df["sequence"].astype(str).str.upper().str.replace(r"[^ACGU]", "", regex=True)
    # drop bad rows
    df = df[(df["target_id"]!="") & (df["sequence"]!="")]
    df = df.drop_duplicates(subset=["target_id"])
    return df

# --- demo: write a quick CSV and reload ---
demo_csv = "/mnt/data/demo_sequences.csv"
pd.DataFrame({
    "target_id": ["T001","T002","T003","T003"],
    "sequence":  ["ACGU-ACGU","acguNN","", "ACGU"]
}).to_csv(demo_csv, index=False)

clean_df = load_readable_csv(demo_csv)
display(clean_df)


## 3. Two-Model Inference Wrappers and Ensembling

We define a light abstraction for predictors:

```python
class Predictor:
    def predict(self, df: pd.DataFrame) -> pd.DataFrame:  # returns columns: target_id, score
        ...
```

Replace the mock implementations with your actual **Boltz-1** and **Protenix** calls. Then we offer two ensemble strategies:
- **Mean ensemble:** arithmetic mean of model scores
- **Rank-average ensemble:** average of normalized ranks (robust when score scales differ)


In [None]:

import numpy as np
import pandas as pd

class Predictor:
    def predict(self, df: pd.DataFrame) -> pd.DataFrame:
        raise NotImplementedError

class Boltz1Predictor(Predictor):
    def predict(self, df: pd.DataFrame) -> pd.DataFrame:
        # TODO: replace with real Boltz-1 inference reading from YAML/inputs_prediction
        # Mock: score = length * 0.7 + random noise
        rng = np.random.default_rng(42)
        scores = df["sequence"].str.len() * 0.7 + rng.normal(0, 0.3, size=len(df))
        return pd.DataFrame({"target_id": df["target_id"].values, "boltz1_score": scores})

class ProtenixPredictor(Predictor):
    def predict(self, df: pd.DataFrame) -> pd.DataFrame:
        # TODO: replace with real Protenix inference
        rng = np.random.default_rng(123)
        scores = df["sequence"].str.len() * 0.5 + rng.normal(0, 0.4, size=len(df))
        return pd.DataFrame({"target_id": df["target_id"].values, "protenix_score": scores})

def mean_ensemble(df_scores: pd.DataFrame) -> pd.Series:
    score_cols = [c for c in df_scores.columns if c.endswith("_score")]
    return df_scores[score_cols].mean(axis=1)

def rank_average_ensemble(df_scores: pd.DataFrame) -> pd.Series:
    score_cols = [c for c in df_scores.columns if c.endswith("_score")]
    ranks = df_scores[score_cols].rank(method="average")
    norm = (ranks - ranks.min()) / (ranks.max() - ranks.min() + 1e-9)
    return norm.mean(axis=1)

# --- demo using clean_df from previous cell ---
boltz = Boltz1Predictor().predict(clean_df)
prot  = ProtenixPredictor().predict(clean_df)

joined = clean_df[["target_id"]].merge(boltz, on="target_id").merge(prot, on="target_id")
joined["score_mean"] = mean_ensemble(joined[["boltz1_score","protenix_score"]])
joined["score_rankavg"] = rank_average_ensemble(joined[["boltz1_score","protenix_score"]])
display(joined)
