# üß¨ Research / Production-Grade LLM_Variant_Pipeline

**Copyright (c) 2026 Ka-Kyung Kim**  
License: MIT  

GitHub: [kakyungkim](https://github.com/kakyungkim)



### Î™©Ï†Å

* Í≥µÍ∞ú TCGA somatic VCF ÏÉòÌîå Í∏∞Î∞ò Î≥ÄÏù¥ Î∂ÑÏÑù
* Î≥ÄÏù¥ Îç∞Ïù¥ÌÑ∞ Ï†ïÌòïÌôî Î∞è LLM Í∏∞Î∞ò ÏûÑÏÉÅ Ìï¥ÏÑù ÏàòÌñâ

### Ï†ÑÏ≤¥ ÌùêÎ¶Ñ

* GitHubÏóêÏÑú TCGA ÏÉòÌîå ÌååÏùº Ï°∞Ìöå
* somatic VCF Îã§Ïö¥Î°úÎìú Î∞è Ï†ïÎ¶¨
* VCF ‚Üí DataFrame Î≥ÄÌôò
* ÏÉòÌîåÎ≥Ñ ÏïîÏ¢Ö Ï†ïÎ≥¥ Ïó∞Í≤∞
* Ï£ºÏöî Î≥ÄÏù¥ ÏÑ†ÌÉù
* LLM Í∏∞Î∞ò Íµ¨Ï°∞ÌôîÎêú Ìï¥ÏÑù ÏÉùÏÑ±

---


In [None]:
# üß¨ LLM_Variant_Pipeline
# Copyright (c) 2026 Ka-Kyung Kim
# License: MIT
# GitHub: https://github.com/kakyungkim
#
# This notebook demonstrates LLM-assisted somatic variant interpretation
# for research/demo purposes only. Not intended for clinical use.

## Environment & Setup

### Execution environment
- Google Colab
- Public TCGA data only

### Package installation

In [1]:
# 1Ô∏è. Ìå®ÌÇ§ÏßÄ ÏÑ§Ïπò
# Í∏∞Ï°¥ ÏÑ§Ïπò Ï†úÍ±∞ ÌõÑ, ÏßÄÏ†ï Î≤ÑÏ†Ñ ÏÑ§Ïπò
# !pip uninstall -y langchain langchain-core langchain-openai
!pip install -q \
  langchain-openai \
  pydantic \
  pandas

In [2]:
# 2. Library imports & environment check

# --- Standard library ---
import os
import sys
import glob
import gzip
import re
import tarfile
from io import BytesIO
from pathlib import Path
from getpass import getpass

# --- Third-party ---
import requests
import pandas as pd

import pydantic
from pydantic import BaseModel, Field
from typing import List, Optional

import langchain
import langchain_core
import langchain_openai
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import JsonOutputParser

# Environment version logging (reproducibility)

import sys
from importlib.metadata import version, PackageNotFoundError

def get_pkg_version(pkg_name):
    try:
        return version(pkg_name)
    except PackageNotFoundError:
        return "not installed"

print("üì¶ Environment versions")
print("-----------------------")
print(f"Python           : {sys.version.split()[0]}")
print(f"langchain        : {get_pkg_version('langchain')}")
print(f"langchain-core   : {get_pkg_version('langchain-core')}")
print(f"langchain-openai : {get_pkg_version('langchain-openai')}")
print(f"pydantic         : {get_pkg_version('pydantic')}")
print(f"pandas           : {get_pkg_version('pandas')}")

üì¶ Environment versions
-----------------------
Python           : 3.12.12
langchain        : 0.2.0
langchain-core   : 0.2.11
langchain-openai : 0.1.14
pydantic         : 2.9.2
pandas           : 2.2.2


In [3]:
# 3. API Key ÏÑ§Ï†ï (Colab Ïà®ÍπÄ)
# ‚ö†Ô∏è Ïã§Ï†ú ÎÖ∏Ìä∏Î∂ÅÏóêÏÑúÎäî ÏßÅÏ†ë ÏûÖÎ†• ÎòêÎäî Secret Manager ÏÇ¨Ïö© Í∂åÏû•
API_KEY = getpass("Enter your OpenAI API Key: ")  # <--- ÏÇ¨Ïö©Ïûê ÏûÖÎ†•
os.environ["OPENAI_API_KEY"] = API_KEY
print("‚úÖ API Key set!")

# 4. Google Drive ÎßàÏö¥Ìä∏
from google.colab import drive
drive.mount('/content/drive')

# 5. ÌîÑÎ°úÏ†ùÌä∏ Ìè¥Îçî Í≤ΩÎ°ú ÏÑ§Ï†ï
PROJECT_DIR = Path("/content/drive/MyDrive/ToyProjects/LLM_Variant_Pipeline")
BASE_DIR = PROJECT_DIR / "data"
OUTPUT_DIR = PROJECT_DIR / "reports"

BASE_DIR.mkdir(parents=True, exist_ok=True)
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"‚úÖ Project directories ready!\nBASE_DIR: {BASE_DIR}\nOUTPUT_DIR: {OUTPUT_DIR}")


Enter your OpenAI API Key: ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑
‚úÖ API Key set!
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
‚úÖ Project directories ready!
BASE_DIR: /content/drive/MyDrive/ToyProjects/LLM_Variant_Pipeline/data
OUTPUT_DIR: /content/drive/MyDrive/ToyProjects/LLM_Variant_Pipeline/reports


---

# Somatic Variant Interpretation Pipeline

## Single-Sample TCGA Variant Interpretation (LLM-Assisted)

> This notebook demonstrates a pipeline for interpreting somatic variants from a single tumor sample.
> Only a **single public TCGA sample** is used for testing and demonstration.
> Tier assignments are **candidate only** and require additional validation.



---

## 1. Introduction

### Î™©Ï†Å

* Public TCGA somatic VCF ÏÉòÌîåÏùÑ ÏûÖÎ†•ÏúºÎ°ú ÏÇ¨Ïö©
* Îã®Ïùº Ï¢ÖÏñë ÏÉòÌîå ÎÇ¥ Î≥ÄÏù¥Î•º Íµ¨Ï°∞Ìôî
* LLMÏùÑ Ïù¥Ïö©Ìï¥ **Ï°∞Í±¥Î∂Ä ÏûÑÏÉÅ Ìï¥ÏÑù(candidate tier)** ÏÉùÏÑ±
* Ïù¥ÌõÑ cohort Í∏∞Î∞ò Î∂ÑÏÑùÏúºÎ°ú ÌôïÏû• Í∞ÄÎä•Ìïú ÌååÏù¥ÌîÑÎùºÏù∏ ÏÑ§Í≥Ñ

### Îã®Ïùº ÏÉòÌîå Î∂ÑÏÑùÏùò ÌïúÍ≥Ñ

* cohort ÎπàÎèÑ Ï†ïÎ≥¥ ÏóÜÏùå
* ÌÜµÍ≥ÑÏ†Å driver ÌåêÎ≥Ñ Î∂àÍ∞Ä
* ÏûÑÏÉÅ tier ÌôïÏ†ï Î∂àÍ∞Ä

‚Üí Î≥∏ ÎÖ∏Ìä∏Î∂ÅÏùò Î™®Îì† tierÎäî **ÌôïÏ†ïÏù¥ ÏïÑÎãå candidate ÏàòÏ§Ä**

---

## 2. Data acquisition (public TCGA sample)

* TCGA Pan-Cancer ÌîÑÎ°úÏ†ùÌä∏Ïùò Í≥µÍ∞ú sample_files ÏÇ¨Ïö©
* GitHub APIÎ•º ÌÜµÌï¥ ÌååÏùº Îã§Ïö¥Î°úÎìú (somatic SNV/MNV tar.gz ÌååÏùºÎßå ÏÑ†Î≥Ñ)
* `sample_files` ÎîîÎ†âÌÜ†Î¶¨ Ï°∞Ìöå


In [4]:
import requests
from pathlib import Path

OWNER = "ICGC-TCGA-PanCancer"
REPO = "vcf-uploader"
PATH = "sample_files"
BRANCH = "develop"

# GitHub APIÎ°ú sample_files Î™©Î°ù Ï°∞Ìöå
api_url = f"https://api.github.com/repos/{OWNER}/{REPO}/contents/{PATH}?ref={BRANCH}"
resp = requests.get(api_url)
resp.raise_for_status()
files = resp.json()

# ÌååÏùºÎ™ÖÏóê 'somatic'Ïù¥ Ìè¨Ìï®Îêú Ìï≠Î™© ÌïÑÌÑ∞ÎßÅ
somatic_urls = [
    f["download_url"]
    for f in files
    if "somatic" in f["name"].lower()
]

print(f"Number of somatic-related files to download: {len(somatic_urls)}")
for u in somatic_urls:
    print(" -", u.split("/")[-1])

# Îã§Ïö¥Î°úÎìú
for url in somatic_urls:
    fname = url.split("/")[-1]
    out_path = Path(BASE_DIR) / fname
    print(f"Downloading {fname}...")
    r = requests.get(url)
    r.raise_for_status()
    with open(out_path, "wb") as f:
        f.write(r.content)

print("Download complete at:", BASE_DIR)


Number of somatic-related files to download: 10
 - a4beedc3-0e96-4e1c-90b4-3674dfc01786.TestWorkflow_1-0-0.20141009.somatic.indel.vcf.gz
 - a4beedc3-0e96-4e1c-90b4-3674dfc01786.TestWorkflow_1-0-0.20141009.somatic.indel.vcf.gz.idx
 - a4beedc3-0e96-4e1c-90b4-3674dfc01786.TestWorkflow_1-0-0.20141009.somatic.indel.vcf.gz.idx.md5
 - a4beedc3-0e96-4e1c-90b4-3674dfc01786.TestWorkflow_1-0-0.20141009.somatic.indel.vcf.gz.md5
 - a4beedc3-0e96-4e1c-90b4-3674dfc01786.TestWorkflow_1-0-0.20141009.somatic.snv_mnv.tar.gz
 - a4beedc3-0e96-4e1c-90b4-3674dfc01786.TestWorkflow_1-0-0.20141009.somatic.snv_mnv.tar.gz.md5
 - a4beedc3-0e96-4e1c-90b4-3674dfc01786.TestWorkflow_1-0-0.20141009.somatic.snv_mnv.vcf.gz
 - a4beedc3-0e96-4e1c-90b4-3674dfc01786.TestWorkflow_1-0-0.20141009.somatic.snv_mnv.vcf.gz.idx
 - a4beedc3-0e96-4e1c-90b4-3674dfc01786.TestWorkflow_1-0-0.20141009.somatic.snv_mnv.vcf.gz.idx.md5
 - a4beedc3-0e96-4e1c-90b4-3674dfc01786.TestWorkflow_1-0-0.20141009.somatic.snv_mnv.vcf.gz.md5
Downloading a4

In [5]:
sorted(p.name for p in BASE_DIR.iterdir())

['a4beedc3-0e96-4e1c-90b4-3674dfc01786.TestWorkflow_1-0-0.20141009.somatic.indel.vcf.gz',
 'a4beedc3-0e96-4e1c-90b4-3674dfc01786.TestWorkflow_1-0-0.20141009.somatic.indel.vcf.gz.idx',
 'a4beedc3-0e96-4e1c-90b4-3674dfc01786.TestWorkflow_1-0-0.20141009.somatic.indel.vcf.gz.idx.md5',
 'a4beedc3-0e96-4e1c-90b4-3674dfc01786.TestWorkflow_1-0-0.20141009.somatic.indel.vcf.gz.md5',
 'a4beedc3-0e96-4e1c-90b4-3674dfc01786.TestWorkflow_1-0-0.20141009.somatic.snv_mnv.tar.gz',
 'a4beedc3-0e96-4e1c-90b4-3674dfc01786.TestWorkflow_1-0-0.20141009.somatic.snv_mnv.tar.gz.md5',
 'a4beedc3-0e96-4e1c-90b4-3674dfc01786.TestWorkflow_1-0-0.20141009.somatic.snv_mnv.vcf.gz',
 'a4beedc3-0e96-4e1c-90b4-3674dfc01786.TestWorkflow_1-0-0.20141009.somatic.snv_mnv.vcf.gz.idx',
 'a4beedc3-0e96-4e1c-90b4-3674dfc01786.TestWorkflow_1-0-0.20141009.somatic.snv_mnv.vcf.gz.idx.md5',
 'a4beedc3-0e96-4e1c-90b4-3674dfc01786.TestWorkflow_1-0-0.20141009.somatic.snv_mnv.vcf.gz.md5']

## 3. VCF parsing (SNV/MNV + INDEL)

### Î∂ÑÏÑù ÎåÄÏÉÅ Í∏∞Ï§Ä

* ÏÇ¨Ïö©:

  * `.vcf.gz`

* Ï†úÏô∏:

  * `.idx`, `.md5`, `.tar.gz`

* SNV/MNVÏôÄ INDELÏùÑ ÌååÏùºÎ™Ö Í∏∞Ï§ÄÏúºÎ°ú Î∂ÑÎ¶¨

In [6]:
vcf_files = list(BASE_DIR.glob("*.vcf.gz"))

snv_mnv_files = [f for f in vcf_files if "snv" in f.name.lower()]
indel_files   = [f for f in vcf_files if "indel" in f.name.lower()]

print("SNV/MNV files:")
for f in snv_mnv_files:
    print(" -", f.name)

print("\nINDEL files:")
for f in indel_files:
    print(" -", f.name)


SNV/MNV files:
 - a4beedc3-0e96-4e1c-90b4-3674dfc01786.TestWorkflow_1-0-0.20141009.somatic.snv_mnv.vcf.gz

INDEL files:
 - a4beedc3-0e96-4e1c-90b4-3674dfc01786.TestWorkflow_1-0-0.20141009.somatic.indel.vcf.gz


## 4. Variant structuring

### ÏÑ§Í≥Ñ ÏõêÏπô

* ÏôÑÏ†ÑÌïú VCF ÌååÏã±ÏùÄ Î™©Ï†Å ÏïÑÎãò
* LLM Ìï¥ÏÑùÏóê ÌïÑÏöîÌïú ÏµúÏÜå Ï†ïÎ≥¥Îßå Ïú†ÏßÄ
* INFO ÌïÑÎìúÎäî Î¨∏ÏûêÏó¥Î°ú Î≥¥Ï°¥ (annotation Îã®Í≥ÑÏóêÏÑú ÌôúÏö©)


In [7]:
import gzip
import pandas as pd

def parse_vcf_to_df(vcf_path, variant_type):
    rows = []

    with gzip.open(vcf_path, "rt") as f:
        for line in f:
            if line.startswith("#"):
                continue

            chrom, pos, _, ref, alt, _, _, info = line.strip().split("\t")[:8]

            rows.append({
                "chrom": chrom,
                "pos": int(pos),
                "ref": ref,
                "alt": alt,
                "variant_type": variant_type,
                "info": info
            })

    return pd.DataFrame(rows)


In [8]:
dfs = []

for f in snv_mnv_files:
    dfs.append(parse_vcf_to_df(f, "snv_mnv"))

for f in indel_files:
    dfs.append(parse_vcf_to_df(f, "indel"))

df_variants = pd.concat(dfs, ignore_index=True)
df_variants.head()


Unnamed: 0,chrom,pos,ref,alt,variant_type,info
0,1,14610,T,C,snv_mnv,FS=22.026;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMPA...
1,1,14653,C,T,snv_mnv,FS=173.386;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMP...
2,1,14907,A,G,snv_mnv,FS=55.93;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMPAC...
3,1,14930,A,G,snv_mnv,FS=31.945;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMPA...
4,1,15118,A,G,snv_mnv,FS=3.583;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMPAC...


## 5. Variant annotation (lightweight)

### Î™©Ï†Å

* Ïô∏Î∂Ä DBÎ•º ÏßÅÏ†ë Î∂ôÏù¥ÏßÄ ÏïäÍ≥†
* INFO ÌïÑÎìúÏóêÏÑú ÏµúÏÜåÌïúÏùò ÏùòÎØ∏ ÏûàÎäî Ï†ïÎ≥¥ Ï∂îÏ∂ú
* gene Ï§ëÏã¨ Ìï¥ÏÑùÏùÑ Í∞ÄÎä•ÌïòÍ≤å Ìï®

### ÏòàÏãú: INFOÏóêÏÑú GENE / SYMBOL Ï∂îÏ∂ú
* INFO ÌïÑÎìúÏóê Ìè¨Ìï®Îêú annotation tool ÏùòÏ°¥Ï†Å gene name Ï∂îÏ∂ú
* ÌòÑÏû¨ ÏÉòÌîåÏùÄ SnpEff annotation Í∏∞Ï§Ä
* annotation toolÏù¥ Î≥ÄÍ≤ΩÎêòÎ©¥ Ìï¥Îãπ ÌååÏÑúÎßå ÍµêÏ≤¥

In [9]:
def extract_gene(info):
    match = re.search(r"SNPEFF_GENE_NAME=([^;]+)", info)
    return match.group(1) if match else None

df_variants["gene"] = df_variants["info"].apply(extract_gene)
df_variants.head()


Unnamed: 0,chrom,pos,ref,alt,variant_type,info,gene
0,1,14610,T,C,snv_mnv,FS=22.026;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMPA...,DDX11L1
1,1,14653,C,T,snv_mnv,FS=173.386;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMP...,DDX11L1
2,1,14907,A,G,snv_mnv,FS=55.93;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMPAC...,DDX11L1
3,1,14930,A,G,snv_mnv,FS=31.945;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMPA...,DDX11L1
4,1,15118,A,G,snv_mnv,FS=3.583;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMPAC...,DDX11L1


> Ïã§Ï†ú ÌôïÏû• Ïãú:
>
> * VEP / ANNOVAR
> * COSMIC / ClinVar
> * OncoKB
>   Îì±ÏúºÎ°ú ÍµêÏ≤¥ Í∞ÄÎä•ÌïòÎèÑÎ°ù Íµ¨Ï°∞ Ïú†ÏßÄ

---

## 6. Candidate selection

### Î™©Ï†Å

* Î™®Îì† Î≥ÄÏù¥Î•º LLMÏóê Ï†ÑÎã¨ÌïòÏßÄ ÏïäÏùå
* Ìï¥ÏÑù ÏòàÏ†úÏóê Ï†ÅÌï©Ìïú ÏÜåÏàò Î≥ÄÏù¥Îßå ÏÑ†ÌÉù

### Îã®Ïàú Í∏∞Ï§Ä (Îç∞Î™®Ïö©)

* gene Ï†ïÎ≥¥Í∞Ä ÏûàÍ±∞ÎÇò
* Î≥ÄÏù¥ ÌëúÌòÑÏù¥ Î™ÖÌôïÌïú Í≤ΩÏö∞
* ÏÉÅÏúÑ NÍ∞úÎßå ÏÇ¨Ïö©


In [10]:
MAX_VARIANTS = 10

df_candidates = (
    df_variants
    .sort_values(by=["gene"], na_position="last")
    .head(MAX_VARIANTS)
    .copy()
)

df_candidates

Unnamed: 0,chrom,pos,ref,alt,variant_type,info,gene
0,1,14610,T,C,snv_mnv,FS=22.026;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMPA...,DDX11L1
17,1,16378,T,C,indel,FS=58.995;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMPA...,DDX11L1
16,1,15274,A,T,indel,FS=0.0;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMPACT=...,DDX11L1
15,1,15211,T,G,indel,FS=0.0;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMPACT=...,DDX11L1
14,1,15118,A,G,indel,FS=3.583;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMPAC...,DDX11L1
13,1,14930,A,G,indel,FS=31.945;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMPA...,DDX11L1
12,1,14907,A,G,indel,FS=55.93;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMPAC...,DDX11L1
11,1,14653,C,T,indel,FS=173.386;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMP...,DDX11L1
10,1,14610,T,C,indel,FS=22.026;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMPA...,DDX11L1
9,1,16571,G,A,snv_mnv,FS=4.075;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMPAC...,DDX11L1


### LLM ÏûÖÎ†•Ïö© ÌÖçÏä§Ìä∏ Î≥ÄÌôò

In [11]:
def variants_to_text(df):
    lines = []
    for _, r in df.iterrows():
        lines.append(
            f"{r.chrom}:{r.pos} {r.ref}>{r.alt} "
            f"[{r.variant_type}] "
            f"GENE={r.gene} INFO={r.info}"
        )
    return "\n".join(lines)

variant_text = variants_to_text(df_candidates)
print(variant_text)


1:14610 T>C [snv_mnv] GENE=DDX11L1 INFO=<bound method Series.info of chrom                                                           1
pos                                                         14610
ref                                                             T
alt                                                             C
variant_type                                              snv_mnv
info            FS=22.026;SNPEFF_GENE_NAME=DDX11L1;SNPEFF_IMPA...
gene                                                      DDX11L1
Name: 0, dtype: object>
1:16378 T>C [indel] GENE=DDX11L1 INFO=<bound method Series.info of chrom                                                           1
pos                                                         16378
ref                                                             T
alt                                                             C
variant_type                                                indel
info            FS=58.995;SNPEFF_GENE_NAME=DDX11

## 7. LLM-based interpretation (dual-mode tiering)
**Mode A: LLM-driven candidate tiering**
*   LLMÏù¥ Î≥ÄÏù¥Î≥Ñ candidate tierÎ•º ÏßÅÏ†ë ÌåêÎã®
*   Í∑ºÍ±∞(evidence)ÏôÄ Ìï®Íªò Ï°∞Í±¥Î∂ÄÎ°ú Í∏∞Ïà†
*   ÌÉêÏÉâÏ†Å Î∂ÑÏÑù, Ïã†Ìò∏ Î∞úÍµ¥Ïóê Ï†ÅÌï©

**Mode B: Rule-based pre-tiering + LLM explanation**
*   TierÎäî rule-based Î°úÏßÅÏúºÎ°ú ÏÇ¨Ï†ÑÏóê Í≤∞Ï†ï
*   LLMÏùÄ tierÎ•º Î≥ÄÍ≤ΩÌïòÏßÄ ÏïäÍ≥† ÏÑ§Î™ÖÎßå ÏÉùÏÑ±
*   Ïû¨ÌòÑÏÑ±, Î¶¨Î∑∞ ÏïàÏ†ÑÏÑ±, Í≥µÏú† Î™©Ï†ÅÏóê Ï†ÅÌï©

### Tier Í∏∞Ï§Ä Íµ¨Ï°∞Ìôî

TierÎäî **ÌôïÏ†ïÏù¥ ÏïÑÎãàÎùº Ï°∞Í±¥Î∂Ä ÌåêÎã®**ÏúºÎ°úÎßå ÏÇ¨Ïö©:

* `candidate_Tier_I`

  * well-known cancer gene
  * hotspot ÎòêÎäî Í∞ïÌïú oncogenic Í∑ºÍ±∞
* `candidate_Tier_II`

  * oncogenic Í∞ÄÎä•ÏÑ±
  * Ïïî Í¥ÄÎ†® Í≤ΩÎ°ú Ïó∞Í¥Ä
* `uncertain`

  * Í∑ºÍ±∞ Î∂ÄÏ°± ÎòêÎäî Ï†ïÎ≥¥ Ï†úÌïú

---

### (1) Ï∂úÎ†• Ïä§ÌÇ§Îßà Ï†ïÏùò (Pydantic)


In [12]:
from pydantic import BaseModel
from typing import List, Optional

class VariantInterpretation(BaseModel):
    gene: Optional[str]
    variant: str
    variant_type: str

    oncogenic_evidence: str
    potential_tier: str
    tier_basis: str

    requires_additional_validation: bool
    notes: Optional[str]


class SampleInterpretation(BaseModel):
    cancer_context: Optional[str]

    variants: List[VariantInterpretation]

    overall_summary: str
    limitations: str



---
### (2) LLM + Parser ÏÑ§Ï†ï

In [13]:
# from langchain_openai import ChatOpenAI
# from langchain_core.output_parsers import JsonOutputParser

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
parser = JsonOutputParser(pydantic_object=SampleInterpretation)

---

### (3) ÌîÑÎ°¨ÌîÑÌä∏ ÌÖúÌîåÎ¶ø Ï†ïÏùò

In [14]:
# Mode ÏÑ†ÌÉùÏö© ÏÑ§Ï†ïÍ∞í
TIER_MODE = "llm"      # ÎòêÎäî "rule"

In [15]:
from langchain_core.prompts import ChatPromptTemplate

def build_prompt_template(tier_mode: str) -> ChatPromptTemplate:
    if tier_mode == "llm":
        tier_instruction = """
You may assign a candidate tier based on available evidence.
Use conditional language only.
"""
    elif tier_mode == "rule":
        tier_instruction = """
Candidate tier has been pre-assigned using rule-based logic.
Do NOT change the tier. Only provide explanation.
"""
    else:
        raise ValueError("tier_mode must be 'llm' or 'rule'")

    return ChatPromptTemplate.from_template(f"""
You are interpreting somatic variants from a single tumor sample.

Constraints:
- Single-sample analysis only
- No cohort-level frequency data
- No definitive clinical decisions

{tier_instruction}

Variants:
{{variant_text}}

Cancer context:
{{cancer_context}}

Tier definitions:
- candidate_Tier_I: strong known oncogenic relevance
- candidate_Tier_II: moderate or emerging evidence
- uncertain: insufficient evidence

Output only valid JSON matching this schema:
{{format_instructions}}
""")

---

### (4) ÌÖúÌîåÎ¶ø ‚Üí Î©îÏãúÏßÄ ‚Üí Ïã§Ìñâ

In [16]:
prompt_tmpl = build_prompt_template(TIER_MODE)

messages = prompt_tmpl.format_messages(
    variant_text=variant_text,
    cancer_context="pancreatic adenocarcinoma",
    format_instructions=parser.get_format_instructions()
)

# 1) ÌååÏù¥ÌîÑÎùºÏù∏ Î©îÌÉÄÎç∞Ïù¥ÌÑ∞
# NOTE:
# TCGA sample identifier is inferred from VCF filename.
# In multi-sample or production settings, explicit sample metadata should be used.
SAMPLE_ID = snv_mnv_files[0].stem

# 2) LLM Ïã§Ìñâ
result = llm.invoke(messages)
parsed = parser.parse(result.content)

# 3) LLM Í≤∞Í≥º + sample_id Í≤∞Ìï©
final_result = {
    "sample_id": SAMPLE_ID,
    "cancer_context": "pancreatic adenocarcinoma",
    "variants": parsed["variants"],
    "overall_summary": parsed["overall_summary"],
    "limitations": parsed["limitations"]
}

# 7.1 ‚Äî Dual-mode comparison (Optional analysis)

> Î™©Ï†Å
>
> * **Í∞ôÏùÄ Î≥ÄÏù¥ ÏßëÌï©**Ïóê ÎåÄÌï¥
>
>   * Mode A: LLMÏù¥ tierÍπåÏßÄ ÌåêÎã®
>   * Mode B: Rule-based tier Í≥†Ï†ï + LLM ÏÑ§Î™Ö
> * Í≤∞Í≥ºÎ•º **ÎÇòÎûÄÌûà ÎπÑÍµê**
> * ‚ÄúLLMÏù¥ Ïôú tierÎ•º ÌåêÎã®Ìï¥ÎèÑ ÎêòÎäîÍ∞Ä‚ÄùÎ•º Îç∞Ïù¥ÌÑ∞Î°ú Î≥¥Ïó¨Ï§å


```
VCF (SNV / INDEL)
        ‚îÇ
        ‚ñº
Lightweight Parsing
        ‚îÇ
        ‚ñº
Candidate Variant Selection
        ‚îÇ
        ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
        ‚ñº              ‚ñº
 Mode A (LLM)      Mode B (Rule)
 Tier + Evidence   Tier fixed
        ‚îÇ              ‚îÇ
        ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                ‚ñº
        Dual-mode Comparison
                ‚îÇ
                ‚ñº
        Structured JSON / CSV
```


---

## 7.1-A. Í≥µÌÜµ Ï§ÄÎπÑ (Ïù¥ÎØ∏ 7Î≤àÏóêÏÑú ÎßåÎì† Í≤É Ïû¨ÏÇ¨Ïö©)

In [17]:
# Í≥µÌÜµ Î©îÌÉÄÎç∞Ïù¥ÌÑ∞
SAMPLE_ID = snv_mnv_files[0].stem
CANCER_CONTEXT = "pancreatic adenocarcinoma"

---

## 7.1-B. Mode A ‚Äî LLM-driven candidate tiering

### (1) Prompt ÏÉùÏÑ±

In [18]:
prompt_llm = build_prompt_template("llm")

messages_llm = prompt_llm.format_messages(
    variant_text=variant_text,
    cancer_context=CANCER_CONTEXT,
    format_instructions=parser.get_format_instructions()
)

### (2) Ïã§Ìñâ

In [19]:
result_llm = llm.invoke(messages_llm)
parsed_llm = parser.parse(result_llm.content)

---

## 7.1-C. Mode B ‚Äî Rule-based pre-tiering + LLM explanation

### (1) Rule-based tier Ï†ïÏùò (Îç∞Î™®Ïö©)

In [20]:
# def assign_candidate_tier(row):
#     """
#     Demo-only rule-based pre-tiering.
#     Not intended for clinical or guideline-compliant use.
#     """
#     if row["gene"] in {"KRAS", "EGFR", "BRAF"}:
#         return "candidate_Tier_I"
#     elif row["gene"] is not None:
#         return "candidate_Tier_II"
#     else:
#         return "uncertain"

# df_candidates["rule_based_tier"] = df_candidates.apply(
#     assign_candidate_tier, axis=1
# )

In [21]:
# Minimal cancer-relevant gene set (demo)
KNOWN_ONCOGENES = {
    "KRAS", "NRAS", "HRAS",
    "EGFR", "BRAF", "PIK3CA",
    "TP53", "CDKN2A", "SMAD4"
}

def assign_candidate_tier(row):
    """
    Rule-based candidate tiering (INTENTIONALLY MINIMAL).

    Conceptual mapping to AMP/ASCO/CAP guidelines:
    - candidate_Tier_I   ~ AMP Tier I/II (strong clinical or biological evidence)
    - candidate_Tier_II  ~ AMP Tier III (biological relevance, limited evidence)
    - uncertain          ~ AMP Tier IV (unknown significance)

    This rule is NOT guideline-compliant and is used only
    to provide a stable comparison baseline for LLM behavior.
    """

    gene = row["gene"]

    # Well-established oncogenes / tumor suppressors
    if gene in KNOWN_ONCOGENES:
        return "candidate_Tier_I"

    # Missing or undefined gene
    if gene is None:
        return "uncertain"

    # Pseudogenes / uncharacterized loci (very coarse heuristic)
    if gene.endswith("L") or gene.startswith("LOC"):
        return "uncertain"

    # Fallback: gene-level relevance but no strong evidence
    return "candidate_Tier_II"


df_candidates = df_candidates.copy()
df_candidates["rule_based_tier"] = df_candidates.apply(
    assign_candidate_tier, axis=1
)

# (ÏÑ†ÌÉù) sanity check
df_candidates[["gene", "rule_based_tier"]].head()

Unnamed: 0,gene,rule_based_tier
0,DDX11L1,candidate_Tier_II
17,DDX11L1,candidate_Tier_II
16,DDX11L1,candidate_Tier_II
15,DDX11L1,candidate_Tier_II
14,DDX11L1,candidate_Tier_II


### (2) Rule tier Ìè¨Ìï®Ìïú variant text ÏÉùÏÑ±

In [22]:
def variants_to_text_with_rule(df):
    lines = []
    for _, r in df.iterrows():
        lines.append(
            f"{r['chrom']}:{r['pos']} {r['ref']}>{r['alt']} "
            f"[{r['variant_type']}] "
            f"GENE={r['gene']} PRE_TIER={r['rule_based_tier']}"
        )
    return "\n".join(lines)

variant_text_rule = variants_to_text_with_rule(df_candidates)
print(variant_text_rule)


1:14610 T>C [snv_mnv] GENE=DDX11L1 PRE_TIER=candidate_Tier_II
1:16378 T>C [indel] GENE=DDX11L1 PRE_TIER=candidate_Tier_II
1:15274 A>T [indel] GENE=DDX11L1 PRE_TIER=candidate_Tier_II
1:15211 T>G [indel] GENE=DDX11L1 PRE_TIER=candidate_Tier_II
1:15118 A>G [indel] GENE=DDX11L1 PRE_TIER=candidate_Tier_II
1:14930 A>G [indel] GENE=DDX11L1 PRE_TIER=candidate_Tier_II
1:14907 A>G [indel] GENE=DDX11L1 PRE_TIER=candidate_Tier_II
1:14653 C>T [indel] GENE=DDX11L1 PRE_TIER=candidate_Tier_II
1:14610 T>C [indel] GENE=DDX11L1 PRE_TIER=candidate_Tier_II
1:16571 G>A [snv_mnv] GENE=DDX11L1 PRE_TIER=candidate_Tier_II


### (3) Prompt ÏÉùÏÑ±

In [23]:
prompt_rule = build_prompt_template("rule")

messages_rule = prompt_rule.format_messages(
    variant_text=variant_text_rule,
    cancer_context=CANCER_CONTEXT,
    format_instructions=parser.get_format_instructions()
)

### (4) Ïã§Ìñâ

In [24]:
result_rule = llm.invoke(messages_rule)
parsed_rule = parser.parse(result_rule.content)

---

## 7.1-D. Í≤∞Í≥º ÌÜµÌï© (sample_id Ìè¨Ìï®)

In [25]:
dual_mode_result = {
    "sample_id": SAMPLE_ID,
    "cancer_context": CANCER_CONTEXT,
    "llm_mode": parsed_llm,
    "rule_mode": parsed_rule
}

---

## 7.1-E. Dual-mode ÎπÑÍµê ÌÖåÏù¥Î∏î ÏÉùÏÑ±

> ÌïµÏã¨: **variant Îã®ÏúÑÎ°ú tier ÌåêÎã®Ïù¥ Ïñ¥ÎñªÍ≤å Îã¨ÎùºÏßÄÎäîÏßÄ**

In [26]:
# --- 1) variant key Ï†ïÏùò Ìï®Ïàò ---
def make_variant_key(v):
    return f"{v['variant']}|{v['variant_type']}"

# --- 2) LLM / Rule Í≤∞Í≥ºÎ•º dictÎ°ú Î≥ÄÌôò ---
llm_map = {
    make_variant_key(v): v
    for v in parsed_llm["variants"]
}

rule_map = {
    make_variant_key(v): v
    for v in parsed_rule["variants"]
}

# --- 3) Í≥µÌÜµ key Í∏∞Ï§ÄÏúºÎ°ú ÎπÑÍµê ---
rows = []

for key in llm_map.keys() & rule_map.keys():
    v_llm = llm_map[key]
    v_rule = rule_map[key]

    rows.append({
        "gene": v_llm.get("gene"),
        "variant": v_llm["variant"],
        "variant_type": v_llm["variant_type"],
        "LLM_candidate_tier": v_llm["potential_tier"],
        "Rule_candidate_tier": v_rule["potential_tier"],
        "LLM_tier_basis": v_llm["tier_basis"],
        "Rule_tier_basis": v_rule["tier_basis"],
        "LLM_requires_validation": v_llm["requires_additional_validation"]
    })

comparison_df = pd.DataFrame(rows)
comparison_df["tier_match"] = (
    comparison_df["LLM_candidate_tier"]
    == comparison_df["Rule_candidate_tier"]
)

In [27]:
concordant_df = comparison_df.loc[comparison_df["tier_match"]]
concordant_df

Unnamed: 0,gene,variant,variant_type,LLM_candidate_tier,Rule_candidate_tier,LLM_tier_basis,Rule_tier_basis,LLM_requires_validation,tier_match


In [28]:
discordant_df = comparison_df.loc[~comparison_df["tier_match"]]
discordant_df

Unnamed: 0,gene,variant,variant_type,LLM_candidate_tier,Rule_candidate_tier,LLM_tier_basis,Rule_tier_basis,LLM_requires_validation,tier_match
0,DDX11L1,1:14907 A>G,indel,uncertain,candidate_Tier_II,Insufficient evidence for oncogenic relevance.,Pre-assigned based on rule-based logic.,True,False
1,DDX11L1,1:14610 T>C,indel,uncertain,candidate_Tier_II,Insufficient evidence for oncogenic relevance.,Pre-assigned based on rule-based logic.,True,False
2,DDX11L1,1:15118 A>G,indel,uncertain,candidate_Tier_II,Insufficient evidence for oncogenic relevance.,Pre-assigned based on rule-based logic.,True,False
3,DDX11L1,1:14610 T>C,snv_mnv,uncertain,candidate_Tier_II,Insufficient evidence for oncogenic relevance.,Pre-assigned based on rule-based logic.,True,False
4,DDX11L1,1:15274 A>T,indel,uncertain,candidate_Tier_II,Insufficient evidence for oncogenic relevance.,Pre-assigned based on rule-based logic.,True,False
5,DDX11L1,1:15211 T>G,indel,uncertain,candidate_Tier_II,Insufficient evidence for oncogenic relevance.,Pre-assigned based on rule-based logic.,True,False
6,DDX11L1,1:16571 G>A,snv_mnv,uncertain,candidate_Tier_II,Insufficient evidence for oncogenic relevance.,Pre-assigned based on rule-based logic.,True,False
7,DDX11L1,1:14653 C>T,indel,uncertain,candidate_Tier_II,Insufficient evidence for oncogenic relevance.,Pre-assigned based on rule-based logic.,True,False
8,DDX11L1,1:16378 T>C,indel,uncertain,candidate_Tier_II,Insufficient evidence for oncogenic relevance.,Pre-assigned based on rule-based logic.,True,False
9,DDX11L1,1:14930 A>G,indel,uncertain,candidate_Tier_II,Insufficient evidence for oncogenic relevance.,Pre-assigned based on rule-based logic.,True,False


### CSV Ï†ÄÏû•

In [29]:
comparison_df.to_csv(
    OUTPUT_DIR / f"{SAMPLE_ID}_dual_mode_comparison.csv",
    index=False
)

## 8. Output & limitations

### Ï∂úÎ†• ÌäπÏßï

* Î≥ÄÏù¥Î≥Ñ:

  * oncogenic Í∑ºÍ±∞ ÏÑ§Î™Ö
  * tier Í∑ºÍ±∞ Î™ÖÏãú
  * Ï∂îÍ∞Ä Í≤ÄÏ¶ù ÌïÑÏöî Ïó¨Î∂Ä
* ÏÉòÌîå ÏàòÏ§Ä ÏöîÏïΩ Ìè¨Ìï®
* ÌïúÍ≥Ñ ÏÇ¨Ìï≠ÏùÑ Î™ÖÏãúÏ†ÅÏúºÎ°ú Í∏∞Ïà†

Ïù¥ Íµ¨Ï°∞Îäî:

* Í≥ºÏû•Îêú ÏûÑÏÉÅ Ï£ºÏû• Î∞©ÏßÄ
* Î¶¨Î∑∞/Í≥µÏú†Ïóê ÏïàÏ†Ñ

---

## 9. Future extensions

* Îã§Ï§ë ÏÉòÌîå ‚Üí cohort Î∂ÑÏÑù
* VAF Í∏∞Î∞ò ÌïÑÌÑ∞ÎßÅ
* COSMIC / ClinVar / OncoKB Ïó∞Îèô
* AMP/ASCO/CAP guideline Ï†ïÏãù Îß§Ìïë
* LLM Ï∂úÎ†• ‚Üí Î¶¨Ìè¨Ìä∏ ÏûêÎèô ÏÉùÏÑ±

---