실제 파이프라인 예시 (precision 지향형)

### NER-augmented RAG

accelerate로 모델 로드 : device=0 없이(오류남), 큰 모델 가능

In [1]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# 모델 로드
model_name = "d4data/biomedical-ner-all"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# NER pipeline
ner_pipe = pipeline(
    "ner",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

text = "Lapatinib and trastuzumab are used in breast cancer treatment."
result = ner_pipe(text)
print(result)


  from .autonotebook import tqdm as notebook_tqdm
Device set to use cuda:0


[{'entity_group': 'Medication', 'score': 0.91673374, 'word': 'lapatinib', 'start': 0, 'end': 9}, {'entity_group': 'Medication', 'score': 0.99988055, 'word': 'tr', 'start': 14, 'end': 16}, {'entity_group': 'Medication', 'score': 0.9977405, 'word': '##ast', 'start': 16, 'end': 19}, {'entity_group': 'Medication', 'score': 0.89958596, 'word': '##uzumab', 'start': 19, 'end': 25}, {'entity_group': 'Biological_structure', 'score': 0.9155983, 'word': 'breast', 'start': 38, 'end': 44}]


Pipeline 생성 예제 (BioNER + RAG 연동용)

In [8]:
def merge_wordpieces(ner_results):
    drugs = []
    current_word = ""
    for token in ner_results:
        if token['entity_group'] == 'Medication':
            if token['word'].startswith("##"):
                current_word += token['word'][2:]
            else:
                if current_word:
                    drugs.append(current_word)
                current_word = token['word']
    if current_word:
        drugs.append(current_word)
    return drugs

merged_drugs = merge_wordpieces(result)
print(merged_drugs)  # ['lapatinib', 'trastuzumab']


['lapatinib', 'trastuzumab']


Pipeline 생성 예제 (BioNER + RAG 연동용)

In [1]:
# ==========================
# Step 1: 라이브러리 로드
# ==========================
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import FAISS
from langchain.embeddings import OllamaEmbeddings
from langchain.chains import RetrievalQA
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

print("Step 1: 라이브러리 로드 완료")


  from .autonotebook import tqdm as notebook_tqdm


Step 1: 라이브러리 로드 완료


In [3]:
# ==========================
# Step 2: PDF 문서 로드
# ==========================
pdf_path = "/data1/workspace/pdfs/4.pdf"
loader = PyMuPDFLoader(pdf_path)
docs = loader.load()
print(f"Step 2: PDF 로드 완료, 문서 수: {len(docs)}")
print("문서 내용 일부 확인:\n", docs[0].page_content[:500])


Step 2: PDF 로드 완료, 문서 수: 9
문서 내용 일부 확인:
 ORIGINAL ARTICLE
HER2 exon 20 insertions in non-small-cell lung cancer
are sensitive to the irreversible pan-HER receptor
tyrosine kinase inhibitor pyrotinib
Y. Wang1†, T. Jiang1†, Z. Qin2†, J. Jiang3, Q. Wang3, S. Yang1, C. Rivard4, G. Gao1, T. L. Ng4, M. M. Tu5,6,
H. Yu4, H. Ji1,2‡, C. Zhou1‡, S. Ren1,4*‡, J. Zhang7, P. Bunn4, R. C. Doebele4, D. R. Camidge4 & F. R. Hirsch4
1Department of Medical Oncology, Shanghai Pulmonary Hospital, Tongji University School of Medicine, Shanghai; 2Institute o


In [5]:
# ==========================
# Step 3: 텍스트 Chunk 분할
# ==========================
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
split_documents = text_splitter.split_documents(docs)
print(f"Step 3: 문서 분할 완료, Chunk 수: {len(split_documents)}")
print("첫 번째 Chunk 내용:\n", split_documents[0].page_content)


Step 3: 문서 분할 완료, Chunk 수: 90
첫 번째 Chunk 내용:
 ORIGINAL ARTICLE
HER2 exon 20 insertions in non-small-cell lung cancer
are sensitive to the irreversible pan-HER receptor
tyrosine kinase inhibitor pyrotinib
Y. Wang1†, T. Jiang1†, Z. Qin2†, J. Jiang3, Q. Wang3, S. Yang1, C. Rivard4, G. Gao1, T. L. Ng4, M. M. Tu5,6,
H. Yu4, H. Ji1,2‡, C. Zhou1‡, S. Ren1,4*‡, J. Zhang7, P. Bunn4, R. C. Doebele4, D. R. Camidge4 & F. R. Hirsch4


In [6]:
# ==========================
# Step 4: 임베딩 생성 및 FAISS 벡터스토어
# ==========================
embeddings = OllamaEmbeddings(
    model="nomic-embed-text",
    base_url="http://localhost:11434"
)

vectorstore = FAISS.from_documents(documents=split_documents, embedding=embeddings)
retriever = vectorstore.as_retriever()
print("Step 4: 벡터스토어 및 retriever 생성 완료")


Step 4: 벡터스토어 및 retriever 생성 완료


In [7]:
# ==========================
# Step 5: BioNER 모델 로드
# ==========================
ner_model_name = "d4data/biomedical-ner-all"
tokenizer = AutoTokenizer.from_pretrained(ner_model_name)
model = AutoModelForTokenClassification.from_pretrained(
    ner_model_name,
    device_map="auto"  # GPU 자동 배치
)

ner_pipe = pipeline(
    "ner",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)
print("Step 5: BioNER pipeline 생성 완료")


Device set to use cuda:0


Step 5: BioNER pipeline 생성 완료


In [8]:
# ==========================
# Step 6: NER 테스트
# ==========================
test_text = "Lapatinib and trastuzumab are used in breast cancer treatment."
ner_result = ner_pipe(test_text)
print("Step 6: NER 결과 (raw)\n", ner_result)


Step 6: NER 결과 (raw)
 [{'entity_group': 'Medication', 'score': 0.91673374, 'word': 'lapatinib', 'start': 0, 'end': 9}, {'entity_group': 'Medication', 'score': 0.99988055, 'word': 'tr', 'start': 14, 'end': 16}, {'entity_group': 'Medication', 'score': 0.9977405, 'word': '##ast', 'start': 16, 'end': 19}, {'entity_group': 'Medication', 'score': 0.89958596, 'word': '##uzumab', 'start': 19, 'end': 25}, {'entity_group': 'Biological_structure', 'score': 0.9155983, 'word': 'breast', 'start': 38, 'end': 44}]


In [9]:
# ==========================
# Step 7: WordPiece 후처리
# ==========================
def merge_wordpieces(ner_results):
    drugs = []
    current_word = ""
    for token in ner_results:
        if token['entity_group'] == 'Medication':
            if token['word'].startswith("##"):
                current_word += token['word'][2:]
            else:
                if current_word:
                    drugs.append(current_word)
                current_word = token['word']
    if current_word:
        drugs.append(current_word)
    return drugs

merged_drugs = merge_wordpieces(ner_result)
print("Step 7: 최종 Drug Names\n", merged_drugs)


Step 7: 최종 Drug Names
 ['lapatinib', 'trastuzumab']


In [10]:
# ==========================
# Step 0.5: OllamaLLM 정의
# ==========================
import requests
from langchain.llms.base import LLM
from typing import Optional, List, Any

class OllamaLLM(LLM):
    model_name: str = "gpt-oss"
    base_url: str = "http://localhost:11434"
    timeout: int = 300  # 모델 크기 때문에 충분히 넉넉히 설정
    
    def _call(
        self, 
        prompt: str, 
        stop: Optional[List[str]] = None,
        run_manager: Optional[Any] = None
    ) -> str:
        url = f"{self.base_url}/api/generate"
        payload = {
            "model": self.model_name,
            "prompt": prompt,
            "stream": False
        }
        
        try:
            response = requests.post(url, json=payload, timeout=self.timeout)
            response.raise_for_status()
            return response.json()['response']
        except requests.Timeout:
            raise RuntimeError(f"Ollama timed out after {self.timeout}s")
        except Exception as e:
            raise RuntimeError(f"Ollama API error: {str(e)}")
    
    @property
    def _llm_type(self) -> str:
        return "ollama"

print("Step 0.5: OllamaLLM 클래스 정의 완료")


Step 0.5: OllamaLLM 클래스 정의 완료


In [11]:
# ==========================
# Step 8 (선택): RAG + LLM 연동 예제
# ==========================
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI  # 필요시 OllamaLLM도 사용 가능

prompt_template = """
You are a biomedical text analysis assistant.
List all unique drug names found in the document chunk below:

==== Document Excerpt Start ====
{context}
==== Document Excerpt End ====

Answer:
"""

PROMPT = PromptTemplate(
    template=prompt_template,
    input_variables=["context"]
)

# 예시: 첫 번째 chunk에서 LLM에게 확인
llm = OllamaLLM(model_name="gpt-oss", temperature=0)
context = split_documents[0].page_content
query_prompt = PROMPT.format(context=context)
print("Step 8: LLM 프롬프트 예시\n", query_prompt)


Step 8: LLM 프롬프트 예시
 
You are a biomedical text analysis assistant.
List all unique drug names found in the document chunk below:

==== Document Excerpt Start ====
ORIGINAL ARTICLE
HER2 exon 20 insertions in non-small-cell lung cancer
are sensitive to the irreversible pan-HER receptor
tyrosine kinase inhibitor pyrotinib
Y. Wang1†, T. Jiang1†, Z. Qin2†, J. Jiang3, Q. Wang3, S. Yang1, C. Rivard4, G. Gao1, T. L. Ng4, M. M. Tu5,6,
H. Yu4, H. Ji1,2‡, C. Zhou1‡, S. Ren1,4*‡, J. Zhang7, P. Bunn4, R. C. Doebele4, D. R. Camidge4 & F. R. Hirsch4
==== Document Excerpt End ====

Answer:



PDF → RAG → BioNER → 후처리 → 최종 drug list → 정답셋 대비 precision 계산

In [12]:
# ==========================
# Step 0.5: OllamaLLM 정의
# ==========================
import requests
from langchain.llms.base import LLM
from typing import Optional, List, Any

class OllamaLLM(LLM):
    model_name: str = "gpt-oss"
    base_url: str = "http://localhost:11434"
    timeout: int = 300
    
    def _call(
        self, 
        prompt: str, 
        stop: Optional[List[str]] = None,
        run_manager: Optional[Any] = None
    ) -> str:
        url = f"{self.base_url}/api/generate"
        payload = {
            "model": self.model_name,
            "prompt": prompt,
            "stream": False
        }
        
        try:
            response = requests.post(url, json=payload, timeout=self.timeout)
            response.raise_for_status()
            return response.json()['response']
        except requests.Timeout:
            raise RuntimeError(f"Ollama timed out after {self.timeout}s")
        except Exception as e:
            raise RuntimeError(f"Ollama API error: {str(e)}")
    
    @property
    def _llm_type(self) -> str:
        return "ollama"

print("Step 0.5: OllamaLLM 클래스 정의 완료")


Step 0.5: OllamaLLM 클래스 정의 완료


In [13]:
# ==========================
# Step 1: 라이브러리 로드
# ==========================
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import FAISS
from langchain.embeddings import OllamaEmbeddings
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import re

# ==========================
# Step 2: PDF 로드
# ==========================
pdf_path = "/data1/workspace/pdfs/1058.full.pdf"  # PDF 경로
loader = PyMuPDFLoader(pdf_path)
docs = loader.load()

# ==========================
# Step 3: 텍스트 Chunk 분할
# ==========================
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
split_documents = text_splitter.split_documents(docs)

# Step 4: 임베딩 + FAISS 벡터스토어
# ==========================
embeddings = OllamaEmbeddings(model="nomic-embed-text", base_url="http://localhost:11434")
vectorstore = FAISS.from_documents(documents=split_documents, embedding=embeddings)
retriever = vectorstore.as_retriever()



ValueError: File path /data1/workspace/pdfs/1058.full.pdf is not a valid file or url

In [14]:
# ==========================
# Step 5: BioNER 모델 로드
# ==========================
ner_model_name = "d4data/biomedical-ner-all"
tokenizer = AutoTokenizer.from_pretrained(ner_model_name)
model = AutoModelForTokenClassification.from_pretrained(
    ner_model_name,
    device_map="auto"
)

ner_pipe = pipeline(
    "ner",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

# ==========================
# Step 6: NER 추출 + WordPiece 후처리 함수
# ==========================
def extract_drugs(text, ner_pipe, min_score=0.9):
    ner_results = ner_pipe(text)
    # WordPiece 합치기
    drugs = []
    current_word = ""
    for token in ner_results:
        if token['entity_group'] == 'Medication' and token['score'] >= min_score:
            if token['word'].startswith("##"):
                current_word += token['word'][2:]
            else:
                if current_word:
                    drugs.append(current_word)
                current_word = token['word']
    if current_word:
        drugs.append(current_word)
    # 중복 제거
    drugs = list(set(drugs))
    return drugs

# 테스트
test_text = split_documents[0].page_content
drugs_in_chunk = extract_drugs(test_text, ner_pipe)
# print("Step 6: 첫 번째 Chunk에서 추출된 drug names:", drugs_in_chunk)

Device set to use cuda:0


In [15]:
# ==========================
# Step 6: NER 추출 + WordPiece 후처리 함수
# ==========================
def extract_drugs(text, ner_pipe, min_score=0.9):
    ner_results = ner_pipe(text)
    # WordPiece 합치기
    drugs = []
    current_word = ""
    for token in ner_results:
        if token['entity_group'] == 'Medication' and token['score'] >= min_score:
            if token['word'].startswith("##"):
                current_word += token['word'][2:]
            else:
                if current_word:
                    drugs.append(current_word)
                current_word = token['word']
    if current_word:
        drugs.append(current_word)
    # 중복 제거
    drugs = list(set(drugs))
    return drugs

# 테스트
test_text = split_documents[0].page_content
drugs_in_chunk = extract_drugs(test_text, ner_pipe)
# print("Step 6: 첫 번째 Chunk에서 추출된 drug names:", drugs_in_chunk)


In [16]:
# ==========================
# Step 7: 전체 PDF에서 Drug 추출
# ==========================
all_drugs = []
for i, chunk in enumerate(split_documents):
    chunk_drugs = extract_drugs(chunk.page_content, ner_pipe)
    all_drugs.extend(chunk_drugs)
all_drugs = list(set(all_drugs))

# ==========================
# Step 8: RAG + OllamaLLM 연동 (정리)
# ==========================
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

# LLM 객체 생성
llm = OllamaLLM(model_name="gpt-oss", temperature=0)
print("Step 8: OllamaLLM 객체 생성 완료")

# Prompt 정의
prompt_template = """
You are a biomedical text analysis assistant.

Extract and clean all unique drug names from the following document chunks:

==== Document Excerpt Start ====
{context}
==== Document Excerpt End ====

- Merge broken tokens (WordPiece) into full drug names
- Remove duplicates
- Output as a semicolon-separated list

Answer:
"""

PROMPT = PromptTemplate(
    template=prompt_template,
    input_variables=["context"]
)

# RetrievalQA 체인 생성
chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type="stuff",
    chain_type_kwargs={"prompt": PROMPT},
    return_source_documents=True,
    verbose=True
)
print("Step 8: RetrievalQA 체인 생성 완료")

# PDF 전체 text 합치기
all_text = " ".join([chunk.page_content for chunk in split_documents])
print("Step 8: 모든 chunk 합치기 완료, 길이:", len(all_text))

# RAG + LLM 실행
response = chain({"query": all_text})
rag_drugs = response['result']
print("Step 8: RAG + LLM로 추출된 Drug Names:\n", rag_drugs)


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
  response = chain({"query": all_text})
Error in StdOutCallbackHandler.on_chain_start callback: AttributeError("'NoneType' object has no attribute 'get'")


Step 8: OllamaLLM 객체 생성 완료
Step 8: RetrievalQA 체인 생성 완료
Step 8: 모든 chunk 합치기 완료, 길이: 40178

[1m> Finished chain.[0m
Step 8: RAG + LLM로 추출된 Drug Names:
 AP32788; TAK-788; neratinib; afatinib; pyrotinib; PF299804; HKI-272
