RAG-LangChain Multi-modal LLM 

PDF 로드 & 이미지 추출

In [1]:
import os
import fitz  # PyMuPDF
from PIL import Image

pdf_path = "/data1/workspace/pdfs/1.pdf"
img_dir = "/data1/workspace/pdf_images"
os.makedirs(img_dir, exist_ok=True)

print(f"[1] Loading PDF: {pdf_path}")
doc = fitz.open(pdf_path)
print(f"  → 총 {len(doc)} 페이지")

for page_index in range(len(doc)):
    page = doc.load_page(page_index)
    images = page.get_images(full=True)
    print(f"  - Page {page_index+1}: {len(images)} images")
    for i, img in enumerate(images):
        xref = img[0]
        base_image = doc.extract_image(xref)
        img_bytes = base_image["image"]
        img_ext = base_image["ext"]
        img_path = os.path.join(img_dir, f"page{page_index+1}_{i+1}.{img_ext}")
        with open(img_path, "wb") as f:
            f.write(img_bytes)
        print(f"    → Saved: {img_path}")

print("✅ 이미지 추출 완료")


[1] Loading PDF: /data1/workspace/pdfs/1.pdf
  → 총 16 페이지
  - Page 1: 4 images
    → Saved: /data1/workspace/pdf_images/page1_1.jpeg
    → Saved: /data1/workspace/pdf_images/page1_2.jpeg
    → Saved: /data1/workspace/pdf_images/page1_3.jpeg
    → Saved: /data1/workspace/pdf_images/page1_4.png
  - Page 2: 0 images
  - Page 3: 0 images
  - Page 4: 0 images
  - Page 5: 1 images
    → Saved: /data1/workspace/pdf_images/page5_1.jpeg
  - Page 6: 9 images
    → Saved: /data1/workspace/pdf_images/page6_1.jpeg
    → Saved: /data1/workspace/pdf_images/page6_2.jpeg
    → Saved: /data1/workspace/pdf_images/page6_3.jpeg
    → Saved: /data1/workspace/pdf_images/page6_4.jpeg
    → Saved: /data1/workspace/pdf_images/page6_5.jpeg
    → Saved: /data1/workspace/pdf_images/page6_6.jpeg
    → Saved: /data1/workspace/pdf_images/page6_7.jpeg
    → Saved: /data1/workspace/pdf_images/page6_8.jpeg
    → Saved: /data1/workspace/pdf_images/page6_9.jpeg
  - Page 7: 0 images
  - Page 8: 1 images
    → Saved: /data1

[STEP 2] Vision 모델 테스트 (LLaVA or Llama3.2-Vision)

In [2]:
import base64
import requests

VISION_MODEL = "llava"  # 또는 "llama3.2-vision"
OLLAMA_URL = "http://localhost:11434/api/generate"

def analyze_image_with_ollama(image_path, prompt="Describe this figure."):
    print(f"[2] Vision 모델로 이미지 분석: {VISION_MODEL}")
    try:
        # ✅ 이미지 base64 인코딩
        with open(image_path, "rb") as img:
            image_b64 = base64.b64encode(img.read()).decode("utf-8")
        
        payload = {
            "model": VISION_MODEL,
            "prompt": prompt,
            "images": [image_b64],
            "stream": False
        }

        response = requests.post(OLLAMA_URL, json=payload, timeout=300)
        response.raise_for_status()
        
        result = response.json().get("response", "")
        print(f"  → 결과 요약 (앞부분 300자):\n{result[:300]}...")
        return result

    except requests.exceptions.HTTPError as e:
        print(f"⚠️ HTTP 오류: {e.response.text}")
    except Exception as e:
        print(f"⚠️ Vision 모델 실행 오류: {e}")

# 첫 이미지로 다시 테스트
import os
images = sorted([f for f in os.listdir(img_dir) if f.lower().endswith(("png","jpg","jpeg"))])
if images:
    sample = os.path.join(img_dir, images[0])
    print("테스트 이미지:", sample)
    analyze_image_with_ollama(sample)
else:
    print("⚠️ 추출된 이미지가 없습니다.")


테스트 이미지: /data1/workspace/pdf_images/page10_1.jpeg
[2] Vision 모델로 이미지 분석: llava
  → 결과 요약 (앞부분 300자):
 The image is a composite of several photos and text, which appears to be an informational or educational collage. The central part shows a microscope slide with immunohistochemistry-stained sections that are likely from tissue samples. These sections have antibodies directed against specific protei...


[STEP 3] 텍스트 로드 및 분할

In [3]:
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

print("[3] 텍스트 추출 시작...")
loader = PyMuPDFLoader(pdf_path)
docs = loader.load()
print(f"  → 로드된 문서 수: {len(docs)}")

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
split_docs = splitter.split_documents(docs)
print(f"  → 분할된 청크 수: {len(split_docs)}")

[3] 텍스트 추출 시작...
  → 로드된 문서 수: 16
  → 분할된 청크 수: 186


[STEP 4] 임베딩 & 벡터스토어 생성

In [4]:
from langchain_community.vectorstores import FAISS
from langchain.embeddings import OllamaEmbeddings

print("[4] 벡터스토어 생성 중...")
embeddings = OllamaEmbeddings(
    model="nomic-embed-text",
    base_url="http://localhost:11434"
)

vectorstore = FAISS.from_documents(split_docs, embeddings)
retriever = vectorstore.as_retriever()
print("✅ 벡터스토어 구축 완료")


[4] 벡터스토어 생성 중...
✅ 벡터스토어 구축 완료


[STEP 5] LLM 연결 (Runnable 호환 버전)

In [5]:
from langchain_core.runnables import Runnable
import requests

class OllamaRunnable(Runnable):
    def __init__(self, model="gpt-oss", base_url="http://localhost:11434"):
        self.model = model
        self.base_url = base_url

    def invoke(self, input, *args, **kwargs):
        """
        LangChain이 전달하는 input이 StringPromptValue, dict, str 등일 수 있음.
        모두 안전하게 문자열로 변환 후 Ollama로 요청.
        """
        # ✅ 1. LangChain PromptValue 타입 안전 처리
        try:
            if hasattr(input, "to_string"):
                prompt_text = input.to_string()
            elif isinstance(input, dict) and "prompt" in input:
                prompt_text = input["prompt"]
            elif isinstance(input, str):
                prompt_text = input
            else:
                prompt_text = str(input)
        except Exception:
            prompt_text = str(input)

        # ✅ 2. Ollama API 요청
        try:
            response = requests.post(
                f"{self.base_url}/api/generate",
                json={"model": self.model, "prompt": prompt_text, "stream": False},
                timeout=300
            )
            response.raise_for_status()
            result = response.json().get("response", "")
            print(f"[LLM 응답 앞부분]\n{result[:200]}...\n")
            return result
        except Exception as e:
            print(f"⚠️ LLM 오류: {e}")
            return f"[LLM 오류]: {e}"

llm = OllamaRunnable(model="gpt-oss")
print("✅ LLM Runnable (최종버전) 등록 완료")


✅ LLM Runnable (최종버전) 등록 완료


[STEP 6] RetrievalQA 체인 구성 및 테스트

In [7]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

prompt_template = """
You are a biomedical text analysis assistant.

Your task is to extract **only the drugs that were actually tested, administered, or part of the experiments** in the provided document chunks.
Do not include drugs that are mentioned only in the background, discussion, references, or comparison sections.

==== Document Excerpt Start ====
{context}
==== Document Excerpt End ====

Guidelines:
- Extract drugs mentioned in the 'Results' or 'Methods' sections preferably.
- Include drugs mentioned in Figures and Tables only if they were used in the main experiments.
- Exclude drugs mentioned only as examples, background information, or in cited papers.
- Exclude gene names, proteins, signaling pathways, or assay reagents (e.g., DMSO, PBS, MTT).
- If the text describes a combination therapy, include all drugs that were co-administered.
- Merge WordPiece fragments into complete drug names.
- Remove duplicates.
- List up to **3 most relevant drugs** directly used in experiments.
- Verify that each extracted term is an **actual drug name or formulation**, not a biological target or herbal component.
- Output as a **semicolon-separated list**, without any extra words or explanations.

Answer:
"""

PROMPT = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"]
)

print("[6] RAG 체인 구성 중...")
chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type="stuff",
    chain_type_kwargs={"prompt": PROMPT},
    return_source_documents=True,
    verbose=True
)
print("✅ 체인 생성 완료")

query = "List all drugs mentioned in the results section."
response = chain.invoke({"query": query})
print("\n[결과]")
print(response["result"])


Error in StdOutCallbackHandler.on_chain_start callback: AttributeError("'NoneType' object has no attribute 'get'")


[6] RAG 체인 구성 중...
✅ 체인 생성 완료
[LLM 응답 앞부분]
sotorasib; adavosertib; osimertinib...


[1m> Finished chain.[0m

[결과]
sotorasib; adavosertib; osimertinib
