# UpstageLayoutAnalysisLoader
UpstageLayoutAnalysisLoader 는 Upstage AI에서 제공하는 문서 분석 도구로, LangChain 프레임워크와 통합되어 사용할 수 있는 문서 로더입니다.

주요 특징: - PDF, 이미지 등 다양한 형식의 문서에서 레이아웃 분석 수행 - 문서의 구조적 요소(제목, 단락, 표, 이미지 등)를 자동으로 인식 및 추출 - OCR 기능 지원 (선택적)

UpstageLayoutAnalysisLoader는 단순한 텍스트 추출을 넘어 문서의 구조를 이해하고 요소 간 관계를 파악하여 보다 정확한 문서 분석을 가능하게 합니다.

설치

langchain-upstage 패키지를 설치 후 사용합니다.
```
pip install -U langchain-upstage
```

In [7]:
UPSTAGE_API_KEY = ""

In [2]:
! pip install -U langchain-upstage

Collecting langchain-upstage
  Downloading langchain_upstage-0.4.0-py3-none-any.whl.metadata (3.3 kB)
Collecting langchain-openai<0.3,>=0.2 (from langchain-upstage)
  Downloading langchain_openai-0.2.14-py3-none-any.whl.metadata (2.7 kB)
Collecting pypdf<5.0.0,>=4.2.0 (from langchain-upstage)
  Downloading pypdf-4.3.1-py3-none-any.whl.metadata (7.4 kB)
Collecting tokenizers<0.20.0,>=0.19.1 (from langchain-upstage)
  Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting langchain-core<0.4,>=0.3.0 (from langchain-upstage)
  Downloading langchain_core-0.3.29-py3-none-any.whl.metadata (6.3 kB)
Collecting openai<2.0.0,>=1.58.1 (from langchain-openai<0.3,>=0.2->langchain-upstage)
  Downloading openai-1.59.3-py3-none-any.whl.metadata (27 kB)
Collecting tiktoken<1,>=0.7 (from langchain-openai<0.3,>=0.2->langchain-upstage)
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Down

In [3]:
! pip install -qU langchain-teddynote

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.7/34.7 MB[0m [31m51.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.1/51.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m203.4/203.4 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.3/81.3 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m53.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.2/137.2 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.6/114.6 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m


In [4]:
import os

os.environ["OPENAI_API_KEY"] = ""
os.environ["LANGCHAIN_API_KEY"] = ""
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
os.environ["LANGCHAIN_PROJECT"] = ""

[UpstageLayoutAnalysisLoader]
주요 파라미터

- file_path: 분석할 문서 경로
- output_type: 출력 형식 [(기본값)'html', 'text']
- split: 문서 분할 방식 ['none', 'element', 'page']
- use_ocr=True: OCR 사용
- exclude=["header", "footer"]: 헤더, 푸터 제외

In [6]:
from langchain_upstage import UpstageLayoutAnalysisLoader

# 파일 경로
file_path = "/content/TransUNet_10p.pdf"

# 문서 로더 설정
loader = UpstageLayoutAnalysisLoader(
    file_path,
    output_type="html",
    split="page",
    use_ocr=True,
    exclude=["header", "footer"],
    api_key=UPSTAGE_API_KEY
)

# 문서 로드
docs = loader.load()

# 결과 출력
for doc in docs[:3]:
    print(doc)

page_content='<h1 id='0' style='font-size:22px'>TransUNet: Transformers Make Strong<br>Encoders for Medical Image Segmentation</h1> <p id='1' data-category='paragraph' style='font-size:20px'>Jieneng Chen1, Yongyi Lu1, Qihang Yu1, Xiangde Luo2,<br>Ehsan Adeli3, Yan Wang4, Le Lu5, Alan L. Yuille1, and Yuyin Zhou3</p> <p id='2' data-category='equation'>$$^{2}\mathrm{University~of~Electronice~and~Dniversity}}\\ {{\mathrm{~^{2}~E l e c t r o n i c~S c i e n t e~E n a d~D r a t y e c h n o l o g y~G l i n a}}}\\ {{\mathrm{~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$$</p> <p id='3' data-category='paragraph' style='font-size:14px'>Abstract. Medi