---

* 출처: LangChain 공식 문서 또는 해당 교재명
* 원본 URL: https://smith.langchain.com/hub/teddynote/summary-stuff-documents

---

## **`Arxiv`**

* [**`arXiv`**](https://arxiv.org/)

  * 물리학, 수학, 컴퓨터 과학, 정량 생물학, 정량 금융, 통계, 전기공학 및 시스템 과학, 경제학 분야의 200만 편의 학술 논문을 위한 오픈 액세스 아카이브

  * `Arxiv` 문서 로더에 접근하려면 `arxiv`, `PyMuPDF` 및 `langchain-community` 통합 패키지 설치 필요

* **`PyMuPDF`**: `arxiv.org` 사이트에서 다운로드한 `PDF 파일`을 **`텍스트 형식`** 으로 `변환`


* [공식 도큐먼트](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.arxiv.ArxivLoader.html#langchain_community.document_loaders.arxiv.ArxivLoader)

In [None]:
# API KEY를 환경변수로 관리하기 위한 설정 파일
import os
from dotenv import load_dotenv

# API KEY 정보로드
load_dotenv()               # true

---

* 사전에 `VS Code` 터미널에 설치할 것

```bash
          pip install -qU langchain-community arxiv pymupdf
```

---

### **`객체 생성`**

* 이제 `model` 객체를 인스턴스화 및 문서 로드 가능

In [None]:
from langchain_community.document_loaders import ArxivLoader     # ArxivLoader 임포트
from langchain_core.documents import Document
import fitz                                                      # fitz 모듈 임포트                    

# try-except 블록을 사용하여 오류 발생 시 처리
try:
    # ArxivLoader 인스턴스 생성
    # Query에 검색하고자 하는 논문의 주제 입력
    loader = ArxivLoader(
        query="Chain of thought",           # 검색할 논문의 주제
        load_max_docs=2,                    # 최대 문서 수
        load_all_available_meta=True,       # 메타데이터 전체 로드 여부
    )

    # 문서 로드
    docs = loader.load()

    # 로드된 문서 목록 출력
    if docs:
        print(f"총 {len(docs)}개의 논문을 로드했습니다.\n")
        
        for i, doc in enumerate(docs):
            if isinstance(doc, Document):
                print(f"--- 논문 {i+1} ---")
                print(f"제목: {doc.metadata.get('Title', '제목 없음')}")
                print(f"저자: {doc.metadata.get('Authors', '저자 정보 없음')}")
                print(f"요약 (일부):\n{doc.page_content[:300]}...\n")
            else:
                print(f"--- 논문 {i+1} (로드 실패) ---")
                print(doc)
    else:
        print("조건에 맞는 논문을 찾지 못했습니다.")

except Exception as e:
    print(f"논문 로드 중 오류가 발생했습니다: {e}")
    print("\n[해결 제안]")
    print("1. 검색어가 정확한지 확인해 주세요.")
    print("2. 네트워크 상태를 확인하고, 방화벽이나 프록시 설정이 논문 다운로드를 방해하지 않는지 확인해 주세요.")
    print("3. 'pip install -U PyMuPDF' 명령어를 실행하여 PyMuPDF를 최신 버전으로 업데이트해 보세요.")
    print("4. 그래도 문제가 해결되지 않으면, 'ArxivLoader' 대신 'ArxivAPIWrapper'를 사용하여 논문 정보를 먼저 확인해 볼 수 있습니다.")

<small>

* 셀 출력 (1.2s)

    ```markdown
    논문 로드 중 오류가 발생했습니다: module 'fitz' has no attribute 'fitz'

    [해결 제안]
    1. 검색어가 정확한지 확인해 주세요.
    2. 네트워크 상태를 확인하고, 방화벽이나 프록시 설정이 논문 다운로드를 방해하지 않는지 확인해 주세요.
    3. 'pip install -U PyMuPDF' 명령어를 실행하여 PyMuPDF를 최신 버전으로 업데이트해 보세요.
    4. 그래도 문제가 해결되지 않으면, 'ArxivLoader' 대신 'ArxivAPIWrapper'를 사용하여 논문 정보를 먼저 확인해 볼 수 있습니다.
    ```

---

* 오류 원인과 해결 방법

    ```markdown
    - 오류 원인: **`langchin_community.document_loader.arxiv`모듈과 `PyMuPDF`라이브러리 버전의 호환성 충돌 문제일 가능성 높음**
    - `langchain_community`의 `ArxivLoader` 코드가 오래된 버전의 `PyPDF`(모듈 이름:`fitz`)를 기반으로 작성되었기 때문
    - `ArxivLoader`
    - 내부적으로 `try-except fitz.fitz.FileDataError`와 같은 코드를 사용
    - `fitz`모듈 아래에 `fitz`라는 속성(`attribute`)이 더 이상 존재하지 않기 때문

    - 해결 방법: **`ArxivLoader`** 의 코드를 수정해야 함
    - `ArxivLoader`의 소스 코드를 직접 수정하는 것은 어려움 
    - `langchain-community` 라이브러리가 이 문제를 해결하기 전까지는 `ArxivLoader`를 직접 사용하지 않고 `ArxivAPIWrapper`를 사용하여 논문을 가져오는 방법을 사용하는 방법으로 우회하기
    - `ArxivAPIWrapper`
        - **논문 검색 결과 자체를 반환**
        - **`PDF` 파일을 `직접 다운로드`하고 `파싱`하는 로직을 직접 구현해야 함**
    ```

### **`ArxivAPIWrapper`로 시도**

In [None]:
from langchain_community.utilities import ArxivAPIWrapper
import requests
import fitz

# 검색할 논문의 주제
query_topic = "Chain of thought"

# 최대 문서 수
max_docs_to_load = 2

# ArxivAPIWrapper 인스턴스 생성
arxiv_wrapper = ArxivAPIWrapper()

try:
    # 1. Arxiv에서 논문 검색
    print(f"'{query_topic}' 주제로 논문을 검색 중입니다...")
    search_results = arxiv_wrapper.run(query=query_topic)
    
    # ArxivAPIWrapper의 run 메서드는 문자열을 반환하므로, 원하는 메타데이터 추출이 어려움
    # 따라서 arxiv 패키지를 직접 사용하기
    import arxiv
    search = arxiv.Search(
        query=query_topic,
        max_results=max_docs_to_load,
        sort_by=arxiv.SortCriterion.Relevance
    )
    
    docs = []
    
    # 2. 검색된 논문 목록을 순회하며 PDF 다운로드 및 파싱
    for i, result in enumerate(arxiv.Client().results(search)):
        try:
            # PDF 다운로드 URL
            pdf_url = result.pdf_url
            print(f"\n{i+1}/{max_docs_to_load} 논문 다운로드 중: {pdf_url}")
            
            # 논문 PDF 다운로드
            response = requests.get(pdf_url)
            response.raise_for_status()                 # HTTP 오류가 발생하면 예외 발생
            
            # 다운로드된 PDF 데이터를 메모리에서 파싱
            with fitz.open(stream=response.content, filetype="pdf") as doc_file:
                # PDF 페이지의 텍스트를 모두 추출
                text_content = "".join(page.get_text() for page in doc_file)
                
                # 추출된 정보를 Document 객체로 저장
                from langchain.docstore.document import Document
                doc = Document(
                    page_content=text_content,
                    metadata={
                        "Title": result.title,
                        "Authors": ", ".join(author.name for author in result.authors),
                        "Published": result.published,
                        "pdf_url": pdf_url,
                    }
                )
                docs.append(doc)
                
            print(f"성공적으로 로드된 논문: '{result.title}'")

        except requests.exceptions.HTTPError as errh:
            print(f"HTTP 오류: {errh}")
            print(f"'{result.title}' 논문 다운로드 실패. 다른 논문을 시도합니다.")
            continue
        except Exception as e:
            print(f"논문 처리 중 오류 발생: {e}")
            print(f"'{result.title}' 논문 처리 실패. 다른 논문을 시도합니다.")
            continue

    # 3. 로드된 문서 목록 출력
    if docs:
        print("\n=== 최종 로드된 문서 목록 ===")
        for i, doc in enumerate(docs):
            print(f"--- 논문 {i+1} ---")
            print(f"제목: {doc.metadata.get('Title', '제목 없음')}")
            print(f"저자: {doc.metadata.get('Authors', '저자 정보 없음')}")
            print(f"요약 (일부):\n{doc.page_content[:300]}...\n")
    else:
        print("\n조건에 맞는 논문을 찾지 못했습니다.")

except Exception as e:
    print(f"전체 프로세스 중 오류가 발생했습니다: {e}")
    print("\n[해결 제안]")
    print("1. 검색어가 정확한지 확인해 주세요.")
    print("2. 네트워크 상태를 확인하고, 방화벽이나 프록시 설정이 논문 다운로드를 방해하지 않는지 확인해 주세요.")
    print("3. 필요한 라이브러리(requests, PyMuPDF, arxiv)가 모두 설치되었는지 확인해 주세요.")
    print("   'pip install requests PyMuPDF arxiv'")

<small>

* 셀 출력 (2.3s)

    ```markdown
    'Chain of thought' 주제로 논문을 검색 중입니다...

    1/2 논문 다운로드 중: http://arxiv.org/pdf/2311.09277v1
    성공적으로 로드된 논문: 'Contrastive Chain-of-Thought Prompting'

    2/2 논문 다운로드 중: http://arxiv.org/pdf/2305.16582v2
    성공적으로 로드된 논문: 'Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Language Models'

    === 최종 로드된 문서 목록 ===
    --- 논문 1 ---
    제목: Contrastive Chain-of-Thought Prompting
    저자: Yew Ken Chia, Guizhen Chen, Luu Anh Tuan, Soujanya Poria, Lidong Bing
    요약 (일부):
    Contrastive Chain-of-Thought Prompting
    Yew Ken Chia∗1,
    Guizhen Chen∗1, 2
    Luu Anh Tuan2
    Soujanya Poria
    Lidong Bing† 1
    1DAMO Academy, Alibaba Group, Singapore
    Singapore University of Technology and Design
    2Nanyang Technological University, Singapore
    {yewken_chia, sporia}@sutd.edu.sg
    {guizhen001, anhtu...

    --- 논문 2 ---
    제목: Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Language Models
    저자: Yao Yao, Zuchao Li, Hai Zhao
    요약 (일부):
    Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in
    Language Models
    Yao Yao1,2, Zuchao Li3,∗and Hai Zhao1,2,∗
    1Department of Computer Science and Engineering, Shanghai Jiao Tong University
    2MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University
    3National Eng...
    ```


---

### **메타데이터**: 전체 출력, 부분 출력

* 전체 메타데이터 출력하기

```python

        # 관련 코드
        # ...중략...
                # 모든 메타데이터를 저장
                full_metadata = {
                    "Title": result.title,
                    "Authors": ", ".join(author.name for author in result.authors),
                    "Published": result.published,
                    "Updated": result.updated,
                    "Summary": result.summary,
                    "Categories": ", ".join(result.categories),
                    "Entry ID": result.entry_id,
                    "DOI": result.doi,
                    "pdf_url": pdf_url,
                }
                
                doc = Document(
                    page_content=text_content,
                    metadata=full_metadata
                )
                docs.append(doc)
            print(f"성공적으로 로드된 논문: '{result.title}'")

```

---

* `PyMuPDF`에서의 방법

```python
        loader = ArxivLoader(
            query="ChatGPT",
            load_max_docs=2,                    # 최대 문서 수
            load_all_available_meta=True,       # 메타데이터 전체 로드
        )
```

In [None]:
# 메타데이터 전체 출력하기_1
# 다른 주제: ChatGPT

from langchain_community.utilities import ArxivAPIWrapper
import requests
import fitz
from langchain.docstore.document import Document
import arxiv

# 검색할 논문의 주제
query_topic = "ChatGPT"

# 최대 문서 수
max_docs_to_load = 2

# ArxivAPIWrapper 인스턴스 생성
arxiv_wrapper = ArxivAPIWrapper()

try:
    print(f"'{query_topic}' 주제로 논문을 검색 중입니다...")
    
    search = arxiv.Search(
        query=query_topic,
        max_results=max_docs_to_load,
        sort_by=arxiv.SortCriterion.Relevance
    )
    
    docs = []
    
    for i, result in enumerate(arxiv.Client().results(search)):
        try:
            pdf_url = result.pdf_url
            response = requests.get(pdf_url)
            response.raise_for_status()
            
            with fitz.open(stream=response.content, filetype="pdf") as doc_file:
                text_content = "".join(page.get_text() for page in doc_file)
                
                # 모든 메타데이터를 저장
                full_metadata = {
                    "Title": result.title,
                    "Authors": ", ".join(author.name for author in result.authors),
                    "Published": result.published,
                    "Updated": result.updated,
                    "Summary": result.summary,
                    "Categories": ", ".join(result.categories),
                    "Entry ID": result.entry_id,
                    "DOI": result.doi,
                    "pdf_url": pdf_url,
                }
                
                doc = Document(
                    page_content=text_content,
                    metadata=full_metadata
                )
                docs.append(doc)
            print(f"성공적으로 로드된 논문: '{result.title}'")

        except requests.exceptions.HTTPError as errh:
            print(f"HTTP 오류: {errh}")
            continue
        except Exception as e:
            print(f"논문 처리 중 오류 발생: {e}")
            continue

    if docs:
        print("\n=== 로드된 문서 목록 (전체 메타데이터) ===")
        for i, doc in enumerate(docs):
            print(f"--- 논문 {i+1} ---")
            for key, value in doc.metadata.items():
                print(f"{key}: {value}")
            print("-" * 20)
    else:
        print("\n조건에 맞는 논문을 찾지 못했습니다.")

except Exception as e:
    print(f"전체 프로세스 중 오류가 발생했습니다: {e}")


<small>

* 셀 출력 (5.4s)

    ```markdown
    'ChatGPT' 주제로 논문을 검색 중입니다...
    성공적으로 로드된 논문: 'In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT'
    성공적으로 로드된 논문: 'Is ChatGPT Involved in Texts? Measure the Polish Ratio to Detect ChatGPT-Generated Text'

    === 로드된 문서 목록 (전체 메타데이터) ===
    --- 논문 1 ---
    Title: In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT
    Authors: Xinyue Shen, Zeyuan Chen, Michael Backes, Yang Zhang
    Published: 2023-04-18 13:20:45+00:00
    Updated: 2023-10-05 13:27:12+00:00
    Summary: The way users acquire information is undergoing a paradigm shift with the
    advent of ChatGPT. Unlike conventional search engines, ChatGPT retrieves
    knowledge from the model itself and generates answers for users. ChatGPT's
    impressive question-answering (QA) capability has attracted more than 100
    million users within a short period of time but has also raised concerns
    regarding its reliability. In this paper, we perform the first large-scale
    measurement of ChatGPT's reliability in the generic QA scenario with a
    carefully curated set of 5,695 questions across ten datasets and eight domains.
    We find that ChatGPT's reliability varies across different domains, especially
    underperforming in law and science questions. We also demonstrate that system
    roles, originally designed by OpenAI to allow users to steer ChatGPT's
    behavior, can impact ChatGPT's reliability in an imperceptible way. We further
    show that ChatGPT is vulnerable to adversarial examples, and even a single
    character change can negatively affect its reliability in certain cases. We
    believe that our study provides valuable insights into ChatGPT's reliability
    and underscores the need for strengthening the reliability and security of
    large language models (LLMs).
    Categories: cs.CR, cs.LG
    Entry ID: http://arxiv.org/abs/2304.08979v2
    DOI: None
    pdf_url: http://arxiv.org/pdf/2304.08979v2
    --------------------
    --- 논문 2 ---
    Title: Is ChatGPT Involved in Texts? Measure the Polish Ratio to Detect ChatGPT-Generated Text
    Authors: Lingyi Yang, Feng Jiang, Haizhou Li
    Published: 2023-07-21 06:38:37+00:00
    Updated: 2023-12-30 13:17:52+00:00
    Summary: The remarkable capabilities of large-scale language models, such as ChatGPT,
    in text generation have impressed readers and spurred researchers to devise
    detectors to mitigate potential risks, including misinformation, phishing, and
    academic dishonesty. Despite this, most previous studies have been
    predominantly geared towards creating detectors that differentiate between
    purely ChatGPT-generated texts and human-authored texts. This approach,
    however, fails to work on discerning texts generated through human-machine
    collaboration, such as ChatGPT-polished texts. Addressing this gap, we
    introduce a novel dataset termed HPPT (ChatGPT-polished academic abstracts),
    facilitating the construction of more robust detectors. It diverges from extant
    corpora by comprising pairs of human-written and ChatGPT-polished abstracts
    instead of purely ChatGPT-generated texts. Additionally, we propose the "Polish
    Ratio" method, an innovative measure of the degree of modification made by
    ChatGPT compared to the original human-written text. It provides a mechanism to
    measure the degree of ChatGPT influence in the resulting text. Our experimental
    results show our proposed model has better robustness on the HPPT dataset and
    two existing datasets (HC3 and CDB). Furthermore, the "Polish Ratio" we
    proposed offers a more comprehensive explanation by quantifying the degree of
    ChatGPT involvement.
    Categories: cs.CL
    Entry ID: http://arxiv.org/abs/2307.11380v2
    DOI: None
    pdf_url: http://arxiv.org/pdf/2307.11380v2
    --------------------
    ```

---

* 전체 메타데이터 출력하기

```python

        # 관련 코드
        # ...중략...
                # 메타데이터 일부를 저장
                doc = Document(
                    page_content=text_content,
                    metadata={
                        "Title": result.title,
                        "Authors": ", ".join(author.name for author in result.authors),
                        "Published": result.published,
                        "pdf_url": pdf_url,
                    }
                )
                docs.append(doc)

```

---

* `PyMuPDF`에서의 방법

```python
        loader = ArxivLoader(
            query="ChatGPT",
            load_max_docs=2,                     # 최대 문서 수
            load_all_available_meta=False,       # 메타데이터 전체 로드
        )
```

In [None]:
# 메타데이터 전체 출력하기_2
# 다른 주제: ChatGPT

from langchain_community.utilities import ArxivAPIWrapper
import requests
import fitz
from langchain.docstore.document import Document
import arxiv

# 검색할 논문의 주제
query_topic = "ChatGPT"

# 최대 문서 수
max_docs_to_load = 2

# ArxivAPIWrapper 인스턴스 생성
arxiv_wrapper = ArxivAPIWrapper()

try:
    print(f"'{query_topic}' 주제로 논문을 검색 중입니다...")
    
    search = arxiv.Search(
        query=query_topic,
        max_results=max_docs_to_load,
        sort_by=arxiv.SortCriterion.Relevance
    )
    
    docs = []
    
    for i, result in enumerate(arxiv.Client().results(search)):
        try:
            pdf_url = result.pdf_url
            response = requests.get(pdf_url)
            response.raise_for_status()
            
            with fitz.open(stream=response.content, filetype="pdf") as doc_file:
                text_content = "".join(page.get_text() for page in doc_file)
                
                doc = Document(
                    page_content=text_content,
                    metadata={
                        "Title": result.title,
                        "Authors": ", ".join(author.name for author in result.authors),
                        "Published": result.published,
                        "pdf_url": pdf_url,
                    }
                )
                docs.append(doc)
            print(f"성공적으로 로드된 논문: '{result.title}'")

        except requests.exceptions.HTTPError as errh:
            print(f"HTTP 오류: {errh}")
            continue
        except Exception as e:
            print(f"논문 처리 중 오류 발생: {e}")
            continue

    if docs:
        print("\n=== 로드된 문서 목록 (간결한 메타데이터) ===")
        for i, doc in enumerate(docs):
            print(f"--- 논문 {i+1} ---")
            print(f"제목: {doc.metadata.get('Title', '제목 없음')}")
            print(f"저자: {doc.metadata.get('Authors', '저자 정보 없음')}")
            print(f"URL: {doc.metadata.get('pdf_url', 'URL 없음')}")
            print("-" * 20)
    else:
        print("\n조건에 맞는 논문을 찾지 못했습니다.")

except Exception as e:
    print(f"전체 프로세스 중 오류가 발생했습니다: {e}")

<small>

* 셀 출력 (1.1s)

    ```markdown
    'ChatGPT' 주제로 논문을 검색 중입니다...
    성공적으로 로드된 논문: 'In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT'
    성공적으로 로드된 논문: 'Is ChatGPT Involved in Texts? Measure the Polish Ratio to Detect ChatGPT-Generated Text'

    === 로드된 문서 목록 (간결한 메타데이터) ===
    --- 논문 1 ---
    제목: In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT
    저자: Xinyue Shen, Zeyuan Chen, Michael Backes, Yang Zhang
    URL: http://arxiv.org/pdf/2304.08979v2
    --------------------
    --- 논문 2 ---
    제목: Is ChatGPT Involved in Texts? Measure the Polish Ratio to Detect ChatGPT-Generated Text
    저자: Lingyi Yang, Feng Jiang, Haizhou Li
    URL: http://arxiv.org/pdf/2307.11380v2
    --------------------
    ```

---

### **`요약(Summaray)`**

* 논문의 전체 내용이 아닌 요약본을 출력시 → `get_summaries_as_docs()` 함수 호출

In [None]:
from langchain_community.document_loaders import ArxivLoader
from langchain_community.utilities import ArxivAPIWrapper
from langchain_core.documents import Document

# 검색할 논문의 주제
query_topic = "ChatGPT"

# 최대 문서 수
max_docs_to_load = 2

# ArxivAPIWrapper 인스턴스 생성
arxiv_wrapper = ArxivAPIWrapper()

try:
    print(f"'{query_topic}' 주제로 논문을 검색 중입니다...")

    # ArxivLoader를 사용해 요약본을 문서로 로드
    # get_summaries_as_docs() 함수는 논문 초록을 가져와 Document 객체로 반환
    docs = arxiv_wrapper.get_summaries_as_docs(query=query_topic)

    if docs:
        print("\n=== 요약된 문서 목록 ===")
        for i, doc in enumerate(docs):
            if isinstance(doc, Document):
                print(f"--- 문서 {i+1} ---")
                print(f"제목: {doc.metadata.get('Title', '제목 없음')}")
                print(f"저자: {doc.metadata.get('Authors', '저자 정보 없음')}")
                print(f"요약:\n{doc.page_content}\n")
                print("-" * 20)
            else:
                print(f"--- 문서 {i+1} (로드 실패) ---")
                print(doc)
    else:
        print("\n조건에 맞는 논문을 찾지 못했습니다.")

except Exception as e:
    print(f"논문 요약 중 오류가 발생했습니다: {e}")
    print("\n[해결 제안]")
    print("1. 검색어가 정확한지 확인해 주세요.")
    print("2. 네트워크 상태를 확인하고, 방화벽이나 프록시 설정이 논문 다운로드를 방해하지 않는지 확인해 주세요.")

<small>

* 셀 출력 (0.1s)

    ```markdown
    'ChatGPT' 주제로 논문을 검색 중입니다...

    === 요약된 문서 목록 ===
    --- 문서 1 ---
    제목: In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT
    저자: Xinyue Shen, Zeyuan Chen, Michael Backes, Yang Zhang
    요약:
    The way users acquire information is undergoing a paradigm shift with the
    advent of ChatGPT. Unlike conventional search engines, ChatGPT retrieves
    knowledge from the model itself and generates answers for users. ChatGPT's
    impressive question-answering (QA) capability has attracted more than 100
    million users within a short period of time but has also raised concerns
    regarding its reliability. In this paper, we perform the first large-scale
    measurement of ChatGPT's reliability in the generic QA scenario with a
    carefully curated set of 5,695 questions across ten datasets and eight domains.
    We find that ChatGPT's reliability varies across different domains, especially
    underperforming in law and science questions. We also demonstrate that system
    roles, originally designed by OpenAI to allow users to steer ChatGPT's
    behavior, can impact ChatGPT's reliability in an imperceptible way. We further
    show that ChatGPT is vulnerable to adversarial examples, and even a single
    character change can negatively affect its reliability in certain cases. We
    believe that our study provides valuable insights into ChatGPT's reliability
    and underscores the need for strengthening the reliability and security of
    large language models (LLMs).

    --------------------
    --- 문서 2 ---
    제목: Is ChatGPT Involved in Texts? Measure the Polish Ratio to Detect ChatGPT-Generated Text
    저자: Lingyi Yang, Feng Jiang, Haizhou Li
    요약:
    The remarkable capabilities of large-scale language models, such as ChatGPT,
    in text generation have impressed readers and spurred researchers to devise
    detectors to mitigate potential risks, including misinformation, phishing, and
    academic dishonesty. Despite this, most previous studies have been
    predominantly geared towards creating detectors that differentiate between
    purely ChatGPT-generated texts and human-authored texts. This approach,
    however, fails to work on discerning texts generated through human-machine
    collaboration, such as ChatGPT-polished texts. Addressing this gap, we
    introduce a novel dataset termed HPPT (ChatGPT-polished academic abstracts),
    facilitating the construction of more robust detectors. It diverges from extant
    corpora by comprising pairs of human-written and ChatGPT-polished abstracts
    instead of purely ChatGPT-generated texts. Additionally, we propose the "Polish
    Ratio" method, an innovative measure of the degree of modification made by
    ChatGPT compared to the original human-written text. It provides a mechanism to
    measure the degree of ChatGPT influence in the resulting text. Our experimental
    results show our proposed model has better robustness on the HPPT dataset and
    two existing datasets (HC3 and CDB). Furthermore, the "Polish Ratio" we
    proposed offers a more comprehensive explanation by quantifying the degree of
    ChatGPT involvement.

    --------------------
    --- 문서 3 ---
    제목: When ChatGPT is gone: Creativity reverts and homogeneity persists
    저자: Qinghan Liu, Yiyong Zhou, Jihao Huang, Guiquan Li
    요약:
    ChatGPT has been evidenced to enhance human performance in creative tasks.
    Yet, it is still unclear if this boosting effect sustains with and without
    ChatGPT. In a pre-registered seven-day lab experiment and a follow-up survey
    after 30 days of experiment completion, we examined the impacts of ChatGPT
    presence and absence on sustained creativity using a text dataset of 3302
    creative ideas and 427 creative solutions from 61 college students.
    Participants in the treatment group used ChatGPT in creative tasks, while those
    in the control group completed the tasks by themselves. The findings show that
    although the boosting effect of ChatGPT was consistently observed over a
    five-day creative journey, human creative performance reverted to baseline when
    ChatGPT was down on the 7th and the 30th day. More critically, the use of
    ChatGPT in creative tasks resulted in increasingly homogenized contents, and
    this homogenization effect persisted even when ChatGPT was absence. These
    findings pose a challenge to the prevailing argument that ChatGPT can enhance
    human creativity. In fact, generative AI like ChatGPT lends to human with a
    temporary rise in creative performance but boxes human creative capability in
    the long run, highlighting the imperative for cautious generative AI
    integration in creative endeavors.

    --------------------
    ```

---

### **`lazy_load()`**

* 문서를 대량으로 로드 시
  * 모든 로드된 문서의 부분 집합에 대해 하류 작업을 수행할 수 있다면 **`메모리 사용량을 최소화`하기 위해 `문서를 한 번에 하나씩 지연 로드`할 수 있음**

In [None]:
import fitz
print(dir(fitz))
print(type(dir(fitz)))                          # <class 'list'>

<small>

* 셀 출력

```python
    ['ASSERT_PDF', 'Annot', 'AnyType', 'Archive', 'Base14_fontdict', 'Base14_fontnames', 'ByteString', 'CS_CMYK', 'CS_GRAY', 'CS_RGB', 'CheckColor', 'CheckFont', 'CheckFontInfo', 'CheckMarkerArg', 'CheckMorph', 'CheckParent', 'CheckQuad', 'CheckRect', 'ColorCode', 'Colorspace', 'ConversionHeader', 'ConversionTrailer', 'DeviceWrapper', 'DisplayList', 'Document', 'DocumentWriter', 'EMPTY_IRECT', 'EMPTY_QUAD', 'EMPTY_RECT', 'ENSURE_OPERATION', 'EPSILON', 'ElementPosition', 'EmptyFileError', 'FLT_EPSILON', 'FZ_MAX_INF_RECT', 'FZ_MIN_INF_RECT', 'FZ_RECOMPRESS_FAX', 'FZ_RECOMPRESS_J2K', 'FZ_RECOMPRESS_JPEG', 'FZ_RECOMPRESS_LOSSLESS', 'FZ_RECOMPRESS_NEVER', 'FZ_RECOMPRESS_SAME', 'FZ_SUBSAMPLE_AVERAGE', 'FZ_SUBSAMPLE_BICUBIC', 'FileDataError', 'FileNotFoundError', 'FitzDeprecation', 'Font', 'Graftmap', 'INFINITE_IRECT', 'INFINITE_QUAD', 'INFINITE_RECT', 'INVALID_NAME_CHARS', 'IRect', 'Identity', 'IdentityMatrix', 'JM_BinFromBuffer', 'JM_BufferFromBytes', 'JM_EscapeStrFromBuffer', 'JM_EscapeStrFromStr', 'JM_Exc_FileDataError', 'JM_FLOAT_ITEM', 'JM_INT_ITEM', 'JM_MEMORY', 'JM_StrAsChar', 'JM_TUPLE', 'JM_TUPLE3', 'JM_UnicodeFromBuffer', 'JM_UnicodeFromStr', 'JM_add_annot_id', 'JM_add_layer_config', 'JM_add_oc_object', 'JM_annot_border', 'JM_annot_colors', 'JM_annot_id_stem', 'JM_annot_set_border', 'JM_append_rune', 'JM_append_word', 'JM_char_bbox', 'JM_char_font_flags', 'JM_char_quad', 'JM_choice_options', 'JM_clear_pixmap_rect_with_value', 'JM_color_FromSequence', 'JM_color_count', 'JM_compress_buffer', 'JM_convert_to_pdf', 'JM_copy_rectangle', 'JM_create_widget', 'JM_cropbox', 'JM_cropbox_size', 'JM_derotate_page_matrix', 'JM_embed_file', 'JM_embedded_clean', 'JM_ensure_identity', 'JM_ensure_ocproperties', 'JM_expand_fname', 'JM_field_type_text', 'JM_fill_pixmap_rect_with_color', 'JM_find_annot_irt', 'JM_fitz_config', 'JM_font_ascender', 'JM_font_descender', 'JM_font_name', 'JM_gather_fonts', 'JM_gather_forms', 'JM_gather_images', 'JM_get_annot_by_name', 'JM_get_annot_by_xref', 'JM_get_annot_id_list', 'JM_get_annot_xref_list', 'JM_get_annot_xref_list2', 'JM_get_border_style', 'JM_get_font', 'JM_get_fontbuffer', 'JM_get_fontextension', 'JM_get_ocg_arrays', 'JM_get_ocg_arrays_imp', 'JM_get_page_labels', 'JM_get_resource_properties', 'JM_get_script', 'JM_get_widget_by_xref', 'JM_get_widget_properties', 'JM_have_operation', 'JM_image_extension', 'JM_image_filter', 'JM_image_profile', 'JM_image_reporter', 'JM_image_reporter_Filter', 'JM_insert_contents', 'JM_insert_font', 'JM_irect_from_py', 'JM_is_rtl_char', 'JM_is_word_delimiter', 'JM_listbox_value', 'JM_make_annot_DA', 'JM_make_image_block', 'JM_make_spanlist', 'JM_make_text_block', 'JM_make_textpage_dict', 'JM_matrix_from_py', 'JM_mediabox', 'JM_merge_range', 'JM_merge_resources', 'JM_mupdf_error', 'JM_mupdf_show_errors', 'JM_mupdf_show_warnings', 'JM_mupdf_warning', 'JM_mupdf_warnings_store', 'JM_new_bbox_device', 'JM_new_bbox_device_Device', 'JM_new_buffer_from_stext_page', 'JM_new_javascript', 'JM_new_lineart_device_Device', 'JM_new_output_fileptr', 'JM_new_output_fileptr_Output', 'JM_new_texttrace_device', 'JM_norm_rotation', 'JM_object_to_buffer', 'JM_outline_xrefs', 'JM_page_rotation', 'JM_pdf_obj_from_str', 'JM_pixmap_from_display_list', 'JM_pixmap_from_page', 'JM_point_from_py', 'JM_print_stext_page_as_text', 'JM_put_script', 'JM_py_from_irect', 'JM_py_from_matrix', 'JM_py_from_point', 'JM_py_from_quad', 'JM_py_from_rect', 'JM_quad_from_py', 'JM_read_contents', 'JM_rect_from_py', 'JM_rects_overlap', 'JM_refresh_links', 'JM_rotate_page_matrix', 'JM_scan_resources', 'JM_search_stext_page', 'JM_set_choice_options', 'JM_set_field_type', 'JM_set_object_value', 'JM_set_ocg_arrays', 'JM_set_ocg_arrays_imp', 'JM_set_resource_property', 'JM_set_widget_properties', 'JM_show_string_cs', 'JM_update_stream', 'JM_xobject_from_page', 'LINK_FLAG_B_VALID', 'LINK_FLAG_FIT_H', 'LINK_FLAG_FIT_V', 'LINK_FLAG_L_VALID', 'LINK_FLAG_R_IS_ZOOM', 'LINK_FLAG_R_VALID', 'LINK_FLAG_T_VALID', 'LINK_GOTO', 'LINK_GOTOR', 'LINK_LAUNCH', 'LINK_NAMED', 'LINK_NONE', 'LINK_URI', 'Link', 'MSG_BAD_ANNOT_TYPE', 'MSG_BAD_APN', 'MSG_BAD_ARG_INK_ANNOT', 'MSG_BAD_ARG_POINTS', 'MSG_BAD_BUFFER', 'MSG_BAD_COLOR_SEQ', 'MSG_BAD_DOCUMENT', 'MSG_BAD_FILETYPE', 'MSG_BAD_LOCATION', 'MSG_BAD_OC_CONFIG', 'MSG_BAD_OC_LAYER', 'MSG_BAD_OC_REF', 'MSG_BAD_PAGEID', 'MSG_BAD_PAGENO', 'MSG_BAD_PDFROOT', 'MSG_BAD_RECT', 'MSG_BAD_TEXT', 'MSG_BAD_XREF', 'MSG_COLOR_COUNT_FAILED', 'MSG_FILE_OR_BUFFER', 'MSG_FONT_FAILED', 'MSG_IS_NO_ANNOT', 'MSG_IS_NO_DICT', 'MSG_IS_NO_IMAGE', 'MSG_IS_NO_PDF', 'MSG_PIXEL_OUTSIDE', 'MSG_PIX_NOALPHA', 'Matrix', 'OptBytes', 'OptDict', 'OptFloat', 'OptInt', 'OptSeq', 'OptStr', 'Outline', 'PDF_ALERT_BUTTON_CANCEL', 'PDF_ALERT_BUTTON_GROUP_OK', 'PDF_ALERT_BUTTON_GROUP_OK_CANCEL', 'PDF_ALERT_BUTTON_GROUP_YES_NO', 'PDF_ALERT_BUTTON_GROUP_YES_NO_CANCEL', 'PDF_ALERT_BUTTON_NO', 'PDF_ALERT_BUTTON_NONE', 'PDF_ALERT_BUTTON_OK', 'PDF_ALERT_BUTTON_YES', 'PDF_ALERT_ICON_ERROR', 'PDF_ALERT_ICON_QUESTION', 'PDF_ALERT_ICON_STATUS', 'PDF_ALERT_ICON_WARNING', 'PDF_ANNOT_3D', 'PDF_ANNOT_CARET', 'PDF_ANNOT_CIRCLE', 'PDF_ANNOT_FILE_ATTACHMENT', 'PDF_ANNOT_FREE_TEXT', 'PDF_ANNOT_HIGHLIGHT', 'PDF_ANNOT_INK', 'PDF_ANNOT_IS_HIDDEN', 'PDF_ANNOT_IS_INVISIBLE', 'PDF_ANNOT_IS_LOCKED', 'PDF_ANNOT_IS_LOCKED_CONTENTS', 'PDF_ANNOT_IS_NO_ROTATE', 'PDF_ANNOT_IS_NO_VIEW', 'PDF_ANNOT_IS_NO_ZOOM', 'PDF_ANNOT_IS_PRINT', 'PDF_ANNOT_IS_READ_ONLY', 'PDF_ANNOT_IS_TOGGLE_NO_VIEW', 'PDF_ANNOT_IT_DEFAULT', 'PDF_ANNOT_IT_FREETEXT_CALLOUT', 'PDF_ANNOT_IT_FREETEXT_TYPEWRITER', 'PDF_ANNOT_IT_LINE_ARROW', 'PDF_ANNOT_IT_LINE_DIMENSION', 'PDF_ANNOT_IT_POLYGON_CLOUD', 'PDF_ANNOT_IT_POLYGON_DIMENSION', 'PDF_ANNOT_IT_POLYLINE_DIMENSION', 'PDF_ANNOT_IT_STAMP_IMAGE', 'PDF_ANNOT_IT_STAMP_SNAPSHOT', 'PDF_ANNOT_IT_UNKNOWN', 'PDF_ANNOT_LE_BUTT', 'PDF_ANNOT_LE_CIRCLE', 'PDF_ANNOT_LE_CLOSED_ARROW', 'PDF_ANNOT_LE_DIAMOND', 'PDF_ANNOT_LE_NONE', 'PDF_ANNOT_LE_OPEN_ARROW', 'PDF_ANNOT_LE_R_CLOSED_ARROW', 'PDF_ANNOT_LE_R_OPEN_ARROW', 'PDF_ANNOT_LE_SLASH', 'PDF_ANNOT_LE_SQUARE', 'PDF_ANNOT_LINE', 'PDF_ANNOT_LINK', 'PDF_ANNOT_MOVIE', 'PDF_ANNOT_POLYGON', 'PDF_ANNOT_POLY_LINE', 'PDF_ANNOT_POPUP', 'PDF_ANNOT_PRINTER_MARK', 'PDF_ANNOT_PROJECTION', 'PDF_ANNOT_Q_CENTER', 'PDF_ANNOT_Q_LEFT', 'PDF_ANNOT_Q_RIGHT', 'PDF_ANNOT_REDACT', 'PDF_ANNOT_RICH_MEDIA', 'PDF_ANNOT_SCREEN', 'PDF_ANNOT_SOUND', 'PDF_ANNOT_SQUARE', 'PDF_ANNOT_SQUIGGLY', 'PDF_ANNOT_STAMP', 'PDF_ANNOT_STRIKE_OUT', 'PDF_ANNOT_TEXT', 'PDF_ANNOT_TRAP_NET', 'PDF_ANNOT_UNDERLINE', 'PDF_ANNOT_UNKNOWN', 'PDF_ANNOT_WATERMARK', 'PDF_ANNOT_WIDGET', 'PDF_BM_Color', 'PDF_BM_ColorBurn', 'PDF_BM_ColorDodge', 'PDF_BM_Darken', 'PDF_BM_Difference', 'PDF_BM_Exclusion', 'PDF_BM_HardLight', 'PDF_BM_Hue', 'PDF_BM_Lighten', 'PDF_BM_Luminosity', 'PDF_BM_Multiply', 'PDF_BM_Normal', 'PDF_BM_Overlay', 'PDF_BM_Saturation', 'PDF_BM_Screen', 'PDF_BM_SoftLight', 'PDF_BORDER_EFFECT_CLOUDY', 'PDF_BORDER_EFFECT_NONE', 'PDF_BORDER_STYLE_BEVELED', 'PDF_BORDER_STYLE_DASHED', 'PDF_BORDER_STYLE_INSET', 'PDF_BORDER_STYLE_SOLID', 'PDF_BORDER_STYLE_UNDERLINE', 'PDF_BTN_FIELD_IS_NO_TOGGLE_TO_OFF', 'PDF_BTN_FIELD_IS_PUSHBUTTON', 'PDF_BTN_FIELD_IS_RADIO', 'PDF_BTN_FIELD_IS_RADIOS_IN_UNISON', 'PDF_CH_FIELD_IS_COMBO', 'PDF_CH_FIELD_IS_COMMIT_ON_SEL_CHANGE', 'PDF_CH_FIELD_IS_DO_NOT_SPELL_CHECK', 'PDF_CH_FIELD_IS_EDIT', 'PDF_CH_FIELD_IS_MULTI_SELECT', 'PDF_CH_FIELD_IS_SORT', 'PDF_CID_FONT_RESOURCE', 'PDF_CJK_FONT_RESOURCE', 'PDF_CLEAN_STRUCTURE_DROP', 'PDF_CLEAN_STRUCTURE_KEEP', 'PDF_DOCUMENT_EVENT_ALERT', 'PDF_DOCUMENT_EVENT_EXEC_MENU_ITEM', 'PDF_DOCUMENT_EVENT_LAUNCH_URL', 'PDF_DOCUMENT_EVENT_MAIL_DOC', 'PDF_DOCUMENT_EVENT_PRINT', 'PDF_DOCUMENT_EVENT_SUBMIT', 'PDF_ENCRYPT_AES_128', 'PDF_ENCRYPT_AES_256', 'PDF_ENCRYPT_KEEP', 'PDF_ENCRYPT_NONE', 'PDF_ENCRYPT_RC4_128', 'PDF_ENCRYPT_RC4_40', 'PDF_ENCRYPT_UNKNOWN', 'PDF_ENUM_FALSE', 'PDF_ENUM_LIMIT', 'PDF_ENUM_NULL', 'PDF_ENUM_TRUE', 'PDF_FALSE', 'PDF_FD_ALL_CAP', 'PDF_FD_FIXED_PITCH', 'PDF_FD_FORCE_BOLD', 'PDF_FD_ITALIC', 'PDF_FD_NONSYMBOLIC', 'PDF_FD_SCRIPT', 'PDF_FD_SERIF', 'PDF_FD_SMALL_CAP', 'PDF_FD_SYMBOLIC', 'PDF_FIELD_IS_NO_EXPORT', 'PDF_FIELD_IS_READ_ONLY', 'PDF_FIELD_IS_REQUIRED', 'PDF_LAYER_UI_CHECKBOX', 'PDF_LAYER_UI_LABEL', 'PDF_LAYER_UI_RADIOBOX', 'PDF_LEXBUF_LARGE', 'PDF_LEXBUF_SMALL', 'PDF_MAX_GEN_NUMBER', 'PDF_MAX_OBJECT_NUMBER', 'PDF_MRANGE_CAP', 'PDF_NAME', 'PDF_NOT_ZUGFERD', 'PDF_NULL', 'PDF_NUM_TOKENS', 'PDF_OC_OFF', 'PDF_OC_ON', 'PDF_OC_TOGGLE', 'PDF_PAGE_LABEL_ALPHA_LC', 'PDF_PAGE_LABEL_ALPHA_UC', 'PDF_PAGE_LABEL_DECIMAL', 'PDF_PAGE_LABEL_NONE', 'PDF_PAGE_LABEL_ROMAN_LC', 'PDF_PAGE_LABEL_ROMAN_UC', 'PDF_PERM_ACCESSIBILITY', 'PDF_PERM_ANNOTATE', 'PDF_PERM_ASSEMBLE', 'PDF_PERM_COPY', 'PDF_PERM_FORM', 'PDF_PERM_MODIFY', 'PDF_PERM_PRINT', 'PDF_PERM_PRINT_HQ', 'PDF_PROCESSOR_REQUIRES_DECODED_IMAGES', 'PDF_REDACT_IMAGE_NONE', 'PDF_REDACT_IMAGE_PIXELS', 'PDF_REDACT_IMAGE_REMOVE', 'PDF_REDACT_IMAGE_REMOVE_UNLESS_INVISIBLE', 'PDF_REDACT_LINE_ART_NONE', 'PDF_REDACT_LINE_ART_REMOVE_IF_COVERED', 'PDF_REDACT_LINE_ART_REMOVE_IF_TOUCHED', 'PDF_REDACT_TEXT_NONE', 'PDF_REDACT_TEXT_REMOVE', 'PDF_SIGNATURE_DEFAULT_APPEARANCE', 'PDF_SIGNATURE_ERROR_DIGEST_FAILURE', 'PDF_SIGNATURE_ERROR_NOT_SIGNED', 'PDF_SIGNATURE_ERROR_NOT_TRUSTED', 'PDF_SIGNATURE_ERROR_NO_CERTIFICATE', 'PDF_SIGNATURE_ERROR_NO_SIGNATURES', 'PDF_SIGNATURE_ERROR_OKAY', 'PDF_SIGNATURE_ERROR_SELF_SIGNED', 'PDF_SIGNATURE_ERROR_SELF_SIGNED_IN_CHAIN', 'PDF_SIGNATURE_ERROR_UNKNOWN', 'PDF_SIGNATURE_SHOW_DATE', 'PDF_SIGNATURE_SHOW_DN', 'PDF_SIGNATURE_SHOW_GRAPHIC_NAME', 'PDF_SIGNATURE_SHOW_LABELS', 'PDF_SIGNATURE_SHOW_LOGO', 'PDF_SIGNATURE_SHOW_TEXT_NAME', 'PDF_SIMPLE_ENCODING_CYRILLIC', 'PDF_SIMPLE_ENCODING_GREEK', 'PDF_SIMPLE_ENCODING_LATIN', 'PDF_SIMPLE_FONT_RESOURCE', 'PDF_TOK_CLOSE_ARRAY', 'PDF_TOK_CLOSE_BRACE', 'PDF_TOK_CLOSE_DICT', 'PDF_TOK_ENDOBJ', 'PDF_TOK_ENDSTREAM', 'PDF_TOK_EOF', 'PDF_TOK_ERROR', 'PDF_TOK_FALSE', 'PDF_TOK_INT', 'PDF_TOK_KEYWORD', 'PDF_TOK_NAME', 'PDF_TOK_NEWOBJ', 'PDF_TOK_NULL', 'PDF_TOK_OBJ', 'PDF_TOK_OPEN_ARRAY', 'PDF_TOK_OPEN_BRACE', 'PDF_TOK_OPEN_DICT', 'PDF_TOK_R', 'PDF_TOK_REAL', 'PDF_TOK_STARTXREF', 'PDF_TOK_STREAM', 'PDF_TOK_STRING', 'PDF_TOK_TRAILER', 'PDF_TOK_TRUE', 'PDF_TOK_XREF', 'PDF_TRUE', 'PDF_TX_FIELD_IS_COMB', 'PDF_TX_FIELD_IS_DO_NOT_SCROLL', 'PDF_TX_FIELD_IS_DO_NOT_SPELL_CHECK', 'PDF_TX_FIELD_IS_FILE_SELECT', 'PDF_TX_FIELD_IS_MULTILINE', 'PDF_TX_FIELD_IS_PASSWORD', 'PDF_TX_FIELD_IS_RICH_TEXT', 'PDF_WIDGET_TX_FORMAT_DATE', 'PDF_WIDGET_TX_FORMAT_NONE', 'PDF_WIDGET_TX_FORMAT_NUMBER', 'PDF_WIDGET_TX_FORMAT_SPECIAL', 'PDF_WIDGET_TX_FORMAT_TIME', 'PDF_WIDGET_TYPE_BUTTON', 'PDF_WIDGET_TYPE_CHECKBOX', 'PDF_WIDGET_TYPE_COMBOBOX', 'PDF_WIDGET_TYPE_LISTBOX', 'PDF_WIDGET_TYPE_RADIOBUTTON', 'PDF_WIDGET_TYPE_SIGNATURE', 'PDF_WIDGET_TYPE_TEXT', 'PDF_WIDGET_TYPE_UNKNOWN', 'PDF_ZUGFERD_BASIC', 'PDF_ZUGFERD_BASIC_WL', 'PDF_ZUGFERD_COMFORT', 'PDF_ZUGFERD_EXTENDED', 'PDF_ZUGFERD_MINIMUM', 'PDF_ZUGFERD_UNKNOWN', 'PDF_ZUGFERD_XRECHNUNG', 'Page', 'Page__add_text_marker', 'Pixmap', 'Point', 'PyExc_ValueError', 'PySequence_Check', 'PySequence_Size', 'PyUnicode_DecodeRawUnicodeEscape', 'Quad', 'RAISEPY', 'Rect', 'STAMP_Approved', 'STAMP_AsIs', 'STAMP_Confidential', 'STAMP_Departmental', 'STAMP_Draft', 'STAMP_Experimental', 'STAMP_Expired', 'STAMP_Final', 'STAMP_ForComment', 'STAMP_ForPublicRelease', 'STAMP_NotApproved', 'STAMP_NotForPublicRelease', 'STAMP_Sold', 'STAMP_TopSecret', 'Shape', 'SigFlag_AppendOnly', 'SigFlag_SignaturesExist', 'Story', 'TEXTFLAGS_BLOCKS', 'TEXTFLAGS_DICT', 'TEXTFLAGS_HTML', 'TEXTFLAGS_RAWDICT', 'TEXTFLAGS_SEARCH', 'TEXTFLAGS_TEXT', 'TEXTFLAGS_WORDS', 'TEXTFLAGS_XHTML', 'TEXTFLAGS_XML', 'TEXT_ACCURATE_ASCENDERS', 'TEXT_ACCURATE_BBOXES', 'TEXT_ACCURATE_SIDE_BEARINGS', 'TEXT_ALIGN_CENTER', 'TEXT_ALIGN_JUSTIFY', 'TEXT_ALIGN_LEFT', 'TEXT_ALIGN_RIGHT', 'TEXT_CID_FOR_UNKNOWN_UNICODE', 'TEXT_CLIP_RECT', 'TEXT_COLLECT_STRUCTURE', 'TEXT_COLLECT_STYLES', 'TEXT_COLLECT_VECTORS', 'TEXT_DEHYPHENATE', 'TEXT_ENCODING_CYRILLIC', 'TEXT_ENCODING_GREEK', 'TEXT_ENCODING_LATIN', 'TEXT_FONT_BOLD', 'TEXT_FONT_ITALIC', 'TEXT_FONT_MONOSPACED', 'TEXT_FONT_SERIFED', 'TEXT_FONT_SUPERSCRIPT', 'TEXT_IGNORE_ACTUALTEXT', 'TEXT_INHIBIT_SPACES', 'TEXT_MEDIABOX_CLIP', 'TEXT_OUTPUT_HTML', 'TEXT_OUTPUT_JSON', 'TEXT_OUTPUT_TEXT', 'TEXT_OUTPUT_XHTML', 'TEXT_OUTPUT_XML', 'TEXT_PARAGRAPH_BREAK', 'TEXT_PRESERVE_IMAGES', 'TEXT_PRESERVE_LIGATURES', 'TEXT_PRESERVE_SPANS', 'TEXT_PRESERVE_WHITESPACE', 'TEXT_SEGMENT', 'TEXT_STEXT_SEGMENT', 'TEXT_TABLE_HUNT', 'TEXT_USE_CID_FOR_UNKNOWN_UNICODE', 'TEXT_USE_GID_FOR_UNKNOWN_UNICODE', 'TOOLS', 'TOOLS_JM_UNIQUE_ID', 'TextPage', 'TextWriter', 'UCDN_SCRIPT_ADLAM', 'UCDN_SCRIPT_AHOM', 'UCDN_SCRIPT_ANATOLIAN_HIEROGLYPHS', 'UCDN_SCRIPT_ARABIC', 'UCDN_SCRIPT_ARMENIAN', 'UCDN_SCRIPT_AVESTAN', 'UCDN_SCRIPT_BALINESE', 'UCDN_SCRIPT_BAMUM', 'UCDN_SCRIPT_BASSA_VAH', 'UCDN_SCRIPT_BATAK', 'UCDN_SCRIPT_BENGALI', 'UCDN_SCRIPT_BHAIKSUKI', 'UCDN_SCRIPT_BOPOMOFO', 'UCDN_SCRIPT_BRAHMI', 'UCDN_SCRIPT_BRAILLE', 'UCDN_SCRIPT_BUGINESE', 'UCDN_SCRIPT_BUHID', 'UCDN_SCRIPT_CANADIAN_ABORIGINAL', 'UCDN_SCRIPT_CARIAN', 'UCDN_SCRIPT_CAUCASIAN_ALBANIAN', 'UCDN_SCRIPT_CHAKMA', 'UCDN_SCRIPT_CHAM', 'UCDN_SCRIPT_CHEROKEE', 'UCDN_SCRIPT_CHORASMIAN', 'UCDN_SCRIPT_COMMON', 'UCDN_SCRIPT_COPTIC', 'UCDN_SCRIPT_CUNEIFORM', 'UCDN_SCRIPT_CYPRIOT', 'UCDN_SCRIPT_CYPRO_MINOAN', 'UCDN_SCRIPT_CYRILLIC', 'UCDN_SCRIPT_DESERET', 'UCDN_SCRIPT_DEVANAGARI', 'UCDN_SCRIPT_DIVES_AKURU', 'UCDN_SCRIPT_DOGRA', 'UCDN_SCRIPT_DUPLOYAN', 'UCDN_SCRIPT_EGYPTIAN_HIEROGLYPHS', 'UCDN_SCRIPT_ELBASAN', 'UCDN_SCRIPT_ELYMAIC', 'UCDN_SCRIPT_ETHIOPIC', 'UCDN_SCRIPT_GARAY', 'UCDN_SCRIPT_GEORGIAN', 'UCDN_SCRIPT_GLAGOLITIC', 'UCDN_SCRIPT_GOTHIC', 'UCDN_SCRIPT_GRANTHA', 'UCDN_SCRIPT_GREEK', 'UCDN_SCRIPT_GUJARATI', 'UCDN_SCRIPT_GUNJALA_GONDI', 'UCDN_SCRIPT_GURMUKHI', 'UCDN_SCRIPT_GURUNG_KHEMA', 'UCDN_SCRIPT_HAN', 'UCDN_SCRIPT_HANGUL', 'UCDN_SCRIPT_HANIFI_ROHINGYA', 'UCDN_SCRIPT_HANUNOO', 'UCDN_SCRIPT_HATRAN', 'UCDN_SCRIPT_HEBREW', 'UCDN_SCRIPT_HIRAGANA', 'UCDN_SCRIPT_IMPERIAL_ARAMAIC', 'UCDN_SCRIPT_INHERITED', 'UCDN_SCRIPT_INSCRIPTIONAL_PAHLAVI', 'UCDN_SCRIPT_INSCRIPTIONAL_PARTHIAN', 'UCDN_SCRIPT_JAVANESE', 'UCDN_SCRIPT_KAITHI', 'UCDN_SCRIPT_KANNADA', 'UCDN_SCRIPT_KATAKANA', 'UCDN_SCRIPT_KAWI', 'UCDN_SCRIPT_KAYAH_LI', 'UCDN_SCRIPT_KHAROSHTHI', 'UCDN_SCRIPT_KHITAN_SMALL_SCRIPT', 'UCDN_SCRIPT_KHMER', 'UCDN_SCRIPT_KHOJKI', 'UCDN_SCRIPT_KHUDAWADI', 'UCDN_SCRIPT_KIRAT_RAI', 'UCDN_SCRIPT_LAO', 'UCDN_SCRIPT_LATIN', 'UCDN_SCRIPT_LEPCHA', 'UCDN_SCRIPT_LIMBU', 'UCDN_SCRIPT_LINEAR_A', 'UCDN_SCRIPT_LINEAR_B', 'UCDN_SCRIPT_LISU', 'UCDN_SCRIPT_LYCIAN', 'UCDN_SCRIPT_LYDIAN', 'UCDN_SCRIPT_MAHAJANI', 'UCDN_SCRIPT_MAKASAR', 'UCDN_SCRIPT_MALAYALAM', 'UCDN_SCRIPT_MANDAIC', 'UCDN_SCRIPT_MANICHAEAN', 'UCDN_SCRIPT_MARCHEN', 'UCDN_SCRIPT_MASARAM_GONDI', 'UCDN_SCRIPT_MEDEFAIDRIN', 'UCDN_SCRIPT_MEETEI_MAYEK', 'UCDN_SCRIPT_MENDE_KIKAKUI', 'UCDN_SCRIPT_MEROITIC_CURSIVE', 'UCDN_SCRIPT_MEROITIC_HIEROGLYPHS', 'UCDN_SCRIPT_MIAO', 'UCDN_SCRIPT_MODI', 'UCDN_SCRIPT_MONGOLIAN', 'UCDN_SCRIPT_MRO', 'UCDN_SCRIPT_MULTANI', 'UCDN_SCRIPT_MYANMAR', 'UCDN_SCRIPT_NABATAEAN', 'UCDN_SCRIPT_NAG_MUNDARI', 'UCDN_SCRIPT_NANDINAGARI', 'UCDN_SCRIPT_NEWA', 'UCDN_SCRIPT_NEW_TAI_LUE', 'UCDN_SCRIPT_NKO', 'UCDN_SCRIPT_NUSHU', 'UCDN_SCRIPT_NYIAKENG_PUACHUE_HMONG', 'UCDN_SCRIPT_OGHAM', 'UCDN_SCRIPT_OLD_HUNGARIAN', 'UCDN_SCRIPT_OLD_ITALIC', 'UCDN_SCRIPT_OLD_NORTH_ARABIAN', 'UCDN_SCRIPT_OLD_PERMIC', 'UCDN_SCRIPT_OLD_PERSIAN', 'UCDN_SCRIPT_OLD_SOGDIAN', 'UCDN_SCRIPT_OLD_SOUTH_ARABIAN', 'UCDN_SCRIPT_OLD_TURKIC', 'UCDN_SCRIPT_OLD_UYGHUR', 'UCDN_SCRIPT_OL_CHIKI', 'UCDN_SCRIPT_OL_ONAL', 'UCDN_SCRIPT_ORIYA', 'UCDN_SCRIPT_OSAGE', 'UCDN_SCRIPT_OSMANYA', 'UCDN_SCRIPT_PAHAWH_HMONG', 'UCDN_SCRIPT_PALMYRENE', 'UCDN_SCRIPT_PAU_CIN_HAU', 'UCDN_SCRIPT_PHAGS_PA', 'UCDN_SCRIPT_PHOENICIAN', 'UCDN_SCRIPT_PSALTER_PAHLAVI', 'UCDN_SCRIPT_REJANG', 'UCDN_SCRIPT_RUNIC', 'UCDN_SCRIPT_SAMARITAN', 'UCDN_SCRIPT_SAURASHTRA', 'UCDN_SCRIPT_SHARADA', 'UCDN_SCRIPT_SHAVIAN', 'UCDN_SCRIPT_SIDDHAM', 'UCDN_SCRIPT_SIGNWRITING', 'UCDN_SCRIPT_SINHALA', 'UCDN_SCRIPT_SOGDIAN', 'UCDN_SCRIPT_SORA_SOMPENG', 'UCDN_SCRIPT_SOYOMBO', 'UCDN_SCRIPT_SUNDANESE', 'UCDN_SCRIPT_SUNUWAR', 'UCDN_SCRIPT_SYLOTI_NAGRI', 'UCDN_SCRIPT_SYRIAC', 'UCDN_SCRIPT_TAGALOG', 'UCDN_SCRIPT_TAGBANWA', 'UCDN_SCRIPT_TAI_LE', 'UCDN_SCRIPT_TAI_THAM', 'UCDN_SCRIPT_TAI_VIET', 'UCDN_SCRIPT_TAKRI', 'UCDN_SCRIPT_TAMIL', 'UCDN_SCRIPT_TANGSA', 'UCDN_SCRIPT_TANGUT', 'UCDN_SCRIPT_TELUGU', 'UCDN_SCRIPT_THAANA', 'UCDN_SCRIPT_THAI', 'UCDN_SCRIPT_TIBETAN', 'UCDN_SCRIPT_TIFINAGH', 'UCDN_SCRIPT_TIRHUTA', 'UCDN_SCRIPT_TODHRI', 'UCDN_SCRIPT_TOTO', 'UCDN_SCRIPT_TULU_TIGALARI', 'UCDN_SCRIPT_UGARITIC', 'UCDN_SCRIPT_UNKNOWN', 'UCDN_SCRIPT_VAI', 'UCDN_SCRIPT_VITHKUQI', 'UCDN_SCRIPT_WANCHO', 'UCDN_SCRIPT_WARANG_CITI', 'UCDN_SCRIPT_YEZIDI', 'UCDN_SCRIPT_YI', 'UCDN_SCRIPT_ZANABAZAR_SQUARE', 'UpdateFontInfo', 'VersionBind', 'VersionDate', 'VersionFitz', 'Walker', 'Widget', 'Xml', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_as_fz_document', '_as_fz_page', '_as_pdf_document', '_as_pdf_page', '_g_out_message', '_globals', '_log_items', '_log_items_active', '_log_items_clear', 'annot_postprocess', 'annot_preprocess', 'annot_skel', 'apply_pages', 'args_match', 'atexit', 'b', 'binascii', 'calc_image_matrix', 'canon', 'chartocanon', 'collections', 'colors_pdf_dict', 'colors_wx_list', 'compute_scissor', 'csCMYK', 'csGRAY', 'csRGB', 'css_for_pymupdf_font', 'dest_is_valid', 'dest_is_valid_page', 'detect_super_script', 'dictkey_align', 'dictkey_asc', 'dictkey_bbox', 'dictkey_bidi', 'dictkey_blocks', 'dictkey_bpc', 'dictkey_c', 'dictkey_char_flags', 'dictkey_chars', 'dictkey_color', 'dictkey_colorspace', 'dictkey_content', 'dictkey_creationDate', 'dictkey_cs_name', 'dictkey_da', 'dictkey_dashes', 'dictkey_desc', 'dictkey_descr', 'dictkey_dir', 'dictkey_effect', 'dictkey_ext', 'dictkey_filename', 'dictkey_fill', 'dictkey_flags', 'dictkey_font', 'dictkey_glyph', 'dictkey_height', 'dictkey_id', 'dictkey_image', 'dictkey_items', 'dictkey_length', 'dictkey_lines', 'dictkey_matrix', 'dictkey_modDate', 'dictkey_name', 'dictkey_number', 'dictkey_origin', 'dictkey_rect', 'dictkey_size', 'dictkey_smask', 'dictkey_spans', 'dictkey_stroke', 'dictkey_style', 'dictkey_subject', 'dictkey_text', 'dictkey_title', 'dictkey_type', 'dictkey_ufilename', 'dictkey_width', 'dictkey_wmode', 'dictkey_xref', 'dictkey_xres', 'dictkey_yres', 'dir_str', 'exception_info', 'extra', 'extra_FzDocument_insert_pdf', 'f', 'find_string', 'find_tables', 'fitz_fontdescriptors', 'format_g', 'g', 'g_exceptions_verbose', 'g_img_info', 'g_use_extra', 'getTJstr', 'get_env_bool', 'get_env_int', 'get_highlight_selection', 'get_pdf_now', 'get_pdf_str', 'get_tessdata', 'get_text', 'get_text_length', 'glob', 'glyph_name_to_unicode', 'hdist', 'image_profile', 'inspect', 'io', 'jm_append_merge', 'jm_bbox_add_rect', 'jm_bbox_fill_image', 'jm_bbox_fill_image_mask', 'jm_bbox_fill_path', 'jm_bbox_fill_shade', 'jm_bbox_fill_text', 'jm_bbox_ignore_text', 'jm_bbox_stroke_path', 'jm_bbox_stroke_text', 'jm_checkquad', 'jm_checkrect', 'jm_dev_linewidth', 'jm_increase_seqno', 'jm_lineart_begin_group', 'jm_lineart_begin_layer', 'jm_lineart_clip_image_mask', 'jm_lineart_clip_path', 'jm_lineart_clip_stroke_path', 'jm_lineart_clip_stroke_text', 'jm_lineart_clip_text', 'jm_lineart_color', 'jm_lineart_drop_device', 'jm_lineart_end_group', 'jm_lineart_end_layer', 'jm_lineart_fill_path', 'jm_lineart_fill_text', 'jm_lineart_ignore_text', 'jm_lineart_path', 'jm_lineart_pop_clip', 'jm_lineart_stroke_path', 'jm_lineart_stroke_text', 'jm_trace_text', 'jm_trace_text_span', 'linkDest', 'log', 'make_escape', 'make_story_elpos', 'make_table', 'match_string', 'math', 'matrix_like', 'message', 'message_warning', 'mupdf', 'mupdf_cppyy', 'mupdf_location', 'mupdf_version', 'mupdf_version_tuple', 'name', 'on_highlight_char', 'open', 'os', 'page_merge', 'paper_rect', 'paper_size', 'paper_sizes', 'pathlib', 'pdf_lookup_page_loc', 'pdfcolor', 'pdfobj_string', 'planish_line', 'point_like', 'pymupdf', 'pymupdf_date', 'pymupdf_git_branch', 'pymupdf_git_diff', 'pymupdf_git_sha', 'pymupdf_version', 'pymupdf_version_tuple', 'quad_like', 'r', 're', 'recover_bbox_quad', 'recover_char_quad', 'recover_line_quad', 'recover_quad', 'recover_span_quad', 'rect_like', 'repair_mono_font', 'restore_aliases', 'sRGB_to_pdf', 'sRGB_to_rgb', 'set_log', 'set_messages', 'string', 'string_in_names_list', 'strip_outline', 'strip_outlines', 'swig_version', 'swig_version_tuple', 'symbol_glyphs', 'sys', 'table', 'tarfile', 'time', 'trace_device_CLIP_PATH', 'trace_device_CLIP_STROKE_PATH', 'trace_device_FILL_PATH', 'trace_device_STROKE_PATH', 'typing', 'unicode_to_glyph_name', 'util_concat_matrix', 'util_ensure_widget_calc', 'util_hor_matrix', 'util_include_point_in_rect', 'util_intersect_rect', 'util_invert_matrix', 'util_is_point_in_rect', 'util_make_irect', 'util_make_rect', 'util_measure_string', 'util_point_in_quad', 'util_round_rect', 'util_sine_between', 'util_transform_point', 'util_transform_rect', 'util_union_rect', 'utils', 'vdist', 'version', 'warnings', 'weakref', 'zapf_glyphs', 'zipfile']
```

In [None]:
from langchain_community.utilities import ArxivAPIWrapper
import requests
import fitz
from langchain.docstore.document import Document
import arxiv

# 검색할 논문의 주제
query_topic = "ChatGPT"

# 최대 문서 수
max_docs_to_load = 5                                        # 한 번에 5개까지 논문을 가져오기

# arxiv 검색 설정
search = arxiv.Search(
    query=query_topic,
    max_results=max_docs_to_load,
    sort_by=arxiv.SortCriterion.Relevance
)

def lazy_load_arxiv_docs():
    """
    arxiv.Client().results()를 사용하여 논문을 하나씩 로드하고 반환합니다.
    """
    print(f"'{query_topic}' 주제로 논문을 검색 중입니다...")
    
    for i, result in enumerate(arxiv.Client().results(search)):
        try:
            pdf_url = result.pdf_url
            print(f"\n{i+1}/{max_docs_to_load} 논문 다운로드 및 파싱 중: {result.title}")
            
            # 논문 PDF 다운로드
            response = requests.get(pdf_url)
            response.raise_for_status()
            
            # 다운로드된 PDF 데이터를 메모리에서 파싱
            with fitz.open(stream=response.content, filetype="pdf") as doc_file:
                text_content = "".join(page.get_text() for page in doc_file)
                
                # 추출된 정보를 Document 객체로 저장
                doc = Document(
                    page_content=text_content,
                    metadata={
                        "Title": result.title,
                        "Authors": ", ".join(author.name for author in result.authors),
                        "Published": result.published,
                        "pdf_url": pdf_url,
                    }
                )
                
                # 제너레이터(generator)를 사용하여 하나씩 반환
                yield doc

        except requests.exceptions.HTTPError as errh:
            print(f"HTTP 오류: {errh}")
            continue
        except Exception as e:
            print(f"논문 처리 중 오류 발생: {e}")
            continue

# 'docs' 리스트에 문서를 하나씩 추가 (lazy_load()와 유사한 방식)
docs = []
for doc in lazy_load_arxiv_docs():
    docs.append(doc)

# 최종 로드된 문서 목록 출력
if docs:
    print("\n=== 최종 로드된 문서 목록 ===")
    for i, doc in enumerate(docs):
        print(f"--- 논문 {i+1} ---")
        print(f"제목: {doc.metadata.get('Title', '제목 없음')}")
        print(f"요약 (일부):\n{doc.page_content[:300]}...\n")
        print("-" * 20)

else:
    print("\n조건에 맞는 논문을 찾지 못했습니다.")

<small>

* 셀 출력 (7.7s)

* 논문 5개는 순차적으로, 1개씩 검색됨

    ```markdown
    'ChatGPT' 주제로 논문을 검색 중입니다...

    1/5 논문 다운로드 및 파싱 중: In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT

    2/5 논문 다운로드 및 파싱 중: Is ChatGPT Involved in Texts? Measure the Polish Ratio to Detect ChatGPT-Generated Text

    3/5 논문 다운로드 및 파싱 중: When ChatGPT is gone: Creativity reverts and homogeneity persists

    4/5 논문 다운로드 및 파싱 중: Pros and Cons! Evaluating ChatGPT on Software Vulnerability

    5/5 논문 다운로드 및 파싱 중: Unveiling the Role of ChatGPT in Software Development: Insights from Developer-ChatGPT Interactions on GitHub

    === 최종 로드된 문서 목록 ===
    --- 논문 1 ---
    제목: In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT
    요약 (일부):
    In ChatGPT We Trust? Measuring and Characterizing
    the Reliability of ChatGPT
    Xinyue Shen1 Zeyuan Chen2 Michael Backes1 Yang Zhang1
    1CISPA Helmholtz Center for Information Security
    2Individual Researcher
    Abstract
    The way users acquire information is undergoing a paradigm
    shift with the advent of Chat...

    --------------------
    --- 논문 2 ---
    제목: Is ChatGPT Involved in Texts? Measure the Polish Ratio to Detect ChatGPT-Generated Text
    요약 (일부):
    IS CHATGPT INVOLVED IN TEXTS? MEASURE THE POLISH
    RATIO TO DETECT CHATGPT-GENERATED TEXT
    Lingyi Yang1, Feng Jiang 1 2 3∗, Haizhou Li 1 2
    1 School of Data Science, The Chinese University of Hong Kong, Shenzhen, China
    2 Shenzhen Research Institute of Big Data, China
    3 School of Information Science and ...

    --------------------
    --- 논문 3 ---
    제목: When ChatGPT is gone: Creativity reverts and homogeneity persists
    요약 (일부):
    
    
    
    
    
    1 
    When ChatGPT is gone: Creativity reverts and homogeneity persists 
    Qinghan Liu 
    Yiyong Zhou 
    Jihao Huang 
    Guiquan Li* 
    Peking U., School of 
    Psychological and 
    Cognitive Sciences 
    Peking U., School of 
    Psychological and 
    Cognitive Sciences 
    Beijing Yuxin 
    Technology Company 
    Peking U., ...

    --------------------
    --- 논문 4 ---
    제목: Pros and Cons! Evaluating ChatGPT on Software Vulnerability
    요약 (일부):
    Pros and Cons! Evaluating ChatGPT on Software
    Vulnerability
    XIN YIN, Zhejiang University, China
    This paper proposes a pipeline for quantitatively evaluating interactive LLMs such as ChatGPT using publicly
    available dataset. We carry out an extensive technical evaluation of ChatGPT using Big-Vul cove...

    --------------------
    --- 논문 5 ---
    제목: Unveiling the Role of ChatGPT in Software Development: Insights from Developer-ChatGPT Interactions on GitHub
    요약 (일부):
    Unveiling the Role of ChatGPT in Software Development:
    Insights from Developer-ChatGPT Interactions on GitHub
    RUIYIN LI, School of Computer Science, Wuhan University, China
    PENG LIANG, School of Computer Science, Wuhan University, China
    YIFEI WANG, School of Computer Science, Wuhan University, China...

    --------------------
    ```

---

* *next: `DirectoryLoader`*

---