## 과제 요약
### [1] PDF를 load하는 라이브러리를 사용

### 2. prompt를 custom하여 결과 도출
#### i) rlm/rag-prompt는 최대 3문장이 기본 설정값
#### ii) custom prompt
##### - 전체 수상작을 포함시키도록 수정
##### - 각 수상작에 대해 1문장으로 요약하도록 설정
##### - 최대 8문장으로 상향 조정
##### - 이어지는 문단으로 기술할 것
##### - 한글로 요약하되 프로젝트명이 영어이면 영어로 표기

### [2] custom_prompt 설정
#### 1. 논문을 요약 전문가로 설정
#### 2. 요약에 포함할 내용
##### 1) 논문의 의미
##### 2) 논문의 method나 접근 방법
##### 3) 주요 발견 및 결과
##### 4) 논문의 인사이트
#### 3. 문맥에 없는 사실을 만들어내지 말 것
#### 4. reference나 인용은 포함시키지 말 것
#### 5. 명확하고 학술적인 언어로 작성할 것

### [3] 요약 결과
#### 1) RAG를 도입한 모델 설명
#### 2) method : retrieval mechanism을 이용한 답변 생성
#### 3) 결과 : RAG를 이용하여 독해 benchmark에서 뛰어난 성능을 보임
#### 4) insight : 복잡도를 크게 증가시키지 않고도 자연어 처리를 잘 할 수 있음을 시사

The paper presents a novel approach to multi-paragraph reading comprehension by introducing the RAG (Retrieval-Augmented Generation) model, which effectively combines parametric and non-parametric memory components to improve performance in tasks requiring contextual understanding. The proposed methodology utilizes retrieval mechanisms to enhance the generation of responses based on external knowledge sources, allowing for better contextual relevance and accuracy. The key findings demonstrate that RAG achieves competitive results in reading comprehension benchmarks, outperforming traditional models without needing complex pipeline architectures or extensive retrieval supervision. Notably, the work highlights the potential of integrating retrieval methods into generative models, which can lead to advances in various natural language processing applications by leveraging external knowledge efficiently without significant increases in complexity.


## 필요한 라이브러리 가져오기

In [3]:
import os
from dotenv import load_dotenv
from langchain_chroma import Chroma
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import PDFPlumberLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langsmith import Client

## 환경 변수 가져와서 API key 설정하기

In [5]:
load_dotenv()

api_key = os.getenv("OPENAI_API_KEY")
langsmith_api_key = os.getenv("LANGSMITH_API_KEY")
langsmith_endpoint = os.getenv("LANGSMITH_ENDPOINT")

os.environ["USER_AGENT"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36"
os.environ["LANGCHAIN_API_KEY"] = langsmith_api_key
os.environ["LANGCHAIN_ENDPOINT"] = langsmith_endpoint
os.environ["LANGSMITH_TRACING"] = "true"

## 모델 설정 : gpt-4o-mini

In [7]:
llm = ChatOpenAI(model="gpt-4o-mini", api_key=api_key)
client = Client(api_key=langsmith_api_key)

## 논문 파일 가져와서 로드하기

### PDFPlumberLoader
 #### file_path
  - load할 file 경로 설정

In [19]:
loader = PDFPlumberLoader(
    file_path="./2005.11401.pdf",
)
docs = loader.load()

CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox


In [21]:
docs

[Document(metadata={'source': './2005.11401.pdf', 'file_path': './2005.11401.pdf', 'page': 0, 'total_pages': 19, 'Author': '', 'CreationDate': 'D:20210413004838Z', 'Creator': 'LaTeX with hyperref', 'Keywords': '', 'ModDate': 'D:20210413004838Z', 'PTEX.Fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'Producer': 'pdfTeX-1.40.21', 'Subject': '', 'Title': '', 'Trapped': 'False'}, page_content='Retrieval-Augmented Generation for\nKnowledge-Intensive NLP Tasks\nPatrickLewis†‡,EthanPerez(cid:63),\nAleksandraPiktus†,FabioPetroni†,VladimirKarpukhin†,NamanGoyal†,HeinrichKüttler†,\nMikeLewis†,Wen-tauYih†,TimRocktäschel†‡,SebastianRiedel†‡,DouweKiela†\n†FacebookAIResearch;‡UniversityCollegeLondon;(cid:63)NewYorkUniversity;\nplewis@fb.com\nAbstract\nLargepre-trainedlanguagemodelshavebeenshowntostorefactualknowledge\nintheirparameters,andachievestate-of-the-artresultswhenfine-tunedondown-\nstreamNLPtasks. However,theirabilitytoaccessandpreciselym

## Knowledge source에서 text 추출

### chunk size 1000, overlap 200으로 설정
#### chunk_overlap
  - 인접한 두 chunk가 겹치도록 설정하여 context를 유지

### Vector store
#### - Chroma로 vector store 생성
#### - vectorstore는 text와 embedding의 pair

In [23]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # 적당한 크기 조정 가능
    chunk_overlap=200
)
split_docs = text_splitter.split_documents(docs)

In [35]:
vectorstore = Chroma.from_documents(
    documents=split_docs, 
    embedding=OpenAIEmbeddings(api_key=api_key),
)

## 질문과 유사한 embedding을 갖고 있는 text를 추출하는 코드

### vector store에서 retriever 생성
#### top k 설정하기
  - 검색해서 return할 유사한 문서, chunk의 수를 결정하는 부분
  - default는 4로 되어 있음

##### k값을 늘리는 것의 의미
  - k값이 작으면 문서가 여러 chunk로 쪼개진 경우 앞 부분만 검색될 수 있는데, 이 부분을 방지할 수 있음
  - k값이 작으면 문서의 일부만 보고 편향된 응답을 할 수 있는데 그럴 가능성이 낮아짐
  - k값이 너무 크면 관련없는 정보까지 포함될 수 있음

In [37]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})

In [41]:
from langchain import hub

user_msg = "이 논문 요약해줘."
retrieved_docs = retriever.invoke(user_msg)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

### custom_prompt 설정
#### 1. 논문을 요약 전문가로 설정
#### 2. 요약에 포함할 내용
##### 1) 논문의 의미
##### 2) 논문의 method나 접근 방버
##### 3) 주요 발견 및 결과
##### 4) 논문의 인사이트
#### 3. 문맥에 없는 사실을 만들어내지 말 것
#### 4. reference나 인용은 포함시키지 말 것
#### 5. 명확하고 학술적인 언어로 작성할 것

In [43]:
from langchain.prompts import PromptTemplate

custom_prompt = PromptTemplate.from_template("""
You are a research assistant specialized in summarizing academic papers.

Given the following excerpts retrieved from a research paper, please generate a concise and comprehensive summary that captures:
- The main contribution of the paper
- The methodology or proposed approach
- The key findings or results
- Any unique insights or implications

Do not make up facts that are not present in the context. Do not include references or citations.

Use clear and formal academic language. Write in one paragraph. The summary should be understandable to someone familiar with NLP and machine learning.

Context:
{context}

Answer:
""")

In [45]:
user_prompt = custom_prompt.invoke({"context": format_docs(retrieved_docs), "question": user_msg})
print(user_prompt)

text='\nYou are a research assistant specialized in summarizing academic papers.\n\nGiven the following excerpts retrieved from a research paper, please generate a concise and comprehensive summary that captures:\n- The main contribution of the paper\n- The methodology or proposed approach\n- The key findings or results\n- Any unique insights or implications\n\nDo not make up facts that are not present in the context. Do not include references or citations.\n\nUse clear and formal academic language. Write in one paragraph. The summary should be understandable to someone familiar with NLP and machine learning.\n\nContext:\n[7] ChristopherClarkandMattGardner. SimpleandEffectiveMulti-ParagraphReadingCompre-\nhension. arXiv:1710.10723[cs],October2017. URLhttp://arxiv.org/abs/1710.10723.\narXiv: 1710.10723.\n[8] JacobDevlin,Ming-WeiChang,KentonLee,andKristinaToutanova. BERT:Pre-trainingof\nDeepBidirectionalTransformersforLanguageUnderstanding. InProceedingsofthe2019Con-\nferenceoftheNorthAm

In [47]:
response = llm.invoke(user_prompt)
print(response.content)

The paper presents a novel approach to multi-paragraph reading comprehension by introducing the RAG (Retrieval-Augmented Generation) model, which effectively combines parametric and non-parametric memory components to improve performance in tasks requiring contextual understanding. The proposed methodology utilizes retrieval mechanisms to enhance the generation of responses based on external knowledge sources, allowing for better contextual relevance and accuracy. The key findings demonstrate that RAG achieves competitive results in reading comprehension benchmarks, outperforming traditional models without needing complex pipeline architectures or extensive retrieval supervision. Notably, the work highlights the potential of integrating retrieval methods into generative models, which can lead to advances in various natural language processing applications by leveraging external knowledge efficiently without significant increases in complexity.
