# RAG 시작하기

대규모 언어 모델(LLM)은 고급 사용 사례를 지원하는 강력한 기능을 제공하지만 사실 불일치 및 환각과 같은 문제로 어려움을 겪습니다. 검색 증강 생성(RAG)은 LLM 기능을 강화하고 안정성을 개선하기 위한 강력한 접근 방식입니다. RAG는 작업 수행에 도움이 되는 관련 정보로 프롬프트 컨텍스트를 보강하여 LLM을 외부 지식과 결합하는 것입니다.

이 튜토리얼에서는 벡터 스토어와 오픈 소스 LLM을 활용하여 RAG를 시작하는 방법을 보여드립니다. 이 사용 사례에서는 RAG의 힘을 보여주기 위해 원본 ML 논문 제목에서 짧고 읽기 쉬운 ML 논문 제목을 제안하는 RAG 시스템을 구축하는 방법을 다룹니다. 논문 제목은 일반 독자에게는 너무 전문적일 수 있으므로, 이전에 생성된 짧은 제목을 기반으로 RAG를 사용하여 짧은 제목을 생성하면 연구 논문 제목에 대한 접근성을 높이고 뉴스레터나 블로그 등의 과학 커뮤니케이션에 활용할 수 있습니다.

시작하기 전에 먼저 사용할 라이브러리를 설치해 보겠습니다:

In [15]:
%%capture
!pip install chromadb tqdm fireworks-ai python-dotenv pandas
!pip install sentence-transformers

계속하기 전에 미스트랄 7B 모델을 사용하려면 Fireworks API 키를 받아야 합니다.

이 퀵 가이드를 확인하여 Fireworks API 키를 받으세요: https://readme.fireworks.ai/docs

In [16]:
import fireworks.client
import os
import dotenv
import chromadb
import json
from tqdm.auto import tqdm
import pandas as pd
import random

# you can set envs using Colab secrets
dotenv.load_dotenv()

fireworks.client.api_key = os.getenv("FIREWORKS_API_KEY")

시작하기 ## 시작하기

Fireworks 추론 플랫폼에서 완성을 가져오는 함수를 정의해 보겠습니다.

In [17]:
def get_completion(prompt, model=None, max_tokens=50):

    fw_model_dir = "accounts/fireworks/models/"

    if model is None:
        model = fw_model_dir + "llama-v2-7b"
    else:
        model = fw_model_dir + model

    completion = fireworks.client.Completion.create(
        model=model,
        prompt=prompt,
        max_tokens=max_tokens,
        temperature=0
    )

    return completion.choices[0].text

먼저 간단한 프롬프트를 통해 기능을 사용해 보겠습니다:

In [18]:
get_completion("Hello, my name is")

' Katie and I am a 20 year old student at the University of Leeds. I am currently studying a BA in English Literature and Creative Writing. I have been working as a tutor for over 3 years now and I'

이제 Mistral-7B-Instruct로 테스트해 보겠습니다:

In [19]:
mistral_llm = "mistral-7b-instruct-4k"

get_completion("Hello, my name is", model=mistral_llm)

' [Your Name]. I am a [Your Profession/Occupation]. I am writing to [Purpose of Writing].\n\nI am writing to [Purpose of Writing] because [Reason for Writing]. I believe that ['

미스트랄 7B 인스트럭트 모델은 특수 인스트럭션 토큰 `[INST] <인스트럭션> [/INST]`를 사용하여 인스트럭트해야 올바른 동작을 수행할 수 있습니다. 미스트랄 7B 인스트럭트에 명령을 내리는 방법에 대한 자세한 내용은 https://docs.mistral.ai/llm/mistral-instruct-v0.1 에서 확인할 수 있습니다.

In [20]:
mistral_llm = "mistral-7b-instruct-4k"

get_completion("Tell me 2 jokes", model=mistral_llm)

".\n1. Why don't scientists trust atoms? Because they make up everything!\n2. Did you hear about the mathematician who’s afraid of negative numbers? He will stop at nothing to avoid them."

In [21]:
mistral_llm = "mistral-7b-instruct-4k"

get_completion("[INST]Tell me 2 jokes[/INST]", model=mistral_llm)

" Sure, here are two jokes for you:\n\n1. Why don't scientists trust atoms? Because they make up everything!\n2. Why did the tomato turn red? Because it saw the salad dressing!"

이제 지침이 포함된 좀 더 복잡한 프롬프트를 사용해 보겠습니다:

In [22]:
prompt = """[INST]
Given the following wedding guest data, write a very short 3-sentences thank you letter:

{
  "name": "John Doe",
  "relationship": "Bride's cousin",
  "hometown": "New York, NY",
  "fun_fact": "Climbed Mount Everest in 2020",
  "attending_with": "Sophia Smith",
  "bride_groom_name": "Tom and Mary"
}

Use only the data provided in the JSON object above.

The senders of the letter is the bride and groom, Tom and Mary.
[/INST]"""

get_completion(prompt, model=mistral_llm, max_tokens=150)

" Dear John Doe,\n\nWe, Tom and Mary, would like to extend our heartfelt gratitude for your attendance at our wedding. It was a pleasure to have you there, and we truly appreciate the effort you made to be a part of our special day.\n\nWe were thrilled to learn about your fun fact - climbing Mount Everest is an incredible accomplishment! We hope you had a safe and memorable journey.\n\nThank you again for joining us on this special occasion. We hope to stay in touch and catch up on all the amazing things you've been up to.\n\nWith love,\n\nTom and Mary"

## RAG 사용 사례: 짧은 논문 제목 생성

RAG 사용 사례에서는 매주 가장 인기 있는 ML 논문 목록이 포함된 [데이터 세트](https://github.com/dair-ai/ML-Papers-of-the-Week/tree/main/research)를 사용할 것입니다.

사용자는 원본 논문 제목을 제공합니다. 그런 다음 해당 입력을 받은 다음 데이터 세트를 사용하여 짧고 눈에 띄는 논문 제목의 컨텍스트를 생성하여 원래 입력 제목에 대한 눈에 띄는 제목을 생성하는 데 도움이 될 것입니다.



### 1단계: 데이터 세트 로드하기

먼저 사용할 데이터 집합을 로드해 보겠습니다:

In [23]:
# load dataset from data/ folder to pandas dataframe
# dataset contains column names

ml_papers = pd.read_csv("../data/ml-potw-10232023.csv", header=0)

# remove rows with empty titles or descriptions
ml_papers = ml_papers.dropna(subset=["Title", "Description"])

In [24]:
ml_papers.head()

Unnamed: 0,Title,Description,PaperURL,TweetURL,Abstract
0,Llemma,an LLM for mathematics which is based on conti...,https://arxiv.org/abs/2310.10631,https://x.com/zhangir_azerbay/status/171409802...,"We present Llemma, a large language model for ..."
1,LLMs for Software Engineering,a comprehensive survey of LLMs for software en...,https://arxiv.org/abs/2310.03533,https://x.com/omarsar0/status/1713940983199506...,This paper provides a survey of the emerging a...
2,Self-RAG,presents a new retrieval-augmented framework t...,https://arxiv.org/abs/2310.11511,https://x.com/AkariAsai/status/171511027707796...,"Despite their remarkable capabilities, large l..."
3,Retrieval-Augmentation for Long-form Question ...,explores retrieval-augmented language models o...,https://arxiv.org/abs/2310.12150,https://x.com/omarsar0/status/1714986431859282...,We present a study of retrieval-augmented lang...
4,GenBench,presents a framework for characterizing and un...,https://www.nature.com/articles/s42256-023-007...,https://x.com/AIatMeta/status/1715041427283902...,


In [25]:
# convert dataframe to list of dicts with Title and Description columns only

ml_papers_dict = ml_papers.to_dict(orient="records")

In [26]:
ml_papers_dict[0]

{'Title': 'Llemma',
 'Description': 'an LLM for mathematics which is based on continued pretraining from Code Llama on the Proof-Pile-2 dataset; the dataset involves scientific paper, web data containing mathematics, and mathematical code; Llemma outperforms open base models and the unreleased Minerva on the MATH benchmark; the model is released, including dataset and code to replicate experiments.',
 'PaperURL': 'https://arxiv.org/abs/2310.10631',
 'TweetURL': 'https://x.com/zhangir_azerbay/status/1714098025956864031?s=20',
 'Abstract': 'We present Llemma, a large language model for mathematics. We continue pretraining Code Llama on the Proof-Pile-2, a mixture of scientific papers, web data containing mathematics, and mathematical code, yielding Llemma. On the MATH benchmark Llemma outperforms all known open base models, as well as the unreleased Minerva model suite on an equi-parameter basis. Moreover, Llemma is capable of tool use and formal theorem proving without any further finet

크로마 문서 저장소에 저장할 임베딩을 생성하는 데 SentenceTransformer를 사용할 것입니다.

In [27]:
from chromadb import Documents, EmbeddingFunction, Embeddings
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

class MyEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input: Documents) -> Embeddings:
        batch_embeddings = embedding_model.encode(input)
        return batch_embeddings.tolist()

embed_fn = MyEmbeddingFunction()

# Initialize the chromadb directory, and client.
client = chromadb.PersistentClient(path="./chromadb")

# create collection
collection = client.get_or_create_collection(
    name=f"ml-papers-nov-2023"
)

.gitattributes: 100%|██████████| 1.18k/1.18k [00:00<00:00, 194kB/s]
1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 204kB/s]
README.md: 100%|██████████| 10.6k/10.6k [00:00<00:00, 7.64MB/s]
config.json: 100%|██████████| 612/612 [00:00<00:00, 679kB/s]
config_sentence_transformers.json: 100%|██████████| 116/116 [00:00<00:00, 94.0kB/s]
data_config.json: 100%|██████████| 39.3k/39.3k [00:00<00:00, 7.80MB/s]
pytorch_model.bin: 100%|██████████| 90.9M/90.9M [00:03<00:00, 24.3MB/s]
sentence_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 55.4kB/s]
special_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 161kB/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 6.15MB/s]
tokenizer_config.json: 100%|██████████| 350/350 [00:00<00:00, 286kB/s]
train_script.py: 100%|██████████| 13.2k/13.2k [00:00<00:00, 12.2MB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 9.15MB/s]
modules.json: 100%|██████████| 349/349 [00:00<00:00, 500kB/s]


이제 배치에 대한 임베딩을 생성합니다:

In [28]:
# Generate embeddings, and index titles in batches
batch_size = 50

# loop through batches and generated + store embeddings
for i in tqdm(range(0, len(ml_papers_dict), batch_size)):

    i_end = min(i + batch_size, len(ml_papers_dict))
    batch = ml_papers_dict[i : i + batch_size]

    # Replace title with "No Title" if empty string
    batch_titles = [str(paper["Title"]) if str(paper["Title"]) != "" else "No Title" for paper in batch]
    batch_ids = [str(sum(ord(c) + random.randint(1, 10000) for c in paper["Title"])) for paper in batch]
    batch_metadata = [dict(url=paper["PaperURL"],
                           abstract=paper['Abstract'])
                           for paper in batch]

    # generate embeddings
    batch_embeddings = embedding_model.encode(batch_titles)

    # upsert to chromadb
    collection.upsert(
        ids=batch_ids,
        metadatas=batch_metadata,
        documents=batch_titles,
        embeddings=batch_embeddings.tolist(),
    )

100%|██████████| 9/9 [00:01<00:00,  7.62it/s]


이제 리트리버를 테스트할 수 있습니다:

In [29]:
collection = client.get_or_create_collection(
    name=f"ml-papers-nov-2023",
    embedding_function=embed_fn
)

retriever_results = collection.query(
    query_texts=["Software Engineering"],
    n_results=2,
)

print(retriever_results["documents"])

[['LLMs for Software Engineering', 'Communicative Agents for Software Development']]


이제 마지막 프롬프트를 정리해 보겠습니다:

In [30]:
# user query
user_query = "S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models"

# query for user query
results = collection.query(
    query_texts=[user_query],
    n_results=10,
)

# concatenate titles into a single string
short_titles = '\n'.join(results['documents'][0])

prompt_template = f'''[INST]

Your main task is to generate 5 SUGGESTED_TITLES based for the PAPER_TITLE

You should mimic a similar style and length as SHORT_TITLES but PLEASE DO NOT include titles from SHORT_TITLES in the SUGGESTED_TITLES, only generate versions of the PAPER_TILE.

PAPER_TITLE: {user_query}

SHORT_TITLES: {short_titles}

SUGGESTED_TITLES:

[/INST]
'''

responses = get_completion(prompt_template, model=mistral_llm, max_tokens=2000)
suggested_titles = ''.join([str(r) for r in responses])

# Print the suggestions.
print("Model Suggestions:")
print(suggested_titles)
print("\n\n\nPrompt Template:")
print(prompt_template)

Model Suggestions:

1. S3Eval: A Comprehensive Evaluation Suite for Large Language Models
2. Synthetic and Scalable Evaluation for Large Language Models
3. Systematic Evaluation of Large Language Models with S3Eval
4. S3Eval: A Synthetic and Scalable Approach to Language Model Evaluation
5. S3Eval: A Synthetic and Scalable Evaluation Suite for Large Language Models



Prompt Template:
[INST]

Your main task is to generate 5 SUGGESTED_TITLES based for the PAPER_TITLE

You should mimic a similar style and length as SHORT_TITLES but PLEASE DO NOT include titles from SHORT_TITLES in the SUGGESTED_TITLES, only generate versions of the PAPER_TILE.

PAPER_TITLE: S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models

SHORT_TITLES: Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
ChemCrow: Augmenting large-language models with chemistry tools
A Survey of Large Language Models
LLaMA: Open and Efficient Foundation Language Models
Spars

보시다시피, LLM으로 생성된 짧은 제목은 어느 정도 괜찮은 수준입니다. 이 사용 사례는 아직 더 많은 작업이 필요하며 미세 조정을 통해 이점을 얻을 수 있습니다. 이 튜토리얼에서는 Firework의 빠른 속도의 오픈 소스 모델을 사용하여 RAG를 간단하게 적용하는 방법을 제공했습니다.

다른 오픈 소스 모델도 여기에서 사용해 보세요: https://app.fireworks.ai/models

Fireworks API에 대한 자세한 내용은 여기를 참조하세요: https://readme.fireworks.ai/reference/createchatcompletion
