https://docs.llamaindex.ai/en/stable/examples/low_level/ingestion.html

Setup  
We build an empty Pinecone Index, and define the necessary LlamaIndex wrappers/abstractions so that we can start loading data into Pinecone.

In [13]:
from pinecone import Pinecone, PodSpec

In [39]:
import configparser

config = configparser.ConfigParser()
config.read('env/pinecone.conf')

api_key = config["DEFAULT"]["PINECONE_API_KEY"]
environment = config["DEFAULT"]["PINECONE_ENVIRONMENT"]
openai_api_key = config["DEFAULT"]["OPENAI_API_KEY"]

In [8]:
pc = Pinecone(api_key=api_key)

In [46]:
index_name = "llamaindex-rag-fs"

pc.delete_index(index_name)

pc.create_index(
    name=index_name, 
    dimension=1536, 
    metric="cosine", 
    spec=PodSpec(environment=environment, pod_type="p1.x1")
)


In [47]:
pinecone_index = pc.Index(index_name)

### Build an Ingestion Pipeline from Scratch

1. Load Data

In [48]:
!mkdir data
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"

mkdir: data: File exists
--2024-02-20 21:39:12--  https://arxiv.org/pdf/2307.09288.pdf
arxiv.org (arxiv.org) 해석 중... 151.101.195.42, 151.101.3.42, 151.101.67.42, ...
다음으로 연결 중: arxiv.org (arxiv.org)|151.101.195.42|:443... 연결했습니다.
HTTP 요청을 보냈습니다. 응답 기다리는 중... 200 OK
길이: 13661300 (13M) [application/pdf]
저장 위치: `data/llama2.pdf'


2024-02-20 21:39:14 (6.47 MB/s) - `data/llama2.pdf' 저장함 [13661300/13661300]



In [49]:
import fitz     # pdf 정보 추출 라이브러리

In [50]:
file_path = "./data/llama2.pdf"
doc = fitz.open(file_path)

2. Use a Text Splitter to Split Documents

In [51]:
from llama_index.core.node_parser import SentenceSplitter

In [53]:
text_parser = SentenceSplitter(
    chunk_size=1536,
    # separator=" ",
)

In [54]:
text_chunks = []
# maintain relationship with source doc index, to help inject doc metadata in (3)
doc_idxs = []
for doc_idx, page in enumerate(doc):
    page_text = page.get_text("text")
    cur_text_chunks = text_parser.split_text(page_text)
    text_chunks.extend(cur_text_chunks)
    doc_idxs.extend([doc_idx] * len(cur_text_chunks))

3. Manually Construct Nodes from Text Chunks

In [55]:
from llama_index.core.schema import TextNode

In [56]:
nodes = []
for idx, text_chunk in enumerate(text_chunks):
    node = TextNode(
        text=text_chunk,
    )
    src_doc_idx = doc_idxs[idx]
    src_page = doc[src_doc_idx]
    nodes.append(node)

In [57]:
print(nodes[0].metadata)
# print a sample node
print(nodes[0].get_content(metadata_mode="all"))

{}
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron∗
Louis Martin†
Kevin Stone†
Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra
Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen
Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller
Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou
Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev
Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich
Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra
Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi
Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang
Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang
Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic
Sergey Edu

In [58]:
# [Optional] 4. Extract Metadata from each Node
'''
from llama_index.core.extractors import (
    QuestionsAnsweredExtractor,
    TitleExtractor,
)
from llama_index.core.ingestion import IngestionPipeline
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo")

extractors = [
    TitleExtractor(nodes=5, llm=llm),
    QuestionsAnsweredExtractor(questions=3, llm=llm),
]
pipeline = IngestionPipeline(
    transformations=extractors,
)
nodes = await pipeline.arun(nodes=nodes, in_place=False)
print(nodes[0].metadata)
'''

'\nfrom llama_index.core.extractors import (\n    QuestionsAnsweredExtractor,\n    TitleExtractor,\n)\nfrom llama_index.core.ingestion import IngestionPipeline\nfrom llama_index.llms.openai import OpenAI\n\nllm = OpenAI(model="gpt-3.5-turbo")\n\nextractors = [\n    TitleExtractor(nodes=5, llm=llm),\n    QuestionsAnsweredExtractor(questions=3, llm=llm),\n]\npipeline = IngestionPipeline(\n    transformations=extractors,\n)\nnodes = await pipeline.arun(nodes=nodes, in_place=False)\nprint(nodes[0].metadata)\n'

5. Generate Embeddings for each Node

In [59]:
from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(api_key=openai_api_key)

In [60]:
for node in nodes:
    node_embedding = embed_model.get_text_embedding(
        node.get_content(metadata_mode="all")
    )
    node.embedding = node_embedding

6. Load Nodes into a Vector Store

In [61]:
from llama_index.vector_stores.pinecone import PineconeVectorStore
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)

In [62]:
# We now insert these nodes into our PineconeVectorStore.
vector_store.add(nodes)

Upserted vectors: 100%|██████████| 80/80 [00:01<00:00, 60.04it/s]


['59b4f167-424f-4970-aa11-726d0dff6c45',
 '54499387-c80b-4ed1-a6c8-573233e8a544',
 'd8c3f86d-a2b6-45ec-8223-8a2e120263bd',
 '81153352-1397-4697-84a4-e7bf9719f89b',
 '2aa56b9b-bb5c-4e39-91e6-3f685bbf50f5',
 'd7204148-ce5d-4bf5-983d-b1bb8bd9ebd2',
 '892b9f94-396a-47f8-84e7-935e79f3121b',
 '88e80ade-3445-45c7-963f-af11c1d18150',
 '5da05f09-4817-49e6-8c67-6a9b250b4989',
 'bb505547-7268-49b3-a626-962990a98959',
 'f074068a-5b5a-4632-9e6b-ce71f34d2aec',
 '079b9d17-7af4-41e2-ad92-b8da83cba50e',
 '1a5a3510-91e2-44b9-b599-c62b5a9935ec',
 'e85d1463-a79f-4b7b-8c7d-d126e03578a0',
 '7d324b3d-e9ec-4af8-afaa-a2ce2e8e090c',
 'b2cb59e5-1078-442d-9ca4-a4b26c30d67c',
 'a87ea862-8b4e-4447-9de5-fb9f173bc409',
 '39bce592-6a7f-4735-a531-0a42cba85ded',
 'b6e74359-88fa-44bd-be0b-91434c73446e',
 '70ed1c78-03d5-4b44-a6af-a5643e2f35e3',
 'a4ef88eb-3e70-4a31-8e3a-3b4059408e91',
 '9b11e808-35a2-4090-8520-eb75d1021d1b',
 '31fb3ce2-a80c-4a5a-8f73-cebf67662714',
 'ef7ad5b8-dcd6-4e10-aa33-787286d4e1e6',
 'f21afd6b-9d29-

### Retrieve and Query from the Vector Store
Now that our ingestion is complete, we can retrieve/query this vector store.

In [68]:
from llama_index.core import VectorStoreIndex
from llama_index.core import StorageContext

import os
os.environ["OPENAI_API_KEY"] = openai_api_key

In [69]:
index = VectorStoreIndex.from_vector_store(vector_store)

In [70]:
query_engine = index.as_query_engine()

In [71]:
query_str = "Can you tell me about the key concepts for safety finetuning"

In [72]:
response = query_engine.query(query_str)

In [73]:
print(str(response))

The key concepts for safety fine-tuning include supervised safety fine-tuning, safety RLHF (Rejection Learning from Human Feedback), and safety context distillation. Supervised safety fine-tuning involves gathering adversarial prompts and safe demonstrations to align the model with safety guidelines. Safety RLHF integrates safety into the RLHF pipeline by training a safety-specific reward model and using challenging adversarial prompts for optimization. Safety context distillation refines the RLHF pipeline by generating safer model responses with safety preprompts and fine-tuning the model on these responses to distill the safety context into the model.
