# Pinecone

- Author: [Teddy](https://github.com/teddylee777)
- Design: [Teddy](https://github.com/teddylee777)
- Peer Review: [Teddy](https://github.com/teddylee777)
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)

## Overview
[Pinecone](https://docs.pinecone.io/docs/overview) is a vector database with broad functionality.

This notebook shows how to use functionality related to the `Pinecone` vector database.

### Table of Contents
- [Overview](#overview)
- [Environement Setup](#environment-setup)
- [Create a basic PDF-based Retrieval Chain](#create-a-basic-pdf-based-retrieval-chain)
- [Query routing and document evaluation](#query-routing-and-document-evaluation)

## References
- [LangChain: Pinecone](https://python.langchain.com/v0.2/docs/integrations/vectorstores/pinecone/)
- [Pinecone]()


## Environment Setup
Set up the environment. You may refer to Environment Setup for more details.

**[Note]**
- langchain-opentutorial is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
- You can checkout the langchain-opentutorial for more details.

In [None]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [8]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain-pinecone",
        "pinecone-client",
        "nltk",
        "langchain_community",
    ],
    verbose=False,
    upgrade=False,
)


In [None]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "PINECONE_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Pinecone",
    },
)

In [None]:

from pinecone import ServerlessSpec, Pinecone

class PineconeDB(VectorDBInterface):
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.pc = None  # 초기에는 None

    def connect(self, **kwargs) -> None:
        """Pinecone API 연결"""
        self.pc = Pinecone(api_key=self.api_key)
        print("[Pinecone] 연결 성공")

    def create_index(self, index_name: str, dimension: int, metric: str = "dotproduct", **kwargs) -> Any:
        """Pinecone 인덱스 생성"""
        pod_spec = kwargs.get("pod_spec", ServerlessSpec(cloud="aws", region="us-east-1"))

        if index_name not in self.pc.list_indexes().names():
            self.pc.create_index(name=index_name, dimension=dimension, metric=metric, spec=pod_spec)

        print(f"[Pinecone] 인덱스 {index_name} 생성 완료")
        return self.pc.Index(index_name)

    def get_index(self, index_name: str) -> Any:
        """Pinecone 인덱스 조회"""
        return self.pc.Index(index_name)

    def delete_index(self, index_name: str) -> None:
        """Pinecone 인덱스 삭제"""
        self.pc.delete_index(index_name)
        print(f"[Pinecone] 인덱스 {index_name} 삭제 완료")

    def upsert_documents(self, index_name: str, documents: List[Dict], **kwargs) -> None:
        """문서를 Pinecone에 업서트"""
        index = self.pc.Index(index_name)
        index.upsert(vectors=documents, namespace=kwargs.get("namespace"))
        print(f"[Pinecone] {len(documents)}개 문서 업서트 완료")

    def upsert_documents_parallel(self, index_name: str, documents: List[Dict], batch_size: int = 32, max_workers: int = 10, **kwargs) -> None:
        """병렬 업서트 (미구현)"""
        pass

    def query(self, index_name: str, query_vector: List[float], top_k: int = 10, **kwargs) -> List[Document]:
        """Pinecone에서 쿼리 수행"""
        index = self.pc.Index(index_name)
        results = index.query(vector=query_vector, top_k=top_k, namespace=kwargs.get("namespace"))
        return [Document(page_content=r.metadata["context"], metadata=r.metadata) for r in results["matches"]]

    def delete_by_filter(self, index_name: str, filters: Dict, **kwargs) -> None:
        """필터를 사용하여 문서 삭제"""
        index = self.pc.Index(index_name)
        index.delete(filter=filters, namespace=kwargs.get("namespace"))

    def preprocess_documents(self, documents: List[Document], **kwargs) -> List[Dict]:
        """LangChain 문서를 Pinecone에 맞는 형식으로 변환"""
        return [{"id": doc.metadata.get("id", ""), "vector": doc.metadata["vector"], "metadata": doc.metadata} for doc in documents]

    def get_api_key(self) -> str:
        """Pinecone API 키 반환"""
        return self.api_key


[Note] If you are using a .env file, proceed as follows.

In [None]:
from dotenv import load_dotenv

# API 키 정보 로드
load_dotenv()

## Credentials (to-do)

Create a new Pinecone account, or sign into your existing one, and create an API key to use in this notebook.

'''
######################################################
(TO-DO) API Key 발급방법 설명해주세요
######################################################
'''

API Key 발급 방법 설명 ~~

`.env` 파일에 아래와 같이 추가합니다.

```
PINECONE_API_KEY="YOUR_PINECONE_API_KEY"

In [None]:
import getpass
import os
import time

from pinecone import Pinecone, ServerlessSpec

'''
######################################################
(TO-DO) db 연결함수 추가
: connect
######################################################
'''

if not os.getenv("PINECONE_API_KEY"):
    os.environ["PINECONE_API_KEY"] = getpass.getpass("Enter your Pinecone API key: ")

pinecone_api_key = os.environ.get("PINECONE_API_KEY")

pc = Pinecone(api_key=pinecone_api_key)

In [None]:
'''
######################################################
(TO-DO) api key 조회 함수 추가
: get_api_key
######################################################
'''


## Initialization
Before initializing our vector store, let's connect to a Pinecone index. If one named index_name doesn't exist, it will be created.

**주의사항**
- `metric` 은 유사도 측정 방법을 지정합니다. 만약 HybridSearch 를 고려하고 있다면 `metric` 은 `dotproduct` 로 지정합니다.

### List Indexs (to-do)

In [None]:
'''
######################################################
(TO-DO) 인덱스 조회 함수 추가
: list_indexs
######################################################
'''

existing_indexes = [index_info["name"] for index_info in pc.list_indexes()]

### Create Index (to-do)

Pinecone 의 새로운 인덱스를 생성합니다.

![pinecone-01.png](https://github.com/teddylee777/langchain-kr/blob/main/09-VectorStore/images/pinecone-01.png?raw=1)

In [None]:
import time

'''
######################################################
(TO-DO) 인덱스 생성 함수 추가
: create_index
######################################################
'''

index_name = "langchain-test-index"  # change if desired

if index_name not in existing_indexes:
    pc.create_index(
        name=index_name,
        dimension=3072,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )
    while not pc.describe_index(index_name).status["ready"]:
        time.sleep(1)

index = pc.Index(index_name)

### Delete Index (to-do)

In [None]:
'''
######################################################
(TO-DO) 인덱스 삭제 함수 추가
: delete_index
######################################################
'''

### Select Embeddings model

In [None]:
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

In [None]:
from langchain_pinecone import PineconeVectorStore
vector_store = PineconeVectorStore(index=index, embedding=embeddings)

### Data Preprocessing (to-do)

Below is the preprocessing process for general documents.

- 필요한 `metadata` 정보를 추출합니다.
- 최소 길이 이상의 데이터만 필터링 합니다.
  
- 문서의 `basename` 을 사용할지 여부를 지정합니다. 기본값은 `False` 입니다.
  - 여기서 `basename` 이란 파일 경로의 가장 마지막 부분을 의미합니다.
  - 예를 들어, `/Users/teddy/data/document.pdf` 의 경우 `document.pdf` 가 됩니다.

In [10]:
# This is a long document we can split up.
with open("./data/the_little_prince.txt") as f:
    raw_text = f.read()

In [25]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

split_docs = text_splitter.create_documents([raw_text])

print(split_docs)

[Document(metadata={}, page_content='The Little Prince\nWritten By Antoine de Saiot-Exupery (1900〜1944)'), Document(metadata={}, page_content='[ Antoine de Saiot-Exupery ]'), Document(metadata={}, page_content='Over the past century, the thrill of flying has inspired some to perform remarkable feats of'), Document(metadata={}, page_content='remarkable feats of daring. For others, their desire to soar into the skies led to dramatic leaps'), Document(metadata={}, page_content='to dramatic leaps in technology. For Antoine de Saint-Exupéry, his love of aviation inspired'), Document(metadata={}, page_content='aviation inspired stories, which have touched the hearts of millions around the world.'), Document(metadata={}, page_content='Born in 1900 in Lyons, France, young Antoine was filled with a passion for adventure. When he'), Document(metadata={}, page_content='adventure. When he failed an entrance exam for the Naval Academy, his interest in aviation took'), Document(metadata={}, page_con

In [None]:
'''
######################################################
(TO-DO) 각 DB 형식에 맞게 문서 전처리 함수 추가
: preprocess_documents

이 함수는 문서를 LangChain 형식에서 Pinecone에 맞게 변환합니다.
######################################################
'''
contents, metadatas = preprocess_documents(
    split_docs=split_docs,
    metadata_keys=["source", "page", "author"],
    min_length=5,
    use_basename=True,
)

## Manage vector store
Once you have created your vector store, we can interact with it by adding and deleting different items.


### Add items to vector store

We can add items to our vector store by using the add_documents function.

In [None]:
from uuid import uuid4
from langchain_core.documents import Document

uuids = [str(uuid4()) for _ in range(len(docs))]
print(uuids)
vector_store.add_documents(documents=split_docs, ids=uuids)

### Delete items from vector store

In [None]:
vector_store.delete(ids=[uuids[-1]])

### Upsert items to vector store (to-do)

In [None]:
'''
######################################################
(TO-DO) 문서 insert+update 함수 추가
: upsert_documents
######################################################
'''

%%time
from langchain_teddynote.community.pinecone import upsert_documents

upsert_documents(
    index=pc,  # Pinecone Index
    namespace="langchain-open-tutorial-01",  # Pinecone namespace
    contents=contents,  # Previously preprocessed document content
    metadatas=metadatas,  # Previously preprocessed document metadata
    embedder=openai_embeddings,
    batch_size=32,
)

### Upsert items to vector store (parallel) (to-do)

In [None]:
'''
######################################################
(TO-DO) upsert_documents_parallel 함수 추가
: upsert_documents_parallel
######################################################
'''

%%time
from langchain_teddynote.community.pinecone import upsert_documents_parallel

upsert_documents_parallel(
    index=pc,  # Pinecone Index
    namespace="langchain-open-tutorial-01",  # Pinecone namespace
    contents=contents,  # Previously preprocessed document content
    metadatas=metadatas,  # Previously preprocessed document metadata
    sparse_encoder=sparse_encoder,  # Sparse encoder
    embedder=openai_embeddings,
    batch_size=32,
)

## Query vector store

Once your vector store has been created and the relevant documents have been added you will most likely wish to query it during the running of your chain or agent.

### Query directly

Performing a simple similarity search can be done as follows:

In [None]:
results = vector_store.similarity_search(
    "LangChain provides abstractions to make working with LLMs easy",
    k=2,
    filter={"source": "tweet"},
)
for res in results:
    print(f"* {res.page_content} [{res.metadata}]")

### Similarity search with score
You can also search with score:

In [None]:
results = vector_store.similarity_search_with_score(
    "Will it be hot tomorrow?", k=1, filter={"source": "news"}
)
for res, score in results:
    print(f"* [SIM={score:3f}] {res.page_content} [{res.metadata}]")

### Query by turning into retreiver
You can also transform the vector store into a retriever for easier usage in your chains.

In [None]:
retriever = vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"k": 1, "score_threshold": 0.5},
)
retriever.invoke("Stealing from the bank is a crime", filter={"source": "news"})

### delete_by_filter (to-do)

In [None]:
'''
######################################################
(TO-DO) filter 기반 특정 문서 delete 함수 추가
: delete_by_filter
######################################################
'''