-----

# **Pinecone Vector Database**

----

## **1. Install Required Libraries**

In [86]:
!pip install pinecone pinecone-client openai langchain-openai langchain-community langchain_pinecone pypdf tiktoken



## **2. Import Required Libraries**

In [87]:
import os
from langchain.vectorstores import Pinecone
from langchain_pinecone import PineconeVectorStore
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_openai import OpenAI
from langchain.chains import RetrievalQA
from langchain_core.prompts import PromptTemplate

## **3. Load Data**

In [88]:
# Create directory to store data
!mkdir PDF_DATA

In [89]:
# Download the YOLOv7 research paper from Google Drive using gdown
# The file is saved as 'yolov7paper.pdf' in the 'PDF_DATA' directory
!gdown 1hPQlXrX8FbaYaLypxTmeVOFNitbBMlEE -O PDF_DATA/yolov7paper.pdf

# Download Rachel Green's CV from Google Drive using gdown
# The file is saved as 'rachelgreecv.pdf' in the 'PDF_DATA' directory
!gdown 1vILwiv6nS2wI3chxNabMgry3qnV67TxM -O PDF_DATA/rachelgreecv.pdf

Downloading...
From: https://drive.google.com/uc?id=1hPQlXrX8FbaYaLypxTmeVOFNitbBMlEE
To: /content/PDF_DATA/yolov7paper.pdf
100% 2.27M/2.27M [00:00<00:00, 137MB/s]
Downloading...
From: https://drive.google.com/uc?id=1vILwiv6nS2wI3chxNabMgry3qnV67TxM
To: /content/PDF_DATA/rachelgreecv.pdf
100% 271k/271k [00:00<00:00, 91.5MB/s]


## **4. Extract Text From PDF_DATA**

In [90]:
# Initialize a PyPDFDirectoryLoader instance to load PDF files from a specified directory
# "PDF_DATA/" is the directory containing the PDF files to be loaded
loader = PyPDFDirectoryLoader("PDF_DATA/")

# Load the PDF documents from the specified directory using the loader instance
# This will read and parse the PDF files, returning them as a list of document objects
documents = loader.load()

In [91]:
# documents  # Uncomment to see the extracted data

## **5. Convert Extracted Data Into Text Chunks**

In [92]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
text_chunks = text_splitter.split_documents(documents)

In [93]:
# Let's Checkfirst chunk
text_chunks[0]

Document(metadata={'source': 'PDF_DATA/yolov7paper.pdf', 'page': 0}, page_content='YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object\ndetectors\nChien-Yao Wang1, Alexey Bochkovskiy, and Hong-Yuan Mark Liao1\n1Institute of Information Science, Academia Sinica, Taiwan\nkinyiu@iis.sinica.edu.tw, alexeyab84@gmail.com, and liao@iis.sinica.edu.tw\nAbstract\nYOLOv7 surpasses all known object detectors in both\nspeed and accuracy in the range from 5 FPS to 160 FPS\nand has the highest accuracy 56.8% AP among all known')

## **6. Set Up OpenAI Key and Embeddings**

In [94]:
# Set up OpenAI API Key

os.environ['OPENAI_API_KEY'] = "YOUR_API_KEY"

In [95]:
# Set up OpenAI Embeddings

embeddings = OpenAIEmbeddings()

In [96]:
# Let's try to embed a text

res = embeddings.embed_query("Hello Adil!")

In [97]:
# By default the length of OpenAI Embedding Vector Will be 1536

len(res)

1536

## **6. Set Up Pincone API**

In [98]:
# API key to initialize your client

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="YOUR_API_KEY")


In [102]:
# Set the API key as an environment variable
os.environ["PINECONE_API_KEY"] = "YOUR_API_KEY"

In [99]:
# Create a serverless index

# - An index defines the dimension of vectors to be stored and the similarity metric to be used when querying them.

# - Create a serverless index with a dimension and similarity metric based on the embedding model you’ll use to create the vector embeddings:

index_name = "pineconepractice"

pc.create_index(
    name=index_name,
    dimension=1536, # Replace with your model dimensions
    metric="cosine", # Replace with your model metric
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)

## **7. Create Embeddings From Chunks**

In [103]:
# Create a Pinecone vector store instance from a list of text chunks
docsearch = PineconeVectorStore.from_texts(
    # Extract the page content from each text chunk and create a list
    [t.page_content for t in text_chunks],

    # Use OpenAI's embedding model to convert text into vector representations
    embedding=OpenAIEmbeddings(),

    # Specify the name of the index in Pinecone where the vectors will be stored
    index_name=index_name,
)

In [104]:
docsearch

<langchain_pinecone.vectorstores.PineconeVectorStore at 0x7b08afe2cb80>

#### **Load the Existing Pinecone Index**

In [105]:
docsearch = PineconeVectorStore.from_existing_index(index_name, embeddings)
docsearch

<langchain_pinecone.vectorstores.PineconeVectorStore at 0x7b08e0788220>

## **8. Let's Perform Similarity Search**

In [117]:
query = "YOLOv7 outperforms which models"

In [108]:
# Perform a similarity search in the Pinecone vector store using the specified query
# This searches for the top 'k' most relevant documents based on the query
docs = docsearch.similarity_search(query, k=3)

# Output the retrieved documents that match the similarity criteria
docs

[Document(id='69ef3d73-31c3-4d1c-be2c-cc053cbb60b6', metadata={}, page_content='YOLOv7-tiny 6.2 3.5 320 30.8% 47.3% 32.2% 10.0% 31.9% 52.2%\nimprovement -39% -49% - = = = -0.9 = +0.7\nYOLOR-E6 [81] 115.8M 683.2G 1280 55.7% 73.2% 60.7% 40.1% 60.4% 69.2%\nYOLOv7-E6 97.2M 515.2G 1280 55.9% 73.5% 61.1% 40.6% 60.3% 70.0%\nimprovement -19% -33% - +0.2 +0.3 +0.4 +0.5 -0.1 +0.8\nYOLOR-D6 [81] 151.7M 935.6G 1280 56.1% 73.9% 61.2% 42.4% 60.5% 69.9%\nYOLOv7-D6 154.7M 806.8G 1280 56.3% 73.8% 61.4% 41.3% 60.6% 70.1%\nYOLOv7-E6E 151.7M 843.2G 1280 56.8% 74.4% 62.1% 40.8% 62.1% 70.6%'),
 Document(id='eb6b8f9b-ff28-4621-9f9d-e830932b0b34', metadata={}, page_content='YOLOv5-L6 (r6.1) [23] 76.8M 445.6G 1280 63 - / 53.7% - -\nYOLOX-X [21] 99.1M 281.9G 640 58 51.5% / 51.1% - -\nYOLOv7-E6 97.2M 515.2G 1280 56 56.0% /55.9% 73.5% 61.2%\nYOLOR-E6 [81] 115.8M 683.2G 1280 45 55.8% / 55.7% 73.4% 61.1%\nPPYOLOE-X [85] 98.4M 206.6G 640 45 52.2% / 51.9% 69.9% 56.5%\nYOLOv7-D6 154.7M 806.8G 1280 44 56.6% /56.3% 74.0

## **9. Query the Vector Database**

In [109]:
# Initialize an instance of the OpenAI language model (LLM)
llm = OpenAI()

In [110]:
# Create a RetrievalQA instance that combines a language model with a retriever
qa = RetrievalQA.from_chain_type(
    # Specify the language model to be used for question answering
    llm=llm,

    # Define the type of chain to use for processing the data; "stuff" indicates a specific method of combining information
    chain_type="stuff",

    # Use the document searcher as the retriever to fetch relevant documents for the QA process
    retriever=docsearch.as_retriever()
)

In [112]:
# Invoke the question-answering system with the specified query
# This processes the query and retrieves an answer using the language model and the document retriever
qa.invoke(query)

{'query': 'YOLOv7 outperforms which models',
 'result': ' YOLOv7 outperforms all known object detectors including YOLOR-D6, YOLOv7-E6, YOLOX-X, YOLOv5-X6, PPYOLOE-X, YOLOv5-X, YOLOR-CSP, YOLOR-CSP-X, YOLOv7-tiny-SiLU, YOLOv7, and YOLOv7-X.'}

In [121]:
# Define a query string to search for specific information related to Rachel Green's experience
query2 = "Rachel Green Experience"

# Invoke the question-answering system with the defined query to get a response
response = qa.invoke(query2)

# Extract just the content of the 'result' field from the response dictionary
result_content = response.get('result')

# Print the extracted result content to the console
print(result_content)

 Rachel Green has a PhD in English from the University of Illinois at Urbana-Champaign and has also completed an MA in English. She has received various grants and awards for her academic achievements, including a Summer Research Grant and a Graduate College Conference Travel Grant from the University of Illinois. She has also presented at conferences and has published in academic journals and books. She currently works as an Associate Professor of English at Butler University in Indianapolis, IN.


In [124]:
import sys

# Start an infinite loop to continuously accept user input
while True:
    try:
        # Prompt the user for input and store it in the 'user_input' variable
        user_input = input("Input Prompt: ")

        # Check if the user input is 'exit'
        if user_input.lower() == 'exit':
            print('Exiting')
            # Terminate the program
            sys.exit()

        # Check if the user input is an empty string
        if user_input == '':
            # If input is empty, skip to the next iteration of the loop
            continue

        # Call the question-answering system using the invoke method
        result = qa.invoke({'query': user_input})

        # Print the answer retrieved from the result, specifically the 'result' field
        print(f"Answer: {result['result']}")

    except Exception as e:
        # Handle any exceptions that occur and print an error message
        print(f"An error occurred: {e}")

Input Prompt: Tell me about yolo v7
Answer:  YOLOv7 is an object detection model that has surpassed all known object detectors in terms of speed and accuracy. It has a 56.8% average precision (AP) and can process images at a rate of 5 to 160 frames per second (FPS). It was trained only on the MS COCO dataset, without using any other datasets or pre-trained weights. Comparing YOLOv7 with YOLOR, it has a faster inference speed and a higher detection rate. It also has improvements in AP and parameter and computation reduction compared to YOLOv5. YOLOv7-D6 has a similar inference speed to YOLOR-E6 but with a higher AP by 0.8%.
Input Prompt: exit
Exiting


SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)
