<a href="https://colab.research.google.com/github/phaethonp/chat-pdf/blob/master/KnowledgeAssistant2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



> Notebook for Chromadb + Instructor-XL + OPENAI
https://gpt-index.readthedocs.io/en/latest/how_to/integrations/chatgpt_plugins.html

**The OpenAI ChatGPT Retrieval Plugin offers a centralized API specification for any document storage system to interact with ChatGPT. Since this can be deployed on any service, --> this means that more and more document retrieval services will implement this spec; this allows them to not only interact with ChatGPT, but also interact with any LLM toolkit that may use a retrieval service.**

## Lets think together --> using the API OF OpenAI ChatGPT Retrieval Plugin as middleware

The ChatGPT Plugin Integrations encompass several aspects, all aiming to improve the interaction and data exchange between the ChatGPT model and various data sources or storage systems. These integrations allow developers to manipulate and integrate a vast range of data with the model, enhancing its conversational abilities and usefulness. Below, I'll provide a breakdown of each aspect:

### ChatGPT Retrieval Plugin Integrations: 
The retrieval plugin integration provides an API that can be implemented by any document storage system to interact with ChatGPT. By following this spec, document retrieval services can interact with not only ChatGPT but also any Language Model (LLM) toolkit that may use a retrieval service.

### Loading Data from LlamaHub into the ChatGPT Retrieval Plugin: 
LlamaHub offers more than 65 data loaders for various APIs and document formats, which can load data into the ChatGPT Retrieval Plugin. The plugin's /upsert endpoint can be used to load documents, converting them into the JSON format it expects.

### ChatGPT Retrieval Plugin Data Loader: 
This allows data to be loaded from any document store that implements the plugin API, into a LlamaIndex data structure, further enhancing the ease and scope of data utilization in the model.

### ChatGPT Retrieval Plugin Index: 
This enables the building of a vector index over any documents stored in a document store implementing the ChatGPT endpoint. The index is a vector index, allowing for top-k retrieval, thus facilitating the efficient searching and retrieval of documents based on their content.

Illustration of the role of ChatGPT Plugin Retrieval API 

                      


                           +-------------------+
                           |                   |
                           |   LLM Toolkit     |
                           |                   |
                           +---------+---------+
                                     |
                                     |
                           +---------v---------+
                           |                   |
                           |  ChatGPT Plugin   |
                           |  Retrieval API    |
                           |                   |
                           +---------+---------+
                                     |
                +--------------------+--------------------+
                |                    |                    |
    +-----------v---------+  +-------v-------+  +---------v--------+
    |                     |  |               |  |                  |
    |   Document Store    |  |  LlamaHub     |  |  Other DocStores |
    |                     |  |               |  |                  |
    +-----------+---------+  +-------+-------+  +---------+--------+
                |                  |                   |
                |                  |                   |
    +-----------v---------+  +-------v-------+  +---------v--------+
    |                     |  |               |  |                  |
    | ChatGPT Retrieval   |  | ChatGPT Retrieval   | ChatGPT Retrieval 
    | Plugin Index        |  | Plugin Index   |   | Plugin Index
    |                     |  |               |  |                  |
    +---------------------+  +---------------+  +------------------+

                       
In this diagram:

1. The "LLM Toolkit" represents any language model toolkit, which interacts with the "ChatGPT Plugin Retrieval API."

2. The "ChatGPT Plugin Retrieval API" acts as a middleman, enabling interaction between the "LLM Toolkit" and various data sources (i.e., "Document Store," "LlamaHub," and "Other DocStores").

3. The data sources at the bottom represent various document storage and retrieval systems that implement the ChatGPT API, allowing them to exchange data with the model.

4. The "ChatGPT Retrieval Plugin Index" has been added under each document store, including "Document Store," "LlamaHub," and "Other DocStores." These represent the vector indices built over any documents stored in their respective document stores.

This high-level architecture allows any LLM to access data from various sources seamlessly, provided these sources implement the API. The API essentially standardizes how these systems interact, making it easier for developers to integrate and manipulate a wide range of data.

App I found exactly same concept
https://twitter.com/s_jobs6/status/1618346125697875968?s=20&t=RJhQu2mD0-zZNGfq65xodA

## **API GPT-RETRIEVAL IS SIMILAR TO WHAT WE HAVE BUILT**
1. services/file.py = Get text from pdf

2. chatgpt-retrieval-plugin/datastore
/datastore.py = Abstarct methods for datastore providers

3. chatgpt-retrieval-plugin/datastore/providers
/chroma_datastore.py = Methods to CRUD CHROMADB


# Lets finalize what we have done so far and test results
1. Establish a streamlined workflow in Colab that incorporates ChromaDB, InstructorXL, and ChatGPT, leveraging our existing work.
2. Introduce unstructured PDF-to-image conversion using PyTesseract, and we'll contrast the outcomes

3. Upsert function from chatgpt-retrieval-plugin/datastore/providers/chroma_datastore.py / there are two distinct methods for data uploading: one for files and another for URLs. 

3. Incorporate chat mode functionality, 
4. storing all requests and responses in ChromaDB.

Reference 
https://gpt-index.readthedocs.io/en/latest/how_to/integrations/chatgpt_plugins.html

"Loading Data from LlamaHub into the ChatGPT Retrieval Plugin
The ChatGPT Retrieval Plugin defines an /upsert endpoint for users to load documents. This offers a natural integration point with LlamaHub, which offers over 65 data loaders from various API’s and document formats."

6. Create the Python script. ( we can discuss available options, buitl our own/ fork chatgpt plugin and build our own?)

7. Perform a comparison analysis. We're experiencing an issue with the current responses, and I'll present some results generated using a Chat2PDF plugin by another team as a reference for comparison.









# 1. Install dependencies

In [1]:
!pip install chromadb llama-index openai PyPDF2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting chromadb
  Downloading chromadb-0.3.26-py3-none-any.whl (123 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.6/123.6 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-index
  Downloading llama_index-0.6.21.post1-py3-none-any.whl (476 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m476.3/476.3 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m35.4 MB/s[0m eta [36m0:00:00[0m
Collecting requests>=2.28 (from chromadb)
  Downloading requests-2.31.0-py3-none-any.whl (62 

In [None]:
!pip install InstructorEmbedding sentence_transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# import libraries
from llama_index import Document, StorageContext
from llama_index import load_index_from_storage
from llama_index import GPTVectorStoreIndex, GPTListIndex, GPTKeywordTableIndex, GPTTreeIndex
from llama_index.indices.knowledge_graph.base import GPTKnowledgeGraphIndex
from llama_index.indices.keyword_table import GPTSimpleKeywordTableIndex
from llama_index.indices.composability import ComposableGraph
from llama_index.indices.query.query_transform.base import DecomposeQueryTransform
from llama_index.query_engine.transform_query_engine import TransformQueryEngine
#import PyPDF2
import os

In [None]:
from llama_index import SimpleDirectoryReader, GPTListIndex
from llama_index.vector_stores import ChromaVectorStore
from llama_index.storage.storage_context import StorageContext
from llama_index.readers.schema.base import Document
import chromadb
import PyPDF2

In [None]:
 !pip install unstructured pdf2image pytesseract

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting unstructured
  Downloading unstructured-0.7.1-py3-none-any.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pdf2image
  Downloading pdf2image-1.16.3-py3-none-any.whl (11 kB)
Collecting pytesseract
  Downloading pytesseract-0.3.10-py3-none-any.whl (14 kB)
Collecting argilla (from unstructured)
  Downloading argilla-1.8.0-py3-none-any.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m78.1 MB/s[0m eta [36m0:00:00[0m
Collecting msg-parser (from unstructured)
  Downloading msg_parser-1.2.0-py2.py3-none-any.whl (101 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.8/101.8 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
Collecting pdfminer.six (from unstructured)
  Downloading pdfminer.six-20221105-py3-none-any.whl (5

In [None]:
#install pdf view functionality 
!pip install pdf2image
!apt-get install -y poppler-utils  # necessary for pdf2image


# 2. Authentication

In [None]:
def require_chatgptkey():
    """ Make user input chatgpt key"""
    # code
    # for now just use available
    # key = "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"
    key = "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"
    return key

import os
import openai

key = require_chatgptkey()
os.environ["OPENAI_API_KEY"] = key
openai.api_key = key

# Connect to Chromadb

In [None]:
from chromadb.config import Settings
def connect_to_chromadb():
    chroma_client = chromadb.Client(Settings(chroma_api_impl="rest",
                                        chroma_server_host="3.8.31.17",
                                        chroma_server_http_port="8000"
                                    ))
    return chroma_client
chroma_client = connect_to_chromadb()

In [None]:
chroma_collection = chroma_client.get_or_create_collection("mica")


In [None]:
chroma_client.list_collections()

[Collection(name=wiki_test),
 Collection(name=5d334f69-ecef-48f5-afb8-e61da7c21c23),
 Collection(name=gpt-retrieval),
 Collection(name=instructionXL),
 Collection(name=phaethontest),
 Collection(name=mica)]

In [None]:
chroma_collection.count()

185

In [None]:
chroma_collection.peek()

{'ids': ['f2cc9f86-d05e-4acc-9595-71910df4f7ad',
  '1cf36d4b-f6eb-4433-ae60-2975c477ab4b',
  '74cbd204-cbee-4bd9-9bdf-ac2fd28e6504',
  'd5a0382f-8825-4d7a-ac25-7c05ea8f7442',
  'fc956555-331e-4090-8a61-4556748178b2',
  '2cd7d829-23ab-4a8a-8b73-1d9a20a9bfa8',
  '9cb098a8-e054-4c01-b6c2-b7463445de8b',
  '8de41171-34fc-4363-8697-f1f9c60341f8',
  '8aaf87d8-8c5e-4caf-8bde-776e6e3c08d9',
  '37181752-7a16-4768-901a-d69e23c8376a'],
 'embeddings': [[-0.0013718354748561978,
   -0.017751986160874367,
   -0.010306310839951038,
   -0.025759093463420868,
   -0.013808585703372955,
   0.0013960640644654632,
   -0.009657989256083965,
   0.0004983555991202593,
   -0.004932592622935772,
   -0.026614611968398094,
   0.03625255078077316,
   0.00904308632016182,
   -0.014904716983437538,
   -0.0018781280377879739,
   -0.0013075046008452773,
   0.014784409664571285,
   0.010627130046486855,
   -0.02214987948536873,
   0.013253835961222649,
   0.004100468009710312,
   -0.002544830087572336,
   0.0008120731799

# 3. Specify file path

In [None]:
pdf_file_path = "/content/drive/MyDrive/drive_gpt/MICA_PDF/MICA.pdf"

# Render PDF to compare/contrast with query results

In [None]:
from pdf2image import convert_from_path
from IPython.display import display, Image

# convert the PDF to images
pages = convert_from_path('your_file.pdf')

# display the images
for page in pages:
    display(Image(page))


# 4. Create Data Service

Extract data from PDF

In [None]:
def extract_pdf(file_path):
    reader = PyPDF2.PdfReader(file_path)
    list_text = [page.extract_text() for page in reader.pages]
    text = "".join(list_text)
    return text
text = extract_pdf(pdf_file_path)

# 5. Creating embeddings with huggingface Instructor / saving embeddings to Chromadb

In [None]:
# use instructor model
from llama_index import ServiceContext, LLMPredictor
from llama_index import LangchainEmbedding
from langchain.chat_models import ChatOpenAI
from langchain.embeddings.huggingface import HuggingFaceInstructEmbeddings


instructor_model = LangchainEmbedding(HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl"))
gpt_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0))

service_context = ServiceContext.from_defaults(embed_model=instructor_model, llm_predictor=gpt_predictor)

load INSTRUCTOR_Transformer


# 6. Create Index Service

vector index:  
- llama embedding:
    - index constructing:\
    text-embedding-ada-002-v2, 2 requests\
    10,812 prompt + 0 completion = 10,812 tokens
    - querying:\
    gpt-3.5-turbo-0301, 1 request\
    1,739 prompt + 47 completion = 1,786 tokens\
    text-embedding-ada-002-v2, 1 request\
    8 prompt + 0 completion = 8 tokens
- instructor embedding:
    - querying:
    gpt-3.5-turbo-0301, 1 request\
    1,722 prompt + 29 completion = 1,751 tokens


In [None]:
def create_vector_index_from_text(text, service_context):
    documents = [Document(text)]
    index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)
    return index
index = create_vector_index_from_text(text, service_context)

KeyboardInterrupt: ignored

# 7. Save index data to disk

In [None]:
def save_index(index, path='/content/'):
    index.storage_context.persist(persist_dir=path)
save_index(index, "/content/drive/MyDrive/drive_gpt/MICA_PDF")

# 8. Load data to llm query

In [None]:
def load_index(service_context, path='/content/'):
    storage_context = StorageContext.from_defaults(persist_dir=path)
    index = load_index_from_storage(storage_context, service_context=service_context)
    return index
index = load_index(service_context, "/content/drive/MyDrive/drive_gpt/MICA_PDF")

# 9. Create Question

In [None]:
question = "    "

# 10. Query Services Single Document

In [None]:
def get_response(index, question):
    query_engine = index.as_query_engine()
    response = query_engine.query(question)
    return response
response = get_response(index, "Tell me what love is")
print(response)


Love is a complex emotion that is difficult to define. It is often described as a strong feeling of affection and care for another person. It can also be used to describe a strong feeling of fondness and attachment to a place, thing, or activity. Love is often seen as a powerful force that can bring people together and create strong bonds.


In [None]:
import pandas as pd
eval_df = pd.DataFrame(
    {
        "Question": question,
        "Response": str(response),
        "Source1": response.source_nodes[0].node.get_text()[:1000] + "...",
        # "Source2": response.source_nodes[1].node.get_text()[:1000] + "...",
        # "Score": response.source_nodes[0].score
     
    },
    index=[0]
)
eval_df = eval_df.style.set_properties(
    **{
        'inline-size': '400px',
        'overflow-wrap': 'break-word',
        'text-align': 'left'
    },
    subset=["Response",
            "Source1", 
            # "Source2",
            # "Score"
            ]
)
display(eval_df)

NameError: ignored

In [None]:
# embedding model
str(index.service_context.embed_model._langchain_embedding.model_name)

'hkunlp/instructor-xl'

In [None]:
# predictor model
str(index.service_context.llm_predictor._llm.model_name)

'gpt-3.5-turbo'

# Query results

## Use vector store index

In [None]:
# use llama embedding
def get_response(index, question):
    query_engine = index.as_query_engine()
    response = query_engine.query(question)
    return response
response = get_response(index, question)
print(response)


Toronto became a destination for immigrants in the 19th century, particularly during the wave of Irish immigrants fleeing the Great Irish Famine. By 1851, the Irish-born population had become the largest single ethnic group in the city.


In [None]:
# use instructor embedding
def get_response(index, question):
    query_engine = index.as_query_engine()
    response = query_engine.query(question)
    return response
response = get_response(index, question)
print(response)

Following the elimination of racially based immigration policies by the late 1960s, Toronto became a destination for immigrants from all parts of the world.


In [None]:
def predict(self, prompt, **prompt_args):
        llm_payload = {**prompt_args}
        llm_payload["template"] = prompt
        from llama_index.callbacks.schema import CBEventType
        event_id = self.callback_manager.on_event_start(
            CBEventType.LLM,
            payload=llm_payload,
        )
        import json
        cac = json.dumps(prompt_args)
        with open("cac.json", "w") as outfile:
            outfile.write(cac)
        formatted_prompt = prompt.format(llm=self._llm, **prompt_args)
        llm_prediction = self._predict(prompt, **prompt_args)
        # logging.debug(llm_prediction)
        

        # We assume that the value of formatted_prompt is exactly the thing
        # eventually sent to OpenAI, or whatever LLM downstream
        prompt_tokens_count = self._count_tokens(formatted_prompt)
        prediction_tokens_count = self._count_tokens(llm_prediction)
        self._total_tokens_used += prompt_tokens_count + prediction_tokens_count
        self.callback_manager.on_event_end(
            CBEventType.LLM,
            payload={"response": llm_prediction, "formatted_prompt": formatted_prompt},
            event_id=event_id,
        )
        return llm_prediction, formatted_prompt

from llama_index import LLMPredictor

LLMPredictor.predict = predict

## use tree index

In [None]:
# use llama embedding
def get_response(index, question):
    query_engine = index.as_query_engine()
    response = query_engine.query(question)
    return response
response = get_response(index, question)
print(response)

Following the elimination of racially based immigration policies by the late 1960s, Toronto became a destination for immigrants from all parts of the world.


In [None]:
# use instructor embedding
def get_response(index, question):
    query_engine = index.as_query_engine()
    response = query_engine.query(question)
    return response
response = get_response(index, question)
print(response)

Following the elimination of racially based immigration policies by the late 1960s, Toronto became a destination for immigrants from all parts of the world.


## use keyword table 

In [None]:
# use instructor embedding
def get_response(index, question):
    query_engine = index.as_query_engine()
    response = query_engine.query(question)
    return response
response = get_response(index, question)
print(response)



The provided context is not relevant to the question of when Toronto became a destination for immigrants. Therefore, the original answer remains the same:

Toronto became a destination for immigrants in the late 19th century into the early 20th century, particularly Germans, French, Italians, and Jews. They were soon followed by Russians, Poles, and other Eastern European nations, in addition to the Chinese entering from the West. Following the elimination of racially based immigration policies by the late 1960s, Toronto became a destination for immigrants from all parts of the world. By the 1980s, Toronto had surpassed Montreal as Canada's most populous city and chief economic hub.


In [None]:
def create_tree_index_from_text(text, service_context):
    documents = [Document(text)]
    index = GPTTreeIndex.from_documents(documents, service_context=service_context)
    return index
index = create_tree_index_from_text(text, service_context)

In [None]:
def create_list_index_from_text(text, service_context):
    documents = [Document(text)]
    index = GPTTreeIndex.from_documents(documents, service_context=service_context)
    return index
index = create_tree_index_from_text(text, service_context)

In [None]:
def create_kwtable_index_from_text(text, service_context):
    documents = [Document(text)]

    # GPTKeywordTableIndex uses llama embed
    from llama_index import GPTKeywordTableIndex
    index = GPTKeywordTableIndex.from_documents(documents, service_context=service_context)
    return index
index = create_kwtable_index_from_text(text, service_context)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
def create_kg_index_from_text(text, service_context):
    documents = [Document(text)]

    # GPTKnowledgeGraphIndex uses llama embed
    from llama_index.indices.knowledge_graph import GPTKnowledgeGraphIndex
    index = GPTKnowledgeGraphIndex.from_documents(documents, service_context=service_context)
    return index
index = create_kg_index_from_text(text, service_context)



# Not in use

In [None]:
# use default (text-embedding-ada-02) embedding models
llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)

NameError: ignored