<a href="https://colab.research.google.com/github/phaethonp/we-ai/blob/main/KnowledgeAssistant2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



> Notebook for Chromadb + Instructor-XL + OPENAI
https://gpt-index.readthedocs.io/en/latest/how_to/integrations/chatgpt_plugins.html

**we are developing centralized API specification for any document storage system to interact with Large Language models such as ChatGPT. The ChatGPT Retrieval plugin provides a solid base for developing our API. to Since this can be deployed on any service, --> this means that more and more document retrieval services will implement this spec; this allows them to not only interact with ChatGPT, but also interact with any LLM toolkit that may use a retrieval service.**

## Build a central API as middleware

The ChatGPT Plugin Integrations encompass several aspects, all aiming to improve the interaction and data exchange between the ChatGPT model and various data sources or storage systems. These integrations allow developers to manipulate and integrate a vast range of data with the model, enhancing its conversational abilities and usefulness. Below, I'll provide a breakdown of each aspect:

### ChatGPT Retrieval Plugin Integrations:
The retrieval plugin integration provides an API that can be implemented by any document storage system to interact with ChatGPT. By following this spec, document retrieval services can interact with not only ChatGPT but also any Language Model (LLM) toolkit that may use a retrieval service.

### Loading Data from LlamaHub into the ChatGPT Retrieval Plugin:
LlamaHub offers more than 65 data loaders for various APIs and document formats, which can load data into the ChatGPT Retrieval Plugin. The plugin's /upsert endpoint can be used to load documents, converting them into the JSON format it expects.

### ChatGPT Retrieval Plugin Data Loader:
This allows data to be loaded from any document store that implements the plugin API, into a LlamaIndex data structure, further enhancing the ease and scope of data utilization in the model.

### ChatGPT Retrieval Plugin Index:
This enables the building of a vector index over any documents stored in a document store implementing the ChatGPT endpoint. The index is a vector index, allowing for top-k retrieval, thus facilitating the efficient searching and retrieval of documents based on their content.

Illustration of the role of ChatGPT Plugin Retrieval API




                           +-------------------+
                           |                   |
                           |   LLM Toolkit     |
                           |                   |
                           +---------+---------+
                                     |
                            |MODELS ROUTER|
                                     |
                           +---------v---------+
                           |                   |
                           |  ChatGPT Plugin   |
                           |  Retrieval API    |
                           |                   |
                           +---------+---------+
                                     |
                +--------------------+--------------------+
                |                    |                    |
    +-----------v---------+  +-------v-------+  +---------v--------+
    |                     |  |               |  |                  |
    |   Document Store    |  |  Data Loaders      Other DocStores |
    |                     |  |               |  |                  |
    +-----------+---------+  +-------+-------+  +---------+--------+
                |                           |
                |                           |
    +-----------v---------+           +-------v-------+                   |
    | Meta Index- Question Routert|  | Services API's   |
    |

    +---------------------+  +---------------+  +------------------+


In this diagram:

1. The "LLM Toolkit" represents any language model toolkit, which interacts with the "ChatGPT Plugin Retrieval API."

2. The "ChatGPT Plugin Retrieval API" acts as a middleman, enabling interaction between the "LLM Toolkit" and various data sources (i.e., "Document Store," "LlamaHub," and "Other DocStores").

3. The data sources at the bottom represent various document storage and retrieval systems that implement the ChatGPT API, allowing them to exchange data with the model.

4. The "ChatGPT Retrieval Plugin Index" has been added under each document store, including "Document Store," "LlamaHub," and "Other DocStores." These represent the vector indices built over any documents stored in their respective document stores.

This high-level architecture allows any LLM to access data from various sources seamlessly, provided these sources implement the API. The API essentially standardizes how these systems interact, making it easier for developers to integrate and manipulate a wide range of data.

App I found exactly same concept
https://twitter.com/s_jobs6/status/1618346125697875968?s=20&t=RJhQu2mD0-zZNGfq65xodA

## **API GPT-Retrieval Architecture**
1. services/file.py = Get text from pdf

2. chatgpt-retrieval-plugin/datastore
/datastore.py = Abstarct methods for datastore providers

3. chatgpt-retrieval-plugin/datastore/providers
/chroma_datastore.py = Methods to CRUD CHROMADB


# Finalizing flows and testing results
### **1. Flow 1--> Create text from PDF--> Create embeddings and index save to chroma**
### **2. Retrieve Index from chroma--> Make Query**
llama_indexreference https://gpt-index.readthedocs.io/en/latest/how_to/index/vector_store_guide.html#vector-store-index



3. Introduce unstructured PDF-to-image conversion using PyTesseract, and we'll contrast the outcomes

4. Upsert function from chatgpt-retrieval-plugin/datastore/providers/chroma_datastore.py / there are two distinct methods for data uploading: one for files and another for URLs.

4. Incorporate chat mode functionality,
5. storing all requests and responses in ChromaDB.

Reference
https://gpt-index.readthedocs.io/en/latest/how_to/integrations/chatgpt_plugins.html

"Loading Data from LlamaHub into the ChatGPT Retrieval Plugin
The ChatGPT Retrieval Plugin defines an /upsert endpoint for users to load documents. This offers a natural integration point with LlamaHub, which offers over 65 data loaders from various API’s and document formats."

6. Create the Python script. ( we can discuss available options, buitl our own/ fork chatgpt plugin and build our own?)

7. Perform a comparison analysis. We're experiencing an issue with the current responses, and I'll present some results generated using a Chat2PDF plugin by another team as a reference for comparison.









# 1. Install dependencies

In [None]:
!pip install chromadb llama-index openai PyPDF2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting chromadb
  Downloading chromadb-0.3.26-py3-none-any.whl (123 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m123.6/123.6 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-index
  Downloading llama_index-0.6.22-py3-none-any.whl (493 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.5/493.5 kB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m22.2 MB/s[0m eta [36m0:00:00[0m
Collecting requests>=2.28 (from chromadb)
  Downloading requests-2.31.0-py3-none-any.whl (62 kB)
[2

In [None]:
!pip install InstructorEmbedding sentence_transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting InstructorEmbedding
  Downloading InstructorEmbedding-1.0.1-py2.py3-none-any.whl (19 kB)
Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0 (from sentence_transformers)
  Downloading transformers-4.30.1-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m109.1 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence_transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m82.7 MB/s[0m eta [36m0:00:00[0m
[?25hCol

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# import libraries
from llama_index import Document, StorageContext
from llama_index import load_index_from_storage
from llama_index import GPTVectorStoreIndex, GPTListIndex, GPTKeywordTableIndex, GPTTreeIndex
from llama_index.indices.knowledge_graph.base import GPTKnowledgeGraphIndex
from llama_index.indices.keyword_table import GPTSimpleKeywordTableIndex
from llama_index.indices.composability import ComposableGraph
from llama_index.indices.query.query_transform.base import DecomposeQueryTransform
from llama_index.query_engine.transform_query_engine import TransformQueryEngine
#import PyPDF2
import os

In [None]:
from llama_index import SimpleDirectoryReader, GPTListIndex
from llama_index.vector_stores import ChromaVectorStore
from llama_index.storage.storage_context import StorageContext
from llama_index.readers.schema.base import Document
import chromadb
import PyPDF2

In [None]:
 !pip install unstructured pdf2image pytesseract

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting unstructured
  Downloading unstructured-0.7.2-py3-none-any.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pdf2image
  Downloading pdf2image-1.16.3-py3-none-any.whl (11 kB)
Collecting pytesseract
  Downloading pytesseract-0.3.10-py3-none-any.whl (14 kB)
Collecting argilla (from unstructured)
  Downloading argilla-1.8.0-py3-none-any.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m81.8 MB/s[0m eta [36m0:00:00[0m
Collecting msg-parser (from unstructured)
  Downloading msg_parser-1.2.0-py2.py3-none-any.whl (101 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.8/101.8 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
Collecting pdfminer.six (from unstructured)
  Downloading pdfminer.six-20221105-py3-none-any.whl (5

In [None]:
#install pdf view functionality
!pip install pdf2image
!apt-get install -y poppler-utils  # necessary for pdf2image


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  poppler-utils
0 upgraded, 1 newly installed, 0 to remove and 38 not upgraded.
Need to get 174 kB of archives.
After this operation, 754 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 poppler-utils amd64 0.86.1-0ubuntu1.1 [174 kB]
Fetched 174 kB in 0s (633 kB/s)
Selecting previously unselected package poppler-utils.
(Reading database ... 122541 files and directories currently installed.)
Preparing to unpack .../poppler-utils_0.86.1-0ubuntu1.1_amd64.deb ...
Unpacking poppler-utils (0.86.1-0ubuntu1.1) ...
Setting up poppler-utils (0.86.1-0ubuntu1.1) ...
Processing triggers for man-db (2.9.1-1) ...


# 2. Authentication

In [None]:
def require_chatgptkey():
    """ Make user input chatgpt key"""
    # code
    # for now just use available
    key = "sk-xxx"
    #key = "skxxx"
    return key

import os
import openai

key = require_chatgptkey()
os.environ["OPENAI_API_KEY"] = key
openai.api_key = key

# Connect to Chromadb

In [None]:
from chromadb.config import Settings
def connect_to_chromadb():
    chroma_client = chromadb.Client(Settings(chroma_api_impl="rest",
                                        chroma_server_host="3.8.31.17",
                                        chroma_server_http_port="8000"
                                    ))
    return chroma_client
chroma_client = connect_to_chromadb()

In [None]:
chroma_collection = chroma_client.get_or_create_collection("localgpt")



# Info about chroma DB
the embeddings table

7de01789766e :) show create table embeddings

SHOW CREATE TABLE embeddings

Query id: 37b582d4-3e43-41d3-95be-a2acf4e1f970

CREATE TABLE default.embeddings
(
    `collection_uuid` UUID,
    `uuid` UUID,
    `embedding` Array(Float64),
    `document` Nullable(String),
    `id` Nullable(String),
    `metadata` Nullable(String)
)
ENGINE = MergeTree
ORDER BY collection_uuid
SETTINGS index_granularity = 8192

1 row in set. Elapsed: 0.003 sec.

# Chromadb Query Tools

In [None]:
chroma_collection.get(
    include=["embeddings"]
)

In [None]:
chroma_collection.get(
    include=["documents"]
)

In [None]:
chroma_client.list_collections()

[Collection(name=wiki_test),
 Collection(name=mica_summary),
 Collection(name=gpt-retrieval),
 Collection(name=instructionXL),
 Collection(name=phaethontest),
 Collection(name=mica)]

In [None]:
chroma_collection.count()

11

In [None]:
chroma_collection.peek()

{'ids': ['72cb4c83-bdcc-4cf1-9985-086a70416575',
  '7d596577-9dd5-44bf-ab4f-75b3d8ff2cdd',
  'c2f51664-6b76-4423-aac3-e6067645477c',
  'fa253ddb-637f-4b51-85b5-330975a73683',
  '1c0ac445-8615-4b8a-802c-dee6e479c413',
  '31719da8-5120-4988-a5d9-cc6ed6e85c7c',
  'e613cbbf-71a3-4d32-b2da-f9e74e7e916b',
  'cc5207e3-0bef-488d-b92c-94b7f47290f9',
  'f92939d9-3c26-4a37-b0fe-0a7f6452a166',
  'e54d3738-a4da-4218-8698-24b835f80132'],
 'embeddings': [[0.0007445769151672721,
   -0.012351022101938725,
   -0.010301362723112106,
   -0.05916816368699074,
   -0.01656973920762539,
   0.010732521302998066,
   -0.014991036616265774,
   -0.0025355415418744087,
   -0.0028489602264016867,
   -0.020204734057188034,
   0.03417425602674484,
   -0.004318214487284422,
   -0.01988634094595909,
   -0.007429186254739761,
   0.010964683257043362,
   0.008556830696761608,
   0.01798924431204796,
   -0.0014244801132008433,
   0.0008847033604979515,
   0.007170491386204958,
   -0.003674793988466263,
   -0.00807924009859

# 3. Enabling users to load files from google drive. Specify file path

In [None]:
pdf_file_path = "/content/drive/MyDrive/drive_gpt/Mica_summary/MarketsCryptoSummary.pdf"

# Render PDF to compare/contrast with query results

In [None]:
from pdf2image import convert_from_path
from IPython.display import display, Image

# convert the PDF to images
pages = convert_from_path('/content/drive/MyDrive/drive_gpt/Mica_summary/MarketsCryptoSummary.pdf')

# display the images
for page in pages:
    display(Image(page))


FileNotFoundError: ignored

FileNotFoundError: ignored

<IPython.core.display.Image object>

FileNotFoundError: ignored

FileNotFoundError: ignored

<IPython.core.display.Image object>

FileNotFoundError: ignored

FileNotFoundError: ignored

<IPython.core.display.Image object>

FileNotFoundError: ignored

FileNotFoundError: ignored

<IPython.core.display.Image object>

FileNotFoundError: ignored

FileNotFoundError: ignored

<IPython.core.display.Image object>

FileNotFoundError: ignored

FileNotFoundError: ignored

<IPython.core.display.Image object>

FileNotFoundError: ignored

FileNotFoundError: ignored

<IPython.core.display.Image object>

FileNotFoundError: ignored

FileNotFoundError: ignored

<IPython.core.display.Image object>

FileNotFoundError: ignored

FileNotFoundError: ignored

<IPython.core.display.Image object>

FileNotFoundError: ignored

FileNotFoundError: ignored

<IPython.core.display.Image object>

FileNotFoundError: ignored

FileNotFoundError: ignored

<IPython.core.display.Image object>

# 4. Extract & Load Data Service

Extract data from PDF

In [None]:
def extract_pdf(file_path):
    reader = PyPDF2.PdfReader(file_path)
    list_text = [page.extract_text() for page in reader.pages]
    text = "".join(list_text)
    return text
text = extract_pdf(pdf_file_path)

TO DO: introduce higher resolution in data extraction
https://github.com/phaethonp/chat-pdf/blob/master/src/ingest.py

# 5. implementing the huggingface Instructor model for creating embeddings, to increase speed, reduce cost, and use an open source model

In [None]:
# use instructor model
from llama_index import ServiceContext, LLMPredictor
from llama_index import LangchainEmbedding
from langchain.chat_models import ChatOpenAI
from langchain.embeddings.huggingface import HuggingFaceInstructEmbeddings


instructor_model = LangchainEmbedding(HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl"))
gpt_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0))

service_context = ServiceContext.from_defaults(embed_model=instructor_model, llm_predictor=gpt_predictor)

load INSTRUCTOR_Transformer
max_seq_length  512


# 6. Create Index Service

vector index:
- llama embedding:
    - index constructing:\
    text-embedding-ada-002-v2, 2 requests\
    10,812 prompt + 0 completion = 10,812 tokens
    - querying:\
    gpt-3.5-turbo-0301, 1 request\
    1,739 prompt + 47 completion = 1,786 tokens\
    text-embedding-ada-002-v2, 1 request\
    8 prompt + 0 completion = 8 tokens
- instructor embedding:
    - querying:
    gpt-3.5-turbo-0301, 1 request\
    1,722 prompt + 29 completion = 1,751 tokens


In [None]:
def create_vector_index_from_text(text, service_context):
    documents = [Document(text)]
    index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)
    return index
index = create_vector_index_from_text(text, service_context)

# 7. Save index data to disk

In [None]:
def save_index(index, path='/content/'):
    index.storage_context.persist(persist_dir=path)
save_index(index, "/content/drive/MyDrive/drive_gpt/MICA_PDF")

# 8. Load data to llm query

In [None]:
def load_index(service_context, path='/content/'):
    storage_context = StorageContext.from_defaults(persist_dir=path)
    index = load_index_from_storage(storage_context, service_context=service_context)
    return index
index = load_index(service_context, "/content/drive/MyDrive/drive_gpt/MICA_PDF")

# 9. Create Question

In [None]:
question = "What is this document about? create a summary of the main points   "

# 10. Query Services Single Document

In [None]:
def get_response(index, question):
    query_engine = index.as_query_engine()
    response = query_engine.query(question)
    return response
response = get_response(index, question)
print(response)

In [None]:
import pandas as pd
eval_df = pd.DataFrame(
    {
        "Question": question,
        "Response": str(response),
        "Source1": response.source_nodes[0].node.get_text()[:1000] + "...",
        # "Source2": response.source_nodes[1].node.get_text()[:1000] + "...",
        # "Score": response.source_nodes[0].score

    },
    index=[0]
)
eval_df = eval_df.style.set_properties(
    **{
        'inline-size': '400px',
        'overflow-wrap': 'break-word',
        'text-align': 'left'
    },
    subset=["Response",
            "Source1",
            # "Source2",
            # "Score"
            ]
)
display(eval_df)

In [None]:
# embedding model
str(index.service_context.embed_model._langchain_embedding.model_name)

'hkunlp/instructor-xl'

In [None]:
# predictor model
str(index.service_context.llm_predictor._llm.model_name)

'gpt-3.5-turbo'

# Query results

## Use vector store index

In [None]:
# use instructor embedding
def get_response(index, question):
    query_engine = index.as_query_engine()
    response = query_engine.query(question)
    return response
response = get_response(index, question)
print(response)

In [None]:
# use llama embedding
def get_response(index, question):
    query_engine = index.as_query_engine()
    response = query_engine.query(question)
    return response
response = get_response(index, question)
print(response)


In [None]:
def predict(self, prompt, **prompt_args):
        llm_payload = {**prompt_args}
        llm_payload["template"] = prompt
        from llama_index.callbacks.schema import CBEventType
        event_id = self.callback_manager.on_event_start(
            CBEventType.LLM,
            payload=llm_payload,
        )
        import json
        cac = json.dumps(prompt_args)
        with open("cac.json", "w") as outfile:
            outfile.write(cac)
        formatted_prompt = prompt.format(llm=self._llm, **prompt_args)
        llm_prediction = self._predict(prompt, **prompt_args)
        # logging.debug(llm_prediction)


        # We assume that the value of formatted_prompt is exactly the thing
        # eventually sent to OpenAI, or whatever LLM downstream
        prompt_tokens_count = self._count_tokens(formatted_prompt)
        prediction_tokens_count = self._count_tokens(llm_prediction)
        self._total_tokens_used += prompt_tokens_count + prediction_tokens_count
        self.callback_manager.on_event_end(
            CBEventType.LLM,
            payload={"response": llm_prediction, "formatted_prompt": formatted_prompt},
            event_id=event_id,
        )
        return llm_prediction, formatted_prompt

from llama_index import LLMPredictor

LLMPredictor.predict = predict

## use tree index

In [None]:
# use llama embedding
def get_response(index, question):
    query_engine = index.as_query_engine()
    response = query_engine.query(question)
    return response
response = get_response(index, question)
print(response)

Following the elimination of racially based immigration policies by the late 1960s, Toronto became a destination for immigrants from all parts of the world.


In [None]:
# use instructor embedding
def get_response(index, question):
    query_engine = index.as_query_engine()
    response = query_engine.query(question)
    return response
response = get_response(index, question)
print(response)

## use keyword table

In [None]:
# use instructor embedding
def get_response(index, question):
    query_engine = index.as_query_engine()
    response = query_engine.query(question)
    return response
response = get_response(index, question)
print(response)

In [None]:
def create_tree_index_from_text(text, service_context):
    documents = [Document(text)]
    index = GPTTreeIndex.from_documents(documents, service_context=service_context)
    return index
index = create_tree_index_from_text(text, service_context)

In [None]:
def create_list_index_from_text(text, service_context):
    documents = [Document(text)]
    index = GPTTreeIndex.from_documents(documents, service_context=service_context)
    return index
index = create_tree_index_from_text(text, service_context)

In [None]:
def create_kwtable_index_from_text(text, service_context):
    documents = [Document(text)]

    # GPTKeywordTableIndex uses llama embed
    from llama_index import GPTKeywordTableIndex
    index = GPTKeywordTableIndex.from_documents(documents, service_context=service_context)
    return index
index = create_kwtable_index_from_text(text, service_context)

In [None]:
def create_kg_index_from_text(text, service_context):
    documents = [Document(text)]

    # GPTKnowledgeGraphIndex uses llama embed
    from llama_index.indices.knowledge_graph import GPTKnowledgeGraphIndex
    index = GPTKnowledgeGraphIndex.from_documents(documents, service_context=service_context)
    return index
index = create_kg_index_from_text(text, service_context)

# Not in use

In [None]:
# use default (text-embedding-ada-02) embedding models
llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0))
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor)