In [2]:
! pip install cohere -q

In [2]:
! pip install cohere hnswlib unstructured -q

In [1]:
!pip install pdfplumber

In [19]:
!pip install PyMuPDF

Collecting PyMuPDF
  Obtaining dependency information for PyMuPDF from https://files.pythonhosted.org/packages/30/3f/356a70c105d4410c29529f1ca8c53b5d176b448a4409238b4dcd133507a4/PyMuPDF-1.24.10-cp311-none-win_amd64.whl.metadata
  Downloading PyMuPDF-1.24.10-cp311-none-win_amd64.whl.metadata (3.4 kB)
Collecting PyMuPDFb==1.24.10 (from PyMuPDF)
  Obtaining dependency information for PyMuPDFb==1.24.10 from https://files.pythonhosted.org/packages/70/cb/8459d6c179befd7c6eee555334f054e9a6dcdd9f8671891e1da19e0ce526/PyMuPDFb-1.24.10-py3-none-win_amd64.whl.metadata
  Downloading PyMuPDFb-1.24.10-py3-none-win_amd64.whl.metadata (1.4 kB)
Downloading PyMuPDF-1.24.10-cp311-none-win_amd64.whl (3.2 MB)
   ---------------------------------------- 0.0/3.2 MB ? eta -:--:--
   ---------------------------------------- 0.0/3.2 MB ? eta -:--:--
   ---------------------------------------- 0.0/3.2 MB 320.0 kB/s eta 0:00:10
   ---------------------------------------- 0.0/3.2 MB 325.1 kB/s eta 0:00:10
    -----

In [12]:
import cohere
co = cohere.Client("API Key")

In [29]:
import pdfplumber

# open the PDF
with pdfplumber.open('EU AI Act.pdf') as pdf:
    text = ''
    # iterate through each page
    for page in pdf.pages:
        text += page.extract_text() + '\n'

print(text)

BRIEFING
EU Legislation in Progress
Artificial intelligence act
OVERVIEW
European Union lawmakers signed the artificial intelligence (AI) act in June 2024. The AI act, the first
binding worldwide horizontal regulation on AI, sets a common framework for the use and supply of
AI systems in the EU.
The new act offers a classification for AI systems with different requirements and obligations tailored
to a 'risk-based approach'. Some AI systems presenting 'unacceptable' risks are prohibited. A wide
range of 'high-risk' AI systems that can have a detrimental impact on people's health, safety or on
their fundamental rights are authorised, but subject to a set of requirements and obligations to gain
access to the EU market. AI systems posing limited risks because of their lack of transparency will
be subject to information and transparency requirements, while AI systems presenting only minimal
risk for people will not be subject to further obligations.
The regulation also lays down specific r

In [30]:
lines = text.splitlines()

title = lines[:3]
text = lines[3:]

In [31]:
documents = [
    {
        "title": title,
        "text": text
    }
]

In [32]:
import uuid
import hnswlib
from typing import List, Dict
from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title

In [41]:
class Vectorstore:
    """
    A class representing a collection of documents indexed into a vectorstore.

    Parameters:
    raw_documents (list): A list of dictionaries representing the sources of the raw documents. Each dictionary should have 'title' and 'url' keys.

    Attributes:
    raw_documents (list): A list of dictionaries representing the raw documents.
    docs (list): A list of dictionaries representing the chunked documents, with 'title', 'text', and 'url' keys.
    docs_embs (list): A list of the associated embeddings for the document chunks.
    docs_len (int): The number of document chunks in the collection.
    idx (hnswlib.Index): The index used for document retrieval.

    Methods:
    load_and_chunk(): Loads the data from the sources and partitions the HTML content into chunks.
    embed(): Embeds the document chunks using the Cohere API.
    index(): Indexes the document chunks for efficient retrieval.
    retrieve(): Retrieves document chunks based on the given query.
    """

    def __init__(self, documents: List[Dict[str, str]]):
        self.documents = documents
        self.docs = []
        self.docs_embs = []
        self.retrieve_top_k = 10
        self.rerank_top_k = 3
        self.load_and_chunk()
        self.embed()
        self.index()


    def load_and_chunk(self) -> None:
        """
        Loads the pre-extracted text documents and stores them as chunks.
        """
        print("Loading documents...")

        for raw_document in self.documents:
            title = document["title"]
            text = document["text"]

            # Assuming you want to chunk by paragraphs or a similar method
            chunks = self.chunk_text(text)
            
            for chunk in chunks:
                self.docs.append(
                    {
                        "title": title,
                        "text": str(chunk)
                    }
                )
        print(f"Loaded {len(self.docs)} document chunks.")

    def chunk_text(self, text: str, max_chunk_size: int = 500) -> list:
        """
        Splits the text into chunks of a maximum size.
        """
        # You can implement a more sophisticated chunking logic here
        return [text[i:i + max_chunk_size] for i in range(0, len(text), max_chunk_size)]


    def embed(self) -> None:
        """
        Embeds the document chunks using the Cohere API.
        """
        print("Embedding document chunks...")

        batch_size = 90
        self.docs_len = len(self.docs)
        for i in range(0, self.docs_len, batch_size):
            batch = self.docs[i : min(i + batch_size, self.docs_len)]
            texts = [item["text"] for item in batch]
            docs_embs_batch = co.embed(
                texts=texts, model="embed-english-v3.0", input_type="search_document"
            ).embeddings
            self.docs_embs.extend(docs_embs_batch)

    def index(self) -> None:
        """
        Indexes the document chunks for efficient retrieval.
        """
        print("Indexing document chunks...")

        self.idx = hnswlib.Index(space="ip", dim=1024)
        self.idx.init_index(max_elements=self.docs_len, ef_construction=512, M=64)
        self.idx.add_items(self.docs_embs, list(range(len(self.docs_embs))))

        print(f"Indexing complete with {self.idx.get_current_count()} document chunks.")

    def retrieve(self, query: str) -> List[Dict[str, str]]:
        """
        Retrieves document chunks based on the given query.

        Parameters:
        query (str): The query to retrieve document chunks for.

        Returns:
        List[Dict[str, str]]: A list of dictionaries representing the retrieved document chunks, with 'title', 'text', and 'url' keys.
        """

        # Dense retrieval
        query_emb = co.embed(
            texts=[query], model="embed-english-v3.0", input_type="search_query"
        ).embeddings
        
        doc_ids = self.idx.knn_query(query_emb, k=self.retrieve_top_k)[0][0]

        # Reranking
        rank_fields = ["title", "text"] # We'll use the title and text fields for reranking

        docs_to_rerank = [self.docs[doc_id] for doc_id in doc_ids]
        rerank_results = co.rerank(
            query=query,
            documents=docs_to_rerank,
            top_n=self.rerank_top_k,
            model="rerank-english-v3.0",
            rank_fields=rank_fields
        )

        doc_ids_reranked = [doc_ids[result.index] for result in rerank_results.results]

        docs_retrieved = []
        for doc_id in doc_ids_reranked:
            docs_retrieved.append(
                {
                    "title": self.docs[doc_id]["title"],
                    "text": self.docs[doc_id]["text"],
                }
            )

        return docs_retrieved

In [42]:
# Create an instance of the Vectorstore class with the given sources
vectorstore = Vectorstore(documents)

Loading documents...
Loaded 96 document chunks.
Embedding document chunks...
Indexing document chunks...
Indexing complete with 96 document chunks.


In [47]:
vectorstore.retrieve("Prompting by giving examples")

[{'title': 'BRIEFING\nEU Legislation in Progress\nArtificial intelligence act',
  'text': 'ion of\nconformity with the draft AI act requirements.\nThird, AI systems presenting limited risk, such as systems that interact with humans (i.e. chatbots),\nemotion recognition systems, biometric categorisation systems, and AI systems that generate or\nmanipulate image, audio or video content (i.e. deepfakes), would be subject to a limited set of\ntransparency obligations.\nFinally, all other AI systems presenting only low or minimal risk could be developed and used in the\nEU without conforming to a'},
 {'title': 'BRIEFING\nEU Legislation in Progress\nArtificial intelligence act',
  'text': ' for instance, to ensure the robustness of high-risk AI systems and the\nwatermarking of AI-generated content while, in the meantime, the EU is fostering the adoption of\nvoluntary codes of conduct and of an AI pact to mitigate the potential downsides of generative AI.\nSome academics warn that that the st

In [44]:
def run_chatbot(message, chat_history=None):
    if chat_history is None:
        chat_history = []
    
    # Generate search queries, if any        
    response = co.chat(message=message,
                        model="command-r-plus",
                        search_queries_only=True,
                        chat_history=chat_history)
    
    search_queries = []
    for query in response.search_queries:
        search_queries.append(query.text)

    # If there are search queries, retrieve the documents
    if search_queries:
        print("Retrieving information...", end="")

        # Retrieve document chunks for each query
        documents = []
        for query in search_queries:
            documents.extend(vectorstore.retrieve(query))

        # Use document chunks to respond
        response = co.chat_stream(
            message=message,
            model="command-r-plus",
            documents=documents,
            chat_history=chat_history,
        )

    else:
        response = co.chat_stream(
            message=message,
            model="command-r-plus",
            chat_history=chat_history,
        )
        
    # Print the chatbot response, citations, and documents
    chatbot_response = ""
    print("\nChatbot:")

    for event in response:
        if event.event_type == "text-generation":
            print(event.text, end="")
            chatbot_response += event.text
        if event.event_type == "stream-end":
            if event.response.citations:
                print("\n\nCITATIONS:")
                for citation in event.response.citations:
                    print(citation)
            if event.response.documents:
                print("\nCITED DOCUMENTS:")
                for document in event.response.documents:
                    print(document)
            # Update the chat history for the next turn
            chat_history = event.response.chat_history

    return chat_history

In [None]:
# provide context to the LLM about our role
context = "We are a technology consulting company with global clients. It is important for our company and the clients of our company to stay within regulations of the strictest AI act, the EU AI Act. "

# pass case scenario to the LLM with context
message = context + "A client in the healthcare industry has approached our tech consulting company with a proposal for an AI doctor that can use a patient's information to detect health risk and diagnoses. How much risk does this project have according to the EU AI Act? Please provide quotes and citations from the document."

In [45]:
# Turn # 1
chat_history = run_chatbot("Hello, I have a question")


Chatbot:
Of course! I am here to help. Please go ahead with your question and I will do my best to assist you.

In [48]:
# Turn # 2
chat_history = run_chatbot(message, chat_history)

Retrieving information...
Chatbot:
The EU AI Act, signed in June 2024, adopts a risk-based approach to the use and supply of AI systems in the EU. It identifies several use cases where AI systems are considered high-risk because they can potentially adversely affect people's health, safety, or fundamental rights. 

The proposal for an AI doctor that can use a patient's information to detect health risks and make diagnoses would likely be considered high-risk according to the EU AI Act. This is because the Act specifically mentions that high-risk AI systems can include medical devices. 

> High-risk AI systems can be safety components of products covered by sectoral EU law (e.g. medical devices) or AI systems that, as a matter of principle, are considered to be high risk when they are used in specific areas listed in an annex.

Furthermore, the Act states that AI systems that perform profiling of natural persons will always be considered high-risk:

> an AI system will always be conside

In [46]:
# provide context to the LLM about our role
context = "We are a technology consulting company with global clients. It is important for our company and the clients of our company to stay within regulations of the strictest AI act, the EU AI Act. "

# pass case scenario to the LLM with context
message = context + "A client in the healthcare industry has approached our tech consulting company with a proposal for an AI doctor that can use a patient's information to detect health risk and diagnoses. How much risk does this project have according to the EU AI Act? Please provide quotes and citations from the document."

# generate the response
response = co.chat_stream(message=message,
                          model="command-r-plus",
                          documents=documents)

# display the response
citations = []
cited_documents = []

for event in response:
    if event.event_type == "text-generation":
        print(event.text, end="")
    elif event.event_type == "citation-generation":
        citations.extend(event.citations)
    elif event.event_type == "stream-end":
      cited_documents = event.response.documents

# display the citations and source documents
if citations:
  print("\n\nCITATIONS:")
  for citation in citations:
    print(citation)

  print("\nDOCUMENTS:")
  for document in cited_documents:
    print(document)

UnprocessableEntityError: status_code: 422, body: data=None message='invalid request: field title must be a string. For proper usage, please refer to https://docs.cohere.com/reference/chat'