# LLamaindex
LlamaIndex is a framework for building context-augmented generative AI applications with LLMs including agents and workflows.

https://docs.llamaindex.ai/en/stable/


# PymuPDFLLM
Using PyMuPDF as Data Feeder in LLM / RAG Applications
https://pypi.org/project/pymupdf4llm/

In [2]:
#! pip install -q "llama-index==0.11.15" llama-index-readers-file
# !pip install docx2txt
# !pip install python-dotenv
# %pip install llama-index-multi-modal-llms-gemini
# %pip install llama-index-vector-stores-qdrant
# %pip install llama-index-embeddings-gemini
# %pip install llama-index-llms-gemini
# !pip install 'google-generativeai>=0.3.0' matplotlib qdrant_client
# pip install pymupdf4llm

In [1]:
from dotenv import dotenv_values
import os

In [2]:
# read env file
ROOT_DIR = os.getcwd()
config = dotenv_values(os.path.join(ROOT_DIR, "keys", ".env"))


In [59]:
import pymupdf4llm

md_text = pymupdf4llm.to_markdown("data/UM HCE116NX1.pdf", write_images=True, image_path="./data/images")

Processing data/UM HCE116NX1.pdf...


In [60]:
import pprint
import pathlib
pathlib.Path("data/output.md").write_bytes(md_text.encode())

44854

In [9]:
pprint.pprint(md_text [:300])

('## �������������������������������\n'
 '\n'
 ' �������������\n'
 '\n'
 ' �����������������Cooker Hood\n'
 '\n'
 ' ������������������Instruction Manual\n'
 '\n'
 '1\n'
 '\n'
 '\n'
 '-----\n'
 '\n'
 '##### Content\n'
 '\n'
 '1…………………………………..………………………………Safety instructions\n'
 '\n'
 '2…………………………………..………………………………Installation\n'
 '\n'
 '3…………………………………..………………………………Start using your cooker hood\n'
 '\n'
 '4……')


In [11]:
import pymupdf4llm

md_read = pymupdf4llm.LlamaMarkdownReader()
data = md_read.load_data("data/Parser Source 2.pdf")

Successfully imported LlamaIndex
Processing data/Parser Source 2.pdf...
Processing data/Parser Source 2.pdf...
Processing data/Parser Source 2.pdf...
Processing data/Parser Source 2.pdf...
Processing data/Parser Source 2.pdf...
Processing data/Parser Source 2.pdf...
Processing data/Parser Source 2.pdf...
Processing data/Parser Source 2.pdf...
Processing data/Parser Source 2.pdf...
Processing data/Parser Source 2.pdf...
Processing data/Parser Source 2.pdf...
Processing data/Parser Source 2.pdf...
Processing data/Parser Source 2.pdf...
Processing data/Parser Source 2.pdf...
Processing data/Parser Source 2.pdf...
Processing data/Parser Source 2.pdf...
Processing data/Parser Source 2.pdf...
Processing data/Parser Source 2.pdf...
Processing data/Parser Source 2.pdf...
Processing data/Parser Source 2.pdf...
Processing data/Parser Source 2.pdf...
Processing data/Parser Source 2.pdf...
Processing data/Parser Source 2.pdf...
Processing data/Parser Source 2.pdf...
Processing data/Parser Source 2

In [61]:
len(data)

143

In [62]:
data[1].to_dict().keys()

dict_keys(['id_', 'embedding', 'metadata', 'excluded_embed_metadata_keys', 'excluded_llm_metadata_keys', 'relationships', 'text', 'mimetype', 'start_char_idx', 'end_char_idx', 'text_template', 'metadata_template', 'metadata_seperator', 'class_name'])

In [15]:
GOOGLE_API_KEY = config.get("GEMINI-API-KEY")
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY


In [49]:
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.embeddings.gemini import GeminiEmbedding
from llama_index.llms.gemini import Gemini
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import Settings
from llama_index.core import StorageContext
import qdrant_client
from llama_index.multi_modal_llms.gemini import GeminiMultiModal


# Create a local Qdrant vector store
clientg = qdrant_client.QdrantClient(path="companies_gemini_3")

vector_storeg = QdrantVectorStore(client=clientg, collection_name="collection")

# Using the embedding model to Gemini
Settings.embed_model = GeminiEmbedding(
    model_name="models/embedding-001", api_key=GOOGLE_API_KEY
)
Settings.llm = Gemini(api_key=GOOGLE_API_KEY, model="models/gemini-1.5-flash-002")

In [50]:
storage_contextg = StorageContext.from_defaults(vector_store=vector_storeg)

indexg =VectorStoreIndex.from_documents(documents=data,    storage_context=storage_contextg,)



In [51]:

# converting vector store to query engine
query_engineg = indexg.as_query_engine(similarity_top_k=5)

# generating query response
response = query_engineg.query("Appointments Board Positions list with names and other details of Aardvark Constructions Limited")
print(response)

Aardvark Constructions Limited's board appointments include:  Tim Haines (HAINES-T) as Director on 13/11/2022; ABNB AltBauNeu Baugesellschaft mbH (ABNBGERGMB) as Company Secretary on 19/12/2022; Trustme (TRUSTME) as Managing Member on 30/01/2023; Mohammed Malek (MALEK-M) as Chair on 31/01/2023; Nicole Adams (ADAMS-N) as Alternate Director on 15/03/2023; Gordon Tatun (TATUN-G) as Director on 12/10/2023; Brian Jenkins (JENKINS-B) as Director on 29/11/2023; Brian Stafford (STAFFORD-B) as Chief Executive on 28/11/2023; Neil Barlow (BARLOW-N) as Director on 11/12/2023 (two entries); Willem Director (WILDIR) as Director on 10/01/2024; Paloma Plews (PLEWS-P) as Director on 07/02/2024; and Roman Arkadyevich Abramovich (ABRAMOV-RA) as Director on 07/02/2024; Susan Boyie (BOYIE-S) as Director on 15/02/2024.  Nicole Adams (PRIMARYNM) was appointed as Director (DXM Pending Appt) on 10/09/2020 (two entries).



In [63]:
from llama_index.core.memory import ChatMemoryBuffer
from llama_index.core.chat_engine import CondensePlusContextChatEngine

# creating chat memory buffer
memory = ChatMemoryBuffer.from_defaults(token_limit=4500)

# creating chat engine
chat_engine = CondensePlusContextChatEngine.from_defaults(indexg.as_retriever(),memory=memory,llm=Gemini(api_key=GOOGLE_API_KEY, model="models/gemini-1.5-flash-002"), temperature=.8)
prompt = """
Provide Main Details of the company Aardvark Constructions Limited. Including following details:
Name:
Country:
Company Number:
Incorporated:
Company Type:
Company Status:
Primary Addresses Registered Office:
Accounting Dates:
Confirmation Statement:
"""
# generating chat response
response = chat_engine.chat(prompt)
print(str(response))

Here's a summary of the main details for Aardvark Constructions Limited, based on the provided documents:

**Name:** Aardvark Constructions Limited

**Country:** United Kingdom

**Company Number:** 123456

**Incorporated:** January 13, 2022 (Date Company Created)

**Company Type:** Holding (Company Type - LSL)

**Company Status:**  The provided documents don't explicitly state the current company status.  More information would be needed to determine if it is active, dissolved, etc.

**Primary Address (Registered Office):** 123, Oakwood Lane, London

**Accounting Dates:** The documents show completed annual returns for the periods 30/01/2022 to 30/01/2023 and 30/01/2023 to 30/01/2024.  However, the exact accounting dates (fiscal year end) are not explicitly stated.  The "Fiscal Year End" field in the CSC Information section is blank.

**Confirmation Statement:** The provided text mentions "Annual Return" which is completed, but doesn't explicitly refer to a "Confirmation Statement".  T

In [64]:
prompt = """
From Management Details extract:
Managed By:
Managed By Email:
"""
response = chat_engine.chat(prompt)
print(str(response))

I'm sorry, but the provided text does not contain the email address or the name of the person who manages Aardvark Constructions Limited.  The documents list board members and their positions, but not a designated manager or their contact information.



In [65]:
prompt = """
Past Names of the Company with their period 
"""
response = chat_engine.chat(prompt)
print(str(response))

Here's a list of the past names of Aardvark Constructions Limited, along with their effective periods, as shown in the provided document:

* **Aardvark Construction:** From 20/10/2020 to 20/10/2021
* **Aardvark and Son Ltd:** From 20/10/2021 to 20/10/2022



In [66]:
prompt = """
Appointments Board Positions list with names and other details
"""
response = chat_engine.chat(prompt)
print(str(response))

Based on the provided document, here's a list of board appointments for Aardvark Constructions Limited:

| Name             | QuickRef    | Position             | Appointed       | Job Title      |
|-----------------|-------------|----------------------|-----------------|-----------------|
| Abbles, James    | ABBLES-J    | Director             | 19/04/2023      | Trainer         |
| Abdreatta, Leopoldo | ABDREATT-L | Director             | 18/10/2023      | Secretary       |
| Adam, Nicole     | ADAMS-N     | Alternate Director | 04/04/2023      | CFO             |
|                  |             | Non Executive Director | 10/04/2024      | CFO             |
| Alberts, Stoffel | ALBERTS-S   | Company Secretary    | 16/12/2022      | Accountant      |
| Rutter, Gus      | RUTTER-G    | Director             | 07/03/2024      | Director        |


**Note:**  Nicole Adams's appointment as CFO is listed twice, once as Alternate Director and once as Non-Executive Director.  The dates indic

In [29]:
# Nvidia
from llama_index.llms.nvidia import NVIDIA
from llama_index.embeddings.nvidia import NVIDIAEmbedding
os.environ['NVIDIA_API_KEY'] = config.get('NVIDIA_API_KEY')


None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [30]:
# Here we are using meta/llama-3.2-3b-instruct model from API Catalog
Settings.llm = NVIDIA(model="meta/llama-3.2-3b-instruct", temperature=0.7)
Settings.embed_model = NVIDIAEmbedding(model="NV-Embed-QA", truncate="END")



In [31]:
# Create a local Qdrant vector store
client = qdrant_client.QdrantClient(path="companies_nvidia")

vector_store = QdrantVectorStore(client=client, collection_name="collection2")

In [32]:
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index =VectorStoreIndex.from_documents(documents=data,    storage_context=storage_context,)



In [33]:
# converting vector store to query engine
query_engine = index.as_query_engine(similarity_top_k=3)

# generating query response
response = query_engine.query("Appointments Board Positions list with names and other details of Aardvark Constructions Limited")
print(response)

Abbles, James ABBLES-J 
Director 
19/04/2023 
Trainer 

Abdreatta, Leopoldo ABDREATT-L 
Director 
18/10/2023 
Secretary 

Adam, Nicole ADAMS-N 
Alternate Director 
04/04/2023 
CFO 

Non Executive 
10/04/2024 
CFO 
Director 

Alberts, Stoffel ALBERTS-S 
Company Secretary 
16/12/2022 
Accountant 

Rutter, Gus RUTTER-G 
Director 
07/03/2024 
Director


In [34]:

llm = NVIDIA(model="meta/llama-3.2-3b-instruct", temperature=0.7)
# creating chat memory buffer
memory = ChatMemoryBuffer.from_defaults(token_limit=4500)

# creating chat engine
chat_engine = CondensePlusContextChatEngine.from_defaults(index.as_retriever(),memory=memory,llm=llm)
prompt = """
Provide Main Details of the company Aardvark Constructions Limited. Including following details:
Name:
Country:
Company Number:
Incorporated:
Company Type:
Company Status:
Primary Addresses Registered Office:
Accounting Dates:
Confirmation Statement:
"""
# generating chat response
response = chat_engine.chat(prompt)
print(str(response))

Based on the provided document, the Main Details of Aardvark Constructions Limited are as follows:

1. Name: Aardvark Constructions Limited
2. Country: United Kingdom
3. Company Number: 123456
4. Incorporated: 20/10/2020
5. Company Type: Limited by Shares
6. Company Status: Active
7. Primary Addresses:
   - Registered Office: 6 Chancery Road, London, WC2A 5DP, United Kingdom
8. Accounting Dates:
   - Last Period End: 16/11/2022
   - Current Period End: 16/11/2024
   - Last Extended: 16/11/2022
9. Confirmation Statement: Filed on 03/03/2023


In [35]:
prompt = """
From Management Details extract:
Managed By:
Managed By Email:
"""
response = chat_engine.chat(prompt)
print(str(response))

From the Management Details of Aardvark Constructions Limited, the extracted information is:

1. Managed By: Caroline McPartland
2. Managed By Email: cmcpartland@diligent.com


In [36]:
prompt = """
Past Names of the Company with their period 
"""
response = chat_engine.chat(prompt)
print(str(response))

From the document, the Past Names of the Company (Aardvark Constructions Limited) along with their periods are:

 None of the provided document details the Past Names of Aardvark Constructions Limited.


In [37]:
prompt = """
Appointments Board Positions list with names and other details
"""
response = chat_engine.chat(prompt)
print(str(response))

From the document, the Appointments: Board Positions list of Aardvark Constructions Limited, along with names and other details, is:

1. **Name** **QuickRef** **Position** **Appointed** **Resigned**
   - Abbles, James ABBLES-J Director 19/04/2023 Trainer
   - Abdreatta, Leopoldo ABDREATT-L Director 18/10/2023 Secretary
   - Adam, Nicole ADAMS-N Alternate Director 04/04/2023 CFO
   - Non Executive 10/04/2024 CFO

Note: There is a slight discrepancy in the position of Adam, Nicole ADAMS-N, as it is initially listed as CFO but later listed as Non Executive 10/04/2024 CFO.
