# EyeLevel - A Comprehensive Guide
*Updated 2024-8-22*

This notebook is designed to help you use and fully understand EyeLevel's tools. For support feel free to either contact daniel.warfield@eyelevel.ai (the author) or support@eyelevel.ai (general technical support).

---

## What is EyeLevel?

EyeLevel is a set of unified technologies which are designed to allow you to parse and search through documents. This can be used in many applications, including:
- Finding references within a large document set
- Performing retreival augmented generation
- Reformatting visually dense documents into a useful textual representation

Eyelevel has two core technologies: GroundX and X-Ray.

X-Ray is a modern take on document parsing: it uses a variety of computer vision and natural language processing techniques to turn documents (even those with complex formatting,) into a textual representation we call a "semantic object". Semantic objects contain key information about the document, sections of the document, and elements within the document in order to provide highly contextualized and useful representations of the source material.

GroundX uses the semantic objects created by X-Ray to perform search. You can put a natural language question into GroundX and you'll get back a list of semantic objects, generated with X-Ray based on your documents, which are relevent to your question.

## How does X-Ray work?

We created a fine tuned vision model which is specifically designed to identify key elements within documents. We observe a variety of element types, but predominately concern ourselves with text, tables, and graphical figures. Once these elements have been identified they are extracted from the document and sent to a pipeliene depending on the type of element. Text is simply extracted while tables and images are grounded within a textual representation via fine tuned multimodal LLMs.

Once all elements within a document are identified, extracted, and grounded textually, X-Ray constructs a summarization on a document and section level based on the extracted textual representations. This allows X-Ray to create a representation of the greater document context, which is used to build semantic objects.

We use extracted text from the elements, as well as summary level information, to identify key ideas within the document. These ideas might encompass one or many extracted elements. We use the extracted data to construct a template of key information which needs to be filled in order to fully describe the identified ideas within the document.

Once the document has been devided into ideas, and templates including what is needed to describe those ideas are generated, the templates are filled out via yet another fine tuned LLM. This, ultimately, is what becomes a semantic object. All models used in this process exist within EyeLevel's cloud, and can be deployed to a VPC as an atomic unit.

## How does GroundX work?

Because X-Ray creates highly queryable semantic objects, GroundX search does not use the traditional cosine-similarity flavor of vector search common in many similar retreival systems. Rather, GroundX employs a customized textual search strategy built on top of Apache Lucene, Lucene being designed for indexing and searching textual data. We employ a configured variety of Apache Lucene enabled search which is specifically designed to be maximally compatible with the semantic objects output by X-Ray.

The validity of this approach is supported [in literature](https://arxiv.org/pdf/2308.14963) and also has a variety of practical benifits which allow for the optimization of GroundX on a case by case basis with minimal overhead.

There's a lot of nitty gritty engineering that goes into this, which is the cumulative experience of eyelevel working with numerous companies across a diverse spread of documents. What we settled on is a complex multi-field filter and search which prioritizes certain elements in the semantic object, and certain tokens within those elements.

## The Workflow

All of eyelevels technologies (including X-Ray) can be accessed via the GroundX SDK, which is essentially a collection of language specific implementations and CURL accessible API endpoints. The documentation for the API can be found here:

https://documentation.eyelevel.ai/reference/

The most fundamental component of organization within GroundX is the bucket, which is used to store documents. When you upload a document to a bucket it will trigger X-Ray parsing, and the result will be stored on the bucket for later querying. The semantic objects which X-Ray creates are ultimately what is stored in a bucket.

Projects are collections of buckets, allowing you to search between multiple buckets. This can allow you to organize information across buckets, and aggregate that information for specific use cases.

Buckets and Projects can be searched against based on a natural language query. GroundX will search for the most relevent semantic objects which match your query and return them. GroundX will also construct a recomended text block which aggregates information from the most relevent retreived semantic objects. This is designed to be injected into a language model, enabling RAG esque qorkflows. We'll explore, in depth, the results of search later in the notebook.

## Optimizing for Your Documents

While EyeLevel's products are designed to work out of the box on arbitrary human documents, in reality it's impossible to make a single unified system that is perfect in every use case. One of the core ideas of both X-Ray and GroundX is an element of configurability: We can fine tune computer vision models on your documents, we can adjust templating to match your needs, and we can modify our search system based on your specific requirments. We also have a depth of experience in analyzing and testing the performance of RAG systems in real world use cases. The takeaway is that X-Ray + GroundX allows you to acheive state of the art performance out of the box on a wide range of common document types, and can be tailored to perform exelently to your documents if necessary.

---



# Creating an EyeLevel Account And Registering an API Key

You can create an account here:

https://dashboard.eyelevel.ai/auth/register

Once you have an account setup, you can navigate here to setup an API key:

https://dashboard.eyelevel.ai/apikey


In [None]:
"""Enter your API key here
"""
is_google_seceret = True

if is_google_seceret:
    #if your api key is stored in the colab seceret manager
    from google.colab import userdata
    api_key = userdata.get('GroundXAPIKey_daniel.warfield') #<- your seceret name here
else:
    #for hard coding
    api_key = "xxxxx"

# The Python SDK
Interfacing with both X-Ray and GroundX can be done via [The Python GroundX SDK](https://pypi.org/project/groundx-python-sdk/). There's also a [node package](https://www.jsdelivr.com/package/npm/groundx-typescript-sdk) which exposes equivilent javascript functionality.

Currently GroundX and X-Ray exist as a series of endpoints which we'll be exploring in this notebook. Documentation around those endpoints can be found here:

https://documentation.eyelevel.ai/reference

In the near future these endpoints will soon be abstracted into language specific implementations of core functionality. For now we'll be working directly with the exposed API endpoints.

In [None]:
!pip install groundx-python-sdk

Collecting groundx-python-sdk
  Downloading groundx_python_sdk-1.3.23-py3-none-any.whl.metadata (33 kB)
Downloading groundx_python_sdk-1.3.23-py3-none-any.whl (349 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m349.5/349.5 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: groundx-python-sdk
Successfully installed groundx-python-sdk-1.3.23


# Authenticating
To get set up in python simply import the module and create an instance of the GroundX client with your API key.

In [None]:
from groundx import Groundx
groundx = Groundx(
    api_key=api_key,
)

# Creating a Bucket
Buckets can either be created through [the dashboard](https://dashboard.eyelevel.ai/home) by selecting the `+ New Bucket` button, or via the api via the [create_bucket endpoint](https://documentation.eyelevel.ai/reference/Buckets/Bucket_create).

Here we'll create a bucket called demo_bucket.

Buckets are uniquely identified by a bucket_id which is returned upon completion of the endpoint. We'll use this bucket_id to upload documents and search against our bucket.

The body of the response is formatted thus:
```
body={'bucket': {'bucketId': ____, 'name': ____}}
```

In [None]:
response = groundx.buckets.create(
    name="demo_bucket"
)
bucket_id = response.body['bucket']['bucketId']
print(f'Created bucket {bucket_id}')

Created bucket 11009


# Uploading Documents
Now that we have a bucket we can upload documents to that bucket. Recall that this triggers X-Ray to parse the documents so our bucket will be populated with a bunch of semantic objects.

X-Ray supports the following document types:
```
txt, docx, pptx, xlsx, pdf, png, jpg, csv, tsv, json
```
The primary use case of eyelevel products is in understanding complex human documents, so we'll use a PDF for this example. Specifically, we'll use [this document](https://arxiv.org/pdf/2110.11822), which is an academic paper with complex textual formatting and graphical figures.

When using the `upload_local` endpoint you'll get a response with the flavor of

```
{'ingest': {'processId': ____, 'status': 'queued'}})
```

Once you upload a set of documents it triggers X-Ray to begin processing the documents. This can be observed by using the `processId`.

In [None]:
doc_path = '2110.11822v2.pdf'

#uploading document
response = groundx.documents.ingest_local([{
    "blob": open(doc_path, "rb"),
    "metadata": {
        "bucketId": bucket_id,
        "fileName": doc_path,
        "fileType": "pdf",

        # you can provide custom keywords to GroundX to influence search.
        # This is useful if there is useful organizational information which
        # does not exsist in the document, but might exist in the folder structure
        # the document resides in, for instance. This is an optional parameter.
        "searchData": {
            "topic": "risk",
            "year": 2023
        }
    }
}])

processId = response.body['ingest']['processId']

# Tracking X-Ray Progress

We can poll the processId via the [get_processing_status_by_id](https://documentation.eyelevel.ai/reference/Documents/Document_getProcessingStatusById) endpoint, which will tell us if our documents are in one of four states.
```
cancelled, complete, errors, processing
```

The structure of the response obeys the following schema:
```
{
  "ingest": {
    "processId": "9e0ad09b-5150-48c0-aded-707587048fd9",
    "progress": {
      "cancelled": {
        "documents": [
          {
            <document info>
          }
        ],
        "total": 0
      },
      "complete": {
        "documents": [
          {
            <document info>
          }
        ],
        "total": 0
      },
      "errors": {
        "documents": [
          {
            <document info>
          }
        ],
        "total": 0
      },
      "processing": {
        "documents": [
          {
            <document info>
          }
        ],
        "total": 0
      }
    },
    "status": "queued",
```

where `<document info>` might look like the following:
```
{"bucketId": 0,
"documentId": "4704590c-004e-410d-adf7-acb7ca0a7052",
"fileName": "string",
"fileSize": "1.4MB",
"fileType": "txt",
"processId": "9e0ad09b-5150-48c0-aded-707587048fd9",
"searchData": {},
"sourceUrl": "http://example.com",
"status": "queued",
"statusMessage": "string",
"xrayUrl": "http://example.com"}
```

This can be used for a variety of status checks depending on the application. For now, because we're only uploading a single document for testing purposes, I'll just poll this endpoing every 10 seconds to see if our document is done by checking the status of the cumulative process

In [None]:
import time
while True:

    response = groundx.documents.get_processing_status_by_id(
        process_id=processId
    )
    if response.body['ingest']['status'] == 'complete':
        print('done!')
        break

    print('still processing...')
    time.sleep(10)

#getting the document id for the next section.
doc_id = response.body['ingest']['progress']['complete']['documents'][0]['documentId']

still processing...
still processing...
still processing...
still processing...
still processing...
still processing...
still processing...
still processing...
still processing...
still processing...
still processing...
still processing...
still processing...
done!


# Viewing X-Ray parse results
In the previous section we got the document_id from the upload process. We can use that to get the URL where the X-Ray data is exposed (as a JSON object) then explore it.

In [None]:
from groundx import Groundx

#getting the URL of x-ray parsing
response = groundx.documents.get(
    document_id=doc_id
)
x_ray_url = response.body['document']['xrayUrl']

#getting x-ray data
import urllib.request, json
with urllib.request.urlopen(x_ray_url) as url:
    x_ray_data = json.loads(url.read().decode())

In [None]:
"""X-Ray summarization of the entire document
"""
x_ray_data['fileSummary']

'This document entitled "Unraveling the Hidden Environmental Impacts of AI" describes the environmental consequences of AI technologies, particularly focusing on deep learning methods, and involves researchers Anne-Laure Ligozat, Aurélie Bugeau, Julien Lefèvre, and Jacques Combaz. The core topics include a review of the different types of environmental impacts caused by AI, methodologies for assessing these impacts, and the application of life cycle assessment (LCA) to AI services. The document highlights the significant energy consumption and greenhouse gas emissions associated with training AI models and discusses the broader environmental implications beyond just energy use. It also proposes a framework for evaluating the environmental benefits and costs of AI solutions designed for environmental purposes, emphasizing the need for comprehensive impact assessments that include production, use, and end-of-life phases of AI equipment.\n\nKeywords: AI environmental impact, deep learning

In [None]:
"""X-Ray provides some generally useful data about the document
"""
print('File Type:',x_ray_data['fileType'])
print('Language: ',x_ray_data['language'])
print('Keywords: ',x_ray_data['fileKeywords'])

File Type: pdf
Language:  English
Keywords:  2110.11822v2.pdf,AI environmental impact, deep learning environmental effects, AI energy consumption, AI greenhouse gas emissions, life cycle assessment AI, LCA AI services, AI carbon footprint, AI sustainability assessment, environmental cost of AI, AI energy use, AI model training emissions, AI environmental consequences, AI life cycle, AI production impact, AI end-of-life impact, AI equipment environmental assessment, AI for environmental solutions, AI ecological impact, AI environmental methodologies, AI environmental benefits, AI environmental costs, AI impact evaluation, AI environmental review, AI energy footprint, AI sustainability framework, AI environmental implications, AI environmental research, AI environmental data, AI environmental analysis.


In [None]:
"""Semantic object Exploration
Semantic objects can exist on one or multiple pages. In this object you can see
the following:
 - A list of bounding boxes from items on the page(s) that contribute to the object
 - The type of content, in this case a paragraph (as apposed to a figure or table)
 - the page number(s) the semantic object exists within.
 - sectionSummary: summarizes the greater section the semantic object is within.
 This is designed to provide additional context to the semantic object.
 - suggestedText: is LLM rewritten text which is based on the extracted text and
 other section level and document level information.
 - text: The raw extracted textual data
"""
x_ray_data['chunks'][0]

{'boundingBoxes': [{'bottomRightX': 363,
   'bottomRightY': 810,
   'pageNumber': 2,
   'topLeftX': 111,
   'topLeftY': 792},
  {'bottomRightX': 616,
   'bottomRightY': 760,
   'pageNumber': 2,
   'topLeftX': 111,
   'topLeftY': 645},
  {'bottomRightX': 294,
   'bottomRightY': 629,
   'pageNumber': 2,
   'topLeftX': 111,
   'topLeftY': 613},
  {'bottomRightX': 615,
   'bottomRightY': 579,
   'pageNumber': 2,
   'topLeftX': 144,
   'topLeftY': 488},
  {'bottomRightX': 616,
   'bottomRightY': 482,
   'pageNumber': 2,
   'topLeftX': 144,
   'topLeftY': 414},
  {'bottomRightX': 615,
   'bottomRightY': 407,
   'pageNumber': 2,
   'topLeftX': 144,
   'topLeftY': 337},
  {'bottomRightX': 617,
   'bottomRightY': 331,
   'pageNumber': 2,
   'topLeftX': 143,
   'topLeftY': 263},
  {'bottomRightX': 432,
   'bottomRightY': 249,
   'pageNumber': 2,
   'topLeftX': 133,
   'topLeftY': 227},
  {'bottomRightX': 615,
   'bottomRightY': 221,
   'pageNumber': 2,
   'topLeftX': 111,
   'topLeftY': 152},
  

In [None]:
"""Semantic object Exploration
The previous example of a semantic object only contained textual information.
Let's explore a table.

As can be seen the content of this semantic object is similar to the one used
to represent paragraph information, but with some key differences:
- There's a json description of the data
- There's a url to the image used to extract data
- There is a narrative representation of the data

Click the multimodal URL to renter the image!

We've found that things like figures and tables often benifit from having
both a JSON description of what content exists, as well as a narrative description
to describe key elements.
"""
x_ray_data['chunks'][6]

{'boundingBoxes': [{'bottomRightX': 1157,
   'bottomRightY': 1253,
   'pageNumber': 4,
   'topLeftX': 665,
   'topLeftY': 859}],
 'chunk': 'n2c6uu-1',
 'contentType': ['table'],
 'json': [{'summary': 'The following table contains the life cycle stages and unit processes for evaluating the environmental impact of ICT equipment and AI services. It includes phases such as raw material acquisition, production, use, and end of life, with each phase detailing specific activities and whether they are mandatory or recommended.'},
  {'id': 'A', 'phase': 'Raw material acquisition', 'requirement': 'Mandatory'},
  {'activity': 'Devices production and assembly',
   'id': 'B',
   'phase': 'Production',
   'requirement': 'Mandatory'},
  {'activity': 'Manufacturer support activities',
   'id': 'B',
   'phase': 'Production',
   'requirement': 'Recommended'},
  {'activity': 'Production of support equipment',
   'id': 'B',
   'phase': 'Production',
   'requirement': 'Mandatory'},
  {'activity': 'ICT-spec

In [None]:
"""Semantic object Exploration
Here's an example of a figure

it has the same general structure as tables, but a different underlying pipeline
in X-Ray was used to create this object. In being more complex visually than
a table, the narrative representation is arguably more impactful
"""
x_ray_data['chunks'][9]

{'boundingBoxes': [{'bottomRightX': 950.20856,
   'bottomRightY': 1083.8008,
   'pageNumber': 5,
   'topLeftX': 311.6436,
   'topLeftY': 640.8914}],
 'chunk': '8dbn0r-0',
 'contentType': ['figure'],
 'json': [{'color_coding': 'Red arrows represent emissions (pollution, abiotic resources depletion), black arrows represent economic flows (bold for material, dashed for energy), colored boxes correspond to unit processes.',
   'description': 'This image is a diagram representing the life cycle phases of a device used by an AI service, including production, use, and end-of-life phases. It shows the production of electricity and resources (metals, etc.) feeding into the production of the device, the use of the device, and the end-of-life of the device. Emissions (pollution, abiotic resources depletion) are shown as red arrows, economic flows are shown as black arrows (bold for material, dashed for energy), and colored boxes correspond to unit processes.',
   'title': 'Different tasks involve

# Searching
Ok, we have a bunch of these semantic objects thanks to X-Ray, and they exist within GroundX buckets. Now we can run search via GroundX. Search will get us a list of semantic objects as well as some additional aggregate information.

Within the search object you'll find:
 - count: the number of relevent semantic objects
 - query: the query used in search
 - results: a list of semantic objects. These are normal semantic objects, but they each have an additional "score" attribute which describes how well they align with the users query.
 - score: How relevent the top scoring semantic object is.
 - text: a formatted block of text which contains information from relevent chunks. This can be used as context in a RAG application.

The list of semantic objects are just like the semantic objects previously discussed, but they each have a "score"

In [None]:
search_query = 'I need a diagram of the AI lifecycle'

response = groundx.search.content(
    id=bucket_id,
    query=search_query
)

In [None]:
# Exploring a retreived chunk
response.body['search']['results'][0]

{'boundingBoxes': [{'bottomRightX': 616,
   'bottomRightY': 586,
   'pageNumber': 7,
   'topLeftX': 144,
   'topLeftY': 393},
  {'bottomRightX': 612,
   'bottomRightY': 338,
   'pageNumber': 7,
   'topLeftX': 588,
   'topLeftY': 321},
  {'bottomRightX': 492,
   'bottomRightY': 338,
   'pageNumber': 7,
   'topLeftX': 232,
   'topLeftY': 320},
  {'bottomRightX': 153,
   'bottomRightY': 377,
   'pageNumber': 7,
   'topLeftX': 111,
   'topLeftY': 361},
  {'bottomRightX': 616,
   'bottomRightY': 295,
   'pageNumber': 7,
   'topLeftX': 110,
   'topLeftY': 155},
  {'bottomRightX': 1188,
   'bottomRightY': 1494,
   'pageNumber': 6,
   'topLeftX': 662,
   'topLeftY': 1319},
  {'bottomRightX': 1166,
   'bottomRightY': 1311,
   'pageNumber': 6,
   'topLeftX': 693,
   'topLeftY': 1271},
  {'bottomRightX': 1162,
   'bottomRightY': 1228,
   'pageNumber': 6,
   'topLeftX': 740,
   'topLeftY': 1204},
  {'bottomRightX': 704,
   'bottomRightY': 1257,
   'pageNumber': 6,
   'topLeftX': 661,
   'topLeftY'

In [None]:
# exploring the formatted context to provide to the language model
response.body['search']['text']

'The following text excerpt is from a section of a document named \'2110.11822v2.pdf\':\n\nMetadata:\ntopic: risk\nyear: 2023.000000\n\nText Excerpt:\nThe following text:\n\nThe use phase is mostly due to the energy use, so the impacts of this part are highly dependent on the server/facility efficiency and the carbon intensity of the energy sources.\n\nThe end-of-life phase is difficult to assess in Information and Communication Technology (ICT) in general because of the lack of data concerning this phase of equipment. In particular, the end of life of many ICT equipment is poorly documented: globally, about 80% of electronic and electrical equipment is not formally collected [3].\n\n4. Assessing the usefulness of an AI for Green service\n\nNow that we have presented how the general framework of life cycle assessment can be adapted to AI solutions, we propose to use it for evaluating the complete benefits of an AI for Green service.\n\nIn this section, we will consider the following se

# RAG
Now that we understand EyeLevel's X-Ray and GroundX more thoroughly, we can explore their application. In this example we'll be using [search.content](https://documentation.eyelevel.ai/reference/Search/Search_content) to search for relevent information, and pass GroundX's formatted aggregation to the language model.

GroundX's formatted aggregation is designed to put the most important things at the beginning. To use it for different sized language models, you can simply keep the first `n` charecters in the sequence. We've found that `n = 3*token_limit` typically works well, but more sophisticated token counting techniques can also be employed.

In [None]:
"""using OpenAI for generation
"""
!pip install OpenAI

Collecting OpenAI
  Downloading openai-1.42.0-py3-none-any.whl.metadata (22 kB)
Collecting httpx<1,>=0.23.0 (from OpenAI)
  Downloading httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)
Collecting jiter<1,>=0.4.0 (from OpenAI)
  Downloading jiter-0.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->OpenAI)
  Downloading httpcore-1.0.5-py3-none-any.whl.metadata (20 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->OpenAI)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading openai-1.42.0-py3-none-any.whl (362 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m362.9/362.9 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K   [90m━━

In [None]:
import os
from openai import OpenAI

# Getting API Key for OpenAI
OPENAI_API_KEY = userdata.get('OpenAIAPIKey')
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

In [None]:
"""Defining RAG
using GroundX Search to retrive information, constructing an
augmented prompt based on GX's recommended textual representation,
and using OpenAI to generate a response.
"""

# ==== Retreival ====
def gx_search(query):
    response = groundx.search.content(
        id=bucket_id,
        query=query
    )
    return response.body['search']['text']

# ==== Augmentation ====
def gx_retreive_and_augment(query):

    #getting context
    context = gx_search(query)

    if len(context) > 4000 * 3:
        context = context[:4000*3]

    #defining a high level prompt so the LLM knows what to do
    system_prompt = 'you are a helpful AI agent tasked with helping users extract information from the context below'

    #based on OpenAI's new formatting
    augmented_prompt = [{
        "role": "system",
        "content": system_prompt+'\n\n===\n'+context+'\n==='},
         {
        "role": "user",
        "content": query
         }]

    return augmented_prompt

# ==== Generation ====
def gxrag(query):

    #retreving and augmenting
    augmented_prompt = gx_retreive_and_augment(query)

    #Generating
    client = OpenAI()
    return client.chat.completions.create(model="gpt-4",messages=augmented_prompt).choices[0].message.content

res = gxrag('What are the major parts of the AI lifecycle?')
print('response:')
print(res)

response:
The major parts of the AI lifecycle as discussed in the text are production, use, and end of life, all of which are considered in the Life Cycle Assessment (LCA) methodology.


# Image Reporting
Sometimes you don't only want a language model to answer the question for you. While RAG is useful, sometimes text simply isn't the appropriate response. In this example we'll use the same search approach as before, but provide a rich report of pages and figures, allong with generation, which might answer the question. This will allow a human to quickly evaluate the truthfullness of the generation, and come to their own conclusions as necessary.

Because X-Ray is multimodal by nature, the resulting semantic objects contain a variety of visual information which can be referenced. GroundX is useful in searching, but it's important to note that the GroundX ranking is designed for textual rather than visual search. As a result the most relevent diagram may not be the first search result from GroundX.

This can be easily aleviated by using a CLIP style model as a re-ranker, allowing for the most visually relevent information to be prioritized. It's unlikely that a small clip style model can understand image content, but it can likely seperate high level correct and incorrect types of images.

In [None]:
# Getting a CLIP style model to use as a reranker from Huggingface
from transformers import pipeline

#https://huggingface.co/openai/clip-vit-base-patch32?library=transformers
classifier = pipeline("zero-shot-image-classification", model="openai/clip-vit-large-patch14")

config.json:   0%|          | 0.00/4.52k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.71G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/905 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/961k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

In [None]:
def GX_get_image_urls(query, rerank_filter):
    response = groundx.search.content(
        id=bucket_id,
        query=query
    )

    image_urls = set()

    for semantic_object in response.body['search']['results']:
        if 'multimodalUrl' in semantic_object.keys():
            image_urls.add(semantic_object['multimodalUrl'])
        image_urls.update(semantic_object['pageImages'])

    image_urls = list(image_urls)

    ranked_results = classifier(
        image_urls,
        candidate_labels=rerank_filter,
    )

    #reformatting rank
    for i in range(len(ranked_results)):
        for j in range(len(rerank_filter)):
            ranked_results[i][j]['url'] = image_urls[i]

    #flattening
    ranked_results = [x for xs in ranked_results for x in xs]

    return ranked_results

query = 'A diagram of the AI lifecycle'
rerank_filters = [query, 'miscilanious figure', 'miscilanious page', 'miscilanious table']

ranked_results = GX_get_image_urls(query, rerank_filters)

In [None]:
import pandas as pd
df = pd.DataFrame(ranked_results)
df = df[df['label'] == query].sort_values('score',ascending=False)

def make_clickable(val):
    # target _blank to open new window
    return '<a target="_blank" href="{}">{}</a>'.format(val, val)

df.style.format({'url': make_clickable})

Unnamed: 0,score,label,url
12,0.999999,A diagram of the AI lifecycle,https://upload.eyelevel.ai/layout/raw/prod/d8606fc9-8be7-4bb1-b546-c8f2ba51b5e4/c4106c80-7a9d-4f3c-8b27-0b10aa9dd79c/figure-5-1.jpg
8,0.999997,A diagram of the AI lifecycle,https://upload.eyelevel.ai/layout/raw/prod/d8606fc9-8be7-4bb1-b546-c8f2ba51b5e4/c4106c80-7a9d-4f3c-8b27-0b10aa9dd79c/figure-5-0.jpg
36,0.999929,A diagram of the AI lifecycle,https://upload.eyelevel.ai/layout/raw/prod/d8606fc9-8be7-4bb1-b546-c8f2ba51b5e4/c4106c80-7a9d-4f3c-8b27-0b10aa9dd79c/figure-4-0.jpg
0,0.999839,A diagram of the AI lifecycle,https://upload.eyelevel.ai/layout/raw/prod/d8606fc9-8be7-4bb1-b546-c8f2ba51b5e4/c4106c80-7a9d-4f3c-8b27-0b10aa9dd79c/figure-6-0.jpg
60,0.99933,A diagram of the AI lifecycle,https://upload.eyelevel.ai/layout/raw/prod/d8606fc9-8be7-4bb1-b546-c8f2ba51b5e4/c4106c80-7a9d-4f3c-8b27-0b10aa9dd79c/1.jpg
64,0.9917,A diagram of the AI lifecycle,https://upload.eyelevel.ai/layout/raw/prod/d8606fc9-8be7-4bb1-b546-c8f2ba51b5e4/c4106c80-7a9d-4f3c-8b27-0b10aa9dd79c/5.jpg
40,0.982985,A diagram of the AI lifecycle,https://upload.eyelevel.ai/layout/raw/prod/d8606fc9-8be7-4bb1-b546-c8f2ba51b5e4/c4106c80-7a9d-4f3c-8b27-0b10aa9dd79c/table-4-0.jpg
24,0.974595,A diagram of the AI lifecycle,https://upload.eyelevel.ai/layout/raw/prod/d8606fc9-8be7-4bb1-b546-c8f2ba51b5e4/c4106c80-7a9d-4f3c-8b27-0b10aa9dd79c/figure-8-0.jpg
4,0.965495,A diagram of the AI lifecycle,https://upload.eyelevel.ai/layout/raw/prod/d8606fc9-8be7-4bb1-b546-c8f2ba51b5e4/c4106c80-7a9d-4f3c-8b27-0b10aa9dd79c/2.jpg
20,0.942037,A diagram of the AI lifecycle,https://upload.eyelevel.ai/layout/raw/prod/d8606fc9-8be7-4bb1-b546-c8f2ba51b5e4/c4106c80-7a9d-4f3c-8b27-0b10aa9dd79c/table-4-1.jpg


As can be seen, the top responses are figures which are most relevent to the query.