# Search PDF text, images and tables with Python and CLIP

## Introduction

Have you ever been searching through a stack of files and just can't find the right keywords to get what you're looking for? Staring at a screen and wracking your brain at 3am for the right word ain't fun, take it from me.

How about trying to search through a stack of PDFs? That gets even harder since all that nice plain text is wrapped up in [a gnarly format](https://forum.quartertothree.com/t/is-pdf-an-evil-format/58598). Good luck grepping those!

And what if you want to search **tables and images** as well as text? 

In this notebook we're going to kill those three birds with one stone.

We'll harness the power of AI to find things *similar* to the search query you input, and we'll show you how to deploy that search engine in real life for anyone to use.

We're going to do this with open-source tools from the Jina ecosystem.

### Why Jina and [neural search](https://docs.jina.ai/get-started/neural-search?utm_source=pdf-notebook)? What's wrong with good old symbolic search?

#### Semantics semantics semantics!

Instead of just matching patterns, our search engine will match *meanings*. So if we were to search [`arthropod`](https://examples.yourdictionary.com/examples-of-arthropods.html), our top results would be related directly to arthropods, but we'd also get results for spiders, scorpions, horseshoe crabs and lots of other cute related critters. This is because we're using deep neural nets (DNNs) to embed words in a vector space so that words with similar meanings have similar [embeddings](https://docarray.jina.ai/fundamentals/document/embedding?utm_source=pdf-notebook).

*An example of an arthropod, specifically a Trilobite:*

![](https://upload.wikimedia.org/wikipedia/commons/thumb/a/a6/Estonian_Museum_of_Natural_History_-_trilobite_-_Hydrocephalus.png/1280px-Estonian_Museum_of_Natural_History_-_trilobite_-_Hydrocephalus.png)

#### Less code to write

Using Jina Hub, we reduce the amount of code we need to write. Instead of  integrating [Transformers](https://hub.jina.ai/executor/u9pqs8eb) with our search engine, we can simply use a couple of lines of code to download it from [Jina Hub](https://hub.jina.ai), run it in Docker, or run it in a [sandbox](https://docs.jina.ai/how-to/sandbox?utm_source=pdf-notebook) on the cloud. And if we wanted to swap it out for something like [spaCy](https://hub.jina.ai/executor/u7h7cuh2)? Again, just a matter of changing a couple of lines of code.

#### Deployment made easy

Also, tools like Jina take a lot of hassle out of the orchestration and scaling. We can easily add [sharding, replicas](https://docs.jina.ai/how-to/scale-out/?highlight=sharding), [Kubernetes integration](https://docs.jina.ai/how-to/kubernetes?utm_source=pdf-notebook), and so on. 

### Meet our ingredients

#### **[DocArray](https://docarray.jina.ai?utm_source=pdf-notebook)**

DocArray is a library for nested, unstructured data in transit, including text, image, audio, video, 3D mesh, etc. It allows deep-learning engineers to efficiently process, embed, search, recommend, store, and transfer the multi-modal data with a Pythonic API. ([star the repo]())

#### **[Jina](https://docs.jina.ai)**
 
 Jina is a framework that empowers anyone to build cross-modal and multi-modal[*] applications on the cloud. It uplifts a PoC into a production-ready service. Jina handles the infrastructure complexity, making advanced solution engineering and cloud-native technologies accessible to every developer. ([star the repo]())

#### **[Jina Hub](https://hub.jina.ai)**

Download pre-built building blocks for neural search.

## Setup

In [None]:
!pip install -q docarray[full]

In [None]:
!pip install -q ipywidgets jina

In [1]:
import os
import warnings

In [2]:
warnings.filterwarnings('ignore')  # ignore all those pesky warnings

## Downloading our data

We're using a couple of PDFs downloaded from arxiv.org. Of course, this is just a toy dataset. PDFs can differ in many ways, and depending on your use case you may need to process them very differently (e.g. OCR, image processing). Since ours are simple plain text, these steps will apply to most PDF search engines you may wish to build.

I selected these PDFs because they included images, text and tables, and extracting/processing those is a key part of this notebook.

---

#### ⚙️ Want to use your  own data?

In that case:

* Ignore the cell below
* Create a `data` directory in the "Files" sidebar
* Copy your own PDFs into that

In [9]:
if not os.path.isdir("data"):
  !wget -q -N --output-document data.zip https://github.com/jina-ai/workshops/blob/main/notebooks/pdf_search/part_2_images_and_text/data.zip?raw=true
  !unzip -n data.zip
  !rm -f data.zip

Archive:  data.zip
   creating: data/
  inflating: data/0809.0899.pdf      
  inflating: data/1704.04553.pdf     


## Loading our PDF files

We'll use a [DocumentArray](https://docarray.jina.ai/fundamentals/documentarray/) from the [DocArray](https://docarray.jina.ai/) package to collect all of our PDFs, then [load them as binary blob data](https://docarray.jina.ai/fundamentals/document/fluent-interface/#blobdata/) into [Document](https://docarray.jina.ai/fundamentals/document/) instances.

In [3]:
from docarray import DocumentArray, Document

In [4]:
docs = DocumentArray.from_files("data/*.pdf")

In [5]:
for doc in docs:
  doc.load_uri_to_blob()

## Creating a Flow

We'll use Jina to generate [Flows](https://docs.jina.ai/fundamentals/flow?utm_source=pdf-notebook) for indexing and searching. Our Documents will pass through these when we're indexing or searching.

A Flow is built out of [Executors](https://docs.jina.ai/fundamentals/executor?utm_source=pdf-notebook), each of which perform a single processing task on each Document. We'll use [Jina Hub]() to provide pre-made Executors, meaning we don't have to write so much code.

Compared to our [previous PDF search engine](https://colab.research.google.com/github/jina-ai/workshops/blob/main/pdf_search/pdf_search.ipynb), this Flow has a lot more Executors. You can read about them in our blog post.

### Why just one Flow?

In a later notebook we'll deploy and host our Flow on [JCloud](https://docs.jina.ai/fundamentals/jcloud/) for free. This requires us to use just one Flow for both indexing and searching.

### Why is this Flow so complex?

Using one Flow to handle both indexing and searching means:

- When we submit a search term it's merely a text string or an image, both of which we wrap in a Document. We don't need to extract them from any other kind of data, so we can skip a lot of Executors (anything with the name prefix of `index_`)
- Our search Document is a "root-level" Document - i.e. the content is right at the "top". Our indexed Documents are at chunk-level (the sentences, images, and tables extracted from the top-level PDF). So we need to use Executors with different `traversal_paths`. This means duplicating a few Executors, with one to run during indexing (prefixed `index_`) and one during searching (`search_`).
- Some Executors (like [AnnLiteIndexer](https://hub.jina.ai/executor/7yypg8qk)) are used both for indexing and searching, so are prefixed `all_`.

In [6]:
from jina import Flow, Client

In [7]:
flow = (
    Flow()
    .add(
        uses="jinahub://PDFTableExtractor/latest", # Extract tables
        install_requirements=True,
        name="index_table_extractor"
    )
    .add(
        uses="jinahub://PDFSegmenter", # Extract images/text
        install_requirements=True,
        name="index_segmenter"
    )
    .add(
        uses="jinahub://ElementTypeTagger", # Tag Documents based on modality (image/text/table)
        uses_with={"traversal_paths": "@c"},
        name="index_tagger",
    )
    .add(
        uses="jinahub://SpacySentencizer", # Sentencize long text into sentences
        uses_with={"traversal_paths": "@c"},
        install_requirements=True,
        name="index_sentencizer",
    )
    .add(
        uses="jinahub://TagsCopier", # Recursively copy tags
        uses_with={"traversal_paths": "@c"},
        name="index_tags_copier"
    )
    .add(
        uses="jinahub://ChunkFlattener", # Flatten all chunks to doc.chunks
        name="index_flattener"
    )
    .add(
        uses="jinahub://ImagePreprocessor-skip-non-images", # Process images in PDF chunks
        uses_with={"traversal_paths": "@c"},
        install_requirements=True,
        name="index_image_processor"
    )
    .add(
        uses="jinahub://ImagePreprocessor-skip-non-images", # Process search query image
        uses_with={"traversal_paths": "@r"},
        install_requirements=True,
        name="search_image_processor"
    )
    .add(
        uses="jinahub://CLIPEncoder/latest-gpu", # Encode using CLIP - chunk level
        uses_with={"traversal_paths": "@c"},
        install_requirements=True,
        name="index_encoder"
    )
    .add(
        uses="jinahub://CLIPEncoder/latest-gpu", # Encode using CLIP - root level
        install_requirements=True,
        name="search_encoder"
    )
    .add(
        uses="jinahub://AnnLiteIndexer", # Store vectors and metadata on disk
        uses_with={
            "index_traversal_paths": "@c",
            "search_traversal_paths": "@c",
            "columns": [("element_type", "str")],
            "n_dim": 512
            },
        install_requirements=True,
        name="all_indexer"
    )
)

In [None]:
flow.plot()

## Indexing our Documents

Now it's time to run the Flow.

First we'll remove any old index data that may be lying around to ensure nothing carried over from a prior run:

In [None]:
!rm -rf workspace

🚨 **Note:** if the below cell fails, restart the runtime (*Runtime* > *Restart runtime*) and run all the cells again. This seems to be an issue with Colab.

In [1]:
with flow:
  client = Client(port=flow.port)
  docs = client.post("/index", docs, request_size=1, show_progress=True, target_executor="(index_*|all_*)")

NameError: name 'flow' is not defined

### Examining our Documents

Now that we've done all that processing, what do our Documents look like?

Let's look at the indexed DocumentArray to start

In [13]:
docs.summary()

And now the first Document:

In [14]:
docs[0]

In [15]:
docs[0].chunks

Here's a chunk with its embedding and tags:

In [16]:
docs[0].chunks[0]

IndexError: list index out of range

## Searching our data

For performing a search, we need to:

- Create a Document containing our search query (either image or text)
- If the search query is an image, convert to a tensor so CLIPEncoder can read it
- Encode the search query with CLIPEncoder
- Search through the already indexed data with the search query

You can also specify filters for `element_type` (either `text`, `table`, or `image`).

In [None]:
search_format = "text" # text or image

### Using a text search term

In [None]:
if search_format == "text":
  search_term = "trilobite diagram"
  query_doc = Document(text=search_term)

### Using an image search term

In [None]:
if search_format == "image":
  # Download image
  image_url = "http://paleonet.org/TTP/files/stacks-image-f0024aa.jpg"
  !wget -q --output-document image.png $image_url

  query_doc = Document(uri="image.png")

### Applying search filter

[AnnLiteIndexer](https://hub.jina.ai/executor/7yypg8qk?utm_source=notebook-pdf-search-tables) allows you to apply MongoDB-style filters. Check the [Executor's README](https://hub.jina.ai/executor/7yypg8qk) to learn more.

In [None]:
# you can use any combination of text/table/image

element_type = [
    "text", 
    "image" 
    "table"
    ]

In [None]:
filter = {
    "element_type": {
        "$in": element_type,
    }
}

### Performing the search

In [None]:
with flow:
  client = Client(port=flow.port)

  results = client.post(
      "/search",
      query_doc, 
      request_size=1,
      parameters={
          "filter": filter
      },
      show_progress=True, 
      target_executor="(search_*|all_*)"
      )

### Show results

If the results are text or table, just print it out. Otherwise we can plot the image matches in the notebook.

Note: Due to the content of the PDFs, *most* results will be text results. You can change the `filter` above to select instead for tables and/or images.

The `render()` function below is needed to render the search results in a notebook. In the real world you'd probably want to do something different, but this quick, hacky code (specifically tailored for notebooks, not real world) will serve for now.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

def render(docarray):
  for idx, doc in enumerate(docarray):
    if doc.tags["element_type"] == "image":
      os.makedirs("images", exist_ok=True)
      filename = f"images/{idx}-{doc.id}.png"
      doc.set_image_tensor_inv_normalization(channel_axis=0)
      doc.save_image_tensor_to_file(filename, channel_axis=0)
      image=plt.imread(filename)
      fig=plt.figure()
      plt.axis('off')
      plt.imshow(image)

    elif doc.tags["element_type"] == "table":
      os.makedirs("csvs", exist_ok=True)
      filename = f"csvs/{idx}-{doc.id}.csv" 
      with open(filename, "w") as file:
        file.write(doc.tags["table_content"])
      df = pd.read_csv(filename)
      print(df)
      
    else:
      print(doc.text)

In [None]:
render(results[0].matches)

## Putting it into production

Colab notebooks have a number of restrictions that make real-world stuff quite difficult. If we were building this outside of a notebook, we could:

* Set up a [RESTful or gRPC gateway](https://docs.jina.ai/fundamentals/gateway?utm_source=pdf-notebook) and keep the Flow open to requests using `flow.block()`
* Use [sharding and replicas](https://docs.jina.ai/how-to/scale-out?utm_source=pdf-notebook) to improve performance and reliability.
* [Monitor our Flow with Grafana](https://docs.jina.ai/fundamentals/flow/monitoring-flow?utm_source=pdf-notebook)
* Better yet, host our Flow on [JCloud](https://docs.jina.ai/fundamentals/jcloud?utm_source=pdf-notebook), so we don't have to use any of our own compute for encoding, indexing, hosting, etc (encoding is especially hungry on the hardware)
* Finetune our results using [Finetuner](https://finetuner.jina.ai) to provide better matches
* Use a more specialized model (rather than just general purpose)

## Troubleshooting

### No text is being extracted from my PDF

It might be that your PDF is full of *pictures of text* rather than text itself. This is quite common. In a future notebook we'll integrate an OCR Executor like [PaddlePaddleOCR](https://hub.jina.ai/executor/78yp7etm) to get around this.

### I'm getting bad search results in my language

The CLIP model we're using is trained primarily on English. Multilingual CLIP models do exist however. You can define which model you want to use with the `pretrained_model_name_or_path` argument in [CLIPEcoder](https://hub.jina.ai/executor/29r2b26t).

### My tables aren't being extracted

The docs2info's table extraction service is still being tested. While it's provided good results in my experience, it's still under heavy development.

### The notebook fails when I do anything involving images

Try restarting the runtime (there should be an option for that near the top, under the `!pip install docarray[full]` cell. This seems to be a notebook limitation.

### It's too slow!

Have you enabled Colab's GPU under *Runtime* > *Change runtime type*?

### Something else?

Join our [Slack](https://slack.jina.ai) and ask us there in the #projects-pdf channel!

## Learn more

Want to dig more into the Jina ecosystem? Here are some resources:

- [Developer portal](https://learn.jina.ai) - tutorials, courses, videos on using Jina
- [Fashion search notebook](https://colab.research.google.com/github/alexcg1/neural-search-notebooks/blob/main/fashion-search/1_build_basic_search/basic_search.ipynb) - build an image-to-image fashion search engine
- [DALL-E Flow](https://colab.research.google.com/github/jina-ai/dalle-flow/blob/main/client.ipynb#scrollTo=NeWDy9viOCAP)/[Disco Art](https://colab.research.google.com/github/jina-ai/discoart/blob/main/discoart.ipynb#scrollTo=47428f37) - create AI-generated art in your browser