# **📚 Semantic Document Search**

💻 **Installation**: To install the plugin and its dependencies (Sentence Transformers library, `Qdrant` client Python library), use the provided `fiftyone` CLI commands.

🚀 **Run Qdrant**: You also need to have a `Qdrant` instance running, which can be done with the provided Docker command.

💡 **Note**: This semantic search plugin is similar to the [keyword search plugin](https://github.com/jacobmarks/keyword-search-plugin) and is designed to be used with the PyTesseract OCR plugin.

This notebook utilizes a plugin that enables semantic searching through text blocks (from OCR) in your dataset using a `Qdrant` index and the [General Text Embeddings (GTE)-base model](https://huggingface.co/thenlper/gte-base) from the Sentence Transformers library. 🤔

### **💡 Overview**

To use this plugin, follow these steps:

✍️ **Step 1: Get Text Blocks**: First, obtain text blocks in your dataset using the [`PyTesseract OCR`](https://github.com/jacobmarks/pytesseract-ocr-plugin/) plugin. 

🔍 **Step 2: Create Index**: The `create_semantic_document_index` operator creates a vector index for your text blocks, stored in Qdrant with the collection name `<dataset_name>_sds_<field_name>`.

🎯 **Step 3: Search Documents**: The `semantically_search_documents` operator lets you search through your text blocks, specifying which index to use if you have multiple detections with text blocks.


In [None]:
!fiftyone plugins download https://github.com/jacobmarks/semantic-document-search-plugin

!fiftyone plugins requirements @jacobmarks/semantic_document_search --install

!fiftyone plugins download https://github.com/jacobmarks/pytesseract-ocr-plugin

!fiftyone plugins requirements @jacobmarks/pytesseract_ocr --install

In [None]:
import os

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz

You can download the dataset from Hugging Face using the code below, note we're only downloading 250 samples for this example.

In [None]:
import fiftyone.utils.huggingface as fouh

dataset = fouh.load_from_hub(
    "Voxel51/CVPR_2024_Papers",
    max_samples=250,
    name="cvpr_papers_250_samples",
    overwrite=True,
    )

## **Workflow Steps:**

Once you start the session, hit the backtick button (`) to open up a list of plugin operators:

1. **Run OCR Engine (`run_ocr_engine`)**: Extract text from documents in the dataset using PyTesseract OCR, convert results to FiftyOne labels, and store individual word predictions and block-level predictions.

2. **Create Semantic Document Index (`create_semantic_document_index`)**: After the OCR has run, build a Qdrant index for a specified text field within a detections label field. Before running this you need to fire up a Docker container: `docker run -p "6333:6333" -p "6334:6334" -d qdrant/qdrant`

3. **Semantically Search Documents (`semantically_search_documents`)**: Search for text in your dataset using semantic querying. Filter results to show only labels that match your query. Customize the search by specifying the number of results to return and the similarity score threshold.

In [None]:
session = fo.launch_app(dataset)