# Jina Embeddings v4 for FiftyOne Tutorial

# [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/harpreetsahota204/jina_embeddings_v4/blob/main/jinav4_embeddings_fiftyone_tutorial.ipynb)

This notebook demonstrates how to use ColPali v1.3 with FiftyOne for visual document retrieval.

## Overview

Jina Embeddings v4 is a state-of-the-art Vision Language Model that generates embeddings for both images and text in a shared vector space. Built on a parameter-efficient architecture using PEFT (Parameter-Efficient Fine-Tuning), it supports multiple tasks including document retrieval, multilingual text matching, and code understanding. This integration adapts Jina v4 for use with FiftyOne's embedding and similarity infrastructure.

## Setup

Install required packages:


In [None]:
%pip install fiftyone transformers torch huggingface-hub umap-learn

## Register the Zoo Model

Register this repository as a FiftyOne zoo model source:


In [None]:
import fiftyone.zoo as foz

# Register this repository as a remote zoo model source
foz.register_zoo_model_source(
    "https://github.com/harpreetsahota204/jina_embeddings_v4",
    overwrite=True
)

## Load Dataset

Load a document dataset from Hugging Face:


In [None]:
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

# Load document dataset from Hugging Face
dataset = load_from_hub(
    "Voxel51/document-haystack-10pages",
    overwrite=True
)

## Basic Workflow: Document Retrieval

### Load Model and Compute Embeddings


In [None]:
import fiftyone.zoo as foz

model = foz.load_zoo_model(
    "jinaai/jina-embeddings-v4",
    task="retrieval",  # or "text-matching", "code", but for visualzing embeddings this is best
)

In [None]:
# Compute embeddings for all documents
dataset.compute_embeddings(
    model=model,
    embeddings_field="jina_embeddings",
)

# Check embedding dimensions
print(dataset.first()['jina_embeddings'].shape) 


### Build Similarity Index


In [None]:
import fiftyone.brain as fob

# Build similarity index
text_img_index = fob.compute_similarity(
    dataset,
    model="jinaai/jina-embeddings-v4",
    embeddings_field="jinda_embeddings",
    brain_key="jina_sim",
    model_kwargs={
        "task":"retrieval",
    }
)


### Query for Specific Content


In [None]:
# Query for specific content
sims = text_img_index.sort_by_similarity(
    "the secret office supply is pencil"
)

# Launch FiftyOne App
session = fo.launch_app(dataset, auto=False)
print(session.url)


## Advanced Embedding Workflows

### 1. Embedding Visualization with UMAP

Create 2D visualizations of your document embeddings:


In [None]:
import fiftyone.brain as fob

# Create UMAP visualization
results = fob.compute_visualization(
    dataset,
    method="umap",  # Also supports "tsne", "pca"
    brain_key="jina_viz",
    embeddings="jinda_embeddings"
)

# Explore in the App
session = fo.launch_app(dataset)


### 2. Similarity Search

Build powerful similarity search with ColPali embeddings:


In [None]:
import fiftyone.brain as fob

# Build similarity index
results = fob.compute_similarity(
    dataset,
    backend="sklearn",  # Fast sklearn backend
    brain_key="jina_sim", 
    embeddings="jina_embeddings"
)

# Find similar images
sample_id = dataset.first().id
similar_samples = dataset.sort_by_similarity(
    sample_id,
    brain_key="jina_sim",
    k=10  # Top 10 most similar
)

# View results
session = fo.launch_app(similar_samples)


### 3. Dataset Representativeness

Score how representative each sample is of your dataset:


In [None]:
import fiftyone.brain as fob

# Compute representativeness scores
fob.compute_representativeness(
    dataset,
    representativeness_field="jina_represent",
    method="cluster-center",
    embeddings="jina_embeddings"
)

# Find most representative samples
representative_view = dataset.sort_by("jina_represent", reverse=True)


### 4. Duplicate Detection

Find and remove near-duplicate documents:


In [None]:
import fiftyone.brain as fob

# Detect duplicates using embeddings
results = fob.compute_uniqueness(
    dataset,
    embeddings="jina_embeddings"
)

# Filter to most unique samples
unique_view = dataset.sort_by("uniqueness", reverse=True)
