Skip to content

Alembic DDL mismatch for HALFVEC index #118

@jleguina

Description

@jleguina

Hi,

I'm facing a few issues with using Alembic and SQLAlchemy when creating halfvec indices.

1. The DLL generated by alembic given the index is invalid

My index:

Index(
    "ix_chunks_vector",
    func.cast(Chunk.vector, HALFVEC(settings.EMBEDDING_DIMS)).label("vector"),
    postgresql_using="hnsw",
    postgresql_with={"m": 16, "ef_construction": 64},
    postgresql_ops={"vector": "halfvec_cosine_ops"},
)

Which translates to Alembic as:

    op.create_index(
        "ix_chunks_vector",
        "chunks",
        [sa.text("CAST(vector AS HALFVEC(3072))")],
        unique=False,
        postgresql_using="hnsw",
        postgresql_with={"m": 16, "ef_construction": 64},
        postgresql_ops={"vector": "halfvec_cosine_ops"},
    )

Which in turn becomes:

CREATE INDEX ix_chunks_vector 
ON chunks
USING hnsw (CAST(vector AS HALFVEC(3072))) 
WITH (m = 16, ef_construction = 64)

As you can see, the operations are not inserted correctly into the query: hnsw (CAST(vector AS HALFVEC(3072))) should be hnsw (CAST(vector AS HALFVEC(3072)) halfvec_cosine_ops)

2. Manually altering the DLL results in a mismatch next time I generate a migration

If I manually correct the query to be:

CREATE INDEX ix_chunks_vector 
ON chunks
USING hnsw (CAST(vector AS HALFVEC(3072)) halfvec_cosine_ops) 
WITH (m = 16, ef_construction = 64)

The next time I run alembic autogenerate it will try to remove this index, and create one with the wrong DDL again:

    op.drop_index(
        "ix_chunks_vector",
        table_name="chunks",
        postgresql_with={"m": "16", "ef_construction": "64"},
        postgresql_using="hnsw",
    )
    op.create_index(
        "ix_chunks_vector",
        "chunks",
        [sa.text("CAST(vector AS HALFVEC(3072))")],
        unique=False,
        postgresql_using="hnsw",
        postgresql_with={"m": 16, "ef_construction": 64},
        postgresql_ops={"vector": "halfvec_cosine_ops"},
    )

I'm struggling to find a way to define the index appropriately to avoid this.

Thanks.


For context, here is my ORM table definition:

class Chunk(Base):
    __tablename__ = "chunks"

    # I've tried setting index to True and False, no change
    vector: Mapped[list[float]] = mapped_column(
        Vector(settings.EMBEDDING_DIMS), unique=False, nullable=False
    )

Index(
    "ix_chunks_vector",
    func.cast(Chunk.vector, HALFVEC(settings.EMBEDDING_DIMS)).label("vector"),
    postgresql_using="hnsw",
    postgresql_with={"m": 16, "ef_construction": 64},
    postgresql_ops={"vector": "halfvec_cosine_ops"},
)

I am using:

[[package]]
name = "pgvector"
version = "0.3.6"

[[package]]
name = "alembic"
version = "1.14.1"

[[package]]
name = "sqlalchemy"
version = "2.0.36"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions