Skip to content

Pipeline creates Chunks with duplicate ids when executed multiple times #221

@risafj

Description

@risafj

When I run the Pipeline() on a loop with multiple documents, a Chunk node with an id property of ":1" and index of 1 is created for each run. This causes problems, since the ids are no longer unique.

For example, when the lexical graph gets created, a Chunk node with an id of ":1" has a NEXT_NODE relation to every Chunk node that has an id of ":2".

After running the pipeline with 4 documents, it looks like this:

Neo4j_Aura

The same issue is occuring with FROM_CHUNK, where an entity that's supposed to have a relation like (n:Entity)-[:FROM_CHUNK]->(c:Chunk {id: ":1", index: "1"}) actually has that relation to all documents' chunks with an index of 1.

Is there any workaround for this?
I'm guessing this issue would be solved if I could somehow pass document-specific id_prefix so each chunk gets a unique id?

def chunk_id(self, chunk_index: int) -> str:
return f"{self.config.id_prefix}:{chunk_index}"

Additional info:
I use v1.2.0.
I have a standard pipeline setup that has these components.

    pipe = Pipeline()
    # skipping the config code
    pipe.add_component(text_splitter, "splitter")
    pipe.add_component(embedder, "chunk_embedder")
    pipe.add_component(schema_builder, "schema")
    pipe.add_component(extractor, "extractor")
    pipe.add_component(writer, "writer")
    pipe.add_component(resolver, "resolver")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions