Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] StorageContext breaks when document contains embeddings. #5830

Closed
Olamyy opened this issue Jun 5, 2023 · 5 comments
Closed

[BUG] StorageContext breaks when document contains embeddings. #5830

Olamyy opened this issue Jun 5, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@Olamyy
Copy link

Olamyy commented Jun 5, 2023

Description

When an embedding is passed while creating a document, the storage_context.persist fails with a TypeError: Object of type float32 is not JSON serializable error.

Reproduction

from llama_index import Document, ServiceContext, GPTVectorStoreIndex, StorageContext, LLMPredictor
from langchain import OpenAI
from sentence_transformers import SentenceTransformers

model = SentenceTransformers("")
text = "This is a content"
embedding = models.encode(text)
document = Document(text, embedding=embedding)
storage_context = StorageContext.from_defaults()
doc_index = GPTVectorStoreIndex.from_documents(documents, storage_context=storage_context)
storage_context.persist(persist_dir=".storage/")

Expected Behaviour

Storage context persist should work fine even with embeddings.

@logan-markewich logan-markewich added the bug Something isn't working label Jun 5, 2023
@Samshive
Copy link
Contributor

I get a similar error when trying to persist a graph: TypeError: Object of type PosixPath is not JSON serializable

@dosubot
Copy link

dosubot bot commented Oct 24, 2023

Hi, @Olamyy! I'm Dosu, and I'm here to help the LlamaIndex team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue you reported is related to the StorageContext.persist function failing with a TypeError when a document contains embeddings. This problem occurs when using the llama_index library and can be reproduced by passing an embedding while creating a document. It seems that another user named Samshive also encountered a similar error when trying to persist a graph. Additionally, msharara1998 and ChengYen-Tang have given a thumbs up reaction to your comment.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LlamaIndex repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to LlamaIndex, and we appreciate your understanding as we work to manage our backlog effectively. If you have any further questions or concerns, please don't hesitate to reach out.

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Oct 24, 2023
@Neco777
Copy link

Neco777 commented Oct 30, 2023

I have similar issue but InstructorEmbedding

Reproduction:

# this illustrates that the combination of VectorStoreIndex with InstructorEmbedding does not work

from llama_index.embeddings import InstructorEmbedding
#from embedding_models.instructor_embedding_model import InstructorEmbedding

CACHE_FOLDER = "./instructor-cache"

instructor_embeddings = InstructorEmbedding(embed_batch_size=1, cache_folder=CACHE_FOLDER)

import openai
openai.log = "debug"

from llama_index import ServiceContext, set_global_service_context
from llama_index.llms import OpenAI

llm = OpenAI(model="gpt-3.5-turbo", temperature=0)
service_context = ServiceContext.from_defaults(llm=llm, embed_model=instructor_embeddings, chunk_size=512)
set_global_service_context(service_context)


# ===================]


from llama_index import VectorStoreIndex
from llama_index.schema import Document, TextNode, NodeRelationship, RelatedNodeInfo


node1 = TextNode(text="Hello")
node2 = TextNode(text="Hello")

node2.relationships[NodeRelationship.PARENT] = RelatedNodeInfo(node_id=node1.id_,)

index = VectorStoreIndex([])
index.insert_nodes([node1, node2])

index.storage_context.persist(persist_dir=f"./cached_data/test")

It looks like the root cause is that INSTRUCTOR produces results in numpy.float32 type, so the fix I found is to change few lines in the InstructorEmbedding class:

...
    def _embed(self, instruct_sentence_pairs: List[List[str]]) -> List[List[float]]:
        """Embed sentences."""
        return self._convert_to_float(self._model.encode(instruct_sentence_pairs))
    
    def _convert_to_float(self, nplist) -> List[List[float]]:
        return [[npf.item() for npf in arr] for arr in nplist]
...

I assume there should be a better place to do this that will work for all the embedding models.

I use 0.8.55 version of llama_index (the main page says it was released 16 hours ago).

@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Oct 30, 2023
@dosubot
Copy link

dosubot bot commented Oct 30, 2023

@logan-markewich Could you please help @Neco777 with this issue? They are experiencing a similar problem with the InstructorEmbedding class in the llama_index library. They have provided reproduction steps and a potential fix. Thank you!

Copy link

dosubot bot commented Jan 30, 2024

Hi, @Olamyy,

I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale. The reported issue is related to the storage_context.persist function failing with a TypeError when a document containing embeddings is passed. Another user, Samshive, encountered a similar error when trying to persist a graph. Additionally, Neco777 reported a similar issue with the InstructorEmbedding class and provided reproduction steps and a potential fix, prompting a request for assistance from logan-markewich.

Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and cooperation. If you have any further questions or need assistance, feel free to reach out.

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jan 30, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 6, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants