fallback to {} for None metadata from Chroma #1714

jeffchuber · 2023-03-16T17:40:13Z

The basic vector store example started breaking because Document required not None for metadata, but Chroma stores metadata as None if none is provided. This creates a fallback which fixes the basic tutorial https://langchain.readthedocs.io/en/latest/modules/indexes/examples/vectorstores.html

Here is the error that was generated

Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.
Traceback (most recent call last):
  File "/Users/jeff/src/temp/langchainchroma/test.py", line 17, in <module>
    docs = docsearch.similarity_search(query)
  File "/Users/jeff/src/langchain/langchain/vectorstores/chroma.py", line 133, in similarity_search
    docs_and_scores = self.similarity_search_with_score(query, k)
  File "/Users/jeff/src/langchain/langchain/vectorstores/chroma.py", line 182, in similarity_search_with_score
    return _results_to_docs_and_scores(results)
  File "/Users/jeff/src/langchain/langchain/vectorstores/chroma.py", line 24, in _results_to_docs_and_scores
    return [
  File "/Users/jeff/src/langchain/langchain/vectorstores/chroma.py", line 27, in <listcomp>
    (Document(page_content=result[0], metadata=result[1]), result[2])
  File "pydantic/main.py", line 331, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for Document
metadata
  none is not an allowed value (type=type_error.none.not_allowed)
Exiting: Cleaning up .chroma directory

* master: (68 commits) hotfix (langchain-ai#1742) Harrison/move docs (langchain-ai#1741) move docs (langchain-ai#1740) bump version to 114 (langchain-ai#1739) Harrison/latex splitter (langchain-ai#1738) Harrison/blackboard loader (langchain-ai#1737) docs: add docs link to agent toolkits (langchain-ai#1735) fix: agent json parser fails with text in suffix (langchain-ai#1734) Harrison/official method (langchain-ai#1728) Sagemaker Endpoint LLM (langchain-ai#1686) adding new agent types in comments (langchain-ai#1711) (OpenAI) Add model_name to LLMResult.llm_output (langchain-ai#1713) Fix all the bug in init Tool in docs (langchain-ai#1725) Bump duckdb-engine to 0.7.0 (langchain-ai#1726) Add HTML document_loader that includes page title metadata (langchain-ai#1720) fix async in agent (langchain-ai#1723) pydantic/json parsing (langchain-ai#1722) Loosen PyYAML dependency (langchain-ai#1698) Adding ability to `return_pl_id` to all PromptLayer Models in LangChain (langchain-ai#1699) fallback to {} for None metadata from Chroma (langchain-ai#1714) ...

apremjee8 · 2023-04-23T09:09:55Z

Hey there - I still seem to be getting this error when I'm using chroma with pandas dataframe and the dataframe loader.

My exact error happens when I do:

docsearch = Chroma.from_documents(texts, embeddings)

I get this error: Expected metadata value to be a str, int, or float, got None.

If I use a dataframe with just one column it works but then I don't have any metadata. If I have other columns and specify one as the content column and the others as metadata my documents show the metadata but then running the above command still gives me the error. Any ideas?

sunlin-xiaonai · 2023-05-18T09:48:35Z

@apremjee8 i have the same problem with you , now do you solve it ?

sunlin-xiaonai · 2023-05-19T05:59:19Z

same

i have solve it , i read the source code, your data key's value must be not None, you can deal with document , if there exist None value

pccross · 2023-07-10T06:38:10Z

same

i have solve it , i read the source code, your data key's value must be not None, you can deal with document , if there exist None value

Can you clarify how you fixed this? Did you change metadata manually? I've been trying to figure out how to fix for last couple of days, and just haven't had any luck.

Merdaneth · 2023-07-15T13:48:37Z

I'm still getting this error as well when I'm using the example given in the langchain documentation on a website that doesn't generate language attribute:

    from langchain.document_loaders import WebBaseLoader    
    loader = WebBaseLoader("https://www.dim-sum.nl")
    
    from langchain.text_splitter import RecursiveCharacterTextSplitter    
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0)
    all_splits = text_splitter.split_documents(data)

The first member of the dict is:

"Document(page_content='Dimsum Reizen, Bijzondere Reizen Azië, Midden-Oosten, Balkan\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\r\n Home \r\n \n\n\nReisinfo\n\n\n\n\nReisvoorwaarden\n\n\nVerzekeringen\n\n\nCorona Virus\n\n\nSelfdrives\n\n\nFamiliereizen\n\n\nTreinreizen\n\n\nOverland Reizen\n\n\nHuwelijksreizen\n\n\nReisblogs\n\n\nMeet a Local\n\n\n\n\n\nOver ons\n\n\n\n\nWie zijn wij?\n\n\nSpeciale groepen\n\n\nBeurzen en zo\n\n\nTrees for All\n\n\nVacatures reisbranche\n\n\nDuurzaam Toerisme', metadata={'source': 'https://www.dim-sum.nl/', 'title': 'Dimsum Reizen, Bijzondere Reizen Azië, Midden-Oosten, Balkan', 'description': 'Dimsum Reizen organiseert individuele reizen op maat naar Azië, het Midden-Oosten en Europa', 'language': None})"

And the None value in the language key that is return after splitting generates the same error

  File "\Lib\site-packages\chromadb\api\types.py", line 138, in validate_metadata
    raise ValueError(
ValueError: Expected metadata value to be a str, int, or float, got None which is a <class 'NoneType'>

jeffchuber · 2023-07-16T05:17:32Z

@Merdaneth None is not valid JSON - can you sanitize it?

Merdaneth · 2023-07-16T08:18:27Z

@jeffchuber sure I can. But I shouldn't get back a data structure that produces invalid JSON in the metadata from the document/content loader functionality of langchain in the first place.

For this use case I solved it like this:

    from langchain.text_splitter import RecursiveCharacterTextSplitter    
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0)
    all_splits = text_splitter.split_documents(data)
    #st.write(all_splits)        

    # fix because ChromaDB doesn't accept None as a value in the metadata array
    for item in all_splits:
        if item.metadata.get('language') is None:
            item.metadata['language'] = 'nl'

fallback to {} for None metadata from Chroma

09e53a1

jeffchuber mentioned this pull request Mar 16, 2023

ChromaDB validation error for Document metadata. #1287

Closed

hwchase17 approved these changes Mar 16, 2023

View reviewed changes

hwchase17 merged commit f93c011 into langchain-ai:master Mar 16, 2023

jeffchuber deleted the fixNoneMetadataHandling branch March 16, 2023 19:11

jeffchuber mentioned this pull request Mar 29, 2023

Validation error- Metadata should not be empty or None #1825

Closed

jeffchuber mentioned this pull request May 22, 2023

Langchain love chroma-core/chroma#560

Closed

Satyam-79 mentioned this pull request May 29, 2023

Chroma integration improvement #5415

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fallback to {} for None metadata from Chroma #1714

fallback to {} for None metadata from Chroma #1714

jeffchuber commented Mar 16, 2023

apremjee8 commented Apr 23, 2023

sunlin-xiaonai commented May 18, 2023

sunlin-xiaonai commented May 19, 2023

pccross commented Jul 10, 2023

Merdaneth commented Jul 15, 2023

jeffchuber commented Jul 16, 2023

Merdaneth commented Jul 16, 2023

fallback to {} for None metadata from Chroma #1714

fallback to {} for None metadata from Chroma #1714

Conversation

jeffchuber commented Mar 16, 2023

apremjee8 commented Apr 23, 2023

sunlin-xiaonai commented May 18, 2023

sunlin-xiaonai commented May 19, 2023

pccross commented Jul 10, 2023

Merdaneth commented Jul 15, 2023

jeffchuber commented Jul 16, 2023

Merdaneth commented Jul 16, 2023