Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fallback to {} for None metadata from Chroma #1714

Merged
merged 1 commit into from
Mar 16, 2023

Conversation

jeffchuber
Copy link
Contributor

The basic vector store example started breaking because Document required not None for metadata, but Chroma stores metadata as None if none is provided. This creates a fallback which fixes the basic tutorial https://langchain.readthedocs.io/en/latest/modules/indexes/examples/vectorstores.html

Here is the error that was generated

Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.
Traceback (most recent call last):
  File "/Users/jeff/src/temp/langchainchroma/test.py", line 17, in <module>
    docs = docsearch.similarity_search(query)
  File "/Users/jeff/src/langchain/langchain/vectorstores/chroma.py", line 133, in similarity_search
    docs_and_scores = self.similarity_search_with_score(query, k)
  File "/Users/jeff/src/langchain/langchain/vectorstores/chroma.py", line 182, in similarity_search_with_score
    return _results_to_docs_and_scores(results)
  File "/Users/jeff/src/langchain/langchain/vectorstores/chroma.py", line 24, in _results_to_docs_and_scores
    return [
  File "/Users/jeff/src/langchain/langchain/vectorstores/chroma.py", line 27, in <listcomp>
    (Document(page_content=result[0], metadata=result[1]), result[2])
  File "pydantic/main.py", line 331, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for Document
metadata
  none is not an allowed value (type=type_error.none.not_allowed)
Exiting: Cleaning up .chroma directory

@hwchase17 hwchase17 merged commit f93c011 into langchain-ai:master Mar 16, 2023
@jeffchuber jeffchuber deleted the fixNoneMetadataHandling branch March 16, 2023 19:11
bdonkey added a commit to bdonkey/langchain that referenced this pull request Mar 23, 2023
* master: (68 commits)
  hotfix (langchain-ai#1742)
  Harrison/move docs (langchain-ai#1741)
  move docs (langchain-ai#1740)
  bump version to 114 (langchain-ai#1739)
  Harrison/latex splitter (langchain-ai#1738)
  Harrison/blackboard loader (langchain-ai#1737)
  docs: add docs link to agent toolkits (langchain-ai#1735)
  fix: agent json parser fails with text in suffix (langchain-ai#1734)
  Harrison/official method (langchain-ai#1728)
  Sagemaker Endpoint LLM (langchain-ai#1686)
  adding new agent types in comments (langchain-ai#1711)
  (OpenAI) Add model_name to LLMResult.llm_output (langchain-ai#1713)
  Fix all the bug in init Tool in docs (langchain-ai#1725)
  Bump duckdb-engine to 0.7.0 (langchain-ai#1726)
  Add HTML document_loader that includes page title metadata (langchain-ai#1720)
  fix async in agent (langchain-ai#1723)
  pydantic/json parsing (langchain-ai#1722)
  Loosen PyYAML dependency (langchain-ai#1698)
  Adding ability to `return_pl_id` to all PromptLayer Models in LangChain (langchain-ai#1699)
  fallback to {} for None metadata from Chroma (langchain-ai#1714)
  ...
@apremjee8
Copy link

Hey there - I still seem to be getting this error when I'm using chroma with pandas dataframe and the dataframe loader.

My exact error happens when I do:

docsearch = Chroma.from_documents(texts, embeddings)

I get this error: Expected metadata value to be a str, int, or float, got None.

If I use a dataframe with just one column it works but then I don't have any metadata. If I have other columns and specify one as the content column and the others as metadata my documents show the metadata but then running the above command still gives me the error. Any ideas?

@sunlin-xiaonai
Copy link

@apremjee8 i have the same problem with you , now do you solve it ?

@sunlin-xiaonai
Copy link

same

i have solve it , i read the source code, your data key's value must be not None, you can deal with document , if there exist None value

@pccross
Copy link

pccross commented Jul 10, 2023

same

i have solve it , i read the source code, your data key's value must be not None, you can deal with document , if there exist None value

Can you clarify how you fixed this? Did you change metadata manually? I've been trying to figure out how to fix for last couple of days, and just haven't had any luck.

@Merdaneth
Copy link

I'm still getting this error as well when I'm using the example given in the langchain documentation on a website that doesn't generate language attribute:

    from langchain.document_loaders import WebBaseLoader    
    loader = WebBaseLoader("https://www.dim-sum.nl")
    
    from langchain.text_splitter import RecursiveCharacterTextSplitter    
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0)
    all_splits = text_splitter.split_documents(data)    

The first member of the dict is:

"Document(page_content='Dimsum Reizen, Bijzondere Reizen Azië, Midden-Oosten, Balkan\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\r\n Home \r\n \n\n\nReisinfo\n\n\n\n\nReisvoorwaarden\n\n\nVerzekeringen\n\n\nCorona Virus\n\n\nSelfdrives\n\n\nFamiliereizen\n\n\nTreinreizen\n\n\nOverland Reizen\n\n\nHuwelijksreizen\n\n\nReisblogs\n\n\nMeet a Local\n\n\n\n\n\nOver ons\n\n\n\n\nWie zijn wij?\n\n\nSpeciale groepen\n\n\nBeurzen en zo\n\n\nTrees for All\n\n\nVacatures reisbranche\n\n\nDuurzaam Toerisme', metadata={'source': 'https://www.dim-sum.nl/', 'title': 'Dimsum Reizen, Bijzondere Reizen Azië, Midden-Oosten, Balkan', 'description': 'Dimsum Reizen organiseert individuele reizen op maat naar Azië, het Midden-Oosten en Europa', 'language': None})"

And the None value in the language key that is return after splitting generates the same error

  File "\Lib\site-packages\chromadb\api\types.py", line 138, in validate_metadata
    raise ValueError(
ValueError: Expected metadata value to be a str, int, or float, got None which is a <class 'NoneType'>

@jeffchuber
Copy link
Contributor Author

@Merdaneth None is not valid JSON - can you sanitize it?

@Merdaneth
Copy link

@jeffchuber sure I can. But I shouldn't get back a data structure that produces invalid JSON in the metadata from the document/content loader functionality of langchain in the first place.

For this use case I solved it like this:

    from langchain.text_splitter import RecursiveCharacterTextSplitter    
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0)
    all_splits = text_splitter.split_documents(data)
    #st.write(all_splits)        

    # fix because ChromaDB doesn't accept None as a value in the metadata array
    for item in all_splits:
        if item.metadata.get('language') is None:
            item.metadata['language'] = 'nl'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants