langchain: adds recursive json splitter #17144

joelsprunger · 2024-02-07T00:48:32Z

Description: This adds a recursive json splitter class to the existing text_splitters as well as unit tests
Issue: splitting text from structured data can cause issues if you have a large nested json object and you split it as regular text you may end up losing the structure of the json. To mitigate against this you can split the nested json into large chunks and overlap them, but this causes unnecessary text processing and there will still be times where the nested json is so big that the chunks get separated from the parent keys.

As an example you wouldn't want the following to be split in half:

{'val0': 'DFWeNdWhapbR',
 'val1': {'val10': 'QdJo',
          'val11': 'FWSDVFHClW',
          'val12': 'bkVnXMMlTiQh',
          'val13': 'tdDMKRrOY',
          'val14': 'zybPALvL',
          'val15': 'JMzGMNH',
          'val16': {'val160': 'qLuLKusFw',
                    'val161': 'DGuotLh',
                    'val162': 'KztlcSBropT',
-----------------------------------------------------------------------split-----
                    'val163': 'YlHHDrN',
                    'val164': 'CtzsxlGBZKf',
                    'val165': 'bXzhcrWLmBFp',
                    'val166': 'zZAqC',
                    'val167': 'ZtyWno',
                    'val168': 'nQQZRsLnaBhb',
                    'val169': 'gSpMbJwA'},
          'val17': 'JhgiyF',
          'val18': 'aJaqjUSFFrI',
          'val19': 'glqNSvoyxdg'}}

Any llm processing the second chunk of text may not have the context of val1, and val16 reducing accuracy. Embeddings will also lack this context and this makes retrieval less accurate.

Instead you want it to be split into chunks that retain the json structure.

{'val0': 'DFWeNdWhapbR',
 'val1': {'val10': 'QdJo',
          'val11': 'FWSDVFHClW',
          'val12': 'bkVnXMMlTiQh',
          'val13': 'tdDMKRrOY',
          'val14': 'zybPALvL',
          'val15': 'JMzGMNH',
          'val16': {'val160': 'qLuLKusFw',
                    'val161': 'DGuotLh',
                    'val162': 'KztlcSBropT',
                    'val163': 'YlHHDrN',
                    'val164': 'CtzsxlGBZKf'}}}

and

{'val1':{'val16':{
                    'val165': 'bXzhcrWLmBFp',
                    'val166': 'zZAqC',
                    'val167': 'ZtyWno',
                    'val168': 'nQQZRsLnaBhb',
                    'val169': 'gSpMbJwA'},
          'val17': 'JhgiyF',
          'val18': 'aJaqjUSFFrI',
          'val19': 'glqNSvoyxdg'}}

This recursive json text splitter does this. Values that contain a list can be converted to dict first by using split(... convert_lists=True) otherwise long lists will not be split and you may end up with chunks larger than the max chunk.

In my testing large json objects could be split into small chunks with
✅ Increased question answering accuracy
✅ The ability to split into smaller chunks meant retrieval queries can use fewer tokens

Dependencies: json import added to text_splitter.py, and random added to the unit test
Twitter handle: @joelsprunger

vercel · 2024-02-07T00:48:36Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Feb 8, 2024 9:44pm

joelsprunger · 2024-02-07T05:12:46Z

I've added an ipynb notebook and I'm trying to get the documentation built locally.

chore: documentation ipynb

joelsprunger · 2024-02-07T06:35:56Z

I think we are good here. I was able to run the unit tests, linting, and see the documentation.

Let me know if I need to do anything else. :-)

hwchase17

really really like this - one large-ish comment/refactor to make more consistent with other classes

hwchase17 · 2024-02-07T16:11:05Z

libs/langchain/langchain/text_splitter.py

+            if min_chunk_size is not None
+            else max(max_chunk_size - 200, 50)
+        )
+        self._chunks = JsonChunks()


i dont think this class should be stateful - pretty inconsistent with other classes

I get that. I think I initially coded it up for my own use. And with the chunks passed to 3 of the functions in the class I was thinking it made more sense to just have those be a class member. I can change that.

I'll try to have it inherit from TextSplitter, and change split to split_text so that create_documents() can work.

The issue I am seeing is with create_documents()

The super is expecting List[str]
I wanted to keep the loaded json as a Dict rather than expecting str because the recursion for preprocessing and splitting is on Dict objects. And anyone with json in python probably has it in a Dict format to start. I think I will stop inheriting from TextSplitter, and instead just shaddow the functionality with split_text, and create_documents, with that said I'll add split_json that returns List[dict] so that people can get the output in json if they want. Then split_text and create_docs can be wrappers around split_json.

@hwchase17 I refactored this and updated the docs and unit tests.

…uDoThis

hwchase17 · 2024-02-08T21:45:29Z

thanks @joelsprunger ! will highlight this when its out in next release. really like it!

joelsprunger · 2024-02-09T00:53:56Z

@hwchase17 great! If you want me to demo some of the benefits with example json using langsmith I can make a screen-cast or something.

funkymonkeymonk · 2024-02-09T01:07:33Z

So I just found this in the documentation and it took me a bit of spelunking to find that this was not yet released because it's already in the documentation. Thanks for making this just in time for me to need it :-)

[![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com) This PR contains the following updates: | Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | [@types/node](https://togithub.com/DefinitelyTyped/DefinitelyTyped/tree/master/types/node) ([source](https://togithub.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/node)) | [`20.11.16` -> `20.11.17`](https://renovatebot.com/diffs/npm/@types%2fnode/20.11.16/20.11.17) | [![age](https://developer.mend.io/api/mc/badges/age/npm/@types%2fnode/20.11.17?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/npm/@types%2fnode/20.11.17?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/npm/@types%2fnode/20.11.16/20.11.17?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/npm/@types%2fnode/20.11.16/20.11.17?slim=true)](https://docs.renovatebot.com/merge-confidence/) | | [ai](https://sdk.vercel.ai/docs) ([source](https://togithub.com/vercel/ai)) | [`2.2.33` -> `2.2.35`](https://renovatebot.com/diffs/npm/ai/2.2.33/2.2.35) | [![age](https://developer.mend.io/api/mc/badges/age/npm/ai/2.2.35?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/npm/ai/2.2.35?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/npm/ai/2.2.33/2.2.35?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/npm/ai/2.2.33/2.2.35?slim=true)](https://docs.renovatebot.com/merge-confidence/) | | [langchain](https://togithub.com/langchain-ai/langchain) | `0.1.5` -> `0.1.6` | [![age](https://developer.mend.io/api/mc/badges/age/pypi/langchain/0.1.6?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/pypi/langchain/0.1.6?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/pypi/langchain/0.1.5/0.1.6?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/langchain/0.1.5/0.1.6?slim=true)](https://docs.renovatebot.com/merge-confidence/) | | [langchain](https://togithub.com/langchain-ai/langchainjs/tree/main/langchain/) ([source](https://togithub.com/langchain-ai/langchainjs)) | [`0.1.16` -> `0.1.17`](https://renovatebot.com/diffs/npm/langchain/0.1.16/0.1.17) | [![age](https://developer.mend.io/api/mc/badges/age/npm/langchain/0.1.17?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/npm/langchain/0.1.17?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/npm/langchain/0.1.16/0.1.17?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/npm/langchain/0.1.16/0.1.17?slim=true)](https://docs.renovatebot.com/merge-confidence/) | | [novel](https://novel.sh) ([source](https://togithub.com/steven-tey/novel)) | [`^0.1.19` -> `^0.2.0`](https://renovatebot.com/diffs/npm/novel/0.1.22/0.2.0) | [![age](https://developer.mend.io/api/mc/badges/age/npm/novel/0.2.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/npm/novel/0.2.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/npm/novel/0.1.22/0.2.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/npm/novel/0.1.22/0.2.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | | [openai](https://togithub.com/openai/openai-python) | `1.11.1` -> `1.12.0` | [![age](https://developer.mend.io/api/mc/badges/age/pypi/openai/1.12.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/pypi/openai/1.12.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/pypi/openai/1.11.1/1.12.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/openai/1.11.1/1.12.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | | [openai](https://togithub.com/openai/openai-python) | `1.9.0` -> `1.12.0` | [![age](https://developer.mend.io/api/mc/badges/age/pypi/openai/1.12.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/pypi/openai/1.12.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/pypi/openai/1.9.0/1.12.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/openai/1.9.0/1.12.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | | [openai](https://togithub.com/openai/openai-node) | [`4.26.1` -> `4.27.1`](https://renovatebot.com/diffs/npm/openai/4.26.1/4.27.1) | [![age](https://developer.mend.io/api/mc/badges/age/npm/openai/4.27.1?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/npm/openai/4.27.1?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/npm/openai/4.26.1/4.27.1?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/npm/openai/4.26.1/4.27.1?slim=true)](https://docs.renovatebot.com/merge-confidence/) | | [pydantic](https://togithub.com/pydantic/pydantic) ([changelog](https://docs.pydantic.dev/latest/changelog/)) | `2.5.3` -> `2.6.1` | [![age](https://developer.mend.io/api/mc/badges/age/pypi/pydantic/2.6.1?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/pypi/pydantic/2.6.1?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/pypi/pydantic/2.5.3/2.6.1?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/pydantic/2.5.3/2.6.1?slim=true)](https://docs.renovatebot.com/merge-confidence/) | | [python-dotenv](https://togithub.com/theskumar/python-dotenv) | `1.0.0` -> `1.0.1` | [![age](https://developer.mend.io/api/mc/badges/age/pypi/python-dotenv/1.0.1?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/pypi/python-dotenv/1.0.1?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/pypi/python-dotenv/1.0.0/1.0.1?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/python-dotenv/1.0.0/1.0.1?slim=true)](https://docs.renovatebot.com/merge-confidence/) | | [tsx](https://togithub.com/privatenumber/tsx) | [`4.7.0` -> `4.7.1`](https://renovatebot.com/diffs/npm/tsx/4.7.0/4.7.1) | [![age](https://developer.mend.io/api/mc/badges/age/npm/tsx/4.7.1?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/npm/tsx/4.7.1?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/npm/tsx/4.7.0/4.7.1?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/npm/tsx/4.7.0/4.7.1?slim=true)](https://docs.renovatebot.com/merge-confidence/) | --- ### Release Notes <details> <summary>vercel/ai (ai)</summary> ### [`v2.2.35`](https://togithub.com/vercel/ai/releases/tag/ai%402.2.35) [Compare Source](https://togithub.com/vercel/ai/compare/ai@2.2.34...ai@2.2.35) ##### Patch Changes - [`b717dad`](https://togithub.com/vercel/ai/commit/b717dad): Adding Inkeep as a stream provider ### [`v2.2.34`](https://togithub.com/vercel/ai/releases/tag/ai%402.2.34) [Compare Source](https://togithub.com/vercel/ai/compare/ai@2.2.33...ai@2.2.34) ##### Patch Changes - [`2c8ffdb`](https://togithub.com/vercel/ai/commit/2c8ffdb): cohere-stream: support AsyncIterable - [`ed1e278`](https://togithub.com/vercel/ai/commit/ed1e278): Message annotations handling for all Message types </details> <details> <summary>langchain-ai/langchain (langchain)</summary> ### [`v0.1.6`](https://togithub.com/langchain-ai/langchain/releases/tag/v0.1.6) [Compare Source](https://togithub.com/langchain-ai/langchain/compare/v0.1.5...v0.1.6) ##### What's Changed - experimental\[patch]: Release 0.0.50 by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/16883](https://togithub.com/langchain-ai/langchain/pull/16883) - infra: bump exp min test reqs by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/16884](https://togithub.com/langchain-ai/langchain/pull/16884) - docs: fix docstring examples by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/16889](https://togithub.com/langchain-ai/langchain/pull/16889) - langchain\[patch]: Add async methods to MultiVectorRetriever by [@cbornet](https://togithub.com/cbornet) in [https://github.com/langchain-ai/langchain/pull/16878](https://togithub.com/langchain-ai/langchain/pull/16878) - docs: Indicated Guardrails for Amazon Bedrock preview status by [@harelix](https://togithub.com/harelix) in [https://github.com/langchain-ai/langchain/pull/16769](https://togithub.com/langchain-ai/langchain/pull/16769) - Factorize AstraDB components constructors by [@cbornet](https://togithub.com/cbornet) in [https://github.com/langchain-ai/langchain/pull/16779](https://togithub.com/langchain-ai/langchain/pull/16779) - support LIKE comparator (full text match) in Qdrant by [@xieqihui](https://togithub.com/xieqihui) in [https://github.com/langchain-ai/langchain/pull/12769](https://togithub.com/langchain-ai/langchain/pull/12769) - infra: ci naming by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/16890](https://togithub.com/langchain-ai/langchain/pull/16890) - Docs: Fixed grammatical mistake by [@ShorthillsAI](https://togithub.com/ShorthillsAI) in [https://github.com/langchain-ai/langchain/pull/16858](https://togithub.com/langchain-ai/langchain/pull/16858) - Minor update to Nomic cookbook by [@rlancemartin](https://togithub.com/rlancemartin) in [https://github.com/langchain-ai/langchain/pull/16886](https://togithub.com/langchain-ai/langchain/pull/16886) - infra: ci naming 2 by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/16893](https://togithub.com/langchain-ai/langchain/pull/16893) - refactor `langchain.prompts.example_selector` by [@leo-gan](https://togithub.com/leo-gan) in [https://github.com/langchain-ai/langchain/pull/15369](https://togithub.com/langchain-ai/langchain/pull/15369) - doc: fix typo in message_history.ipynb by [@akirawuc](https://togithub.com/akirawuc) in [https://github.com/langchain-ai/langchain/pull/16877](https://togithub.com/langchain-ai/langchain/pull/16877) - community: revert SQL Stores by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/16912](https://togithub.com/langchain-ai/langchain/pull/16912) - langchain_openai\[patch]: Invoke callback prior to yielding token by [@eyurtsev](https://togithub.com/eyurtsev) in [https://github.com/langchain-ai/langchain/pull/16909](https://togithub.com/langchain-ai/langchain/pull/16909) - docs: fix broken links by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/16855](https://togithub.com/langchain-ai/langchain/pull/16855) - Fix loading of ImagePromptTemplate by [@hinthornw](https://togithub.com/hinthornw) in [https://github.com/langchain-ai/langchain/pull/16868](https://togithub.com/langchain-ai/langchain/pull/16868) - core\[patch]: Hide aliases when serializing by [@hinthornw](https://togithub.com/hinthornw) in [https://github.com/langchain-ai/langchain/pull/16888](https://togithub.com/langchain-ai/langchain/pull/16888) - core\[patch]: Remove deep copying of run prior to submitting it to LangChain Tracing by [@hinthornw](https://togithub.com/hinthornw) in [https://github.com/langchain-ai/langchain/pull/16904](https://togithub.com/langchain-ai/langchain/pull/16904) - core\[minor]: add validation error handler to `BaseTool` by [@hmasdev](https://togithub.com/hmasdev) in [https://github.com/langchain-ai/langchain/pull/14007](https://togithub.com/langchain-ai/langchain/pull/14007) - Updated integration doc for aleph alpha by [@rocky1405](https://togithub.com/rocky1405) in [https://github.com/langchain-ai/langchain/pull/16844](https://togithub.com/langchain-ai/langchain/pull/16844) - core\[patch]: fix chat prompt partial messages placeholder var by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/16918](https://togithub.com/langchain-ai/langchain/pull/16918) - core\[patch]: Message content as positional arg by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/16921](https://togithub.com/langchain-ai/langchain/pull/16921) - core\[patch]: doc init positional args by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/16854](https://togithub.com/langchain-ai/langchain/pull/16854) - community\[docs]: add quantization to vllm and update API by [@mspronesti](https://togithub.com/mspronesti) in [https://github.com/langchain-ai/langchain/pull/16950](https://togithub.com/langchain-ai/langchain/pull/16950) - docs: BigQuery Vector Search went public review and updated docs by [@ashleyxuu](https://togithub.com/ashleyxuu) in [https://github.com/langchain-ai/langchain/pull/16896](https://togithub.com/langchain-ai/langchain/pull/16896) - core\[patch]: Add doc-string to RunnableEach by [@keenborder786](https://togithub.com/keenborder786) in [https://github.com/langchain-ai/langchain/pull/16892](https://togithub.com/langchain-ai/langchain/pull/16892) - core\[patch]: handle some optional cases in tools by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/16954](https://togithub.com/langchain-ai/langchain/pull/16954) - docs: partner packages by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/16960](https://togithub.com/langchain-ai/langchain/pull/16960) - infra: install integration deps for test linting by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/16963](https://togithub.com/langchain-ai/langchain/pull/16963) - Update README.md by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/16966](https://togithub.com/langchain-ai/langchain/pull/16966) - langchain_mistralai\[patch]: Invoke callback prior to yielding token by [@ccurme](https://togithub.com/ccurme) in [https://github.com/langchain-ai/langchain/pull/16986](https://togithub.com/langchain-ai/langchain/pull/16986) - openai\[patch]: rm tiktoken model warning by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/16964](https://togithub.com/langchain-ai/langchain/pull/16964) - google-genai\[patch]: fix new core typing by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/16988](https://togithub.com/langchain-ai/langchain/pull/16988) - community\[patch]: Correct the calling to collection_name in qdrant by [@killinsun](https://togithub.com/killinsun) in [https://github.com/langchain-ai/langchain/pull/16920](https://togithub.com/langchain-ai/langchain/pull/16920) - docs: Update ollama examples with new community libraries by [@picsoung](https://togithub.com/picsoung) in [https://github.com/langchain-ai/langchain/pull/17007](https://togithub.com/langchain-ai/langchain/pull/17007) - langchain_core: Fixed bug in dict to message conversion. by [@rmkraus](https://togithub.com/rmkraus) in [https://github.com/langchain-ai/langchain/pull/17023](https://togithub.com/langchain-ai/langchain/pull/17023) - Add async methods to BaseChatMessageHistory and BaseMemory by [@cbornet](https://togithub.com/cbornet) in [https://github.com/langchain-ai/langchain/pull/16728](https://togithub.com/langchain-ai/langchain/pull/16728) - Nvidia trt model name for stop_stream() by [@mkhludnev](https://togithub.com/mkhludnev) in [https://github.com/langchain-ai/langchain/pull/16997](https://togithub.com/langchain-ai/langchain/pull/16997) - core\[patch]: Add langsmith to printed sys information by [@eyurtsev](https://togithub.com/eyurtsev) in [https://github.com/langchain-ai/langchain/pull/16899](https://togithub.com/langchain-ai/langchain/pull/16899) - docs: exa contents by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/16555](https://togithub.com/langchain-ai/langchain/pull/16555) - add -p to mkdir in lint steps by [@hwchase17](https://togithub.com/hwchase17) in [https://github.com/langchain-ai/langchain/pull/17013](https://togithub.com/langchain-ai/langchain/pull/17013) - template: tool-retrieval-fireworks by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17052](https://togithub.com/langchain-ai/langchain/pull/17052) - pinecone: init pkg by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/16556](https://togithub.com/langchain-ai/langchain/pull/16556) - community\[patch]: fix agent_toolkits mypy by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/17050](https://togithub.com/langchain-ai/langchain/pull/17050) - Shield callback methods from cancellation: Fix interrupted runs marked as pending forever by [@nfcampos](https://togithub.com/nfcampos) in [https://github.com/langchain-ai/langchain/pull/17010](https://togithub.com/langchain-ai/langchain/pull/17010) - Fix condition on custom root type in runnable history by [@nfcampos](https://togithub.com/nfcampos) in [https://github.com/langchain-ai/langchain/pull/17017](https://togithub.com/langchain-ai/langchain/pull/17017) - partners: \[NVIDIA AI Endpoints] Support User-Agent metadata and minor fixes. by [@VKudlay](https://togithub.com/VKudlay) in [https://github.com/langchain-ai/langchain/pull/16942](https://togithub.com/langchain-ai/langchain/pull/16942) - community\[patch]: callbacks mypy fixes by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/17058](https://togithub.com/langchain-ai/langchain/pull/17058) - community\[patch]: chat message history mypy fixes by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/17059](https://togithub.com/langchain-ai/langchain/pull/17059) - community\[patch]: chat model mypy fixes by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/17061](https://togithub.com/langchain-ai/langchain/pull/17061) - Langchain: `json_chat` don't need stop sequenes by [@calvinweb](https://togithub.com/calvinweb) in [https://github.com/langchain-ai/langchain/pull/16335](https://togithub.com/langchain-ai/langchain/pull/16335) - langchain: add partial parsing support to JsonOutputToolsParser by [@Mercurrent](https://togithub.com/Mercurrent) in [https://github.com/langchain-ai/langchain/pull/17035](https://togithub.com/langchain-ai/langchain/pull/17035) - Community: Allow adding ARNs as model_id to support Amazon Bedrock custom models by [@supreetkt](https://togithub.com/supreetkt) in [https://github.com/langchain-ai/langchain/pull/16800](https://togithub.com/langchain-ai/langchain/pull/16800) - Community: Add Progress bar to HuggingFaceEmbeddings by [@tylertitsworth](https://togithub.com/tylertitsworth) in [https://github.com/langchain-ai/langchain/pull/16758](https://togithub.com/langchain-ai/langchain/pull/16758) - Langchain Community: Fix the \_call of HuggingFaceHub by [@keenborder786](https://togithub.com/keenborder786) in [https://github.com/langchain-ai/langchain/pull/16891](https://togithub.com/langchain-ai/langchain/pull/16891) - Community: MLflow callback update by [@serena-ruan](https://togithub.com/serena-ruan) in [https://github.com/langchain-ai/langchain/pull/16687](https://togithub.com/langchain-ai/langchain/pull/16687) - docs: add 2 more tutorials to the list in youtube.mdx by [@strongSoda](https://togithub.com/strongSoda) in [https://github.com/langchain-ai/langchain/pull/16998](https://togithub.com/langchain-ai/langchain/pull/16998) - Docs: Fix Copilot name by [@bmuskalla](https://togithub.com/bmuskalla) in [https://github.com/langchain-ai/langchain/pull/16956](https://togithub.com/langchain-ai/langchain/pull/16956) - docs:Updating documentation for Konko provider by [@shivanimodi16](https://togithub.com/shivanimodi16) in [https://github.com/langchain-ai/langchain/pull/16953](https://togithub.com/langchain-ai/langchain/pull/16953) - fixing a minor grammatical mistake by [@ShorthillsAI](https://togithub.com/ShorthillsAI) in [https://github.com/langchain-ai/langchain/pull/16931](https://togithub.com/langchain-ai/langchain/pull/16931) - docs: Fix typo in quickstart.ipynb by [@n0vad3v](https://togithub.com/n0vad3v) in [https://github.com/langchain-ai/langchain/pull/16859](https://togithub.com/langchain-ai/langchain/pull/16859) - community:Breebs docs retriever by [@Poissecaille](https://togithub.com/Poissecaille) in [https://github.com/langchain-ai/langchain/pull/16578](https://togithub.com/langchain-ai/langchain/pull/16578) - add structured tools by [@hwchase17](https://togithub.com/hwchase17) in [https://github.com/langchain-ai/langchain/pull/15772](https://togithub.com/langchain-ai/langchain/pull/15772) - docs: update parse_partial_json source info by [@Mercurrent](https://togithub.com/Mercurrent) in [https://github.com/langchain-ai/langchain/pull/17036](https://togithub.com/langchain-ai/langchain/pull/17036) - infra: fix breebs test lint by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/17075](https://togithub.com/langchain-ai/langchain/pull/17075) - docs: add youtube link by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/17065](https://togithub.com/langchain-ai/langchain/pull/17065) - Add prompt metadata + tags by [@hinthornw](https://togithub.com/hinthornw) in [https://github.com/langchain-ai/langchain/pull/17054](https://togithub.com/langchain-ai/langchain/pull/17054) - core\[patch]: fix \_sql_record_manager mypy for [#17048](https://togithub.com/langchain-ai/langchain/issues/17048) by [@moorej-oci](https://togithub.com/moorej-oci) in [https://github.com/langchain-ai/langchain/pull/17073](https://togithub.com/langchain-ai/langchain/pull/17073) - langchain_experimental: Fixes issue [#17060](https://togithub.com/langchain-ai/langchain/issues/17060) by [@SalamanderXing](https://togithub.com/SalamanderXing) in [https://github.com/langchain-ai/langchain/pull/17062](https://togithub.com/langchain-ai/langchain/pull/17062) - community: add integration_tests and coverage to MAKEFILE by [@scottnath](https://togithub.com/scottnath) in [https://github.com/langchain-ai/langchain/pull/17053](https://togithub.com/langchain-ai/langchain/pull/17053) - templates: bump by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17074](https://togithub.com/langchain-ai/langchain/pull/17074) - docs\[patch]: Update streaming documentation by [@eyurtsev](https://togithub.com/eyurtsev) in [https://github.com/langchain-ai/langchain/pull/17066](https://togithub.com/langchain-ai/langchain/pull/17066) - core\[patch]: Add astream events config test by [@eyurtsev](https://togithub.com/eyurtsev) in [https://github.com/langchain-ai/langchain/pull/17055](https://togithub.com/langchain-ai/langchain/pull/17055) - docs: fix typo in dspy.ipynb by [@eltociear](https://togithub.com/eltociear) in [https://github.com/langchain-ai/langchain/pull/16996](https://togithub.com/langchain-ai/langchain/pull/16996) - fixed import in `experimental` by [@leo-gan](https://togithub.com/leo-gan) in [https://github.com/langchain-ai/langchain/pull/17078](https://togithub.com/langchain-ai/langchain/pull/17078) - community: Fix error in `LlamaCpp` community LLM with Configurable Fields, 'grammar' custom type not available by [@fpaupier](https://togithub.com/fpaupier) in [https://github.com/langchain-ai/langchain/pull/16995](https://togithub.com/langchain-ai/langchain/pull/16995) - docs/docs/integrations/chat/mistralai.ipynb: update for version 0.1+ by [@mtmahe](https://togithub.com/mtmahe) in [https://github.com/langchain-ai/langchain/pull/17011](https://togithub.com/langchain-ai/langchain/pull/17011) - docs: update StreamlitCallbackHandler example by [@os1ma](https://togithub.com/os1ma) in [https://github.com/langchain-ai/langchain/pull/16970](https://togithub.com/langchain-ai/langchain/pull/16970) - docs: Link to Brave Website added by [@Janldeboer](https://togithub.com/Janldeboer) in [https://github.com/langchain-ai/langchain/pull/16958](https://togithub.com/langchain-ai/langchain/pull/16958) - community: Added new Utility runnables for NVIDIA Riva. by [@rmkraus](https://togithub.com/rmkraus) in [https://github.com/langchain-ai/langchain/pull/15966](https://togithub.com/langchain-ai/langchain/pull/15966) - langchain: `output_parser.py` in conversation_chat is customizable by [@hdnh2006](https://togithub.com/hdnh2006) in [https://github.com/langchain-ai/langchain/pull/16945](https://togithub.com/langchain-ai/langchain/pull/16945) - docs: Fix typo in amadeus.ipynb by [@laoazhang](https://togithub.com/laoazhang) in [https://github.com/langchain-ai/langchain/pull/16916](https://togithub.com/langchain-ai/langchain/pull/16916) - new feature: add github file loader to load any github file content b… by [@shufanhao](https://togithub.com/shufanhao) in [https://github.com/langchain-ai/langchain/pull/15305](https://togithub.com/langchain-ai/langchain/pull/15305) - core\[patch]: Release 0.1.19 by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/17117](https://togithub.com/langchain-ai/langchain/pull/17117) - Add SelfQueryRetriever support to PGVector by [@Swalloow](https://togithub.com/Swalloow) in [https://github.com/langchain-ai/langchain/pull/16991](https://togithub.com/langchain-ai/langchain/pull/16991) - infra: add pinecone secret by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17120](https://togithub.com/langchain-ai/langchain/pull/17120) - nvidia-trt: propagate InferenceClientException to the caller. by [@mkhludnev](https://togithub.com/mkhludnev) in [https://github.com/langchain-ai/langchain/pull/16936](https://togithub.com/langchain-ai/langchain/pull/16936) - infra: add integration deps to partner lint by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17122](https://togithub.com/langchain-ai/langchain/pull/17122) - pinecone\[patch]: integration test new namespace by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17121](https://togithub.com/langchain-ai/langchain/pull/17121) - nvidia-ai-endpoints\[patch]: release 0.0.2 by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17125](https://togithub.com/langchain-ai/langchain/pull/17125) - infra: update to cache v4 by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17126](https://togithub.com/langchain-ai/langchain/pull/17126) - community\[patch]: Release 0.0.18 by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/17129](https://togithub.com/langchain-ai/langchain/pull/17129) - API References sorted `Partner libs` menu by [@leo-gan](https://togithub.com/leo-gan) in [https://github.com/langchain-ai/langchain/pull/17130](https://togithub.com/langchain-ai/langchain/pull/17130) - docs: fix typo in ollama notebook by [@arnoschutijzer](https://togithub.com/arnoschutijzer) in [https://github.com/langchain-ai/langchain/pull/17127](https://togithub.com/langchain-ai/langchain/pull/17127) - mistralai\[patch]: 16k token batching logic embed by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17136](https://togithub.com/langchain-ai/langchain/pull/17136) - infra: read min versions by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17135](https://togithub.com/langchain-ai/langchain/pull/17135) - mistralai\[patch]: release 0.0.4 by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17139](https://togithub.com/langchain-ai/langchain/pull/17139) - infra: fix release by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17142](https://togithub.com/langchain-ai/langchain/pull/17142) - docs: format by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17143](https://togithub.com/langchain-ai/langchain/pull/17143) - infra: poetry run min versions by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17146](https://togithub.com/langchain-ai/langchain/pull/17146) - infra: poetry run min versions 2 by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17149](https://togithub.com/langchain-ai/langchain/pull/17149) - infra: release min version debugging by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17150](https://togithub.com/langchain-ai/langchain/pull/17150) - infra: release min version debugging 2 by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17152](https://togithub.com/langchain-ai/langchain/pull/17152) - docs: tutorials update by [@leo-gan](https://togithub.com/leo-gan) in [https://github.com/langchain-ai/langchain/pull/17132](https://togithub.com/langchain-ai/langchain/pull/17132) - docs `integraions/providers` nav fix by [@leo-gan](https://togithub.com/leo-gan) in [https://github.com/langchain-ai/langchain/pull/17148](https://togithub.com/langchain-ai/langchain/pull/17148) - docs `Integraions/Components` menu reordered by [@leo-gan](https://togithub.com/leo-gan) in [https://github.com/langchain-ai/langchain/pull/17151](https://togithub.com/langchain-ai/langchain/pull/17151) - Add trace_as_chain_group metadata by [@hinthornw](https://togithub.com/hinthornw) in [https://github.com/langchain-ai/langchain/pull/17187](https://togithub.com/langchain-ai/langchain/pull/17187) - allow optional newline in the action responses of JSON Agent parser by [@tomasonjo](https://togithub.com/tomasonjo) in [https://github.com/langchain-ai/langchain/pull/17186](https://togithub.com/langchain-ai/langchain/pull/17186) - Feat: support functions call for google-genai by [@chyroc](https://togithub.com/chyroc) in [https://github.com/langchain-ai/langchain/pull/15146](https://togithub.com/langchain-ai/langchain/pull/15146) - Use batched tracing in sdk by [@nfcampos](https://togithub.com/nfcampos) in [https://github.com/langchain-ai/langchain/pull/16305](https://togithub.com/langchain-ai/langchain/pull/16305) - core\[patch]: Release 0.1.20 by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/17194](https://togithub.com/langchain-ai/langchain/pull/17194) - infra: fix core release by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17195](https://togithub.com/langchain-ai/langchain/pull/17195) - infra: better conditional by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17197](https://togithub.com/langchain-ai/langchain/pull/17197) - Add neo4j semantic layer with ollama template by [@tomasonjo](https://togithub.com/tomasonjo) in [https://github.com/langchain-ai/langchain/pull/17192](https://togithub.com/langchain-ai/langchain/pull/17192) - remove pg_essay.txt by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17198](https://togithub.com/langchain-ai/langchain/pull/17198) - langchain: Standardize `output_parser.py` across all agent types for custom `FORMAT_INSTRUCTIONS` by [@hdnh2006](https://togithub.com/hdnh2006) in [https://github.com/langchain-ai/langchain/pull/17168](https://togithub.com/langchain-ai/langchain/pull/17168) - core\[patch], community\[patch]: link extraction continue on failure by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/17200](https://togithub.com/langchain-ai/langchain/pull/17200) - core\[patch]: Release 0.1.21 by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/17202](https://togithub.com/langchain-ai/langchain/pull/17202) - cli\[patch]: copyright 2024 default by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17204](https://togithub.com/langchain-ai/langchain/pull/17204) - community\[patch]: Release 0.0.19 by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/17207](https://togithub.com/langchain-ai/langchain/pull/17207) - Fix stream events/log with some kinds of non addable output by [@nfcampos](https://togithub.com/nfcampos) in [https://github.com/langchain-ai/langchain/pull/17205](https://togithub.com/langchain-ai/langchain/pull/17205) - google-vertexai\[patch]: serializable citation metadata, release 0.0.4 by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17145](https://togithub.com/langchain-ai/langchain/pull/17145) - google-vertexai\[patch]: function calling integration test by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17209](https://togithub.com/langchain-ai/langchain/pull/17209) - google-genai\[patch]: match function call interface by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17213](https://togithub.com/langchain-ai/langchain/pull/17213) - google-genai\[patch]: no error for FunctionMessage by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17215](https://togithub.com/langchain-ai/langchain/pull/17215) - google-genai\[patch]: release 0.0.7 by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17193](https://togithub.com/langchain-ai/langchain/pull/17193) - docs: cleanup fleet integration by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/17214](https://togithub.com/langchain-ai/langchain/pull/17214) - templates: add gemini functions agent by [@hwchase17](https://togithub.com/hwchase17) in [https://github.com/langchain-ai/langchain/pull/17141](https://togithub.com/langchain-ai/langchain/pull/17141) - langchain\[minor], community\[minor], core\[minor]: Async Cache support and AsyncRedisCache by [@dzmitry-kankalovich](https://togithub.com/dzmitry-kankalovich) in [https://github.com/langchain-ai/langchain/pull/15817](https://togithub.com/langchain-ai/langchain/pull/15817) - community\[patch]: Fix chat openai unit test by [@LuizFrra](https://togithub.com/LuizFrra) in [https://github.com/langchain-ai/langchain/pull/17124](https://togithub.com/langchain-ai/langchain/pull/17124) - docs: titles fix by [@leo-gan](https://togithub.com/leo-gan) in [https://github.com/langchain-ai/langchain/pull/17206](https://togithub.com/langchain-ai/langchain/pull/17206) - community\[patch]: Better error propagation for neo4jgraph by [@tomasonjo](https://togithub.com/tomasonjo) in [https://github.com/langchain-ai/langchain/pull/17190](https://togithub.com/langchain-ai/langchain/pull/17190) - community\[minor]: SQLDatabase Add fetch mode `cursor`, query parameters, query by selectable, expose execution options, and documentation by [@eyurtsev](https://togithub.com/eyurtsev) in [https://github.com/langchain-ai/langchain/pull/17191](https://togithub.com/langchain-ai/langchain/pull/17191) - community\[patch]: octoai embeddings bug fix by [@AI-Bassem](https://togithub.com/AI-Bassem) in [https://github.com/langchain-ai/langchain/pull/17216](https://togithub.com/langchain-ai/langchain/pull/17216) - docs: add missing link to Quickstart by [@sana-google](https://togithub.com/sana-google) in [https://github.com/langchain-ai/langchain/pull/17085](https://togithub.com/langchain-ai/langchain/pull/17085) - docs: use PromptTemplate.from_template by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/17218](https://togithub.com/langchain-ai/langchain/pull/17218) - langchain_google_vertexai : added logic to override get_num_tokens_from_messages() for ChatVertexAI by [@Adi8885](https://togithub.com/Adi8885) in [https://github.com/langchain-ai/langchain/pull/16784](https://togithub.com/langchain-ai/langchain/pull/16784) - google-vertexai\[patch]: integration test fix, release 0.0.5 by [@efriis](https://togithub.com/efriis) in [https://github.com/langchain-ai/langchain/pull/17258](https://togithub.com/langchain-ai/langchain/pull/17258) - partners/google-vertexai:fix \_parse_response_candidate issue by [@hsuyuming](https://togithub.com/hsuyuming) in [https://github.com/langchain-ai/langchain/pull/16647](https://togithub.com/langchain-ai/langchain/pull/16647) - langchain\[minor], core\[minor]: add openai-json structured output runnable by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/16914](https://togithub.com/langchain-ai/langchain/pull/16914) - Documentation: Fix typo in github.ipynb by [@jorge-campo](https://togithub.com/jorge-campo) in [https://github.com/langchain-ai/langchain/pull/17259](https://togithub.com/langchain-ai/langchain/pull/17259) - Implement Unique ID Enforcement in FAISS by [@ByeongUkChoi](https://togithub.com/ByeongUkChoi) in [https://github.com/langchain-ai/langchain/pull/17244](https://togithub.com/langchain-ai/langchain/pull/17244) - langchain, community: Fixes in the Ontotext GraphDB Graph and QA Chain by [@nelly-hateva](https://togithub.com/nelly-hateva) in [https://github.com/langchain-ai/langchain/pull/17239](https://togithub.com/langchain-ai/langchain/pull/17239) - community: Fix KeyError 'embedding' (MongoDBAtlasVectorSearch) by [@cjpark-data](https://togithub.com/cjpark-data) in [https://github.com/langchain-ai/langchain/pull/17178](https://togithub.com/langchain-ai/langchain/pull/17178) - community: Support SerDe transform functions in Databricks LLM by [@liangz1](https://togithub.com/liangz1) in [https://github.com/langchain-ai/langchain/pull/16752](https://togithub.com/langchain-ai/langchain/pull/16752) - langchain_google-genai\[patch]: Invoke callback prior to yielding token by [@dudesparsh](https://togithub.com/dudesparsh) in [https://github.com/langchain-ai/langchain/pull/17092](https://togithub.com/langchain-ai/langchain/pull/17092) - Added LCEL for alibabacloud and anyscale by [@kartheekyakkala](https://togithub.com/kartheekyakkala) in [https://github.com/langchain-ai/langchain/pull/17252](https://togithub.com/langchain-ai/langchain/pull/17252) - langchain: Fix create_retriever_tool missing on_retriever_end Document content by [@wangcailin](https://togithub.com/wangcailin) in [https://github.com/langchain-ai/langchain/pull/16933](https://togithub.com/langchain-ai/langchain/pull/16933) - added parsing of function call / response by [@lkuligin](https://togithub.com/lkuligin) in [https://github.com/langchain-ai/langchain/pull/17245](https://togithub.com/langchain-ai/langchain/pull/17245) - langchain: Update quickstart.mdx - Fix 422 error in example with LangServe client code by [@schalkje](https://togithub.com/schalkje) in [https://github.com/langchain-ai/langchain/pull/17163](https://togithub.com/langchain-ai/langchain/pull/17163) - langchain: adds recursive json splitter by [@joelsprunger](https://togithub.com/joelsprunger) in [https://github.com/langchain-ai/langchain/pull/17144](https://togithub.com/langchain-ai/langchain/pull/17144) - community: Add you.com utility, update you retriever integration docs by [@scottnath](https://togithub.com/scottnath) in [https://github.com/langchain-ai/langchain/pull/17014](https://togithub.com/langchain-ai/langchain/pull/17014) - community: add runtime kwargs to HuggingFacePipeline by [@ab-10](https://togithub.com/ab-10) in [https://github.com/langchain-ai/langchain/pull/17005](https://togithub.com/langchain-ai/langchain/pull/17005) - \[Langchain_core]: Added Docstring for RunnableConfigurableAlternatives by [@keenborder786](https://togithub.com/keenborder786) in [https://github.com/langchain-ai/langchain/pull/17263](https://togithub.com/langchain-ai/langchain/pull/17263) - community: updated openai prices in mapping by [@Sssanek](https://togithub.com/Sssanek) in [https://github.com/langchain-ai/langchain/pull/17009](https://togithub.com/langchain-ai/langchain/pull/17009) - docs: `Toolkits` menu by [@leo-gan](https://togithub.com/leo-gan) in [https://github.com/langchain-ai/langchain/pull/16217](https://togithub.com/langchain-ai/langchain/pull/16217) - infra: rm boto3, gcaip from pyproject by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/17270](https://togithub.com/langchain-ai/langchain/pull/17270) - langchain\[patch]: expose cohere rerank score, add parent doc param by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/16887](https://togithub.com/langchain-ai/langchain/pull/16887) - core\[patch]: Release 0.1.22 by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/17274](https://togithub.com/langchain-ai/langchain/pull/17274) - langchain\[patch]: Release 0.1.6 by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/17133](https://togithub.com/langchain-ai/langchain/pull/17133) - langchain\[patch]: undo redis cache import by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/17275](https://togithub.com/langchain-ai/langchain/pull/17275) - infra: mv SQLDatabase tests to community by [@baskaryan](https://togithub.com/baskaryan) in [https://github.com/langchain-ai/langchain/pull/17276](https://togithub.com/langchain-ai/langchain/pull/17276) ##### New Contributors - [@akirawuc](https://togithub.com/akirawuc) made their first contribution in [https://github.com/langchain-ai/langchain/pull/16877](https://togithub.com/langchain-ai/langchain/pull/16877) - [@rocky1405](https://togithub.com/rocky1405) made their first contribution in [https://github.com/langchain-ai/langchain/pull/16844](https://togithub.com/langchain-ai/langchain/pull/16844) - [@picsoung](https://togithub.com/picsoung) made their first contribution in [https://github.com/langchain-ai/langchain/pull/17007](https://togithub.com/langchain-ai/langchain/pull/17007) - [@rmkraus](https://togithub.com/rmkraus) made their first contribution in [https://github.com/langchain-ai/langchain/pull/17023](https://togithub.com/langchain-ai/langchain/pull/17023) - [@mkhludnev](https://togithub.com/mkhludnev) made their first contribution in [https://github.com/langchain-ai/langchain/pull/16997](https://togithub.com/langchain-ai/langchain/pull/16997) - [@calvinweb](https://togithub.com/calvinweb) made their first contribution in [https://github.com/langchain-ai/langchain/pull/16335](https://togithub.com/langchain-ai/langchain/pull/16335) - [@Mercurrent](https://togithub.com/Mercurrent) made their first contribution in [https://github.com/langchain-ai/langchain/pull/17035](https://togithub.com/langchain-ai/langchain/pull/17035) - [@supreetkt](https://togithub.com/supreetkt) made their first contribution in [https://github.com/langchain-ai/langchain/pull/16800](https://togithub.com/langchain-ai/langchain/pull/16800) - [@strongSoda](https://togithub.com/strongSoda) made their first contribution in [https://github.com/langchain-ai/langchain/pull/16998](https://togithub.com/langchain-ai/langchain/pull/16998) - [@bmuskalla](https://togithub.com/bmuskalla) made their first contribution in [https://github.com/langchain-ai/langchain/pull/16956](https://togithub.com/langchain-ai/langchain/pull/16956) - [@n0vad3v](https://togithub.com/n0vad3v) made their first contribution in [https://github.com/langchain-ai/langchain/pull/16859](https://togithub.com/langchain-ai/langchain/pull/16859) - [@Poissecaille](https://togithub.com/Poissecaille) made their first contribution in [https://github.com/langchain-ai/langchain/pull/16578](https://togithub.com/langchain-ai/langchain/pull/16578) - [@moorej-oci](https://togithub.com/moorej-oci) made their first contribution in [https://github.com/langchain-ai/langchain/pull/17073](https://togithub.com/langchain-ai/langchain/pull/17073) - [@SalamanderXing](https://togithub.com/SalamanderXing) made their first contribution in [https://github.com/langchain-ai/langchain/pull/17062](https://togithub.com/langchain-ai/langchain/pull/17062) - [@scottnath](https://togithub.com/scottnath) made their first contribution in [https://github.com/langchain-ai/langchain/pull/17053](https://togithub.com/langchain-ai/langchain/pull/17053) - [@fpaupier](https://togithub.com/fpaupier) made their first contribution in [https://github.com/langchain-ai/langchain/pull/16995](https://togithub.com/langchain-ai/langchain/pull/16995) - [@mtmahe](https://togithub.com/mtmahe) made their first contribution in [https://github.com/langchain-ai/langchain/pull/17011](https://togithub.com/langchain-ai/langchain/pull/17011) - [@hdnh2006](https://togithub.com/hdnh2006) made their first contribution in [https://github.com/langchain-ai/langchain/pull/16945](https://togithub.com/langchain-ai/langchain/pull/16945) - [@laoazhang](https://togithub.com/laoazhang) made their first contribution in [https://github.com/langchain-ai/langchain/pull/16916](https://togithub.com/langchain-ai/langchain/pull/16916) - [@Swalloow](https://togithub.com/Swalloow) made their first contribution in [https://github.com/langchain-ai/langchain/pull/16991](https://togithub.com/langchain-ai/langchain/pull/16991) - [@arnoschutijzer](https://togithub.com/arnoschutijzer) made their first contribution in [https://github.com/langchain-ai/langchain/pull/17127](https://togithub.com/langchain-ai/langchain/pull/17127) - [@dzmitry-kankalovich](https://togithub.com/dzmitry-kankalovich) made their first contribution in [https://github.com/langchain-ai/langchain/pull/15817](https://togithub.com/langchain-ai/langchain/pull/15817) - [@LuizFrra](https://togithub.com/LuizFrra) made their first contribution in [https://github.com/langchain-ai/langchain/pull/17124](https://togithub.com/langchain-ai/langchain/pull/17124) - [@sana-google](https://togithub.com/sana-google) made their first contribution in [https://github.com/langchain-ai/langchain/pull/17085](https://togithub.com/langchain-ai/langchain/pull/17085) - [@jorge-campo](https://togithub.com/jorge-campo) made their first contribution in [https://github.com/langchain-ai/langchain/pull/17259](https://togithub.com/langchain-ai/langchain/pull/17259) - [@ByeongUkChoi](https://togithub.com/ByeongUkChoi) made their first contribution in [https://github.com/langchain-ai/langchain/pull/17244](https://togithub.com/langchain-ai/langchain/pull/17244) - [@cjpark-data](https://togithub.com/cjpark-data) made their first contribution in [https://github.com/langchain-ai/langchain/pull/17178](https://togithub.com/langchain-ai/langchain/pull/17178) - [@kartheekyakkala](https://togithub.com/kartheekyakkala) made their first contribution in [https://github.com/langchain-ai/langchain/pull/17252](https://togithub.com/langchain-ai/langchain/pull/17252) - [@wangcailin](https://togithub.com/wangcailin) made their first contribution in [https://github.com/langchain-ai/langchain/pull/16933](https://togithub.com/langchain-ai/langchain/pull/16933) - [@schalkje](https://togithub.com/schalkje) made their first contribution in [https://github.com/langchain-ai/langchain/pull/17163](https://togithub.com/langchain-ai/langchain/pull/17163) - [@joelsprunger](https://togithub.com/joelsprunger) made their first contribution in [https://github.com/langchain-ai/langchain/pull/17144](https://togithub.com/langchain-ai/langchain/pull/17144) - [@Sssanek](https://togithub.com/Sssanek) made their first contribution in [https://github.com/langchain-ai/langchain/pull/17009](https://togithub.com/langchain-ai/langchain/pull/17009) **Full Changelog**: https://github.com/langchain-ai/langchain/compare/v0.1.5...v0.1.6 </details> <details> <summary>langchain-ai/langchainjs (langchain)</summary> ### [`v0.1.17`](https://togithub.com/langchain-ai/langchainjs/releases/tag/0.1.17) [Compare Source](https://togithub.com/langchain-ai/langchainjs/compare/0.1.16...0.1.17) #### What's Changed - langchain\[patch]: Release 0.1.16 by [@jacoblee93](https://togithub.com/jacoblee93) in [https://github.com/langchain-ai/langchainjs/pull/4334](https://togithub.com/langchain-ai/langchainjs/pull/4334) - Correct waitlist instruction in README by [@eknuth](https://togithub.com/eknuth) in [https://github.com/langchain-ai/langchainjs/pull/4335](https://togithub.com/langchain-ai/langchainjs/pull/4335) - docs\[patch]: Fix broken link by [@jacoblee93](https://togithub.com/jacoblee93) in [https://github.com/langchain-ai/langchainjs/pull/4336](https://togithub.com/langchain-ai/langchainjs/pull/4336) - langchain\[patch]: Export helper functions from indexing api by [@bracesproul](https://togithub.com/bracesproul) in [https://github.com/langchain-ai/langchainjs/pull/4344](https://togithub.com/langchain-ai/langchainjs/pull/4344) - docs\[minor]: Add Human-in-the-loop to tools use case by [@bracesproul](https://togithub.com/bracesproul) in [https://github.com/langchain-ai/langchainjs/pull/4314](https://togithub.com/langchain-ai/langchainjs/pull/4314) - langchain\[minor],docs\[minor]: Add `SitemapLoader` by [@bracesproul](https://togithub.com/bracesproul) in [https://github.com/langchain-ai/langchainjs/pull/4331](https://togithub.com/langchain-ai/langchainjs/pull/4331) - langchain\[patch]: Rm unwanted build artifacts by [@bracesproul](https://togithub.com/bracesproul) in [https://github.com/langchain-ai/langchainjs/pull/4345](https://togithub.com/langchain-ai/langchainjs/pull/4345) #### New Contributors - [@eknuth](https://togithub.com/eknuth) made their first contribution in [https://github.com/langchain-ai/langchainjs/pull/4335](https://togithub.com/langchain-ai/langchainjs/pull/4335) **Full Changelog**: https://github.com/langchain-ai/langchainjs/compare/0.1.16...0.1.17 </details> <details> <summary>steven-tey/novel (novel)</summary> ### [`v0.2.0`](https://togithub.com/steven-tey/novel/releases/tag/0.2.0) [Compare Source](https://togithub.com/steven-tey/novel/compare/0.1.22...0.2.0) WIP Novel docs here [Docs](https://novel.sh/docs/introduction) #### What's Changed - RFC: Headless core components & imperative support by [@andrewdoro](https://togithub.com/andrewdoro) in [https://github.com/steven-tey/novel/pull/136](https://togithub.com/steven-tey/novel/pull/136) - feat: add docs app by [@andrewdoro](https://togithub.com/andrewdoro) in [https://github.com/steven-tey/novel/pull/284](https://togithub.com/steven-tey/novel/pull/284) - fix: update dark mode class to drag handler component by [@brunocroh](https://togithub.com/brunocroh) in [https://github.com/steven-tey/novel/pull/286](https://togithub.com/steven-tey/novel/pull/286) - \[Fix] - Correct License Link in README.md by [@justinjunodev](https://togithub.com/justinjunodev) in [https://github.com/steven-tey/novel/pull/274](https://togithub.com/steven-tey/novel/pull/274) - fix: image move when dragged by [@brunocroh](https://togithub.com/brunocroh) in [https://github.com/steven-tey/novel/pull/287](https://togithub.com/steven-tey/novel/pull/287) #### New Contributors - [@andrewdoro](https://togithub.com/andrewdoro) made their first contribution in [https://github.com/steven-tey/novel/pull/136](https://togithub.com/steven-tey/novel/pull/136) - [@brunocroh](https://togithub.com/brunocroh) made their first contribution in [https://github.com/steven-tey/novel/pull/286](https://togithub.com/steven-tey/novel/pull/286) - [@justinjunodev](https://togithub.com/justinjunodev) made their first contribution in [https://github.com/steven-tey/novel/pull/274](https://togithub.com/steven-tey/novel/pull/274) **Full Changelog**: https://github.com/steven-tey/novel/compare/0.1.22...0.2.0 </details> <details> <summary>openai/openai-python (openai)</summary> ### [`v1.12.0`](https://togithub.com/openai/openai-python/blob/HEAD/CHANGELOG.md#1120-2024-02-08) [Compare Source](https://togithub.com/openai/openai-python/compare/v1.11.1...v1.12.0) Full Changelog: [v1.11.1...v1.12.0](https://togithub.com/openai/openai-python/compare/v1.11.1...v1.12.0) ##### Features - **api:** add `timestamp_granularities`, add `gpt-3.5-turbo-0125` model ([#1125](https://togithub.com/openai/openai-python/issues/1125)) ([1ecf8f6](https://togithub.com/openai/openai-python/commit/1ecf8f6b12323ed09fb6a2815c85b9533ee52a50)) - **cli/images:** add support for `--model` arg ([#1132](https://togithub.com/openai/openai-python/issues/1132)) ([0d53866](https://togithub.com/openai/openai-python/commit/0d5386615cda7cd50d5db90de2119b84dba29519)) ##### Bug Fixes - remove double brackets from timestamp_granularities param ([#1140](https://togithub.com/openai/openai-python/issues/1140)) ([3db0222](https://togithub.com/openai/openai-python/commit/3db022216a81fa86470b53ec1246669bc7b17897)) - **types:** loosen most List params types to Iterable ([#1129](https://togithub.com/openai/openai-python/issues/1129)) ([bdb31a3](https://togithub.com/openai/openai-python/commit/bdb31a3b1db6ede4e02b3c951c4fd23f70260038)) ##### Chores - **internal:** add lint command ([#1128](https://togithub.com/openai/openai-python/issues/1128)) ([4c021c0](https://togithub.com/openai/openai-python/commit/4c021c0ab0151c2ec092d860c9b60e22e658cd03)) - **internal:** support serialising iterable types ([#1127](https://togithub.com/openai/openai-python/issues/1127)) ([98d4e59](https://togithub.com/openai/openai-python/commit/98d4e59afcf2d65d4e660d91eb9462240ef5cd63)) ##### Documentation - add CONTRIBUTING.md ([#1138](https://togithub.com/openai/openai-python/issues/1138)) ([79c8f0e](https://togithub.com/openai/openai-python/commit/79c8f0e8bf5470e2e31e781e8d279331e89ddfbe)) </details> <details> <summary>openai/openai-node (openai)</summary> ### [`v4.27.1`](https://togithub.com/openai/openai-node/blob/HEAD/CHANGELOG.md#4271-2024-02-12) [Compare Source](https://togithub.com/openai/openai-node/compare/v4.27.0...v4.27.1) Full Changelog: [v4.27.0...v4.27.1](https://togithub.com/openai/openai-node/compare/v4.27.0...v4.27.1) ### [`v4.27.0`](https://togithub.com/openai/openai-node/blob/HEAD/CHANGELOG.md#4270-2024-02-08) [Compare Source](https://togithub.com/openai/openai-node/compare/v4.26.1...v4.27.0) Full Changelog: [v4.26.1...v4.27.0](https://togithub.com/openai/openai-node/compare/v4.26.1...v4.27.0) ##### Features - **api:** add `timestamp_granularities`, add `gpt-3.5-turbo-0125` model ([#661](https://togithub.com/openai/openai-node/issues/661)) ([5016806](https://togithub.com/openai/openai-node/commit/50168066862f66b529bae29f4564741300303246)) ##### Chores - **internal:** fix retry mechanism for ecosystem-test ([#663](https://togithub.com/openai/openai-node/issues/663)) ([0eb7ed5](https://togithub.com/openai/openai-node/commit/0eb7ed5ca3f7c7b29c316fc7d725d834cee73989)) - respect `application/vnd.api+json` content-type header ([#664](https://togithub.com/openai/openai-node/issues/664)) ([f4fad54](https://togithub.com/openai/openai-node/commit/f4fad549c5c366d8dd8b936b7699639b895e82a1)) </details> <details> <summary>pydantic/pydantic (pydantic)</summary> ### [`v2.6.1`](https://togithub.com/pydantic/py </details> --- ### Configuration 📅 **Schedule**: Branch creation - "before 4am on Monday" in timezone America/Chicago, Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 👻 **Immortal**: This PR will be recreated if closed unmerged. Get [config help](https://togithub.com/renovatebot/renovate/discussions) if that's undesired. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/autoblocksai/autoblocks-examples).  Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

@joelsprunger

- **Description:** This adds a recursive json splitter class to the existing text_splitters as well as unit tests - **Issue:** splitting text from structured data can cause issues if you have a large nested json object and you split it as regular text you may end up losing the structure of the json. To mitigate against this you can split the nested json into large chunks and overlap them, but this causes unnecessary text processing and there will still be times where the nested json is so big that the chunks get separated from the parent keys. As an example you wouldn't want the following to be split in half: ```shell {'val0': 'DFWeNdWhapbR', 'val1': {'val10': 'QdJo', 'val11': 'FWSDVFHClW', 'val12': 'bkVnXMMlTiQh', 'val13': 'tdDMKRrOY', 'val14': 'zybPALvL', 'val15': 'JMzGMNH', 'val16': {'val160': 'qLuLKusFw', 'val161': 'DGuotLh', 'val162': 'KztlcSBropT', -----------------------------------------------------------------------split----- 'val163': 'YlHHDrN', 'val164': 'CtzsxlGBZKf', 'val165': 'bXzhcrWLmBFp', 'val166': 'zZAqC', 'val167': 'ZtyWno', 'val168': 'nQQZRsLnaBhb', 'val169': 'gSpMbJwA'}, 'val17': 'JhgiyF', 'val18': 'aJaqjUSFFrI', 'val19': 'glqNSvoyxdg'}} ``` Any llm processing the second chunk of text may not have the context of val1, and val16 reducing accuracy. Embeddings will also lack this context and this makes retrieval less accurate. Instead you want it to be split into chunks that retain the json structure. ```shell {'val0': 'DFWeNdWhapbR', 'val1': {'val10': 'QdJo', 'val11': 'FWSDVFHClW', 'val12': 'bkVnXMMlTiQh', 'val13': 'tdDMKRrOY', 'val14': 'zybPALvL', 'val15': 'JMzGMNH', 'val16': {'val160': 'qLuLKusFw', 'val161': 'DGuotLh', 'val162': 'KztlcSBropT', 'val163': 'YlHHDrN', 'val164': 'CtzsxlGBZKf'}}} ``` and ```shell {'val1':{'val16':{ 'val165': 'bXzhcrWLmBFp', 'val166': 'zZAqC', 'val167': 'ZtyWno', 'val168': 'nQQZRsLnaBhb', 'val169': 'gSpMbJwA'}, 'val17': 'JhgiyF', 'val18': 'aJaqjUSFFrI', 'val19': 'glqNSvoyxdg'}} ``` This recursive json text splitter does this. Values that contain a list can be converted to dict first by using split(... convert_lists=True) otherwise long lists will not be split and you may end up with chunks larger than the max chunk. In my testing large json objects could be split into small chunks with ✅ Increased question answering accuracy ✅ The ability to split into smaller chunks meant retrieval queries can use fewer tokens - **Dependencies:** json import added to text_splitter.py, and random added to the unit test - **Twitter handle:** @joelsprunger --------- Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>

joelsprunger · 2024-02-22T16:58:24Z

@hwchase17 were you planning to highlight this feature in some way?

joelsprunger added 2 commits February 6, 2024 14:51

feat: adds recursive json splitter and unit test

11a87a4

chore: update to pass linting

2cbf2b2

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Feb 7, 2024

dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Feb 7, 2024

joelsprunger closed this Feb 7, 2024

joelsprunger reopened this Feb 7, 2024

hwchase17 self-assigned this Feb 7, 2024

fix: repeated calls to split() will clear chunks before splitting

4df70ed

chore: documentation ipynb

vercel bot deployed to Preview February 7, 2024 06:40 View deployment

hwchase17 reviewed Feb 7, 2024

View reviewed changes

chore: refactor json splitter to be stateless and update docs

2a77ccf

vercel bot deployed to Preview February 7, 2024 20:26 View deployment

chore: brings json size function into class

278698c

vercel bot deployed to Preview February 7, 2024 20:35 View deployment

joelsprunger changed the title ~~langchain: adds recursive json splitter and unit test~~ langchain: adds recursive json splitter Feb 7, 2024

cr

9832196

hwchase17 approved these changes Feb 8, 2024

View reviewed changes

dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Feb 8, 2024

Merge branch 'master' into joelsprunger-notWitchTenThousandMenCouldYo…

2cf19e8

…uDoThis

vercel bot deployed to Preview February 8, 2024 21:44 View deployment

hwchase17 merged commit 3984f66 into langchain-ai:master Feb 8, 2024
43 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

langchain: adds recursive json splitter #17144

langchain: adds recursive json splitter #17144

joelsprunger commented Feb 7, 2024 •

edited

vercel bot commented Feb 7, 2024 •

edited

joelsprunger commented Feb 7, 2024

joelsprunger commented Feb 7, 2024 •

edited

hwchase17 left a comment

hwchase17 Feb 7, 2024

joelsprunger Feb 7, 2024

joelsprunger Feb 7, 2024

joelsprunger Feb 7, 2024

joelsprunger Feb 7, 2024

hwchase17 commented Feb 8, 2024

joelsprunger commented Feb 9, 2024

funkymonkeymonk commented Feb 9, 2024

joelsprunger commented Feb 22, 2024

langchain: adds recursive json splitter #17144

langchain: adds recursive json splitter #17144

Conversation

joelsprunger commented Feb 7, 2024 • edited

vercel bot commented Feb 7, 2024 • edited

joelsprunger commented Feb 7, 2024

joelsprunger commented Feb 7, 2024 • edited

hwchase17 left a comment

Choose a reason for hiding this comment

hwchase17 Feb 7, 2024

Choose a reason for hiding this comment

joelsprunger Feb 7, 2024

Choose a reason for hiding this comment

joelsprunger Feb 7, 2024

Choose a reason for hiding this comment

joelsprunger Feb 7, 2024

Choose a reason for hiding this comment

joelsprunger Feb 7, 2024

Choose a reason for hiding this comment

hwchase17 commented Feb 8, 2024

joelsprunger commented Feb 9, 2024

funkymonkeymonk commented Feb 9, 2024

joelsprunger commented Feb 22, 2024

joelsprunger commented Feb 7, 2024 •

edited

vercel bot commented Feb 7, 2024 •

edited

joelsprunger commented Feb 7, 2024 •

edited