## How to split JSON data

##### This json splitter splits json data while allowing control over chunk sizes. It traverses json data depth first and builds smaller json chunks. It attempts to keep nested json objects whole but will split them if needed to keep chunks between a min_chunk_size and the max_chunk_size.

##### If the value is not a nested json, but rather a very large string the string will not be split. If you need a hard cap on the chunk size consider composing this with a Recursive Text splitter on those chunks. Thereâ€™s an optional pre-processing step to split lists, by first converting them to json (dict) and then splitting.

- How the text is split: json value.
- How the chunk size is measured: by number of characters.


In [3]:
import json
import requests

json_data = requests.get("https://jsonplaceholder.typicode.com/users").json()

In [4]:
json_data

[{'id': 1,
  'name': 'Leanne Graham',
  'username': 'Bret',
  'email': 'Sincere@april.biz',
  'address': {'street': 'Kulas Light',
   'suite': 'Apt. 556',
   'city': 'Gwenborough',
   'zipcode': '92998-3874',
   'geo': {'lat': '-37.3159', 'lng': '81.1496'}},
  'phone': '1-770-736-8031 x56442',
  'website': 'hildegard.org',
  'company': {'name': 'Romaguera-Crona',
   'catchPhrase': 'Multi-layered client-server neural-net',
   'bs': 'harness real-time e-markets'}},
 {'id': 2,
  'name': 'Ervin Howell',
  'username': 'Antonette',
  'email': 'Shanna@melissa.tv',
  'address': {'street': 'Victor Plains',
   'suite': 'Suite 879',
   'city': 'Wisokyburgh',
   'zipcode': '90566-7771',
   'geo': {'lat': '-43.9509', 'lng': '-34.4618'}},
  'phone': '010-692-6593 x09125',
  'website': 'anastasia.net',
  'company': {'name': 'Deckow-Crist',
   'catchPhrase': 'Proactive didactic contingency',
   'bs': 'synergize scalable supply-chains'}},
 {'id': 3,
  'name': 'Clementine Bauch',
  'username': 'Samantha

In [10]:
from langchain_text_splitters import RecursiveJsonSplitter
import requests

# Example JSON
json_data = requests.get("https://raw.githubusercontent.com/typicode/demo/master/db.json").json()

# Create splitter
json_splitter = RecursiveJsonSplitter(max_chunk_size=200)

# Split JSON into chunks
json_chunks = json_splitter.split_json(json_data)

json_chunks[:3]  # show first 3 chunks


[{'posts': [{'id': 1, 'title': 'Post 1'},
   {'id': 2, 'title': 'Post 2'},
   {'id': 3, 'title': 'Post 3'}]},
 {'comments': [{'id': 1, 'body': 'some comment', 'postId': 1},
   {'id': 2, 'body': 'some comment', 'postId': 1}],
  'profile': {'name': 'typicode'}}]

In [11]:
for chunk in json_chunks[:3]:
    print(chunk)

{'posts': [{'id': 1, 'title': 'Post 1'}, {'id': 2, 'title': 'Post 2'}, {'id': 3, 'title': 'Post 3'}]}
{'comments': [{'id': 1, 'body': 'some comment', 'postId': 1}, {'id': 2, 'body': 'some comment', 'postId': 1}], 'profile': {'name': 'typicode'}}


In [17]:
### The splitter can also output documents

from langchain_core.documents import Document

# docs = json_splitter.create_documents(json_data)

# 1. Get some JSON
url = "https://jsonplaceholder.typicode.com/todos"
json_data = requests.get(url).json()   # this is a LIST

# 2. Wrap in a dict so the splitter has a clear root
wrapped = {"todos": json_data}

# 3. Create splitter
json_splitter = RecursiveJsonSplitter(max_chunk_size=500)

# 4. Split into JSON chunks (Python objects)
json_chunks = json_splitter.split_json(wrapped)

print("Number of chunks:", len(json_chunks))

# 5. Turn chunks into LangChain Documents
docs = [
    Document(page_content=json.dumps(chunk, ensure_ascii=False))
    for chunk in json_chunks
]

# 6. Inspect a few
for i, d in enumerate(docs[:3]):
    print(f"\n--- DOC {i} ---")
    print(d.page_content)

Number of chunks: 1

--- DOC 0 ---
{"todos": [{"userId": 1, "id": 1, "title": "delectus aut autem", "completed": false}, {"userId": 1, "id": 2, "title": "quis ut nam facilis et officia qui", "completed": false}, {"userId": 1, "id": 3, "title": "fugiat veniam minus", "completed": false}, {"userId": 1, "id": 4, "title": "et porro tempora", "completed": true}, {"userId": 1, "id": 5, "title": "laboriosam mollitia et enim quasi adipisci quia provident illum", "completed": false}, {"userId": 1, "id": 6, "title": "qui ullam ratione quibusdam voluptatem quia omnis", "completed": false}, {"userId": 1, "id": 7, "title": "illo expedita consequatur quia in", "completed": false}, {"userId": 1, "id": 8, "title": "quo adipisci enim quam ut ab", "completed": true}, {"userId": 1, "id": 9, "title": "molestiae perspiciatis ipsa", "completed": false}, {"userId": 1, "id": 10, "title": "illo est ratione doloremque quia maiores aut", "completed": true}, {"userId": 1, "id": 11, "title": "vero rerum temporibus