### üìÑ JSON & JSONL Processing ‚Äî LangChain + Custom Parsing

üéØ Purpose
- Work with nested JSON and JSONL files
- Extract structured entities (members / employees)
- Support both loader-based and custom full-control processing

#### ‚úÖ Method 1 ‚Äî JSONLoader with jq_schema (Simple & Quick)

Use when:
- The JSON structure is consistent
- You just want to extract matching objects
- Minimal formatting is fine
- Good for quick testing / demos

What happens:
- jq_schema navigates nested fields
- Each member becomes one document
- content_key="name" ‚Üí page text = employee name
- Remaining fields go to metadata

üß† Think of it as: ‚ÄúFast extract, minimal control‚Äù

#### ‚úÖ Method 2 ‚Äî Custom JSON Processing (Full Control)

Use when:
- You want custom page_content formatting
- Need rich metadata
- Want to clean / normalize fields
- Need grouping / logic (department ‚Üí team ‚Üí member)
- Suitable for RAG pipelines / production

What happens:
- Load JSON manually
- Loop through departments ‚Üí teams ‚Üí members
- Build readable text
- Attach structured metadata

üß† Think of it as: ‚ÄúYou control content, metadata, and structure‚Äù

üßæ JSONL Support (Line-wise Records)

Use when:
- Working with streaming / large datasets
- Each line = one record (employee)
- Useful for scalable ingestion

What happens:
- Records are written line-by-line
- Can be loaded later into docs or pandas

üöÄ Key Takeaways
- JSONLoader = simple, schema-based extraction
- Custom JSON = full formatting & metadata control
- JSONL = best for large / streaming datasets
- Choose based on simplicity vs control.

## Json parsing and processing

In [1]:
import json
import os
os.makedirs('data/json_files',exist_ok=True)

In [15]:
json_data = {
  "company": {
    "name": "TechNova Solutions",
    "location": "Bangalore",
    "departments": [
      {
        "dept_name": "Engineering",
        "manager": "Rahul Mehta",
        "teams": [
          {
            "team_name": "Backend",
            "members": [
              {
                "id": 1,
                "name": "Aisha",
                "role": "Backend Developer",
                "skills": ["Python", "Django", "PostgreSQL"],
                "projects": [
                  { "project_id": "P101", "title": "Order Processing API", "status": "In Progress" }
                ]
              }
            ]
          },
          {
            "team_name": "Data",
            "members": [
              {
                "id": 2,
                "name": "Karan",
                "role": "Data Analyst",
                "skills": ["SQL", "Pandas", "PowerBI"],
                "projects": [
                  { "project_id": "P202", "title": "Sales Insights Dashboard", "status": "Completed" }
                ]
              }
            ]
          }
        ]
      },
      {
        "dept_name": "HR",
        "manager": "Meera Kapoor",
        "teams": [
          {
            "team_name": "HR Team",
            "members": [
              {
                "id": 3,
                "name": "Sara",
                "role": "HR Executive",
                "skills": ["Recruitment", "Payroll", "Training"]
              }
            ]
          }
        ]
      }
    ]
  }
}


In [16]:
json_data

{'company': {'name': 'TechNova Solutions',
  'location': 'Bangalore',
  'departments': [{'dept_name': 'Engineering',
    'manager': 'Rahul Mehta',
    'teams': [{'team_name': 'Backend',
      'members': [{'id': 1,
        'name': 'Aisha',
        'role': 'Backend Developer',
        'skills': ['Python', 'Django', 'PostgreSQL'],
        'projects': [{'project_id': 'P101',
          'title': 'Order Processing API',
          'status': 'In Progress'}]}]},
     {'team_name': 'Data',
      'members': [{'id': 2,
        'name': 'Karan',
        'role': 'Data Analyst',
        'skills': ['SQL', 'Pandas', 'PowerBI'],
        'projects': [{'project_id': 'P202',
          'title': 'Sales Insights Dashboard',
          'status': 'Completed'}]}]}]},
   {'dept_name': 'HR',
    'manager': 'Meera Kapoor',
    'teams': [{'team_name': 'HR Team',
      'members': [{'id': 3,
        'name': 'Sara',
        'role': 'HR Executive',
        'skills': ['Recruitment', 'Payroll', 'Training']}]}]}]}}

In [17]:
with open('data/json_files/company_data.json', 'w') as json_file:
    json.dump(json_data, json_file, indent=4)

In [18]:
data = [
    {"id": 1, "name": "Aisha", "role": "Data Analyst", "skills": ["Python", "SQL"], "location": "Bangalore"},
    {"id": 2, "name": "Rahul", "role": "ML Engineer", "skills": ["PyTorch", "NLP"], "location": "Hyderabad"},
    {"id": 3, "name": "Meera", "role": "Product Manager", "skills": ["Roadmapping", "User Research"], "location": "Remote"},
    {"id": 4, "name": "Karan", "role": "Backend Developer", "skills": ["Django", "PostgreSQL"], "location": "Pune"},
    {"id": 5, "name": "Sara", "role": "Intern", "skills": ["Excel", "Reporting"], "location": "Delhi"}
]

file_path = "data/json_files/employees.jsonl"

with open(file_path, "w", encoding="utf-8") as f:
    for record in data:
        f.write(json.dumps(record) + "\n")

print("JSONL file created successfully!")


JSONL file created successfully!


## Json Processing Strategies

In [22]:
from langchain_community.document_loaders import JSONLoader

company_loader = JSONLoader(
    file_path="data/json_files/company_data.json",
    jq_schema=".company.departments[].teams[].members[]",
    content_key="name",
    text_content=False
)

company_docs = company_loader.load()

print(f"Loaded {len(company_docs)} employees")
print(company_docs[0].page_content[:200])
print(company_docs[0].metadata)
print(company_docs)


Loaded 3 employees
Aisha
{'source': 'C:\\Users\\Ahmed\\OneDrive\\Desktop\\ExploringRAGs\\0-DataIngestParsing\\data\\json_files\\company_data.json', 'seq_num': 1}
[Document(metadata={'source': 'C:\\Users\\Ahmed\\OneDrive\\Desktop\\ExploringRAGs\\0-DataIngestParsing\\data\\json_files\\company_data.json', 'seq_num': 1}, page_content='Aisha'), Document(metadata={'source': 'C:\\Users\\Ahmed\\OneDrive\\Desktop\\ExploringRAGs\\0-DataIngestParsing\\data\\json_files\\company_data.json', 'seq_num': 2}, page_content='Karan'), Document(metadata={'source': 'C:\\Users\\Ahmed\\OneDrive\\Desktop\\ExploringRAGs\\0-DataIngestParsing\\data\\json_files\\company_data.json', 'seq_num': 3}, page_content='Sara')]


In [23]:
## Method: ‚úÖ Custom JSON Processing ‚Äî Full Control

import json
from langchain_core.documents import Document

with open("data/json_files/company_data.json", "r", encoding="utf-8") as f:
    data = json.load(f)

documents = []

for dept in data["company"]["departments"]:
    for team in dept["teams"]:
        for member in team["members"]:
            
            # Build custom text content (you control format)
            content = f"""
            Name: {member['name']}
            Role: {member['role']}
            Department: {dept['dept_name']}
            Team: {team['team_name']}
            """

            # Custom metadata (structured, searchable)
            metadata = {
                "id": member["id"],
                "department": dept["dept_name"],
                "team": team["team_name"],
                "source": "company_data.json"
            }

            documents.append(
                Document(
                    page_content=content.strip(),
                    metadata=metadata
                )
            )

print(f"Created {len(documents)} custom JSON documents")
print(documents[0])


Created 3 custom JSON documents
page_content='Name: Aisha
            Role: Backend Developer
            Department: Engineering
            Team: Backend' metadata={'id': 1, 'department': 'Engineering', 'team': 'Backend', 'source': 'company_data.json'}
