## Task Summary

Create a Jupyter notebook that merges two JSONL (JSON Lines) files based on a common 'id' key field.



# JSONL File Merger

This notebook merges two JSONL files based on the 'id' key. It will:
1. Load both JSONL files
2. Parse JSON objects from each line
3. Create dictionaries indexed by 'id'
4. Merge the data based on matching IDs
5. Save the merged result to a new JSONL file

In [1]:
import json
import os
from pathlib import Path

In [2]:
def load_jsonl(file_path):
    """Load JSONL file and return list of dictionaries"""
    data = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            line = line.strip()
            if line:
                data.append(json.loads(line))
    return data

In [3]:
def merge_jsonl_files(file1_path, file2_path, output_path, merge_strategy='update'):
    """
    Merge two JSONL files based on 'id' key
    
    merge_strategy options:
    - 'update': file2 values overwrite file1 values for matching keys
    - 'combine': combine both objects, keeping all fields
    """

    # Load both files
    data1 = load_jsonl(file1_path)
    data2 = load_jsonl(file2_path)

    # Create dictionaries indexed by id
    dict1 = {item['id']: item for item in data1 if 'id' in item}
    dict2 = {item['id']: item for item in data2 if 'id' in item}

    print(f"File 1: {len(dict1)} records with 'id'")
    print(f"File 2: {len(dict2)} records with 'id'")

    # Merge based on strategy
    merged_data = {}

    if merge_strategy == 'update':
        merged_data = dict1.copy()
        merged_data.update(dict2)
    elif merge_strategy == 'combine':
        all_ids = set(dict1.keys()) | set(dict2.keys())
        for id_key in all_ids:
            merged_obj = {}
            if id_key in dict1:
                merged_obj.update(dict1[id_key])
            if id_key in dict2:
                merged_obj.update(dict2[id_key])
            merged_data[id_key] = merged_obj

    print(f"Merged: {len(merged_data)} records")

    # Save merged data
    with open(output_path, 'w', encoding='utf-8') as f:
        for record in merged_data.values():
            f.write(json.dumps(record, ensure_ascii=False) + '\n')

    print(f"Merged data saved to: {output_path}")
    return merged_data

In [9]:
# Example usage - update these paths with your actual file paths
file1_path = "../benchmark/nq_new/corpus.train.top3.jsonl"  # Replace with your first JSONL file path
file2_path = "../benchmark/nq_new/corpus.dev.top3.jsonl"  # Replace with your second JSONL file path
output_path = "../benchmark/nq_new/corpus.top3.jsonl"

# Check if files exist (remove this check when using real files)
if os.path.exists(file1_path) and os.path.exists(file2_path):
    merged_data = merge_jsonl_files(file1_path, file2_path, output_path, merge_strategy='combine')
else:
    print("Please update the file paths with your actual JSONL files")

File 1: 221675 records with 'id'
File 2: 9058 records with 'id'
Merged: 225669 records
Merged data saved to: ../benchmark/nq_new/corpus.top3.jsonl


In [10]:
def analyze_merge_results(file1_path, file2_path, merged_data):
    """Analyze the merge results and show statistics"""

    data1 = load_jsonl(file1_path) if os.path.exists(file1_path) else []
    data2 = load_jsonl(file2_path) if os.path.exists(file2_path) else []

    dict1 = {item['id']: item for item in data1 if 'id' in item}
    dict2 = {item['id']: item for item in data2 if 'id' in item}

    common_ids = set(dict1.keys()) & set(dict2.keys())
    only_in_file1 = set(dict1.keys()) - set(dict2.keys())
    only_in_file2 = set(dict2.keys()) - set(dict1.keys())

    print("Merge Analysis:")
    print(f"Common IDs (in both files): {len(common_ids)}")
    print(f"IDs only in file 1: {len(only_in_file1)}")
    print(f"IDs only in file 2: {len(only_in_file2)}")
    print(f"Total unique IDs: {len(merged_data)}")

    if len(common_ids) > 0:
        print(f"\nExample common IDs: {list(common_ids)[:5]}")


In [11]:
analyze_merge_results(file1_path, file2_path, merged_data)

Merge Analysis:
Common IDs (in both files): 5064
IDs only in file 1: 216611
IDs only in file 2: 3994
Total unique IDs: 225669

Example common IDs: ['838179391_5114-5371', '865846375_381-1468', '819463567_845-1452', '811129920_458-1708', '836158297_3788-4958']
