Skip to content

Performance Tweak: BatchDocumentBuilder: Unnecessary memory allocations in size estimation #34

@schenksj

Description

@schenksj

BatchDocumentBuilder: Unnecessary memory allocations in size estimation

Summary

estimateDocumentSize() allocates byte arrays for every field name and text value just to measure UTF-8 length. These allocations immediately become garbage, causing GC pressure during high-throughput indexing.

Problem

In BatchDocumentBuilder.java, the size estimation methods use getBytes(UTF_8).length:

// Line 446 - for every field name
size += 2 + fieldName.getBytes(StandardCharsets.UTF_8).length + 1 + 2;

// Line 466 - for every text/json value
return 4 + value.toString().getBytes(StandardCharsets.UTF_8).length;

Each .getBytes(UTF_8) call:

  1. Allocates a new byte array
  2. Encodes the entire string into UTF-8
  3. Returns the array (which immediately becomes garbage)

Impact

For a batch of 10,000 documents with 10 fields each (avg 100 chars per text value, 20 chars per field name):

  • Text values: 10,000 × 10 × 100 = 10,000,000 bytes allocated
  • Field names: 10,000 × 10 × 20 = 2,000,000 bytes allocated
  • ~12 MB of garbage per batch just for size estimation

This is called from addDocument(), so it happens for every document indexed.

Proposed Fix

Replace getBytes(UTF_8).length with a zero-allocation UTF-8 length calculation:

private static int utf8Length(String s) {
    int count = 0;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
        if (c <= 0x7F) {
            count++;
        } else if (c <= 0x7FF) {
            count += 2;
        } else if (Character.isHighSurrogate(c)) {
            count += 4;
            i++;
        } else {
            count += 3;
        }
    }
    return count;
}

Then update:

// estimateDocumentSize()
size += 2 + utf8Length(fieldName) + 1 + 2;

// estimateValueSize()
return 4 + utf8Length(value.toString());

Additional Issue

getEstimatedSize() is also missing the offset table overhead:

// Current - missing offset table
return estimatedSize + 16;

// Should be
return estimatedSize + 16 + (documents.size() * 4);

This causes underestimation by docCount × 4 bytes (40KB for 10,000 docs).

Files Affected

  • src/main/java/io/indextables/tantivy4java/batch/BatchDocumentBuilder.java

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions