-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
enhancementNew feature or requestNew feature or request
Description
BatchDocumentBuilder: Unnecessary memory allocations in size estimation
Summary
estimateDocumentSize() allocates byte arrays for every field name and text value just to measure UTF-8 length. These allocations immediately become garbage, causing GC pressure during high-throughput indexing.
Problem
In BatchDocumentBuilder.java, the size estimation methods use getBytes(UTF_8).length:
// Line 446 - for every field name
size += 2 + fieldName.getBytes(StandardCharsets.UTF_8).length + 1 + 2;
// Line 466 - for every text/json value
return 4 + value.toString().getBytes(StandardCharsets.UTF_8).length;Each .getBytes(UTF_8) call:
- Allocates a new byte array
- Encodes the entire string into UTF-8
- Returns the array (which immediately becomes garbage)
Impact
For a batch of 10,000 documents with 10 fields each (avg 100 chars per text value, 20 chars per field name):
- Text values: 10,000 × 10 × 100 = 10,000,000 bytes allocated
- Field names: 10,000 × 10 × 20 = 2,000,000 bytes allocated
- ~12 MB of garbage per batch just for size estimation
This is called from addDocument(), so it happens for every document indexed.
Proposed Fix
Replace getBytes(UTF_8).length with a zero-allocation UTF-8 length calculation:
private static int utf8Length(String s) {
int count = 0;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (c <= 0x7F) {
count++;
} else if (c <= 0x7FF) {
count += 2;
} else if (Character.isHighSurrogate(c)) {
count += 4;
i++;
} else {
count += 3;
}
}
return count;
}Then update:
// estimateDocumentSize()
size += 2 + utf8Length(fieldName) + 1 + 2;
// estimateValueSize()
return 4 + utf8Length(value.toString());Additional Issue
getEstimatedSize() is also missing the offset table overhead:
// Current - missing offset table
return estimatedSize + 16;
// Should be
return estimatedSize + 16 + (documents.size() * 4);This causes underestimation by docCount × 4 bytes (40KB for 10,000 docs).
Files Affected
src/main/java/io/indextables/tantivy4java/batch/BatchDocumentBuilder.java
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request