Performance Tweak: BatchDocumentBuilder: Unnecessary memory allocations in size estimation

# BatchDocumentBuilder: Unnecessary memory allocations in size estimation

## Summary

`estimateDocumentSize()` allocates byte arrays for every field name and text value just to measure UTF-8 length. These allocations immediately become garbage, causing GC pressure during high-throughput indexing.

## Problem

In `BatchDocumentBuilder.java`, the size estimation methods use `getBytes(UTF_8).length`:

```java
// Line 446 - for every field name
size += 2 + fieldName.getBytes(StandardCharsets.UTF_8).length + 1 + 2;

// Line 466 - for every text/json value
return 4 + value.toString().getBytes(StandardCharsets.UTF_8).length;
```

Each `.getBytes(UTF_8)` call:
1. Allocates a new byte array
2. Encodes the entire string into UTF-8
3. Returns the array (which immediately becomes garbage)

## Impact

For a batch of 10,000 documents with 10 fields each (avg 100 chars per text value, 20 chars per field name):
- Text values: 10,000 × 10 × 100 = 10,000,000 bytes allocated
- Field names: 10,000 × 10 × 20 = 2,000,000 bytes allocated
- **~12 MB of garbage per batch** just for size estimation

This is called from `addDocument()`, so it happens for every document indexed.

## Proposed Fix

Replace `getBytes(UTF_8).length` with a zero-allocation UTF-8 length calculation:

```java
private static int utf8Length(String s) {
    int count = 0;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
        if (c <= 0x7F) {
            count++;
        } else if (c <= 0x7FF) {
            count += 2;
        } else if (Character.isHighSurrogate(c)) {
            count += 4;
            i++;
        } else {
            count += 3;
        }
    }
    return count;
}
```

Then update:
```java
// estimateDocumentSize()
size += 2 + utf8Length(fieldName) + 1 + 2;

// estimateValueSize()
return 4 + utf8Length(value.toString());
```

## Additional Issue

`getEstimatedSize()` is also missing the offset table overhead:

```java
// Current - missing offset table
return estimatedSize + 16;

// Should be
return estimatedSize + 16 + (documents.size() * 4);
```

This causes underestimation by `docCount × 4` bytes (40KB for 10,000 docs).

## Files Affected

- `src/main/java/io/indextables/tantivy4java/batch/BatchDocumentBuilder.java`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Tweak: BatchDocumentBuilder: Unnecessary memory allocations in size estimation #34

BatchDocumentBuilder: Unnecessary memory allocations in size estimation

Summary

Problem

Impact

Proposed Fix

Additional Issue

Files Affected

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance Tweak: BatchDocumentBuilder: Unnecessary memory allocations in size estimation #34

Description

BatchDocumentBuilder: Unnecessary memory allocations in size estimation

Summary

Problem

Impact

Proposed Fix

Additional Issue

Files Affected

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions