Skip to content

[BUG] Pivot cache triggers Excel repair #42

@chilltemp

Description

@chilltemp

(Disclaimer, I used AI to figure this out and generate the bug report.)

Bug report: Pivot cache generated by @protobi/exceljs triggers Excel repair

Summary

When generating a workbook with a Pivot Table using @protobi/exceljs, Excel opens the file with:

Repaired Records: PivotTable report from /xl/pivotCache/pivotCacheDefinition1.xml part (PivotTable cache)

The same script and data generated with exceljs opens cleanly.

Environment

  • macOS
  • Node/TypeScript script
  • Workbook contains:
    • a data worksheet
    • a pivot worksheet

Repro

  1. Generate workbook using exceljs and unzip output to temp/excel-orig.
  2. Generate workbook using @protobi/exceljs and unzip output to temp/excel-protobi.
  3. Open both .xlsx files in desktop Excel.

Expected

Both files open without repair prompts.

Actual

Only the @protobi/exceljs output triggers a repair for pivotCacheDefinition1.xml.

Findings from XML comparison

The pivot table definition is essentially the same between outputs (xl/pivotTables/pivotTable1.xml).

The key difference is pivot cache serialization:

  • exceljs writes many non-pivot cache fields in a lightweight form in pivotCacheDefinition1.xml, e.g.:
    • <sharedItems containsSemiMixedTypes="0" containsString="0" containsNumber="1" containsInteger="1" />
  • @protobi/exceljs writes fully enumerated <sharedItems count="..."> ... </sharedItems> for many fields, including very large text fields (Description / File Names / Dependency Roots).

Also:

  • exceljs cache records are mostly inline strings (<s v="..."/>) in pivotCacheRecords1.xml.
  • @protobi/exceljs cache records are index-based (<x v="..."/>) and rely on huge shared item tables.

This expanded shared-items representation appears to be what causes Excel to repair the pivot cache definition.

Why this seems to be a library issue

  • Input data and pivot configuration are the same.
  • pivotTable1.xml is comparable across both outputs.
  • Repair points specifically to pivot cache definition.
  • Reducing non-used fields in sharedItems to lightweight forms avoids repair in downstream normalization.

Minimal standalone repro script

This is a minimal script pattern (no project-specific data required):

import * as ExcelJS from '@protobi/exceljs';

async function main() {
  const workbook = new ExcelJS.Workbook();

  const dataSheet = workbook.addWorksheet('Data');
  dataSheet.columns = [
    { header: 'Repo', key: 'repo', width: 20 },
    { header: 'Group', key: 'group', width: 20 },
    { header: 'Severity', key: 'severity', width: 20 },
    { header: 'Description', key: 'description', width: 80 },
    { header: 'Counter', key: 'count', width: 10 },
  ];

  for (let index = 0; index < 1500; index++) {
    dataSheet.addRow({
      repo: `repo-${index % 50}`,
      group: `group-${index % 100}`,
      severity: ['CRITICAL', 'HIGH', 'MEDIUM', 'LOW', 'INFO'][index % 5],
      description:
        `long-text-${index}-` +
        'x'.repeat(500 + (index % 200)),
      count: 1,
    });
  }

  const pivotSheet = workbook.addWorksheet('Summary');
  // eslint-disable-next-line @typescript-eslint/no-explicit-any
  (pivotSheet as any).addPivotTable({
    sourceSheet: dataSheet,
    rows: ['Repo', 'Group'],
    columns: ['Severity'],
    values: ['Counter'],
    metric: 'sum',
  });

  await workbook.xlsx.writeFile('repro-protobi.xlsx');
}

main().catch((error) => {
  console.error(error);
  process.exit(1);
});

Open repro-protobi.xlsx in desktop Excel and inspect whether a repair prompt references pivotCacheDefinition.

Patch currently used as workaround

After writing the workbook to disk, we patch the generated .xlsx zip and normalize pivot cache definitions.

import AdmZip from 'adm-zip';

export const normalizeProtobiPivotCache = (filePath: string) => {
  const zip = new AdmZip(filePath);
  const entries = zip.getEntries();

  const usedFieldIndexes = new Set<number>();

  for (const entry of entries) {
    if (!/^xl\/pivotTables\/pivotTable\d+\.xml$/.test(entry.entryName)) {
      continue;
    }

    const pivotXml = entry.getData().toString('utf8');
    const fieldRefRx = /<(?:field\s+x|dataField\s+fld)="(\d+)"/g;
    let fieldRefMatch;
    while ((fieldRefMatch = fieldRefRx.exec(pivotXml))) {
      usedFieldIndexes.add(Number(fieldRefMatch[1]));
    }
  }

  let modified = false;

  for (const entry of entries) {
    if (
      !/^xl\/pivotCache\/pivotCacheDefinition\d+\.xml$/.test(entry.entryName)
    ) {
      continue;
    }

    let cacheXml = entry.getData().toString('utf8');

    let fieldIndex = -1;
    const updated = cacheXml.replace(
      /<cacheField([^>]*)>([\s\S]*?)<\/cacheField>/g,
      (full, attrs, body) => {
        fieldIndex += 1;

        if (usedFieldIndexes.has(fieldIndex)) {
          const normalizedBody = body.replace(
            /<sharedItems([^>]*)>([\s\S]*?)<\/sharedItems>/g,
            (_siFull: string, siAttrs: string, siBody: string): string => {
              const itemCount =
                (siBody.match(/<(?:s|n|b|d|e|m)(?:\s|\/>)/g) || []).length ||
                (siBody.match(/<(?:s|n|b|d|e|m)\b/g) || []).length;
              const attrsWithoutCount = siAttrs
                .replace(/\s*count="[^"]*"/g, '')
                .trim();
              const nextAttrs = attrsWithoutCount
                ? ` ${attrsWithoutCount} count="${itemCount}"`
                : ` count="${itemCount}"`;
              return `<sharedItems${nextAttrs}>${siBody}</sharedItems>`;
            }
          );

          return `<cacheField${attrs}>${normalizedBody}</cacheField>`;
        }

        const sharedItemsMatch = body.match(
          /<sharedItems([^>]*)>([\s\S]*?)<\/sharedItems>/
        );

        if (!sharedItemsMatch) {
          return full;
        }

        const sharedItemsBody = sharedItemsMatch[2] || '';
        const hasBlank =
          sharedItemsBody.includes('<s v="" />') ||
          sharedItemsBody.includes('<s v=""/>');
        const hasLongText = sharedItemsBody.length > 5000;

        let newSharedItems = '<sharedItems/>';
        if (hasLongText) {
          newSharedItems = '<sharedItems longText="1"/>';
        } else if (hasBlank) {
          newSharedItems = '<sharedItems containsBlank="1"/>';
        }

        const normalizedBody = body.replace(
          /<sharedItems[^>]*>[\s\S]*?<\/sharedItems>/,
          newSharedItems
        );
        return `<cacheField${attrs}>${normalizedBody}</cacheField>`;
      }
    );

    if (updated !== cacheXml) {
      modified = true;
      cacheXml = updated;
      zip.updateFile(entry, Buffer.from(cacheXml, 'utf8'));
    }
  }

  if (modified) {
    zip.writeZip(filePath);
  }
};

Suggested fix direction

When writing pivot cache definitions:

  • Prefer lightweight <sharedItems .../> for non-pivoted / high-cardinality / long-text fields.
  • Avoid fully materializing massive shared item lists when not needed for row/column/value axes.
  • Ensure generated cache definition is compatible with strict Excel parser behavior.

Notes on attachments

I cannot include sample workbook files in the report. If needed, I can share redacted XML snippets from the relevant pivot cache parts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions