Skip to content

fix(#25764): Implement UTF-8 encoding standardization for CSV import/…#27409

Open
Darshan3690 wants to merge 6 commits intoopen-metadata:mainfrom
Darshan3690:fix/25764-utf8-csv-import-export
Open

fix(#25764): Implement UTF-8 encoding standardization for CSV import/…#27409
Darshan3690 wants to merge 6 commits intoopen-metadata:mainfrom
Darshan3690:fix/25764-utf8-csv-import-export

Conversation

@Darshan3690
Copy link
Copy Markdown
Contributor

@Darshan3690 Darshan3690 commented Apr 16, 2026

PR Summary: Fix Chinese Character Garbling in CSV Import/Export (#25764)

Issue

Chinese and other non-ASCII characters were getting garbled during CSV import and export flows.

Root Cause

Encoding was not consistently enforced across the full pipeline:

  • UI upload and API request headers
  • Backend CSV endpoint media types
  • CSV import parsing for BOM content
  • CSV download behavior for Excel compatibility

What Changed

Backend updates

  • Added UTF-8 BOM helper support in CsvUtil
  • Added BOM stripping in shared import flow in EntityResource so all CSV imports normalize input safely
  • Standardized CSV import/export resource endpoints to explicitly use UTF-8 charset on text/plain CSV payloads
  • Added backend unit tests for:
    • Chinese character preservation in generated CSV
    • BOM stripping behavior for BOM, non-BOM, empty, and null inputs

Frontend updates

  • Standardized CSV import request headers to include charset UTF-8
  • Updated file upload reading to explicitly decode as UTF-8
  • Updated CSV download logic to:
    • prepend BOM for CSV exports
    • avoid duplicate BOM when content already includes BOM
    • preserve non-CSV behavior unchanged
  • Added and updated Jest coverage for:
    • UTF-8 request header assertions
    • CSV BOM prepend behavior
    • duplicate BOM prevention
    • non-CSV behavior

Playwright updates

  • Added Chinese data in glossary import/export E2E flow
  • Added assertion to verify Chinese term visibility after import
  • Updated reference URL in test data to URI-safe encoded form to avoid URI validation flakiness

Cleanup updates

  • Removed accidental debug artifact file from PR
  • Added debug artifact path to gitignore to prevent future re-commit

Additional review follow-up fixes

Based on review comments, this PR also includes:

  • URI-safe encoded URL in Playwright glossary references field
  • Proper Blob global restoration in ExportUtils Jest tests to prevent mock leakage
  • Added Unicode round-trip import/export integration coverage in BaseEntityIT for entities that support CSV import/export

Validation status

Passed locally

  • openmetadata-service CsvUtilTest
  • UI importExportAPI test suite
  • UI ExportUtils test suite including BOM regression scenarios

Added coverage

  • Backend unit coverage for UTF-8/BOM behavior
  • UI unit coverage for header and download behavior
  • Playwright scenario with Chinese content
  • Integration-level Unicode CSV round-trip test added in BaseEntityIT

Environment note

Integration-test module compile/run requires local snapshot artifacts in this environment. Code updates are complete, but full integration module execution depends on those snapshot dependencies being available in CI or a fully bootstrapped local build.


Compatibility and risk

  • Backward compatible for existing clients
  • Plain text CSV clients continue to work
  • UTF-8 handling is now explicit and consistent
  • BOM handling is defensive and avoids double-BOM corruption

Impact

CSV import/export now reliably preserves Chinese and other Unicode characters across backend and frontend workflows, including Excel-friendly CSV download behavior.


…r CSV import/export

## Overview
Resolve Chinese character garbling in CSV import/export workflows by implementing
end-to-end UTF-8 encoding standardization across backend REST endpoints and
frontend file handling.

## Root Causes Fixed
1. Missing charset=UTF-8 declarations on CSV transport layer (HTTP headers)
2. No UTF-8 BOM handling for Windows Excel compatibility
3. Inconsistent encoding across 9+ independent resource classes
4. Browser FileReader lacking explicit encoding specification
5. No UTF-8 BOM prepending in CSV downloads

## Changes Implemented

### Backend (11 files)
**CSV Utility (CsvUtil.java)**
- Added UTF8_BOM constant (\uFEFF)
- Added stripUtf8Bom(String value) utility method for safe BOM removal
- Handles null, empty string, and multi-byte character scenarios

**Shared Import Flow (EntityResource.java)**
- Import CsvUtil dependency
- Normalize CSV input by stripping BOM before repository parsing
- Applied to all entity types (Table, Glossary, Team, User, TestCase, etc.)

**REST Endpoints (9 resource files)**
- ColumnResource.java: Updated 3 @Produces/@consumes annotations
- TableResource.java: Updated 4 annotations (export, async export, import, async import)
- UserResource.java: Updated 3 annotations
- TeamResource.java: Updated 4 annotations
- TestCaseResource.java: Updated 3 annotations
- GlossaryResource.java: Updated 4 annotations
- GlossaryTermResource.java: Updated 4 annotations
- LineageResource.java: Updated 1 annotation (export)
- All changed from TEXT_PLAIN → TEXT_PLAIN + "
Copilot AI review requested due to automatic review settings April 16, 2026 03:52
@Darshan3690 Darshan3690 requested a review from a team as a code owner April 16, 2026 03:52
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@Darshan3690
Copy link
Copy Markdown
Contributor Author

HI , @harshach add safe test label

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements end-to-end UTF-8 handling for CSV import/export to prevent non-ASCII (e.g., Chinese) character corruption by standardizing charset usage across UI requests, backend endpoints, and CSV parsing/downloading.

Changes:

  • Standardize CSV import request encoding (UI sends text/plain; charset=UTF-8; backend consumes/produces UTF-8 explicitly).
  • Add UTF-8 BOM handling (backend strips BOM on import; UI prepends BOM for CSV downloads for Excel compatibility).
  • Extend automated coverage (Java unit tests, Jest tests, and a Playwright E2E scenario with Chinese content).

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
openmetadata-ui/src/main/resources/ui/src/utils/Export/ExportUtils.ts Prepends BOM and enforces CSV MIME type for downloads to improve Excel UTF-8 handling.
openmetadata-ui/src/main/resources/ui/src/utils/Export/ExportUtils.test.tsx Updates/adds tests for BOM behavior on CSV vs non-CSV downloads.
openmetadata-ui/src/main/resources/ui/src/rest/teamsAPI.ts Adds UTF-8 charset to CSV import request headers for team/user imports.
openmetadata-ui/src/main/resources/ui/src/rest/tableAPI.ts Adds UTF-8 charset to CSV import request headers for table import.
openmetadata-ui/src/main/resources/ui/src/rest/importExportAPI.ts Adds UTF-8 charset to CSV import request headers for multiple entity import APIs.
openmetadata-ui/src/main/resources/ui/src/rest/importExportAPI.test.ts Updates assertions to validate UTF-8 charset headers in import requests.
openmetadata-ui/src/main/resources/ui/src/rest/databaseAPI.ts Adds UTF-8 charset to CSV import request headers for database/schema imports.
openmetadata-ui/src/main/resources/ui/src/rest/columnAPI.ts Adds UTF-8 charset to CSV import request headers for column CSV import APIs.
openmetadata-ui/src/main/resources/ui/src/components/UploadFile/UploadFile.tsx Forces FileReader.readAsText(..., 'utf-8') for CSV uploads.
openmetadata-ui/src/main/resources/ui/playwright/e2e/Pages/GlossaryImportExport.spec.ts Adds Chinese glossary term data to validate E2E import/export behavior.
openmetadata-service/src/test/java/org/openmetadata/csv/CsvUtilTest.java Adds unit tests for BOM stripping and Chinese character preservation in CSV formatting.
openmetadata-service/src/main/java/org/openmetadata/service/resources/teams/UserResource.java Adds UTF-8 charset to CSV import/export endpoint annotations for users.
openmetadata-service/src/main/java/org/openmetadata/service/resources/teams/TeamResource.java Adds UTF-8 charset to CSV import/export endpoint annotations for teams.
openmetadata-service/src/main/java/org/openmetadata/service/resources/lineage/LineageResource.java Adds UTF-8 charset to lineage CSV export endpoint annotation.
openmetadata-service/src/main/java/org/openmetadata/service/resources/glossary/GlossaryTermResource.java Adds UTF-8 charset to glossary term CSV import/export endpoint annotations.
openmetadata-service/src/main/java/org/openmetadata/service/resources/glossary/GlossaryResource.java Adds UTF-8 charset to glossary CSV import/export endpoint annotations.
openmetadata-service/src/main/java/org/openmetadata/service/resources/dqtests/TestCaseResource.java Adds UTF-8 charset to test case CSV import/export endpoint annotations.
openmetadata-service/src/main/java/org/openmetadata/service/resources/databases/TableResource.java Adds UTF-8 charset to table CSV import/export endpoint annotations.
openmetadata-service/src/main/java/org/openmetadata/service/resources/columns/ColumnResource.java Adds UTF-8 charset to column CSV import endpoint annotations.
openmetadata-service/src/main/java/org/openmetadata/service/resources/EntityResource.java Centralizes BOM stripping for entity CSV imports via CsvUtil.stripUtf8Bom(...).
openmetadata-service/src/main/java/org/openmetadata/csv/CsvUtil.java Introduces UTF-8 BOM constant and helper to strip BOM from imported CSV strings.
Comments suppressed due to low confidence (6)

openmetadata-ui/src/main/resources/ui/src/components/UploadFile/UploadFile.tsx:49

  • setUploading(false) runs in the finally block immediately after readAsText(...) is initiated, but FileReader completes asynchronously. This means the loader state will be cleared before onload/onerror fires (and errors thrown inside reader.onerror won't be caught by this try/catch). Move the setUploading(false) into reader.onloadend (or onload/onerror) and surface errors via the callback/rejection rather than throwing a string in an async handler.
      setUploading(true);
      try {
        const reader = new FileReader();
        reader.onload = onCSVUploaded;
        reader.onerror = () => {
          throw t('server.unexpected-error');
        };
        reader.readAsText(options.file as Blob, 'utf-8');
      } catch (error) {
        showErrorToast(error as AxiosError);
      } finally {
        setUploading(false);
      }

openmetadata-service/src/main/java/org/openmetadata/service/resources/teams/TeamResource.java:755

  • The sync exportCsv(...) endpoint produces plain text CSV, but the @ApiResponse content is still declared as application/json. This makes the generated OpenAPI spec incorrect for clients. Update the response @Content(mediaType=...) to text/plain (or text/csv) to match what is actually returned.
  @GET
  @Path("/name/{name}/export")
  @Produces({MediaType.TEXT_PLAIN + "; charset=UTF-8"})
  @Valid
  @Operation(
      operationId = "exportTeams",
      summary = "Export teams in CSV format",
      responses = {
        @ApiResponse(
            responseCode = "200",
            description = "Exported csv with teams information",
            content =
                @Content(
                    mediaType = "application/json",
                    schema = @Schema(implementation = String.class)))

openmetadata-service/src/main/java/org/openmetadata/service/resources/glossary/GlossaryResource.java:575

  • The sync exportCsv(...) endpoint returns CSV (String) and is annotated as @Produces(text/plain; charset=UTF-8), but the @ApiResponse still declares application/json. Adjust the documented response media type to text/plain (or text/csv) so generated clients don’t try to parse JSON.
  @GET
  @Path("/name/{name}/export")
  @Produces({MediaType.TEXT_PLAIN + "; charset=UTF-8"})
  @Valid
  @Operation(
      operationId = "exportGlossary",
      summary = "Export glossary in CSV format",
      responses = {
        @ApiResponse(
            responseCode = "200",
            description = "Exported csv with glossary terms",
            content =
                @Content(
                    mediaType = "application/json",
                    schema = @Schema(implementation = String.class)))

openmetadata-service/src/main/java/org/openmetadata/service/resources/databases/TableResource.java:622

  • The sync exportCsv(...) endpoint returns plain-text CSV but its @ApiResponse still advertises application/json. This makes the OpenAPI spec inaccurate for CSV consumers. Update the documented response @Content(mediaType=...) to text/plain (or text/csv).
  @GET
  @Path("/name/{name}/export")
  @Produces({MediaType.TEXT_PLAIN + "; charset=UTF-8"})
  @Valid
  @Operation(
      operationId = "exportTable",
      summary = "Export table in CSV format",
      responses = {
        @ApiResponse(
            responseCode = "200",
            description = "Exported csv with columns from the table",
            content =
                @Content(
                    mediaType = "application/json",
                    schema = @Schema(implementation = String.class)))
      })

openmetadata-service/src/main/java/org/openmetadata/service/resources/lineage/LineageResource.java:416

  • exportLineage(...) is annotated to produce plain text, but the OpenAPI @ApiResponse is documented as returning a SearchResponse JSON payload. Since the method returns a CSV String, update the documented response content/media type to text/plain (or text/csv) to avoid generating incorrect clients.
  @GET
  @Path("/export")
  @Produces({MediaType.TEXT_PLAIN + "; charset=UTF-8"})
  @Operation(
      operationId = "exportLineage",
      summary = "Export lineage",
      responses = {
        @ApiResponse(
            responseCode = "200",
            description = "search response",
            content =
                @Content(
                    mediaType = "application/json",
                    schema = @Schema(implementation = SearchResponse.class)))
      })

openmetadata-service/src/main/java/org/openmetadata/service/resources/teams/UserResource.java:1701

  • exportUsersCsv(...) is annotated as producing plain text, but the OpenAPI @ApiResponse content is still declared as application/json. This makes the generated spec misleading for CSV consumers. Update the documented response @Content(mediaType=...) to text/plain (or text/csv) to match the actual response body.
  @GET
  @Path("/export")
  @Produces({MediaType.TEXT_PLAIN + "; charset=UTF-8"})
  @Valid
  @Operation(
      operationId = "exportUsers",
      summary = "Export users in a team in CSV format",
      responses = {
        @ApiResponse(
            responseCode = "200",
            description = "Exported csv with user information",
            content =
                @Content(
                    mediaType = "application/json",
                    schema = @Schema(implementation = String.class)))
      })

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copilot AI review requested due to automatic review settings April 16, 2026 15:57
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@Darshan3690
Copy link
Copy Markdown
Contributor Author

Hi @harshach @PubChimps @pmbrull add safe to test label

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

openmetadata-ui/src/main/resources/ui/src/components/UploadFile/UploadFile.tsx:48

  • setUploading(false) runs in the finally block immediately after calling FileReader.readAsText(...), but FileReader is asynchronous. This makes the loader state inaccurate (it will flip back to false before onload/onerror fires). Move setUploading(false) into the onload and onerror handlers (and call options.onSuccess/onError if needed) so the UI reflects the actual read lifecycle.
        reader.readAsText(options.file as Blob, 'utf-8');
      } catch (error) {
        showErrorToast(error as AxiosError);
      } finally {
        setUploading(false);

displayName: '中文术语展示名',
description: '这是用于验证导入导出编码的中文描述。',
synonyms: '中文同义词;测试',
references: '参考;https://example.com/中文',
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

references includes a URL with raw non-ASCII characters (https://example.com/中文). In the GlossaryTerm schema, termReference.endpoint is format: uri, so validators may reject IRIs that are not RFC3986-encoded. To avoid a flaky/invalid test while still exercising Chinese text, keep Chinese in the reference name and percent-encode the URL path (or use an ASCII-only URL).

Suggested change
references: '参考;https://example.com/中文',
references: '参考;https://example.com/%E4%B8%AD%E6%96%87',

Copilot uses AI. Check for mistakes.
Comment on lines 95 to +104
it('uses the provided mimeType when creating the Blob', () => {
const mockBlob = {};
const MockBlob = jest.fn().mockReturnValue(mockBlob);
global.Blob = MockBlob as unknown as typeof Blob;

downloadFile('content', 'file.csv', 'text/csv;charset=utf-8;');

expect(MockBlob).toHaveBeenCalledWith(['content'], {
expect(MockBlob).toHaveBeenCalledWith(['\uFEFFcontent'], {
type: 'text/csv;charset=utf-8;',
});
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests overwrite global.Blob but never restore it. jest.restoreAllMocks() won’t revert direct assignments, so the mocked Blob can leak into later tests/files and cause hard-to-debug failures. Capture the original global.Blob and restore it in afterEach (or use a spy/mocking approach that is automatically restored).

Copilot uses AI. Check for mistakes.
@harshach
Copy link
Copy Markdown
Collaborator

@Darshan3690 this requires lot of test coverage.

  1. you need to add unit tests in openmetadata-service
  2. you need to add integration-tests which shows import/export with unicode chars for import/export csv you can check the BaseEntityIT which has comon tests across diffeent entities that supports import/export
  3. You also need to add unit-test coverage for the UI and playwright tests which simulates the import/export cc @PubChimps

@Darshan3690
Copy link
Copy Markdown
Contributor Author

Darshan3690 commented Apr 16, 2026

@Darshan3690 this requires lot of test coverage.

1. you need to add unit tests in openmetadata-service

2. you need to add integration-tests which shows import/export with unicode chars for import/export csv you can check the BaseEntityIT which has comon tests across diffeent entities that supports import/export

3. You also need to add unit-test coverage for the UI and playwright tests which simulates the import/export cc @PubChimps

okay sir i will update the pr and add safe to test label

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copilot AI review requested due to automatic review settings April 16, 2026 17:50
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 23 changed files in this pull request and generated 2 comments.

Comment on lines +5674 to +5676
} catch (Exception e) {
fail("Unicode import/export round-trip failed: " + e.getMessage());
}
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test catches Exception and then calls fail(...) with only e.getMessage(), which drops the stack trace and makes failures harder to debug. Prefer either not catching here (let JUnit report the exception), catching the specific exception type(s) expected, or passing the exception as the cause (e.g., fail(message, e)).

Copilot uses AI. Check for mistakes.
const blob = new Blob([content], { type: mimeType });
const isCsvFile = fileName.toLowerCase().endsWith('.csv');
const isCsvMime = mimeType.toLowerCase().includes('text/csv');
const csvMimeType = 'text/csv;charset=utf-8;';
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

csvMimeType includes a trailing ; (text/csv;charset=utf-8;), which is not a valid media type per the HTTP Content-Type grammar and may be parsed inconsistently by browsers/tools. Consider removing the trailing semicolon (and optionally adding a space after the ;) so the Blob type is a well-formed text/csv; charset=utf-8.

Suggested change
const csvMimeType = 'text/csv;charset=utf-8;';
const csvMimeType = 'text/csv; charset=utf-8';

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 17, 2026

Code Review ✅ Approved 2 resolved / 2 findings

Standardizes CSV import UTF-8 encoding by resolving BOM double-prepending issues and removing the mistakenly committed 101K-line debug artifact. No issues found.

✅ 2 resolved
Quality: Accidentally committed 101K-line debug.json test artifact

📄 openmetadata-ui/src/main/resources/ui/debug.json:1-15
The file openmetadata-ui/src/main/resources/ui/debug.json is a 101,730-line Jest test output file that was accidentally committed. It contains local Windows file paths (e.g., C:\open source con\OpenMetadata\...) and full test failure details. This bloats the repository significantly and exposes local development environment information. It is not referenced by any build config or .gitignore entry.

Bug: BOM may be double-prepended if content already contains one

📄 openmetadata-ui/src/main/resources/ui/src/utils/Export/ExportUtils.ts:26-32
In ExportUtils.ts, the downloadFile function unconditionally prepends a UTF-8 BOM (\uFEFF) to all CSV content. However, there's no check whether the content already starts with a BOM. On the import side, the backend strips the BOM via CsvUtil.stripUtf8Bom(), but if a future code path or the backend export ever includes a BOM in the response, the download will contain \uFEFF\uFEFF — a double BOM. The first BOM is consumed as expected, but the second appears as an invisible zero-width no-break space character at the start of the first header, which can cause subtle CSV parsing failures on re-import.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@Darshan3690
Copy link
Copy Markdown
Contributor Author

hi @harshach @PubChimps add safe to test label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants