fix(#25764): Implement UTF-8 encoding standardization for CSV import/…#27409
fix(#25764): Implement UTF-8 encoding standardization for CSV import/…#27409Darshan3690 wants to merge 6 commits intoopen-metadata:mainfrom
Conversation
…r CSV import/export ## Overview Resolve Chinese character garbling in CSV import/export workflows by implementing end-to-end UTF-8 encoding standardization across backend REST endpoints and frontend file handling. ## Root Causes Fixed 1. Missing charset=UTF-8 declarations on CSV transport layer (HTTP headers) 2. No UTF-8 BOM handling for Windows Excel compatibility 3. Inconsistent encoding across 9+ independent resource classes 4. Browser FileReader lacking explicit encoding specification 5. No UTF-8 BOM prepending in CSV downloads ## Changes Implemented ### Backend (11 files) **CSV Utility (CsvUtil.java)** - Added UTF8_BOM constant (\uFEFF) - Added stripUtf8Bom(String value) utility method for safe BOM removal - Handles null, empty string, and multi-byte character scenarios **Shared Import Flow (EntityResource.java)** - Import CsvUtil dependency - Normalize CSV input by stripping BOM before repository parsing - Applied to all entity types (Table, Glossary, Team, User, TestCase, etc.) **REST Endpoints (9 resource files)** - ColumnResource.java: Updated 3 @Produces/@consumes annotations - TableResource.java: Updated 4 annotations (export, async export, import, async import) - UserResource.java: Updated 3 annotations - TeamResource.java: Updated 4 annotations - TestCaseResource.java: Updated 3 annotations - GlossaryResource.java: Updated 4 annotations - GlossaryTermResource.java: Updated 4 annotations - LineageResource.java: Updated 1 annotation (export) - All changed from TEXT_PLAIN → TEXT_PLAIN + "
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
HI , @harshach add safe test label |
There was a problem hiding this comment.
Pull request overview
Implements end-to-end UTF-8 handling for CSV import/export to prevent non-ASCII (e.g., Chinese) character corruption by standardizing charset usage across UI requests, backend endpoints, and CSV parsing/downloading.
Changes:
- Standardize CSV import request encoding (UI sends
text/plain; charset=UTF-8; backend consumes/produces UTF-8 explicitly). - Add UTF-8 BOM handling (backend strips BOM on import; UI prepends BOM for CSV downloads for Excel compatibility).
- Extend automated coverage (Java unit tests, Jest tests, and a Playwright E2E scenario with Chinese content).
Reviewed changes
Copilot reviewed 21 out of 22 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| openmetadata-ui/src/main/resources/ui/src/utils/Export/ExportUtils.ts | Prepends BOM and enforces CSV MIME type for downloads to improve Excel UTF-8 handling. |
| openmetadata-ui/src/main/resources/ui/src/utils/Export/ExportUtils.test.tsx | Updates/adds tests for BOM behavior on CSV vs non-CSV downloads. |
| openmetadata-ui/src/main/resources/ui/src/rest/teamsAPI.ts | Adds UTF-8 charset to CSV import request headers for team/user imports. |
| openmetadata-ui/src/main/resources/ui/src/rest/tableAPI.ts | Adds UTF-8 charset to CSV import request headers for table import. |
| openmetadata-ui/src/main/resources/ui/src/rest/importExportAPI.ts | Adds UTF-8 charset to CSV import request headers for multiple entity import APIs. |
| openmetadata-ui/src/main/resources/ui/src/rest/importExportAPI.test.ts | Updates assertions to validate UTF-8 charset headers in import requests. |
| openmetadata-ui/src/main/resources/ui/src/rest/databaseAPI.ts | Adds UTF-8 charset to CSV import request headers for database/schema imports. |
| openmetadata-ui/src/main/resources/ui/src/rest/columnAPI.ts | Adds UTF-8 charset to CSV import request headers for column CSV import APIs. |
| openmetadata-ui/src/main/resources/ui/src/components/UploadFile/UploadFile.tsx | Forces FileReader.readAsText(..., 'utf-8') for CSV uploads. |
| openmetadata-ui/src/main/resources/ui/playwright/e2e/Pages/GlossaryImportExport.spec.ts | Adds Chinese glossary term data to validate E2E import/export behavior. |
| openmetadata-service/src/test/java/org/openmetadata/csv/CsvUtilTest.java | Adds unit tests for BOM stripping and Chinese character preservation in CSV formatting. |
| openmetadata-service/src/main/java/org/openmetadata/service/resources/teams/UserResource.java | Adds UTF-8 charset to CSV import/export endpoint annotations for users. |
| openmetadata-service/src/main/java/org/openmetadata/service/resources/teams/TeamResource.java | Adds UTF-8 charset to CSV import/export endpoint annotations for teams. |
| openmetadata-service/src/main/java/org/openmetadata/service/resources/lineage/LineageResource.java | Adds UTF-8 charset to lineage CSV export endpoint annotation. |
| openmetadata-service/src/main/java/org/openmetadata/service/resources/glossary/GlossaryTermResource.java | Adds UTF-8 charset to glossary term CSV import/export endpoint annotations. |
| openmetadata-service/src/main/java/org/openmetadata/service/resources/glossary/GlossaryResource.java | Adds UTF-8 charset to glossary CSV import/export endpoint annotations. |
| openmetadata-service/src/main/java/org/openmetadata/service/resources/dqtests/TestCaseResource.java | Adds UTF-8 charset to test case CSV import/export endpoint annotations. |
| openmetadata-service/src/main/java/org/openmetadata/service/resources/databases/TableResource.java | Adds UTF-8 charset to table CSV import/export endpoint annotations. |
| openmetadata-service/src/main/java/org/openmetadata/service/resources/columns/ColumnResource.java | Adds UTF-8 charset to column CSV import endpoint annotations. |
| openmetadata-service/src/main/java/org/openmetadata/service/resources/EntityResource.java | Centralizes BOM stripping for entity CSV imports via CsvUtil.stripUtf8Bom(...). |
| openmetadata-service/src/main/java/org/openmetadata/csv/CsvUtil.java | Introduces UTF-8 BOM constant and helper to strip BOM from imported CSV strings. |
Comments suppressed due to low confidence (6)
openmetadata-ui/src/main/resources/ui/src/components/UploadFile/UploadFile.tsx:49
setUploading(false)runs in thefinallyblock immediately afterreadAsText(...)is initiated, butFileReadercompletes asynchronously. This means the loader state will be cleared beforeonload/onerrorfires (and errors thrown insidereader.onerrorwon't be caught by this try/catch). Move thesetUploading(false)intoreader.onloadend(oronload/onerror) and surface errors via the callback/rejection rather thanthrowing a string in an async handler.
setUploading(true);
try {
const reader = new FileReader();
reader.onload = onCSVUploaded;
reader.onerror = () => {
throw t('server.unexpected-error');
};
reader.readAsText(options.file as Blob, 'utf-8');
} catch (error) {
showErrorToast(error as AxiosError);
} finally {
setUploading(false);
}
openmetadata-service/src/main/java/org/openmetadata/service/resources/teams/TeamResource.java:755
- The sync
exportCsv(...)endpoint produces plain text CSV, but the@ApiResponsecontent is still declared asapplication/json. This makes the generated OpenAPI spec incorrect for clients. Update the response@Content(mediaType=...)totext/plain(ortext/csv) to match what is actually returned.
@GET
@Path("/name/{name}/export")
@Produces({MediaType.TEXT_PLAIN + "; charset=UTF-8"})
@Valid
@Operation(
operationId = "exportTeams",
summary = "Export teams in CSV format",
responses = {
@ApiResponse(
responseCode = "200",
description = "Exported csv with teams information",
content =
@Content(
mediaType = "application/json",
schema = @Schema(implementation = String.class)))
openmetadata-service/src/main/java/org/openmetadata/service/resources/glossary/GlossaryResource.java:575
- The sync
exportCsv(...)endpoint returns CSV (String) and is annotated as@Produces(text/plain; charset=UTF-8), but the@ApiResponsestill declaresapplication/json. Adjust the documented response media type totext/plain(ortext/csv) so generated clients don’t try to parse JSON.
@GET
@Path("/name/{name}/export")
@Produces({MediaType.TEXT_PLAIN + "; charset=UTF-8"})
@Valid
@Operation(
operationId = "exportGlossary",
summary = "Export glossary in CSV format",
responses = {
@ApiResponse(
responseCode = "200",
description = "Exported csv with glossary terms",
content =
@Content(
mediaType = "application/json",
schema = @Schema(implementation = String.class)))
openmetadata-service/src/main/java/org/openmetadata/service/resources/databases/TableResource.java:622
- The sync
exportCsv(...)endpoint returns plain-text CSV but its@ApiResponsestill advertisesapplication/json. This makes the OpenAPI spec inaccurate for CSV consumers. Update the documented response@Content(mediaType=...)totext/plain(ortext/csv).
@GET
@Path("/name/{name}/export")
@Produces({MediaType.TEXT_PLAIN + "; charset=UTF-8"})
@Valid
@Operation(
operationId = "exportTable",
summary = "Export table in CSV format",
responses = {
@ApiResponse(
responseCode = "200",
description = "Exported csv with columns from the table",
content =
@Content(
mediaType = "application/json",
schema = @Schema(implementation = String.class)))
})
openmetadata-service/src/main/java/org/openmetadata/service/resources/lineage/LineageResource.java:416
exportLineage(...)is annotated to produce plain text, but the OpenAPI@ApiResponseis documented as returning aSearchResponseJSON payload. Since the method returns a CSVString, update the documented response content/media type totext/plain(ortext/csv) to avoid generating incorrect clients.
@GET
@Path("/export")
@Produces({MediaType.TEXT_PLAIN + "; charset=UTF-8"})
@Operation(
operationId = "exportLineage",
summary = "Export lineage",
responses = {
@ApiResponse(
responseCode = "200",
description = "search response",
content =
@Content(
mediaType = "application/json",
schema = @Schema(implementation = SearchResponse.class)))
})
openmetadata-service/src/main/java/org/openmetadata/service/resources/teams/UserResource.java:1701
exportUsersCsv(...)is annotated as producing plain text, but the OpenAPI@ApiResponsecontent is still declared asapplication/json. This makes the generated spec misleading for CSV consumers. Update the documented response@Content(mediaType=...)totext/plain(ortext/csv) to match the actual response body.
@GET
@Path("/export")
@Produces({MediaType.TEXT_PLAIN + "; charset=UTF-8"})
@Valid
@Operation(
operationId = "exportUsers",
summary = "Export users in a team in CSV format",
responses = {
@ApiResponse(
responseCode = "200",
description = "Exported csv with user information",
content =
@Content(
mediaType = "application/json",
schema = @Schema(implementation = String.class)))
})
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi @harshach @PubChimps @pmbrull add safe to test label |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 22 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
openmetadata-ui/src/main/resources/ui/src/components/UploadFile/UploadFile.tsx:48
setUploading(false)runs in thefinallyblock immediately after callingFileReader.readAsText(...), butFileReaderis asynchronous. This makes the loader state inaccurate (it will flip back to false beforeonload/onerrorfires). MovesetUploading(false)into theonloadandonerrorhandlers (and calloptions.onSuccess/onErrorif needed) so the UI reflects the actual read lifecycle.
reader.readAsText(options.file as Blob, 'utf-8');
} catch (error) {
showErrorToast(error as AxiosError);
} finally {
setUploading(false);
| displayName: '中文术语展示名', | ||
| description: '这是用于验证导入导出编码的中文描述。', | ||
| synonyms: '中文同义词;测试', | ||
| references: '参考;https://example.com/中文', |
There was a problem hiding this comment.
references includes a URL with raw non-ASCII characters (https://example.com/中文). In the GlossaryTerm schema, termReference.endpoint is format: uri, so validators may reject IRIs that are not RFC3986-encoded. To avoid a flaky/invalid test while still exercising Chinese text, keep Chinese in the reference name and percent-encode the URL path (or use an ASCII-only URL).
| references: '参考;https://example.com/中文', | |
| references: '参考;https://example.com/%E4%B8%AD%E6%96%87', |
| it('uses the provided mimeType when creating the Blob', () => { | ||
| const mockBlob = {}; | ||
| const MockBlob = jest.fn().mockReturnValue(mockBlob); | ||
| global.Blob = MockBlob as unknown as typeof Blob; | ||
|
|
||
| downloadFile('content', 'file.csv', 'text/csv;charset=utf-8;'); | ||
|
|
||
| expect(MockBlob).toHaveBeenCalledWith(['content'], { | ||
| expect(MockBlob).toHaveBeenCalledWith(['\uFEFFcontent'], { | ||
| type: 'text/csv;charset=utf-8;', | ||
| }); |
There was a problem hiding this comment.
These tests overwrite global.Blob but never restore it. jest.restoreAllMocks() won’t revert direct assignments, so the mocked Blob can leak into later tests/files and cause hard-to-debug failures. Capture the original global.Blob and restore it in afterEach (or use a spy/mocking approach that is automatically restored).
|
@Darshan3690 this requires lot of test coverage.
|
okay sir i will update the pr and add safe to test label |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
| } catch (Exception e) { | ||
| fail("Unicode import/export round-trip failed: " + e.getMessage()); | ||
| } |
There was a problem hiding this comment.
This test catches Exception and then calls fail(...) with only e.getMessage(), which drops the stack trace and makes failures harder to debug. Prefer either not catching here (let JUnit report the exception), catching the specific exception type(s) expected, or passing the exception as the cause (e.g., fail(message, e)).
| const blob = new Blob([content], { type: mimeType }); | ||
| const isCsvFile = fileName.toLowerCase().endsWith('.csv'); | ||
| const isCsvMime = mimeType.toLowerCase().includes('text/csv'); | ||
| const csvMimeType = 'text/csv;charset=utf-8;'; |
There was a problem hiding this comment.
csvMimeType includes a trailing ; (text/csv;charset=utf-8;), which is not a valid media type per the HTTP Content-Type grammar and may be parsed inconsistently by browsers/tools. Consider removing the trailing semicolon (and optionally adding a space after the ;) so the Blob type is a well-formed text/csv; charset=utf-8.
| const csvMimeType = 'text/csv;charset=utf-8;'; | |
| const csvMimeType = 'text/csv; charset=utf-8'; |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
Code Review ✅ Approved 2 resolved / 2 findingsStandardizes CSV import UTF-8 encoding by resolving BOM double-prepending issues and removing the mistakenly committed 101K-line debug artifact. No issues found. ✅ 2 resolved✅ Quality: Accidentally committed 101K-line debug.json test artifact
✅ Bug: BOM may be double-prepended if content already contains one
OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|
hi @harshach @PubChimps add safe to test label |
PR Summary: Fix Chinese Character Garbling in CSV Import/Export (#25764)
Issue
Chinese and other non-ASCII characters were getting garbled during CSV import and export flows.
Root Cause
Encoding was not consistently enforced across the full pipeline:
What Changed
Backend updates
Frontend updates
Playwright updates
Cleanup updates
Additional review follow-up fixes
Based on review comments, this PR also includes:
Validation status
Passed locally
Added coverage
Environment note
Integration-test module compile/run requires local snapshot artifacts in this environment. Code updates are complete, but full integration module execution depends on those snapshot dependencies being available in CI or a fully bootstrapped local build.
Compatibility and risk
Impact
CSV import/export now reliably preserves Chinese and other Unicode characters across backend and frontend workflows, including Excel-friendly CSV download behavior.