Skip to content

overdoignism/HtmTxTrans

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HtmTxTrans

繁體中文版本請見 README_zhtw.md

1. Purpose & Concept

HtmTxTrans is a powerful Command-Line Interface (CLI) tool designed for batch, full-text translation of HTML files using Large Language Models (LLMs).

Translating raw HTML with LLMs often leads to broken tags, lost attributes, or hallucinated structures. HtmTxTrans solves this by utilizing a "Disassembly-Translate-Reassembly" pipeline. It meticulously extracts text nodes and inline tags from the HTML, translates the plain text with sequential context awareness, and seamlessly aligns and reconstructs the translated text back into a perfectly valid HTML structure.

2. Tech Stack & License

  • Framework: .NET 8

  • Core Libraries:

    • OpenAI (2.10.0, Official .NET SDK for standardized LLM API interactions, MIT)

    • YamlDotNet (17.0.1, For highly readable configuration, prompts, and strictly-typed state management, MIT)

    • AngleSharp (1.4.0, For robust HTML DOM parsing and manipulation, MIT)

  • License: Released under the MIT License.

3. Workflow (The 6 Passes)

HtmTxTrans processes files through a highly resilient 6-pass workflow. Each pass saves its state as human-readable YAML files, allowing for easy debugging, resumption, and independent retries.

  • Pass 1: HTML Extraction & Chunking
    Extracts text and target attributes using AngleSharp. Strips inline tags and splits the text into optimal chunks based on a sliding window and LLM sentence-boundary detection.

  • Pass 2: Proper Noun Extraction (Glossary Generation)
    Scans the text chunks to identify proper nouns suitable for transliteration, generating a rolling bilingual glossary to ensure terminology consistency.

  • Pass 3: Context-Aware Translation
    Translates the plain text chunks using the generated glossary, passing sequential preceding and succeeding context to the LLM for highly accurate localization.

  • Pass 4: Node Alignment
    Performs an in-place YAML update, mapping the translated plain text fragments back into their original Node ID structures.

  • Pass 5: Tag Alignment
    Re-inserts the stripped HTML inline tags (e.g., , ) back into the translated nodes at their correct semantic positions.

  • Pass 6: HTML Restoration
    Reconstructs the final HTML file by merging the fully translated and tagged nodes back into the original HTML skeleton.

4. Acknowledgments

This project was co-developed with the assistance of GeminiChatGPT, and Grok.
The principal code author and architect for this iteration is Gemini 3.1 pro.


5. Quick Start

For detailed CLI usage, advanced configuration, and prompt engineering instructions, please refer to the Detailed Documentation.

About

HtmTxTrans is a powerful CLI tool designed for batch, full-text translation of HTML files using LLMs.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages