Note
This package is currently in the development stage.
A Python package for parsing and generating CF_HTML clipboard format data, which is a format used by Windows applications to store HTML content with metadata, including fragment boundaries and selection information.
This package requires Python 3.10 or higher.
Note: This package is not yet available on PyPI. Install directly from GitHub if you want to use it:
pip install git+https://github.com/privet-kitty/python-cf-html.git
from cf_html import CfHtml
cf_html_str = (
"Version:1.0\r\n"
"StartHTML:0000000105\r\n"
"EndHTML:0000000197\r\n"
"StartFragment:0000000141\r\n"
"EndFragment:0000000161\r\n"
"<html>\r\n"
"<body>\r\n"
"<!--StartFragment--><p>Hello, World!</p><!--EndFragment-->\r\n"
"</body>\r\n"
"</html>"
)
cf_html = CfHtml.loads(cf_html_str)
# Access the fragment content
fragment = cf_html.fragment
print(fragment) # <p>Hello, World!</p>
from cf_html import CfHtml
html_context = (
"<html>\r\n"
"<body>\r\n"
"<!--StartFragment--><p>Hello, World!</p><!--EndFragment-->\r\n"
"</body>\r\n"
"</html>"
)
cf_html = CfHtml.load_contexts(html_context)
print(str(cf_html)) # This outputs the following CF_HTML
# Version:1.0
# StartHTML:0000000105
# EndHTML:0000000197
# StartFragment:0000000141
# EndFragment:0000000161
# <html>
# <body>
# <!--StartFragment--><p>Hello, World!</p><!--EndFragment-->
# </body>
# </html>
When examining the actual implementation of CF_HTML in official Windows applications like Microsoft Teams and Microsoft Edge, the fragment boundaries defined by StartFragment
and EndFragment
do not include the <!--StartFragment-->
and <!--EndFragment-->
comment markers themselves, but rather point to the HTML content between these markers. This package follows this real-world behavior.
However, the official HTML Clipboard Format specification states that StartFragment
stores the "offset (in bytes) from the beginning of the clipboard to the start of the fragment" (emphasis added). According to the following BNF syntax provided in the specification, "fragment" seems to refer to content that includes both the <!--StartFragment-->
and <!--EndFragment-->
comment markers:
<cf-html> ::= <description-header> <context>
<context> ::= [<preceding-context>] <fragment> ment>[<trailing-context>]
<description-header> ::= "Version:" <version> <br> ( <header-offset-keyword> ":" <header-offset-value> <br> )*
<header-offset-keyword> ::= "StartHTML" | "EndHTML" | "StartFragment" | "EndFragment" | "StartSelection" | "EndSelection"
<header-offset-value> ::= { Base 10 (decimal) integer string with optional *multiple* leading zero digits (see "Offset syntax" below) }
<version> ::= "0.9" | "1.0"
<fragment> ::= <fragment-start-comment> <fragment-text> <fragment-end-comment>
<fragment-start-comment> ::= "<!--StartFragment -->"
<fragment-end-comment> ::= "<!--EndFragment -->"
<preceding-context> ::= { Arbitrary HTML }
<trailing-context> ::= { Arbitrary HTML }
<fragment-text> ::= { Arbitrary HTML }
<br> ::= "\r" | "\n" | "\r\n"
The following example demonstrates this package's behavior:
Version:1.0
StartHTML:0000000105
EndHTML:0000000193
StartFragment:0000000139 ← Points to (inclusive) start of "<p>Hello, World!</p>"
EndFragment:0000000159 ← Points to (exclusive) end of "<p>Hello, World!</p>"
<html>
<body>
<!--StartFragment--><p>Hello, World!</p><!--EndFragment-->
</body>
</html>
Copyright (c) 2025 Hugo Sansaqua.