This document describes a representation of various aspects of OCR output in an XML-like format. That is, we define as set of tags containing text and other tags, together with attributes of those tags. However, since the content we are representing is formatted text,
However, we are not actually using a new XML for the representation; instead embed the representation in
Some text is missing in the first paragraph.
define as set => define a set
instead embed => instead we embed