tex4ht tutorial

M. Senthilkumar edited this page Jan 6, 2017 · 3 revisions

What is tex4ht?

tex4ht is a system which converts LaTeX to various output formats, including html, xhtml, odt, docbook or tei. html and odt are the most common and best-supported conversion targets.

tex4ht allows authors to use LaTeX input--widely employed for high-quality typography, especially mathematical typography--to produce output in other formats, especially html (for web pages) and xhtml (for ebooks and other applications).

System Description

tex4ht consists of three basic building blocks and various scripts which tie these blocks together.

  1. tex4ht.sty is a TeX package which inserts configured output codes (i.e., html tags) into TeX's .dvi output file. Many documents can be translated to html without users needing to supply tags explicitly, but there are macros to insert html directly into the output if the need arises.

  2. tex4ht is an executable (program), which extracts information stored in the .dvi file including text and output codes, and prepares auxiliary files for image conversion and other tasks. Note that although whole system is named tex4ht, this command cannot be executed on .tex file; it works only with .dvi file

  3. t4ht is a program which converts images, generates css file, and runs various commands requested in the .tex file

A number of helper shell scripts (commands) exist, so that users do not need to invoke these commands manually. The best known of these is htlatex, which by default converts LaTeX to html. Using different options, you can convert to any output format supported by the tex4ht system.

In fact, you can convert to almost any format using tex4ht, even to formats not based on xml, but to do so involves providing extensive configuration files.

Basic usage

The basic usage of the htlatex command (script) is as follows:

 htlatex filename "options for tex4ht.sty" "options for tex4ht" "options for t4ht" "LaTeX options"

As you can see, htlatex has five parameters; only first one, the filename, is mandatory. Also note that options must be generally be enclosed in quotes so that they can be passed literally to the underlying commands.

The calling command "driver" is mk4ht, which is similar to htlatex, but slips in a new first parameter indicating the system to be used. Values for this parameter include htlatex which produces the same results as the htlatex script, oolatex for Open Document Format conversion, dblatex for docbook, or teilatex for TEI. The mk4ht command is quite general, allowing user-generated configuration files. For further information, see calling commands on the tex4ht website.

As an example, to compile to Open Document Format, you would type this at the commmand prompt:

mk4ht oolatex sample.tex

A more recent option is to use Michal Hoftich's make4ht build system for tex4ht. It allows the user to call various commands during compilation, such as bibtex, biber, or xindy; to postprocess output files with Lua scripts or commands such as tidy or xslt processors; and to specify the command to be used for image conversion.

In this tutorial, we will show usage of both htlatex and make4ht.

Simple example

Lets start with conversion of simple LaTeX file to html. Let's say we have following multilingual LaTeX file:

\documentclass{article}
\usepackage[english,czech]{babel}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\begin{document}
Příliš žluťoučký kůň úpěl \textit{ďábelské} ódy.
\begin{otherlanguage}{english}
	Some text in English
\end{otherlanguage}
\end{document}

things to notice are use of two languages, Czech is the main document language, English is secondary. Note usage of otherlanguage environment. It is provided by babel package and locally switches document languages, so correct hyphenation and other language dependent stuff are used. We could use \selectlanguage command, but I would like to discourage usage of switching commands such is this one, or font switching commands like \bfseries, for one reason: it is impossible to configure them correctly for end element insertion. For font switching commands, situation is saved by tex4ht command, which inserts formatting instructions for each font change. But generally, such commands don't play nice with nature of xml based formats, where every started element must be closed on the same hierarchical level. So they must have same parent element. Usage of otherlanguage environment will allow us to make proper configuration and insert opening and closing tags at correct places.

But beware of following situation:

Hello world.
\begin{someenv}
Just start some environment.

But run it through several paragraphs
\end{someenv}

say that we insert <div class="someenv"> and </div> tags around someenv environment. By default this may produce following structure:

<p>Hello world.
<div class="someenv">Just start some environment.
</p>

<p>But run it through several paragraphs
</div></p>

as you can see, generated html code is incorrect, as opening and closing div tags have different parent elements. someenv can be configured to close current paragraph, but it may be not what you want.

Best way to prevent tag mismatch may be something like:

Hello world.
\begin{someenv}
Just start some environment.
\end{someenv}

\begin{someenv}
But run it through several paragraphs
\end{someenv}

But stop talking about traps you may fall into and lets compile our example! For start use of both of htlatex and make4ht will be showed, we will focus on make4ht later.

With htlatex, we may use

 htlatex sample1

and with make4ht

 make4ht sample1

lets look on text part generated by htlatex:

<!--l. 6--><p class="noindent" >P&#x0159;íli&#353; &#382;lu&#x0165;ou&#x010D;k&#x00FD; k&#x016F;&#x0148; úp&#x011B;l <span 
class="ecti-1000">&#x010F;</span><span 
class="ecti-1000">ábelsk</span><span 
class="ecti-1000">é </span>ódy. Some text in English

and by make4ht:

<!--l. 6--><p class="noindent" >P&#x0159;íli&#353; &#382;lu&#x0165;ou&#x010D;k&#x00FD; k&#x016F;&#x0148; úp&#x011B;l <span 
class="ecti-1000">&#x010F;</span><span 
class="ecti-1000">ábelsk</span><span 
class="ecti-1000">é </span>ódy. Some text in English
</p> 

only difference is missing </p> tag in output of htlatex, because html 4.01 is produced by htlatex by default. make4ht on the other hand produces xhtml by default, so closing tag must be presented.

To get xhtml output from htlatex, use tex4ht.sty option xhtml. This option must be first option in the option list passed to tex4ht.sty. Value of the first option must be either html, xhtml or name of custom config file. We will cover these config files later, as they are key component in customization of tex4ht output.

So in order to get same output as from make4ht, we must use following command:

 htlatex sample1 xhtml

Now we should get rid of ugly entities which encode accented letters. This is somewhat ugly with htlatex:

 htlatex sample1 "xhtml,charset=utf-8" " -cunihtf -utf8"

charset=utf-8" produces meta element which declares document to be in utf-8 encoding. Important are two options for tex4ht command, -c and -utf8.

ToDo: add description of process of conversion from htf fonts to utf8 using unicode.4hf. It is directed from tex4ht.env file.

With make4ht, situation is easier, as all we need to do is to add -u option:

 make4ht -u sample1.tex

resulting file:

<!--l. 6--><p class="noindent" >Příliš žluťoučký kůň úpěl <span 
class="ecti-1000">ď</span><span 
class="ecti-1000">ábelsk</span><span 
class="ecti-1000">é </span>ódy. Some text in English
</p> 

Entities are gone, but other persists. What we see is caused by a bug in tex4ht command. It decorates text which is set in non-default font with <span> elements. Unfortunately it doesn't play well with accented letters as we can see. This has easy solution, fortunately. We just need to dive into tex4ht configuration. Yay!

Configurations

We already saw that we can use command line options to configure the output. For full list of options for tex4ht.sty, see an article on CVR's blog. These options mainly influence appearance or math, footnotes, tables, etc. Note that these options aren't fixed set, anyone can add new options and not all options are supported in each output format supported by tex4ht. Generally these options work with html (and xhtml) output.

Other option is to use custom config file (.cfg). This is a TeX file with some basic structure:

 optional stuff like requiring LaTeX packages etc
 ...
 \Preamble{xhtml,tex4ht.sty options}
 ...
 tex4ht configurations
 ...
 \begin{document} 
 ...
 more tex4ht configurations
 ...
 \EndPreamble

Most important command for configuring is \Configure. This command has variable number of arguments, in the simplest form it does have two arguments: \Configure{configname}{insert for a first hook}.

At this place we should talk about hooks. In order to insert html tags, LaTeX macros are redefined and in the definitions special hooks are inserted. These hooks are declared with \NewConfigure{configname}{number of hooks} in special file named as redefined package name with suffix .4ht. These hooks are then seeded in configure files for particular output formats, or in the .cfg file.

To illustrate that, we can show some simple example. Lets say we have simple package hello.sty:

\ProvidesPackage{hello} 
\newcommand\hello{\textbf{hello world}}
\endinput

we can provide hooks in file named hello.4ht. Say we just want to insert tags at beginning and at end of \hello command:

% provide configure for \hello command. we can choose any name
% but most convenient is to name hooks after redefined command
% we declare two hooks, to be inserted before and after the command
\NewConfigure{hello}{2}
% now we need to redefine \hello. save it to tmp command
\let\tmp:hello\hello
% note that `:` can be part of command name in `.4ht` files. 
% now insert the hooks. they are named as \a:hook, \b:hook, ..., \h:hook
% depending on how many hooks were declared
\renewcommand\hello{\a:hello\tmp:hello\b:hello} 

because we want to surround contents produced by \hello with tags, we need to declare two hooks. This is the most usual case for normal commands which just produce some text. Old contents of macro are saved in temporary macro and then command is redefined to insert hooks and original contents stored in temporary macro.

Now we can change our sample to use hello package:

\documentclass{article}
\usepackage[english,czech]{babel} 
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc} 
\usepackage{hello}
\begin{document} Příliš žluťoučký kůň úpěl \textit{ďábelské} ódy.
\begin{otherlanguage}{english} Some text in English, \hello
\end{otherlanguage} 
\end{document}

we haven't provided any configurations for hello yet, but you can see that text hello world is in bold font anyway. This is the same case as \textit which is converted as italic. Basic font styles are inserted by tex4ht command during extraction of text from dvi to a output format. So it is the right time to finally show how to configure both textit and hello to produce some better tags than they provide by default.

Basic structure of a config file has been shown before, so now we will just add basic configurations for \textit and \hello:

\Preamble{xhtml}
\Configure{textit}{\HCode{<span class="textit">}}{\HCode{</span>}}
\Configure{hello}{\HCode{<span class="hello">}}{\HCode{</span>}}
\Css{.textit{font-style:italic;}}
\Css{.hello{font-weight:bold;}}
\begin{document}
\EndPreamble

For documentation of default configurations, see tex4ht info, most useful are LaTeX and tex4ht sections. Documentation for basic font commands such as \textit or \textbf is provided in LaTeX section. We can see that configuration takes two parameters, insertion before and after content. Same situation is with hello configuration we defined earlier, hooks are inserted before and after the content.

To insert html tags, we need to use \HCode commands, special characters such as <,> or & are escaped otherwise. In our example we insert span elements with some class attribute to distinguish them. Because these classes doesn't have any visual appearance by default, we use \Css commands to add some styling. Yes, you need to know both html and css to effectively configure tex4ht!

If we look at html output now, we can see that things don't look much better than initially:

<!--l. 6--><p class="noindent" >Příliš žluťoučký kůň úpěl <span class="textit"><span 
class="ecti-1000">ď</span><span 
class="ecti-1000">ábelsk</span><span 
class="ecti-1000">é</span></span> ódy. Some text in English, <span class="hello"><span 
class="ecbx-1000">hello world</span></span>
</p> 

our new tags were inserted, but unnecessary elements inserted by tex4ht processor are still present. Fortunately, we can suppress insertion of these elements with \NoFonts command, and later enable again with \EndNoFonts. We can also use tex4ht.sty option NoFonts, which will suppress font processing in whole document, but you should use this with caution, as it may have some side effects.

Let's take a look how would out configurations look with \NoFonts command:

\Preamble{xhtml}
\Configure{textit}{\HCode{<span class="textit">}\NoFonts}
{\EndNoFonts\HCode{</span>}}
\Configure{hello}{\HCode{<span class="hello">}\NoFonts}
{\EndNoFonts\HCode{</span>}}
\Css{.textit{font-style:italic;}}
\Css{.hello{font-weight:bold;}}
\begin{document}
\EndPreamble

the output now looks much better:

<!--l. 6--><p class="noindent" >Příliš žluťoučký kůň úpěl <span class="textit">ďábelské</span> ódy. Some text in English, <span class="hello">hello world</span>
</p> 

It may seems that we can be happy at this point, but things aren't as easy as we may hope, because we haven't talked about one thing:

Paragraphs

What if we add some more paragraphs in English to our sample file?

\documentclass{article}
\usepackage[english,czech]{babel} 
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc} 
\usepackage{hello}
\begin{document} Příliš žluťoučký kůň úpěl \textit{ďábelské} ódy.
\begin{otherlanguage}{english} Some text in English, \hello
\end{otherlanguage} 

\begin{otherlanguage}{english} 

\textit{What will do} \verb|\textit| at the beginning of paragraph?

And also, what about configuration for \verb|otherlanguage| environment?

\end{otherlanguage}

\end{document}

What if we want to insert elements with lang attribute to specify language of text in the html. It might be useful from semantic point of view, we can also enable hyphenation in the css and it works only when correct languages are marked in the source.

This exercise will be little bit more difficult

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.