Skip to content

Commit

Permalink
some more and some changed words about UTF-8 [ci skip]
Browse files Browse the repository at this point in the history
  • Loading branch information
FrankMittelbach committed Mar 30, 2018
1 parent 43c948f commit a003709
Showing 1 changed file with 176 additions and 37 deletions.
213 changes: 176 additions & 37 deletions doc/ltnews28.tex
Expand Up @@ -35,6 +35,10 @@

\usepackage{lmodern,url,hologo}

\providecommand\acro[1]{\textsc{#1}}
\providecommand\meta[1]{$\langle$\textit{#1}$\rangle$}


\publicationmonth{April}
\publicationyear{2018}

Expand All @@ -47,7 +51,7 @@

\setlength\rightskip{0pt plus 3em}

\section{New home for \LaTeXe{} sources}
\section{A new home for \LaTeXe{} sources}

In the past the development version of the \LaTeXe{} source files has
been managed in a Subversion source control system with read access
Expand Down Expand Up @@ -81,43 +85,146 @@ \section{Bug reports for core \LaTeXe{}}
\end{quote}
and with further details also discussed in~\cite{Mittelbach:TB39-1}.

\section{Default input encoding}
Since the release of \LaTeXe, \LaTeX\ has supported multiple file encodings
via the \package{inputenc} package. It used to be necessary to support several
different input encodings to support different languages. These days Unicode
and in particular the UTF-8 file encoding can support multiple languages
in a single encoding. UTF-8 is the default encoding in most current operating
systems and editors, and is the only encoding natively supported by
\hologo{LuaTeX} and \hologo{XeTeX}.

With this release, the default encoding for \LaTeX\ files has been
changed to UTF-8 if used with classic \TeX\ or PDF\TeX. The
implementation is essentially the same as the existing UTF-8 support
from \verb|\usepackage[utf8]{inputenc}|.

Documents using non ASCII characters should already be specifying the
encoding used via an option to the \package{inputenc} package. Such
documents should not be affected by this change in default.
\section{UTF-8: the new default input encoding}

The first \TeX{} implementations only supported reading 7-bit
\acro{ascii} files---any accented or otherwise ``special'' character
had to be entered using commands, if it could be represented at
all. For example to obtain an ``a'' one would enter \verb=\"a=, and to
typeset a ``\ss'' the command \verb=\ss=. Furthermore fonts at that
time had 128 glyphs inside, holding the \acro{ascii} characters, some
accents to build composite glyphs from a letter and an accent, and a
few special symbols such as parantheses, etc.

with 8-bit \TeX{} engines such as \hologo{pdfTeX} this situation changed
somewhat: it was now possible to process 8-bit files, i.e., files that
could encode 256 different characters. However, 256 is still a fairly
small number and with this limitation it is only possible to encode a
few languages and for other languages one would need to change the
encoding (i.e., interpret the character positions 0--255 in a
different way). The first code points 0--127 where essentially normed
(corresponding to \acro{ascii}) while the second half 128--255 would
vary by holding different accented characters to support a certain set
of languages.

Each computer used one of these encodings when storing or interpreting
files and as long as two computers used the same encoding it was
(easily) possible to exchange files between them and have them
interpreted and processed correctly.

But different computers may have used different encodings and given
that a computer file is simply a sequence of bytes with no indication for
which encoding is was destined chaos could easily happen and
happened. For example, the German word ``Gr\"o\ss e'' (height) entered on a
German keyboard could show up as ``Gr\v T\`ae'' on a diferent
computer using a different encoding by default.

So in summmary the situation wasn't at all well and it was clear in
the early nienties that \LaTeXe{} (that was being developed to provide
a \LaTeX{} version usable across the world) had to provide a solution
to this issue.

The \LaTeXe{} answer was the introduction of the \package{inputenc}
package~\cite{Mittelbach:Brno95} through which it is possible to
provide support for multiple encodings. It also allows to correctly
process a file written in one encoding on a computer using a different
encoding and even supports documents where the encoding changes
midway.

Since the first release of \LaTeXe{} in 1994, \LaTeX{} documents that
used any characters outside \acro{ascii} in the source (i.e. any
characters in the range of 128--255) were supposed to load
\package{inputenc} and specify in which file encoding they were
written and stored.
%
If the \package{inputenc} package was not loaded then \LaTeX{} used a
``raw'' encoding which essentially took each byte from the input file
and typeset the glyph that happened to be in that position in the
current font---something that sometimes produces the right result but
often enough will not.

In 1992 Ken Thompson and Rob Pike developed the UTF-8 encoding scheme
which allows to encode all Unicode characters within 8-bit sequences
and over time this encoding has gradually taken over the world,
replacing the legacy 8-bit encodings used before. These days all major
computer operating systems use UTF-8 to store their files and it
requires some effort to explicitly store files in one of the legay
encodings.

As a result, whenever \LaTeX{} users want to use any accented
characters from their keyboard (instead of resorting to \verb=\"a= and
the like) they always have to use
\begin{verbatim}
\usepackage[utf8]{inputenc}
\end{verbatim}
in the preamble of their documents as otherwise \LaTeX{} will produce
glibberish.

\subsection*{The new default}

Some documents would have been using accemted letters \emph{without}
loading \package{inputenc}, relying on the similarities between the
input used and the T1 font encoding. These documents will generate an
error that they are not valid UTF-8, however the documents may be
easily processed by specifying the encoding used by adding a line such
as \verb|\usepackage[utf8]{inputenc}|, or adding the new command
\verb|\UseRawInputEncoding| as the first line of the file. This will
re-instate the previous default.
With this release, the default encoding for \LaTeX\ files has been
changed from the ``fall through raw'' encoding to UTF-8 if used with
classic \TeX\ or \hologo{pdfTeX}. The implementation is essentially
the same as the existing UTF-8 support from
\verb|\usepackage[utf8]{inputenc}|.

The \hologo{LuaTeX} and \hologo{XeTeX} engines always supported the
UTF-8 encoding as their native (and only) input encoding, so with
these engines \package{inputenc} was always a no-op.

This means that with new documents one can assume UTF-8 input and it
is no longer required to always specify
\verb|\usepackage[utf8]{inputenc}|. But if this line is present it
will not hurt either.


\subsection*{Compatibility}

For most existing documents this change will be transparent:
\begin{itemize}
\item documents using only \acro{ascii} in the input file and
accessing accented characters via commands;
\item documents that specified the encoding of their file via an
option to the \package{inputenc} package and then used 8-bit
characters in that encoding;
\item documents that already had been stored in UTF-8 (whether or not
specifying this via \package{inputenc}).
\end{itemize}
Only documents that have been stored in a legay encoding and used
accented letters from the keyboard \emph{without} loading
\package{inputenc} (relying on the similarities between the input used
and the T1 font encoding) are affected.

These documents will now generate an error that they contain invalid
UTF-8 sequences. However, such documents may be easily processed by
adding the new command \verb|\UseRawInputEncoding| as the first line
of the file. This will re-instate the previous ``raw'' encoding
default.

\verb|\UseRawInputEncoding| may also be used on the commandline to
process existing files without requiring the file to be edited\\
\verb|pdflatex '\UseRawInputEncoding \input' file|\\
process existing files without requiring the file to be edited
\begin{verbatim}
pdflatex '\UseRawInputEncoding \input' file
\end{verbatim}
will process the file using the previous default encoding.

Possible alternatives are reencoding the file to UTF-8 using a tool
(such as recode or iconv or an editor) or adding the line
\begin{flushleft}
\verb= \usepackage[=\meta{encoding}\verb=]{inputenc}=
\end{flushleft}
to the preamble specifying the \meta{encoding} that fits the file
encoding. In many cases this will be \texttt{latin1} or
\texttt{cp1562}. For other encoding names and their meaning see the
\package{inputenc} documentation.

As usual, this change may also be reverted via the more general
\package{latexrelease} package mechanism, by speciying a release date
earlier than this release.

\section{General rollback concept for packages and classes}
\section[A general rollback concept]
{A general rollback concept for packages and classes}

In 2015 a rollback concept for the \LaTeX{} kernel was introduced.
Providing this feature allowed us to make corrections to the
Expand Down Expand Up @@ -156,10 +263,6 @@ \section{Integration of \pkg{remreset} and \pkg{chngcntr} packages



\section{Further TU encoding improvements}

Anything here?

\section{Changes to packages in the tools category}

\subsection{\LaTeX{} table columns with fixed widths}
Expand Down Expand Up @@ -189,21 +292,57 @@ \subsection{Obscure overprinting with \pkg{multicol} fixed}

\begin{thebibliography}{9}

\bibitem{Mittelbach:TB38-2-213} Frank Mittelbach:
\emph{\LaTeX{} table columns with fixed widths}.
In: TUGBoat, 38\#2, 2017.
\url{https://www.latex-project.org/publications/}

\bibitem{Mittelbach:TB39-1} Frank Mittelbach:
\emph{New rules for reporting bugs in the \LaTeX{} core software}.
Submitted to TUGBoat.
\url{https://www.latex-project.org/publications/}

\bibitem{Mittelbach:Brno95} Frank Mittelbach:
\emph{\LaTeXe{} Encoding Interface --- Purpose, concepts, and
Open Problems}.
Talk given in Brno June 1995.
\url{https://www.latex-project.org/publications/}

\bibitem{Mittelbach:TB39-2} Frank Mittelbach:
\emph{A rollback concept for packages and classes}.
Submitted to TUGBoat.
\url{https://www.latex-project.org/publications/}

\bibitem{Mittelbach:TB38-2-213} Frank Mittelbach:
\emph{\LaTeX{} table columns with fixed widths}.
In: TUGBoat, 38\#2, 2017.
\url{https://www.latex-project.org/publications/}

\end{thebibliography}

\end{document}



Since the release of \LaTeXe, \LaTeX\ has supported multiple file encodings
via the \package{inputenc} package. It used to be necessary to support several
different input encodings to support different languages. These days Unicode
and in particular the UTF-8 file encoding can support multiple languages
in a single encoding. UTF-8 is the default encoding in most current operating
systems and editors, and is the only encoding natively supported by
\hologo{LuaTeX} and \hologo{XeTeX}.

Documents using non ASCII characters should already be specifying the
encoding used via an option to the \package{inputenc} package. Such
documents should not be affected by this change in default.


Some documents would have been using accemted letters \emph{without}
loading \package{inputenc}, relying on the similarities between the
input used and the T1 font encoding. These documents will generate an
error that they are not valid UTF-8, however the documents may be
easily processed by specifying the encoding used by adding a line such
as \verb|\usepackage[utf8]{inputenc}|, or adding the new command
\verb|\UseRawInputEncoding| as the first line of the file. This will
re-instate the previous default.

\verb|\UseRawInputEncoding| may also be used on the commandline to
process existing files without requiring the file to be edited\\
\verb|pdflatex '\UseRawInputEncoding \input' file|\\
will process the file using the previous default encoding.

0 comments on commit a003709

Please sign in to comment.