diff --git a/doc/ltnews28.tex b/doc/ltnews28.tex index 07df086b9..59d5eebbb 100644 --- a/doc/ltnews28.tex +++ b/doc/ltnews28.tex @@ -35,6 +35,10 @@ \usepackage{lmodern,url,hologo} +\providecommand\acro[1]{\textsc{#1}} +\providecommand\meta[1]{$\langle$\textit{#1}$\rangle$} + + \publicationmonth{April} \publicationyear{2018} @@ -47,7 +51,7 @@ \setlength\rightskip{0pt plus 3em} -\section{New home for \LaTeXe{} sources} +\section{A new home for \LaTeXe{} sources} In the past the development version of the \LaTeXe{} source files has been managed in a Subversion source control system with read access @@ -81,43 +85,146 @@ \section{Bug reports for core \LaTeXe{}} \end{quote} and with further details also discussed in~\cite{Mittelbach:TB39-1}. -\section{Default input encoding} -Since the release of \LaTeXe, \LaTeX\ has supported multiple file encodings -via the \package{inputenc} package. It used to be necessary to support several -different input encodings to support different languages. These days Unicode -and in particular the UTF-8 file encoding can support multiple languages -in a single encoding. UTF-8 is the default encoding in most current operating -systems and editors, and is the only encoding natively supported by -\hologo{LuaTeX} and \hologo{XeTeX}. - -With this release, the default encoding for \LaTeX\ files has been -changed to UTF-8 if used with classic \TeX\ or PDF\TeX. The -implementation is essentially the same as the existing UTF-8 support -from \verb|\usepackage[utf8]{inputenc}|. -Documents using non ASCII characters should already be specifying the -encoding used via an option to the \package{inputenc} package. Such -documents should not be affected by this change in default. +\section{UTF-8: the new default input encoding} + +The first \TeX{} implementations only supported reading 7-bit +\acro{ascii} files---any accented or otherwise ``special'' character +had to be entered using commands, if it could be represented at +all. For example to obtain an ``a'' one would enter \verb=\"a=, and to +typeset a ``\ss'' the command \verb=\ss=. Furthermore fonts at that +time had 128 glyphs inside, holding the \acro{ascii} characters, some +accents to build composite glyphs from a letter and an accent, and a +few special symbols such as parantheses, etc. + +with 8-bit \TeX{} engines such as \hologo{pdfTeX} this situation changed +somewhat: it was now possible to process 8-bit files, i.e., files that +could encode 256 different characters. However, 256 is still a fairly +small number and with this limitation it is only possible to encode a +few languages and for other languages one would need to change the +encoding (i.e., interpret the character positions 0--255 in a +different way). The first code points 0--127 where essentially normed +(corresponding to \acro{ascii}) while the second half 128--255 would +vary by holding different accented characters to support a certain set +of languages. + +Each computer used one of these encodings when storing or interpreting +files and as long as two computers used the same encoding it was +(easily) possible to exchange files between them and have them +interpreted and processed correctly. + +But different computers may have used different encodings and given +that a computer file is simply a sequence of bytes with no indication for +which encoding is was destined chaos could easily happen and +happened. For example, the German word ``Gr\"o\ss e'' (height) entered on a +German keyboard could show up as ``Gr\v T\`ae'' on a diferent +computer using a different encoding by default. + +So in summmary the situation wasn't at all well and it was clear in +the early nienties that \LaTeXe{} (that was being developed to provide +a \LaTeX{} version usable across the world) had to provide a solution +to this issue. + +The \LaTeXe{} answer was the introduction of the \package{inputenc} +package~\cite{Mittelbach:Brno95} through which it is possible to +provide support for multiple encodings. It also allows to correctly +process a file written in one encoding on a computer using a different +encoding and even supports documents where the encoding changes +midway. + +Since the first release of \LaTeXe{} in 1994, \LaTeX{} documents that +used any characters outside \acro{ascii} in the source (i.e. any +characters in the range of 128--255) were supposed to load +\package{inputenc} and specify in which file encoding they were +written and stored. +% +If the \package{inputenc} package was not loaded then \LaTeX{} used a +``raw'' encoding which essentially took each byte from the input file +and typeset the glyph that happened to be in that position in the +current font---something that sometimes produces the right result but +often enough will not. + +In 1992 Ken Thompson and Rob Pike developed the UTF-8 encoding scheme +which allows to encode all Unicode characters within 8-bit sequences +and over time this encoding has gradually taken over the world, +replacing the legacy 8-bit encodings used before. These days all major +computer operating systems use UTF-8 to store their files and it +requires some effort to explicitly store files in one of the legay +encodings. + +As a result, whenever \LaTeX{} users want to use any accented +characters from their keyboard (instead of resorting to \verb=\"a= and +the like) they always have to use +\begin{verbatim} + \usepackage[utf8]{inputenc} +\end{verbatim} +in the preamble of their documents as otherwise \LaTeX{} will produce +glibberish. + +\subsection*{The new default} -Some documents would have been using accemted letters \emph{without} -loading \package{inputenc}, relying on the similarities between the -input used and the T1 font encoding. These documents will generate an -error that they are not valid UTF-8, however the documents may be -easily processed by specifying the encoding used by adding a line such -as \verb|\usepackage[utf8]{inputenc}|, or adding the new command -\verb|\UseRawInputEncoding| as the first line of the file. This will -re-instate the previous default. +With this release, the default encoding for \LaTeX\ files has been +changed from the ``fall through raw'' encoding to UTF-8 if used with +classic \TeX\ or \hologo{pdfTeX}. The implementation is essentially +the same as the existing UTF-8 support from +\verb|\usepackage[utf8]{inputenc}|. + +The \hologo{LuaTeX} and \hologo{XeTeX} engines always supported the +UTF-8 encoding as their native (and only) input encoding, so with +these engines \package{inputenc} was always a no-op. + +This means that with new documents one can assume UTF-8 input and it +is no longer required to always specify +\verb|\usepackage[utf8]{inputenc}|. But if this line is present it +will not hurt either. + + +\subsection*{Compatibility} + +For most existing documents this change will be transparent: +\begin{itemize} +\item documents using only \acro{ascii} in the input file and + accessing accented characters via commands; +\item documents that specified the encoding of their file via an + option to the \package{inputenc} package and then used 8-bit + characters in that encoding; +\item documents that already had been stored in UTF-8 (whether or not + specifying this via \package{inputenc}). +\end{itemize} +Only documents that have been stored in a legay encoding and used +accented letters from the keyboard \emph{without} loading +\package{inputenc} (relying on the similarities between the input used +and the T1 font encoding) are affected. + +These documents will now generate an error that they contain invalid +UTF-8 sequences. However, such documents may be easily processed by +adding the new command \verb|\UseRawInputEncoding| as the first line +of the file. This will re-instate the previous ``raw'' encoding +default. \verb|\UseRawInputEncoding| may also be used on the commandline to -process existing files without requiring the file to be edited\\ - \verb|pdflatex '\UseRawInputEncoding \input' file|\\ +process existing files without requiring the file to be edited +\begin{verbatim} + pdflatex '\UseRawInputEncoding \input' file +\end{verbatim} will process the file using the previous default encoding. +Possible alternatives are reencoding the file to UTF-8 using a tool +(such as recode or iconv or an editor) or adding the line +\begin{flushleft} +\verb= \usepackage[=\meta{encoding}\verb=]{inputenc}= +\end{flushleft} +to the preamble specifying the \meta{encoding} that fits the file +encoding. In many cases this will be \texttt{latin1} or +\texttt{cp1562}. For other encoding names and their meaning see the +\package{inputenc} documentation. + As usual, this change may also be reverted via the more general \package{latexrelease} package mechanism, by speciying a release date earlier than this release. -\section{General rollback concept for packages and classes} +\section[A general rollback concept] + {A general rollback concept for packages and classes} In 2015 a rollback concept for the \LaTeX{} kernel was introduced. Providing this feature allowed us to make corrections to the @@ -156,10 +263,6 @@ \section{Integration of \pkg{remreset} and \pkg{chngcntr} packages -\section{Further TU encoding improvements} - -Anything here? - \section{Changes to packages in the tools category} \subsection{\LaTeX{} table columns with fixed widths} @@ -189,21 +292,57 @@ \subsection{Obscure overprinting with \pkg{multicol} fixed} \begin{thebibliography}{9} -\bibitem{Mittelbach:TB38-2-213} Frank Mittelbach: - \emph{\LaTeX{} table columns with fixed widths}. - In: TUGBoat, 38\#2, 2017. - \url{https://www.latex-project.org/publications/} - \bibitem{Mittelbach:TB39-1} Frank Mittelbach: \emph{New rules for reporting bugs in the \LaTeX{} core software}. Submitted to TUGBoat. \url{https://www.latex-project.org/publications/} +\bibitem{Mittelbach:Brno95} Frank Mittelbach: + \emph{\LaTeXe{} Encoding Interface --- Purpose, concepts, and + Open Problems}. + Talk given in Brno June 1995. + \url{https://www.latex-project.org/publications/} + \bibitem{Mittelbach:TB39-2} Frank Mittelbach: \emph{A rollback concept for packages and classes}. Submitted to TUGBoat. \url{https://www.latex-project.org/publications/} +\bibitem{Mittelbach:TB38-2-213} Frank Mittelbach: + \emph{\LaTeX{} table columns with fixed widths}. + In: TUGBoat, 38\#2, 2017. + \url{https://www.latex-project.org/publications/} + \end{thebibliography} \end{document} + + + +Since the release of \LaTeXe, \LaTeX\ has supported multiple file encodings +via the \package{inputenc} package. It used to be necessary to support several +different input encodings to support different languages. These days Unicode +and in particular the UTF-8 file encoding can support multiple languages +in a single encoding. UTF-8 is the default encoding in most current operating +systems and editors, and is the only encoding natively supported by +\hologo{LuaTeX} and \hologo{XeTeX}. + +Documents using non ASCII characters should already be specifying the +encoding used via an option to the \package{inputenc} package. Such +documents should not be affected by this change in default. + + +Some documents would have been using accemted letters \emph{without} +loading \package{inputenc}, relying on the similarities between the +input used and the T1 font encoding. These documents will generate an +error that they are not valid UTF-8, however the documents may be +easily processed by specifying the encoding used by adding a line such +as \verb|\usepackage[utf8]{inputenc}|, or adding the new command +\verb|\UseRawInputEncoding| as the first line of the file. This will +re-instate the previous default. + +\verb|\UseRawInputEncoding| may also be used on the commandline to +process existing files without requiring the file to be edited\\ + \verb|pdflatex '\UseRawInputEncoding \input' file|\\ +will process the file using the previous default encoding. +