-
Notifications
You must be signed in to change notification settings - Fork 0
/
mod.rs
410 lines (385 loc) · 16.9 KB
/
mod.rs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
//! Ligature and kern data
//!
//! The TFM file format can provide information about ligatures and kerns.
//! A [ligature](https://en.wikipedia.org/wiki/Ligature_(writing))
//! is a special character that can replace two or more adjacent characters.
//! For example, the pair of characters ae can be replaced by the æ ligature which is a single character.
//! A [kern](https://en.wikipedia.org/wiki/Kerning) is special space inserted between
//! two adjacent characters to align them better.
//! For example, a kern can be inserted between A and V to compensate for the large
//! amount of space created by the specific combination of these two characters.
//!
//! ## The lig/kern programming language
//!
//! TFM provides ligature and kern data in the form of
//! "instructions in a simple programming language that explains what to do for special letter pairs"
//! (quoting TFtoPL.2014.13).
//! This lig/kern programming language can be used to specify instructions like
//! "replace the pair (a,e) by æ" and
//! "insert a kern of width -0.1pt between the pair (A,V)".
//! But it can also specify more complex behaviors.
//! For example, a lig/kern program can specify "replace the pair (x,y) by the pair (z,y)".
//!
//! In general for any pair of characters (x,y) the program specifies zero or one lig/kern instructions.
//! After this instruction is executed, there may be a new
//! pair of characters remaining, as in the (x,y) to (z,y) instruction.
//! The lig/kern instruction for this pair is then executed, if it exists.
//! This process continues until there are no more instructions left to run.
//!
//! Lig/kern instructions are represented in this module by the [`lang::Instruction`] type.
//!
//! ## Related code by Knuth
//!
//! The TFtoPL and PLtoTF programs don't contain any code for running lig/kern programs.
//! They only contain logic for translating between the `.tfm` and `.pl`
//! formats for lig/kern programs, and for doing some validation as described below.
//! Lig/kern programs are actually executed in TeX; see KnuthTeX.2021.1032-1040.
//!
//! One of the challenges with lig/kern programs is that they can contain infinite loops.
//! Here is a simple example of a lig/kern program with two instruction and an infinite loop:
//!
//! - Replace (x,y) with (z,y) (in property list format, `(LABEL C x)(LIG/ C y C z)`)
//! - Replace (z,y) with (x,y) (in property list format, `(LABEL C z)(LIG/ C y C x)`)
//!
//! When this program runs (x,y) will be swapped with (z,y) ad infinitum.
//! See TFtoPL.2014.88 for more examples.
//!
//! Both TFtoPL and PLtoTF contain code that checks that a lig/kern program
//! does not contain infinite loops (TFtoPL.2014.88-95 and PLtoTF.2014.116-125).
//! The algorithm for detecting infinite loops is a topological sorting algorithm
//! over a graph where each node is a pair of characters.
//! However it's a bit complicated because the full graph cannot be constructed without
//! running the lig/kern program.
//!
//! TeX does not check for infinite loops, presumably under the assumption that any `.tfm` file will have
//! been generated by PLtoTF and thus already validated.
//! However TeX does check for interrupts when executing lig/kern programs so that
//! at least a user can terminate TeX if an infinite loop is hit.
//! (See the `check_interrupt` line in KnuthTeX.2021.1040.)
//!
//! ## Functionality in this module
//!
//! This module handles lig/kern programs in a different way,
//! inspired by the ["parse don't validate"](https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/)
//! philosophy.
//! This module is able to represent raw lig/kern programs as a vector of [`lang::Instruction`] values.
//! But can also _compile_ lig/kern programs (into a [`CompiledProgram`]).
//! This compilation process essentially executes the lig/kern program for every possible character pair.
//! The result is a map from each character pair to the full list of
//! replacement characters and kerns for that pair.
//! If there is an infinite loop in the program this compilation will naturally fail.
//! The compiled program is thus a "parsed" version of the lig/kern program
//! and it is impossible for infinite loops to appear in it.
//!
//! An advantage of this model is that the lig/kern program does not need to be repeatedly
//! executed in the main hot loop of TeX.
//! This may make TeX faster.
//! However the compiled lig/kern program does have a larger memory footprint than the raw program,
//! and so it may be slower if TeX is memory bound.
mod compiler;
use crate::Char;
use crate::Number;
use std::collections::HashMap;
/// Types corresponding to the "lig/kern programming language".
///
/// See the documentation on the [`super`] module for information about this programming language.
///
/// The types here are put in a separate module because users of this crate are generally not expected to use them.
/// Instead, users will work with compiled lig/kern programs.
pub mod lang {
use crate::Char;
use crate::Number;
/// A single instruction in a lig/kern program.
///
/// In TFM files, instructions are serialized to 32 bit integers.
///
/// In property list files, instructions are specified using a `(LIG _ _)` or `(KERN _ _)` element,
/// and optionally a `(STOP)` or `(SKIP _)` element directly after.
#[derive(Debug, PartialEq, Eq)]
pub struct Instruction {
/// Specifies the next instruction to run if this instruction is not applicable -
/// e.g., if the right character of the pair is not `right_char`.
/// If the payload is present, that number of lig/kern instructions in the list of all instructions are skipped to
/// find the next instruction.
/// Otherwise this is the final instruction and there are no more instructions to consider.
pub next_instruction: Option<u8>,
/// This instruction is run if the right character in the pair is this character.
/// Otherwise the next lig kern instruction for the current character is considered,
/// using the `next_instruction` field.
///
/// After this operation is performed,
/// no more operations need to be performed on this pair.
/// However the result of the operation may result in a new pair being created
/// and the lig/kern program will run for that pair.
/// See the documentation on [`PostLigOperation`] for information on that.
pub right_char: Char,
/// The operation to perform for this instruction.
pub operation: Operation,
}
/// A lig/kern operation to perform.
#[derive(Debug, PartialEq, Eq, Copy, Clone)]
pub enum Operation {
/// Insert a kern between the current character and the next character.
/// The variant payload is the index of the kern in the kerns array.
Kern(Number),
/// Perform a ligature step.
/// This inserts `char_to_insert` between the left and right characters,
/// and then performs the post-lig operation.
Ligature {
/// Character to insert.
char_to_insert: Char,
/// What to do after inserting the character.
post_lig_operation: PostLigOperation,
},
}
/// A post-lig operation to perform after performing a ligature operation ([`Operation::Ligature`]).
///
/// A lig operation starts with a pair of characters (x,y) and a "cursor" on x.
/// The operation then inserts another character to get, say, (x,z,y).
/// At this point the cursor is still on x.
/// The post-lig operation does two things:
///
/// 1. First, it potentially deletes x or y or both.
/// 1. Second, it potentially moves the cursor forward.
///
/// After this, if the cursor is not at the end of the list of characters,
/// the lig-kern program is run for the new pair starting at the cursor.
///
/// For example, the post-lig operation [`PostLigOperation::RetainLeftMoveNowhere`] retains
/// x and deletes y, leaving (x,z).
/// It then moves the cursor nowhere, leaving it on x.
/// As a result, the lig kern program for the pair (x,z) will run.
///
/// On the other hand, if the post-lig operation [`PostLigOperation::RetainLeftMoveToInserted`]
/// runs, y is still deleted but the cursor moves to z.
/// This is the last character in this list and there no more pairs of characters to consider.
/// The lig/kern program thus terminates.
///
/// In general all of the post-lig operations are of the form `RetainXMoveY` where `X`
/// specifies the characters to retain and `Y` specifies where the cursor should move.
#[derive(Debug, PartialEq, Eq, Clone, Copy)]
pub enum PostLigOperation {
/// Corresponds to the `/LIG/` property list element.
RetainBothMoveNowhere,
/// Corresponds to the `/LIG/>` property list element.
RetainBothMoveToInserted,
/// Corresponds to the `/LIG/>>` property list element.
RetainBothMoveToRight,
/// Corresponds to the `LIG/` property list element.
RetainRightMoveToInserted,
/// Corresponds to the `LIG/>` property list element.
RetainRightMoveToRight,
/// Corresponds to the `/LIG` property list element.
RetainLeftMoveNowhere,
/// Corresponds to the `/LIG>` property list element.
RetainLeftMoveToInserted,
/// Corresponds to the `LIG` property list element.
RetainNeitherMoveToInserted,
}
}
/// A compiled lig/kern program.
#[derive(Debug)]
pub struct CompiledProgram {
left_to_pairs: HashMap<Char, (u16, u16)>,
pairs: Vec<(Char, RawReplacement)>,
middle_chars: Vec<(Char, Number)>,
}
#[derive(Debug, Clone)]
struct RawReplacement {
left_char_operation: LeftCharOperation,
middle_char_bounds: std::ops::Range<u16>,
last_char: Char,
}
impl CompiledProgram {
/// Compile a lig/kern program.
pub fn compile(
instructions: &[lang::Instruction],
entry_points: HashMap<Char, usize>,
) -> Result<(CompiledProgram, Vec<CompilationWarning>), CompilationError> {
compiler::compile(instructions, &entry_points)
}
/// Get an iterator over the full lig/kern replacement for a pair of characters.
pub fn get_replacement_iter(&self, left_char: Char, right_char: Char) -> ReplacementIter {
self.get_replacement(left_char, right_char)
.into_iter(left_char)
}
/// Get the full lig/kern replacement for a pair of characters.
pub fn get_replacement(&self, left_char: Char, right_char: Char) -> Replacement {
if let Some((lower, upper)) = self.left_to_pairs.get(&left_char) {
for (candidate_right_char, replacement) in
&self.pairs[(*lower as usize)..(*upper as usize)]
{
if *candidate_right_char != right_char {
continue;
}
return if replacement.middle_char_bounds.end == 0 {
Replacement {
left_char_operation: replacement.left_char_operation,
middle_chars: &[],
last_char: replacement.last_char,
}
} else {
Replacement {
left_char_operation: replacement.left_char_operation,
middle_chars: &self.middle_chars[replacement.middle_char_bounds.start
as usize
..replacement.middle_char_bounds.end as usize],
last_char: replacement.last_char,
}
};
}
}
Replacement::no_op(right_char)
}
/// Returns an iterator over all pairs `(char,char)` that have a replacement
/// specified in the lig/kern program.
pub fn all_pairs_having_replacement(&self) -> impl '_ + Iterator<Item = (Char, Char)> {
PairsIter {
current_left: Char(0),
left_iter: self.left_to_pairs.iter(),
right_chars: vec![],
program: self,
}
}
}
/// An error returned from lig/kern compilation.
#[derive(Debug, PartialEq, Eq)]
pub struct CompilationError {
/// The pair of characters the starts the infinite loop.
pub starting_pair: (Char, Char),
/// A sequence of steps forming the infinite loop.
///
/// At the end of these steps, the next pair to replace will be the `starting_pair` again.
pub infinite_loop: Vec<InfiniteLoopStep>,
}
/// One step in a lig/kern infinite loop.
///
/// A vector of these steps is returned in a [`CompilationError`].
#[derive(Debug, PartialEq, Eq)]
pub struct InfiniteLoopStep {
/// The index of the instruction to apply in this step.
pub instruction_index: usize,
/// The replacement text after applying this step.
pub post_replacement: Vec<Char>,
/// The position of the cursor after applying this step.
pub post_cursor_position: usize,
}
/// A warning returned from lig/kern compilation.
#[derive(Debug)]
pub enum CompilationWarning {
InvalidNextInstruction,
DuplicateRule,
OrphanRule,
}
/// Data structure describing the replacement of a character pair in a lig/kern program.
pub struct Replacement<'a> {
/// Operation to perform on the left character.
pub left_char_operation: LeftCharOperation,
/// Slice of characters and kerns to insert after the left character.
pub middle_chars: &'a [(Char, Number)],
/// Last character to insert.
pub last_char: Char,
}
impl<'a> Replacement<'a> {
fn no_op(right_char: Char) -> Replacement<'a> {
Replacement {
left_char_operation: LeftCharOperation::Retain,
middle_chars: &[],
last_char: right_char,
}
}
}
impl<'a> Replacement<'a> {
pub fn into_iter(self, left_char: Char) -> ReplacementIter<'a> {
ReplacementIter {
left_char,
full_operation: self,
state: IterState::LeftChar,
}
}
}
/// Operation to perform on the left character of a lig/kern pair.
#[derive(PartialEq, Eq, Debug, Copy, Clone)]
pub enum LeftCharOperation {
/// Retain the left character and do not add a kern.
Retain,
/// Delete the left character.
Delete,
/// Retain the left character and append the specified kern.
AppendKern(Number),
}
/// Iterator over the replacement of a character pair in a lig/kern program.
pub struct ReplacementIter<'a> {
left_char: Char,
full_operation: Replacement<'a>,
state: IterState,
}
enum IterState {
LeftChar,
MiddleChar(usize),
LastChar,
Exhausted,
}
impl<'a> ReplacementIter<'a> {
fn i(&self) -> (IterState, Option<(Char, Number)>) {
match self.state {
IterState::LeftChar => (
IterState::MiddleChar(0),
match self.full_operation.left_char_operation {
LeftCharOperation::Retain => Some((self.left_char, Number::ZERO)),
LeftCharOperation::Delete => None,
LeftCharOperation::AppendKern(kern) => Some((self.left_char, kern)),
},
),
IterState::MiddleChar(i) => match self.full_operation.middle_chars.get(i).copied() {
None => (IterState::LastChar, None),
Some(t) => (IterState::MiddleChar(i + 1), Some(t)),
},
IterState::LastChar => (
IterState::Exhausted,
Some((self.full_operation.last_char, Number::ZERO)),
),
IterState::Exhausted => (IterState::Exhausted, None),
}
}
}
impl<'a> Iterator for ReplacementIter<'a> {
type Item = (Char, Number);
fn next(&mut self) -> Option<Self::Item> {
loop {
let (state, r) = self.i();
self.state = state;
match (&self.state, r) {
(_, Some(t)) => return Some(t),
(IterState::Exhausted, _) => return None,
(_, _) => {}
}
}
}
}
/// An iterator over all pairs of characters that have a lig/kern replacement in a program.
struct PairsIter<'a, L> {
current_left: Char,
left_iter: L,
right_chars: Vec<Char>,
program: &'a CompiledProgram,
}
impl<'a, L: 'a + Iterator<Item = (&'a Char, &'a (u16, u16))>> Iterator for PairsIter<'a, L> {
type Item = (Char, Char);
fn next(&mut self) -> Option<Self::Item> {
match self.right_chars.pop() {
Some(right_char) => Some((self.current_left, right_char)),
None => match self.left_iter.next() {
None => None,
Some((&new_left, (lower, upper))) => {
self.current_left = new_left;
self.right_chars = self.program.pairs[*lower as usize..*upper as usize]
.iter()
.map(|t| t.0)
.collect();
Some((new_left, self.right_chars.pop().unwrap()))
}
},
}
}
}