verbatim delimiter #421

Alex-Jordan · 2019-09-10T05:30:41Z

Once upon a time (actually still in 2.14), in lib/Value/String.pm, there was this code:

#
#  Mark a string to be display verbatim
#
sub verb {shift; return "\\verb".chr(0x85).(shift).chr(0x85)}

The idea is the \verb LaTeX command is going to be used on a string answer, and it needs a delimiter character. Character 0x85, ASCII 133, was chosen because it would be crazy for a student to have that as part of a string answer they "typed".

Then in 539406c, the character changed to 0x1F, ASCII 31, the "unit separator character". This brought it down into 7 bit ASCII, and Geoff's comment in the commit suggests this has something to do with the utf8 conversion.

So now we have string answers that use character 0x1F in their display. This is causing an issue with PreTeXt. When WW processes a problem with "PTX" display mode, it makes XML. For each answer of the problem, it makes a single XML element, with lots of attributes and values that correspond to the Perl answer hash's keys and values. For an example, see:

https://webwork-ptx.aimath.org/webwork2/html2xml?courseID=anonymous&userID=anonymous&password=anonymous&course_password=anonymous&answersSubmitted=0&displayMode=PTX&outputformat=ptx&problemSeed=8435&sourceFilePath=Library/PCC/BasicAlgebra/Geometry/CylinderVolume10.pg

and view source, since your web browser is likely to try to read the XML as HTML.

So you can see how a string like \verb<0x1F>foo<0x1F> could end up inside a value for an attribute of one of these XML elements. The problem is that XML does not allow this character in an attribute value. (Well, there are varying standards for what is allowed, but even when this one is allowed, its use is discouraged, and anyway, its presence causes the python validator we use to declare this to be invalid XML.)

So. We want a character that a student will not be able to type with normal use of the keyboard. So nothing in ASCII 32--127. And we want a character that is valid for XML in an attribute value. So nothing in ASCII 00--31. So we have to leave 7-bit ASCII to meet both conditions. Is it possible to do this? Can a character be chosen somewhere else in utf8 and that be compatible with the utf8 conversion happening now?

The text was updated successfully, but these errors were encountered:

dpvc · 2019-09-10T14:33:03Z

How about ASCII 127 (U+007F, DELETE), since the DELETE character is hard to get into a student's answer string (pressing delete causes an action, rather than inserts the character)?

Alternatively, lib/Value/String.pm could be modified to select a delimiter character based on the content of the string it is typesetting. E.g., find the smallest n > 32 where chr(n) is not in the string to be delimited, and use that. One could split the string into an array of characters, sort it (discarding duplicates), and find the first index i where the i-th character is not chr(i + 33) and use char(i + 32) as the delimiter.

For example

sub verb {
  my $self = shift;
  my $string = shift;
  my $i = 33;                                         # starting ASCII character to look for
  my @has{split(//, $string)} = ();                   # hash with keys equal to the characters in the string
  my @c = num_sort(map {ord($_)} (keys %has));        # sorted list of (unique) character numbers
  while ($c[0] < $i) {shift(@c)};                     # remove control characters and space
  while (shift(@c) == $i) {$i++}                      # find first unused character number
  return "\\verb" . chr($i) . $string . chr($i);
}

should do it. This will usually end up with ! as tech delimiter, followed by " if ! is used.

Alex-Jordan · 2019-09-10T16:57:11Z

I gather from this reference that ASCII 127 is also not OK to use in XML (or more specifically in an attribute which has the most restrictions in general.)

We do something similar in PreTeXt to what you describe for choosing a delimiter with the aim to replace the 0x85 with something more friendly to more output forms. (That's after the XML validation takes place, so it's not the case that we can repeat this with 0x1F.) There are a few complications, like * can't be used as the delimiter for \verb, and we do not want to not XML control characters. There is the unlikely scenario to think about where every usable character is in the string to be typeset.

I'd be inclined to go that way, except I want to check about using some other unicode character first. I don't understand the character encoding issues much more than surface level. Is a time coming when we could just do like the following?

#
#  Mark a string to be display verbatim
#
sub verb {shift; return "\\verb🕸️".(shift)."🕸️"}

I guess WeBWorK hardcopy would need to switch to use xelatex, but should it be moving that direction anyway to support more characters in PG problems?

dpvc · 2019-09-10T23:03:47Z

I gather ... that ASCII 127 is also not OK to use in XML.

OK, the ranges seemed to indicate that U+007F was OK, but the non-restricted list seems to indicate not. Too bad.

There are a few complications, like * can't be used as the delimiter

True. I suppose you could use

while (shift(@c) == $i || $i == 42) {$i++}

to avoid the star.

and we do not want to not XML control characters

I had an earlier version that used foreach $i (33..126) {...}, but changed it. That would have limited to ASCII characters, but you are right, this doesn't. A little more work could fix that.

There is the unlikely scenario to think about where every usable character is in the string to be typeset.

Yes, I though of that, but no matter what you end up doing along these lines, that will be a possibility, so you are going to crash one way or another.

It looks like U+0085 is allowed, so chr(0x85) would have been a good choice for XML. But I guess Geoff's concern is that this would be a two-character string in UTF-8, though I'm not sure what the problem is with that. It seems the change to chr(0x1F) was to keep it one character.

Another possibility would be to use U+000D (RETURN), which LaTeX will allow as a delimiter for \verb, and remove any \r characters in the string (or handle them differently). The \verb macro can't actually handle arbitrary strings; in fact, the string can't contain newlines, tabs are treated as spaces, and returns are ignored, so to really handle arbitrary strings, assuming you want \n (and \r) to be treated as line breaks, you would have to process them specially anyway.

The ArbitraryString context actually does handle \n (but not \r), via

sub quoteTeX {
  my $self = shift; my $s = shift;
  return $self->verb($s) unless $s =~ m/\n/;
  my @tex = split(/\n/,$s);
  foreach (@tex) {$_ = $self->verb($_) if $_ =~ m/\S/}
  "\\begin{array}{l}".join("\\\\ ",@tex)."\\end{array}";
}

so that \n produces line breaks (and can be displayed by MathJax). You could do something similar that handles both \n and \r as line breaks, or just removes \r and handles \n as above, then use \r as the delimiter in the verb() method. This would guarantee that it was a proper delimiter, and still gets you line breaks.

If using literal returns in the attributes is problematic (though it seems to work in my hand testing), then you could perhaps encode it as  in the attributes:

<tag attr="\verb&#xC;abc&#xC;">

This might be a possible solution.

Alex-Jordan · 2019-09-11T00:02:41Z

One quick note. While an XML file can have \n and \r in the file, they are not allowed in attribute values. So it would be the same validation issue to use them as to use 0x1F.

dpvc · 2019-09-11T00:32:07Z

I couldn't find the specification for what's allowed in an attribute. Can you provide a link?

The best I can find is the definition for AttValue, which seems to indicate that there are no restrictions other than no literal & or <, but any other valid XML character. If you track down the meaning of reference, these do seem to include the three special control characters, #x9, #xA, and #xD, the later being the one I suggested. So I'm not sure where you are getting that they are not allowed in the attribute values.

If attributes are limited in what they can support, are you considering changing to using a container whose contents is the value instead? Essay answers and ones for ArbitraryString certainly can include newlines as part of the student and correct answers. How are these handled in your attributes?

Alex-Jordan · 2019-09-11T03:40:49Z

I think you are right. Sorry, I was mixing up memories. In July, Sean Fitzpatrick and I spent some time thinking about this, and characters \t, \n, and \r were singled out as the only characters in range 0--31 that were legal in XML. But we ruled them out as delimiters. I was mis-remembering why we ruled them out, attributing that (no pun intended) to illegal attribute values. But now I'm remembering it's because of the behavior of \verb. Testing on a simple .tex file, \t just plain doesn't work for a delimiter (it causes a compilation error). With the other two, it works in the sense that it doesn't throw an error. But it gobbles up space. I'm finding that This is \verb\nverbatim\n text comes out as "This is verbatimtext" losing the space that follows the second delimiter. That's with pdflatex and also xelatex, not sure about MathJax.

dpvc · 2019-09-11T03:45:41Z

'm finding that This is \verb\nverbatim\n text comes out as This is verbatimtext losing the space that follows the second delimiter.

Try This is {\verb\nverbatim\n} text; that should preserve the space following it.

Alex-Jordan · 2019-09-11T16:51:27Z

Wonderful. That passes my testing too.

Of the proposals, using \r as the delimiter (and removing any \r from the string being typeset) sounds nicest to me.

Should I poll people for red flags? @mgage , @goehle , @taniwallach ? I think you can skip reading this whole thread. The proposal is for this code in lib/Value/String.pm:

#
#  Mark a string to be display verbatim
#
sub verb {shift; return "\\verb".chr(0x1F).(shift).chr(0x1F)}

to become (where 0x0D = CR = carriage return = \r)

#
#  Mark a string to be display verbatim
#
sub verb {shift; return "\\verb".chr(0x0D).(shift).chr(0x0D)}

except it would also process the input string to strip any 0x0D that somehow ended up in the string. My understanding is that \r is never used alone as a line break (not since very old Mac OS's); Windows uses \r\n but stripping the \r will leave \n, which Windows etc. will still understand.

If no one sees a red flag, I will open a PR for this.

dpvc · 2019-09-11T17:41:41Z

You could include the braces in the verb() method as well; this won't hurt the output on line or in hard copy, and that way, you don't need any special processing in PreTeXt.

Also, note that \n is not valid within \verb, so you will still want to do something about that if you intend to support newlines within String objects (the current String doesn't, but ArbitraryString does). If so, then the code I gave above that splits the string on \n and uses an align environment to get multi-line output could be used to make String completely general.

taniwallach · 2019-09-14T20:22:11Z

I read the thread.

I would avoid using any multi-byte characters yet in latex generated by WeBWorK by default unless it comes in from a UTF-8 encoded problems. Simply not everyone is using xelatex or something else which expects UTF-8 encoded tex files, and such an approach is likely to cause trouble on many sites who are in no rush to support UTF-8.

I strongly support the proposal to change to \r as the delimiter, as that is certainly an option which should cause no or minimal issues on the TeX side of things, and should not cause any UTF-8 issues.

I do think it might be advisable to preprocess the string being put inside the verbatim \verb command to make sure there are no occurrences of \r (maybe just replace them by space characters) and to add the extra braces around the \verb block to prevent any unusual surprises.

I cannot speak to the history behind the choice of ASCII 31 as the new \verb delimiter beyond the minimal comment that Geoff left in 539406c
and the fact the 0x8f is reserved by UTF-8 for continuation bytes: https://en.wikipedia.org/wiki/UTF-8 but I am guessing that it was assumed that this character which is valid in UTF-8 was not likely to ever appear in anything passed into the verbatim code.

BTW I needed to add \catcode^_=12into thehardcopyPreamble.texfiles I made for XeLaTeX support with Hebrew, as the choice of0x1F= ASCII 31 would otherwise trigger compilation problems, Using\r` would avoid that.

Alex-Jordan · 2019-09-14T21:34:09Z

Writing up a PR for this.

Davide, the ArbitraryString snippet breaks the input string into an array split at instances of \n, and joins it back together using \\\\. For the current purpose, is there a reason not to do something like:

sub verb {
  shift;
  my $verb = shift;
  $verb =~ 's/\r/ /g';
  $verb =~ 's/\n/\\\\/g';
  return "{\\verb\r$verb\r"};
}

where regex does the replacement of \n instead?

dpvc · 2019-09-14T22:17:26Z

is there a reason not to do something like ...

Yes. Because the string will be printed verbatim, the \\ will be shown as \\, not interpreted as a line break. That is why the ArbitraryString context goes through the work that it does to get a multi-line display. Also, while \\ will cause a line break anywhere in MathJax's math mode, that is not the case in actual LaTeX. So a multi-line structure is needed.

Alex-Jordan · 2019-09-14T22:24:11Z

Oh I see. I lost how the ArbitaryString re-applies verb() to the split pieces.

Alex-Jordan · 2019-09-14T22:48:42Z

I opened #422.

taniwallach · 2019-09-17T18:34:53Z

@Alex-Jordan - I merged in #422 and unless you want to leave the issue open to track the multi-line case for future attention - I think this issue can be closed.

taniwallach · 2019-09-25T07:11:10Z

#424 patched the changes in #422 to use different delimiters in different contexts.

I'm closing the issue. We can address the special case of multi-line strings in the future, should the need arise.

Alex-Jordan mentioned this issue Sep 14, 2019

refactor verb to use CR for delimiter #422

Merged

taniwallach closed this as completed Sep 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

verbatim delimiter #421

verbatim delimiter #421

Alex-Jordan commented Sep 10, 2019

dpvc commented Sep 10, 2019 •

edited

Loading

Alex-Jordan commented Sep 10, 2019

dpvc commented Sep 10, 2019

Alex-Jordan commented Sep 11, 2019

dpvc commented Sep 11, 2019

Alex-Jordan commented Sep 11, 2019

dpvc commented Sep 11, 2019

Alex-Jordan commented Sep 11, 2019

dpvc commented Sep 11, 2019 •

edited

Loading

taniwallach commented Sep 14, 2019

Alex-Jordan commented Sep 14, 2019

dpvc commented Sep 14, 2019

Alex-Jordan commented Sep 14, 2019

Alex-Jordan commented Sep 14, 2019

taniwallach commented Sep 17, 2019 •

edited

Loading

taniwallach commented Sep 25, 2019

verbatim delimiter #421

verbatim delimiter #421

Comments

Alex-Jordan commented Sep 10, 2019

dpvc commented Sep 10, 2019 • edited Loading

Alex-Jordan commented Sep 10, 2019

dpvc commented Sep 10, 2019

Alex-Jordan commented Sep 11, 2019

dpvc commented Sep 11, 2019

Alex-Jordan commented Sep 11, 2019

dpvc commented Sep 11, 2019

Alex-Jordan commented Sep 11, 2019

dpvc commented Sep 11, 2019 • edited Loading

taniwallach commented Sep 14, 2019

Alex-Jordan commented Sep 14, 2019

dpvc commented Sep 14, 2019

Alex-Jordan commented Sep 14, 2019

Alex-Jordan commented Sep 14, 2019

taniwallach commented Sep 17, 2019 • edited Loading

taniwallach commented Sep 25, 2019

dpvc commented Sep 10, 2019 •

edited

Loading

dpvc commented Sep 11, 2019 •

edited

Loading

taniwallach commented Sep 17, 2019 •

edited

Loading