Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine emphasis opcodes #99

Closed
bertfrees opened this issue Jun 18, 2015 · 17 comments
Closed

Combine emphasis opcodes #99

bertfrees opened this issue Jun 18, 2015 · 17 comments
Labels
enhancement An enhancement in the functionality (not a bug fix or a table improvement)
Milestone

Comments

@bertfrees
Copy link
Member

A possible enhancement to the new opcodes introduced in issue #50 could be to combine certain groups of opcodes into a single opcode. E.g. { firstwordital, firstwordbold, firstwordunder, ... } are combined into a single opcode named firstwordemph which is followed by a class argument { ital, bold, under, ... }. This could improve the readability and maintainability of both tables and code. Also it could make it less UEB specific.

So we're talking about these opcodes:

firstword*
lastwordbefore*
lastwordafter*
len*phrase
firstletter*
lastletter*
singleletter*
*word
*wordstop

* = {ital, bold, under, caps, script, trans1, trans2, trans3, trans4, trans5}

There are 9 × 10 = 90 possible combinations and each of them has its own opcode. The idea is to replace this with:

firstwordemph *
lastwordbeforeemph *
lastwordafteremph *
lenemphphrase *
firstletteremph *
lastletteremph *
singleletteremph *
emphword *
emphwordstop *

* = {ital, bold, under, ...}

* must be an "emphasis class" previously defined with a rule of the form emphclass <name>. Possibly ital, bold and under could be predefined emphasis classes, but the possible classes are not limited to those values, nor are they limited to the list of 10 values that we have now. caps and script may have to be treated separately.

The difference with the current situation is that there are only 9 opcodes, that the names can be freely chosen, and that the possible classes are not limited.

The order of class definitions determines how typeform bits are mapped to the classes. The class order could also be documented in the table header.

@bertfrees bertfrees added the enhancement An enhancement in the functionality (not a bug fix or a table improvement) label Jun 18, 2015
@egli
Copy link
Member

egli commented Jun 18, 2015

Sounds like a brilliant idea, at least from the perspective of table and documentation maintenance. How would it be code wise? How do you define different behaviour for firstletteremph ital and firstletteremph under? Or is there none?

@bertfrees
Copy link
Member Author

Cool. Well yeah, there should be different behavior of course otherwise it's not very useful. I haven't looked into the code yet so I don't know how easy this change would be. Maybe @MikeGray-APH can give us a clue?

@egli
Copy link
Member

egli commented Jun 18, 2015

Yes OK, different behaviour. But then how do you define the behaviour of the following:

emphclass mySpecialCaps
lenemphphrase mySpecialCaps 3

@bertfrees
Copy link
Member Author

Oh I see what you mean, I think. The behavior would still be defined with the opcodes. In that sense nothing changes with the way things are now. The only difference is the way we define it in the tables.

Does that answer your question?

@MikeGray-APH
Copy link
Contributor

In my code, capitols are treat the same as the other emphases except it will process word resets.

@bertfrees bertfrees changed the title Combine emphasis/capital opcodes Combine emphasis opcodes Jun 18, 2015
@bertfrees
Copy link
Member Author

@MikeGray-APH Yes I know. That's why it's probably best to have a predefined class "caps" with slightly different behavior.

@dkager
Copy link
Contributor

dkager commented Jun 19, 2015

@bertfrees wrote:

The order of class definitions determines how typeform bits are mapped to the classes.

I think this should be enclosed in the class definitions themselves, e.g. emphclass italic 1 for bit 1. This to avoid the ordering problem such as we're currently dealing with using the class opcode.

@MikeGray-APH wrote:

In my code, capitols are treat the same as the other emphases except it will process word resets.

Ideally there should be an opcode for this, maybe emphmodechars that works similar to numericmodechars. This is probably the only opcode missing to make the Dutch tables work.

@bertfrees
Copy link
Member Author

The order of class definitions determines how typeform bits are mapped to the classes.

I think this should be enclosed in the class definitions themselves, e.g. emphclass italic 1 for bit 1. This to avoid the ordering problem such as we're currently dealing with using the class opcode.

OK I see where you're coming from, but the two cases are fundamentally different.

The problem with using the "class" opcode and $w, $x, etc. in multipass rules is that such tables can't just be included in any table because the number of "class" rules that are defined before the include must be known.

Tables with "emphclass" definitions, the way I propose it, can be included in other tables without a problem.

A possible issue with my proposal, you might say, is that you can not guarantee that the included tables, and therefore the behavior of your table, will not change. But you could make the same argument for any rule, including your version of the "emphclass" rule. The only possible solution to this problem is to say: the behavior of a table is 100% the responsibility of the table author, and this includes any tables that he wishes to include. Carefully testing you table is what you need.

One thing that your proposal has what mine doesn't is that you could override the order of classes, but I don't immediately see any need for that.

@dkager
Copy link
Contributor

dkager commented Jun 19, 2015

OK, but either solution will completely change typeform handling for external applications. These applications will at least want to know which classes a table has defined and how to use them. Using them can either be done numericcally, as with the current typeform implementation, or using a class name. The latter causes a lot of overhead for longer strings. And using numbers requires there to be a lookup function. Whichever approach you choose, this is going to break every application using this feature of liblouis. Therefore I'm thinking it might be useful to predefine {italic, bold, under} so they are guaranteed to keep their current bits.

Another question: how does the computer_braille typeform fit in with this? I believe there are opcodes to signal where computer braille begins and ends. Should this be considered emphasis? If so the term "emphasis" is a bit of a misnomer. Or to generalize the question: are all emphasis options typeforms and vice versa?

@bertfrees
Copy link
Member Author

The usage will stay the same, i.e. numeric. The difference is that now the emphasis classes are defined and documented per table. A look up function is not strictly necessary but could be useful, yes.

You have a good point regarding the possibility of breaking how applications currently use liblouis. Yes, I do want typeform handling by external application to change in the long run. And even though things won't break immediately (because we'll make sure the behavior doesn't change initially by running the existing tables through our own conversion tool, and we'll give applications some time to adapt to the new approach), still of course there is the risk that applications are lazy and will break eventually.

We could anticipate on that by reserving some bits to bold, italic and underlined (or by having a look up function) and by requiring tables to support at least those 3 classes. I have an idea for handling this in a way that doesn't force table authors to think in the "old" pattern.

But first let me explain the new approach and why we need it (in case not everybody is convinced yet).

Let's start by saying that "italic", "bold", "underline" etc. are print artifacts, i.e. properties of a font. During transcription these are mapped to braille artifacts (indicators). How that mapping is done depends on language and possibly context (e.g. depending on what types of emphasis appear in a text). Sometimes the braille artifacts have the same name as a print artifact, sometimes not.

Up till now liblouis has handled the problem by providing, through a liblouis table, a mapping between the 3 most common emphasis types and a set of indicators. This simple model is limiting in several ways:

  1. limited number of different braille indicators (max. 3)
  2. limited number of emphasis types (3)
  3. fixed mapping

For applications that don't do any special handling per language ("braille code agnostic"), this is an acceptable generic solution, provided that the liblouis tables implement the mapping as good as possible. For emphasis beyond bold, italic and underlined, the best an application can do is to either map it to the type it is most similar to, or ignore it.

It is clear that this is not an optimal solution for all braille codes and all input. But trying to handle everything is not in the scope of liblouis either:

  • In order to handle all possible emphasis types liblouis would need a notion of CSS for example.
  • Solving the problem of context dependent mapping (e.g. UEB) is not possible since the context that liblouis gets is typically only a single paragraph.

This means that applications that use liblouis have the responsibility of doing language specific handling anyway, and therefore it's acceptable that the liblouis interface differs between tables.

What I like so much about this idea is that it doesn't force the table author into a certain pattern. He can freely choose the interface and how much of the mapping he implements in the table. The interface can be a list of distinct indicators (i.e. braille artifacts, e.g. "ind1", "ind2", etc.), or it can be a list of print artifacts, some of which may map to the same indicators. Or it can be a mixture.

To better support multiple emphasis types mapping to the same indicators, without having to duplicate a lot, I had this idea of emphasis "aliases". It could look something like this:

emphclass bold ind1
emphclass ital ind2
emphclass under ind2
firstletteremph ind1 46
...

The exact syntax is not so important. What matters is that tables can easily provide a mapping for bold, italic and underlined, ensuring backwards compatibility, while not being stuck with the old approach.

@dkager
Copy link
Contributor

dkager commented Jun 23, 2015

A look up function is not strictly necessary but could be useful, yes.

The alternative, unless I misunderstand the concept, is that application developers look at the classes a table defines and then hard-code them. This will break if a table is later updated with different numbers assigned to these classes. While table authors should avoid such backwards-incompatible changes, I think we should anticipate this by providing a lookup function. This is also required for applications that allow the user to load arbitrary "custom tables".

Even if the table behavior isn't changed there's already one incompatibility: the change from char to unsigned short which requires applications to be updated. This is of course a minor problem, but it does mean you can't just drop in a >2.6.3 DLL into an existing application.

Question: what are ind1, ind2, etc? E.g. why not write firstletteremph ital 46? Is the idea to make ind1 an alias of ital?

Another idea I had for preventing duplication was something like this:
emphdots ind1 46
firstletteremph ital ind1
lastletteremph ital ind1

I.e. the ability to define virtual dot patterns. The same could probably be achieved by assigning a virtual dot, say a, and then replacing it with the desired dots using multi-pass rules. But this has some disadvantages:

  1. This limits you to the number of virtual dots, which I believe is 8?
  2. Multi-pass rules aren't as transparant.

@bertfrees
Copy link
Member Author

Yes, that was the idea. Applications would look at the "table API". For every change to a table a note is made in the changelog, so applications that do language specific handling (i.e. use more than ital, bold, under) can adapt themselves with each update. Of course table authors should try to make as little backward incompatible changes as possible, just like with any other software component. A look up function can make this more robust indeed, although it's only really helpful when the order of classes changes (and why would you need to do that?). For applications that allow the user to load custom tables it may be best to rely on ital, bold and under only. If they want more, the custom tables should probably follow a well-defined standard anyway, which could possibly include a fixed order of classes. But again, a look up function could be convenient here. So it's an idea worth considering.

Because of the change from char to unsigned short we'll change the version to 3.0.

ind1 etc. were just examples of how indicators could be named in a braille code. E.g. UEB has "first transcriber-defined typeform", "second transcriber-defined typeform", etc. In this particular example ital is an alias of ind1. Why not write firstletteremph ital 46? This is the whole essence of what I've been trying to explain. A table author can still write that if he wishes, but he can also use words that more closely match the braille code and use aliases to do the mapping from print artifacts to braille artifacts.

@bertfrees
Copy link
Member Author

Your idea about "dot pattern aliases" could remove some extra duplication, yes. I need to think about it. I guess I would make it something more general than emphdots. But we have to be careful about inventing things nobody will use. This is indeed something that could be solved already, quite elegantly actually, with multipass opcodes.

There are 6 virtual dots by the way (9, a, b, c, d and e), so the number of virtual dot patterns is (2^6 - 1) * 2^8 = 16128.

If you want to work out this idea some more, please make a new issue for it (as it's not directly related with the opcode unification).

@dkager
Copy link
Contributor

dkager commented Jun 23, 2015

For applications that allow the user to load custom tables it may be best to rely on ital, bold and under only.

I agree, but how are applications supposed to know which bit corresponds with which typeform? A custom table could define ital=1 and bold=2, but it could just as easily define ital=32 and bold=1. The only way around this is to hard-code these three classes with their current values. But this kind of voids the problem dynamic classes are trying to fix.

@bertfrees
Copy link
Member Author

Yep, either reserve some bits, or have a look up function. Or what I said earlier about custom tables following a well-defined standard. The contract could simply say that it is illegal to define ital as 32 and bold as 1.

The first option, reserving some bits, doesn't necessarily conflict with dynamic classes IMO. We need dynamic classes, not dynamic bits. Besides we'll need to reserve the bit for computer_braille anyway. What matters is that there is a whole range of bits available (maybe starting at bit 5) that can be filled in freely.

bertfrees added a commit that referenced this issue Dec 22, 2015
as explained in #99

Note that the unification has been done completely on the level of
compilation. The result of the compilation step is exactly the same as
before and nothing has changed in the steps following compilation.

To do:
- Rename "ital", "bold", etc. to the generic "emph_1", "emph_2",
  etc. everywhere in the code and API. Emphasis classes can be named
  freely in the translation tables and the code and API should reflect
  that.
- Remove an indirection (see comment in compileTranslationTable.c#L4144)
@egli egli modified the milestone: 3.0 alpha Jun 8, 2016
@egli
Copy link
Member

egli commented Jun 17, 2016

I think we can safely close this issue as this has been implemented

@egli egli closed this as completed Jun 17, 2016
@bertfrees
Copy link
Member Author

The lookup function has been added in 511d91e.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement in the functionality (not a bug fix or a table improvement)
Projects
None yet
Development

No branches or pull requests

4 participants