Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Parse Ink data #4

Open
msiemens opened this issue Nov 9, 2020 · 8 comments
Open

Feature: Parse Ink data #4

msiemens opened this issue Nov 9, 2020 · 8 comments
Labels
enhancement New feature or request

Comments

@msiemens
Copy link
Owner

msiemens commented Nov 9, 2020

No description provided.

@msiemens msiemens added the enhancement New feature or request label Nov 9, 2020
@blu-base
Copy link

blu-base commented Feb 16, 2021

Hi Markus,
this is a feature I'm also investigating. I've uncovered some information ObjectSpacePropSets, like the Ink's tool settings, and meta information as you seem to found in wip project.
It these information do construct the ISF specification.
However, there is this serialized stream of point coordinates which I couldn't find much information about, yet.
Somewhere, I read/heard that it is a stream of differential coordinates. Without attempting brute forcing it, it looks to me a stream like an utf8 stream. there are high bits indicating continuation of a segment, or its end.
There is some stuff in the .net reference source which might be worth investigating further, though I did not have time to dug deeper there.
Best, Sebastian

@msiemens
Copy link
Owner Author

Hey Sebastian!

First, thanks for your work with libmson! It's been a great help for understanding the OneNote file format while working on onenote.rs!

Parsing Ink data is an interesting challenge as in my experience this is a widely used feature but Microsoft doesn't see the need to publish the specification for it:

The shapes and ink JCIDs and Properties are considered out-of-scope of the OneNote file format documentation.

To bad…

Anyway, I have a somewhat working version of Ink parsing working locally but haven't had the time to polish it up for publishing the code. From my investigation I was able to determine that the Ink paths are stored as multi-byte data according to the ISF specification, specifically the section on Multi-byte Encoding of Signed Numbers. The algorithm is as follows:

  1. Read the data as an array of bytes (u8 in Rust)
  2. Decode the multi-byte data according to Sizes of Tags and Numbers. Basically read bytes until encoutering a byte with the most significant bit empty. This indicates a word boundary. Then concatenate all the bytes without their most-significant bits (which are all set except for the last one) into a unsigned int/u64
  3. Decode the signs of the unsigned ints by interpreting the lowest bit as the sign (0 = positive, 1 = negative)
  4. Interpret the data as a SVG path: Split the array in half to get the X and Y coordinates, then apply differential encoding (using the first derivatives only). Bascially one can split the array in half and print it as M x0 y0 l x1 y1 x2 y2… with M specifiying the initial coordinates and l indicating the dx and dy coordinates for the remaining entries

This should allow to parse Ink paths into SVG data. I'll try to push the source code for this soon.

During implementation I found that there's a second type of Ink paths used in embedded in rich text objects. This seems to be used for ink paths that OneNote detects to be handwriting. It's using the same data format, but it's really tricky to get the alignment right (inside of one2html) so that the resulting image doesn't get all messed up because the lines and handwriting objects are all over the place. This last bit has kept me from finishing this up for GitHub 🙈

@blu-base
Copy link

First, thanks for your work with libmson! It's been a great help for understanding the OneNote file format while working on onenote.rs!

I am glad it was of some help. It's mostly a playing field right to challenge myself to learn more regarding software development since I didn't study computer science. I hope to improve its structure with getting more experienced.

Anyway, I have a somewhat working version of Ink parsing working locally but haven't had the time to polish it up for publishing the code. From my investigation I was able to determine that the Ink paths are stored as multi-byte data according to the ISF specification, specifically the section on Multi-byte Encoding of Signed Numbers. The algorithm is as follows:

This is good news to me. Thank you for sharing this information. I should have read that spec until the end! =) It just didn't appear to me that it would contain the information i needed.

Regarding:

During implementation I found that there's a second type of Ink paths used in embedded in rich text objects. This seems to be used for ink paths that OneNote detects to be handwriting.

Could you check data of the PropertyID 0x1c00340a.
Initially i was labeling as undoc_Undetermined64byteBlock in PropertyID enum on line 359. This block is actually the Metric Table according to the ISF. Does this PropertyID in your case also contain only the GUID for x and y, or in other words is it only 64 bytes wide?

Not sure how this included in ISF, but the Metric Table is an array of 32 byte structures.
Each structure has this byte format:

| Tag Guid (16 bytes)                                                                |
| logic min (int32) | logic max (int32) | unit enum (32 bits] | resolution (float32) |

On my machine, I only ever have the tag_x and tag_y although i tested input from a wacom tablet.
Let me know if you need more detail.

@msiemens
Copy link
Owner Author

I should have read that spec until the end!

I wish it were so easy 🙈 The spec contains multiple types of multi-byte encodings and for the first few tries I tried applying the more complicated techniques described in the Compression Algorithms for Packet Data section. This would have made most sense to me: According to the ISF spec, coordinate streams are packet data which may be encoded bit packing or Huffman tables. But it turns out that OneNote doesn't use most of the ISF spec except for the simplest form of multi-byte encoding described there…

Does this PropertyID in your case also contain only the GUID for x and y, or in other words is it only 64 bytes wide?

Yes, that looks right to me. As far as I can tell, 598a6a8f-52c0-4ba0-93af-af357411a561 is the X dimension, b53f9f75-04e0-4498-a7ee-c30dbb5a9011 is the Y dimension.

By the way, I've now pushed my ink parsing implementation in a407e1e

@blu-base
Copy link

In src/one/property_set/ink_stroke_node.rs you parsed the InkPath from the multi_byte signed int vector. I have about to complete implementing the multibyte part myself as well.

I noticed, that the first value in InkPath is actually the number of data elements in the vector. For example a multi_byte structure with 113 elements would contain 112 in the first element. So the actual vector (Vec<i64>, or std::vector<uint64_t> respectively) also only contains 112 integers.

Although i haven't attempted to learn Rust yet, I tried to follow up that structure in your library. As far as i understand the fn decode(input: &[u8]) -> Vec<u64> in src/shared/multi_byte.rs does not remove the first element from the list, or validates the read vector length.

Is my understanding correct? Or do I miss some language feature which ignores the first element in a multibyte vector? Might this be a source for the "wierdness" you described in the embedded inks?

Best regards,
Sebastian

@msiemens
Copy link
Owner Author

I noticed, that the first value in InkPath is actually the number of data elements in the vector. For example a multi_byte structure with 113 elements would contain 112 in the first element.

Very interesting! Didn't notice that 🙂

As far as i understand the fn decode(input: &[u8]) -> Vec<u64> in src/shared/multi_byte.rs does not remove the first element from the list, or validates the read vector length.

Yes, this is correct. multy_byte::decode just decodes the data according to the ISF spec. I discard the first value at a later point (as I didn't figure out, that it was the data length):

fn parse_ink_path(
data: Vec<i64>,
props: &stroke_properties_node::Data,
scale_x: Option<f32>,
scale_y: Option<f32>,
) -> Result<Vec<InkPoint>> {
// Ignore the first value, we don't know what it means
let data = &data[1..];

Might this be a source for the "wierdness" you described in the embedded inks?

I took another shot at embedded ink data and figured out that my two main issues were an incorrect usage of ID mapping tables (global ID table in regular OneNote files) and an incorrect indentation of paragraphs (I didn't properly use the RgOutlineIndentDistance indentation width array) along with some minor issues regarding paragraph spacing (msiemens/one2html@46cd6c0).

Although i haven't attempted to learn Rust yet, I tried to follow up that structure in your library.

I have to admit that the code for this project isn't really clean at the moment (and probably not at all a good representation of clean Rust code). In some places I went overboard with trying to use a functional style even when it only complicates everything. Cleaning up the code base is on my list but has low priority as getting ink parsing to work was more important to me 🙂

@blu-base
Copy link

Yes, this is correct. multy_byte::decode just decodes the data according to the ISF spec. I discard the first value at a later point (as I didn't figure out, that it was the data length):

Great, I didn't notice that part before. I spent an hour or two to figure out why my parser would always read odd length, believing i made a mistake somewhere... only when i printed the actual size next to the vector I noticed the correlation, haha.

I have to admit that the code for this project isn't really clean at the moment (and probably not at all a good representation of clean Rust code).

I wouldn't be able to tell. Moreover, i believe to sit in a similar boat.

getting ink parsing to work was more important to me slightly_smiling_face

Same here. When I got your code the first time I was psyched how well it worked already. You obviously have a good understanding of the top-level document graph. That's something where I have to catch up.

@msiemens
Copy link
Owner Author

I spent an hour or two to figure out why my parser would always read odd length, believing i made a mistake somewhere...

Same for me :) In the end I decided to accept that it doesn't make sense to me and to not be bothered by it

You obviously have a good understanding of the top-level document graph.

What really helped me was the overview in the OneNote JS API documentation:

(from https://docs.microsoft.com/en-gb/office/dev/add-ins/onenote/onenote-add-ins-programming-overview#onenote-object-model-diagram).

It skips over some details (e.g. ink words being contained in ink lines and ink paragraphs) but helped me get a sense of the overall object graph. Also playing with the Add-In JS API for OneNote was helpful as it contains more details and allows to inspect how OneNote itself structures its data (see also the API docs).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants