Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raw reader #910

Open
konsumer opened this issue Sep 14, 2017 · 7 comments
Open

Raw reader #910

konsumer opened this issue Sep 14, 2017 · 7 comments

Comments

@konsumer
Copy link

konsumer commented Sep 14, 2017

I am attempting to make a binary-to-info raw parser, so I can get output similar to protoc --decode_raw from an existing binary message, and eventually generate a vague-but-won't-error proto file as a basis for parsing. I think I need a little help with sub-messages, and I'm not sure I'm grabbing types correctly.

I've got a message I built from a proto like this:

syntax = "proto3";

message Test {
  repeated int32 nums = 1;
  int64 num = 2;
  string str = 3;
  repeated Child children = 4;
}

message Child {
  int64 num = 1;
  string str = 2;
  repeated Child children = 3;
}

and I get a binary message that looks like this:

00000000: 0a05 0102 0304 0510 011a 0568 656c 6c6f  ...........hello
00000010: 220c 0801 1204 636f 6f6c 1a02 0801 220f  ".....cool....".
00000020: 0802 1207 6177 6573 6f6d 651a 0208 0222  ....awesome...."
00000030: 0c08 0312 046e 6561 741a 0208 03         .....neat....

To parse it, I've been looking at Google's docs and your nice article

As well as the linked issues: #55 #736

From that, I've made this function:

const getProto = buffer => {
  const reader = Reader.create(buffer)
  const out = []
  while (reader.pos < reader.len) {
    const tag = reader.uint64()
    const id = tag >>> 3
    const wireType = tag & 7
    switch (wireType) {
      case 0: // int32, int64, uint32, bool, enum, etc
        out.push({id, wireType, type: 'int', value: reader.uint32()})
        break
      case 1: // fixed64, sfixed64, double
        out.push({id, wireType, type: 'long', value: reader.fixed64()})
        break
      case 2: // string, bytes, sub-message
        out.push({id, wireType, type: 'string', value: reader.bytes()})
        break
      // IGNORE start_group
      // IGNORE end_group
      case 5: // fixed32, sfixed32, float
        out.push({id, wireType, type: 'float', value: reader.float()})
        break
      default: reader.skipType(wireType)
    }
  }
  return out
}

Now, I know that there are different types, encoded differently, which can't be guessed without the proto file, but I figure that I can mark the types in the output so people can tweak their own generated proto files. This will at least get them started (much like protoc --decode_raw.)

My question is how do I check a string for sub-message or repeated status, and am I using appropriate decoders for those types?

@konsumer
Copy link
Author

konsumer commented Sep 14, 2017

As a sidenote protc --decode-raw is less than ideal, but still sort of readable:

1: "\001\002\003\004\005"
2: 1
3: "hello"
4 {
  1: 1
  2: "cool"
  3 {
    1: 1
  }
}
4 {
  1: 2
  2: "awesome"
  3 {
    1: 2
  }
}
4 {
  1: 3
  2: "neat"
  3 {
    1: 3
  }
}

I'm hoping I can do better, specifically with field 1's repeated ints.

@konsumer
Copy link
Author

konsumer commented Sep 14, 2017

Also, I'd be happy to put this all in a PR as a built-in method so others can very easily reverse protobufs, once I get it figured out.

@konsumer
Copy link
Author

The current method I've got (which I am positive is wrong) actually gets pretty close to protoc's output:

const getData = buffer => {
  const reader = Reader.create(buffer)
  const out = []
  while (reader.pos < reader.len) {
    const tag = reader.uint64()
    const id = tag >>> 3
    const wireType = tag & 7
    switch (wireType) {
      case 0: // int32, int64, uint32, bool, enum, etc
        out.push({[id]: reader.uint32()})
        break
      case 1: // fixed64, sfixed64, double
        out.push({[id]: reader.fixed64()})
        break
      case 2: // string, bytes, sub-message
        const bytes = reader.bytes()
        // TODO: this isn't the right way to do this at all, I'm sure
        if (bytes[0] === 8) {
          out.push({[id]: getData(bytes)})
        } else {
          out.push({[id]: bytes.toString()})
        }
        break
      // IGNORE start_group
      // IGNORE end_group
      case 5: // fixed32, sfixed32, float
        out.push({[id]: reader.float()})
        break
      default: reader.skipType(wireType)
    }
  }
  return out
}

The if (bytes[0] === 8) test works on the dumb demo-data, but nothing else (assuming it's a message-type marker.) Anyone have any strategies for working out the test that should go here?

This is my current test message:

00000000: 0a05 0102 0304 0510 011a 0568 656c 6c6f  ...........hello
00000010: 220c 0801 1204 636f 6f6c 1a02 0801 220f  ".....cool....".
00000020: 0802 1207 6177 6573 6f6d 651a 0208 0222  ....awesome...."
00000030: 0c08 0312 046e 6561 741a 0208 03         .....neat....

And this is what the above function outputs:

[
  {
    "1": "\u0001\u0002\u0003\u0004\u0005"
  },
  {
    "2": 1
  },
  {
    "3": "hello"
  },
  {
    "4": [
      {
        "1": 1
      },
      {
        "2": "cool"
      },
      {
        "3": [
          {
            "1": 1
          }
        ]
      }
    ]
  },
  {
    "4": [
      {
        "1": 2
      },
      {
        "2": "awesome"
      },
      {
        "3": [
          {
            "1": 2
          }
        ]
      }
    ]
  },
  {
    "4": [
      {
        "1": 3
      },
      {
        "2": "neat"
      },
      {
        "3": [
          {
            "1": 3
          }
        ]
      }
    ]
  }
]

@konsumer
Copy link
Author

Ok, I think I have the basics worked out at rawproto. Happy to make a PR to this project, if it's desired, and would love any suggestions (I'm not toally confident I'm doing it right.)

Here is example output:

[
  {
    "1": {
      "type": "Buffer",
      "data": [
        1,
        2,
        3,
        4,
        5
      ]
    }
  },
  {
    "2": 1
  },
  {
    "3": "hello"
  },
  {
    "4": [
      {
        "1": 1
      },
      {
        "2": "cool"
      },
      {
        "3": [
          {
            "1": 1
          }
        ]
      }
    ]
  },
  {
    "4": [
      {
        "1": 2
      },
      {
        "2": "awesome"
      },
      {
        "3": [
          {
            "1": 2
          }
        ]
      }
    ]
  },
  {
    "4": [
      {
        "1": 3
      },
      {
        "2": "neat"
      },
      {
        "3": [
          {
            "1": 3
          }
        ]
      }
    ]
  }
]

@konsumer
Copy link
Author

konsumer commented Sep 27, 2017

Just wanted to leave this idea here:

Looking through the protoc source, it appears there is no "raw-reader", it's just that the regular reader is ok with extra fields not defined in the proto (and adds them as numeric-named fields.)

Raw parsing is basically just "use an empty proto message" and then all the other fields are extra so added as numeric fields. If there was an option in protobufjs to do this (not throw on extra fields, just add them as numeric fields with guessed types) we'd have a raw-parser, but also the other thing that protoc can't do with this stuff: "I have this proto which defines some of the fields, but I there are some extra fields I don't know about, so just add those as number-fields for further analysis."

Here is an example of this:

I made a proto binary message, like above, but added an extra string field that's not in the proto.

I used this proto:

syntax = "proto3";

message Test {
  repeated int32 nums = 1;
  int64 num = 2;
  string str = 3;
  repeated Child children = 4;
}

message Child {
  int64 num = 1;
  string str = 2;
  repeated Child children = 3;
  string extra = 4;
}
00000000: 0a05 0102 0304 0510 011a 0568 656c 6c6f  ...........hello
00000010: 221c 0801 1204 636f 6f6c 1a12 0801 220e  ".....cool....".
00000020: 7468 6973 2069 7320 6578 7472 612e 221f  this.is.extra.".
00000030: 0802 1207 6177 6573 6f6d 651a 1208 0222  ....awesome...."
00000040: 0e74 6869 7320 6973 2065 7874 7261 2e22  .this.is.extra."
00000050: 1c08 0312 046e 6561 741a 1208 0322 0e74  .....neat....".t
00000060: 6869 7320 6973 2065 7874 7261 2e         his.is.extra.

I removed Child.extra, then ran protoc on it, with the definition:

nums: 1
nums: 2
nums: 3
nums: 4
nums: 5
num: 1
str: "hello"
children {
  num: 1
  str: "cool"
  children {
    num: 1
  }
}
children {
  num: 2
  str: "awesome"
  children {
    num: 2
  }
}
children {
  num: 3
  str: "neat"
  children {
    num: 3
  }
}

and protoc, using --decode_raw:

1: "\001\002\003\004\005"
2: 1
3: "hello"
4 {
  1: 1
  2: "cool"
  3 {
    1: 1
    4: "this is extra."
  }
}
4 {
  1: 2
  2: "awesome"
  3 {
    1: 2
    4: "this is extra."
  }
}
4 {
  1: 3
  2: "neat"
  3 {
    1: 3
    4: "this is extra."
  }
}

If I had the equivalent combination of the 2, in protobufjs, it would be easier to reverse-engineer it:

nums: 1
nums: 2
nums: 3
nums: 4
nums: 5
num: 1
str: "hello"
children {
  num: 1
  str: "cool"
  children {
    num: 1
    4: "this is extra."
  }
}
children {
  num: 2
  str: "awesome"
  children {
    num: 2
    4: "this is extra."
  }
}
children {
  num: 3
  str: "neat"
  children {
    num: 3
    4: "this is extra."
  }
}

So, with this, I could loop through fields and test for numeric-fields to find the ones I should take a closer look at. Since I would have all the other awsome context-info that protobufjs has, it would be very easy to figure out which definitions have an extra field (in this case Child has a string in field 4.) Does this seem like a thing that I should PR for?

@konsumer
Copy link
Author

So basically, output of Test.decode(encoded) would be:

{
  "nums": [
    1,
    2,
    3,
    4,
    5
  ],
  "num": "1",
  "str": "hello",
  "children": [
    {
      "num": "1",
      "str": "cool",
      "children": [
        {
          "num": "1",
          "4": "this is extra."
        }
      ]
    },
    {
      "num": "2",
      "str": "awesome",
      "children": [
        {
          "num": "2",
          "4": "this is extra."
        }
      ]
    },
    {
      "num": "3",
      "str": "neat",
      "children": [
        {
          "num": "3",
          "4": "this is extra."
        }
      ]
    }
  ]
}

@konsumer
Copy link
Author

konsumer commented Dec 6, 2020

Coming back to this years later, as I need to decode another protobuf without fully having the proto def. I see no one has commented. Is this a subject anyone else has interest in a PR about?

In my latest adventures in reversing protobufs, I discovered protoc does this now: Fills in the fields it knows from the proto definitions, and leaves the rest numeric, so if I use this proto to create it:

syntax = "proto3";

message Test {
  repeated int32 nums = 1;
  int64 num = 2;
  string str = 3;
  repeated Child children = 4;
}

message Child {
  int64 num = 1;
  string str = 2;
  repeated Child children = 3;
  string extra = 4;
}

Then I comment out extra and I get this:

cat demo.pb | protoc --decode Test demo.proto
nums: 1
nums: 2
nums: 3
nums: 4
nums: 5
num: 1
str: "hello"
children {
  num: 1
  str: "cool"
  children {
    num: 1
  }
}
children {
  num: 2
  str: "awesome"
  children {
    num: 2
  }
}
children {
  num: 3
  str: "neat"
  children {
    num: 3
  }
  4: "this is extra."
}

I'd still like to PR it to this lib, if there is interest. It would mean I can deprecate my old raw parser, and it would make reverse-engineering proto definitions (from partial definitions) even easier. In addition, I made a separate function to infer a basic proto from the raw binary, sort of like this:

syntax = "proto3";

message Message3 {
  int32 field1 = 1; // could be a int32, int64, uint32, bool, enum, etc, or even a float of some kind
}

message Message4 {
  int32 field1 = 1; // could be a int32, int64, uint32, bool, enum, etc, or even a float of some kind
  bytes field2 = 2; // could be a repeated-value, string, bytes, or malformed sub-message
  Message3 subMessage3 = 3;
}

message MessageRoot {
  bytes field1 = 1; // could be a repeated-value, string, bytes, or malformed sub-message
  int32 field2 = 2; // could be a int32, int64, uint32, bool, enum, etc, or even a float of some kind
  bytes field3 = 3; // could be a repeated-value, string, bytes, or malformed sub-message
  repeated Message4 subMessage4 = 4;
}

It's not perfect, but a similar idea could be used to generate a working proto to not error on unknown-formats. It would be cool to integrate these ideas and get partial inference, like use the proto if it applies, and fill in the others with some generated-name. Then you could look through the data and find better names for things you can figure out. It would also allow you to keep editing your proto as you figure fields out, and the next time it parses a message, it would have the new field-defs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant