Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many Code duplicates #89

Closed
fluency03 opened this issue Nov 15, 2018 · 14 comments · Fixed by #90
Closed

Too many Code duplicates #89

fluency03 opened this issue Nov 15, 2018 · 14 comments · Fixed by #90
Assignees

Comments

@fluency03
Copy link
Contributor

fluency03 commented Nov 15, 2018

Such as:

raw 0x55 - base64urlpad 'U', which is 0x55
bencode 0x63 - base32pad 'c', which is 0x63
dbl-sha2-256 0x56 - base32hex-upper 'V', which is 0x56

And,

multihash 0x31 - base1 '1', which is 0x31
multicodec 0x30 - base2 '0', which is 0x30
dns6 0x37 - base8 '7', which is 0x37

@fluency03 fluency03 changed the title Code duplicates Too many Code duplicates Nov 15, 2018
@Stebalien
Copy link
Member

See: #59 and the followup #68.

Basically, 'U' happens to encode to 0x55 in ASCII/UTF-8 but 'U', itself, is a symbol. Multibase only really makes sense in a text context where we have a string of character symbols.

Note: after some followup discussions, we realized that these really don't belong in the same table. Technically, bytes are also symbols but I'm not aware of any text encoding that allows for both character symbols and byte symbols. The current setup causes more confusion than it's worth.

@fluency03
Copy link
Contributor Author

fluency03 commented Nov 15, 2018

@Stebalien

The terminology "symbol" is really confusing, because it does not belong to any data type in, as far as I know, any programming language. At least we have data type Byte in most of languages.

From the implementation point of view, I still don't understand how to represent a symbol, where the other Codes are having Byte field at the same place.

@fluency03
Copy link
Contributor Author

fluency03 commented Nov 15, 2018

According to this implementation https://github.com/multiformats/js-multicodec/blob/master/src/base-table.js, there are only the following bases implemented:

// bases encodings
exports['base1'] = Buffer.from('01', 'hex')
exports['base2'] = Buffer.from('00', 'hex')
exports['base8'] = Buffer.from('07', 'hex')
exports['base10'] = Buffer.from('09', 'hex')

And here, the so called symbols are actually treated as hex in Byte.

Can I do this if I want to implement it in another language?

@fluency03
Copy link
Contributor Author

Update from #76:

Both js-multicodec and py-multicodec are wrong.

@Stebalien
Copy link
Member

The terminology "symbol" is really confusing, because it does not belong to any data type in, as far as I know, any programming language.

Copied from #76 (comment) to keep everything in this thread:

For example, binary is composed of two symbols 0 and 1 (or true and false). Bytes are defined to each be a string of 8 binary symbols but are also, themselves, symbols (there are 256 of them).

Every character is also a symbol. On a computer, these symbols may be encoded into bits/bytes but there are often several ways to encode a single symbol into bits/bytes and the symbol exists apart from these encodings (an '1' on paper is a '1', not 0x31).

When I say symbol, I'm talking about these: https://en.wikipedia.org/wiki/Turing_machine

(I agree this is confusing. It's "technically" correct but I can't think of a better explanation that's still correct.)

@fluency03
Copy link
Contributor Author

fluency03 commented Nov 15, 2018

So that means, in an actual implementation, a symbol has to be implemented as a special data structure, maybe called Symbol. Some properties of this data type Symbol could be something like this:

class Symbol (
  val isByte: Bool = ~
  val value: Bytes = ~
)

Stebalien added a commit that referenced this issue Nov 15, 2018
Resolution from a discussion with Juan and the discussion on the following
issues:

fixes #89
fixes #76
@ghost ghost assigned Stebalien Nov 15, 2018
@ghost ghost added the in progress label Nov 15, 2018
@Stebalien
Copy link
Member

It's probably best to just have two tables:

  1. Multicodecs: these use varint byte sequences.
  2. Multibases: use text symbols.

Combining them under a single abstraction probably isn't worth it.

For multibase, you'd just use whatever encoding your language supports. For example, the symbol 👍 has one encoding in UTF-8, another in UTF-16, and another in UTF-32. At the end of the day, that doesn't really matter. The important part is whether or not some string starts with the symbol 👍 (regardless of encoding).

@fluency03
Copy link
Contributor Author

However, a new question is: if we treat so many different things (such as protobuf, md4, murmur3, even ip4, udp and http) as different type of codec, why should we exclude BaseN form the Codec?

@Stebalien
Copy link
Member

Those all occur in a binary context. That is, they all answer the question "what does this series of bytes mean". However, mulitbase occurs in a text context. It answers the question "how do I convert this sequence of characters to a sequence of bytes".

@fluency03
Copy link
Contributor Author

import com.github.fluency03.multibase.Multibase
import com.github.fluency03.multibase.Base._

val str = "Multibase is awesome! \\o/"

Multibase.encodeString(Base32Upper, str)              // BJV2WY5DJMJQXGZJANFZSAYLXMVZW63LFEEQFY3ZP
Multibase.encodeString(Base32Pad, str)                // cjv2wy5djmjqxgzjanfzsaylxmvzw63lfeeqfy3zp
Multibase.encodeString(Base32PadUpper, str)           // CJV2WY5DJMJQXGZJANFZSAYLXMVZW63LFEEQFY3ZP

Multibase.encodeString(Base32Z, str)                  // hji4sa7djcjozg3jypf31yamzci3s65mfrrofa53x
Multibase.encodeString(Base58Flickr, str)             // ZxaJjNnAzU5jHQLhoLrXxcVM66Ca1VkLWAT
Multibase.encodeString(Base58BTC, str)                // zYAjKoNbau5KiqmHPmSxYCvn66dA1vLmwbt

Multibase.encodeString(Base64, str)                   // mTXVsdGliYXNlIGlzIGF3ZXNvbWUhIFxvLw
Multibase.encodeString(Base64Pad, str)                // MTXVsdGliYXNlIGlzIGF3ZXNvbWUhIFxvLw==
Multibase.encodeString(Base64URL, str)                // uTXVsdGliYXNlIGlzIGF3ZXNvbWUhIFxvLw
Multibase.encodeString(Base64URLPad, str)             // UTXVsdGliYXNlIGlzIGF3ZXNvbWUhIFxvLw==


val encodedStr: String = Multibase.encode(Base16, str.getBytes)
// encodedStr: String = f4d756c74696261736520697320617765736f6d6521205c6f2f

val decodedBytes: Array[Byte] = Multibase.decode(encodedStr)
// decodedBytes: Array[Byte] = Array(77, 117, 108, 116, 105, 98, 97, 115, 101, 32, 105, 115, 32, 97, 119, 101, 115, 111, 109, 101, 33, 32, 92, 111, 47)

val decodedStr = new String(decodedBytes)
// decodedStr: String = Multibase is awesome! \o/

If you take this as an example, you can also say: what does this series of bytes mean?

That is, for this f4d756c74696261736520697320617765736f6d6521205c6f2f:

  • f indicates it is with codec base16
  • based on the codec base16, we can answer the question what does this series of bytes mean?, which is: it means "Multibase is awesome! \\o/".

@Stebalien
Copy link
Member

That's a series of characters. That may or could, potentially, encode to entirely different sequences of bytes depending on the underlying encoding. For example, "f4d756c74696261736520697320617765736f6d6521205c6f2f" encoded in UTF-32 is [255, 254, 0, 0, 102, 0, 0, 0, 52, 0, 0, 0, 100, 0, 0, 0, 55, 0, 0, 0, 53, 0, 0, 0, 54, 0, 0, 0, 99, 0, 0, 0, 55, 0, 0, 0, 52, 0, 0, 0, 54, 0, 0, 0, 57, 0, 0, 0, 54, 0, 0, 0, 50, 0, 0, 0, 54, 0, 0, 0, 49, 0, 0, 0, 55, 0, 0, 0, 51, 0, 0, 0, 54, 0, 0, 0, 53, 0, 0, 0, 50, 0, 0, 0, 48, 0, 0, 0, 54, 0, 0, 0, 57, 0, 0, 0, 55, 0, 0, 0, 51, 0, 0, 0, 50, 0, 0, 0, 48, 0, 0, 0, 54, 0, 0, 0, 49, 0, 0, 0, 55, 0, 0, 0, 55, 0, 0, 0, 54, 0, 0, 0, 53, 0, 0, 0, 55, 0, 0, 0, 51, 0, 0, 0, 54, 0, 0, 0, 102, 0, 0, 0, 54, 0, 0, 0, 100, 0, 0, 0, 54, 0, 0, 0, 53, 0, 0, 0, 50, 0, 0, 0, 49, 0, 0, 0, 50, 0, 0, 0, 48, 0, 0, 0, 53, 0, 0, 0, 99, 0, 0, 0, 54, 0, 0, 0, 102, 0, 0, 0, 50, 0, 0, 0, 102, 0, 0, 0] (bytes/binary).

@fluency03
Copy link
Contributor Author

fluency03 commented Nov 16, 2018

According to this Protocol Description - How does the protocol work?:

multicodec is a self-describing multiformat, it wraps other formats with a tiny bit of self-description. A multicodec identifier may either be a varint (in a byte string) or a symbol (in a text string).

A chunk of data identified by multicodec will look like this:
<multicodec><encoded-data>
# To reduce the cognitive load, we sometimes might write the same line as:
<mc><data>

So, in this example f4d756c74696261736520697320617765736f6d6521205c6f2f (following this format <multicodec><encoded-data>), it self-describes itself by the starting <multicodec> - f followed by the <encoded-data> - 4d756c74696261736520697320617765736f6d6521205c6f2f.

Because it starts with f, which means it is self-describing itself as base16. Then all of the following <encoded-data> should be treated as base16.

@Stebalien
Copy link
Member

I added the "symbols" concept in #68 in an attempt to address this exact issue. I'm now proposing that we remove it in #90 because it's clear that it's still confusing.

Really, multibase is a multicodec (of sorts). However, our other multicodecs all show up in a binary context while multibase shows up in a text context but this distinction and why it matters is confusing.

Nit: other multicodecs usually use <mc><length><value> where multibase is always <mc><value>.

@fluency03
Copy link
Contributor Author

fluency03 commented Nov 16, 2018

"Nit: other multicodecs usually use <mc><length><value> where multibase is always <mc><value>."

I think this is also inaccurate.

  • other multicodecs usually use - How I understand this this is:
    • multicodecs are just a bunch sort of Codecs with names and codes.
    • what you mentioned the format <mc><length><value> is only for multihash not for all Codecs in multicodecs, right?
  • multibase is always <mc><value>

Therefore, the difference <mc><length><value> vs <mc><value> is

multihash vs multibase

instead of

multicodec vs multibase

@vmx vmx closed this as completed in #90 Dec 18, 2018
vmx pushed a commit that referenced this issue Dec 18, 2018
Resolution from a discussion with Juan and the discussion on the following
issues:

fixes #89
fixes #76
@ghost ghost removed the in progress label Dec 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants