Merge pull request #135 from kevincox/patch-1

Create proposal for DAG encryption.
ipld · Oct 30, 2021 · f69101f · f69101f
2 parents f059543 + 8266344
commit f69101f
Showing 1 changed file with 202 additions and 0 deletions.
diff --git a/notebook/exploration-reports/2021.09-dag-pb-encryption.md b/notebook/exploration-reports/2021.09-dag-pb-encryption.md
@@ -0,0 +1,202 @@
+# DAG-PB Encryption
+
+This is a proposed update to the DAG-PB spec that includes the possibility of transparent encryption. All discussion about the data-model and constraints of the previous specification apply except where specifically changed by this specification. It aims to be a complete superset of the previous spec so that eventually DAG-PB can be retained only for reading old blocks. For now it is expected that DAG-PB will continue to be used for unencrypted data until support for DAG-PB-Encrypted is widespread.
+
+Warning: I am not a cryptographer but tried to stick to simple patterns that have been encouraged by real cryptographers. If you are a cryptographer I would love feedback. (If desired I can keep you name private.)
+
+## Objective
+
+Provide a method for encryption for content on IPFS that allows storing sensitive content with a reasonable guarantee that it can not be read by others. Goals include:
+
+- Adding a single block to IPFS with never-before-seen contents should provide no information about the contents of that block.
+- Users should be able to interact with encrypted files and directories with similar UX to unencrypted files and directories.  The only exception is that whenever they provide a CID they will need to provide a key in addition.
+- Subtrees for encrypted directories should be usable without access to parent directories.
+
+## Summary
+
+This takes the simple approach of encrypting content block-by-block.  References to blocks will need to include the encryption key, these references will store the keys encrypted creating a tree of encrypted blocks. The major complication is that we allow for multiple ways for deriving a key that allows trading off privacy for deduplication.
+
+Note: This document does not talk about integrity as integrity is already provided by IPFS. In effect this is an "encrypt then sign" system as the encryption with no integrity occurs then the CID with its hash is calculated and verified by IPFS.
+
+### Advantages
+
+- Requires minimal changes to the core IPFS protocol. Only the clients that add or retrieve encrypted content need to know how or even that it is encrypted. This means that encrypted content could immediately be pinned and served by old nodes without any changes ([assuming that they don't reject the blocks](#backwards-compatibility-with-dag-pb)).
+
+- Allows block-level deduplication on par with non-encrypted content if desired. (However encrypted versions and unencrypted versions of the content will be separate)
+- Allows transparent operation such that it could conceivable become the default some day.
+
+### Disadvantages
+
+Some structure about files and directory trees may be revealed.
+
+The chunker may reveal some structure of the underlying files based on where it makes the splits.
+
+- This can be fixed by using a fixed chunk size at the expense of some deduplication.
+
+Directory structures will be preserved, this may reveal some information about the directory structure (although it is obscured by inline leaves and directory sharding).
+
+- If revealing directory structure is undesirable then the directory should be archived into a single file before adding to IPFS.  (Example enclosed in a ZIP file).
+
+These disadvantages are considered acceptable as they only occur in certain cases. These cases should be protected via obvious controls so that naive users don't stumble into them by default. Furthermore for truly sensitive data the current strategy of encrypting first into a single file can continue to be used. Lastly the most sensitive of data should not be exposed to the internet no matter what encryption you use.
+
+## Configuration
+
+The values below show a recommended value, however they are not important to the protocol except for the noted requirements.
+
+SymmetricEncryption: An algorithm that provides [symmetric encryption](https://en.wikipedia.org/wiki/Symmetric-key_algorithm). The recommended algorithm is AES-256 in CBC mode.
+
+- Note: CTR (aka Counter) mode is technically sufficient here but the primary benefit is the ability to be parallelized. However since encryption is only for blocks with a relatively small max size parallelization has minimal benefits, instead it would be better to encrypt multiple blocks in parallel. 
+
+Hash: A [cryptographic hash function](https://en.wikipedia.org/wiki/Cryptographic_hash_function).  The recommended algorithm is SHA-256
+
+## Block Key Derivation
+
+Each block will be encrypted with a unique key. This ensures that "sharing" one block (or file) does not share other files in the same directory. There are three proposed methods for generating the key depending on the privacy and deduplication tradeoff desired. Note that this is only relevant at creation time. After creation there is no indication (although it may be possible to deduce this), and no need to know which technique was used.
+
+### Random
+
+The Random key derivation method should be the default method used when a user indicates that they want encryption. It provides strong privacy but will allow no deduplication.
+
+In this method the required number of bits for the key are acquired from a cryptographically secure source of randomness for each block that needs to be encrypted.
+
+### Keyed
+
+The Keyed key derivation method is a compromise between privacy and deduplication. All content encrypted with the same "passphrase" will be deduplicated together, but not with other content on the network. This means that adding well-known content will not reveal the presence of that content to the network, however if multiple things are encrypted with the same key, including the same thing over time, then others will be able to learn which data has changed and what data is the same.  Furthermore, sharing a subset of the content with someone does not reveal the key which means that they can not decrypt other content or even identify the presence of known data with the same key. This also opens up the option for clever cryptanalysis attacks based on which data is changing over time. This option should be used carefully and should come with a serious warning in the UX.
+
+In this method the key will be generated as a prefix of the Hash of:
+
+1.  The length of the key in [varint encoding](https://github.com/multiformats/unsigned-varint).
+2.  The key itself.
+3.  The length of the data in varint encoding.
+4.  The data.
+
+Once computed the hash should be truncated to the required key length.
+
+Note: The lengths should not be required for a strong hash but many hashes have been known to be prone to prefix and suffix attacks so adding the lengths is extra mitigation. Furthermore, adding the length of the keys avoids possible collisions where a prefix of a key is used to encrypt data that starts with the remaining characters of another key.
+
+### Fixed
+
+This is for cases where it is acceptable to identify known content and where deduplication is valued. In this case everyone adding the content will get the exact same CID and key and will be able to de-duplicate blocks. This mode is suitable to be made the default in the future (as it preserves the deduplication) but should probably not be strongly advertised as "encrypted" except for where the caveats are pointed out.
+
+This method is identical to Keyed with the key of an empty string.  Specifically the key will be generated as a prefix of the Hash of:
+
+1.  A zero byte.
+2.  The length of the data in varint encoding.
+3.  The data.
+
+Note: This is technically the same as Keyed so it doesn't need to be described separately. However I chose to do so to highlight that deduplication can be preserved with encryption (at the cost of some privacy) and because implementations may choose to impose key quality restrictions for Keyed. In that case Fixed would bypass these checks.
+
+## Encryption
+
+Whenever this document specifies that a value will be encrypted it shall be encrypted using [SymmetricEncryption](#configuration) using a [key derived using the above algorithm](#block-key-derivation) and an IV of 0.
+
+Note: An IV of 0 is acceptable in this scenario because we use unique keys for every string of data. Another approach would be to use extra bits from the derivation hash (or extra random bits) but that is unnecessary.
+
+## DAG Encryption
+
+An alternative to the DAG-PB format will be created that enables encryption. Note that we can not just add fields to the existing message because [the spec forbids additional fields](https://github.com/ipld/specs/blob/master/block-layer/codecs/dag-pb.md#protobuf-strictness).  Otherwise this is mostly an addition to the existing fields. If that messaging is updated we could preserve the old Name and Data fields this would allow backwards compatibility for pinning.
+
+DAG-PB-ENC is the new DAG format that supports optional encryption.
+
+```proto
+message Link {
+  // binary CID (with no multibase prefix) of the target object
+  optional bytes Hash = 1;
+
+  // The Name field has been moved to the EncryptedNode.
+  reserved "Name";
+  reserved 2;
+
+  // The TSize field has been moved to the EncryptedNode as CumulativeSize.
+  reserved "TSize";
+  reserved 3;
+}
+
+// Information that is not required for pinning which may be encrypted.
+message EncryptedLink {
+  optional string Name = 2;
+  optional bytes key = 4;
+}
+
+message EncryptedNode {
+  repeated EncryptedLink = 2;
+
+  // Opaque user data
+  optional bytes Data = 1;
+
+  // The cumulative size of the target and all of its children.
+  optional uint64 CumulativeSize = 3;
+}
+
+message Node {
+  // refs to other objects
+  repeated Link Links = 2;
+
+  enum EncryptionMode {
+    None = 0;
+    AES256CBC = 1;
+  }
+  optional EncryptionMode Encryption = 3;
+
+  // A serialized EncryptedNode proto encrypted using the method specified by EncryptionMode.
+  optional bytes EncryptedNode = 4;
+
+  // The Data field has been moved to the EncryptedNode.
+  reserved "Data";
+  reserved 1;
+}
+```
+
+This proto will be used the same way as the [DAG-PB](https://github.com/ipld/specs/blob/master/block-layer/codecs/dag-pb.md) proto with the following differences.
+
+- When Encryption != None and the encryption key is not available, only the Hash of the links are available. Implementations should handle this gracefully and can not allow traversal by name.
+- While not required by the protocol itself implementations MUST sort the `links` elements by the CID. This ensures that information about the order of links is not leaked.
+
+## References
+
+### For Pinning
+
+If you want to pin your content without providing access to the data to the pinner you can pin the base CID as usual.
+
+### For Sharing
+
+A text format reference is formed by appending information about the encryption to the CID. This takes the form of a hyphen "-", followed by the multibase encoding of the key. The algorithm is not specified in the key but in the DAG node.
+
+#### Example
+
+QmVhiZNnvhqrbTLRsh6JnryJ5eTUwygTjTg5hUBKAfeP1H-z7Z9ZajGKvb6C6LaiB7fnsWQZNwq8roEKCFdt3Fbb1qt9
+
+## Alternatives Considered
+
+### Backwards Compatibility with DAG-PB
+
+This would be possible to do by preserving the Name and Data fields.  These could be used when encryption is not used and the Name fields could be filled with dummy values when encryption is enabled.  Unfortunately [the spec forbids additional fields](https://github.com/ipld/specs/blob/master/block-layer/codecs/dag-pb.md#protobuf-strictness) so a breaking change is required.
+
+For forwards-compatibility the new proposal should ensure that additional fields are allowed.
+
+### Encrypt Links (aka full-block encryption)
+
+This is possible but breaks recursive pinning. Instead of recursive pinning all individual blocks would need to be sent to an untrusted pinner. This also makes systems such as an untrusted pinner following an IPNS publication impossible to implement without updating IPNS to publish a complete list of subblocks.
+
+### Encrypt On Top of IPFS
+
+Separating the encryption from IPFS has a number of downsides.
+
+- Worse user UX. For example gateways can't handle decryption and browsing.
+- Difficult to get good deduplication. Most encryption algorithms will change entire files on any change. Even algorithms (such as rsynccrypo) that aim to produce deduplicatable outputs resynchronize on their own schedule which will not perfectly match IPFS blocks without careful integration. By integrating encryption with IPFS blocks we can achieve optimal deduplication.
+- Unable to share or pin subsets of an encrypted tree.
+
+By integrating the encryption into the IPFS data model all of these concerns can be addressed.
+
+### Fixed Top-Level Key
+
+The system as proposed generates a new encryption key every time it is used, it does not provide provisions for inputting a key to use. This could be added to this scheme but has been deferred for now for the following reasons.
+
+Requires using non-zero IVs.
+
+1. It may be possible to use a hash of the block but leaking the hash of the contents isn't good. The IV could be treated as part of the key but that defeats the point of having a fixed key.
+2. If random IVs are used the result is non-deterministic.
+
+It likely only makes sense to use this at "entry points" (to improve determinism) and use the proposed scheme for children. However this means that children won't have a fixed key for partial sharing.
+
+So for now this is not implemented. This could likely be implemented at the top level by adding a new EncryptionMode that uses a random IV and including the IV in the message. 2 would be solved by not supporting subtree sharing with a fixed key. (Or reencrypting the new "entry" node with a fixed key)