New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPIP: format for denylists for IPFS Nodes and Gateways #299
base: main
Are you sure you want to change the base?
Conversation
Thank you for submitting this @foreseaz.
As noted in #298 (comment) IPFS project has a clear need for standardizing allow/deny lists, and discussion around this IPIP is welcome.
I will be pinging some stakeholders for additional review, but my initial feedback/asks:
- To be aligned with the rest of IPFS stack, we need an escape hatch for other hash functions than
sha2-256– details inline - Make it clear which fields are optional (e.g.
descriptionandstatus_code) and what is the implicit default (emptydescriptionandstatus_code410) - Add top level optional
descriptionfield where denylist maintainers can include additional context (purpose, policy, link to more details), which could be displayed in list management UIs, and perhaps on Gateway HTTP error pages?
|
|
||
| The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content. | ||
|
|
||
| Before the hashing, all CIDv0 in both `cid` field and `content_path` fields are converted to CIDv1 for consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Before the hashing, all CIDv0 in both `cid` field and `content_path` fields are converted to CIDv1 for consistency. | |
| Before the hashing, all CIDs in both `cid` field and `content_path` fields MUST be converted to CIDv1 in Base32 for consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't support base32, whoever is hashing this is already IPFS specific details so I don't think there is value in using a text based format.
Hashing raw binary CIDs is faster and use less memory.
This is a non forward compatible change, pls be carefull.
|
|
||
| ### Compatibility | ||
|
|
||
| No existing implementations yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| No existing implementations yet. | |
| No existing implementations yet. | |
| JSON format is used to maximize interoperability. The intent is for IPFS implementations and services to standardize content filtering around this format for exchanging and storing allow and deny lists. |
|
|
||
| **Side notes on `hashed_cids` & `hashed_content_paths` types** | ||
|
|
||
| The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using sha256 is sensible, but hard-coding a specific hash function in the spec is against the spirit Multiformats, which we aim to use in IPFS stack.
Perhaps we could future-proof this at the low cost of adding 1220 (hex for Multihash prefix for sha256).
This will keep the digest string intact, but turn the field into a valid Multihash, allowing list creators to switch the hash function in the future. An alternative is to have hash function type in a separate field, but this seems less expensive.
| The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content. | |
| The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is hashed (by default with `sha256()`) and stored as a [Multihash](https://docs.ipfs.io/concepts/glossary/#multihash) encoded as hex for easier interop with existing tools. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content. |
We could even make it a valid Multibase string by adding f (Base16) at the front, but not sure how useful alternatives to hex would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why not using the multibase prefix, it's cheap to compare, and allows people to use a more compact base64 or base2048 in the future.
We don't need the full cid, so I do like @lidel's idea of just using the multihash.
|
Why adding content path ? Everything could be done with cids, you just include intermediaries cids. It seems we will have to write code checking intermediaries cids anyway in case the path is reachable multiple ways. |
Content paths allow us to block a specific domain. So if /ipns/bad-website.com keeps updating their content, we'll continue to block requests to that domain. |
| ```js= | ||
| { | ||
| action: "block", | ||
| entries: [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question raised during triage today: what happens when the block list grows to megabytes?
This is a real concern, as https://badbits.dwebops.pub/ alone is getting close to 900KiB, and history shows that even efficient pattern-matching things like adblock lists are multiple megabytes in size (example).
This spec should provide a way for representing handle big, big blocklists.
type: "import" and either cid or content_path pointing at some other list. This provides a solution for sharding and maintaining big lists AND allows composing denylists using existing ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sharded lists, we may need to specify a prefix or something to indicate which list is for which shard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mathew-cf you want to spec a JSON based HAMT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that you mention it @Jorropo i think composability is probably enough for now lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey folks,
We have been operating an internal denylist that is synced with badbits in {nft/web3}.storage. We need to align well on this direction with content path. There is at least a limitation around this that we have found:
- if we block a CID with content path, if user just tries to fetch the resource in the path via its own CID (present in response etag from gateway), it won't be flagged or we will need a follow up check in the denylist
- same as before, but on the other way around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A workaround for the first case (content path blocked, bypassed via CID) could be to resolve the content path and then block that CID as well.
For the second case (CID blocked, bypassed with content path), we're planning on using x-ipfs-roots to ensure that none of the resolved CIDs are blocked
|
I'll throw my two cents:
|
| - `content`: stores the content that should be blocked according to the type. It's suggested that all CIDv0 needs to be converted into CIDv1 to keep the consistency. | ||
| - `content_path`: the content path needs to be blocked). | ||
| - `description`: description of the CIDs or content paths. | ||
| - `status_code`: status code to be responded for the blocked content. E.g. [410 Gone](https://github.com/ipfs/specs/blob/main/http-gateways/PATH_GATEWAY.md#410-gone); [451 Unavailable For Legal Reasons](https://github.com/ipfs/specs/blob/main/http-gateways/PATH_GATEWAY.md#451-unavailable-for-legal-reasons) or `200 OK` for allowed entry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be authoritative.
I think anything not 451, 410 or 200 has no place on a badlist list. I don't want people start to use 3xx or whatever.
Secondly, I don't fully understand why giving out HTTP codes, assuming the goal is to join forces and let gateway operators share ban reasons, this is really unspecific,
I would like a machine readable version enum, containing things like Dos, Legal (maybe with a reason like Legal Copyright, Legal Hatespeech, Legal other, ...), ...
Some gateway operators might prefer faking 500s or just flat out ignore certain reasons.
|
|
||
| The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content. | ||
|
|
||
| Before the hashing, all CIDv0 in both `cid` field and `content_path` fields are converted to CIDv1 for consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't support base32, whoever is hashing this is already IPFS specific details so I don't think there is value in using a text based format.
Hashing raw binary CIDs is faster and use less memory.
This is a non forward compatible change, pls be carefull.
| - `content`: stores the content that should be blocked according to the type. It's suggested that all CIDv0 needs to be converted into CIDv1 to keep the consistency. | ||
| - `content_path`: the content path needs to be blocked). | ||
| - `description`: description of the CIDs or content paths. | ||
| - `status_code`: status code to be responded for the blocked content. E.g. [410 Gone](https://github.com/ipfs/specs/blob/main/http-gateways/PATH_GATEWAY.md#410-gone); [451 Unavailable For Legal Reasons](https://github.com/ipfs/specs/blob/main/http-gateways/PATH_GATEWAY.md#451-unavailable-for-legal-reasons) or `200 OK` for allowed entry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A status code of 200 is weird, you already have the action field saying if it's blocked are allowed, what about just saying the usage of status_code is undefined if the content is allowed ?
|
|
||
| **Side notes on `hashed_cids` & `hashed_content_paths` types** | ||
|
|
||
| The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why not using the multibase prefix, it's cheap to compare, and allows people to use a more compact base64 or base2048 in the future.
We don't need the full cid, so I do like @lidel's idea of just using the multihash.
This IPIP adds spec for (deny|allow)lists for IPFS Nodes and Gateways.
Supported rules:
cidcontent_pathhashed_cidhashed_content_pathCloses #298