IPIP 298: (allow|deny)lists for IPFS Nodes and Gateways #340

thibmeu · 2022-11-10T07:59:06Z

Spec for the IPIP-298.

It adds a standard reference for content list definition, should these
be allow lists or deny lists. It supports CIDv1, sha256 hashed CIDv1,
and absolute and partial path matching.
It is aimed at offering a common schema for node operators to interact
with content lists.

It should be noted this IPIP was first written by @foreseaz, and this PR
is an iteration over their work in #299.

cc @Jorropo @lidel

@foreseaz

Spec for the IPIP-298. It adds a standard reference for content list definition, should these be allow lists or deny lists. It supports CIDv1, sha256 hashed CIDv1, and absolute and partial path matching. It is amied at offering a common schema for node operators to interact with content lists. It should be noted this IPIP was first written by @foreseaz, and this PR is an iteration over their work in #299.

dokterbob

Super super excited to move this forward. Navigating the space between moderation and censorship is a tricky one and having a well thought-out spec to allow for collaboration is essential to keeping the IPFS/IPLD/Filecoin space accessible to small parties and/or ideologically motivated initiatives.

I gave it a good read and left some comments/questions. Main observations are:

We might consider using CID's to do the double hashing (which is apparently standardised now), and gain various other benefits (e.g. CID upgrades, not worrying about codecs).
Formalising reasons on a per-item basis might increase transparency and granularity in terms of filtering.

IPIP/0298-denylists-for-ipfs-nodes-and-gateways.md

dokterbob · 2022-11-10T10:08:08Z

IPIP/0298-denylists-for-ipfs-nodes-and-gateways.md

+>
+> The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content.
+>
+> Before the hashing, all CIDv0 in `CID_V1` field is converted to CIDv1 for consistency.


Note that the current proposal would have future CID's come in at 4-255 (e.g. CIDv2 could be 4). That could be a bit messy. Perhaps we could split the space 0-255 into various sections for semantic reasons?

Something like:

0-63: CIDv and possibly other content-addressed identifiers.

64-127: Hashed variants of aforementioned.

128-192: Various path matching algorithms.

193-155: 🤷🏼

Also: the encoding for SHA256 has not been specified, and the current mechanism doesn't allow for changing hashing algorithm later.

Is there any reason not to use CIDv1 (e.g. CID of CID) which deals with encoding and hashing algo out of the box?

As a bonus, we might use multicodec to describe what it is we're hashing (e.g. CIDv1 or something else entirely, think blocking pubkeys or subnets in the future).

In addition, as discussed, it might be good to be aware of/keep track of libp2p's double hashing efforts: ChainSafe/go-libp2p-kad-dht#1

And their implementation.

Notably, we might consider to salt our hashes to reduce rainbow table attacks, like in this comment: https://github.com/ChainSafe/go-libp2p-kad-dht/pull/1/files#r986978182

From there, I found that there's already DBL_SHA2_256 defined in go-multihash as well as in multihash.

Using multihash like this might even allow us to do away with the need for distinct types for CID versions or double hashed types.

multihash makes sense to be incencitive to CID

for the hashing, it's expected to be fixed. If one wants to introduce a new hash method, it can add a new type.

I feel like adding a separation of the number space at this time is an overkill. What other types do you want the implementation to support?

I'm super late to the party, and this is maybe fixed elsewhere, but I wanted to +1 @dokterbob's suggestion about salting the double hashes (and making sure that the salt doesn't match the one in the double-hashed DHT). Otherwise you're handing a treasure map to bad actors.

There's perhaps a broader discussion about the privacy properties that we need from such a system. These are all some form of private set intersection (PSI), where double hashing is the "naive" case — there are much more sophisticated techniques. (Fancier does not nessesarily mean better, but may be worth exploring)

dokterbob · 2022-11-10T10:58:21Z

IPIP/0298-denylists-for-ipfs-nodes-and-gateways.md

+
+```ipldsch
+type List struct {
+  Action Int # Suggested action to be taken


I feel that some kind of formalisation of the reasons for blocking (either on item or on list level) could be very useful. This would ease the efforts for node maintainers to implement sensible defaults (e.g. copyrighted content we do not serve nor store, but strong violent material does not even enter the DHT). It could also facilitate granularity, transparency and accountability, as the Bad Bits precedent suggests that various kind of blocked content risks ending up in just one list.

This all or nothing type of blocking has a severe creep from going to moderation towards censorship. I imagine a scenario where 1 or 2 parties provide lists which then the majority of node maintainers follow, applying the strongest possible blocking behaviour for everything just to be on the save side.

To bring this down to earth; for ipfs-search.com we want to tell people when content is filtered because of DCMA-complaints (e.g. chilling effects) but in cases of actually illegal content we want to leave it out altogether. Similarly, I can imagine that node maintainers might not want to publicly serve copyrighted content but they might be happy to have it in their DHT and perhaps even their store, whereas they might strongly reject even relaying information about somebody POV'ing a massacre.

mathew-cf · 2022-11-10T18:59:57Z

IPIP/0298-denylists-for-ipfs-nodes-and-gateways.md

+}
+
+type Entry struct {
+  Type Int # Type of the content string. Refer to CID_V1, ABSOLUTE_PATH, PARTIAL_PATH, SHA256_CID_V1


using multihash instead of CID here could help prevent content blocks from being circumvented by changing the base or codec. Additionally, it could remove the need of explicitly specifying SHA256_CID_V1. Relates to this comment https://github.com/ipfs/specs/pull/340/files#r1018935560

Agree with the multihash.
I feel having the sha256_multihash still make sense, as the hash query is specific to how IPFS handles block query at the moment. There is no reason to tie the list specification to this, or add complexity downstream about the hashing method for the hash.

Jorropo · 2022-11-10T20:30:37Z

IPIP/0298-denylists-for-ipfs-nodes-and-gateways.md

+```ipldsch
+type List struct {
+  Action Int # Suggested action to be taken
+  StatusCode optional Int # HTTP status code to be returned by IPFS gateways by default


Same comment as in #299 (comment) this should be authoritatively limited.
I think anything not 451, 410 or 200 has no place on a badlist list. I don't want people start to use 3xx or whatever.

Secondly, I don't fully understand why giving out HTTP codes, assuming the goal is to join forces and let gateway operators share ban reasons, this is really unspecific,
I would like a machine readable version enum, containing things like Dos, Legal (maybe with a reason like Legal Copyright, Legal Hatespeech, Legal other, ...), ...
Some gateway operators might prefer faking 500s or just flat out ignore certain reasons.

I would like a machine readable version enum, containing things like Dos, Legal (maybe with a reason like Legal Copyright, Legal Hatespeech, Legal other, ...), ...

❤️

Would remove the word "Legal" since not really a legal opinion or necessarily illegal depending on the jurisdiction.

I'll update StatusCode to Categories Int[].
Categories values would be subdivided in ranges. 1xxx for allow-like, 2xxx for block-like.
Specifying the categories is going to be similar to Actions. Some codes being assigned, some being for internal use, and some for future use.

If you have a list of categories you think should be included, you can list them in the comments.

hsanjuan · 2022-11-18T14:33:38Z

IPIP/0298-denylists-for-ipfs-nodes-and-gateways.md

+}
+
+type Entry struct {
+  Type Int # Type of the content string. Refer to CID_V1, ABSOLUTE_PATH, PARTIAL_PATH, SHA256_CID_V1


There should be a type that has has exactly the same meaning as the current badbits anchor.

could you add the type, as well as its specification in the comment?

hsanjuan · 2022-11-18T14:41:17Z

IPIP/0298-denylists-for-ipfs-nodes-and-gateways.md

+    {
+      Type: 3, // Type.SHA256_CID_V1
+      Content: "9056e0f9948c942c16af3564af56d4bb96b6203ad9ccd3425ec628bcd843cc39",
+      Description: "sensitive cid that needs to be blocked"
+    }
+  ]


Note that it is very difficult to append anything to this list. You cannot have one process writing to it and another one reading from it at the same time. You need to open, read, deserialize, serialize, write, close.

That means that if you have an external process generating items to be blocked, every update needs to edit the file and then trigger the IPFS implementation to re-read it (possibly in full). Or the IPFS implementation needs to provide API to make the updates itself (and then write the file (again, possibly in full).

I have the feeling this format is not well suited to very large lists, or those updated frequently.

Would defining a max list size as part of this spec help?
Together with Link field, it allows for composing big lists from smaller ones, and updates would only change part of a bigger DAG.

That doesn't improve anything if instead of 1 item of 100GB we have 100 items of GB, somehow linked.

I also don't think the list format should include a Link field. Links to multiple lists etc should be a meta-format above that can be implemented later if we need some glue to handle multi-subscriptions (right now not needed).

I personally think lists should be line-based files, as small as possible, editable by hand for simple things like appending a CID. See lists used by UBlock etc.

Inefficiencies of CBOR lists vs prior art from adblockers

Ack. A compact, efficient, line-based syntax similar to static filter lists used by adblockers (example) will be more efficient and easier to consume than having CBOR.

Main problem: duplicated "Type" "Content" and "Description" strings for every entry – having these strings over and over again in CBOR gets very wasteful very fast if the list is big.

Manifest vs Compact list

So.. this PR feels like the meta format for aggregating lists you describe. In the other comment @hsanjuan suggested adding a type for already existing double hashed format from https://badbits.dwebops.pub, we could do the same here and specify a separate type for lists in the "compact syntax".

I think we should decouple and work on these things in parallel (meta-manifest and the compact list formats).
Will make discussion more productive, and allow us to ship things faster (e.g., in Kubo, we need compact format sooner than the meta-manifest one: users should be able to self-service, will take time).

@thibmeu thoughts on limiting traversable DAG-CBOR/JSON to DAG-like meta-lists, and having a compact syntax for "leaves"? Or supporting lists in both, and seeing which approach gets more traction?

Next steps

@hsanjuan i'd like to explore your proposal – would you have time to propose a syntax for compact list?
(gist / hackmd is fine – I can turn it into a separate IPIP if you have limited bandwidth – this way we could flesh it out in separate IPIP PR without blocking this one, and even reference that IPIP from here, if needed)

Entries structure could be transformed to a CSV. That's also the reason why multihash is serialised to a string via multibase, so it's easier to export.

type,content,description,categories 1,/path/to/content,questionable content,1001 1002 1005 ...

I agree this would be easier to update. Metadata around it could be passed as a frontmatter possibly

action: 1 categories: 1000 1001 description: an example content list --- type,content,description,categories 1,/path/to/content,questionable content,1001 1002 1005

The front matter still allows for compostability with CBOR, but makes the list self contained in a text format if need be.

I would not worry about the existing format for bad bits list. It's a single list, and I would avoid requiring the spec and implementation to be constrained by one historical list instance (similar to Cloudflare safemode in a way).

I have no need for CBOR specifically, it's just a format supported by IPFS and to which an IPLD schema can serialise to.

i'd like to explore your proposal – would you have time to propose a syntax for compact list?

yes, next week at some point? I have ideas but I also need to explain them (because they link to other aspects that are not related to the list itself), which takes more time. Format would be in the direction @thibmeu proposes.

thibmeu

I added replies to comments. New feedback still appreciated.
I'll update the document on next week, with the change of StatusCode to Categories.

IPIP/0298-denylists-for-ipfs-nodes-and-gateways.md

thibmeu · 2022-11-18T15:19:04Z

IPIP/0298-denylists-for-ipfs-nodes-and-gateways.md

+```ipldsch
+type List struct {
+  Action Int # Suggested action to be taken
+  StatusCode optional Int # HTTP status code to be returned by IPFS gateways by default


I'll update StatusCode to Categories Int[].
Categories values would be subdivided in ranges. 1xxx for allow-like, 2xxx for block-like.
Specifying the categories is going to be similar to Actions. Some codes being assigned, some being for internal use, and some for future use.

If you have a list of categories you think should be included, you can list them in the comments.

IPIP/0298-denylists-for-ipfs-nodes-and-gateways.md

thibmeu · 2022-11-18T15:22:19Z

IPIP/0298-denylists-for-ipfs-nodes-and-gateways.md

+}
+
+type Entry struct {
+  Type Int # Type of the content string. Refer to CID_V1, ABSOLUTE_PATH, PARTIAL_PATH, SHA256_CID_V1


Agree with the multihash.
I feel having the sha256_multihash still make sense, as the hash query is specific to how IPFS handles block query at the moment. There is no reason to tie the list specification to this, or add complexity downstream about the hashing method for the hash.

thibmeu · 2022-11-18T15:23:03Z

IPIP/0298-denylists-for-ipfs-nodes-and-gateways.md

+}
+
+type Entry struct {
+  Type Int # Type of the content string. Refer to CID_V1, ABSOLUTE_PATH, PARTIAL_PATH, SHA256_CID_V1


could you add the type, as well as its specification in the comment?

IPIP/0298-denylists-for-ipfs-nodes-and-gateways.md

thibmeu · 2022-11-18T15:26:08Z

IPIP/0298-denylists-for-ipfs-nodes-and-gateways.md

+>
+> The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content.
+>
+> Before the hashing, all CIDv0 in `CID_V1` field is converted to CIDv1 for consistency.


multihash makes sense to be incencitive to CID

for the hashing, it's expected to be fixed. If one wants to introduce a new hash method, it can add a new type.

I feel like adding a separation of the number space at this time is an overkill. What other types do you want the implementation to support?

Updates structure of the content list: * Use `Categories` instead of `StatusCode`. This allows for a granular definition of content, without enforcing HTTP actions. * Move `Entries` to a linked list by default, named `[EntryChunk]`. This allows to keep the inlining that was mentioned before, but does not add overhead for chaining multiple entries if needed. Content type has been updated to mention multihash instead of CIDv1. To keep encoding simple, the multihash is encoded using multibase. Minor edits: * Remove all references to status code as a possible response. * Reference multicodec dbl-sha2-256. Reviewers should note that the `Categories` definition is left empty, and is to be updated in a future commit.

lidel · 2022-11-29T18:13:14Z

IPIP/0298-denylists-for-ipfs-nodes-and-gateways.md

+### User
+
+User is an actor that can make request to the IPFS Node. It's considered authenticated already.
+Access to the node can be made via an HTTP gateway, Bitswap, or another transport method exposed by the IPFS Node.


nit: covering more ground

Suggested change

Access to the node can be made via an HTTP gateway, Bitswap, or another transport method exposed by the IPFS Node.

Access to the node can be made via an HTTP gateway, command-line, RPC API, Bitswap, or another transport method exposed by the IPFS Node.

lidel · 2022-11-29T18:19:47Z

IPIP/0298-denylists-for-ipfs-nodes-and-gateways.md

+
+type EntryChunk struct {
+  Entries [Entry] # List of Entries objects
+  Next optional Link # CID for the next EntryChunk object. Entries referenced this way are not inlined, and possibly require an additional request for discovery.


Having a single Link makes it:

impossible to compose a list from two preexisting ones (e.g., having "anti-phishing" meta-list composed of multiple third-part lists provided by gateways that received reports and/or run active detection)

expensive to maintain a big list (instead of having a DAG, and only recomputing changed branch, one needs to change all lists along the single chain).

Would it be ok to change this field into array of links?

Suggested change

Next optional Link # CID for the next EntryChunk object. Entries referenced this way are not inlined, and possibly require an additional request for discovery.

Next optional [Link] # List of CIDs for the next EntryChunk objects. Entries referenced this way are not inlined, and possibly require an additional request for discovery.

What about having multiple Entries [EntryChunk] instead of multiple Next [Link]? I feel it makes it:

Easier to read when lists are composed: it's simply a new EntryChunk.

Does not modify a specific EntryChunk upon the addition of a new list.

On the topic of having a max size (I'm unable to answer inline to your comment), how do you envision its usage?
I feel this is something that can be done at consumption time, and I'm not sure how it helps to have a max size for a series of linked lists.

Yes, [EntryChunk] also works.

Remaining problem around Link is that there is no place to provide description for it– If i am aggregating third-party lists into a single meta-lists, would be nice to note why each link was included. Maybe add Description optional String to EntryChunk ?

re: max size, i only meant max size per document, encouraging people to "shard" at the same cut-off and making it easier for implementers to fail fast when consuming a single document. (total across lists is a separate problem).

lidel · 2022-11-29T18:37:00Z

IPIP/0298-denylists-for-ipfs-nodes-and-gateways.md

+    {
+      Type: 3, // Type.SHA256_CID_V1
+      Content: "9056e0f9948c942c16af3564af56d4bb96b6203ad9ccd3425ec628bcd843cc39",
+      Description: "sensitive cid that needs to be blocked"
+    }
+  ]


Would defining a max list size as part of this spec help?
Together with Link field, it allows for composing big lists from smaller ones, and updates would only change part of a bigger DAG.

BigLep · 2023-03-30T17:38:55Z

@thibmeu @mathew-cf and others: I wanted to make sure you saw the expansion of this topic that is occurring in #383 . Your feedback on whether that is meets Cloudflare's needs is welcome.

foreseaz and others added 3 commits November 10, 2022 08:49

ipip: add IPIP0002 Denylists for IPFS Nodes and Gateways

05ecec1

ipip(denylists): why JSON

8d8bd0c

thibmeu requested a review from a team as a code owner November 10, 2022 07:59

thibmeu mentioned this pull request Nov 10, 2022

IPIP: format for denylists for IPFS Nodes and Gateways #299

Closed

dokterbob reviewed Nov 10, 2022

View reviewed changes

lidel mentioned this pull request Nov 10, 2022

Revert "feat: add ipfs.pinksheep.whizzzkid.dev" ipfs/public-gateway-checker#331

Merged

mathew-cf reviewed Nov 10, 2022

View reviewed changes

Jorropo reviewed Nov 10, 2022

View reviewed changes

hsanjuan reviewed Nov 18, 2022

View reviewed changes

thibmeu commented Nov 18, 2022

View reviewed changes

lidel reviewed Nov 29, 2022

View reviewed changes

BigLep mentioned this pull request Jan 9, 2023

IPFS filtering to allow node operators to decide on content they are willing to serve ipfs/kubo#8492

Closed

3 tasks

scott-dn mentioned this pull request Jan 17, 2023

chore: gateway and cache node bad bits fleek-network/ursa#295

Open

3 tasks

BigLep mentioned this pull request Mar 16, 2023

IPIP-0383: Compact Denylist Format #383

Merged

whizzzkid mentioned this pull request Mar 24, 2023

feat: automate removal of domains called out as dangerous ipfs/public-gateway-checker#395

Open

whizzzkid mentioned this pull request May 1, 2023

Update gateways.json ipfs/public-gateway-checker#397

Merged

thibmeu closed this by deleting the head repository Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPIP 298: (allow|deny)lists for IPFS Nodes and Gateways #340

IPIP 298: (allow|deny)lists for IPFS Nodes and Gateways #340

thibmeu commented Nov 10, 2022

dokterbob left a comment

dokterbob Nov 10, 2022

dokterbob Nov 10, 2022

dokterbob Nov 10, 2022

thibmeu Nov 18, 2022

expede Jul 8, 2023

dokterbob Nov 10, 2022

mathew-cf Nov 10, 2022

thibmeu Nov 18, 2022

Jorropo Nov 10, 2022

dokterbob Nov 17, 2022

LaurenSpiegel Nov 17, 2022

thibmeu Nov 18, 2022

hsanjuan Nov 18, 2022

thibmeu Nov 18, 2022

hsanjuan Nov 18, 2022

lidel Nov 29, 2022

hsanjuan Nov 30, 2022

lidel Dec 1, 2022 •

edited

Loading

thibmeu Dec 1, 2022

hsanjuan Dec 1, 2022

thibmeu left a comment

thibmeu Nov 18, 2022

thibmeu Nov 18, 2022

thibmeu Nov 18, 2022

thibmeu Nov 18, 2022

lidel Nov 29, 2022

lidel Nov 29, 2022 •

edited

Loading

thibmeu Nov 30, 2022

thibmeu Nov 30, 2022

lidel Dec 1, 2022

lidel Nov 29, 2022

BigLep commented Mar 30, 2023

	Access to the node can be made via an HTTP gateway, Bitswap, or another transport method exposed by the IPFS Node.
	Access to the node can be made via an HTTP gateway, command-line, RPC API, Bitswap, or another transport method exposed by the IPFS Node.

	Next optional Link # CID for the next EntryChunk object. Entries referenced this way are not inlined, and possibly require an additional request for discovery.
	Next optional [Link] # List of CIDs for the next EntryChunk objects. Entries referenced this way are not inlined, and possibly require an additional request for discovery.

IPIP 298: (allow|deny)lists for IPFS Nodes and Gateways #340

IPIP 298: (allow|deny)lists for IPFS Nodes and Gateways #340

Conversation

thibmeu commented Nov 10, 2022

dokterbob left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidel Dec 1, 2022 • edited Loading

Choose a reason for hiding this comment

Inefficiencies of CBOR lists vs prior art from adblockers

Manifest vs Compact list

Next steps

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thibmeu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidel Nov 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BigLep commented Mar 30, 2023

lidel Dec 1, 2022 •

edited

Loading

lidel Nov 29, 2022 •

edited

Loading