Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPIP: format for denylists for IPFS Nodes and Gateways #299

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

foreseaz
Copy link

@foreseaz foreseaz commented Jul 15, 2022

This IPIP adds spec for (deny|allow)lists for IPFS Nodes and Gateways.

Supported rules:

  • plain
    • cid
    • content_path
  • hashed (for trully bad bits)
    • hashed_cid
    • hashed_content_path

Closes #298

@lidel lidel changed the title ipip: add IPIP0002 Denylists for IPFS Nodes and Gateways IPIP: denylists for IPFS Nodes and Gateways Jul 21, 2022
Copy link
Member

@lidel lidel left a comment

Thank you for submitting this @foreseaz.

As noted in #298 (comment) IPFS project has a clear need for standardizing allow/deny lists, and discussion around this IPIP is welcome.

I will be pinging some stakeholders for additional review, but my initial feedback/asks:

  • To be aligned with the rest of IPFS stack, we need an escape hatch for other hash functions than sha2-256 – details inline
  • Make it clear which fields are optional (e.g. description and status_code) and what is the implicit default (empty description and status_code 410)
  • Add top level optional description field where denylist maintainers can include additional context (purpose, policy, link to more details), which could be displayed in list management UIs, and perhaps on Gateway HTTP error pages?


The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content.

Before the hashing, all CIDv0 in both `cid` field and `content_path` fields are converted to CIDv1 for consistency.
Copy link
Member

@lidel lidel Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Before the hashing, all CIDv0 in both `cid` field and `content_path` fields are converted to CIDv1 for consistency.
Before the hashing, all CIDs in both `cid` field and `content_path` fields MUST be converted to CIDv1 in Base32 for consistency.

Copy link
Contributor

@Jorropo Jorropo Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't support base32, whoever is hashing this is already IPFS specific details so I don't think there is value in using a text based format.
Hashing raw binary CIDs is faster and use less memory.

This is a non forward compatible change, pls be carefull.


### Compatibility

No existing implementations yet.
Copy link
Member

@lidel lidel Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
No existing implementations yet.
No existing implementations yet.
JSON format is used to maximize interoperability. The intent is for IPFS implementations and services to standardize content filtering around this format for exchanging and storing allow and deny lists.


**Side notes on `hashed_cids` & `hashed_content_paths` types**

The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content.
Copy link
Member

@lidel lidel Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using sha256 is sensible, but hard-coding a specific hash function in the spec is against the spirit Multiformats, which we aim to use in IPFS stack.

Perhaps we could future-proof this at the low cost of adding 1220 (hex for Multihash prefix for sha256).

This will keep the digest string intact, but turn the field into a valid Multihash, allowing list creators to switch the hash function in the future. An alternative is to have hash function type in a separate field, but this seems less expensive.

Suggested change
The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content.
The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is hashed (by default with `sha256()`) and stored as a [Multihash](https://docs.ipfs.io/concepts/glossary/#multihash) encoded as hex for easier interop with existing tools. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content.

We could even make it a valid Multibase string by adding f (Base16) at the front, but not sure how useful alternatives to hex would be.

Copy link
Contributor

@Jorropo Jorropo Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why not using the multibase prefix, it's cheap to compare, and allows people to use a more compact base64 or base2048 in the future.

We don't need the full cid, so I do like @lidel's idea of just using the multihash.

@lidel lidel changed the title IPIP: denylists for IPFS Nodes and Gateways IPIP: format for denylists for IPFS Nodes and Gateways Jul 22, 2022
@Jorropo
Copy link
Contributor

Jorropo commented Jul 22, 2022

Why adding content path ? Everything could be done with cids, you just include intermediaries cids.

It seems we will have to write code checking intermediaries cids anyway in case the path is reachable multiple ways.

@lidel lidel requested a review from a team Jul 22, 2022
@mathew-cf
Copy link

mathew-cf commented Jul 22, 2022

Why adding content path ? Everything could be done with cids, you just include intermediaries cids.

It seems we will have to write code checking intermediaries cids anyway in case the path is reachable multiple ways.

Content paths allow us to block a specific domain. So if /ipns/bad-website.com keeps updating their content, we'll continue to block requests to that domain.

```js=
{
action: "block",
entries: [
Copy link
Member

@lidel lidel Jul 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question raised during triage today: what happens when the block list grows to megabytes?
This is a real concern, as https://badbits.dwebops.pub/ alone is getting close to 900KiB, and history shows that even efficient pattern-matching things like adblock lists are multiple megabytes in size (example).

This spec should provide a way for representing handle big, big blocklists.

💡 One idea is to introduce a special entry with type: "import" and either cid or content_path pointing at some other list. This provides a solution for sharding and maintaining big lists AND allows composing denylists using existing ones.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sharded lists, we may need to specify a prefix or something to indicate which list is for which shard

Copy link
Contributor

@Jorropo Jorropo Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mathew-cf you want to spec a JSON based HAMT ? 🙂

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that you mention it @Jorropo i think composability is probably enough for now lol

Copy link
Member

@vasco-santos vasco-santos Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey folks,
We have been operating an internal denylist that is synced with badbits in {nft/web3}.storage. We need to align well on this direction with content path. There is at least a limitation around this that we have found:

  • if we block a CID with content path, if user just tries to fetch the resource in the path via its own CID (present in response etag from gateway), it won't be flagged or we will need a follow up check in the denylist
  • same as before, but on the other way around

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A workaround for the first case (content path blocked, bypassed via CID) could be to resolve the content path and then block that CID as well.
For the second case (CID blocked, bypassed with content path), we're planning on using x-ipfs-roots to ensure that none of the resolved CIDs are blocked

@hsanjuan
Copy link
Contributor

hsanjuan commented Jul 31, 2022

I'll throw my two cents:

  • We don't need ONE list-format spec. If anything we need several.
  • This one is particularly bad for processing (parsing an object, with a huge array inside, with objects inside), when all I need are CIDs or paths. Go cannot read each CID in this file without parsing it all into memory for example.
  • Whatever the list format, it is probably going to be quite easy to parse anyways. The question is what is the best format for this? But for a start let's just stick to the badbits format which is no good, but at least we don't have multiple formats that suck in the same ways (like being JSON objects).
  • Where is the badbits format documented? I always forget that.
  • We need a discussion on what formats are good and for what:
    • Easy to read and edit and append: a text file with 1 cid per line, or one path per line, or one double-hash per line (plus modifiers)
    • Fast to move around (gzipped stuff)
    • Easy to share over IPFS (an IPLD linked lists of things)
    • Store-the-hash database files
    • CAR files.
    • What about multiple files?
    • How does ublock &co does it?
    • What about blocking by regexp'ed path?
  • Another discussion is what to exactly to doublehash:
    • Cid + path (like badbits)
    • Resolved paths?
    • Multihash?
  • Block lists should be have as little information as possible.
  • Any double hashing should happen using the fastest hash function available (and no need to use very long hashes here)
  • I think everything should be double hashed. Why not.
  • Details, response codes, descriptions, original hashes etc etc do not belong in the denylists. If you need that keep it out of the denylist, which is derived from it. If anything, put is as comments and not as part of the schema, so that the parser can ignore it.

- `content`: stores the content that should be blocked according to the type. It's suggested that all CIDv0 needs to be converted into CIDv1 to keep the consistency.
- `content_path`: the content path needs to be blocked).
- `description`: description of the CIDs or content paths.
- `status_code`: status code to be responded for the blocked content. E.g. [410 Gone](https://github.com/ipfs/specs/blob/main/http-gateways/PATH_GATEWAY.md#410-gone); [451 Unavailable For Legal Reasons](https://github.com/ipfs/specs/blob/main/http-gateways/PATH_GATEWAY.md#451-unavailable-for-legal-reasons) or `200 OK` for allowed entry.
Copy link
Contributor

@Jorropo Jorropo Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be authoritative.
I think anything not 451, 410 or 200 has no place on a badlist list. I don't want people start to use 3xx or whatever.

Secondly, I don't fully understand why giving out HTTP codes, assuming the goal is to join forces and let gateway operators share ban reasons, this is really unspecific,
I would like a machine readable version enum, containing things like Dos, Legal (maybe with a reason like Legal Copyright, Legal Hatespeech, Legal other, ...), ...
Some gateway operators might prefer faking 500s or just flat out ignore certain reasons.


The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content.

Before the hashing, all CIDv0 in both `cid` field and `content_path` fields are converted to CIDv1 for consistency.
Copy link
Contributor

@Jorropo Jorropo Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't support base32, whoever is hashing this is already IPFS specific details so I don't think there is value in using a text based format.
Hashing raw binary CIDs is faster and use less memory.

This is a non forward compatible change, pls be carefull.

- `content`: stores the content that should be blocked according to the type. It's suggested that all CIDv0 needs to be converted into CIDv1 to keep the consistency.
- `content_path`: the content path needs to be blocked).
- `description`: description of the CIDs or content paths.
- `status_code`: status code to be responded for the blocked content. E.g. [410 Gone](https://github.com/ipfs/specs/blob/main/http-gateways/PATH_GATEWAY.md#410-gone); [451 Unavailable For Legal Reasons](https://github.com/ipfs/specs/blob/main/http-gateways/PATH_GATEWAY.md#451-unavailable-for-legal-reasons) or `200 OK` for allowed entry.
Copy link
Contributor

@Jorropo Jorropo Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A status code of 200 is weird, you already have the action field saying if it's blocked are allowed, what about just saying the usage of status_code is undefined if the content is allowed ?


**Side notes on `hashed_cids` & `hashed_content_paths` types**

The main difference between non-hashed entries and hashed ones is that the CIDs or content paths in the entry will be hashed and no plaintext is shown in the list. Following the [bad bits](https://badbits.dwebops.pub/), each CID or content path is `sha256()` hashed, so it's easy to determine one way but not the other. The hashed entries are designed to store sensitive blocking items and prevent creating an easily accessible list of sensitive content.
Copy link
Contributor

@Jorropo Jorropo Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why not using the multibase prefix, it's cheap to compare, and allows people to use a more compact base64 or base2048 in the future.

We don't need the full cid, so I do like @lidel's idea of just using the multihash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🔎 In Review
Development

Successfully merging this pull request may close these issues.

6 participants