Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data added via ipfs add --to-files=/example --pin=false gets garbage collected. #9553

Open
3 tasks done
kevincox opened this issue Jan 15, 2023 · 10 comments
Open
3 tasks done
Labels
kind/bug A bug in existing code (including security flaws) need/maintainer-input Needs input from the current maintainer(s) P2 Medium: Good to have, but can wait until someone steps up

Comments

@kevincox
Copy link

Checklist

Installation method

third-party binary

Version

Kubo version: 0.17.0
Repo version: 12
System version: amd64/linux
Golang version: go1.19.4

Config

{
  "API": {
    "HTTPHeaders": {
      "Access-Control-Allow-Origin": [
        "https://ipfs.example"
      ]
    }
  },
  "Addresses": {
    "API": "/unix/run/ipfs-api.sock",
    "Announce": [],
    "AppendAnnounce": null,
    "Gateway": "/unix/run/ipfs-gateway.sock",
    "NoAnnounce": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "Swarm": [
      "/ip4/0.0.0.0/tcp/4001",
      "/ip6/::/tcp/4001",
      "/ip4/0.0.0.0/udp/4001/quic",
      "/ip6/::/udp/4001/quic"
    ]
  },
  "AutoNAT": {},
  "Bootstrap": [
    "/ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/ip4/104.131.131.82/udp/4001/quic/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt"
  ],
  "DNS": {
    "Resolvers": null
  },
  "Datastore": {
    "BloomFilterSize": 0,
    "GCPeriod": "1h",
    "HashOnRead": false,
    "Spec": {
      "mounts": [
        {
          "child": {
            "path": "blocks",
            "shardFunc": "/repo/flatfs/shard/v1/next-to-last/2",
            "sync": true,
            "type": "flatfs"
          },
          "mountpoint": "/blocks",
          "prefix": "flatfs.datastore",
          "type": "measure"
        },
        {
          "child": {
            "compression": "none",
            "path": "datastore",
            "type": "levelds"
          },
          "mountpoint": "/",
          "prefix": "leveldb.datastore",
          "type": "measure"
        }
      ],
      "type": "mount"
    },
    "StorageGCWatermark": 90,
    "StorageMax": "10GB"
  },
  "Discovery": {
    "MDNS": {
      "Enabled": false,
      "Interval": 10
    }
  },
  "Experimental": {
    "AcceleratedDHTClient": false,
    "FilestoreEnabled": false,
    "GraphsyncEnabled": false,
    "Libp2pStreamMounting": false,
    "P2pHttpProxy": false,
    "StrategicProviding": false,
    "UrlstoreEnabled": false
  },
  "Gateway": {
    "APICommands": [],
    "HTTPHeaders": {
      "Access-Control-Allow-Headers": [
        "X-Requested-With",
        "Range",
        "User-Agent"
      ],
      "Access-Control-Allow-Methods": [
        "GET"
      ],
      "Access-Control-Allow-Origin": [
        "*"
      ]
    },
    "NoDNSLink": false,
    "NoFetch": false,
    "PathPrefixes": [],
    "PublicGateways": null,
    "RootRedirect": "",
    "Writable": false
  },
  "Identity": {
    "PeerID": "REDACTED"
  },
  "Internal": {},
  "Ipns": {
    "RecordLifetime": "",
    "RepublishPeriod": "",
    "ResolveCacheSize": 128
  },
  "Migration": {
    "DownloadSources": null,
    "Keep": ""
  },
  "Mounts": {
    "FuseAllowOther": false,
    "IPFS": "/ipfs",
    "IPNS": "/ipns"
  },
  "Peering": {
    "Peers": null
  },
  "Pinning": {
    "RemoteServices": {
      // REDACTED
    }
  },
  "Plugins": {
    "Plugins": null
  },
  "Provider": {
    "Strategy": ""
  },
  "Pubsub": {
    "DisableSigning": false,
    "Router": ""
  },
  "Reprovider": {
    "Interval": "12h",
    "Strategy": "all"
  },
  "Routing": {
    "Methods": null,
    "Routers": null,
    "Type": "dhtclient"
  },
  "Swarm": {
    "AddrFilters": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "ConnMgr": {
      "GracePeriod": "20s",
      "HighWater": 900,
      "LowWater": 600,
      "Type": "basic"
    },
    "DisableBandwidthMetrics": false,
    "DisableNatPortMap": true,
    "RelayClient": {},
    "RelayService": {},
    "ResourceMgr": {},
    "Transports": {
      "Multiplexers": {},
      "Network": {},
      "Security": {}
    }
  }
}

Description

Problem

  1. Add content to your node with ipfs add --to-files=/example --pin=false /some/new/data
  2. Run a GC or wait a while.
  3. Run ipfs dag stat --offline $(ipfs --offline files stat --hash /example)

Expected results: Files in MFS are expected to be kept on the node so this command should succeed.

Actual result: "Error: error traversing DAG: block was not found locally (offline): ipld: could not find QmatWhxBmsvpPCAuzryVeQygiWntsi4bZHm1Gy7PnakPEr"

Workaround

As a workaround you can create then remove a pin, this basically defeats the point of adding --to-files but avoids this issue.

Caveats:

  • If you already have this content pinned it will become unpinned.
  • If your script/process/whatever crashes before removing the pin it will leak.
ipfs add --to-files=/example --pin=true /some/new/data
ipfs pin rm $(ipfs --offline files stat --hash /example)

Even thought this leaves you in the same end state from the user perspective (save the caveats mentioned above) this does seem to resolve the internal state such that the MFS "implicit pinning" avoids the content from being collected.

Context

cc @lidel who implemented this feature.

Other questions

Do you need to pass --pin=false when using --to-files? The ipfs help add docs seem to claim that it is still true by default. However when adding to MFS you shouldn't need the pin. With the implicit pin the --to-files flag seems to just be a shortcut for ipfs files cp /ipfs/$(ipfs add /some/new/data) /example. Maybe the docs can be clarified here.

@kevincox kevincox added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Jan 15, 2023
@lidel
Copy link
Member

lidel commented Jan 16, 2023

Unable to reproduce, user data is not lost

Thank you for filling this @kevincox,
I tried to follow repro you provided, but was unable to reproduce in 0.18.0-rc2 and 0.17.0.

  1. Fresh repository (ipfs init)
  2. Run ipfs repo gc to ensure there are no ephemeral blocks around.
  3. ipfs add --to-files=/guardian.jpg --pin=false ~/guardian.jpg
  4. ipfs repo gc (you will see ephemeral blocks being gc'd)
  5. but ipfs files stat --with-local /guardian.jpg shows all blocks behind the imported data (in my case a JPG QmbWqxBEKC3P8tqsKc98xmWNzrzDtRLMiMPL8wBuTGsMnR) are still in local datastore: Local: 120 kB of 120 kB (100.00%)

But there is something unexpected!

I suspect you've also seen some CIDs getting garbage-collected during step (3) ipfs repo gc, I also did see that. There was bafkreiakaltuk5nrwxeonlmztefw3qcjj7s6mxlilyvzfedp7zh6q5fc2a which got GC'd.

Investigating the ephemeral block that got GC'd

When represented as CIDv0 (or CIDv1 with dag-pb):

$ ipfs cid format -v 0 --mc dag-pb bafkreiakaltuk5nrwxeonlmztefw3qcjj7s6mxlilyvzfedp7zh6q5fc2a
QmP1jMW7thNHXB2hiUPmKUqa8mJvBYofuVTd5NVQzsH4e3

QmP1jMW7thNHXB2hiUPmKUqa8mJvBYofuVTd5NVQzsH4e3.car.zip

Exploring the CID, we see it is a wrapper directory that included the imported file, but is no longer used.

👉 So it seems there is no user data loss, but I think you've found that --to-files sometimes produces intermediate, ephemeral blocks that may not be useful after copying reference to MFS is done.

@kevincox are you able to check and confirm this is what you are seeing?

@lidel lidel added the need/author-input Needs input from the original author label Jan 16, 2023
@lidel lidel self-assigned this Jan 16, 2023
@kevincox
Copy link
Author

kevincox commented Jan 16, 2023

That is not the case. It is definitely occurring that the directory added to MFS no longer is complete, so user data is lost. Unfortunately my previous reproductions were the "wait a while" case. In my case it was a directory that was about 75MiB large.

I'll see if I can reproduce with an explicit GC.

@kevincox
Copy link
Author

I can't seem to reproduce with an explicit GC.

I'll share more of the context in case it helps. I have a backup script that snapshots a directory to IPFS. It is basically this.

id=$(date -u '+%FT%TZ')

# Add
ipfs --offline add --to-files=/backups/something/$id --pin=false -rHQ -s rabin-2048-65536-131072 -- /some/directory

# Update latest pointer.
ipfs --offline files rm -r /backups/something/latest || true
ipfs --offline files cp /backups/something/$id /backups/something/latest

# Maybe cleanup old backups. Actual logic is smarter and **does not** always delete the oldest first.
ipfs --offline files rm -r -- $(ipfs files ls /backups/something | head -n1)

# Calculate total space usage. THIS STEP FINDS MISSING BLOCKS
ipfs --offline files stat  /backups/something

# Pin remotely.
ipfs pin remote add ...

# IPNS publish.
ipfs name publish --allow-offline --ipns-base b58mh --key something --lifetime 8784h --quieter $cid

# Save published CID to avoid GC.
ipfs --offline files rm -r /backups/.last-published-something
ipfs --offline files cp /backups/something /backups/.last-published-something

I've used this script for a long time. It originally did an ipfs add && ipfs files cp /ipfs/$cid /backups/something/$id && ipfs pin rm $cid. When --to-files was added I eagerly moved as occasionally an interrupted backup process would leak a pin and waste space. (I don't use pins for anything so I'm not worried about accidental removal of an existing pin, but that is an added plus). However after a handful of runs I started seeing ipfs --offline files stat fail because the blocks were incorrectly GCed. Switching to --pin=true followed by a pin rm solved the issue. I went back and forth between this and the --to-files --pin=false and every time I removed the explicit pin management I started seeing blocks go missing. Even though nothing else in the system changed. So I am quite confident that something with --to-files --pin=false is causing a GC bug to be triggered. Both ipfs add && ipfs files cp && ipfs pin rm and ipfs add --to-files --pin=true && ipfs pin rm do not trigger this bug.

When I investigated which subdirectory of MFS had the missing blocks it was never the latest one. It was always a directory a few backups ago. One that had previously been checked to be complete and was completely unchanged. The path was never removed or even moved since the validation succeeded.

@lidel
Copy link
Member

lidel commented Jan 17, 2023

I poked at this a bit in that direction, run ipfs daemon --enable-gc with Datastore.StorageMax=5M and Datastore.GCPeriod=5m but was unable to reproduce missing blocks.

If there is a problem, it is something more nuanced – we need to find a way to reproduce it first.

@kevincox any chance you can create a minimal repro.sh script with exact commands to demonstrate how to trigger the problem? (set custom IPFS_PATH and create fake data using /dev/urandom)

@kevincox
Copy link
Author

kevincox commented Jan 17, 2023

Unfortunately I'm not really sure that I can. I would need to understand what is happening more to be able to reproduce in a more minimal setting. I tried simulating the workload I had with random data and didn't succeed. I tried sharing some blocks between "backups" and also couldn't reproduce. It seems that anything I strip down removes the magic condition. I think it is very likely that it is something nuanced.

I've included a cleaned up version of my backup script below in case it triggers any ideas but I'm afraid that we may be unable to debug this until someone else has this issue and we can try to spot the similarities.

root="/backups/something"
latest=$root/latest
new=$root/$(date -u '+%FT%TZ')

ipfs --offline files mkdir -p $root

### BROKEN
ipfs --offline add --to-files=$new --pin=false '-rHQ' '-s' 'rabin-2048-65536-131072' '--' '/var/lib/something/'
read latest_cid latest_flat_size < <(ipfs --offline files stat --format '<hash> <cumulsize>' $new)
### BROKEN END

### WORKING
ipfs --offline add --to-files=$new --pin=true '-rHQ' '-s' 'rabin-2048-65536-131072' '--' '/var/lib/something/'
read latest_cid latest_flat_size < <(ipfs --offline files stat --format '<hash> <cumulsize>' $new)
ipfs --offline pin rm $latest_cid
### WORKING END

last=$(ipfs --offline files stat --hash $latest || true)
if [[ $last == "" ]]; then
	diff=new
elif [[ $last != $latest_cid ]]; then
	diff=updated
else
	diff=unchanged
fi

if [[ $diff != unchanged ]]; then
	ipfs --offline files rm -r $latest || true
	ipfs --offline files cp /ipfs/$latest_cid $latest
fi

ipfs --offline files ls $root \
	| grep -v '^latest$' \
	| pruner 'P1D=14' 'P1M=6' '10' \
	| awk -v root=$root/ '{print root $0}' \
	| xargs --verbose -L1 -r \
		ipfs --offline files rm -r --

read root_cid root_flat_size < <(ipfs --offline files stat --format '<hash> <cumulsize>' $root)

key="backups-something"
ipfs key gen $key || true
ipns=$(ipfs name publish \
	--allow-offline \
	--ipns-base b58mh \
	--key "$key" \
	--lifetime $((366*24))h \
	--quieter \
	$root_cid)

ipfs --offline files rm -r "/backups/.last-published-something" || true
ipfs --offline files cp "$root" "/backups/.last-published-something"

root_usage=$(ipfs --offline dag stat --progress=false $root_cid)
eval 'root_usage=${${root_usage##Size: }%,*}'
latest_usage=$(ipfs --offline dag stat --progress=false $latest_cid)
eval 'latest_usage=${${latest_usage##Size: }%,*}'

@kevincox
Copy link
Author

Ok, I switched back to no pinning and this happened again.

  • 2023-01-17T01:58:16Z is the first run with --pin=false
  • 2023-01-18T08:01:24Z was ipfs added and then is successfully stated offline.
  • 2023-01-19T14:11:59Z a block could not be found in the 2023-01-18T08:01:24Z backup.
  • There were no IPFS daemon restarts (or any log messages at all) during this time. This run of the IPFS daemon ran from 2023-01-15 02:15 to 2023-01-20 22:38.
  • The backups occur every 4h.

The detection was probably quite late in this case because I had some problems with my pinning services so the full stat wasn't done between the one right after add and detection. I'll keep running the script (after removing the missing directory) to see when it next occurs and try and get a tighter timeline.

@kevincox
Copy link
Author

kevincox commented Jan 22, 2023

Had this again very quickly. This time two paths were missing.

  • 2023-01-21T16:58:14Z was ipfs added and then stated. (this was one of the missing paths)
  • 2023-01-21T19:58:13Z was ipfs added and the whole "backup archive" was stated. (this was the other missing path)
  • 2023-01-21T20:00:52Z was added with the same contents as the previous.
  • 2023-01-21T20:02:06Z was added with the same contents again.
  • 2023-01-21T22:58:23Z a new path was added. Files were not stated.
  • 2023-01-21T22:59:50Z a new path was added. Files were not stated.
  • 2023-01-21T23:00:23Z a new path was added. Files were not stated.
  • 2023-01-22T01:58:23Z (a new path added and) the missing paths were detected.

@kevincox
Copy link
Author

I saw this a few times, but due to paths getting removed before I investigated it was hard to get a good trace.

The most recent time was very fast, about an hour after the data was added it was lost.

  • 2023-01-27T10:58:54Z was ipfs added. (this is the path that will lose data)
  • 2023-01-27T12:13:29Z block was found missing.

This is actually the same backup run (the stat took a long time with a cold cache) so very little was done between the add and the loss.

@guseggert guseggert added need/maintainer-input Needs input from the current maintainer(s) and removed need/triage Needs initial labeling and prioritization labels Jan 30, 2023
@lidel lidel removed the need/author-input Needs input from the original author label Feb 27, 2023
@lidel lidel removed their assignment Feb 27, 2023
@guseggert guseggert added the P2 Medium: Good to have, but can wait until someone steps up label Apr 3, 2023
@kevincox
Copy link
Author

kevincox commented Jun 6, 2023

I just tried this again with kubo-0.20.0 and the issue recurred quickly.

@BigLep
Copy link
Contributor

BigLep commented Aug 3, 2023

2023-08-03 maintainer conversation: we're not prioritizing this currently because don't know how to reproduce the problem, and if there is a problem, it's likely in MFS (and we need to product work around data onboarding and MFS).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) need/maintainer-input Needs input from the current maintainer(s) P2 Medium: Good to have, but can wait until someone steps up
Projects
No open projects
Status: 🥞 Todo
Development

No branches or pull requests

4 participants