Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some way to get the actual storage size of an object #5910

Closed
da2x opened this issue Jan 10, 2019 · 5 comments
Closed

Add some way to get the actual storage size of an object #5910

da2x opened this issue Jan 10, 2019 · 5 comments
Labels
kind/enhancement A net-new feature or improvement to an existing feature

Comments

@da2x
Copy link
Contributor

da2x commented Jan 10, 2019

Version information:

go-ipfs version: 0.4.18-
Repo version: 7
System version: amd64/linux
Golang version: go1.11.1

Type:

enhancement

Description:

ipfs object stat QmPS6VssQGyBYjGQSK8ordvXaU1yUoaUmTfmrV7daLeRPH outputs CumulativeSize: 36709305. However, when an object consists of mostly duplicated data (like the example hash does) the actual blocksize required to store it in a repository is only 15729480 bytes (or 42,8 % of the CumulativeSize.).

It would extremely useful to have another field in the output from ipfs object stat for the actual recursive deduplicated storage block size of an object (the amount of space required in the repository to store the object).

The following command is the only way I’ve found to get this number:

ipfs refs --unique --recursive QmPS6VssQGyBYjGQSK8ordvXaU1yUoaUmTfmrV7daLeRPH | \
    xargs -I% sh -c "echo -e '\n%I' && ipfs object stat %" | \
    grep DataSize\: | \
    egrep -o -e '[0-9]+' | \
    awk '{x += $1} END{print x}'

(The above method may not be accurate although it appears to work at least for synthetic tests objects.)

This would be very useful to pinning services; all of which currently overcharge their customers for the CumulativeSize of pinned objects and not the actual storage space used.The IPFS community would benefit from more accurate storage accounting and cheaper pin storage services. Ideally, pinning services would create per-customer objects and charge them for the deduplicated storage space across all their pins.

What follows are instructions for recreating a test object in case it gets garbage collected or otherwise lost in time. It’s not really relevant for the issue itself except explaining the test object.

#!/bin/bash
cd $(mktemp -d)
mkdir dir/
mkdir dir/1
head -c 5M </dev/random > dir/1/file1
head -c 5M </dev/random > dir/1/file2
mkdir dir/2
head -c 5M </dev/random > dir/2/file3
cat dir/1/file1 > dir/2/file4
cat dir/1/file2 >> dir/2/file4
ipfs add -r dir
@da2x da2x changed the title Add deduplicated storage size field to ipfs object stat Add some way to get the actual storage size of an object Jan 10, 2019
@bonedaddy
Copy link
Contributor

bonedaddy commented Jan 10, 2019

@da2x i believe i found a way to replicate this using already existing stats according to ipfs object stat:

I took your example hash, adding up all block sizes of unique recursive references which resulted in:

calulcated ref size 15734625
expected   ref size 15729480

Adding up recursive unique references datasizes landed at:

calulcated ref size 15729672
expected   ref size 15729480

the "expected" ref size is the size according to the size you indicate in this issue.

Here's my somewhat less hacky solution im using with our package that interfaces with IPFS, unfortunately I couldn't get the built-in go-ipfs-api to work properly with getting refs sizes and it constantly returned 1 ref, so I resorted to running a raw command

// DedupAndCalculatePinSize is used to remove duplicate refers to objects for a more accurate pin size cost
func (im *IpfsManager) DedupAndCalculatePinSize(hash string) (int64, error) {
	// format a multiaddr api to connect to
	parsedIP := strings.Split(im.nodeAPIAddr, ":")
	multiAddrIP := fmt.Sprintf("/ip4/%s/tcp/%s", parsedIP[0], parsedIP[1])
	outBytes, err := exec.Command("ipfs", fmt.Sprintf("--api=%s", multiAddrIP), "refs", "--recursive", "--unique", hash).Output()
	if err != nil {
		return 0, err
	}
	scanner := bufio.NewScanner(strings.NewReader(string(outBytes)))
	var refsArray []string
	for scanner.Scan() {
		refsArray = append(refsArray, scanner.Text())
	}
	var calculatedRefSize int
	for _, ref := range refsArray {
		refStats, err := im.Stat(ref)
		if err != nil {
			return 0, err
		}
		calculatedRefSize = calculatedRefSize + refStats.DataSize
	}
	return int64(calculatedRefSize), nil
}

@da2x
Copy link
Contributor Author

da2x commented Jan 10, 2019

My example command was slightly off. I updated it above and its behaviour now matches your go implementation.

@bonedaddy
Copy link
Contributor

Woot! I believe my go implementation will work regardless of chunk size

@Stebalien Stebalien added the kind/enhancement A net-new feature or improvement to an existing feature label Apr 30, 2019
@Stebalien
Copy link
Member

So, "CumulativeSize" is calculated based on metadata recorded in the object itself. We'd probably want something like:

  • ipfs repo stat <object>
  • ipfs dag stat <object>

I'd vote for the latter, actually. See: #3955.

@aschmahmann
Copy link
Contributor

There is an ipfs dag stat command that does this now #7553.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement A net-new feature or improvement to an existing feature
Projects
No open projects
Development

No branches or pull requests

4 participants