Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(draft) Common Bytes - standard for data deduplication #444

Closed
danimesq opened this issue Dec 5, 2019 · 3 comments
Closed

(draft) Common Bytes - standard for data deduplication #444

danimesq opened this issue Dec 5, 2019 · 3 comments

Comments

@danimesq
Copy link

danimesq commented Dec 5, 2019

This is a standard proposal for deduplicing common bytes on different versions of same kind of a file.
It takes inspiration on git objects, but takes more approaches to ensure the right content parts are organized. This is not only for deduplicate data, but also for linking the same data which is represented in different kinds of files.

  • consider jpg, gif, png, svg and base64 as different variations from the same file
  • take as example the one-file html savings from dev.to, which easily reached limit on pinata. should only store their common bytes instead of duplicing, and also consider common bytes when mirroring and also read modes from 2read extension
  • store each line of a text, and know the text-format of each part (no duplications when comparing text from a 2read saved page or its full HTML site mirror), and same for PDF/ePub/MOBI
  • use references, for example, in which lines of a single-page HTML is the same content of a JS/CSS file; also works for SVG and base64 images
  • windows/other screenshots, keeping same bytes in objects, for example, parts of taskbar and window frame
  • different qualities from same video; version inside video files and know when frames are similar then diff it
  • all kinds of compressed files (and partition/disk images), also the supported by 7Zip and Linux archiver
  • deb, aur, rpm and other linux/bsd packages
  • midi and 8bit sounds
  • other wave-based audio/music
  • .exe, .msi, .appimage and other executables
  • git packs: consider their content, same as git objects; http://web.archive.org/web/20191205203745/https://github.com/radicle-dev/radicle/issues/689
    new for dedup/plugz/download.json:
    instead of downloading lots of dupliced appimages, DEB and RPM, get their internal files and deduplicate them. Make these files internally symlink the common files. Should also support the browser downloads, with a API to get internal files hash and verify if local device already haves them.
@danimesq
Copy link
Author

danimesq commented Dec 5, 2019

@danimesq
Copy link
Author

danimesq commented Dec 24, 2019

It could also have i/o deduplicing, by generating different versions of same file by applying their common bytes.

@hsanjuan
Copy link
Contributor

hsanjuan commented Mar 21, 2020

Thanks for posting to discuss! I'll close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants