json-fileops: Utilities for recording file info

I need these utilities for some work I'm doing.
I remember doing similar things before, with ad-hoc scripts and arbitrary file formats.
Wish I had thought of using http://jsonlines.org / http://ndjson.org last time!
I'm sure I would use such utilities again, if I were more careful about making them reusable, and therefore so would you.

This starts as a specification, i.e. pure vapourware.

`*.fileops.jsonl`: Record of operations upon a set of files

Inputs
- Web server logs
- strace logs
- Random read operations, taken from a set of files listed in the .fileprops.jsonl style
Outputs
- Summary statistics for access speeds and concurrency
- Re-do the operations
  - If the underlying filesystem were going to go bang! as a result, this should be a nice way to reproduce the effect.
  - Aiming to start each at the same relative time
  - Emit updated timing info

There are two interchangeable ways to record this, I'm not yet sure whether I want to deal with both.

Per-filehandle record style

In which we do a group of operations on a (virtual) file descriptor.

Example; and pretty-printed.

Per-function record style

An strace-like format, where events are logged in chronological order and so you need a (virtual) file descriptor number to tie the later operations back to the initial open.

This is probably more useful after conversion to per-filehandle format, since they are intrinsically serial.

Example; and pretty-printed.

`*.fileprops.jsonl`: Record of file checksums

The initial application here is a large POSIX filesystem, for which we currently have no checksums or integrity checks. File may get updated from time to time, and (at this time) I don't want to hook into everything that touches them.

Properties to store
- File pathname
  - Ignoring any implicit root, to which names may be relative.
  - Ignoring the implicit hostname or filesystem mappings.
- Whole file checksums, of any type (md5, sha1, sha2-512 ...)
- Metadata
  - Size
  - mtime, ctime, atime
  - dev, inode, nlinks
- Segmental file checksums?
  - As a reversible translation of hashdeep output.
  - The advantage of segmental sums is that you don't have to read the entire file when making a random validation.
Operations possible
- Scan a directory tree and collect fileprops (metadata). Because actually reading the files to get the checksums may take weeks, and we will want to run that in parallel.
- Read fileprops and scan the directory tree.
  - check and update metadata, reporting changes and dropping stale checksums
  - fill in missing checksums
Tools that solve part of the problem
- GNU findutils, GNU coreutils
  - find /stuff -type f -print0 | xargs -r0 -n5 -P4 sha1sum
  - No way to detect intentional changes.
  - No way to keep several types of checksum, except in multiple files scans.md5, scans.sha1 etc..
  - There are other line-based plaintext formats which represent similar results in different ways.
- Summain does file scanning and checksumming, and has --output-format json.
  - You have to choose what to scan, and then decide whether the results are still fresh.
- Rsync does file change detection, based on metadata comparison; we have to be aware that
  - "dev" and/or "inode" may not be stable on some filesystem types. I'm using an NFS which makes the Dev field useless.
  - timestamp granularity varies with filesystem type

Sample data

I took the Summain output style as my standard, and made some changes to the details.

$ summain -I --exclude={Username,Group,Atime,Dev} -c nil          -f jsonl todo.jq | tee sum0
{"Ctime":"2015-05-04 19:39:55.198461000 +0000","Gid":"808","Ino":"34399918","Mode":"100640","Mtime":"2015-05-04 19:39:55.198461000 +0000","Name":"todo.jq","Nlink":"1","Size":"357","Uid":"11179"}
$ summain -I --exclude={Username,Group,Atime,Dev} -c md5 -c sha256 -f jsonl todo.jq | tee sum2
{"Ctime":"2015-05-04 19:39:55.198461000 +0000","Gid":"808","Ino":"34399918","MD5":"efe45c3ac876bf4ae963069c18f17a0c","Mode":"100640","Mtime":"2015-05-04 19:39:55.198461000 +0000","Name":"todo.jq","Nlink":"1","SHA256":"b98236d0b92a8e4d069ac62352e58c8ae66d553560449547f8f518d42819df2c","Size":"357","Uid":"11179"}
$ jq . sum0 sum2

{
  "Uid": "11179",
  "Ctime": "2015-05-04 19:39:55.198461000 +0000",
  "Gid": "808",
  "Ino": "34399918",
  "Mode": "100640",
  "Mtime": "2015-05-04 19:39:55.198461000 +0000",
  "Name": "todo.jq",
  "Nlink": "1",
  "Size": "357"
}
{
  "Uid": "11179",
  "Size": "357",
  "SHA256": "b98236d0b92a8e4d069ac62352e58c8ae66d553560449547f8f518d42819df2c",
  "Ctime": "2015-05-04 19:39:55.198461000 +0000",
  "Gid": "808",
  "Ino": "34399918",
  "MD5": "efe45c3ac876bf4ae963069c18f17a0c",
  "Mode": "100640",
  "Mtime": "2015-05-04 19:39:55.198461000 +0000",
  "Name": "todo.jq",
  "Nlink": "1"
}

MD5 and SHA2-256 because they are what iRODS needs for v3 and v4.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
run-fileops		run-fileops
strace-fileops		strace-fileops
web-fileops		web-fileops
.gitignore		.gitignore
README.md		README.md
eg.per-func.fileops.json		eg.per-func.fileops.json
eg.per-func.fileops.jsonl		eg.per-func.fileops.jsonl
eg.vfd-grouped.fileops.json		eg.vfd-grouped.fileops.json
eg.vfd-grouped.fileops.jsonl		eg.vfd-grouped.fileops.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

json-fileops: Utilities for recording file info

`*.fileops.jsonl`: Record of operations upon a set of files

Per-filehandle record style

Per-function record style

`*.fileprops.jsonl`: Record of file checksums

Sample data

About

Releases

Packages

Languages

mcast/json-fileops

Folders and files

Latest commit

History

Repository files navigation

json-fileops: Utilities for recording file info

*.fileops.jsonl: Record of operations upon a set of files

Per-filehandle record style

Per-function record style

*.fileprops.jsonl: Record of file checksums

Sample data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`*.fileops.jsonl`: Record of operations upon a set of files

`*.fileprops.jsonl`: Record of file checksums

Packages