Pachyderm File System - a version controlled file system for big data.
PFS has 4 basic primitives:
- File
- Commit
- Repo
- Block
Each of these is simple, and understanding all of them provides a good tour of PFS.
Pachyderm File System (PFS) allows you to store arbitrary data in files. These files can be as large as you'd like, and store any kind of information.
We wanted to use an interface to data that is familiar to everyone. Reading/writing data to a file is as familiar as you get.
Doing this on big data sets gets interesting, but having a simple underlying interface makes interacting with the data more intuitive, and more easily accessible to developers no matter what their language of choices.
Interactions with PFS are very Git-like. Your data, which is made up of one or more Files
, is versioned in a data repository, or Repo
.
With PFS, you version your data by making Commits
of data into Repos
. By versioning your data, you can:
- reproduce any input or output for your processing, which in turn enables ...
- collaborating with your peers on a data set
Reproducibility and Collaboration are things we care a lot about.
We store each commit as only the data that changed from the prior commit, which is where PFS differs from Git. Storing your data this way allows us to enable Incrementality and keeps PFS space efficient.
Under the hood, we store your files in sets of Blocks
. These are smaller (usually ~8MB) chunks of your file. By storing your data in smaller chunks, we can more efficiently read and write your data in parallel.
Blocks
also determine the smallest indivisible chunk of your data. When performing a map
job, each File
is seen by multiple containers. Each container sees one or more Blocks
of a file.
This is important because this also determines the granularity of how the data is exposed as an input. Specifically, during a map
job, each container will see a slice of your data file. That slice will be one or more Blocks.
For certain data types (binary blobs or JSON objects), making sure that your data is divided correctly into indivisible chunks is important. Doing this with PFS is straightforward.
By default, data is line delimited and a single Block
consists of 8MBs worth of lines.
By default, the data is line delimited and stored internally as a block of no more than ~8MBs. This means that your data will never be broken up within any line.
For JSON data, you might have input like this:
{
"foo": "bar",
"bax": "baz"
}
{
"foo": "cat",
"bax": "dog"
}
You can see quickly how line delimiting will not work. If a block happens to terminate not at the end of a JSON object, the result during a map
job will be a partial / invalid JSON object.
To make sure your JSON data is delimited correctly, just make sure the file in question has a .json
suffix. This tells PFS that the data being stored is JSON, and Pachyderm will make sure each Block
consists of whole JSON objects.
Since binary data doesn't always have a static size, and can be quite large, delimiting binary data works a bit different.
We enable this by treating every single write to that file as a separate block, no matter what the size. E.g. if you open /pfs/out/foo.bin
and within your code write to it several times, each time you write the data will be treated as a separate block. This guarantees that a map
job consuming your data will always see it at least at the granularity you have provided by your writes.
To require PFS to delimit blocks in this fashion, make sure your file as the .bin
suffix.
What happens when you read or write to PFS?
PFS is backed by an object store of your choosing (usually S3 or GCS). This allows for highly redundant consistent storage.
Each block of each of your files is content-addressed and uploaded to your object store. This gains us de-duplication of the data.
Additionally, because each commit only contains diffs
of blocks that were written, all data stored by PFS is immutable.
You can never write to a Pachyderm Repo
without making it part of a Commit
. That means you have to start the Commit
, write your data, then finish the Commit
.
Here's an example:
$pachctl create-repo foo
$pachctl start-commit foo master
master/0
$echo 'hai' | pachctl put-file foo master/0 test.txt
$pachctl finish-commit foo master/0
# And writing in a new commit
$pachctl start-commit foo master
master/1
$echo 'bai' | pachctl put-file foo master/1 test.txt
$pachctl finish-commit foo master/1
In this example, we've written two words to the same file across two commits.
You'll see that writing requires a CommitID. If the Commit
has been finished, you will only be able to read.
Let's try reading the file we wrote above. That would look like this:
$pachctl get-file foo master/1 test.txt
hai
bai
Notice how the output is the cumulative result of the commits.
Pachyderm uses FUSE to mount PFS. You can think of it simplistically as a network mount of PFS. While the files are truly served from object storage, you can see them locally by mounting PFS.
Mounting PFS locally is a great way to debug an issue, or poke around PFS to understand how it works.
To mount locally, run:
$ mkdir ~/pfs
$ pachctl mount ~/pfs &
(If ~/pfs
already exists, you may need to umount
it first)
Now you can look around the local mount using ls
or just point your browser at the local files:
# This is equivalent to `pachctl list-repo`
$ls ~/pfs
foo
# This is equiavelent to `pachctl list-commit foo`
$ls ~/pfs/foo
master/0 master/1
# This is equiavelent to a call to `pachctl get-file ...`
$cat ~/pfs/foo/master/0/test.txt
# And this is similar to `pachctl list-file ...`. It allows you to see all files in a commit:
$ls ~/pfs/foo/master/0/
test.txt
Using this interface, you can grep, touch, ls, etc the files as you normally would. The exceptions are that you cannot write data to a commit that is finished.