Name - pezl

split/cat for Google Cloud Storage objects.

Project status

pezl is a MVP from field personnel and will be updated on an as-needed basis.

Description

pezl is a command line Golang application to split Google Cloud Storage (GCS) object files. It is a useful tool to preprocess data that is better suited for parallel consumption of data as opposed to a single-threaded download, e.g. Hadoop/Spark.

For example, data to be consumed by Spark arrives as a single file, 100GB. Spark will use a single worker to download and read into memory the file, despite have a cluster of workers. A traditional approach to this problem is to spin up a GCE VM instance, download the file via gsutil, and then upload the split output files to GCS for Spark to consume in parallel.

pezl is designed to resemble the interface to the Linux split tool.

Performance

Benchmarking the 'Download-Split-Upload' process described above takes ~40 minutes to complete on a n2-standard-16 VM Instance with a 100GB file. Pezling the same file takes less than 4 minutes, a 10X performance improvement.

Definitions

pezl - to split an object into smaller objects, called pezlings

unpezl - to concatenate pezlings into a single, larger object

Installation

Install Golang 1.17 (current) on a GCE VM instance.
[Setup a pezl project] (https://golang.org/doc/code)
Build the pezl executable - go build -o pezl

Usage

./pezl --help
Usage of pezl:
  -a int
    	use suffixes of length N (default 2 which provides 676 output files) (default 2)
  -b int
    	put exactly SIZE bytes per output file, default is 1M (default 1048576)
  -d	use numeric suffixes instead of alphabetic
  -h	display this help and exit
  -l	put approximately SIZE bytes of lines per output file (ends pezlings on the first EOL after nSIZE bytes
  -s int
    	specify a single pezling to be created. pezl -s 1 would only create a single pezling from byte 1 to byte 1M.  This is often called in a distributed fashion.
  -t int
    	Specify the number of concurrent goroutines ('threads') to use. (default 100)
  -u	Unpezl (concatenate) files created by pezl
  -v	prints diagnostics while pezling

pezl by bytes - for binary files

pezl gs://[bucket/key/to/very_large_file] gs://[bucket/key/to/pezling_file_prefix]

Example on n2-standard-16 (64GB memory):

./pezl -b 1048576000 -t 16 gs://pezl-testing/test/very_large_file gs://pezl-testing/test/segment
very_large_file is 99.1GB in size
pezling_file_prefix is 'segment'
each pezling will be 1GB (1048576000 bytes)
16 pezl threads to not bust out memory ...
pezl will produce 102 1GB pezlings named segment_aa, segment_ab, ..., segment_dx in roughly 3 minutes

pezl by line-bytes - for text files

./pezl -b 10485760 -l gs://pezl-testing/test/very_large_file.txt gs://pezl-testing/test/segment
very_large_file.txt is 41.1MB in size
pezling_file_prefix is segment
each pezling will be 10MB (10485760)
threads are not specified but will only use as many as needed if less than 100
pezl will produce 4 10MB files named segment_aa, .., segment_ad and 1 1.1MB file named segment_ae

unpezl - all pezlings with specified prefix concatenated into a single file

./pezl -u gs://pezl-testing/test/segment
segment is the prefix for the pezlings which are to be smushed together into a single file
pezl will produce a single file named 'segment' which is the result of concatenating all files within the directory which appear to be a pezling, e.g. segment_aa, segment_ab. Pezlings are concatenated lexicographically, i.e. .._aa > .._ab > .._ac.

Authors and acknowledgment

Tim Meyer (baldtim@google.com) and John Stamper (clowndaddy@google.com)

License

Apache-2.0

Disclaimer

This is not an officially supported Google product

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
CONTRIBUTING		CONTRIBUTING
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go
pezling.go		pezling.go
pezlmanpage.go		pezlmanpage.go
unpezl.go		unpezl.go

License

google/pezl

Folders and files

Latest commit

History

Repository files navigation

Name - pezl

Project status

Description

Performance

Definitions

pezl - to split an object into smaller objects, called pezlings

unpezl - to concatenate pezlings into a single, larger object

Installation

Usage

pezl by bytes - for binary files

pezl by line-bytes - for text files

unpezl - all pezlings with specified prefix concatenated into a single file

Authors and acknowledgment

License

Disclaimer

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages