Skip to content

maxcountryman/warc-parquet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

67 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

warc-parquet

πŸ—„οΈ A utility for converting WARC to Parquet.

πŸ“¦ Install

The binary may be installed via cargo:

$ cargo install warc-parquet

To use the crate in your project, add the following to your Cargo.toml file:

[dependencies]
warc-parquet = "0.6.1"

🀸 Usage

The Binary

Once installed, the warc-parquet utility can be used to transform WARC into Parquet:

$ wget --warc-file example 'https://example.com'
$ cat example.warc.gz | warc-parquet --gzipped > example.zstd.parquet

warc-parquet is meant to fit organically into the UNIX ecosystem. As such processing multiple WARCs at once is straightforward:

$ wget --warc-file github 'https://github.com'
$ cat example.warc.gz github.warc.gz | warc-parquet --gzipped > combined.zstd.parquet

It's also simple to preprocess via standard UNIX piping:

$ cat example.warc.gz | gzip -d | warc-parquet > example.zstd.parquet

Various compression options, including the option to forego compression altogether, are also available:

$ cat example.warc.gz | warc-parquet --gzipped --compression gzip > example.gz.parquet

πŸ’‘ warc-parquet --help displays complete options and usage information.

The Crate

Refer to the docs for more details about how to use the Reader within your own programs.

DuckDB

There are any number of ways to consume Parquet once you have it. However a natural fit might be DuckDB:

$ duckdb
v0.3.3 fe9ba8003
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.

D select type, id from 'example.zstd.parquet';
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   type   β”‚                       id                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ warcinfo β”‚ <urn:uuid:A8063499-7675-4D8D-A736-A1D7DAE84C84> β”‚
β”‚ request  β”‚ <urn:uuid:3EB20966-D74F-4949-AACB-23DB3A0733A7> β”‚
β”‚ response β”‚ <urn:uuid:8B92CADC-F770-45BE-8B72-E13A61CD6D1C> β”‚
β”‚ metadata β”‚ <urn:uuid:4C0E9E17-E21B-49E0-859A-D1016FBDE636> β”‚
β”‚ resource β”‚ <urn:uuid:14F502A5-3BDE-4D0B-8A43-95F4BB8398C6> β”‚
β”‚ resource β”‚ <urn:uuid:6B6D6ADD-52FF-4760-AA00-FB9E739CABBE> β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

D describe select * from 'example.zstd.parquet';
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚       column_name       β”‚ column_type β”‚ null β”‚ key β”‚ default β”‚ extra β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ id                      β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ content_length          β”‚ UINTEGER    β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ date                    β”‚ TIMESTAMP   β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ type                    β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ content_type            β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ concurrent_to           β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ block_digest            β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ payload_digest          β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ ip_address              β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ refers_to               β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ target_uri              β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ truncated               β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ warc_info_id            β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ filename                β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ profile                 β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ identified_payload_type β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ segment_number          β”‚ UINTEGER    β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ segment_origin_id       β”‚ VARCHAR     β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ segment_total_length    β”‚ UINTEGER    β”‚ YES  β”‚     β”‚         β”‚       β”‚
β”‚ body                    β”‚ BLOB        β”‚ YES  β”‚     β”‚         β”‚       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜

🦺 Safety

This crate uses #![forbid(unsafe_code)] to ensure everything is implemented in 100% safe Rust.

πŸ‘― Contributing

We appreciate all kinds of contributions, thank you!

About

πŸ—„οΈ A simple CLI for converting WARC to Parquet.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages