Skip to content

join CSV data on command line

License

Unlicense, MIT licenses found

Licenses found

Unlicense
UNLICENSE
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

milancio42/rjoin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Rjoin

rjoin is a new command line utility for joining records of two files on common fields.

Dual-licensed under MIT or unlicense

Documentation

https://docs.rs/rjoin

Installation

The binary name for rjoin is rj.

$ cargo --version
cargo 0.25.0-nightly (a88fbace4 2017-12-29) # requires nightly channel
$ RUSTFLAGS="-C target-cpu=native" cargo install rjoin

(don't forget to add $HOME/.cargo/bin to your path).

Why should you use rjoin?

  • it can perform the join on multiple fields
  • it has higher flexibilty on specifying the field separators and record terminators compared to GNU join
  • it has (subjectively) cleaner CLI.

Why you should not use rjoin?

  • you need a specific output format. GNU join is more flexible on this, but it can be mitigated by piping the output to awk.
  • you need a case insensitive join. This can be mitigated by preprocessing data with tr utility.
  • you don't have an AVX2 capable CPU.

Quick Example

Let's suppose we have the following data:

$ cat left
color,blue
color,green
color,red
shape,circle
shape,square

$ cat right
altitude,low                                        
altitude,high                                       
color,orange                                          
color,purple                                          

To get the lines with the common key:

$ rj left right
color,blue,orange
color,blue,purple
color,green,orange
color,green,purple
color,red,orange
color,red,purple

Some comments:

  • by default, the first field is the key. If you wish to use another field, you can specify it using --key/-k option (even per file). rj supports multiple fields as the key, but the number of key fields in both files must be equal.
  • by default, only the lines with the common key are printed. If you wish to print also unmached lines from the left or right file, use any combination of these: --show-left/-l, --show-right/-r or --show-both/-b. Note however, if you use any of these options, the default behavior is reset (e.g. if you want to see the unmatched lines in the left file along with the matched lines, use -lb. With -l you will not see the matched lines.)
  • there are multiple lines with the same key in both files, resulting in Cartesian product.

To get the lines with the unmatched key in both files:

$ rj -lr left right
altitude,low                                        
altitude,high                                       
shape,circle
shape,square

Check the tutorial for the detailed walkthrough.

Contributing

Any kind of contribution (e.g. comment, suggestion, question, bug report and pull request) is welcome.

Why Rust?

Because C eats a bloody lot of mental resources only to avoid shooting my leg, or worse.

Acknowledgments

The CSV parser used in Rjoin is based on the work of Y. Li, N. R. Katsipoulakis, B. Chandramouli, J. Goldstein, and D. Kossmann. Mison: a fast JSON parser for data analytics. In VLDB, 2017.

The SIMD part was shamelessly copied from pikkr

And finally a big thanks to BurntSushi for his excellent work.

About

join CSV data on command line

Resources

License

Unlicense, MIT licenses found

Licenses found

Unlicense
UNLICENSE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages