tl;dr This cmd program can compare text files and find what rows are the same or differ according to different set commands. It can handle bigfiles, but it is all in ram Link to webpage zetci with tutorials, examples, videos and more. Code: github
...............................................
&%%@( /@%%%. ..............%%%@/..........%@%%&..................
%%@ ,%%.............%%.........................@%%.............
%@ ......./%/.....,%/................................&%..........
&% ...............@%.%&......................................%%.......
&% ......................%@%.........................................&%.....
% ...........................(%...%#........................................,%....
/% .................................&&.....@%.........................................%&..
.% ........................................@&.......@&.........................................%*.
..%..........................................%.........%.........................................,%.
.%@.........................................&/.........&%.........................................@&
.&&.........................................%...........%.........................................#%
.&@.........................................%...........%.........................................%%
./%.........................................@@.........@&.........................................&%
%. ......%.........%.........................................(%.
% (% %# %
% /% %# %
%& % .% %%
% &%@ *%,
%& %# /% &%.
&%# %% &% &%&
%%& %%%. %%* %%%
#%%%&&@/,**/@&%%%& &%%%&@(**,/@@&%%%%
/$$$$$ /$$$$$$ /$$$$$$ /$$ /$$ /$$$$$$$$ /$$$$$$$$ /$$$$$$$$ /$$$$$$ /$$$$$$
|__ $$ /$$__ $$|_ $$_/| $$$ | $$ |_____ $$ | $$_____/|__ $$__/ /$$__ $$|_ $$_/
| $$| $$ \ $$ | $$ | $$$$| $$ /$$/ | $$ | $$ | $$ \__/ | $$
| $$| $$ | $$ | $$ | $$ $$ $$ /$$/ | $$$$$ | $$ | $$ | $$
/$$ | $$| $$ | $$ | $$ | $$ $$$$ /$$/ | $$__/ | $$ | $$ | $$
| $$ | $$| $$ | $$ | $$ | $$\ $$$ /$$/ | $$ | $$ | $$ $$ | $$
| $$$$$$/| $$$$$$/ /$$$$$$| $$ \ $$ /$$$$$$$$| $$$$$$$$ | $$ | $$$$$$/ /$$$$$$
\______/ \______/ |______/|__/ \__/ |________/|________/ |__/ \______/ |______/
** This is in status WIP it is not done yet **
Download the latest release from releases and unzip it to a folder and add the folder to the path.
Download the latest release from releases and unzip it to a folder and add the folder to the path.
Download the latest release from releases and unzip it to a folder and add the folder to the path.
zetci union --files fileA.csv,fileB.csv,fileC.csv --output ~/temp/what-is-both.csv
zetci intersect --files fileA.csv,fileB.csv,fileC.csv --output ~/temp/whats-the-same.csv
zetci diffa --files fileA.csv,fileB.csv --output ~/temp/whats-diff.csv
zetci except --files fileA.csv,fileB.csv --output ~/temp/whats-except.csv
zetci xor --files fileA.csv,fileB.csv --output ~/temp/whats-only-in-one.csv
When you have data in two or more files and want to extract parts of if. In similar way that you would from a sql database. It could be only the unique rows in several files or the differences between two files or what has been added to a file compared to previous version.
Somewhat familiar with running programs from the command line. Also good if you have a basic knowledge of how set theory. It could be Sowftware developers, data analysts, data scientists, data engineers or anyone one that has to work with data in textfiles and do not have access to a sql database or it is a one time analysis.
You have two or more files with data and you want to extract parts of it. Some examples could be:
- You have a file with all your customers and a file with all your orders and you want to see what customers that has not made any orders.
- Someone sent you a file with all the customers that has made an order and you want to see what customers that are not in your system.
- Someone has sent you a file with all the members of a club and you want to see what members that has been added since last time.
- You have a file with all the members of a club and a file with all the members of a club and you want to see what members that are in both files.
- You have a file with all the members of a club and a file with all the members of a club and you want to see what members that are in one file but not in the other.
- You have a really big file and you want to see what rows that are unique in that file.
- You want to import data from a file to a sql database but only the new rows that has been added since last time.
Absolutely if you have sql database use that. But if you do not have access to a database or this is something that is just something you have to do once it could be simpler solution. It can also be part of an ETL process if you only want to import new rows and you get 50 GB with 10 million rows datadumps periodically with all their data from the source system. You could then do except between previous datadump and the new one to get these precious 500 new rows that has been added since last time and only import them. for this solution to work you need plenty of ram since zetci does all in ram.
zetci fileA.csv fileB.csv fileC.csv
will to stdoutput show a union of all unique rows. So if two rows are the same
only one of them will be in the output. Union is the default subcommand.
zetci intersect --files fileA.csv fileB.csv fileC.csv --output ~/temp/whats-the-same.csv
only what rows that are in all of the files will be output to a textfile.
zetci intersect --files fileA.csv key=1,2,3 ft=csv fileB.csv key=7,8,9 ft=csv fileC.csv key=1,2,3 ft=csv --output ~/temp/whats-the-same.csv
only what rows that are all of the files and also we define what fields holds
the unique key for the row and how the fields are defined and separeted.
zetci except --files fileA.csv --exceptfile fileB.csv --output ~/temp/whats-new.csv
Will show what is in the except file fileb.csv bot not in any of the others
files which in this case is just fileA.csv
- files Sets the input file to use
- union Performs union operation on csv files
- intersect Performs intersection operation on csv files
- diffa Performs difference operation on csv files
- xor Performs difference operation on csv files
We want eg the following set operations: not A and B => DiffFile see Explanation of expression or see more easily Venn diagram A and B => IntersectionFile see Explanation of expression or see more easily Venn diagram NotaBene combined => (not A and B) or (A and B) see Explanation of expression or see more easily Venn diagram DiffFile+IntersctionFile => FileB
Make an intersection between file a and b the key are in column 4,6,7 seperator in the csv files a and b are semicolon (;) make it verbose.
zetci -v -i -a"s:\Darkcompare\A_TestFile.cs" -b"s:\Darkcompare\B_TestFile.cs" -k4 6 7 -s; Shown in Venn Diagram it would be:
In pseudo SQL it would be something like:
SELECT a.* from A_TestFile as a INNER JOIN B_TestFile as b on a.k4=b.k4 and a.k6=b.k6 and a.k7=b.k7
On the two sets A_TestFile and B_TestFile defined by the key on column 1 and 2 where the column separator is semicolon (;) make sets that are DiffB and DiffA and the intersection of the two. Describe all in a verbose style.
zetci -v -a".\A_TestFile.csv" -b".\B_TestFile.csv" -k1 2 -s; -r -d -i
In Venn Diagram it would be:
In pseudo SQL it would be something like:
SELECT b.* from B_TestFile as b WHERE b.key_1_2 not in (Select a.key_1_2 from A_TestFile as a)
DiffA
SELECT a.* from A_TestFile as a WHERE a.key_1_2 not in (Select a.key_1_2 from B_TestFile as b)
Intersection
SELECT a.* from A_TestFile as a INNER JOIN B_TestFile as b ON (a.key_1_2 = b.key_1_2 )
Use the issue tracker Check if the bug is already reported. If it is not then create a new issue.
- What command and parameter you used
- What you expected to happen
- What happened
- Testdata if possible
- What operating system you are using
- What version of zetci you are using
- What command and parameter you used
- What you expected to happen
- What happened
- Testdata if possible
- What you want to do
- Why you want to do it
- How you think it should be done
Fast, simple, small footprint, easy to use, easy to understand.
Compete with sql databases.
checkout right hash algorithm. The default one for hashmap is not great for small or large date.
Clone the repository and make a pull request. The code is written in Rust and the tests are written in Rust. To clone the repository and run the tests do the following:
git clone
cd zetci
cargo test
Make sure all tests pass before you make a pull request.
- What you have done
- Why you have done it
- How you have done it
- Example of how to use the new feature
- Test data if possible
- Banter and jokes are fine, but keep it zetci.
- Do not be a jerk.
- Do not make Pull requests that breaks the build.
- Do not make Pull requests that do not have tests.
- Do not make Pull requests that do not have documentation.
- Do not make Pull requests just to practise git.
- Do not make Pull requests just to practise Pull requests.
- If you are allowed to wear a gun in your country, you can wear a gun while making a pull request to this project. (That is if you are in your country while making the pull request)
- Other than that, go wild.
to test params cargo run -- --files ./testdata/fee.csv,./testdata/foo.csv ./target/debug/zetci union --files './testdata/fee.csv','./testdata/foo.csv'
// TODO: Add function that takes array of hashmaps with data from files and
// performs the operation union on them. move function later to library.
// TODO: The array of function needs to be a struct perhaps with meta data about
// the data like which one is the biggest, cardinality and perhaps others so the
// rudimentary queriy optimizer gets relevant info.