CSV exploration and manipulation tools.
An overview of help screens follows. If you need an RFC-4180 compliant alternative of the
cut unix command, scroll down to the
perl -MCPAN -e "install Text::CSV"
$ ./aggregate --help Usage: aggregate [options] < file.csv > aggregated_file.csv # # The examples below suppose a test/test.csv file containing the content as follows: # User,Browser,Date,Hour,DataTransfer # Jane,Firefox,2010-02-01,01:00,100 # John,IE,2010-02-01,01:00,100 # Jane,Firefox,2010-02-01,02:00,100 # John,Firefox,2010-02-01,02:00,50 # Kim,Chrome,2010-02-01,02:00,100 # # # Aggregate the DataTransfer fact in the input CSV by the Date and Hour columns # Expected output: # Date,Hour,DataTransfer,Count # 2010-02-01,01:00,200,2 # 2010-02-01,02:00,250,3 # aggregate --by=Date,Hour --fact=DataTransfer < test/test.csv > tranfer_by_time.csv # # No fact declared implies it's possible to group by any combination of fields. # Only the Count aggregation will be created. # This example aggregates only the count of lines in the input CSV by all # fields except Date and Hour. # Expected output: # User,Browser,DataTransfer,Count # Jane,Firefox,100,2 # John,IE,100,1 # John,Firefox,50,1 # Kim,Chrome,100,1 # aggregate --by-all-except=Date,Hour < test/test.csv > aggregated.csv Options: --help, -h short help screen --man slightly more detailed documentation --input-file=path input file (STDIN by default) --fact=list list of fields to be treated as facts. Sum of each of these fields will be computed. If not provided, only the count of rows will be provided --by=list comma separated list to aggregate by. Either 'by' or 'by-all-except' option must be specified. --by-all-except=list facts will be aggregated by all field except those specified using this or --fact options --no-escape signifies the values in the input CSV do not cantain the separator so a simple 'split' function can be used instead of a complete CSV implementation --debug prints some debug information and progress by tens of thousands processed input lines --in-separator=char field separator in the source file ("," by default) --out-separator=char output field separator ("," by default)
$ ./check_mapping --help Usage: check_mapping [options] pk_column_name referenced_csv_file \ fk_column_name processed_csv_file check_mapping [options] fk_column_name referenced_csv_file \ pk_column_name < processed_csv_file Example: # # list 'Employee ID' values in salaries.csv that don't match a value of # 'ID' in employees.csv # check_mapping salaries.csv 'Employee ID' employees.csv 'ID' Options: --help, -h short help screen --man slightly more detailed documentation --in-separator=char field separator in the source file ("," by default) --out-separator=char output field separator ("," by default)
$ ./clean_mapping --help Usage: clean_mapping [options] pk_column_name referenced_csv_file \ fk_column_name processed_csv_file clean_mapping [options] pk_column_name referenced_csv_file \ fk_column_name < processed_csv_file Example: # # filter out rows of salaries.csv that contain an 'Employee ID' value with # no matching value of 'ID' in employees.csv # check_mapping salaries.csv 'Employee ID' employees.csv 'ID' Options: --help, -h short help screen --man slightly more detailed documentation --keep keeps the violating lines but replaces offending foreign keys with blanks --join=list comma separated list of the fields from the referenced CSV file to be added to output stream. An equivalent of performing an inner join (or an outer join if the --keep switch is used) --in-separator=char field separator in the source file ("," by default) --out-separator=char output field separator ("," by default)
$ ./extract_fields --help
$ ./list_headers --help Usage: list_headers [options] file.csv list_headers [options] < file.csv Options: --help, -h short help screen --man slightly more detailed documentation --max-lenth print maximal length for each column --in-separator=char field separator in the source file ("," by default)
$ ./reshuffle --help Usage: reshuffle [options] file [field ... ] # drop second field and swap third and fourth using field names reshuffle data.csv column_name1 column_name3 column_name2 # The same using indexes + read pipe separated and dump hash separated fields reshuffle --numbers --in-separator='|' --out-separator='#' data.csv 0 3 2 # Insert 'N/A' before the first output field and after 6th field reshuffle --insert-fields=N/A --insert-before=0 --insert-after=5 data.csv Options: --help, -h short help screen --man slightly more detailed documentation --numbers specify output fields by indexes rather than by numbers starging from 0. --insert-field=string string to be used by --insert-before and --insert-after ("N/A" by default) --insert-after=indexes comma separated list of indexes after which an extra string should be inserted --insert-before=indexes comma separated list of indexes before which an extra string should be inserted --skip-first drop the first line --in-separator=char field separator in the source file ("," by default) --out-separator=char output field separator ("," by default) Arguments: First argument The following cases are handled especially for the sake of backward compatibility unless the --input-file option is provided: 1) If a so named file exist, it is treated as the input file name. 2) If equal to '-' the input is expceted on STDIN. Following arguments Identify fields present in the output stream. By default, field names as defined in the CSV header are expected. If the --numbers option is used, numeric indexes are expected instead. Using no fields is a shortcut for enumerating all fields.
$ ./surrogate --help Usage: surrogate [options] lookups # Generate keys for first two fields and store [ generated key, orig value] # pairs in out/dim_name.csv and out/dim_industry.csv lookup files. # All fields except for the first two are dumped unchanged. surrogate --input-file=data.csv \ output-dir=out dim_name dim_industry \ > out/data.csv # copy previously generated lookup files into an extra folder cp out/dim_*.csv lookups_cache/ # Preload lookups from the lookups_cache folder before processing. Don't generate # keys for values that has their keys already stored in corresponding files within # the lookups_cache/ folder. surrogate --input-file=data.csv --input-lookups-dir=lookups_cache \ output-dir=out dim_name dim_industry \ > out/data.csv # The same as above except for the processing exits if a non-resolved value # is found in the second field surrogate --input-file=data.csv --input-lookups-dir=lookups_cache \ --read-only=dim_industry \ output-dir=out dim_name dim_industry \ > out/data.csv # Process a file with hierarchical attributes (a GEO dimension is used in this # example). The --attr-group option tells to distinguish e.g. Albany, NY, USA from # Albany, OR, USA or Albany, WA, Australia. surrogate --input-file=data.csv --input-lookups-dir=my_geo_dir --output-dir=out \ --attr-group=dim_city,dim_state,dim_country \ dim_city dim_state dim_country > out/data.csv Options: --help, -h short help screen --man slightly more detailed documentation --ignore-dups by default, a column marked as a primary key using the --primary-key option is required to hold unique values. The --ignore-dups switch removes this constrain. --input-file=path input file with values to be replaced with keys (STDIN is used by default) --input-lookups-dir=path folder containing already existing lookup files (i.e. files hodling key-value pairs) --output-dir=path folder where the key-value pairs accumulated during processing will be dumped. Each file's name will be based on the given lookup name (see ARGUMENTS below) with the 'csv' suffix. If some lookups has been preloaded from the folder specified via --input-lookup-dir, the dumped lookup files will contain the preloaded key-value pairs as well as those generated during this processing. --primary-key=name the field marked as 'primary-key' is not replaced, the generated key is inserted before the primary key instead. Only first field may be marked as primary key. --primary-key-name=name the name of column of the original primary key ('Original Id' by default). This option is ignored unless --primary-key is specified. --skip-first to silently discard the first row --keep-header to keep fields of the first row that correspond to fields to be dumped --read-only=list comma separated list of fields that are not expecged to contain other values that those preloaded from input-lookups-dir. The processing fails if such an unknown value occurs. --rename=list comma separated list of colon separated pairs of old and new column names. Defines how the original names should be renamed in the output file. --skip-new skip records with unresolved lookups (TODO doesn't work properly yet) --default=pairs list of lookups specific default values for new values, use key correspnding to the given value rather than generating a new values. Useful for garbage workarounds Example: --default=company:1,industry:1 --debug Print some possibly interesting processing info --attr-group=list Values of the first field of the group list are replaced with keys specified to the whole value groups rather than to the first field's value only. Useful for processing hierarchical attributes --in-separator=char field separator in the source file ("," by default) --out-separator=char output field separator ("," by default) Arguments: The script arguments represent lookup files corresponding to each column of the input file. The number of arguments must be smaller or equal to the number of columns. Special lookup names: - prefixed with 'UNUSED' - these columns will be discarded silently - prefixed with 'KEEP' - the original value will be printed to output without any processing