Skip to content

@johnkerl johnkerl released this Sep 22, 2019 · 16 commits to master since this release

Bug fixes:

  • #271 fixes a corner-case bug with more than 100 CSV/TSV files with headers of varying lengths.

Documentation:

Assets 7

@johnkerl johnkerl released this Sep 17, 2019 · 40 commits to master since this release

The only change is that http://johnkerl.org/miller/doc is now more mobile-friendly.

All build artifacts are the same as at https://github.com/johnkerl/miller/releases/tag/v5.6.0

Before

Before

After

After

Assets 2

Features:

  • The new system DSL function allows you to run arbitrary shell commands and store them in field values. Some example usages are documented here. This is in response to issues #246 and #209.

  • There is now support for ASV and USV file formats. This is in response to issue #245.

  • The new format-values verb allows you to apply numerical formatting across all record values. This is in response to issue #252.

Documentation:

Bugfixes:

Note:

Thanks to @aborruso @davidselassie @joelparkerhenderson for the bug reports and feature requests!! :)

Assets 7

@johnkerl johnkerl released this Sep 1, 2019 · 95 commits to master since this release

Features:

  • The new positional-indexing feature resolves #236 from @aborruso. You can now get the name of the 3rd field of each record via $[[3]], and its value by $[[[3]]]. These are both usable on either the left-hand or right-hand side of assignment statements, so you can more easily do things like renaming fields progrmatically within the DSL.

  • There is a new capitalize DSL function, complementing the already-existing toupper. This stems from #236.

  • There is a new skip-trivial-records verb, resolving #197. Similarly, there is a new remove-empty-columns verb, resolving #206. Both are useful for data-cleaning use-cases.

  • Another pair is #181 and #256. While Miller uses mmap internally (and invisibily) to get approximately a 20% performance boost over not using it, this can cause out-of-memory issues with reading either large files, or too many small ones. Now, Miller automatically avoids mmap in these cases. You can still use --mmap or --no-mmap if you want manual control of this.

  • There is a new --ivar option for the nest verb which complements the already-existing --evar. This is from #260 thanks to @jgreely.

  • There is a new keystroke-saving urandrange DSL function: urandrange(low, high) is the same as low + (high - low) * urand(). This arose from #243.

  • There is a new -v option for the cat verb which writes a low-level record-structure dump to standard error.

  • There is a new -N option for mlr which is a keystroke-saver for --implicit-csv-header --headerless-csv-output.

Documentation:

Bugfixes:

  • There was a SEGV using nest within then-chains, fixed in response to #220.

  • Quotes and backslashes weren't being escaped in JSON output with --jvquoteall; reported on #222.

An extra thank-you:

I've never code-named releases but if I were to code-name 5.5.0 I would call it "aborruso". Andrea has contributed many fantastic feature requests, as well as driving a huge volume of Miller-related discussions in StackExchange (#212). Mille grazie al mio amico @aborruso!

Assets 7

Features:

Builds:

Documentation:

Bugfixes:

  • There was a memory leak for TSV-format files only as reported by @treynr on #181.

  • Dollar sign in regular expressions were not being escaped properly as reported by @dohse on #171.

Assets 8
Jan 6, 2018
restore autoreconf -fiv to travis build; aclocal version 13/15 conflicts

Features:

  • Comment strings in data files: mlr --skip-comments allows you to filter out input lines starting with #, for all file formats. Likewise, mlr --skip-comments-with X lets you specify the comment-string X. Comments are only supported at start of data line. mlr --pass-comments and mlr --pass-comments-with X allow you to forward comments to program output as they are read.

  • The count-similar verb lets you compute cluster sizes by cluster labels.

  • While Miller DSL arithmetic gracefully overflows from 64-integer to double-precision float (see also here), there are now the integer-preserving arithmetic operators .+ .- .* ./ .// for those times when you want integer overflow.

  • There is a new bitcount function: for example, echo x=0xf0000206 | mlr put '$y=bitcount($x)' produces x=0xf0000206,y=7.

  • Issue 158: mlr -T is an alias for --nidx --fs tab, and mlr -t is an alias for mlr --tsvlite.

  • The mathematical constants π and e have been renamed from PI and E to M_PI and M_E, respectively. (It's annoying to get a syntax error when you try to define a variable named E in the DSL, when A through D work just fine.) This is a backward incompatibility, but not enough of us to justify calling this release Miller 6.0.0.

Documentation:

  • As noted here, while Miller has its own DSL there will always be things better expressible in a general-purpose language. The new page Sharing data with other languages shows how to seamlessly share data back and forth between Miller, Ruby, and Python. SQL-input examples and SQL-output examples contain detailed information the interplay between Miller and SQL.

  • Issue 150 raised a question about suppressing numeric conversion. This resulted in a new FAQ entry How do I suppress numeric conversion?, as well as the longer-term follow-on issue 151 which will make numeric conversion happen on a just-in-time basis.

  • To my surprise, csvlite format options weren’t listed in mlr --help or the manpage. This has been fixed.

  • Documentation for auxiliary commands has been expanded, including within the manpage.

Bugfixes:

  • Issue 159 fixes regex-match of literal dot.

  • Issue 160 fixes out-of-memory cases for huge files. This is an old bug, as old as Miller, and is due to inadequate testing of huge-file cases. The problem is simple: Miller prefers memory-mapped I/O (using mmap) over stdio since mmap is fractionally faster. Yet as any processing (even mlr cat) steps through an input file, more and more pages are faulted in -- and, unfortunately, previous pages are not paged out once memory pressure increases. (This despite gallant attempts with madvise.) Once all processing is done, the memory is released; there is no leak per se. But the Miller process can crash before the entire file is read. The solution is equally simple: to prefer stdio over mmap for files over 4GB in size. (This 4GB threshold is tunable via the --mmap-below flag as described in the manpage.)

  • Issue 161 fixes a CSV-parse error (with error message "unwrapped double quote at line 0") when a CSV file starts with the UTF-8 byte-order-mark ("BOM") sequence 0xef 0xbb 0xbf and the header line has double-quoted fields. (Release 5.2.0 introduced handling for UTF-8 BOMs, but missed the case of double-quoted header line.)

  • Issue 162 fixes a corner case doing multi-emit of aggregate variables when the first variable name is a typo.

  • The Miller JSON parser used to error with Unable to parse JSON data: Line 1 column 0: Unexpected 0x00 when seeking value on empty input, or input with trailing whitespace; this has been fixed.

There is no prebuilt Windows executable for this release; my apologies.

Assets 5

@johnkerl johnkerl released this Jul 20, 2017 · 359 commits to master since this release

This bugfix release delivers a fix for #147 where a memory allocation failed beyond 4GB.

Documents are the same as for 5.2.0.

Assets 5

@johnkerl johnkerl released this Jun 20, 2017 · 419 commits to master since this release

This bugfix release addresses #142.

I'm not attaching prebuilt binaries beyond those already in https://github.com/johnkerl/miller/releases/tag/v5.2.0 since the binaries there are fine for their respective architectures.

This unblocks Miller on openSUSE.

Assets 2

This release contains mostly feature requests.

Features:

  • The stats1 verb now lets you use regular expressions to specify which field names to compute statistics on, and/or which to group by. Full details are here.

  • The min and max DSL functions, and the min/max/percentile aggregators for the stats1 and merge-fields verbs, now support numeric as well as string field values. (For mixed string/numeric fields, numbers compare before strings.) This means in particular that order statistics -- min, max, and non-interpolated percentiles -- as well as mode, antimode, and count are now possible on string-only (or mixed) fields. (Of course, any operations requiring arithmetic on values, such as computing sums, averages, or interpolated percentiles, yield an error on string-valued input.)

  • There is a new DSL function mapexcept which returns a copy of the argument with specified key(s), if any, unset. The motivating use-case is to split records to multiple filenames depending on particular field value, which is omitted from the output: mlr --from f.dat put 'tee > "/tmp/data-".$a, mapexcept($*, "a")' Likewise, mapselect returns a copy of the argument with only specified key(s), if any, set. This resolves #137.

  • A new -u option for count-distinct allows unlashed counts for multiple field names. For example, with -f a,b and without -u, count-distinct computes counts for distinct pairs of a and b field values. With -f a,b and with -u, it computes counts for distinct a field values and counts for distinct b field values separately.

  • If you build from source, you can now do ./configure without first doing autoreconf -fiv. This resolves #131.

  • The UTF-8 BOM sequence 0xef 0xbb 0xbf is now automatically ignored from the start of CSV files. (The same is already done for JSON files.) This resolves #138.

  • For put and filter with -S, program literals such as the 6 in $x = 6 were being parsed as strings. This is not sensible, since the -S option for put and filter is intended to suppress numeric conversion of record data, not program literals. To get string 6 one may use $x = "6".

Documentation:

Bugfixes:

  • CRLF line-endings were not being correctly autodetected when I/O formats were specified using --c2j et al.

  • Integer division by zero was causing a fatal runtime exception, rather than computing inf or nan as in the floating-point case.

Binaries:

As below. Additionally, the MacOSX version is available in Homebrew. For Windows, you need the .exe file along with both .dll files, with instructions as in https://github.com/johnkerl/miller/releases/tag/v5.1.0w.

Assets 9
You can’t perform that action at this time.