johnkerl
released this
Bug fixes:
- #271 fixes a corner-case bug with more than 100 CSV/TSV files with headers of varying lengths.
Documentation:
- The new http://johnkerl.org/miller/doc/whyc-details.html is an elaboration on http://johnkerl.org/miller/doc/whyc.html which answers a question posed by @BurntSushi on Reddit a couple years ago which I did not address in detail at the time.
Assets
7
johnkerl
released this
The only change is that http://johnkerl.org/miller/doc is now more mobile-friendly.
All build artifacts are the same as at https://github.com/johnkerl/miller/releases/tag/v5.6.0
Before
After
Assets
2
johnkerl
released this
Features:
-
The new system DSL function allows you to run arbitrary shell commands and store them in field values. Some example usages are documented here. This is in response to issues #246 and #209.
-
There is now support for ASV and USV file formats. This is in response to issue #245.
-
The new format-values verb allows you to apply numerical formatting across all record values. This is in response to issue #252.
Documentation:
-
The new DKVP I/O in Python sample code now works for Python 2 as well as Python 3.
-
There is a new cookbook entry on doing multiple joins. This is in response to issue #235.
Bugfixes:
-
The toupper, tolower, and capitalize DSL functions are now UTF-8 aware, thanks to @sheredom's marvelous https://github.com/sheredom/utf8.h. The internationalization page has also been expanded. This is in response to issue #254.
-
#250 fixes a bug using in-place mode in conjunction with verbs (such as rename or sort) which take field-name lists as arguments.
-
#253 fixes a bug in the label when one or more names are common between old and new.
-
#251 fixes a corner-case bug when (a) input is CSV; (b) the last field ends with a comma and no newline; (c) input is from standard input and/or --no-mmap is supplied.
Note:
Thanks to @aborruso @davidselassie @joelparkerhenderson for the bug reports and feature requests!! :)
Assets
7
johnkerl
released this
Features:
-
The new positional-indexing feature resolves #236 from @aborruso. You can now get the name of the 3rd field of each record via $[[3]], and its value by $[[[3]]]. These are both usable on either the left-hand or right-hand side of assignment statements, so you can more easily do things like renaming fields progrmatically within the DSL.
-
There is a new capitalize DSL function, complementing the already-existing toupper. This stems from #236.
-
There is a new skip-trivial-records verb, resolving #197. Similarly, there is a new remove-empty-columns verb, resolving #206. Both are useful for data-cleaning use-cases.
-
Another pair is #181 and #256. While Miller uses mmap internally (and invisibily) to get approximately a 20% performance boost over not using it, this can cause out-of-memory issues with reading either large files, or too many small ones. Now, Miller automatically avoids mmap in these cases. You can still use --mmap or --no-mmap if you want manual control of this.
-
There is a new --ivar option for the nest verb which complements the already-existing --evar. This is from #260 thanks to @jgreely.
-
There is a new keystroke-saving urandrange DSL function: urandrange(low, high) is the same as low + (high - low) * urand(). This arose from #243.
-
There is a new -v option for the cat verb which writes a low-level record-structure dump to standard error.
-
There is a new -N option for mlr which is a keystroke-saver for --implicit-csv-header --headerless-csv-output.
Documentation:
-
The new FAQ entry http://johnkerl.org/miller/doc/faq.html#How_to_escape_'%3F'_in_regexes%3F resolves #203.
-
The new FAQ entry http://johnkerl.org/miller/doc/faq.html#How_can_I_filter_by_date%3F resolves #208.
-
#244 fixes a documentation issue while highlighting the need for #241.
Bugfixes:
-
There was a SEGV using
nest
withinthen
-chains, fixed in response to #220. -
Quotes and backslashes weren't being escaped in JSON output with --jvquoteall; reported on #222.
An extra thank-you:
I've never code-named releases but if I were to code-name 5.5.0 I would call it "aborruso". Andrea has contributed many fantastic feature requests, as well as driving a huge volume of Miller-related discussions in StackExchange (#212). Mille grazie al mio amico @aborruso!
Assets
7
johnkerl
released this
Features:
-
The new clean-whitespace verb resolves #190 from @aborruso. Along with the new functions strip, lstrip, rstrip, collapse_whitespace, and clean_whitespace, there is now both coarse-grained and fine-grained control over whitespace within field names and/or values. See the linked-to documentation for examples.
-
The new altkv verb resolves #184 which was originally opened via an email request. This supports mapping value-lists such as
a,b,c,d
to alternating key-value pairs such asa=b,c=d
. -
The new fill-down verb resolves #189 by @aborruso. See the linked-to documentation for examples.
-
The uniq verb now has a uniq -a which resolves #168 from @sjackman.
-
The new regextract and regextract_or_else functions resolve #183 by @aborruso.
-
The new ssub function arises from #171 by @dohse, as a simplified way to avoid escaping characters which are special to regular-expression parsers.
-
There are new localtime functions in response to #170 by @sitaramc. However note that as discussed on #170 these do not undo one another in all circumstances. This is a non-issue for timezones which do not do DST. Otherwise, please use with disclaimers: localdate, localtime2sec, sec2localdate, sec2localtime, strftime_local, and strptime_local.
Builds:
-
Windows build-artifacts are now available in Appveyor at https://ci.appveyor.com/project/johnkerl/miller/build/artifacts, and will be attached to this and future releases. This resolves #167, #148, and #109.
-
Travis builds at https://travis-ci.org/johnkerl/miller/builds now run on OSX as well as Linux.
Documentation:
-
put/filter documentation was confusing as reported by @NikosAlexandris on #169.
-
The new FAQ entry http://johnkerl.org/miller-releases/miller-head/doc/faq.html#How_to_rectangularize_after_joins_with_unpaired? resolves #193 by @aborruso.
-
The new cookbook entry http://johnkerl.org/miller/doc/cookbook.html#Options_for_dealing_with_duplicate_rows arises from #168 from @sjackman.
-
The unsparsify documentation had some words missing as reported by @tst2005 on #194.
-
There was a typo in the cookpage page http://johnkerl.org/miller/doc/cookbook.html#Full_field_renames_and_reassigns as fixed by @tst2005 in #192.
Bugfixes:
Assets
8
johnkerl
released this
Features:
-
Comment strings in data files:
mlr --skip-comments
allows you to filter out input lines starting with#
, for all file formats. Likewise,mlr --skip-comments-with X
lets you specify the comment-stringX
. Comments are only supported at start of data line.mlr --pass-comments
andmlr --pass-comments-with X
allow you to forward comments to program output as they are read. -
The count-similar verb lets you compute cluster sizes by cluster labels.
-
While Miller DSL arithmetic gracefully overflows from 64-integer to double-precision float (see also here), there are now the integer-preserving arithmetic operators
.+
.-
.*
./
.//
for those times when you want integer overflow. -
There is a new bitcount function: for example,
echo x=0xf0000206 | mlr put '$y=bitcount($x)'
producesx=0xf0000206,y=7
. -
Issue 158:
mlr -T
is an alias for--nidx --fs tab
, andmlr -t
is an alias formlr --tsvlite
. -
The mathematical constants π and e have been renamed from
PI
andE
toM_PI
andM_E
, respectively. (It's annoying to get a syntax error when you try to define a variable namedE
in the DSL, whenA
throughD
work just fine.) This is a backward incompatibility, but not enough of us to justify calling this release Miller 6.0.0.
Documentation:
-
As noted here, while Miller has its own DSL there will always be things better expressible in a general-purpose language. The new page Sharing data with other languages shows how to seamlessly share data back and forth between Miller, Ruby, and Python. SQL-input examples and SQL-output examples contain detailed information the interplay between Miller and SQL.
-
Issue 150 raised a question about suppressing numeric conversion. This resulted in a new FAQ entry How do I suppress numeric conversion?, as well as the longer-term follow-on issue 151 which will make numeric conversion happen on a just-in-time basis.
-
To my surprise, csvlite format options weren’t listed in
mlr --help
or the manpage. This has been fixed. -
Documentation for auxiliary commands has been expanded, including within the manpage.
Bugfixes:
-
Issue 159 fixes regex-match of literal dot.
-
Issue 160 fixes out-of-memory cases for huge files. This is an old bug, as old as Miller, and is due to inadequate testing of huge-file cases. The problem is simple: Miller prefers memory-mapped I/O (using
mmap
) overstdio
sincemmap
is fractionally faster. Yet as any processing (evenmlr cat
) steps through an input file, more and more pages are faulted in -- and, unfortunately, previous pages are not paged out once memory pressure increases. (This despite gallant attempts withmadvise
.) Once all processing is done, the memory is released; there is no leak per se. But the Miller process can crash before the entire file is read. The solution is equally simple: to preferstdio
overmmap
for files over 4GB in size. (This 4GB threshold is tunable via the--mmap-below
flag as described in the manpage.) -
Issue 161 fixes a CSV-parse error (with error message "unwrapped double quote at line 0") when a CSV file starts with the UTF-8 byte-order-mark ("BOM") sequence
0xef
0xbb
0xbf
and the header line has double-quoted fields. (Release 5.2.0 introduced handling for UTF-8 BOMs, but missed the case of double-quoted header line.) -
Issue 162 fixes a corner case doing multi-emit of aggregate variables when the first variable name is a typo.
-
The Miller JSON parser used to error with
Unable to parse JSON data: Line 1 column 0: Unexpected 0x00 when seeking value
on empty input, or input with trailing whitespace; this has been fixed.
There is no prebuilt Windows executable for this release; my apologies.
Assets
5
johnkerl
released this
This bugfix release delivers a fix for #147 where a memory allocation failed beyond 4GB.
Documents are the same as for 5.2.0.
Assets
5
johnkerl
released this
This bugfix release addresses #142.
I'm not attaching prebuilt binaries beyond those already in https://github.com/johnkerl/miller/releases/tag/v5.2.0 since the binaries there are fine for their respective architectures.
This unblocks Miller on openSUSE.
Assets
2
johnkerl
released this
This release contains mostly feature requests.
Features:
-
The stats1 verb now lets you use regular expressions to specify which field names to compute statistics on, and/or which to group by. Full details are here.
-
The min and max DSL functions, and the min/max/percentile aggregators for the stats1 and merge-fields verbs, now support numeric as well as string field values. (For mixed string/numeric fields, numbers compare before strings.) This means in particular that order statistics -- min, max, and non-interpolated percentiles -- as well as mode, antimode, and count are now possible on string-only (or mixed) fields. (Of course, any operations requiring arithmetic on values, such as computing sums, averages, or interpolated percentiles, yield an error on string-valued input.)
-
There is a new DSL function mapexcept which returns a copy of the argument with specified key(s), if any, unset. The motivating use-case is to split records to multiple filenames depending on particular field value, which is omitted from the output:
mlr --from f.dat put 'tee > "/tmp/data-".$a, mapexcept($*, "a")'
Likewise, mapselect returns a copy of the argument with only specified key(s), if any, set. This resolves #137. -
A new -u option for count-distinct allows unlashed counts for multiple field names. For example, with
-f a,b
and without-u
,count-distinct
computes counts for distinct pairs ofa
andb
field values. With-f a,b
and with-u
, it computes counts for distincta
field values and counts for distinctb
field values separately. -
If you build from source, you can now do
./configure
without first doingautoreconf -fiv
. This resolves #131. -
The UTF-8 BOM sequence
0xef
0xbb
0xbf
is now automatically ignored from the start of CSV files. (The same is already done for JSON files.) This resolves #138. -
For
put
andfilter
with-S
, program literals such as the6
in$x = 6
were being parsed as strings. This is not sensible, since the-S
option forput
andfilter
is intended to suppress numeric conversion of record data, not program literals. To get string6
one may use$x = "6"
.
Documentation:
-
A new cookbook example shows how to compute differences between successive queries, e.g. to find out what changed in time-varying data when you run and rerun a SQL query.
-
Another new cookbook example shows how to compute interquartile ranges.
-
A third new cookbook example shows how to compute weighted means.
Bugfixes:
-
CRLF line-endings were not being correctly autodetected when I/O formats were specified using --c2j et al.
-
Integer division by zero was causing a fatal runtime exception, rather than computing inf or nan as in the floating-point case.
Binaries:
As below. Additionally, the MacOSX version is available in Homebrew. For Windows, you need the .exe
file along with both .dll
files, with instructions as in https://github.com/johnkerl/miller/releases/tag/v5.1.0w.