Skip to content

Commit

Permalink
mlr uniq -a
Browse files Browse the repository at this point in the history
  • Loading branch information
johnkerl committed Feb 6, 2018
1 parent 301db6a commit 6942f63
Show file tree
Hide file tree
Showing 6 changed files with 65 additions and 44 deletions.
32 changes: 1 addition & 31 deletions c/draft-release-notes.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,7 @@
## Features:

* [**Comment strings in data files:**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/file-formats.html#Comments_in_data) `mlr --skip-comments` allows you to filter out input lines starting with `#`, for all file formats. Likewise, `mlr --skip-comments-with X` lets you specify the comment-string `X`. Comments are only supported at start of data line. `mlr --pass-comments` and `mlr --pass-comments-with X` allow you to forward comments to program output as they are read.

* The [**count-similar**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/reference-verbs.html#count-similar) verb lets you compute cluster sizes by cluster labels.

* While Miller DSL arithmetic gracefully overflows from 64-integer to double-precision float (see also [**here**](http://johnkerl.org/miller/doc/reference.html#Arithmetic)), there are now the **integer-preserving arithmetic operators** [**`.+`**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/reference-dsl.html#.+) [**`.-`**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/reference-dsl.html#.-) [**`.*`**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/reference-dsl.html#.*) [**`./`**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/reference-dsl.html#./) [**`.//`**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/reference-dsl.html#.//) for those times when you want integer overflow.

* There is a new [**bitcount**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/reference-dsl.html#bitcount) function: for example, `echo x=0xf0000206 | mlr put '$y=bitcount($x)'` produces `x=0xf0000206,y=7`.

* [**Issue 158**](https://github.com/johnkerl/miller/issues/158): `mlr -T` is an alias for `--nidx --fs tab`, and `mlr -t` is an alias for `mlr --tsvlite`.

* The mathematical constants **&pi; and <i>e</i> have been renamed from `PI` and `E` to `M_PI` and `M_E`, respectively**. (It's annoying to get a syntax error when you try to define a variable named `E` in the DSL, when `A` through `D` work just fine.) This is a backward incompatibility, but not enough of us to justify calling this release Miller 6.0.0.
* xxx mlr uniq -a, w/ issue number

## Documentation:

* As noted [**here**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/reference-dsl.html#A_note_on_the_complexity_of_Miller’s_expression_language), while Miller has its own DSL there will always be things better expressible in a general-purpose language. The new page [**Sharing data with other languages**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/data-sharing.html) shows how to seamlessly share data back and forth between **Miller, Ruby, and Python**. [**SQL-input examples**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/10-min.html#SQL-input_examples) and [**SQL-output examples**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/10-min.html#SQL-output_examples) contain detailed information the interplay between **Miller and SQL**.

* [**Issue 150**](https://github.com/johnkerl/miller/issues/150) raised a question about suppressing numeric conversion. This resulted in a new FAQ entry [**How do I suppress numeric conversion?**](http://johnkerl.org/miller/doc/faq.html#How_do_I_suppress_numeric_conversion?), as well as the longer-term follow-on [**issue 151**](https://github.com/johnkerl/miller/issues/151) which will make numeric conversion happen on a just-in-time basis.

* To my surprise, **csvlite format options** weren&rsquo;t listed in `mlr --help` or the manpage. This has been fixed.

* Documentation for [**auxiliary commands**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/reference.html#Auxiliary_commands) has been expanded, including within the [**manpage**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/manpage.html).

## Bugfixes:

* [**Issue 159**](https://github.com/johnkerl/miller/issues/159) fixes regex-match of literal dot.

* [**Issue 160**](https://github.com/johnkerl/miller/issues/160) fixes out-of-memory cases for huge files. This is an old bug, as old as Miller, and is due to inadequate testing of huge-file cases. The problem is simple: Miller prefers memory-mapped I/O (using `mmap`) over `stdio` since `mmap` is fractionally faster. Yet as any processing (even `mlr cat`) steps through an input file, more and more pages are faulted in -- and, unfortunately, previous pages are not paged out once memory pressure increases. (This despite gallant attempts with `madvise`.) Once all processing is done, the memory is released; there is no leak per se. But the Miller process can crash before the entire file is read. The solution is equally simple: to prefer `stdio` over `mmap` for files over 4GB in size. (This 4GB threshold is tunable via the `--mmap-below` flag as described in the [manpage](http://johnkerl.org/miller-releases/miller-5.3.0/doc/manpage.html).)

* [**Issue 161**](https://github.com/johnkerl/miller/issues/161) fixes a CSV-parse error (with error message "unwrapped double quote at line 0") when a CSV file starts with the UTF-8 byte-order-mark ("BOM") sequence `0xef` `0xbb` `0xbf` and the header line has double-quoted fields. ([Release 5.2.0](https://github.com/johnkerl/miller/releases/tag/v5.2.0) introduced handling for UTF-8 BOMs, but missed the case of double-quoted header line.)

* [**Issue 162**](https://github.com/johnkerl/miller/issues/162) fixes a corner case doing multi-emit of aggregate variables when the first variable name is a typo.

* The Miller JSON parser used to error with `Unable to parse JSON data: Line 1 column 0: Unexpected 0x00 when seeking value` on empty input, or input with trailing whitespace; this has been fixed.

There is no prebuilt Windows executable for this release; my apologies.
65 changes: 56 additions & 9 deletions c/mapping/mapper_uniq.c
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
#include <math.h>
#include "lib/mlrutil.h"
#include "containers/sllv.h"
#include "containers/lhmsi.h"
#include "containers/lhmslv.h"
#include "containers/lhmsv.h"
#include "containers/lhmsll.h"
Expand All @@ -18,6 +19,7 @@ typedef struct _mapper_uniq_state_t {
slls_t* pgroup_by_field_names;
int show_counts;
int show_num_distinct_only;
lhmsi_t* puniqified_records; // lrec_sprintf -> full lrec
lhmslv_t* pcounts_by_group;
lhmsv_t* pcounts_unlashed; // string field name -> string field value -> long long count
char* output_field_name;
Expand All @@ -30,9 +32,10 @@ static void mapper_count_distinct_usage(FILE* o, char* argv0, char* verb);
static mapper_t* mapper_count_distinct_parse_cli(int* pargi, int argc, char** argv,
cli_reader_opts_t* _, cli_writer_opts_t* __);
static mapper_t* mapper_uniq_alloc(ap_state_t* pargp, slls_t* pgroup_by_field_names, int do_lashed,
int show_counts, int show_num_distinct_only, char* output_field_name);
int show_counts, int show_num_distinct_only, char* output_field_name, int uniqify_entire_records);
static void mapper_uniq_free(mapper_t* pmapper, context_t* _);

static sllv_t* mapper_uniq_process_uniqify_entire_records(lrec_t* pinrec, context_t* pctx, void* pvstate);
static sllv_t* mapper_uniq_process_unlashed(lrec_t* pinrec, context_t* pctx, void* pvstate);
static sllv_t* mapper_uniq_process_num_distinct_only(lrec_t* pinrec, context_t* pctx, void* pvstate);
static sllv_t* mapper_uniq_process_with_counts(lrec_t* pinrec, context_t* pctx, void* pvstate);
Expand Down Expand Up @@ -101,7 +104,7 @@ static mapper_t* mapper_count_distinct_parse_cli(int* pargi, int argc, char** ar
}

return mapper_uniq_alloc(pstate, pfield_names, do_lashed, TRUE, show_num_distinct_only,
output_field_name);
output_field_name, FALSE);
}

// ----------------------------------------------------------------
Expand All @@ -111,6 +114,7 @@ static void mapper_uniq_usage(FILE* o, char* argv0, char* verb) {
fprintf(o, "-c Show repeat counts in addition to unique values.\n");
fprintf(o, "-n Show only the number of distinct values.\n");
fprintf(o, "-o {name} Field name for output count. Default \"%s\".\n", DEFAULT_OUTPUT_FIELD_NAME);
fprintf(o, "-a Output each unique record only once. Incompatible with -g, -c, -n, -o.\n");
fprintf(o, "Prints distinct values for specified field names. With -c, same as\n");
fprintf(o, "count-distinct. For uniq, -f is a synonym for -g.\n");
}
Expand All @@ -123,6 +127,7 @@ static mapper_t* mapper_uniq_parse_cli(int* pargi, int argc, char** argv,
int show_num_distinct_only = FALSE;
char* output_field_name = DEFAULT_OUTPUT_FIELD_NAME;
int do_lashed = TRUE;
int uniqify_entire_records = FALSE;

char* verb = argv[(*pargi)++];

Expand All @@ -132,24 +137,36 @@ static mapper_t* mapper_uniq_parse_cli(int* pargi, int argc, char** argv,
ap_define_true_flag(pstate, "-c", &show_counts);
ap_define_true_flag(pstate, "-n", &show_num_distinct_only);
ap_define_string_flag(pstate, "-o", &output_field_name);
ap_define_true_flag(pstate, "-a", &uniqify_entire_records);

if (!ap_parse(pstate, verb, pargi, argc, argv)) {
mapper_uniq_usage(stderr, argv[0], verb);
return NULL;
}

if (pgroup_by_field_names == NULL) {
mapper_uniq_usage(stderr, argv[0], verb);
return NULL;
if (uniqify_entire_records) {
if ((pgroup_by_field_names != NULL) || show_counts || show_num_distinct_only) {
mapper_uniq_usage(stderr, argv[0], verb);
return NULL;
}
if (!streq(output_field_name, DEFAULT_OUTPUT_FIELD_NAME)) {
mapper_uniq_usage(stderr, argv[0], verb);
return NULL;
}
} else {
if (pgroup_by_field_names == NULL) {
mapper_uniq_usage(stderr, argv[0], verb);
return NULL;
}
}

return mapper_uniq_alloc(pstate, pgroup_by_field_names, do_lashed, show_counts, show_num_distinct_only,
output_field_name);
output_field_name, uniqify_entire_records);
}

// ----------------------------------------------------------------
static mapper_t* mapper_uniq_alloc(ap_state_t* pargp, slls_t* pgroup_by_field_names, int do_lashed,
int show_counts, int show_num_distinct_only, char* output_field_name)
int show_counts, int show_num_distinct_only, char* output_field_name, int uniqify_entire_records)
{
mapper_t* pmapper = mlr_malloc_or_die(sizeof(mapper_t));

Expand All @@ -159,12 +176,15 @@ static mapper_t* mapper_uniq_alloc(ap_state_t* pargp, slls_t* pgroup_by_field_na
pstate->pgroup_by_field_names = pgroup_by_field_names;
pstate->show_counts = show_counts;
pstate->show_num_distinct_only = show_num_distinct_only;
pstate->puniqified_records = lhmsi_alloc();
pstate->pcounts_by_group = lhmslv_alloc();
pstate->pcounts_unlashed = lhmsv_alloc();
pstate->output_field_name = output_field_name;

pmapper->pvstate = pstate;
if (!do_lashed)
if (uniqify_entire_records)
pmapper->pprocess_func = mapper_uniq_process_uniqify_entire_records;
else if (!do_lashed)
pmapper->pprocess_func = mapper_uniq_process_unlashed;
else if (show_num_distinct_only)
pmapper->pprocess_func = mapper_uniq_process_num_distinct_only;
Expand All @@ -179,27 +199,54 @@ static mapper_t* mapper_uniq_alloc(ap_state_t* pargp, slls_t* pgroup_by_field_na

static void mapper_uniq_free(mapper_t* pmapper, context_t* _) {
mapper_uniq_state_t* pstate = pmapper->pvstate;

slls_free(pstate->pgroup_by_field_names);

lhmsi_free(pstate->puniqified_records);
pstate->puniqified_records = NULL;

// lhmslv_free will free the keys: we only need to free the void-star values.
for (lhmslve_t* pa = pstate->pcounts_by_group->phead; pa != NULL; pa = pa->pnext) {
unsigned long long* pcount = pa->pvvalue;
free(pcount);
}
lhmslv_free(pstate->pcounts_by_group);
pstate->pcounts_by_group = NULL;

for (lhmsve_t* pb = pstate->pcounts_unlashed->phead; pb != NULL; pb = pb->pnext) {
lhmsll_t* pmap = pb->pvvalue;
lhmsll_free(pmap);
}
lhmsv_free(pstate->pcounts_unlashed);
pstate->pcounts_unlashed = NULL;

pstate->pgroup_by_field_names = NULL;
pstate->pcounts_by_group = NULL;
pstate->pcounts_unlashed = NULL;

ap_free(pstate->pargp);
free(pstate);
free(pmapper);
}

// ----------------------------------------------------------------
static sllv_t* mapper_uniq_process_uniqify_entire_records(lrec_t* pinrec, context_t* pctx, void* pvstate) {
mapper_uniq_state_t* pstate = pvstate;
if (pinrec != NULL) {
char* lrec_as_string = lrec_sprint(pinrec, "\xfc", "\xfd", "\xfe");
if (lhmsi_has_key(pstate->puniqified_records, lrec_as_string)) {
// have seen
free(lrec_as_string);
lrec_free(pinrec);
return sllv_single(NULL);
} else {
lhmsi_put(pstate->puniqified_records, lrec_as_string, 1, FREE_ENTRY_VALUE);
return sllv_single(pinrec);
}
} else { // end of record stream
return sllv_single(NULL);
}
}

static sllv_t* mapper_uniq_process_unlashed(lrec_t* pinrec, context_t* pctx, void* pvstate) {
mapper_uniq_state_t* pstate = pvstate;
if (pinrec != NULL) {
Expand Down
3 changes: 2 additions & 1 deletion doc/manpage.html
Original file line number Diff line number Diff line change
Expand Up @@ -1432,6 +1432,7 @@
-c Show repeat counts in addition to unique values.
-n Show only the number of distinct values.
-o {name} Field name for output count. Default "count".
-a Output each unique record only once. Incompatible with -g, -c, -n, -o.
Prints distinct values for specified field names. With -c, same as
count-distinct. For uniq, -f is a synonym for -g.

Expand Down Expand Up @@ -2333,7 +2334,7 @@



2018-01-10 MILLER(1)
2018-02-06 MILLER(1)
</pre>
</div>
<p/>
Expand Down
3 changes: 2 additions & 1 deletion doc/manpage.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1238,6 +1238,7 @@ VERBS
-c Show repeat counts in addition to unique values.
-n Show only the number of distinct values.
-o {name} Field name for output count. Default "count".
-a Output each unique record only once. Incompatible with -g, -c, -n, -o.
Prints distinct values for specified field names. With -c, same as
count-distinct. For uniq, -f is a synonym for -g.

Expand Down Expand Up @@ -2139,4 +2140,4 @@ SEE ALSO



2018-01-10 MILLER(1)
2018-02-06 MILLER(1)
5 changes: 3 additions & 2 deletions doc/mlr.1
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@
.\" Title: mlr
.\" Author: [see the "AUTHOR" section]
.\" Generator: ./mkman.rb
.\" Date: 2018-01-10
.\" Date: 2018-02-06
.\" Manual: \ \&
.\" Source: \ \&
.\" Language: English
.\"
.TH "MILLER" "1" "2018-01-10" "\ \&" "\ \&"
.TH "MILLER" "1" "2018-02-06" "\ \&" "\ \&"
.\" -----------------------------------------------------------------
.\" * Portability definitions
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -1613,6 +1613,7 @@ Usage: mlr uniq [options]
-c Show repeat counts in addition to unique values.
-n Show only the number of distinct values.
-o {name} Field name for output count. Default "count".
-a Output each unique record only once. Incompatible with -g, -c, -n, -o.
Prints distinct values for specified field names. With -c, same as
count-distinct. For uniq, -f is a synonym for -g.
.fi
Expand Down
1 change: 1 addition & 0 deletions doc/reference-verbs.html
Original file line number Diff line number Diff line change
Expand Up @@ -3707,6 +3707,7 @@
-c Show repeat counts in addition to unique values.
-n Show only the number of distinct values.
-o {name} Field name for output count. Default "count".
-a Output each unique record only once. Incompatible with -g, -c, -n, -o.
Prints distinct values for specified field names. With -c, same as
count-distinct. For uniq, -f is a synonym for -g.
</pre>
Expand Down

0 comments on commit 6942f63

Please sign in to comment.