mlr uniq -a

johnkerl · Feb 6, 2018 · 6942f63 · 6942f63
1 parent 301db6a
commit 6942f63
Show file tree

Hide file tree

Showing 6 changed files with 65 additions and 44 deletions.
diff --git a/c/draft-release-notes.md b/c/draft-release-notes.md
@@ -1,37 +1,7 @@
 ## Features:
 
-* [**Comment strings in data files:**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/file-formats.html#Comments_in_data) `mlr --skip-comments` allows you to filter out input lines starting with `#`, for all file formats.  Likewise, `mlr --skip-comments-with X` lets you specify the comment-string `X`.  Comments are only supported at start of data line.  `mlr --pass-comments` and `mlr --pass-comments-with X` allow you to forward comments to program output as they are read.  
-
-* The [**count-similar**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/reference-verbs.html#count-similar) verb lets you compute cluster sizes by cluster labels.
-
-* While Miller DSL arithmetic gracefully overflows from 64-integer to double-precision float (see also [**here**](http://johnkerl.org/miller/doc/reference.html#Arithmetic)), there are now the **integer-preserving arithmetic operators** [**`.+`**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/reference-dsl.html#.+) [**`.-`**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/reference-dsl.html#.-) [**`.*`**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/reference-dsl.html#.*) [**`./`**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/reference-dsl.html#./) [**`.//`**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/reference-dsl.html#.//) for those times when you want integer overflow.
-
-* There is a new [**bitcount**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/reference-dsl.html#bitcount) function: for example, `echo x=0xf0000206 | mlr put '$y=bitcount($x)'` produces `x=0xf0000206,y=7`.
-
-* [**Issue 158**](https://github.com/johnkerl/miller/issues/158): `mlr -T` is an alias for `--nidx --fs tab`, and `mlr -t` is an alias for `mlr --tsvlite`. 
-
-* The mathematical constants **&pi; and <i>e</i> have been renamed from `PI` and `E` to `M_PI` and `M_E`, respectively**. (It's annoying to get a syntax error when you try to define a variable named `E` in the DSL, when `A` through `D` work just fine.) This is a backward incompatibility, but not enough of us to justify calling this release Miller 6.0.0. 
+* xxx mlr uniq -a, w/ issue number
 
 ## Documentation:
 
-* As noted [**here**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/reference-dsl.html#A_note_on_the_complexity_of_Miller’s_expression_language), while Miller has its own DSL there will always be things better expressible in a general-purpose language. The new page [**Sharing data with other languages**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/data-sharing.html) shows how to seamlessly share data back and forth between **Miller, Ruby, and Python**.  [**SQL-input examples**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/10-min.html#SQL-input_examples) and [**SQL-output examples**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/10-min.html#SQL-output_examples) contain detailed information the interplay between **Miller and SQL**. 
-
-* [**Issue 150**](https://github.com/johnkerl/miller/issues/150) raised a question about suppressing numeric conversion. This resulted in a new FAQ entry [**How do I suppress numeric conversion?**](http://johnkerl.org/miller/doc/faq.html#How_do_I_suppress_numeric_conversion?), as well as the longer-term follow-on [**issue 151**](https://github.com/johnkerl/miller/issues/151) which will make numeric conversion happen on a just-in-time basis. 
-
-* To my surprise, **csvlite format options** weren&rsquo;t listed in `mlr --help` or the manpage. This has been fixed. 
-
-* Documentation for [**auxiliary commands**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/reference.html#Auxiliary_commands) has been expanded, including within the [**manpage**](http://johnkerl.org/miller-releases/miller-5.3.0/doc/manpage.html). 
-
 ## Bugfixes: 
-
-* [**Issue 159**](https://github.com/johnkerl/miller/issues/159) fixes regex-match of literal dot. 
-
-* [**Issue 160**](https://github.com/johnkerl/miller/issues/160) fixes out-of-memory cases for huge files. This is an old bug, as old as Miller, and is due to inadequate testing of huge-file cases. The problem is simple: Miller prefers memory-mapped I/O (using `mmap`) over `stdio` since `mmap` is fractionally faster. Yet as any processing (even `mlr cat`) steps through an input file, more and more pages are faulted in -- and, unfortunately, previous pages are not paged out once memory pressure increases. (This despite gallant attempts with `madvise`.) Once all processing is done, the memory is released; there is no leak per se. But the Miller process can crash before the entire file is read. The solution is equally simple: to prefer `stdio` over `mmap` for files over 4GB in size. (This 4GB threshold is tunable via the `--mmap-below` flag as described in the [manpage](http://johnkerl.org/miller-releases/miller-5.3.0/doc/manpage.html).) 
-
-* [**Issue 161**](https://github.com/johnkerl/miller/issues/161) fixes a CSV-parse error (with error message "unwrapped double quote at line 0") when a CSV file starts with the UTF-8 byte-order-mark ("BOM") sequence `0xef` `0xbb` `0xbf` and the header line has double-quoted fields. ([Release 5.2.0](https://github.com/johnkerl/miller/releases/tag/v5.2.0) introduced handling for UTF-8 BOMs, but missed the case of double-quoted header line.) 
-
-* [**Issue 162**](https://github.com/johnkerl/miller/issues/162) fixes a corner case doing multi-emit of aggregate variables when the first variable name is a typo.
-
-* The Miller JSON parser used to error with `Unable to parse JSON data: Line 1 column 0: Unexpected 0x00 when seeking value` on empty input, or input with trailing whitespace; this has been fixed.
-
-There is no prebuilt Windows executable for this release; my apologies.
diff --git a/c/mapping/mapper_uniq.c b/c/mapping/mapper_uniq.c
@@ -4,6 +4,7 @@
 #include <math.h>
 #include "lib/mlrutil.h"
 #include "containers/sllv.h"
+#include "containers/lhmsi.h"
 #include "containers/lhmslv.h"
 #include "containers/lhmsv.h"
 #include "containers/lhmsll.h"
@@ -18,6 +19,7 @@ typedef struct _mapper_uniq_state_t {
 	slls_t* pgroup_by_field_names;
 	int show_counts;
 	int show_num_distinct_only;
+	lhmsi_t* puniqified_records; // lrec_sprintf -> full lrec
 	lhmslv_t* pcounts_by_group;
 	lhmsv_t* pcounts_unlashed; // string field name -> string field value -> long long count
 	char* output_field_name;
@@ -30,9 +32,10 @@ static void      mapper_count_distinct_usage(FILE* o, char* argv0, char* verb);
 static mapper_t* mapper_count_distinct_parse_cli(int* pargi, int argc, char** argv,
 	cli_reader_opts_t* _, cli_writer_opts_t* __);
 static mapper_t* mapper_uniq_alloc(ap_state_t* pargp, slls_t* pgroup_by_field_names, int do_lashed,
-	int show_counts, int show_num_distinct_only, char* output_field_name);
+	int show_counts, int show_num_distinct_only, char* output_field_name, int uniqify_entire_records);
 static void      mapper_uniq_free(mapper_t* pmapper, context_t* _);
 
+static sllv_t* mapper_uniq_process_uniqify_entire_records(lrec_t* pinrec, context_t* pctx, void* pvstate);
 static sllv_t* mapper_uniq_process_unlashed(lrec_t* pinrec, context_t* pctx, void* pvstate);
 static sllv_t* mapper_uniq_process_num_distinct_only(lrec_t* pinrec, context_t* pctx, void* pvstate);
 static sllv_t* mapper_uniq_process_with_counts(lrec_t* pinrec, context_t* pctx, void* pvstate);
@@ -101,7 +104,7 @@ static mapper_t* mapper_count_distinct_parse_cli(int* pargi, int argc, char** ar
 	}
 
 	return mapper_uniq_alloc(pstate, pfield_names, do_lashed, TRUE, show_num_distinct_only,
-		output_field_name);
+		output_field_name, FALSE);
 }
 
 // ----------------------------------------------------------------
@@ -111,6 +114,7 @@ static void mapper_uniq_usage(FILE* o, char* argv0, char* verb) {
 	fprintf(o, "-c            Show repeat counts in addition to unique values.\n");
 	fprintf(o, "-n            Show only the number of distinct values.\n");
 	fprintf(o, "-o {name}     Field name for output count. Default \"%s\".\n", DEFAULT_OUTPUT_FIELD_NAME);
+	fprintf(o, "-a            Output each unique record only once. Incompatible with -g, -c, -n, -o.\n");
 	fprintf(o, "Prints distinct values for specified field names. With -c, same as\n");
 	fprintf(o, "count-distinct. For uniq, -f is a synonym for -g.\n");
 }
@@ -123,6 +127,7 @@ static mapper_t* mapper_uniq_parse_cli(int* pargi, int argc, char** argv,
 	int     show_num_distinct_only = FALSE;
 	char*   output_field_name = DEFAULT_OUTPUT_FIELD_NAME;
 	int     do_lashed = TRUE;
+	int     uniqify_entire_records = FALSE;
 
 	char* verb = argv[(*pargi)++];
 
@@ -132,24 +137,36 @@ static mapper_t* mapper_uniq_parse_cli(int* pargi, int argc, char** argv,
 	ap_define_true_flag(pstate,        "-c", &show_counts);
 	ap_define_true_flag(pstate,        "-n", &show_num_distinct_only);
 	ap_define_string_flag(pstate,      "-o", &output_field_name);
+	ap_define_true_flag(pstate,        "-a", &uniqify_entire_records);
 
 	if (!ap_parse(pstate, verb, pargi, argc, argv)) {
 		mapper_uniq_usage(stderr, argv[0], verb);
 		return NULL;
 	}
 
-	if (pgroup_by_field_names == NULL) {
-		mapper_uniq_usage(stderr, argv[0], verb);
-		return NULL;
+	if (uniqify_entire_records) {
+		if ((pgroup_by_field_names != NULL) || show_counts || show_num_distinct_only) {
+			mapper_uniq_usage(stderr, argv[0], verb);
+			return NULL;
+		}
+		if (!streq(output_field_name, DEFAULT_OUTPUT_FIELD_NAME)) {
+			mapper_uniq_usage(stderr, argv[0], verb);
+			return NULL;
+		}
+	} else {
+		if (pgroup_by_field_names == NULL) {
+			mapper_uniq_usage(stderr, argv[0], verb);
+			return NULL;
+		}
 	}
 
 	return mapper_uniq_alloc(pstate, pgroup_by_field_names, do_lashed, show_counts, show_num_distinct_only,
-		output_field_name);
+		output_field_name, uniqify_entire_records);
 }
 
 // ----------------------------------------------------------------
 static mapper_t* mapper_uniq_alloc(ap_state_t* pargp, slls_t* pgroup_by_field_names, int do_lashed,
-	int show_counts, int show_num_distinct_only, char* output_field_name)
+	int show_counts, int show_num_distinct_only, char* output_field_name, int uniqify_entire_records)
 {
 	mapper_t* pmapper = mlr_malloc_or_die(sizeof(mapper_t));
 
@@ -159,12 +176,15 @@ static mapper_t* mapper_uniq_alloc(ap_state_t* pargp, slls_t* pgroup_by_field_na
 	pstate->pgroup_by_field_names  = pgroup_by_field_names;
 	pstate->show_counts            = show_counts;
 	pstate->show_num_distinct_only = show_num_distinct_only;
+	pstate->puniqified_records     = lhmsi_alloc();
 	pstate->pcounts_by_group       = lhmslv_alloc();
 	pstate->pcounts_unlashed       = lhmsv_alloc();
 	pstate->output_field_name      = output_field_name;
 
 	pmapper->pvstate = pstate;
-	if (!do_lashed)
+	if (uniqify_entire_records)
+		pmapper->pprocess_func = mapper_uniq_process_uniqify_entire_records;
+	else if (!do_lashed)
 		pmapper->pprocess_func = mapper_uniq_process_unlashed;
 	else if (show_num_distinct_only)
 		pmapper->pprocess_func = mapper_uniq_process_num_distinct_only;
@@ -179,27 +199,54 @@ static mapper_t* mapper_uniq_alloc(ap_state_t* pargp, slls_t* pgroup_by_field_na
 
 static void mapper_uniq_free(mapper_t* pmapper, context_t* _) {
 	mapper_uniq_state_t* pstate = pmapper->pvstate;
+
 	slls_free(pstate->pgroup_by_field_names);
+
+	lhmsi_free(pstate->puniqified_records);
+	pstate->puniqified_records = NULL;
+
 	// lhmslv_free will free the keys: we only need to free the void-star values.
 	for (lhmslve_t* pa = pstate->pcounts_by_group->phead; pa != NULL; pa = pa->pnext) {
 		unsigned long long* pcount = pa->pvvalue;
 		free(pcount);
 	}
 	lhmslv_free(pstate->pcounts_by_group);
+	pstate->pcounts_by_group = NULL;
+
 	for (lhmsve_t* pb = pstate->pcounts_unlashed->phead; pb != NULL; pb = pb->pnext) {
 		lhmsll_t* pmap = pb->pvvalue;
 		lhmsll_free(pmap);
 	}
 	lhmsv_free(pstate->pcounts_unlashed);
+	pstate->pcounts_unlashed = NULL;
+
 	pstate->pgroup_by_field_names = NULL;
 	pstate->pcounts_by_group = NULL;
-	pstate->pcounts_unlashed = NULL;
+
 	ap_free(pstate->pargp);
 	free(pstate);
 	free(pmapper);
 }
 
 // ----------------------------------------------------------------
+static sllv_t* mapper_uniq_process_uniqify_entire_records(lrec_t* pinrec, context_t* pctx, void* pvstate) {
+	mapper_uniq_state_t* pstate = pvstate;
+	if (pinrec != NULL) {
+		char* lrec_as_string = lrec_sprint(pinrec, "\xfc", "\xfd", "\xfe");
+		if (lhmsi_has_key(pstate->puniqified_records, lrec_as_string)) {
+			// have seen
+			free(lrec_as_string);
+			lrec_free(pinrec);
+			return sllv_single(NULL);
+		} else {
+			lhmsi_put(pstate->puniqified_records, lrec_as_string, 1, FREE_ENTRY_VALUE);
+			return sllv_single(pinrec);
+		}
+	} else { // end of record stream
+		return sllv_single(NULL);
+	}
+}
+
 static sllv_t* mapper_uniq_process_unlashed(lrec_t* pinrec, context_t* pctx, void* pvstate) {
 	mapper_uniq_state_t* pstate = pvstate;
 	if (pinrec != NULL) {

diff --git a/doc/manpage.html b/doc/manpage.html
@@ -1432,6 +1432,7 @@
        -c	     Show repeat counts in addition to unique values.
        -n	     Show only the number of distinct values.
        -o {name}     Field name for output count. Default "count".
+       -a	     Output each unique record only once. Incompatible with -g, -c, -n, -o.
        Prints distinct values for specified field names. With -c, same as
        count-distinct. For uniq, -f is a synonym for -g.
 
@@ -2333,7 +2334,7 @@
 
 
 
-				  2018-01-10			     MILLER(1)
+				  2018-02-06			     MILLER(1)
 </pre>
 </div>
 <p/>

diff --git a/doc/manpage.txt b/doc/manpage.txt
@@ -1238,6 +1238,7 @@ VERBS
        -c	     Show repeat counts in addition to unique values.
        -n	     Show only the number of distinct values.
        -o {name}     Field name for output count. Default "count".
+       -a	     Output each unique record only once. Incompatible with -g, -c, -n, -o.
        Prints distinct values for specified field names. With -c, same as
        count-distinct. For uniq, -f is a synonym for -g.
 
@@ -2139,4 +2140,4 @@ SEE ALSO
 
 
 
-				  2018-01-10			     MILLER(1)
+				  2018-02-06			     MILLER(1)
diff --git a/doc/mlr.1 b/doc/mlr.1
@@ -2,12 +2,12 @@
 .\"     Title: mlr
 .\"    Author: [see the "AUTHOR" section]
 .\" Generator: ./mkman.rb
-.\"      Date: 2018-01-10
+.\"      Date: 2018-02-06
 .\"    Manual: \ \&
 .\"    Source: \ \&
 .\"  Language: English
 .\"
-.TH "MILLER" "1" "2018-01-10" "\ \&" "\ \&"
+.TH "MILLER" "1" "2018-02-06" "\ \&" "\ \&"
 .\" -----------------------------------------------------------------
 .\" * Portability definitions
 .\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -1613,6 +1613,7 @@ Usage: mlr uniq [options]
 -c            Show repeat counts in addition to unique values.
 -n            Show only the number of distinct values.
 -o {name}     Field name for output count. Default "count".
+-a            Output each unique record only once. Incompatible with -g, -c, -n, -o.
 Prints distinct values for specified field names. With -c, same as
 count-distinct. For uniq, -f is a synonym for -g.
 .fi

diff --git a/doc/reference-verbs.html b/doc/reference-verbs.html
@@ -3707,6 +3707,7 @@
 -c            Show repeat counts in addition to unique values.
 -n            Show only the number of distinct values.
 -o {name}     Field name for output count. Default "count".
+-a            Output each unique record only once. Incompatible with -g, -c, -n, -o.
 Prints distinct values for specified field names. With -c, same as
 count-distinct. For uniq, -f is a synonym for -g.
 </pre>