the UID thing #42

mdsumner · 2018-01-06T00:35:26Z

Some types come in with names of things at every level (OSM) and some things come in with names at no levels (sf), and others with names only at the first object-level (sp, and sf sometimes).

The generic sc_uid should handle these by either preserving existing ones and creating new ones as needed.

It would be nice to be able to override these and nominate UIDs from a column or direct input.

The question of how to maintain a global pool of ever-unique UIDs remains,

The text was updated successfully, but these errors were encountered:

mpadge · 2018-03-12T22:38:13Z

See this anglr issue and replicated unjoin issue for some general thoughts in addition to which comes this one:

Using 5-byte uid values is fine and good, but if silicate is one day to be the super efficient beast it is destined to be, then these will likely be better replaced by simple incremental integers, so that adjacent entities can be compactly represented as varint class objects in delta form, meaning most of them will be simply stored as 1, requiring 1 bit, rather than 5 bytes. My current sphier-type vision is to pre-label vertices, edges, paths, arcs, and objects with respective first letters ("v", "e", ...), but otherwise to use this incremental integer scheme.

Further co-issue for self: Work out where the ordering of the key_col integers returned from unjoin::unjoin comes from? These appear to be pretty randomly ordered, and thus ultimately not amenable to varint::delta representation. Fix?

mdsumner · 2018-03-12T22:53:05Z

That ordering comes from factor and smashing values together as character:

unjoin(mtcars, drat, wt)

as.integer(factor(paste(mtcars$drat, mtcars$wt)))
 [1] 21 22 20  8 11  1 12 16 23 24 24  7  5  6  3  4 13 26 31 28 17  2 10 18  9 25 30 19 29 15 14 27

mdsumner · 2018-07-13T10:33:25Z

Watch this space tidyverse/dplyr#1792

mdsumner · 2018-09-02T08:11:14Z

I'm finding my stance on this has moved somewhat, partly because I just really need this package on CRAN so I can get on with stuff.

Default to structural IDs, basically the silicore model. Anything that needs to maintain relational IDs can add them. It simplifies this package somewhat and removes the need to include subsetting methods.

mdsumner · 2018-09-25T13:35:56Z

I still think this is right, I tried to go again on the SC0 model but got stuck - a full example of normalizing vertices and then normalizing segments is here:

https://github.com/hypertidy/contourPolys#other-attempts

In SC0 (silicore) I cheat by building edges from grouped paths in structural indexing, but we really need edges from normalized indexing as in that readme because juggling back and forth is v hard to understand.

Key is de-ordering segments so that they are identifiable, and I use pmin/pmax but surely there's a faster way to do this, and even a vector type that records the edge as given but can provide undirected identity as needed? (is complex a useful model?)

mpadge · 2018-09-26T14:01:58Z

I think I've got s fair insight into how to do this with C++ std::unordered_map objects, but the polygon nesting / holes thing is maybe not so simple. I'll give it a go in context of contourPolys/issues/4

mdsumner · 2018-10-02T12:12:30Z

I've been messing around with removing heavy indexes, and it's only just occurred to me that a simple binary type can have two tables, object and vertex with a list-col for edge:

compact_indexes_properly <- function(x) {
  x <- compact_indexes(x)
  ind <- split(x$object_link_edge$edge_, 
                                           x$object_link_edge$object_)
  x$object$edge <- lapply(ind, function(xx) x$edge[xx, ])
  x$object_link_edge <- NULL
  x$edge <- NULL
  
  x
}

## still using heavy and inefficient SC to demonstrate the difference
library(silicate)
compact_indexes_properly(SC(minimal_mesh))

$object
# A tibble: 2 x 2
      a edge             
* <int> <list>           
1     1 <tibble [12 × 2]>
2     2 <tibble [4 × 2]> 

$vertex
# A tibble: 14 x 2
      x_    y_
   <dbl> <dbl>
 1  0     0   
 2  0     1   
 3  0.75  1   
 4  1     0.8 
 5  0.5   0.7 
 6  0.8   0.6 
 7  0.69  0

So, that's the right core format for SC afaics, and with a edge-recyclerizer we could take it a long way. One thing it loses over current SC is that there's no explicit mapping between instances of edges and their unordered counterparts, but I think edge normalization and ordering is something we need to be able to do on the fly anyway.

For all the stuff I did with PATH that relied on UIDs I think that can be done on the fly, especially just done between pairs of tables rather than always in the whole model.

(I think were telling me about this two table model but I wasn't ready to go the whole way).

mdsumner · 2018-10-02T12:27:03Z

It's going to take a fair bit to refactor silicate around this, but it also seems like it gives a potential invalidation capability - because the edges are ordered around the paths originally, and a simple index could record their grouping within paths. If you do something really "edge-based", then you simply remove that path-validity. So maybe PATH can just totally die, and become an invalidate-able subclass of SC, and re-cyclerizing can restore those pieces, and ARC is another special type of PATH.

mpadge · 2018-10-02T13:19:31Z

That looks really promising! I've been a bit hammered here, but will defo take it fer a spin by end of next week

mpadge · 2018-10-25T16:19:37Z

So during my inaction I have actually had some insightful thoughts here, also following on from #61. I reckon this only needs an extra argument for SC, inherited by all functions called therein. Something like this should suffice:

SC <- function (x, id = NULL, ...) {
    if (!is.null (id))
        uids <- x ["id"]
    else {
        idcol <- grep ("id", names (x), ignore.case = TRUE)
        if (length (idcol) == 1)
            uids <- x [idcol]
        else {
            if (length (idcol) > 1)
                message ("Can not unambiguously determine ID column; going to just make some IDs up")
            uids <- [... present code via `ids`]
    }
}

Whaddayareckon?

mdsumner · 2018-10-26T04:40:57Z

This seems fine in general, everything is currently modelled off not having existing names so it's definitely open to improvement.

I guess OSM in sf is a funny case, because the inherent names aren't meaningful in the SF context. and we'd be better going direct to OSM sources as you say in the other issue.

There's still the problem of whether to normalize vertices, I guess OSM is saying "these are unique", and so then we should maintain that, not do our own uniqueifying based on scary floating point comparisons.

mpadge · 2018-10-26T15:38:43Z

One complicating thought against what I just wrote, and in favour of the idea prior to that. It'll likely be necessary to allow ID values to be user-defined because it'll otherwise be impossible to easily write round-trip tests like in geojsonsf. And that will require user ability to impose IDs at every level, including PATH construction, bringing us back to where we were. Having said that, I still intend to do osmdata_sc, because that will be heaps faster than osmdata_sf() %>% SC()

mdsumner · 2018-10-27T21:11:56Z

I'm pondering that, I just pushed the purely structural model "BINARY" (with only untested stubs for the worker verbs) - but a plot method for plot(BINARY(x)) with colouring on object level.

Obviously for sf, sp it's fine because they have no internal labels, we can subset the object table and there's no problem - if we subset the vertex table we have to re-index the object$edge_ - possibly dropping edges, and possibly dropping objects.

If you give it a labelled model, you'd want a more SC-like type - keeping all the vertex labels and path/way labels. So, do we change SC to use this binary form and keep labels if they are present (with an extra subclass?), or default to always having labels (creating them if needed), or just have different functions for label mode vs structure mode?

I've been grappling with this kind of idea from the beginning!

mdsumner · 2018-10-28T01:39:58Z

In binary branch I've put updates to SC, critically it doesn't now use PATH but derives from BINARY.

BINARY is two tables, with object$edge_ a nested structural index into vertex

SC is now three tables, object, edge, vertex. I think I'll drop the segment/edge distinction, and always keep edge with the object_ ID. When edges are really unordered we can make that explicit by remapping, nesting the object ID (if needed, and possibly with a direction tag).

So, BINARY is the primary workhorse for unlabelled models, paths are still implicit because of the edge indexes. SC is all edges, as a link between vertex and object - and it's arbitrary so we can subset whereever and joins handle the rest.

I'm still not sure how to go about this for labelled models (like OSM), but maybe a test upfront for "has_labels", that checks if the input model is internally labelled and if it doesn't - BINARY is the starting point. Otherwise we make sure the verbs return IDs is the model has them, so it means we need an "osm-doc" class or something like that and write the sc_verbs against that. If sc_object/sc_vertex and so on have IDs then we avoid using BINARY at all.

mdsumner · 2018-10-28T05:27:18Z

If rgl had names explaining all this would be easier

mdsumner · 2018-10-30T22:42:51Z

I'm happy with this compromise now, SC0 is dense, compact. SC is "fully labelled". TRI and PATH and ARC remain fully labelled.

We could improve things with careful ID management using integers, but I'll leave that for future work.

(BINARY is removed).

mdsumner · 2018-10-31T00:05:56Z

Everything is now in master branch.

mdsumner · 2018-11-02T04:09:31Z

I am terrified of this, but I put an integer-ID system into integers branch. Technically it can be set by global option to use old big IDs, but that will require more care. Do you see any dangers? ( Obviously we might have a table bigger than the limit, but that can probably be handled when it arises and better done by DB anyway)

mdsumner · 2018-11-02T05:18:35Z

One thing is keeping IDs that come in, but that's fine for osmdata_sc and I'll use rownames if present for the sc_uid.data.frame method

mpadge · 2018-11-02T10:16:19Z

One another thought here: The ids::random_ids() function pretends to use a "bytes" parameter, but actually just returns UTF-8 encoded characters which can optionally be used for hex-level bit control. In most cases, including here, that's neither used nor relevant, and the effect of this is that each block of two (64-bit) encodings actually contains just 16^2 = 256 possibilities, and so the chance of repeating one of currently-implemented bytes = 5 IDs is 16 ^ -10.

Current osmdata_sc code uses this function:

// Function to generate IDs for the edges in each way
std::string random_id (size_t len) {
    auto randchar = []() -> char
    {
        const char charset[] = \
            "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
        const size_t max_index = (sizeof(charset) - 1);
        //return charset [ rand() % max_index ];
        size_t i = static_cast <size_t> (floor (Rcpp::runif (1) [0] * max_index));
        return charset [i];
    };
    std::string str (len, 0);
    std::generate_n (str.begin(), len, randchar);
    return str;
}

(currently residing here.) Chance of generating identical IDs in this case is 62 ^ -10, which is approximately infinitely lower. Preferable? Easy R implementation like this:

rhash <- function(n = 20) {
  paste (c ("jst", sample(c (LETTERS, letters, 0:9), n, TRUE)), collapse = "")
}

mdsumner · 2018-11-03T03:52:32Z

Cool thanks! I've been realizing that the way I'm doing it is not very sensible or efficient, and I think integers is fine - but the verbs should pick out existing IDs if they are there.

What's the "jst" for?

For vectorizing, I'd generate every value and split/collapse:

nhash <- 10
n <- 20
chars <- c(LETTERS, letters, 0:9)
unlist(lapply(split(sample(chars, nhash * n, replace = TRUE), rep(1:nhash, each = n)), paste, collapse = ""))

But, I'm sure your c++ will burn that to the ground

mdsumner · 2018-11-03T04:01:23Z

What do you think of using negative as well as positive integers for the key? Is that crazy?

mpadge · 2018-11-03T07:45:22Z

Oh, the prefix is arbitrary but allows them to be used as file names, coz some OSs don't allow numeric starts. Negative integers would also need a minus at the front, which is also likely not sensible for file names.

mdsumner · 2018-11-03T07:48:49Z

Why would these ever be file names? (But good principles at any rate)

mpadge · 2018-11-03T07:52:47Z

Yeah, quite possibly if for example iterating over object table and saving/caching each one separately. Ya never know, would be my motto, so good to presume it might be desired

mdsumner closed this as completed Oct 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the UID thing #42

the UID thing #42

mdsumner commented Jan 6, 2018

mpadge commented Mar 12, 2018

mdsumner commented Mar 12, 2018

mdsumner commented Jul 13, 2018

mdsumner commented Sep 2, 2018 •

edited

Loading

mdsumner commented Sep 25, 2018 •

edited

Loading

mpadge commented Sep 26, 2018

mdsumner commented Oct 2, 2018 •

edited

Loading

mdsumner commented Oct 2, 2018 •

edited

Loading

mpadge commented Oct 2, 2018

mpadge commented Oct 25, 2018 •

edited

Loading

mdsumner commented Oct 26, 2018

mpadge commented Oct 26, 2018

mdsumner commented Oct 27, 2018 •

edited

Loading

mdsumner commented Oct 28, 2018 •

edited

Loading

mdsumner commented Oct 28, 2018

mdsumner commented Oct 30, 2018

mdsumner commented Oct 31, 2018

mdsumner commented Nov 2, 2018

mdsumner commented Nov 2, 2018

mpadge commented Nov 2, 2018

mdsumner commented Nov 3, 2018

mdsumner commented Nov 3, 2018

mpadge commented Nov 3, 2018

mdsumner commented Nov 3, 2018

mpadge commented Nov 3, 2018

the UID thing #42

the UID thing #42

Comments

mdsumner commented Jan 6, 2018

mpadge commented Mar 12, 2018

mdsumner commented Mar 12, 2018

mdsumner commented Jul 13, 2018

mdsumner commented Sep 2, 2018 • edited Loading

mdsumner commented Sep 25, 2018 • edited Loading

mpadge commented Sep 26, 2018

mdsumner commented Oct 2, 2018 • edited Loading

mdsumner commented Oct 2, 2018 • edited Loading

mpadge commented Oct 2, 2018

mpadge commented Oct 25, 2018 • edited Loading

mdsumner commented Oct 26, 2018

mpadge commented Oct 26, 2018

mdsumner commented Oct 27, 2018 • edited Loading

mdsumner commented Oct 28, 2018 • edited Loading

mdsumner commented Oct 28, 2018

mdsumner commented Oct 30, 2018

mdsumner commented Oct 31, 2018

mdsumner commented Nov 2, 2018

mdsumner commented Nov 2, 2018

mpadge commented Nov 2, 2018

mdsumner commented Nov 3, 2018

mdsumner commented Nov 3, 2018

mpadge commented Nov 3, 2018

mdsumner commented Nov 3, 2018

mpadge commented Nov 3, 2018

mdsumner commented Sep 2, 2018 •

edited

Loading

mdsumner commented Sep 25, 2018 •

edited

Loading

mdsumner commented Oct 2, 2018 •

edited

Loading

mdsumner commented Oct 2, 2018 •

edited

Loading

mpadge commented Oct 25, 2018 •

edited

Loading

mdsumner commented Oct 27, 2018 •

edited

Loading

mdsumner commented Oct 28, 2018 •

edited

Loading