Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the UID thing #42

Closed
mdsumner opened this issue Jan 6, 2018 · 25 comments
Closed

the UID thing #42

mdsumner opened this issue Jan 6, 2018 · 25 comments

Comments

@mdsumner
Copy link
Member

mdsumner commented Jan 6, 2018

Some types come in with names of things at every level (OSM) and some things come in with names at no levels (sf), and others with names only at the first object-level (sp, and sf sometimes).

The generic sc_uid should handle these by either preserving existing ones and creating new ones as needed.

It would be nice to be able to override these and nominate UIDs from a column or direct input.

The question of how to maintain a global pool of ever-unique UIDs remains,

@mpadge
Copy link
Member

mpadge commented Mar 12, 2018

See this anglr issue and replicated unjoin issue for some general thoughts in addition to which comes this one:

Using 5-byte uid values is fine and good, but if silicate is one day to be the super efficient beast it is destined to be, then these will likely be better replaced by simple incremental integers, so that adjacent entities can be compactly represented as varint class objects in delta form, meaning most of them will be simply stored as 1, requiring 1 bit, rather than 5 bytes. My current sphier-type vision is to pre-label vertices, edges, paths, arcs, and objects with respective first letters ("v", "e", ...), but otherwise to use this incremental integer scheme.

Further co-issue for self: Work out where the ordering of the key_col integers returned from unjoin::unjoin comes from? These appear to be pretty randomly ordered, and thus ultimately not amenable to varint::delta representation. Fix?

@mdsumner
Copy link
Member Author

That ordering comes from factor and smashing values together as character:

unjoin(mtcars, drat, wt)

as.integer(factor(paste(mtcars$drat, mtcars$wt)))
 [1] 21 22 20  8 11  1 12 16 23 24 24  7  5  6  3  4 13 26 31 28 17  2 10 18  9 25 30 19 29 15 14 27

@mdsumner
Copy link
Member Author

Watch this space tidyverse/dplyr#1792

@mdsumner
Copy link
Member Author

mdsumner commented Sep 2, 2018

I'm finding my stance on this has moved somewhat, partly because I just really need this package on CRAN so I can get on with stuff.

Default to structural IDs, basically the silicore model. Anything that needs to maintain relational IDs can add them. It simplifies this package somewhat and removes the need to include subsetting methods.

@mdsumner
Copy link
Member Author

mdsumner commented Sep 25, 2018

I still think this is right, I tried to go again on the SC0 model but got stuck - a full example of normalizing vertices and then normalizing segments is here:

https://github.com/hypertidy/contourPolys#other-attempts

In SC0 (silicore) I cheat by building edges from grouped paths in structural indexing, but we really need edges from normalized indexing as in that readme because juggling back and forth is v hard to understand.

Key is de-ordering segments so that they are identifiable, and I use pmin/pmax but surely there's a faster way to do this, and even a vector type that records the edge as given but can provide undirected identity as needed? (is complex a useful model?)

@mpadge
Copy link
Member

mpadge commented Sep 26, 2018

I think I've got s fair insight into how to do this with C++ std::unordered_map objects, but the polygon nesting / holes thing is maybe not so simple. I'll give it a go in context of contourPolys/issues/4

@mdsumner
Copy link
Member Author

mdsumner commented Oct 2, 2018

I've been messing around with removing heavy indexes, and it's only just occurred to me that a simple binary type can have two tables, object and vertex with a list-col for edge:

compact_indexes_properly <- function(x) {
  x <- compact_indexes(x)
  ind <- split(x$object_link_edge$edge_, 
                                           x$object_link_edge$object_)
  x$object$edge <- lapply(ind, function(xx) x$edge[xx, ])
  x$object_link_edge <- NULL
  x$edge <- NULL
  
  x
}

## still using heavy and inefficient SC to demonstrate the difference
library(silicate)
compact_indexes_properly(SC(minimal_mesh))

$object
# A tibble: 2 x 2
      a edge             
* <int> <list>           
1     1 <tibble [12 × 2]>
2     2 <tibble [4 × 2]> 

$vertex
# A tibble: 14 x 2
      x_    y_
   <dbl> <dbl>
 1  0     0   
 2  0     1   
 3  0.75  1   
 4  1     0.8 
 5  0.5   0.7 
 6  0.8   0.6 
 7  0.69  0 

So, that's the right core format for SC afaics, and with a edge-recyclerizer we could take it a long way. One thing it loses over current SC is that there's no explicit mapping between instances of edges and their unordered counterparts, but I think edge normalization and ordering is something we need to be able to do on the fly anyway.

For all the stuff I did with PATH that relied on UIDs I think that can be done on the fly, especially just done between pairs of tables rather than always in the whole model.

(I think were telling me about this two table model but I wasn't ready to go the whole way).

@mdsumner
Copy link
Member Author

mdsumner commented Oct 2, 2018

It's going to take a fair bit to refactor silicate around this, but it also seems like it gives a potential invalidation capability - because the edges are ordered around the paths originally, and a simple index could record their grouping within paths. If you do something really "edge-based", then you simply remove that path-validity. So maybe PATH can just totally die, and become an invalidate-able subclass of SC, and re-cyclerizing can restore those pieces, and ARC is another special type of PATH.

@mpadge
Copy link
Member

mpadge commented Oct 2, 2018

That looks really promising! I've been a bit hammered here, but will defo take it fer a spin by end of next week

@mpadge
Copy link
Member

mpadge commented Oct 25, 2018

So during my inaction I have actually had some insightful thoughts here, also following on from #61. I reckon this only needs an extra argument for SC, inherited by all functions called therein. Something like this should suffice:

SC <- function (x, id = NULL, ...) {
    if (!is.null (id))
        uids <- x ["id"]
    else {
        idcol <- grep ("id", names (x), ignore.case = TRUE)
        if (length (idcol) == 1)
            uids <- x [idcol]
        else {
            if (length (idcol) > 1)
                message ("Can not unambiguously determine ID column; going to just make some IDs up")
            uids <- [... present code via `ids`]
    }
}

Whaddayareckon?

@mdsumner
Copy link
Member Author

This seems fine in general, everything is currently modelled off not having existing names so it's definitely open to improvement.

I guess OSM in sf is a funny case, because the inherent names aren't meaningful in the SF context. and we'd be better going direct to OSM sources as you say in the other issue.

There's still the problem of whether to normalize vertices, I guess OSM is saying "these are unique", and so then we should maintain that, not do our own uniqueifying based on scary floating point comparisons.

@mpadge
Copy link
Member

mpadge commented Oct 26, 2018

One complicating thought against what I just wrote, and in favour of the idea prior to that. It'll likely be necessary to allow ID values to be user-defined because it'll otherwise be impossible to easily write round-trip tests like in geojsonsf. And that will require user ability to impose IDs at every level, including PATH construction, bringing us back to where we were. Having said that, I still intend to do osmdata_sc, because that will be heaps faster than osmdata_sf() %>% SC()

@mdsumner
Copy link
Member Author

mdsumner commented Oct 27, 2018

I'm pondering that, I just pushed the purely structural model "BINARY" (with only untested stubs for the worker verbs) - but a plot method for plot(BINARY(x)) with colouring on object level.

Obviously for sf, sp it's fine because they have no internal labels, we can subset the object table and there's no problem - if we subset the vertex table we have to re-index the object$edge_ - possibly dropping edges, and possibly dropping objects.

If you give it a labelled model, you'd want a more SC-like type - keeping all the vertex labels and path/way labels. So, do we change SC to use this binary form and keep labels if they are present (with an extra subclass?), or default to always having labels (creating them if needed), or just have different functions for label mode vs structure mode?

I've been grappling with this kind of idea from the beginning!

@mdsumner
Copy link
Member Author

mdsumner commented Oct 28, 2018

In binary branch I've put updates to SC, critically it doesn't now use PATH but derives from BINARY.

BINARY is two tables, with object$edge_ a nested structural index into vertex

SC is now three tables, object, edge, vertex. I think I'll drop the segment/edge distinction, and always keep edge with the object_ ID. When edges are really unordered we can make that explicit by remapping, nesting the object ID (if needed, and possibly with a direction tag).

So, BINARY is the primary workhorse for unlabelled models, paths are still implicit because of the edge indexes. SC is all edges, as a link between vertex and object - and it's arbitrary so we can subset whereever and joins handle the rest.

I'm still not sure how to go about this for labelled models (like OSM), but maybe a test upfront for "has_labels", that checks if the input model is internally labelled and if it doesn't - BINARY is the starting point. Otherwise we make sure the verbs return IDs is the model has them, so it means we need an "osm-doc" class or something like that and write the sc_verbs against that. If sc_object/sc_vertex and so on have IDs then we avoid using BINARY at all.

@mdsumner
Copy link
Member Author

If rgl had names explaining all this would be easier

@mdsumner
Copy link
Member Author

I'm happy with this compromise now, SC0 is dense, compact. SC is "fully labelled". TRI and PATH and ARC remain fully labelled.

We could improve things with careful ID management using integers, but I'll leave that for future work.

(BINARY is removed).

@mdsumner
Copy link
Member Author

Everything is now in master branch.

@mdsumner
Copy link
Member Author

mdsumner commented Nov 2, 2018

I am terrified of this, but I put an integer-ID system into integers branch. Technically it can be set by global option to use old big IDs, but that will require more care. Do you see any dangers? ( Obviously we might have a table bigger than the limit, but that can probably be handled when it arises and better done by DB anyway)

@mdsumner
Copy link
Member Author

mdsumner commented Nov 2, 2018

One thing is keeping IDs that come in, but that's fine for osmdata_sc and I'll use rownames if present for the sc_uid.data.frame method

@mpadge
Copy link
Member

mpadge commented Nov 2, 2018

One another thought here: The ids::random_ids() function pretends to use a "bytes" parameter, but actually just returns UTF-8 encoded characters which can optionally be used for hex-level bit control. In most cases, including here, that's neither used nor relevant, and the effect of this is that each block of two (64-bit) encodings actually contains just 16^2 = 256 possibilities, and so the chance of repeating one of currently-implemented bytes = 5 IDs is 16 ^ -10.

Current osmdata_sc code uses this function:

// Function to generate IDs for the edges in each way
std::string random_id (size_t len) {
    auto randchar = []() -> char
    {
        const char charset[] = \
            "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
        const size_t max_index = (sizeof(charset) - 1);
        //return charset [ rand() % max_index ];
        size_t i = static_cast <size_t> (floor (Rcpp::runif (1) [0] * max_index));
        return charset [i];
    };
    std::string str (len, 0);
    std::generate_n (str.begin(), len, randchar);
    return str;
}

(currently residing here.) Chance of generating identical IDs in this case is 62 ^ -10, which is approximately infinitely lower. Preferable? Easy R implementation like this:

rhash <- function(n = 20) {
  paste (c ("jst", sample(c (LETTERS, letters, 0:9), n, TRUE)), collapse = "")
}

@mdsumner
Copy link
Member Author

mdsumner commented Nov 3, 2018

Cool thanks! I've been realizing that the way I'm doing it is not very sensible or efficient, and I think integers is fine - but the verbs should pick out existing IDs if they are there.

What's the "jst" for?

For vectorizing, I'd generate every value and split/collapse:

nhash <- 10
n <- 20
chars <- c(LETTERS, letters, 0:9)
unlist(lapply(split(sample(chars, nhash * n, replace = TRUE), rep(1:nhash, each = n)), paste, collapse = ""))

But, I'm sure your c++ will burn that to the ground

@mdsumner
Copy link
Member Author

mdsumner commented Nov 3, 2018

What do you think of using negative as well as positive integers for the key? Is that crazy?

@mpadge
Copy link
Member

mpadge commented Nov 3, 2018

Oh, the prefix is arbitrary but allows them to be used as file names, coz some OSs don't allow numeric starts. Negative integers would also need a minus at the front, which is also likely not sensible for file names.

@mdsumner
Copy link
Member Author

mdsumner commented Nov 3, 2018

Why would these ever be file names? (But good principles at any rate)

@mpadge
Copy link
Member

mpadge commented Nov 3, 2018

Yeah, quite possibly if for example iterating over object table and saving/caching each one separately. Ya never know, would be my motto, so good to presume it might be desired

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants