Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add seed parameter for ST_GeneratePoints #365

Closed
wants to merge 8 commits into from

Conversation

@mwtoews
Copy link

commented Jan 22, 2019

This PR addresses two tickets related to ST_GeneratePoints:

  • trac #4299 st_generatepoints declared IMMUTABLE
  • trac #4304 Add seed parameter to ST_GeneratePoints

All tests pass on an Ubuntu laptop, however I sense there could be some portability issues on foreign compilers and/or hardware. I've left a mess of debugging functions in lwrandom.c to help pin anything down, and much of this can be stripped off before merging.

Much of the core pseudo random number generation is based on John Burkardt's r8_uni function, which is demonstrated in tests 10-12, with verbatim output in asa183_prb_output.txt.

I'm neither a RNG or C expert, so any feedback/review is most welcome.

@Komzpa

This comment has been minimized.

Copy link
Member

commented Jan 22, 2019

The code fails on Travis with clang and gcc undefined behavior sanitizers. Please check this line at least:

Core was generated by `/src/postgis/liblwgeom/cunit/.libs/cu_tester'.
Program terminated with signal SIGILL, Illegal instruction.
#0  0x00007fe9539a093d in lwrandom_set_seed (seed=0) at lwrandom.c:47
47				seed = ((int)time(NULL) * 1996) << 8;
Thread 1 (Thread 0x7fe952125b40 (LWP 6140)):
#0  0x00007fe9539a093d in lwrandom_set_seed (seed=0) at lwrandom.c:47
No locals.
@Algunenano

This comment has been minimized.

Copy link
Member

commented Jan 22, 2019

Some comments from a partial review ( I haven't checked the code itself):

  • Several builds are failing because you are triggering undefined behaviour (check the usan_clang or usan_gcc modes).
  • The PG 9.5 build is failing because you are declaring the function PARALLEL RESTRICTED directly instead of using a macro that only gets used in PG9.6+ (see _PARALLEL).
  • Using ints here is a recipe for conflict with different architectures. If you know (I hope you do) the size of the ints that you need, use int64_t, int32_t, uint32_t, uint64_t instead.
  • The seed shouldn't be numeric if you are going to read just a 32 bit type. I'd recommend you to use smallint, integer or bigint depending on the size you need.

And a final question, why don't you create a new function and make the seed() mandatory? That way you can declare the function PARALLEL SAFE and IMMUTABLE.

@mwtoews

This comment has been minimized.

Copy link
Author

commented Jan 23, 2019

Thanks @Algunenano for the helpful suggestions. I've implemented most of them. However, I'm a bit lost on which PARALLEL is appropriate. The PG docs says:

functions must be marked PARALLEL RESTRICTED if they access temporary tables, client connection state, cursors, prepared statements, or miscellaneous backend-local state which the system cannot synchronize across workers. For example, setseed and random are parallel restricted for this last reason.

I've also modified the shuffle step to use the Fisher–Yates shuffle, which is a bit easier to look at.

One outstanding concern with the existing function signature is that npoints is numeric. Is there any reason why it isn't integer? Should I change it, or leave as-is?

All Travis CI tests pass, so I suppose it's ready for closer review, but perhaps not merge. The tests include some bold deterministic numbers, which I'm certain may fail on some system I don't have access to.

liblwgeom/lwrandom.c Outdated Show resolved Hide resolved
liblwgeom/lwrandom.c Outdated Show resolved Hide resolved
* Communications of the ACM,
* Volume 31, Number 6, June 1988, pages 742-751.
*/
double

This comment has been minimized.

Copy link
@Algunenano

Algunenano Jan 30, 2019

Member

This function comes from an external project. It seems like it's under GPL3, so we'd need to check if it's compatible with Postgis' GPL2 and verify if it needs to be provided along with the source.

This comment has been minimized.

Copy link
@mwtoews

mwtoews Jan 31, 2019

Author

The function was modified from asa183.c (look for double r8_uni ( int *s1, int *s2 )). There are about 12 lines of effective code used here, most of which I've modified to suite this purpose and coding style (e.g. tabs). I'm not sure how different this qualifies to say it's an entirely "different" function with a different name, calling parameters and data types. As for downgrading LGPL code from v. 3 to 2, the answer appears to be no.

The primary source for this algorithm, and I've checked, is Figure 3 "A Portable Generator for 32-bit Computers" from Pierre L'Ecuyer's 1988 manuscript. While it is in Pascal, the algorithm is identical, with about 12 effective statements to make it operate.

If using John Burkardt's 12 lines of C code is an issue, I can remove credit from John Burkardt, and just cite Pierre L'Ecuyer's 1988 work, as there are not too many different ways to rewrite a few lines of Pascal into C. Or, find a different algorithm (there are many PRNGs out there). I've selected this algorithm due to it's simplicity and apparent portability on 32-bit integers.

@mwtoews

This comment has been minimized.

Copy link
Author

commented Jan 31, 2019

I've removed extra outputs for testing portability from this PR. This has been moved to testlwrandom.c with expected output testlwrandom.out. To compile and check using only a C compiler (no other dependencies):

wget -N https://raw.githubusercontent.com/postgis/postgis/ecab4227d5a0f96988850eacb3b14fd42eb4be37/liblwgeom/lwrandom.c
wget -N https://raw.githubusercontent.com/postgis/postgis/ecab4227d5a0f96988850eacb3b14fd42eb4be37/liblwgeom/lwrandom.h
wget -N https://gist.githubusercontent.com/mwtoews/2e2a74281ebf0a4a63c397bf0b39144a/raw/4183ede3b388955a91f0f4bd3c4603fece70c76d/testlwrandom.c
wget -N https://gist.githubusercontent.com/mwtoews/8fdd10e25f60a0aa81c33551b4c2a573/raw/279c8ac0f36c65e0c36d16e75b4f88ce2ec0b492/testlwrandom.out

export CC=gcc
${CC} -c lwrandom.c 
${CC} -o testlwrandom testlwrandom.c lwrandom.c 
./testlwrandom > testlwrandom.check
diff testlwrandom.out testlwrandom.check
@Algunenano

This comment has been minimized.

Copy link
Member

commented Jan 31, 2019

I've removed extra outputs for testing portability from this PR. This has been moved to testlwrandom.c with expected output testlwrandom.out.

Have you considered doing porting it using cunit and place it in liblwgeom/cunit ?

@mwtoews

This comment has been minimized.

Copy link
Author

commented Jan 31, 2019

Have you considered doing porting it using cunit and place it in liblwgeom/cunit ?

I'd rather avoid potential licensing issues by using John Burkardt's LGPL v3 code. The external code is really for me to check on a few different platforms. Also, it's easy to make up tests to check if some random numbers are the same.

There are already some tests in liblwgeom/cunit/cu_algorithm.c (and regress/core/tickets) that do this, however I think I'll split these into two parts: 1) one that makes sure that two large multipoint results are identical (on any platform), and 2) a test to see if (e.g.) the 1001th random point is the same on all platforms. If there are any portability issues then only the second part would need to be addressed/commented-out or whatever other action is needed.

The good news is that I've ran testlwrandom on a range of compilers, platforms, and they have the same results. I've even tested the algorithms with MS Excel's spreadsheet formulas, and I get the same results for John's TEST10:
TEST10
so it's looking good to say it's platform independent, however I'm slightly hesitant to state this is the docs.

@mwtoews

This comment has been minimized.

Copy link
Author

commented Jan 31, 2019

One outstanding question to @pramsey (re d7eccf2) is do you recall why the original SQL function signature was ST_GeneratePoints(area geometry, npoints numeric), where npoints is numeric instead of integer?

Also why npoints = DatumGetInt32(DirectFunctionCall1(numeric_int4, PG_GETARG_DATUM(1))); instead of a more readable npoints= PG_GETARG_INT32(1);?

@pramsey

This comment has been minimized.

Copy link
Member

commented Jan 31, 2019

One outstanding question to @pramsey is do you recall why the original SQL function signature was ST_GeneratePoints(area geometry, npoints numeric), where npoints is numeric instead of integer?

I have no recollection of that, your honour. Mea culpa, seems nonsensical from this vantage point.

@mwtoews mwtoews force-pushed the mwtoews:generatepoints branch from 9a772d8 to acda94c Feb 2, 2019

mwtoews added 6 commits Jan 22, 2019
Address feedback and fix Travis CI tests
 * Create a second function for seed parameter, make it IMMUTABLE
 * Use uint32_t and int32_t to help portability
 * Add a _PARALLEL_RESTRICTED #define for these random-based functions
 * Check to ensure seed is > 0

@mwtoews mwtoews force-pushed the mwtoews:generatepoints branch from acda94c to 19ee783 Feb 2, 2019

@mwtoews

This comment has been minimized.

Copy link
Author

commented Feb 2, 2019

I've modified npoints to integer, so while before it was possible to use npoints=12.5 for 13 points, now that won't work, because it shouldn't have.

Here are 100 random points and connecting lines (via ST_LineFromMultiPoint) on an R, both from seed=1:
R100

Not bad. However, when plotting 20 points with seeds 1 to 100 (each colour), there does appear to be a trend in the lower-left corner as to where each random sequence coincides:
R20xS100

these trends go away with a different sequence of seeds (e.g. 12345, 12395, 12445, etc.), but I suspect this sort of thing isn't too surprising when manually selecting seeds so close to each other near 1. This is a minor thing I can look more into, but it's also not a big deal.

@mwtoews mwtoews force-pushed the mwtoews:generatepoints branch 3 times, most recently from b0be8ea to 3a5b135 Feb 11, 2019

Modify seed algorithm to use process ID, and use a few randomly selec…
…ted prime numbers that sort of look like words.

@mwtoews mwtoews force-pushed the mwtoews:generatepoints branch from 3a5b135 to 2d258c0 Feb 11, 2019

@mwtoews

This comment has been minimized.

Copy link
Author

commented Feb 11, 2019

The peculiar non-random patterns in the previous R image above has been resolved by revising the seeding algorithm.

When a seed is not provided by the user, it is based on Unix time (in seconds) and the process ID. I have no idea how parallel postgres works, but if there are multiple processes running at the same Unix time, they should now each generate different number sequences (assuming they each have different process IDs).

Checking back on the npoints datatype, I see a demonstration by @dbaston in here that relies on a numeric type. With the changes in this PR, this example will not work without an optional round/trunc/floor and casting npoints to integer.

This patch is ready for further review / squashing / merging to svn-trunk.

postgis/lwgeom_geos.c Outdated Show resolved Hide resolved
RETURNS geometry
AS 'MODULE_PATHNAME','ST_GeneratePoints'
LANGUAGE 'c' IMMUTABLE STRICT _PARALLEL
LANGUAGE 'c' VOLATILE STRICT _PARALLEL_RESTRICTED

This comment has been minimized.

Copy link
@Komzpa

Komzpa Feb 12, 2019

Member

does it have to be parallel restricted? if it's volatile and resets seed on start why not start it in parallel worker too?

This comment has been minimized.

Copy link
@Komzpa

Komzpa Feb 12, 2019

Member

Postgres has set seed() as a separate call from random(), that's why they can't migrate random() call into parallel worker. Here it probably not applies.

This comment has been minimized.

Copy link
@mwtoews

mwtoews Feb 12, 2019

Author

As mentioned in trac #4299, the docs for parallel safety say "setseed and random are parallel restricted", so I've applied this to this function too. However, I don't know much about the parallel features of postgres, so am open to modifying this aspect.

@Komzpa
Copy link
Member

left a comment

I like. A couple of tweaks and let's merge this.

postgis/postgis.sql.in Outdated Show resolved Hide resolved
regress/core/tickets.sql Outdated Show resolved Hide resolved
regress/core/tickets.sql Outdated Show resolved Hide resolved
RETURNS geometry
AS 'MODULE_PATHNAME','ST_GeneratePoints'
LANGUAGE 'c' IMMUTABLE STRICT _PARALLEL
LANGUAGE 'c' VOLATILE STRICT _PARALLEL_RESTRICTED

This comment has been minimized.

Copy link
@Komzpa

Komzpa Feb 12, 2019

Member

Postgres has set seed() as a separate call from random(), that's why they can't migrate random() call into parallel worker. Here it probably not applies.

@@ -1787,10 +1802,12 @@ POINT(2 1)

<para>
ST_GeneratePoints generates pseudo-random points until the requested number are
found within the input area.
found within the input area. The seed parameter allows re-generation of
the same set of points.

This comment has been minimized.

Copy link
@Komzpa

Komzpa Feb 12, 2019

Member

I'm hesistant to give guarantees in docs. If we tweak point-in-poly a bit the set may be different between postgis versions, or maybe even platforms, even with same random functions - think JIT and 64/80-bit maths.

Maybe "setting seed allows Postgres to cache results between runs"?

This comment has been minimized.

Copy link
@mwtoews

mwtoews Feb 12, 2019

Author

I'm confident that on one computer that point sequence can be re-generated with a user-supplied seed. I'll reword to not imply absolute re-generation of floating-point values across different hardware.

This comment has been minimized.

Copy link
@mwtoews

mwtoews Feb 12, 2019

Author

Well I did reword this part, but I can't think of a simple way to not over complicate the docs. Pseudorandom number generators are deterministic by design. I'm just not mentioning any guarantees of generating the exact sequence of points across all hardware and software versions.

address most feedback from Darafei's review
further updates to tests to mix generations with and without seeds

also update example in docs to use a deterministic seed

@strk strk closed this in 68d13d2 Feb 12, 2019

@Komzpa

This comment has been minimized.

Copy link
Member

commented Feb 12, 2019

Thanks! Merged this to trunk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.