Modify source time tuple so that it stays within bounds before sampling from distribution #53

pseeth · 2019-04-02T05:34:51Z

Previously, if the sampled source_time + event_duration exceeded source_duration, the new source_time would simply be source_duration - event_duration (e.g. the last event_duration seconds of the audio file). While this is somewhat okay and unnoticeable for shorter audio files, it works poorly for longer audio files (e.g. music stems).

Additionally if a user specifies a distribution like uniform, normal, or trunc_norm, but sets the max or the mean such that it exceeds the duration of the file, the returned sampled source_time becomes a const (of source_duration - event_time). This change makes it so that they are still sampling from the requested distribution but now within safe bounds.

The bulk of this change is in a new function called _modify_source_time in core.py.

This change is

…me cases if the user specifies a max time larger than source_duration - event_duration. passes existing test cases

coveralls · 2019-04-02T06:04:55Z

Coverage remained the same at 100.0% when pulling eada1c8 on pseeth:source_time_resampling into ec5a5f6 on justinsalamon:master.

pseeth · 2019-04-02T06:10:53Z

source_time tuples should also be modified if the minimum is out of bounds. For lack of a better idea, if the minimum + event_duration > source_duration, then the min is set to 0. So if both the minimum and maximum are out of bounds, then the source_time is sampled between 0 and source_duration - event_duration. This also ensures that legal source_time tuple inputs stay legal after going through _modify_source_time.

Finally for the normal and truncnorm distributions, the result can still be out of bounds if you set mean to source_duration - event_duration. In these cases, the code at the moment just falls back to the old method of simply setting source_time = source_duration - event_duration.

justinsalamon · 2019-04-06T00:35:52Z

I'm more or less following, but it would be good to get some clarifications:

Right now, if you ask for an event_duration such that that source_time + event_duration > source_duration, source_time is set to source_duration - event_duration. The logic behind this is as follows:

event_duration always takes top priority. I.e. we always try to accommodate the user specified event duration, even at the cost of modifying source_time.
If source_time is too large to satisfy event_duration, we move it backwards in time under the assumption that we want to keep the modified source_time as close as possible to the specified source_time.
If source_duration - event_duration is negative, it means there is no way to satisfy event_duration. Only in this case do we set source_time to 0 and event_duration to source_duration.

If I understanding correctly, the problem you are encountering is that under the current implementation, you often end up with the same source_time = source_duration - event_duration which is undesired.

As such, what you are proposing is to first calculate the valid range for source_time such that event_time is respected, and then sample from this valid range. That is, first compute the valid range and then sample, instead of first sampling and then checking if the result is valid.

Is this correct?

justinsalamon

Don't address this just yet, please see my subsequent comment on this PR.

justinsalamon · 2019-04-06T00:45:16Z

scaper/core.py

+    elif source_time[0] == 'choose':
+        for i, t in enumerate(source_time[1]):
+            if t + event_duration > source_duration:
+                source_time[1][i] = max(0, source_duration - event_duration)


Consider the following case:

source_duration = 5

event_duration = 3

source_time[1] = [0, 1, 2, 3, 4, 5]

This current code will update the array to [0, 1, 2, 2, 2, 2], meaning it's much more likely for the user to get a source_time of 2 as opposed to 0 or 1. Presumably the user wants a value chosen at random from all possible values, i.e. [0, 1, 2], with equal likelihood.

To address this, should we not remove all duplicates from the modified source_time[1] array?

That's a good idea, we could just make it list(set(source_time[1]))?

@pseeth looks like we're still missing the conversion to list(set(source_time[1])) ?

@pseeth coming back to this!

Please add the missing duplicate removal line: list(set(source_time[1])) :)

Eek, I believe that's not the right place for this. The current location will result in the loop iterating over different list indices compared to those of source_time[1] which I think leads to erroneous behavior.

I meant we should add source_time[1] = list(set(source_time[1])) after the loop which updates each value in the list, i.e. right after line 398 and outside the scope of the for loop (but inside the scope of the elif.

justinsalamon · 2019-04-06T01:02:25Z

@pseeth I had a quick look at your changes and I think they're fine, but it seems to me like there's a bigger discussion to be had first about the behavior on scaper when faced with a combination of event_duration, source_time and source_duration that can't be satisfied.

Right now the behavior is:

Prioritize event_duration, at the cost of source_time

But, other options are possible, such as:

Prioritize source_time, at the cost of event_duration.

My inclination is to stick with (1), but I wanted to run it by you first.

A related issue is how we prioritize event_time versus event_duration. Right now event_duration is prioritized such that if event_time + event_duration > soundscape_duration, event_time get modified to soundscape_duration - event_duration. But I wonder if that's the right play here? Perhaps it would make more sense to prioritize event_time over event_duration? This would mean the event_time would be sampled from the specified distribution, no issues, and then if it ends up going over the soundscape duration it would just be trimmed.

It would be good to get your take on these 2 issues prior to moving forward. To summarize:
A. event_duration vs source_time
B. event_duration vs soundscape_duration.

My inclination with (A) is to keep it as is right now, i.e. prioritize event_duration. I'm less clear about (B).

pseeth · 2019-04-06T02:35:52Z

Hey @justinsalamon, yeah your initial comment was correct and that's how I've implemented it. The reasoning is that I don't know (or particularly care) the lengths of all of the audio files in my dataset. The more important thing (at least for my purposes) is that what is generated by Scaper is sufficiently random while still sampling from all the possible data. I agree that event_duration should be prioritized over source_time. I think prioritizing event_time over event_duration could result in clipped or cut-off audio towards the end of soundscapes.

pseeth · 2019-04-07T21:29:49Z

Just wanted to elaborate slightly.

Say a user requests ('uniform', 0, 300) for source_time. The source audio files are all of varying length, some are around 10 seconds and some are around 300 seconds and some exceed 300 seconds. Say the requested event duration is 1 second. Here's what currently happens:

10 second source file. The source time is sampled from (0, 300) uniformly, but the probability of getting 10 - 1 = 9 for the start time outweighs any other start time as when sampling from (0,300), getting a number in (10, 300) results in that happening. I think it's reasonable to assume that the user is valuing the distribution over source_time (e.g. they want to draw uniformly from each audio file).
300 second source file, the current behavior is unchanged.
> 300 second source file, the current behavior is unchanged.

If a user knows the duration of all of their source files, they can fix it on their end when making an event.

justinsalamon · 2019-04-08T03:31:14Z

I think prioritizing event_time over event_duration could result in clipped or cut-off audio towards the end of soundscapes.

It definitely would.

Basically there's no way to satisfy both the desired distribution of start times and the desired distribution of event durations because the end of the soundscape forces you to prioritize one over the other: either respect event durations (as it is now) at the expense of the start time distribution, or prioritize the distribution of start times at the cost of even durations, as some events will get clipped if they're toward the end of soundscape.

If you think it makes sense to keep the current priorities (event duration > event time) I'm fine with that.

justinsalamon · 2019-04-08T03:43:01Z

scaper/core.py

+        if source_time[2] + event_duration > source_duration:
+            source_time[2] = max(0, source_duration - event_duration)
+        if source_time[1] + event_duration > source_duration:
+            source_time[1] = 0


Why set the lower bound to 0? It seems arbitrary? Say my event_duration is 5, the source duration is 10, and I specify source_time as (uniform, 6, 10). 10 (the upper bound) is out of range, so it gets adjusted to 10 - 5 = 5. 6 (the lower bound) is also out of range, but valid values for the lower bound include anything in the range [0, 5] - why set it to zero?

Setting the lower bound to 0 maximizes the range over which you can sample. But... it's also the furthest away from what the user specified. One could argue that a better option would be to set the lower bound to the highest value that's still in range, so as to remain as close as possible to the distribution specified by the user. So for the above example you'd end up with (uniform, 6, 10) --> (uniform, 5, 5).

Note that this formulation doesn't always default to "use the end of the clip". For example if you have a source duration of 30, and you specify (uniform, 5, 30) it would modify to (uniform, 5, 25).

Thoughts?

Sure, this makes sense. I think the reason I changed it is because (uniform, 5, 5) throws an error as the high has to be strictly greater than the low in the call to the uniform random generator. So I just said whatever and set it to 0. But I could set it to like (uniform, 5 - eps, 5) or something?

That feels a little hacky. A better option would be to add a check to see if the lower and upper bounds are identical and if so just return that value, otherwise sample from the distribution.

justinsalamon · 2019-04-08T03:45:16Z

scaper/core.py

+        if source_time[1] + event_duration > source_duration:
+            source_time[1] = max(0, source_duration - event_duration)
+        if source_time[3] + event_duration > source_duration:
+            source_time[3] = 0


Same as before, not sure about defaulting to 0 for the lower bound.

scaper/core.py

justinsalamon · 2019-04-08T03:58:28Z

scaper/core.py

@@ -358,6 +359,72 @@ def _validate_distribution(dist_tuple):
                'number that is equal to or greater than trunc_min.')


+def _modify_source_time(source_time, source_duration, event_duration):


Hate to be a stickler, but I feel the name _modify_source_time a little too ambiguous... how about _adjust_source_time_for_file_duration or _ensure_satisfiable_source_time or something along those lines? It's a bit of a mouthful but the function is only called once or twice and I'd rather have a long but clear name than something short but too ambiguous. Cheers!

How about _ensure_valid_source_time_tuple?

not keen on including the term valid because scaper already has a notion of tuple validity (a tuple that satisfies the expected syntax), which is unrelated to whether the tuple can be satisfied given the specific constraints of a source file (I purposely avoided valid in my suggestions for this reason).

How about _ensure_satisfiable_source_time_tuple? It's a mouthful but it's clear.

Yeah that's fine and makes sense. I'll edit!

pseeth · 2019-04-09T04:31:17Z

Made some edits if you can review @justinsalamon. I added a test case for _ensure_valid_source_time_tuple_bounds (a name I just picked, easy to change if you have objections!). I used your review to make the test case, hopefully it's doing what we want now!

justinsalamon

Added some comments to address. Once we're happy with all changes (I'll confirm) I'll ask you squash the commits in this PR to a minimal set that covers the main changes.

scaper/core.py

justinsalamon · 2019-04-09T23:02:43Z

scaper/core.py

+    elif source_time[0] == 'choose':
+        for i, t in enumerate(source_time[1]):
+            if t + event_duration > source_duration:
+                source_time[1][i] = max(0, source_duration - event_duration)


@pseeth looks like we're still missing the conversion to list(set(source_time[1])) ?

scaper/core.py

tests/test_core.py

justinsalamon · 2019-04-09T23:43:29Z

@pseeth let's be sure to discuss "open questions" (e.g. naming functions) before they're implemented to avoid redundant code review cycles :)

pseeth · 2019-04-10T00:10:37Z

Will do. For some reason, can't comment directly but the list(set(source_times)) fix isn't as straightforward as I thought. Picture that a user wants to actually weight the draw and puts in [0, 0, 2, 2, 2, 10] or something, to give a 3/5s chance of getting 2 and a 2/5s chance of getting 0, and then 10 would be out of bounds, say. The set solution would get rid of this and not warn them.

pseeth · 2019-04-10T03:22:47Z

Had a bit of time so I just went ahead and addressed the comments. Instead of using list(set(...)), I wrote something that instead prevents duplicate times being added for out of bounds items in choose. Also renamed the function as requested.

Edit: gah, coverage fell and I didn't notice. Fixing now!

Turns out I named the test wrong and the code wasn't being tested! Fixed now.

justinsalamon · 2019-04-10T19:40:30Z

Instead of using list(set(...)), I wrote something that instead prevents duplicate times being added for out of bounds items in choose.

This workaround has an issue: say the user wants to weight the list as you say, and they provide [0, 1, 2, 3, 3, 3] but the value 3 is out of bounds. Then all the 3s would get adjusted but duplicates would be removed, so now the bias towards 3 is gone. Worse still, you end up with mixed semantics where values below the upper boundary can have duplicates and values above it cannot.

I think it might be cleaner to just specify that any duplicates in "choose" will be automatically removed (and we move the duplicate removal code somewhere upstream) such that all values are sampled uniformly. We could then, if there's need for it, introduce a new distribution tuple (e.g. "discrete") where the user can provide a discrete list of values and corresponding weights (e.g. [("a", 1), ("b", 1), ("c", 10)]).

justinsalamon · 2019-04-10T19:45:18Z

For example we could change https://github.com/justinsalamon/scaper/blob/master/scaper/core.py#L27 from:
"choose": lambda x: random.choice(x),
to:
"choose": lambda x: random.choice(np.unique(x)),

pseeth · 2019-06-01T09:00:28Z

Bump, if you have time, @justinsalamon.

justinsalamon

@pseeth generally looks great, I commented on a bunch of really minor things that should be super quick to fix.

One more thing - we should make a new release with this PR, i.e. bump the version number. Since this PR changes behavior (not just a bug fix), let's bump to version 1.1.0. This means updating version.py

With that I believe we'll be ready to merge!

justinsalamon · 2020-01-29T16:35:50Z

scaper/core.py

+    elif source_time[0] == 'choose':
+        for i, t in enumerate(source_time[1]):
+            if t + event_duration > source_duration:
+                source_time[1][i] = max(0, source_duration - event_duration)


@pseeth coming back to this!

Please add the missing duplicate removal line: list(set(source_time[1])) :)

scaper/core.py

justinsalamon · 2020-01-29T21:44:02Z

scaper/core.py

+                    '{:s} source time tuple went out of bounds '
+                    'source duration ({:.2f}) and event duration ({:.2f}), modified source time '
+                    'tuple to stay within bounds'.format(
+                        label, source_duration, event_duration),


Let's just be sure to update this warning to also print out the updated source_duration tuple

See my comment on the previous thing about the warning.

justinsalamon · 2020-01-29T21:47:17Z

scaper/core.py

+
+    # If it's a uniform distribution, tuple must be of length 3, We change the 3rd
+    # item to source_duration - event_duration so that we stay in bounds. If the min
+    # out of bounds, we change it to be source_duration - event_duration - eps.


I believe there's no use of eps anymore, we can update the comment

justinsalamon · 2020-01-29T21:48:25Z

tests/test_core.py

@@ -425,6 +425,61 @@ def __test_bad_tuple_list(tuple_list):
    __test_bad_tuple_list(badargs)


+def test_ensure_satisfiable_source_time_tuple():
+    # Documenting the expected behavior of _ensure_valid_source_time_tuples


_ensure_satisfiable_source_time_tuple (not valid)

Yep! Fixed. Good catch.

justinsalamon · 2020-01-29T21:48:40Z

tests/test_core.py

+    _test_dist = ('truncnorm', 8, 1, 4, 10)
+    _adjusted, warn = scaper.core._ensure_satisfiable_source_time_tuple(
+        _test_dist, source_duration, event_duration)
+    #rtol = 2e-2 to account for eps = 1e-1 in _ensure_valid_source_time_tuple


no eps so no rtol, right? remove comment?

scaper/core.py

justinsalamon

@pseeth almost there but not quite, couple of important fixes required.

justinsalamon · 2020-01-30T17:53:49Z

scaper/version.py

@@ -3,4 +3,4 @@
 """Version info"""

 short_version = '1.0'


@pseeth need to bump short version to 1.1 as well

scaper/core.py

justinsalamon · 2020-01-30T18:00:52Z

scaper/core.py

+    elif source_time[0] == 'choose':
+        for i, t in enumerate(source_time[1]):
+            if t + event_duration > source_duration:
+                source_time[1][i] = max(0, source_duration - event_duration)


Eek, I believe that's not the right place for this. The current location will result in the loop iterating over different list indices compared to those of source_time[1] which I think leads to erroneous behavior.

I meant we should add source_time[1] = list(set(source_time[1])) after the loop which updates each value in the list, i.e. right after line 398 and outside the scope of the for loop (but inside the scope of the elif.

…bump (+11 squashed commits) Squashed commits: [f1fa344] tab [3d3ff52] pytest-faulthandler... [650b81f] typo?? [ddf9166] third try [73d3442] second try [776a7ef] first try [097b005] typo [508ec8c] only bumping numpy [a3347fc] bumping numpy [906102a] fixing versions [c1587b4] fixing the test to match remove duplicates behavior

fixing versions bumping numpy only bumping numpy typo first try second try third try typo?? pytest-faulthandler... tab updated changelog

justinsalamon

Reviewed 6 of 9 files at r3.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @justinsalamon and @pseeth)

justinsalamon

Reviewable status: complete! all files reviewed, all discussions resolved

added _modify_source_time, which modifies the source_time tuple in so…

f55fe25

…me cases if the user specifies a max time larger than source_duration - event_duration. passes existing test cases

pseeth mentioned this pull request Apr 2, 2019

Adding default values for event parameters #51

Open

justinsalamon reviewed Apr 6, 2019

View reviewed changes

This was referenced Apr 6, 2019

Adding saving of sources, take 3 #55

Merged

Output sources as separate files alongside mixture #52

Closed

justinsalamon reviewed Apr 8, 2019

View reviewed changes

This was referenced Apr 8, 2019

Should be able to seed Scaper generation for reproducibility #54

Merged

Support default values for distribution tuples #41

Open

justinsalamon requested changes Apr 9, 2019

View reviewed changes

justinsalamon requested changes Jan 29, 2020

View reviewed changes

pseeth commented Jan 30, 2020

View reviewed changes

scaper/core.py Show resolved Hide resolved

justinsalamon requested changes Jan 30, 2020

View reviewed changes

justinsalamon approved these changes Jan 30, 2020

View reviewed changes

pseeth added 2 commits January 30, 2020 15:44

fixing the test to match remove duplicates behavior

540275c

fixing versions bumping numpy only bumping numpy typo first try second try third try typo?? pytest-faulthandler... tab updated changelog

pseeth force-pushed the source_time_resampling branch from feedef1 to 540275c Compare January 30, 2020 21:49

Merge branch 'master' into source_time_resampling

eada1c8

justinsalamon approved these changes Jan 30, 2020

View reviewed changes

justinsalamon reviewed Jan 30, 2020

View reviewed changes

justinsalamon approved these changes Jan 30, 2020

View reviewed changes

justinsalamon merged commit 803a26b into justinsalamon:master Jan 30, 2020

pseeth mentioned this pull request Feb 19, 2020

Generating background from short segments #70

Closed

justinsalamon mentioned this pull request Feb 25, 2020

event_time sampling biased when selected time is adjusted due to soundscape duration #83

Open

pseeth mentioned this pull request Feb 29, 2020

Factor out distribution logic #59

Closed

		@@ -358,6 +359,72 @@ def _validate_distribution(dist_tuple):
		'number that is equal to or greater than trunc_min.')


		def _modify_source_time(source_time, source_duration, event_duration):

Modify source time tuple so that it stays within bounds before sampling from distribution #53

Modify source time tuple so that it stays within bounds before sampling from distribution #53

Conversation

pseeth commented Apr 2, 2019 • edited Loading

coveralls commented Apr 2, 2019 • edited Loading

pseeth commented Apr 2, 2019 • edited Loading

justinsalamon commented Apr 6, 2019

justinsalamon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinsalamon commented Apr 6, 2019 • edited Loading

pseeth commented Apr 6, 2019

pseeth commented Apr 7, 2019 • edited Loading

justinsalamon commented Apr 8, 2019

justinsalamon Apr 8, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pseeth commented Apr 9, 2019

justinsalamon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinsalamon commented Apr 9, 2019

pseeth commented Apr 10, 2019

pseeth commented Apr 10, 2019 • edited Loading

justinsalamon commented Apr 10, 2019

justinsalamon commented Apr 10, 2019

pseeth commented Jun 1, 2019

justinsalamon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinsalamon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinsalamon left a comment

Choose a reason for hiding this comment

justinsalamon left a comment

Choose a reason for hiding this comment

pseeth commented Apr 2, 2019 •

edited

Loading

coveralls commented Apr 2, 2019 •

edited

Loading

pseeth commented Apr 2, 2019 •

edited

Loading

justinsalamon commented Apr 6, 2019 •

edited

Loading

pseeth commented Apr 7, 2019 •

edited

Loading

justinsalamon Apr 8, 2019 •

edited

Loading

pseeth commented Apr 10, 2019 •

edited

Loading