Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify source time tuple so that it stays within bounds before sampling from distribution #53

Merged
merged 4 commits into from
Jan 30, 2020

Conversation

pseeth
Copy link
Collaborator

@pseeth pseeth commented Apr 2, 2019

Previously, if the sampled source_time + event_duration exceeded source_duration, the new source_time would simply be source_duration - event_duration (e.g. the last event_duration seconds of the audio file). While this is somewhat okay and unnoticeable for shorter audio files, it works poorly for longer audio files (e.g. music stems).

Additionally if a user specifies a distribution like uniform, normal, or trunc_norm, but sets the max or the mean such that it exceeds the duration of the file, the returned sampled source_time becomes a const (of source_duration - event_time). This change makes it so that they are still sampling from the requested distribution but now within safe bounds.

The bulk of this change is in a new function called _modify_source_time in core.py.


This change is Reviewable

…me cases if the user specifies a max time larger than source_duration - event_duration. passes existing test cases
@coveralls
Copy link

coveralls commented Apr 2, 2019

Coverage Status

Coverage remained the same at 100.0% when pulling eada1c8 on pseeth:source_time_resampling into ec5a5f6 on justinsalamon:master.

@pseeth
Copy link
Collaborator Author

pseeth commented Apr 2, 2019

source_time tuples should also be modified if the minimum is out of bounds. For lack of a better idea, if the minimum + event_duration > source_duration, then the min is set to 0. So if both the minimum and maximum are out of bounds, then the source_time is sampled between 0 and source_duration - event_duration. This also ensures that legal source_time tuple inputs stay legal after going through _modify_source_time.

Finally for the normal and truncnorm distributions, the result can still be out of bounds if you set mean to source_duration - event_duration. In these cases, the code at the moment just falls back to the old method of simply setting source_time = source_duration - event_duration.

@justinsalamon
Copy link
Owner

I'm more or less following, but it would be good to get some clarifications:

Right now, if you ask for an event_duration such that that source_time + event_duration > source_duration, source_time is set to source_duration - event_duration. The logic behind this is as follows:

  1. event_duration always takes top priority. I.e. we always try to accommodate the user specified event duration, even at the cost of modifying source_time.
  2. If source_time is too large to satisfy event_duration, we move it backwards in time under the assumption that we want to keep the modified source_time as close as possible to the specified source_time.
  3. If source_duration - event_duration is negative, it means there is no way to satisfy event_duration. Only in this case do we set source_time to 0 and event_duration to source_duration.

If I understanding correctly, the problem you are encountering is that under the current implementation, you often end up with the same source_time = source_duration - event_duration which is undesired.

As such, what you are proposing is to first calculate the valid range for source_time such that event_time is respected, and then sample from this valid range. That is, first compute the valid range and then sample, instead of first sampling and then checking if the result is valid.

Is this correct?

Copy link
Owner

@justinsalamon justinsalamon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't address this just yet, please see my subsequent comment on this PR.

elif source_time[0] == 'choose':
for i, t in enumerate(source_time[1]):
if t + event_duration > source_duration:
source_time[1][i] = max(0, source_duration - event_duration)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the following case:

  • source_duration = 5
  • event_duration = 3
  • source_time[1] = [0, 1, 2, 3, 4, 5]

This current code will update the array to [0, 1, 2, 2, 2, 2], meaning it's much more likely for the user to get a source_time of 2 as opposed to 0 or 1. Presumably the user wants a value chosen at random from all possible values, i.e. [0, 1, 2], with equal likelihood.

To address this, should we not remove all duplicates from the modified source_time[1] array?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, we could just make it list(set(source_time[1]))?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pseeth looks like we're still missing the conversion to list(set(source_time[1])) ?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pseeth coming back to this!

Please add the missing duplicate removal line: list(set(source_time[1])) :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eek, I believe that's not the right place for this. The current location will result in the loop iterating over different list indices compared to those of source_time[1] which I think leads to erroneous behavior.

I meant we should add source_time[1] = list(set(source_time[1])) after the loop which updates each value in the list, i.e. right after line 398 and outside the scope of the for loop (but inside the scope of the elif.

@justinsalamon
Copy link
Owner

justinsalamon commented Apr 6, 2019

@pseeth I had a quick look at your changes and I think they're fine, but it seems to me like there's a bigger discussion to be had first about the behavior on scaper when faced with a combination of event_duration, source_time and source_duration that can't be satisfied.

Right now the behavior is:

  1. Prioritize event_duration, at the cost of source_time

But, other options are possible, such as:

  1. Prioritize source_time, at the cost of event_duration.

My inclination is to stick with (1), but I wanted to run it by you first.

A related issue is how we prioritize event_time versus event_duration. Right now event_duration is prioritized such that if event_time + event_duration > soundscape_duration, event_time get modified to soundscape_duration - event_duration. But I wonder if that's the right play here? Perhaps it would make more sense to prioritize event_time over event_duration? This would mean the event_time would be sampled from the specified distribution, no issues, and then if it ends up going over the soundscape duration it would just be trimmed.

It would be good to get your take on these 2 issues prior to moving forward. To summarize:
A. event_duration vs source_time
B. event_duration vs soundscape_duration.

My inclination with (A) is to keep it as is right now, i.e. prioritize event_duration. I'm less clear about (B).

@pseeth
Copy link
Collaborator Author

pseeth commented Apr 6, 2019

Hey @justinsalamon, yeah your initial comment was correct and that's how I've implemented it. The reasoning is that I don't know (or particularly care) the lengths of all of the audio files in my dataset. The more important thing (at least for my purposes) is that what is generated by Scaper is sufficiently random while still sampling from all the possible data. I agree that event_duration should be prioritized over source_time. I think prioritizing event_time over event_duration could result in clipped or cut-off audio towards the end of soundscapes.

@pseeth
Copy link
Collaborator Author

pseeth commented Apr 7, 2019

Just wanted to elaborate slightly.

Say a user requests ('uniform', 0, 300) for source_time. The source audio files are all of varying length, some are around 10 seconds and some are around 300 seconds and some exceed 300 seconds. Say the requested event duration is 1 second. Here's what currently happens:

  1. 10 second source file. The source time is sampled from (0, 300) uniformly, but the probability of getting 10 - 1 = 9 for the start time outweighs any other start time as when sampling from (0,300), getting a number in (10, 300) results in that happening. I think it's reasonable to assume that the user is valuing the distribution over source_time (e.g. they want to draw uniformly from each audio file).
  2. 300 second source file, the current behavior is unchanged.
  3. > 300 second source file, the current behavior is unchanged.

If a user knows the duration of all of their source files, they can fix it on their end when making an event.

@justinsalamon
Copy link
Owner

I think prioritizing event_time over event_duration could result in clipped or cut-off audio towards the end of soundscapes.

It definitely would.

Basically there's no way to satisfy both the desired distribution of start times and the desired distribution of event durations because the end of the soundscape forces you to prioritize one over the other: either respect event durations (as it is now) at the expense of the start time distribution, or prioritize the distribution of start times at the cost of even durations, as some events will get clipped if they're toward the end of soundscape.

If you think it makes sense to keep the current priorities (event duration > event time) I'm fine with that.

scaper/core.py Outdated
if source_time[2] + event_duration > source_duration:
source_time[2] = max(0, source_duration - event_duration)
if source_time[1] + event_duration > source_duration:
source_time[1] = 0
Copy link
Owner

@justinsalamon justinsalamon Apr 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why set the lower bound to 0? It seems arbitrary? Say my event_duration is 5, the source duration is 10, and I specify source_time as (uniform, 6, 10). 10 (the upper bound) is out of range, so it gets adjusted to 10 - 5 = 5. 6 (the lower bound) is also out of range, but valid values for the lower bound include anything in the range [0, 5] - why set it to zero?

Setting the lower bound to 0 maximizes the range over which you can sample. But... it's also the furthest away from what the user specified. One could argue that a better option would be to set the lower bound to the highest value that's still in range, so as to remain as close as possible to the distribution specified by the user. So for the above example you'd end up with (uniform, 6, 10) --> (uniform, 5, 5).

Note that this formulation doesn't always default to "use the end of the clip". For example if you have a source duration of 30, and you specify (uniform, 5, 30) it would modify to (uniform, 5, 25).

Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this makes sense. I think the reason I changed it is because (uniform, 5, 5) throws an error as the high has to be strictly greater than the low in the call to the uniform random generator. So I just said whatever and set it to 0. But I could set it to like (uniform, 5 - eps, 5) or something?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That feels a little hacky. A better option would be to add a check to see if the lower and upper bounds are identical and if so just return that value, otherwise sample from the distribution.

scaper/core.py Outdated
if source_time[1] + event_duration > source_duration:
source_time[1] = max(0, source_duration - event_duration)
if source_time[3] + event_duration > source_duration:
source_time[3] = 0
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before, not sure about defaulting to 0 for the lower bound.

scaper/core.py Outdated Show resolved Hide resolved
scaper/core.py Outdated
@@ -358,6 +359,72 @@ def _validate_distribution(dist_tuple):
'number that is equal to or greater than trunc_min.')


def _modify_source_time(source_time, source_duration, event_duration):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hate to be a stickler, but I feel the name _modify_source_time a little too ambiguous... how about _adjust_source_time_for_file_duration or _ensure_satisfiable_source_time or something along those lines? It's a bit of a mouthful but the function is only called once or twice and I'd rather have a long but clear name than something short but too ambiguous. Cheers!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about _ensure_valid_source_time_tuple?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not keen on including the term valid because scaper already has a notion of tuple validity (a tuple that satisfies the expected syntax), which is unrelated to whether the tuple can be satisfied given the specific constraints of a source file (I purposely avoided valid in my suggestions for this reason).

How about _ensure_satisfiable_source_time_tuple? It's a mouthful but it's clear.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's fine and makes sense. I'll edit!

@pseeth
Copy link
Collaborator Author

pseeth commented Apr 9, 2019

Made some edits if you can review @justinsalamon. I added a test case for _ensure_valid_source_time_tuple_bounds (a name I just picked, easy to change if you have objections!). I used your review to make the test case, hopefully it's doing what we want now!

Copy link
Owner

@justinsalamon justinsalamon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments to address. Once we're happy with all changes (I'll confirm) I'll ask you squash the commits in this PR to a minimal set that covers the main changes.

scaper/core.py Outdated Show resolved Hide resolved
scaper/core.py Outdated Show resolved Hide resolved
elif source_time[0] == 'choose':
for i, t in enumerate(source_time[1]):
if t + event_duration > source_duration:
source_time[1][i] = max(0, source_duration - event_duration)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pseeth looks like we're still missing the conversion to list(set(source_time[1])) ?

scaper/core.py Outdated Show resolved Hide resolved
scaper/core.py Outdated Show resolved Hide resolved
scaper/core.py Outdated Show resolved Hide resolved
scaper/core.py Show resolved Hide resolved
tests/test_core.py Outdated Show resolved Hide resolved
@justinsalamon
Copy link
Owner

@pseeth let's be sure to discuss "open questions" (e.g. naming functions) before they're implemented to avoid redundant code review cycles :)

@pseeth
Copy link
Collaborator Author

pseeth commented Apr 10, 2019

Will do. For some reason, can't comment directly but the list(set(source_times)) fix isn't as straightforward as I thought. Picture that a user wants to actually weight the draw and puts in [0, 0, 2, 2, 2, 10] or something, to give a 3/5s chance of getting 2 and a 2/5s chance of getting 0, and then 10 would be out of bounds, say. The set solution would get rid of this and not warn them.

@pseeth
Copy link
Collaborator Author

pseeth commented Apr 10, 2019

Had a bit of time so I just went ahead and addressed the comments. Instead of using list(set(...)), I wrote something that instead prevents duplicate times being added for out of bounds items in choose. Also renamed the function as requested.

Edit: gah, coverage fell and I didn't notice. Fixing now!

Turns out I named the test wrong and the code wasn't being tested! Fixed now.

@justinsalamon
Copy link
Owner

Instead of using list(set(...)), I wrote something that instead prevents duplicate times being added for out of bounds items in choose.

This workaround has an issue: say the user wants to weight the list as you say, and they provide [0, 1, 2, 3, 3, 3] but the value 3 is out of bounds. Then all the 3s would get adjusted but duplicates would be removed, so now the bias towards 3 is gone. Worse still, you end up with mixed semantics where values below the upper boundary can have duplicates and values above it cannot.

I think it might be cleaner to just specify that any duplicates in "choose" will be automatically removed (and we move the duplicate removal code somewhere upstream) such that all values are sampled uniformly. We could then, if there's need for it, introduce a new distribution tuple (e.g. "discrete") where the user can provide a discrete list of values and corresponding weights (e.g. [("a", 1), ("b", 1), ("c", 10)]).

@justinsalamon
Copy link
Owner

For example we could change https://github.com/justinsalamon/scaper/blob/master/scaper/core.py#L27 from:
"choose": lambda x: random.choice(x),
to:
"choose": lambda x: random.choice(np.unique(x)),

@pseeth
Copy link
Collaborator Author

pseeth commented Jun 1, 2019

Bump, if you have time, @justinsalamon.

Copy link
Owner

@justinsalamon justinsalamon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pseeth generally looks great, I commented on a bunch of really minor things that should be super quick to fix.

One more thing - we should make a new release with this PR, i.e. bump the version number. Since this PR changes behavior (not just a bug fix), let's bump to version 1.1.0. This means updating version.py

With that I believe we'll be ready to merge!

elif source_time[0] == 'choose':
for i, t in enumerate(source_time[1]):
if t + event_duration > source_duration:
source_time[1][i] = max(0, source_duration - event_duration)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pseeth coming back to this!

Please add the missing duplicate removal line: list(set(source_time[1])) :)

scaper/core.py Show resolved Hide resolved
scaper/core.py Outdated
'{:s} source time tuple went out of bounds '
'source duration ({:.2f}) and event duration ({:.2f}), modified source time '
'tuple to stay within bounds'.format(
label, source_duration, event_duration),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just be sure to update this warning to also print out the updated source_duration tuple

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment on the previous thing about the warning.

scaper/core.py Outdated

# If it's a uniform distribution, tuple must be of length 3, We change the 3rd
# item to source_duration - event_duration so that we stay in bounds. If the min
# out of bounds, we change it to be source_duration - event_duration - eps.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there's no use of eps anymore, we can update the comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -425,6 +425,61 @@ def __test_bad_tuple_list(tuple_list):
__test_bad_tuple_list(badargs)


def test_ensure_satisfiable_source_time_tuple():
# Documenting the expected behavior of _ensure_valid_source_time_tuples
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_ensure_satisfiable_source_time_tuple (not valid)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! Fixed. Good catch.

_test_dist = ('truncnorm', 8, 1, 4, 10)
_adjusted, warn = scaper.core._ensure_satisfiable_source_time_tuple(
_test_dist, source_duration, event_duration)
#rtol = 2e-2 to account for eps = 1e-1 in _ensure_valid_source_time_tuple
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no eps so no rtol, right? remove comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

scaper/core.py Show resolved Hide resolved
Copy link
Owner

@justinsalamon justinsalamon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pseeth almost there but not quite, couple of important fixes required.

@@ -3,4 +3,4 @@
"""Version info"""

short_version = '1.0'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pseeth need to bump short version to 1.1 as well

scaper/core.py Show resolved Hide resolved
scaper/core.py Show resolved Hide resolved
elif source_time[0] == 'choose':
for i, t in enumerate(source_time[1]):
if t + event_duration > source_duration:
source_time[1][i] = max(0, source_duration - event_duration)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eek, I believe that's not the right place for this. The current location will result in the loop iterating over different list indices compared to those of source_time[1] which I think leads to erroneous behavior.

I meant we should add source_time[1] = list(set(source_time[1])) after the loop which updates each value in the list, i.e. right after line 398 and outside the scope of the for loop (but inside the scope of the elif.

…bump (+11 squashed commits)

Squashed commits:
[f1fa344] tab
[3d3ff52] pytest-faulthandler...
[650b81f] typo??
[ddf9166] third try
[73d3442] second try
[776a7ef] first try
[097b005] typo
[508ec8c] only bumping numpy
[a3347fc] bumping numpy
[906102a] fixing versions
[c1587b4] fixing the test to match remove duplicates behavior
fixing versions

bumping numpy

only bumping numpy

typo

first try

second try

third try

typo??

pytest-faulthandler...

tab

updated changelog
Copy link
Owner

@justinsalamon justinsalamon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 9 files at r3.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @justinsalamon and @pseeth)

Copy link
Owner

@justinsalamon justinsalamon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants