-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Luigi Stats #123
Luigi Stats #123
Conversation
- Resample subcorpuses will be required by the stats as well as finaltask - Convert to one aggregated task to keep it clean
class ResampleSubcorpuses(WorkTask): | ||
""" | ||
Aggregates resampling of all the splits and sampling rates | ||
into a single task as dependencies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Won't this block and prevent parallelization? What's the benefit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benefit is that
- We don't need to write the list of resampling tasks in the requirements of all other tasks after the
ResampleSubcorpus
and before theFinalizeCorpus
. Since theGenerateAudioStats
was something in the middle, I had to write the resample list in the requirements of both theGenerateAudioStats
andFinalizeCorpus
.
Resample list which is referred above -
https://github.com/neuralaudio/hear-eval-kit/blob/64fe04f24e4cd754495a95c127957c6d6504448a/heareval/tasks/pipeline.py#L579-L589
- This is consistent with how we are doing the
SubsampleSplits
which aggregate theSubsampleSplit
. This will not affect parallelism as it is just a reference to the list of resampling tasks that will run in parallel.
https://github.com/neuralaudio/hear-eval-kit/blob/8365c0d2eac8b94b9b56af6b422087b6e33eab5b/heareval/tasks/pipeline.py#L286-L318
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't block parallelization, it basically just moves all separate requires tasks from finalizeCorpus out into a single task and aggregates the results into a single requires workdir for the GenerateAudioStats which simplifies that task. Is that correct @khumairraj ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
heareval/tasks/pipeline.py
Outdated
for sr in self.sample_rates | ||
for split in splits | ||
], | ||
"stats": GenerateAudioStats( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to do this on the original folders before ANY preprocessing also. I don't think this should be a task. It should just be a function you call within tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya agreed - I think we really want the report on the duration percentiles before we do any preprocessing and as a function that we call so we can get insight into a dataset and decide on what duration to set for MonoWavTrimCorpus, etc.. -- so this will be mega helpful prior to building out the full pipeline.
This may be helpful as a sanity check at the end of the pipeline. But we would expect all the durations to be the same at this point.
We did discuss that it would be nice to get a stats report out detailing the number of audio files in each split and the different labels and distribution at the end of a pipeline ya?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, We can remove it from being a task. We will have to track the different statistic JSON in different work folders though, to copy them in the final task(outside _workdir) output. In this task, we are just making one JSON, kind of the summary of what was the input and the output, and then copying this JSON over to the task dir(outside the _workdir
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I get the requirement of setting the durations for the MonoWavTrimCorpus
. We can make it a function and call in the extract metadata to save the stats in the same folder as the metadata. This is actually kind of a metadata of the input files in that case. We can keep the GenerateAudioStats
in case we want to generate any other stats along with the once we are doing. Or we can remove as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya -- I kind of like having this reported in the task dir as a summary of the input vs output -- but I think it's more useful from a debugging perspective, so maybe something we could optionally include as a task in the pipeline to help when building? @turian ?
A function that we can call on a folder of audio files that looks through recursively and reads the audio files and produces a report using the stats you have currently would be awesome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@khumairraj Again, we need the for datasets at the BEGINNING (before trim) and at the end. We want it at the beginning so we can actually pick appropriate audio file lengths for each particular task.
heareval/tasks/pipeline.py
Outdated
master_stats["extracted"] = self.get_stats( | ||
self.get_metadata()["relpath"].apply(Path).tolist() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two parts to the stats
extracted
: This is the summary of the list ofrelpath
of audio files in the metadata file which has not been processed at all.relpath
is a reference to all the extracted files(used this way because it helpes avoid looking out for differentextracted workdir
subfolders in a different dataset. The metadata always contain all the files with correct relative path)sampled
: This is a dict and contains the stats for all the partitions and the sampling rates.
heareval/tasks/util/audio.py
Outdated
@@ -69,3 +71,12 @@ def resample_wav(in_file: str, out_file: str, out_sr: int): | |||
) | |||
# Make sure the return code is 0 and the command was successful. | |||
assert ret == 0 | |||
|
|||
|
|||
def get_audiostats(in_file: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! We should include this in MonoWavTrimCorpus and only resample / convert to wav if the input needs it. If it doesn't then just create a symlink. Can do this in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I was thinking of doing this. Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thanks
class ResampleSubcorpuses(WorkTask): | ||
""" | ||
Aggregates resampling of all the splits and sampling rates | ||
into a single task as dependencies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't block parallelization, it basically just moves all separate requires tasks from finalizeCorpus out into a single task and aggregates the results into a single requires workdir for the GenerateAudioStats which simplifies that task. Is that correct @khumairraj ?
sample_rates=self.sample_rates, | ||
data_config=self.data_config, | ||
), | ||
"resample": ResampleSubcorpuses( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have ResampleSubcorpuses a requirement here as well? It is already in the GenerateAudioStats so these will all necessarily be completed by the time we get here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we still require the work directory of the ResampleSubcorpuses
to copy the final outputs to the output folder.
heareval/tasks/pipeline.py
Outdated
for sr in self.sample_rates | ||
for split in splits | ||
], | ||
"stats": GenerateAudioStats( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya agreed - I think we really want the report on the duration percentiles before we do any preprocessing and as a function that we call so we can get insight into a dataset and decide on what duration to set for MonoWavTrimCorpus, etc.. -- so this will be mega helpful prior to building out the full pipeline.
This may be helpful as a sanity check at the end of the pipeline. But we would expect all the durations to be the same at this point.
We did discuss that it would be nice to get a stats report out detailing the number of audio files in each split and the different labels and distribution at the end of a pipeline ya?
heareval/tasks/pipeline.py
Outdated
subsample_splits (list(SubsampleSplit)): task subsamples each split | ||
""" | ||
|
||
sample_rates = luigi.Parameter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
luigi.ListParameter()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it really matters -- but it should be a list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Thanks. WIll change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@khumairraj ^^
heareval/tasks/pipeline.py
Outdated
"total_audiofiles": len(durations), | ||
"mean_samplerate": np.mean(sample_rates), | ||
"mean_dur(sec)": np.mean(durations), | ||
"median_dur)sec)": np.median(durations), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mean samplerate isn't super meaningful -- I'd way rather have a list of samplerates to see which ones are present in a dataset. For the data before preprocessing having a list of formats would be cool too. There is def a chance that soundfile might not be able to read some of the formats that are present in the dataset though. (looking at you coughvid)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This is a problem. In general, what will be a good way for the audio_stats
function? Currently renamed to audio_stats_wav
though. According to #104 . Maybe ffmpeg
can help, but didnot get the stats directly for a .webm
file. The following line - https://stackoverflow.com/questions/34118013/how-to-determine-webm-duration-using-ffprobe. The first comment talks about repackaging, please let me know if you already know any alternatives. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good! Would be awesome to be able to run the function audio_dir_stats_wav from the command line. Something like python3 -m heareval.tasks.audio_dir_stats infile outfile ext
heareval/tasks/util/audio.py
Outdated
"-af", | ||
"aresample=resampler=soxr", | ||
# "-af", | ||
# "aresample=resampler=soxr", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's wrong with sox? Does this not work on your system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Some issue with configuring that in M1 mac. So I just remove the line, do the development and then uncomment.
heareval/tasks/util/audio.py
Outdated
"""Produce summary by recursively searching a directory for wav files""" | ||
|
||
audio_paths = list(Path(in_dir).absolute().rglob("*.wav")) | ||
audio_dir_stats = list(map(audio_stats_wav, audio_paths)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a chance this will take a while to run on larger sets? If so a tqdm progress bar would be great.
heareval/tasks/util/audio.py
Outdated
def audio_dir_stats_wav(in_dir: Union[str, Path], out_file: str): | ||
"""Produce summary by recursively searching a directory for wav files""" | ||
|
||
audio_paths = list(Path(in_dir).absolute().rglob("*.wav")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we pass in the search extension as an arg to support other formats? Will need to be one that soundfile supports, but would be cool to be able to use this for other formats. Also perhaps a case insensitive search on the extension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. in the following commit.
heareval/tasks/util/audio.py
Outdated
return { | ||
"samples": len(audio), | ||
"sample_rate": audio.samplerate, | ||
"duration": round(len(audio) / audio.samplerate, 2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we rounding here? Why not keep the precise value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to keep it approximate and good in the json. But will remove it.
Sure. Have added this pushing in the following commit. |
README.md
Outdated
@@ -44,6 +44,15 @@ Options: | |||
default we resample to 16000, 22050, 44100, 48000. | |||
``` | |||
|
|||
Additionally, to check the stats of an audio directory: | |||
``` | |||
python3 -m heareval.tasks.audio_dir_stats {input folder} {output json file} {ext1} {ext2} .. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get rid of this and use .wav, .ogg, and .mp3 always
Does the following
ResampleSubCorpuses
for the resampling of the audio files. This is done as this resample will also be used in the summary stats as well as the final task. This is similar to theSubsampleSplits
aggregation.GenerateStats
for the extract and the sampled files. Statistics script for task embeddings #104