Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Luigi Stats #123

Merged
merged 22 commits into from
Jul 28, 2021
Merged

Luigi Stats #123

merged 22 commits into from
Jul 28, 2021

Conversation

khumairraj
Copy link
Contributor

@khumairraj khumairraj commented Jul 26, 2021

Does the following

  • Makes an aggregate task ResampleSubCorpuses for the resampling of the audio files. This is done as this resample will also be used in the summary stats as well as the final task. This is similar to the SubsampleSplits aggregation.
  • Make task GenerateStats for the extract and the sampled files. Statistics script for task embeddings #104

@khumairraj khumairraj requested a review from turian July 26, 2021 21:31
class ResampleSubcorpuses(WorkTask):
"""
Aggregates resampling of all the splits and sampling rates
into a single task as dependencies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Won't this block and prevent parallelization? What's the benefit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefit is that

  • We don't need to write the list of resampling tasks in the requirements of all other tasks after the ResampleSubcorpus and before the FinalizeCorpus. Since the GenerateAudioStats was something in the middle, I had to write the resample list in the requirements of both the GenerateAudioStats and FinalizeCorpus.

Resample list which is referred above -
https://github.com/neuralaudio/hear-eval-kit/blob/64fe04f24e4cd754495a95c127957c6d6504448a/heareval/tasks/pipeline.py#L579-L589

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't block parallelization, it basically just moves all separate requires tasks from finalizeCorpus out into a single task and aggregates the results into a single requires workdir for the GenerateAudioStats which simplifies that task. Is that correct @khumairraj ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

for sr in self.sample_rates
for split in splits
],
"stats": GenerateAudioStats(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to do this on the original folders before ANY preprocessing also. I don't think this should be a task. It should just be a function you call within tasks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya agreed - I think we really want the report on the duration percentiles before we do any preprocessing and as a function that we call so we can get insight into a dataset and decide on what duration to set for MonoWavTrimCorpus, etc.. -- so this will be mega helpful prior to building out the full pipeline.

This may be helpful as a sanity check at the end of the pipeline. But we would expect all the durations to be the same at this point.

We did discuss that it would be nice to get a stats report out detailing the number of audio files in each split and the different labels and distribution at the end of a pipeline ya?

Copy link
Contributor Author

@khumairraj khumairraj Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, We can remove it from being a task. We will have to track the different statistic JSON in different work folders though, to copy them in the final task(outside _workdir) output. In this task, we are just making one JSON, kind of the summary of what was the input and the output, and then copying this JSON over to the task dir(outside the _workdir).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I get the requirement of setting the durations for the MonoWavTrimCorpus. We can make it a function and call in the extract metadata to save the stats in the same folder as the metadata. This is actually kind of a metadata of the input files in that case. We can keep the GenerateAudioStats in case we want to generate any other stats along with the once we are doing. Or we can remove as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya -- I kind of like having this reported in the task dir as a summary of the input vs output -- but I think it's more useful from a debugging perspective, so maybe something we could optionally include as a task in the pipeline to help when building? @turian ?

A function that we can call on a folder of audio files that looks through recursively and reads the audio files and produces a report using the stats you have currently would be awesome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khumairraj Again, we need the for datasets at the BEGINNING (before trim) and at the end. We want it at the beginning so we can actually pick appropriate audio file lengths for each particular task.

Comment on lines 652 to 654
master_stats["extracted"] = self.get_stats(
self.get_metadata()["relpath"].apply(Path).tolist()
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two parts to the stats

  • extracted: This is the summary of the list of relpath of audio files in the metadata file which has not been processed at all. relpath is a reference to all the extracted files(used this way because it helpes avoid looking out for different extracted workdir subfolders in a different dataset. The metadata always contain all the files with correct relative path)
  • sampled: This is a dict and contains the stats for all the partitions and the sampling rates.

@@ -69,3 +71,12 @@ def resample_wav(in_file: str, out_file: str, out_sr: int):
)
# Make sure the return code is 0 and the command was successful.
assert ret == 0


def get_audiostats(in_file: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! We should include this in MonoWavTrimCorpus and only resample / convert to wav if the input needs it. If it doesn't then just create a symlink. Can do this in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I was thinking of doing this. Will do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thanks

class ResampleSubcorpuses(WorkTask):
"""
Aggregates resampling of all the splits and sampling rates
into a single task as dependencies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't block parallelization, it basically just moves all separate requires tasks from finalizeCorpus out into a single task and aggregates the results into a single requires workdir for the GenerateAudioStats which simplifies that task. Is that correct @khumairraj ?

sample_rates=self.sample_rates,
data_config=self.data_config,
),
"resample": ResampleSubcorpuses(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have ResampleSubcorpuses a requirement here as well? It is already in the GenerateAudioStats so these will all necessarily be completed by the time we get here.

Copy link
Contributor Author

@khumairraj khumairraj Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we still require the work directory of the ResampleSubcorpuses to copy the final outputs to the output folder.

for sr in self.sample_rates
for split in splits
],
"stats": GenerateAudioStats(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya agreed - I think we really want the report on the duration percentiles before we do any preprocessing and as a function that we call so we can get insight into a dataset and decide on what duration to set for MonoWavTrimCorpus, etc.. -- so this will be mega helpful prior to building out the full pipeline.

This may be helpful as a sanity check at the end of the pipeline. But we would expect all the durations to be the same at this point.

We did discuss that it would be nice to get a stats report out detailing the number of audio files in each split and the different labels and distribution at the end of a pipeline ya?

subsample_splits (list(SubsampleSplit)): task subsamples each split
"""

sample_rates = luigi.Parameter()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

luigi.ListParameter()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it really matters -- but it should be a list

Copy link
Contributor Author

@khumairraj khumairraj Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Thanks. WIll change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"total_audiofiles": len(durations),
"mean_samplerate": np.mean(sample_rates),
"mean_dur(sec)": np.mean(durations),
"median_dur)sec)": np.median(durations),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mean samplerate isn't super meaningful -- I'd way rather have a list of samplerates to see which ones are present in a dataset. For the data before preprocessing having a list of formats would be cool too. There is def a chance that soundfile might not be able to read some of the formats that are present in the dataset though. (looking at you coughvid)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is a problem. In general, what will be a good way for the audio_stats function? Currently renamed to audio_stats_wav though. According to #104 . Maybe ffmpeg can help, but didnot get the stats directly for a .webm file. The following line - https://stackoverflow.com/questions/34118013/how-to-determine-webm-duration-using-ffprobe. The first comment talks about repackaging, please let me know if you already know any alternatives. Thanks.

@khumairraj khumairraj requested review from turian and jorshi July 27, 2021 19:57
Copy link
Contributor

@jorshi jorshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good! Would be awesome to be able to run the function audio_dir_stats_wav from the command line. Something like python3 -m heareval.tasks.audio_dir_stats infile outfile ext

"-af",
"aresample=resampler=soxr",
# "-af",
# "aresample=resampler=soxr",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's wrong with sox? Does this not work on your system?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Some issue with configuring that in M1 mac. So I just remove the line, do the development and then uncomment.

"""Produce summary by recursively searching a directory for wav files"""

audio_paths = list(Path(in_dir).absolute().rglob("*.wav"))
audio_dir_stats = list(map(audio_stats_wav, audio_paths))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a chance this will take a while to run on larger sets? If so a tqdm progress bar would be great.

def audio_dir_stats_wav(in_dir: Union[str, Path], out_file: str):
"""Produce summary by recursively searching a directory for wav files"""

audio_paths = list(Path(in_dir).absolute().rglob("*.wav"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pass in the search extension as an arg to support other formats? Will need to be one that soundfile supports, but would be cool to be able to use this for other formats. Also perhaps a case insensitive search on the extension?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. in the following commit.

return {
"samples": len(audio),
"sample_rate": audio.samplerate,
"duration": round(len(audio) / audio.samplerate, 2),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we rounding here? Why not keep the precise value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to keep it approximate and good in the json. But will remove it.

@khumairraj
Copy link
Contributor Author

This is looking good! Would be awesome to be able to run the function audio_dir_stats_wav from the command line. Something like python3 -m heareval.tasks.audio_dir_stats infile outfile ext

Sure. Have added this pushing in the following commit.

@khumairraj khumairraj requested a review from turian July 28, 2021 08:51
@khumairraj khumairraj requested a review from jorshi July 28, 2021 08:54
README.md Outdated
@@ -44,6 +44,15 @@ Options:
default we resample to 16000, 22050, 44100, 48000.
```

Additionally, to check the stats of an audio directory:
```
python3 -m heareval.tasks.audio_dir_stats {input folder} {output json file} {ext1} {ext2} ..
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get rid of this and use .wav, .ogg, and .mp3 always

@turian turian merged commit 531adcc into main Jul 28, 2021
@turian turian deleted the luigistats branch July 28, 2021 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants