Luigi Stats #123

khumairraj · 2021-07-26T21:14:35Z

Does the following

Makes an aggregate task ResampleSubCorpuses for the resampling of the audio files. This is done as this resample will also be used in the summary stats as well as the final task. This is similar to the SubsampleSplits aggregation.
Make task GenerateStats for the extract and the sampled files. Statistics script for task embeddings #104

- Resample subcorpuses will be required by the stats as well as finaltask - Convert to one aggregated task to keep it clean

turian · 2021-07-26T22:43:24Z

heareval/tasks/pipeline.py

+class ResampleSubcorpuses(WorkTask):
+    """
+    Aggregates resampling of all the splits and sampling rates
+    into a single task as dependencies.


Why? Won't this block and prevent parallelization? What's the benefit?

The benefit is that

We don't need to write the list of resampling tasks in the requirements of all other tasks after the ResampleSubcorpus and before the FinalizeCorpus. Since the GenerateAudioStats was something in the middle, I had to write the resample list in the requirements of both the GenerateAudioStats and FinalizeCorpus.

Resample list which is referred above -
https://github.com/neuralaudio/hear-eval-kit/blob/64fe04f24e4cd754495a95c127957c6d6504448a/heareval/tasks/pipeline.py#L579-L589

This is consistent with how we are doing the SubsampleSplits which aggregate the SubsampleSplit. This will not affect parallelism as it is just a reference to the list of resampling tasks that will run in parallel.
https://github.com/neuralaudio/hear-eval-kit/blob/8365c0d2eac8b94b9b56af6b422087b6e33eab5b/heareval/tasks/pipeline.py#L286-L318

It won't block parallelization, it basically just moves all separate requires tasks from finalizeCorpus out into a single task and aggregates the results into a single requires workdir for the GenerateAudioStats which simplifies that task. Is that correct @khumairraj ?

turian · 2021-07-26T22:44:12Z

heareval/tasks/pipeline.py

-                for sr in self.sample_rates
-                for split in splits
-            ],
+            "stats": GenerateAudioStats(


We want to do this on the original folders before ANY preprocessing also. I don't think this should be a task. It should just be a function you call within tasks.

Ya agreed - I think we really want the report on the duration percentiles before we do any preprocessing and as a function that we call so we can get insight into a dataset and decide on what duration to set for MonoWavTrimCorpus, etc.. -- so this will be mega helpful prior to building out the full pipeline.

This may be helpful as a sanity check at the end of the pipeline. But we would expect all the durations to be the same at this point.

We did discuss that it would be nice to get a stats report out detailing the number of audio files in each split and the different labels and distribution at the end of a pipeline ya?

Yes, We can remove it from being a task. We will have to track the different statistic JSON in different work folders though, to copy them in the final task(outside _workdir) output. In this task, we are just making one JSON, kind of the summary of what was the input and the output, and then copying this JSON over to the task dir(outside the _workdir).

But I get the requirement of setting the durations for the MonoWavTrimCorpus. We can make it a function and call in the extract metadata to save the stats in the same folder as the metadata. This is actually kind of a metadata of the input files in that case. We can keep the GenerateAudioStats in case we want to generate any other stats along with the once we are doing. Or we can remove as well.

Ya -- I kind of like having this reported in the task dir as a summary of the input vs output -- but I think it's more useful from a debugging perspective, so maybe something we could optionally include as a task in the pipeline to help when building? @turian ?

A function that we can call on a folder of audio files that looks through recursively and reads the audio files and produces a report using the stats you have currently would be awesome.

@khumairraj Again, we need the for datasets at the BEGINNING (before trim) and at the end. We want it at the beginning so we can actually pick appropriate audio file lengths for each particular task.

khumairraj · 2021-07-27T05:31:44Z

heareval/tasks/pipeline.py

+        master_stats["extracted"] = self.get_stats(
+            self.get_metadata()["relpath"].apply(Path).tolist()
+        )


There are two parts to the stats

extracted: This is the summary of the list of relpath of audio files in the metadata file which has not been processed at all. relpath is a reference to all the extracted files(used this way because it helpes avoid looking out for different extracted workdir subfolders in a different dataset. The metadata always contain all the files with correct relative path)

sampled: This is a dict and contains the stats for all the partitions and the sampling rates.

jorshi · 2021-07-27T05:22:58Z

heareval/tasks/util/audio.py

@@ -69,3 +71,12 @@ def resample_wav(in_file: str, out_file: str, out_sr: int):
    )
    # Make sure the return code is 0 and the command was successful.
    assert ret == 0
+
+
+def get_audiostats(in_file: str):


Nice! We should include this in MonoWavTrimCorpus and only resample / convert to wav if the input needs it. If it doesn't then just create a symlink. Can do this in a separate PR.

Yes. I was thinking of doing this. Will do.

Awesome thanks

jorshi · 2021-07-27T05:28:06Z

heareval/tasks/pipeline.py

+class ResampleSubcorpuses(WorkTask):
+    """
+    Aggregates resampling of all the splits and sampling rates
+    into a single task as dependencies.


It won't block parallelization, it basically just moves all separate requires tasks from finalizeCorpus out into a single task and aggregates the results into a single requires workdir for the GenerateAudioStats which simplifies that task. Is that correct @khumairraj ?

jorshi · 2021-07-27T05:30:31Z

heareval/tasks/pipeline.py

+                sample_rates=self.sample_rates,
+                data_config=self.data_config,
+            ),
+            "resample": ResampleSubcorpuses(


Why have ResampleSubcorpuses a requirement here as well? It is already in the GenerateAudioStats so these will all necessarily be completed by the time we get here.

Because we still require the work directory of the ResampleSubcorpuses to copy the final outputs to the output folder.

jorshi · 2021-07-27T05:34:13Z

heareval/tasks/pipeline.py

-                for sr in self.sample_rates
-                for split in splits
-            ],
+            "stats": GenerateAudioStats(


Ya agreed - I think we really want the report on the duration percentiles before we do any preprocessing and as a function that we call so we can get insight into a dataset and decide on what duration to set for MonoWavTrimCorpus, etc.. -- so this will be mega helpful prior to building out the full pipeline.

This may be helpful as a sanity check at the end of the pipeline. But we would expect all the durations to be the same at this point.

We did discuss that it would be nice to get a stats report out detailing the number of audio files in each split and the different labels and distribution at the end of a pipeline ya?

jorshi · 2021-07-27T05:37:26Z

heareval/tasks/pipeline.py

+        subsample_splits (list(SubsampleSplit)): task subsamples each split
+    """
+
+    sample_rates = luigi.Parameter()


luigi.ListParameter()

I don't think it really matters -- but it should be a list

Sure. Thanks. WIll change.

@khumairraj ^^

jorshi · 2021-07-27T05:50:26Z

heareval/tasks/pipeline.py

+            "total_audiofiles": len(durations),
+            "mean_samplerate": np.mean(sample_rates),
+            "mean_dur(sec)": np.mean(durations),
+            "median_dur)sec)": np.median(durations),


mean samplerate isn't super meaningful -- I'd way rather have a list of samplerates to see which ones are present in a dataset. For the data before preprocessing having a list of formats would be cool too. There is def a chance that soundfile might not be able to read some of the formats that are present in the dataset though. (looking at you coughvid)

Yes. This is a problem. In general, what will be a good way for the audio_stats function? Currently renamed to audio_stats_wav though. According to #104 . Maybe ffmpeg can help, but didnot get the stats directly for a .webm file. The following line - https://stackoverflow.com/questions/34118013/how-to-determine-webm-duration-using-ffprobe. The first comment talks about repackaging, please let me know if you already know any alternatives. Thanks.

jorshi

This is looking good! Would be awesome to be able to run the function audio_dir_stats_wav from the command line. Something like python3 -m heareval.tasks.audio_dir_stats infile outfile ext

jorshi · 2021-07-27T20:11:52Z

heareval/tasks/util/audio.py

-            "-af",
-            "aresample=resampler=soxr",
+            # "-af",
+            # "aresample=resampler=soxr",


What's wrong with sox? Does this not work on your system?

Yes. Some issue with configuring that in M1 mac. So I just remove the line, do the development and then uncomment.

jorshi · 2021-07-27T20:13:08Z

heareval/tasks/util/audio.py

+    """Produce summary by recursively searching a directory for wav files"""
+
+    audio_paths = list(Path(in_dir).absolute().rglob("*.wav"))
+    audio_dir_stats = list(map(audio_stats_wav, audio_paths))


Is there a chance this will take a while to run on larger sets? If so a tqdm progress bar would be great.

jorshi · 2021-07-27T20:21:48Z

heareval/tasks/util/audio.py

+def audio_dir_stats_wav(in_dir: Union[str, Path], out_file: str):
+    """Produce summary by recursively searching a directory for wav files"""
+
+    audio_paths = list(Path(in_dir).absolute().rglob("*.wav"))


Can we pass in the search extension as an arg to support other formats? Will need to be one that soundfile supports, but would be cool to be able to use this for other formats. Also perhaps a case insensitive search on the extension?

Yes. in the following commit.

turian · 2021-07-27T20:31:02Z

heareval/tasks/util/audio.py

+    return {
+        "samples": len(audio),
+        "sample_rate": audio.samplerate,
+        "duration": round(len(audio) / audio.samplerate, 2),


Why are we rounding here? Why not keep the precise value

Just to keep it approximate and good in the json. But will remove it.

khumairraj · 2021-07-28T08:40:45Z

This is looking good! Would be awesome to be able to run the function audio_dir_stats_wav from the command line. Something like python3 -m heareval.tasks.audio_dir_stats infile outfile ext

Sure. Have added this pushing in the following commit.

turian · 2021-07-28T10:58:51Z

README.md

@@ -44,6 +44,15 @@ Options:
                         default we resample to 16000, 22050, 44100, 48000.
 ```

+Additionally, to check the stats of an audio directory:
+```
+python3 -m heareval.tasks.audio_dir_stats {input folder} {output json file} {ext1} {ext2} ..


Get rid of this and use .wav, .ogg, and .mp3 always

khumairraj added 7 commits July 27, 2021 02:34

add function to get audio stats

04743e6

remove coughvid in runner

fec2f28

Add resamplesubcorpuses as task aggregation like subsamplesplits

4ce0a48

- Resample subcorpuses will be required by the stats as well as finaltask - Convert to one aggregated task to keep it clean

add Generateaudiostats

41f23b0

Merge branch 'main' into luigistats

fb37eb7

black

cd745a8

linting

8365c0d

khumairraj requested a review from turian July 26, 2021 21:31

turian reviewed Jul 26, 2021

View reviewed changes

khumairraj commented Jul 27, 2021

View reviewed changes

jorshi reviewed Jul 27, 2021

View reviewed changes

khumairraj added 6 commits July 28, 2021 00:58

Remove stats task

9f25c72

Add audiodirstats to audio utils

0a9016a

Call the function at appropriate places

292ebeb

flake8

b5f03e4

mypy

ff4625a

more mypy

5419b25

khumairraj requested review from turian and jorshi July 27, 2021 19:57

jorshi reviewed Jul 27, 2021

View reviewed changes

turian reviewed Jul 27, 2021

View reviewed changes

khumairraj added 4 commits July 28, 2021 14:13

add lambda, pass ext, case senstv ext, no rounding, tqdm

b1be365

fix: case insensitive extension

eb953f9

add click cmd endpoint for audio_dir_stats

674cbd9

add in readme

7dc91f2

khumairraj requested a review from turian July 28, 2021 08:51

add back soxr

396dace

khumairraj requested a review from jorshi July 28, 2021 08:54

turian added 2 commits July 28, 2021 12:10

Merge branch 'main' into luigistats

bd85cb9

fix

3786fc8

turian reviewed Jul 28, 2021

View reviewed changes

khumairraj added 2 commits July 28, 2021 16:29

make ogg, mp3 and wav as default extension

7a4c226

fixes

d4be795

turian approved these changes Jul 28, 2021

View reviewed changes

turian merged commit 531adcc into main Jul 28, 2021

turian deleted the luigistats branch July 28, 2021 16:48

jorshi mentioned this pull request Jul 29, 2021

Statistics script for task embeddings #104

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Luigi Stats #123

Luigi Stats #123

khumairraj commented Jul 26, 2021 •

edited

Loading

turian Jul 26, 2021

khumairraj Jul 27, 2021

jorshi Jul 27, 2021

khumairraj Jul 27, 2021

turian Jul 26, 2021

jorshi Jul 27, 2021

khumairraj Jul 27, 2021 •

edited

Loading

khumairraj Jul 27, 2021

jorshi Jul 27, 2021

turian Jul 27, 2021

khumairraj Jul 27, 2021

jorshi Jul 27, 2021

khumairraj Jul 27, 2021

jorshi Jul 27, 2021

jorshi Jul 27, 2021

jorshi Jul 27, 2021

khumairraj Jul 27, 2021 •

edited

Loading

jorshi Jul 27, 2021

jorshi Jul 27, 2021

jorshi Jul 27, 2021

khumairraj Jul 27, 2021 •

edited

Loading

turian Jul 27, 2021

jorshi Jul 27, 2021

khumairraj Jul 27, 2021

jorshi left a comment

jorshi Jul 27, 2021

khumairraj Jul 28, 2021

jorshi Jul 27, 2021

jorshi Jul 27, 2021

khumairraj Jul 28, 2021

turian Jul 27, 2021

khumairraj Jul 28, 2021

khumairraj commented Jul 28, 2021

turian Jul 28, 2021

Luigi Stats #123

Luigi Stats #123

Conversation

khumairraj commented Jul 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khumairraj Jul 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khumairraj Jul 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khumairraj Jul 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorshi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

khumairraj commented Jul 28, 2021

Choose a reason for hiding this comment

khumairraj commented Jul 26, 2021 •

edited

Loading

khumairraj Jul 27, 2021 •

edited

Loading

khumairraj Jul 27, 2021 •

edited

Loading

khumairraj Jul 27, 2021 •

edited

Loading