Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Luigi pipeline update #78

Merged
merged 29 commits into from
Jul 22, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
17dd7bd
Adding config for nsynth
jorshi Jul 19, 2021
ec435fe
Copying speech commands -- getting download and extraction running
jorshi Jul 19, 2021
e0bbce0
Metadata creation for nsynth
jorshi Jul 20, 2021
e7799f7
Renaming nsynth to nsynth pitch
jorshi Jul 20, 2021
fa4f90b
Remove monowavtrim
jorshi Jul 20, 2021
04c8604
Named import for luigi utils
jorshi Jul 20, 2021
e9b14e3
Starting to create config classes
jorshi Jul 20, 2021
1feb45b
Adding name partition configs
jorshi Jul 20, 2021
518ebfc
Dynamic creation of the download and extract tasks
jorshi Jul 20, 2021
46c3d8a
Process metadata being built dynamically
jorshi Jul 21, 2021
94ae944
Starting to genericize the audio pipeline
jorshi Jul 21, 2021
bd1a729
Moved all the audio processing out of speech commands
jorshi Jul 21, 2021
c7fc530
Adding some docstrings
jorshi Jul 21, 2021
6d7fb68
Cleaning up config
jorshi Jul 21, 2021
c4c05e6
versioned task name
jorshi Jul 21, 2021
c6591ff
Merge branch 'nsynth' into luigi-pipeline-update
jorshi Jul 21, 2021
56cda4c
Updating nsynth config
jorshi Jul 21, 2021
998acd0
remove string config passing into dataset builder
jorshi Jul 21, 2021
b7a01b2
Formatting
jorshi Jul 21, 2021
fe76e96
sample rate can be passed in as a command line arg
jorshi Jul 21, 2021
b2b3f56
Cleanup
jorshi Jul 21, 2021
8ac1c0e
Move dataset specific config into the same files as tasks
jorshi Jul 21, 2021
d419651
Remove config folder
jorshi Jul 21, 2021
2bd7ffb
Adding dataset preprocessing usage to readme
jorshi Jul 21, 2021
32a1933
A bit of cleanup
jorshi Jul 22, 2021
01ba125
Updating subsample numbers
jorshi Jul 22, 2021
d8f5ef1
Update some docstrings in the dataset builder
jorshi Jul 22, 2021
63b3d29
Removing command-line invocation from individual tasks -- must be fro…
jorshi Jul 22, 2021
eb65831
Adding click requirement
jorshi Jul 22, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .flake8
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@
max-line-length = 88
extend-ignore =
# See https://github.com/PyCQA/pycodestyle/issues/373
E203,
E203,
27 changes: 27 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,33 @@ See [ROADMAP](ROADMAP.md).
pip install heareval
```

### Evaluation Tasks
To run the preprocessing pipeline for Google Speech Commands:
```
python3 -m heareval.tasks.runner speech_commands
```

For NSynth pitch:
```
python3 -m heareval.tasks.runner nsynth_pitch
```

These commands will download and preprocess the entire dataset. An intermediary dir
call `_workdir` will be created, and then a final directory called `tasks` will contain
the completed dataset.

Options:
```
Options:
--num-workers INTEGER Number of CPU workers to use when running. If not
provided all CPUs are used.

--sample-rate INTEGER Perform resampling only to this sample rate. By
default we resample to 16000, 22050, 44100, 48000.
```



[later we will include more details here]

## Development
Expand Down
Empty file removed heareval/tasks/config/__init__.py
Empty file.
33 changes: 0 additions & 33 deletions heareval/tasks/config/coughvid.py

This file was deleted.

14 changes: 0 additions & 14 deletions heareval/tasks/config/dataset_config.py

This file was deleted.

12 changes: 0 additions & 12 deletions heareval/tasks/config/s3.py

This file was deleted.

19 changes: 0 additions & 19 deletions heareval/tasks/config/speech_commands.py

This file was deleted.

69 changes: 69 additions & 0 deletions heareval/tasks/dataset_config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
"""
Generic configuration used by all tasks
"""

from typing import Dict, List


class DatasetConfig:
"""
A base class config class for HEAR datasets.

Args:
task_name: Unique name for this task
version: version string for the dataset
download_urls: A dictionary of URLs to download the dataset files from
sample_duration: All samples with be padded / trimmed to this length
"""

def __init__(
self, task_name: str, version: str, download_urls: Dict, sample_duration: float
):
self.task_name = task_name
self.version = version
self.download_urls = download_urls
self.sample_duration = sample_duration

@property
def versioned_task_name(self):
return f"{self.task_name}-{self.version}"


class PartitionConfig:
"""
A configuration class for creating named partitions in a dataset

Args:
name: name of the partition
max_files: an integer number of samples to cap this partition at,
defaults to None for no maximum.
"""

def __init__(self, name: str, max_files: int = None):
self.name = name
self.max_files = max_files


class PartitionedDatasetConfig(DatasetConfig):
"""
A base class config class for HEAR datasets. This config should be used when
there are pre-defined data partitions.

Args:
task_name: Unique name for this task
version: version string for the dataset
download_urls: A dictionary of URLs to download the dataset files from
sample_duration: All samples with be padded / trimmed to this length
partitions: A list of PartitionConfig objects describing the partitions
"""

def __init__(
self,
task_name: str,
version: str,
download_urls: Dict,
sample_duration: float,
partitions: List[PartitionConfig],
):
super().__init__(task_name, version, download_urls, sample_duration)
self.partitions = partitions
139 changes: 139 additions & 0 deletions heareval/tasks/nsynth_pitch.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
#!/usr/bin/env python3
"""
Pre-processing pipeline for NSynth pitch detection
"""

import os
from pathlib import Path
from functools import partial
import logging
from typing import List

import luigi
import pandas as pd
from slugify import slugify

from heareval.tasks.dataset_config import (
PartitionedDatasetConfig,
PartitionConfig,
)
from heareval.tasks.util.dataset_builder import DatasetBuilder
import heareval.tasks.util.luigi as luigi_util

logger = logging.getLogger("luigi-interface")


# Dataset configuration
class NSynthPitchConfig(PartitionedDatasetConfig):
def __init__(self):
super().__init__(
task_name="nsynth-pitch",
version="v2.2.3",
download_urls={
"train": "http://download.magenta.tensorflow.org/datasets/nsynth/nsynth-train.jsonwav.tar.gz", # noqa: E501
"valid": "http://download.magenta.tensorflow.org/datasets/nsynth/nsynth-valid.jsonwav.tar.gz", # noqa: E501
"test": "http://download.magenta.tensorflow.org/datasets/nsynth/nsynth-test.jsonwav.tar.gz", # noqa: E501
},
# All samples will be trimmed / padded to this length
sample_duration=4.0,
# Pre-defined partitions in the dataset. Number of files in each split is
# train: 85,111; valid: 10,102; test: 4890. These values will be a bit less
# after filter the pitches to be only within the piano range.
# To subsample a partition, set the max_files to an integer.
# TODO: Should we subsample NSynth?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full nsynth is quite large. Should we subsample it?

partitions=[
PartitionConfig(name="train", max_files=10000),
PartitionConfig(name="valid", max_files=1000),
PartitionConfig(name="test", max_files=None),
],
)
# We only include pitches that are on a standard 88-key MIDI piano
self.pitch_range = (21, 108)


config = NSynthPitchConfig()


class ConfigureProcessMetaData(luigi_util.WorkTask):
"""
Custom metadata pre-processing for the NSynth task. Creates a metadata csv
file that will be used by downstream luigi tasks to curate the final dataset.
"""

outfile = luigi.Parameter()

def requires(self):
raise NotImplementedError

@staticmethod
def get_rel_path(root: Path, item: pd.DataFrame) -> str:
# Creates the relative path to an audio file given the note_str
audio_path = root.joinpath("audio")
filename = f"{item}.wav"
return audio_path.joinpath(filename)

@staticmethod
def slugify_file_name(filename: str) -> str:
return f"{slugify(filename)}.wav"

def get_split_metadata(self, split: str) -> pd.DataFrame:
logger.info(f"Preparing metadata for {split}")

# Loads and prepares the metadata for a specific split
split_path = Path(self.requires()[split].workdir).joinpath(f"nsynth-{split}")

metadata = pd.read_json(split_path.joinpath("examples.json"), orient="index")

# Filter out pitches that are not within the range
metadata = metadata[metadata["pitch"] >= config.pitch_range[0]]
metadata = metadata[metadata["pitch"] <= config.pitch_range[1]]

metadata = metadata.assign(label=lambda df: df["pitch"])
metadata = metadata.assign(
relpath=lambda df: df["note_str"].apply(
partial(self.get_rel_path, split_path)
)
)
metadata = metadata.assign(
slug=lambda df: df["note_str"].apply(self.slugify_file_name)
)
metadata = metadata.assign(partition=lambda df: split)
metadata = metadata.assign(
filename_hash=lambda df: df["slug"].apply(luigi_util.filename_to_int_hash)
)

return metadata[luigi_util.PROCESSMETADATACOLS]

def run(self):

# Get metadata for each of the data splits
process_metadata = pd.concat(
[self.get_split_metadata(split) for split in self.requires()]
)

process_metadata.to_csv(
os.path.join(self.workdir, self.outfile),
columns=luigi_util.PROCESSMETADATACOLS,
header=False,
index=False,
)

self.mark_complete()


def main(num_workers: int, sample_rates: List[int]):

builder = DatasetBuilder(config)

# Build the dataset pipeline with the custom metadata configuration task
download_tasks = builder.download_and_extract_tasks()
configure_metadata = builder.build_task(
ConfigureProcessMetaData,
requirements=download_tasks,
params={"outfile": "process_metadata.csv"},
)
audio_tasks = builder.prepare_audio_from_metadata_task(
configure_metadata, sample_rates
)

builder.run(audio_tasks, num_workers=num_workers)
53 changes: 53 additions & 0 deletions heareval/tasks/runner.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#!/usr/bin/env python3
"""
Runs a luigi pipeline to build a dataset
"""

import logging
import multiprocessing
from typing import Optional

import click

import heareval.tasks.speech_commands as speech_commands
import heareval.tasks.nsynth_pitch as nsynth_pitch

logger = logging.getLogger("luigi-interface")

tasks = {"speech_commands": speech_commands, "nsynth_pitch": nsynth_pitch}


@click.command()
@click.argument("task")
@click.option(
"--num-workers",
default=None,
help="Number of CPU workers to use when running. "
"If not provided all CPUs are used.",
type=int,
)
@click.option(
"--sample-rate",
default=None,
help="Perform resampling only to this sample rate. "
"By default we resample to 16000, 22050, 44100, 48000.",
type=int,
)
def run(
task: str, num_workers: Optional[int] = None, sample_rate: Optional[int] = None
):

if num_workers is None:
num_workers = multiprocessing.cpu_count()
logger.info(f"Using {num_workers} workers")

if sample_rate is None:
sample_rates = [16000, 22050, 44100, 48000]
else:
sample_rates = [sample_rate]

tasks[task].main(num_workers=num_workers, sample_rates=sample_rates)


if __name__ == "__main__":
run()
Loading