Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Luigi cleanup #8

Merged
merged 31 commits into from
May 28, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
373c730
Added coughvid.py pipeline
turian May 24, 2021
f2c19a3
Black
turian May 24, 2021
3f2be27
Convert to mono wav corpus
turian May 24, 2021
1223598
Added precommit info
turian May 24, 2021
f8b7f1d
Small bugfix
turian May 25, 2021
109789d
Another comment
turian May 25, 2021
8bb8886
Updated TODO
turian May 25, 2021
f579231
Fix resampling
turian May 25, 2021
aa2ef62
Starting to refactor
turian May 25, 2021
1aee316
Refactoring
turian May 25, 2021
b149d74
Refactor works
turian May 25, 2021
f0f915f
Tar file
turian May 25, 2021
a455bf3
S3 code works
turian May 25, 2021
594bc5f
Luigi pipeline requirements
jorshi May 25, 2021
3c68890
Starting config file. s3utils
jorshi May 25, 2021
fc5a24e
Creating subdirectories for config and utils
jorshi May 25, 2021
14fedbe
Moving all config over to config file
jorshi May 26, 2021
62dfb78
Move utils over to luigi util
jorshi May 26, 2021
a575f93
Creating audio utils
jorshi May 26, 2021
396def3
Moving out s3 code
jorshi May 26, 2021
2c28afc
Check output on unzip
jorshi May 26, 2021
b7d0720
Cleaning up extract method
jorshi May 26, 2021
e9d5472
Progress bar for downloads
jorshi May 26, 2021
2ffb18a
Wrong slugigy package
jorshi May 26, 2021
b866faa
gitignore for eval tasks
jorshi May 26, 2021
92e8a93
S3 caching config
jorshi May 26, 2021
a871e64
Change S3 config back to defaults
jorshi May 26, 2021
51625fc
Added a couple todos
jorshi May 26, 2021
bfb636a
Updating S3 config and pulling out from coughvid
jorshi May 28, 2021
9055cbc
Removing S3 from coughvid
jorshi May 28, 2021
f9a6eb1
Moving gitignore and requirements into top-level folder
jorshi May 28, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -127,3 +127,9 @@ dmypy.json

# Pyre type checker
.pyre/

# Working directory for luigi pipelines
evaluation-tasks/_workdir/

# Completed evaluation taks
evaluation-tasks/coughvid-*/
17 changes: 17 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
repos:
- repo: https://github.com/kynan/nbstripout
rev: 0.3.9
hooks:
- id: nbstripout
- repo: https://github.com/mwouts/jupytext
rev: v1.11.2
hooks:
- id: jupytext
args: [--sync, --pipe, black]
additional_dependencies:
- black==21.5b0 # Matches hook
- repo: https://github.com/psf/black
rev: 21.5b0
hooks:
- id: black
language_version: python3
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,10 @@
# hear2021-eval-kit

Evaluation kit for HEAR 2021 NeurIPS competition


If you are pushing code to this repo, please make sure you have
pre-commit hooks installed:
```
pre-commit install
```
77 changes: 77 additions & 0 deletions evaluation-tasks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
evaluation-tasks
================

This folder contain Luigi pipelines to download and preprocess
evaluation tasks into a common format. Luigi checkpoints are saved
into directory .checkpoints so preprocessing can be resumed if
interrupted. After preprocessing, tar'ed outputs are saved to your
S3 bucket. This avoids hitting dataset providers repeatedly.

For each evaluation task, the directory structure is:
taskname/
task.json
README
LICENSE
train.csv
[filename],...
test.csv
[filename],...
audio/[sr]/train/[filename]

## More details

task.json also specifies the hop_size that we will use for the
evaluation.

If this is a task involving multiple classes or labels, the max
number of classes/labels will be provided. We might have two versions
of label files, ones with strings and ones converted to ints for
convenience.

## train.csv and test.csv

For classification/multi-classification of the entire sound:
```
filename, non-negative integer class
```

For tagging (multilabel sound event classification) of the entire sound:
```
filename, list of string labels
```

For frame-based temporal multilabel (e.g. transcription and sound event detection):
```
filename, float timestamp in seconds, list of string labels
```

For ranking tasks:
```
list of filenames in ranked order
```

For JND tasks:
```
filename1, filename2, 0/1 indicates whether the audio is perceptually different to human listeners.
```

If the dataset provides a validation.csv, that will be included
too. Otherwise, participants do partition train into train/val
however they like.

## Caching with S3

1. Download and configure the AWS CLI if you haven't done that already:
* [Intallation](https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html)
* [Configuration](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html)

2. Update the S3 config file: `config/s3.py`
* `S3_CACHE = True` enables S3 caching. Set this to False if you want to disable
caching for all tasks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does caching mean here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This basically just turns on and off the S3 caching. Won't need this if we pull out the S3 stuff into a separate script.

* `HANDLE` is a string that is used to create an S3 bucket for all the evaluation
tasks. Every S3 bucket must have a unique name, so you should use this to create
one for yourself. The value of `HANDLE` is appended to `hear2021-`. For example,
if I set `HANDLE=jordie` then all my tasks will be cached in a bucket named
`hear2021-jordie`.
* `S3_REGION_NAME` sets the region for your S3 buckets. You can set this to `None`
to use the default value set during CLI configuration.
Empty file.
33 changes: 33 additions & 0 deletions evaluation-tasks/config/coughvid.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
"""
Configuration for the coughvid task
"""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these could probably be moved to a global / luigi config file (similar to the s3 config stuff that I pulled out)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#10

# TODO: move some of these to a global config and import that here
# See: https://github.com/neuralaudio/hear2021-eval-kit/issues/10

TASKNAME = "coughvid-v2.0.0"

# Number of CPU workers for Luigi jobs
NUM_WORKERS = 4
# NUM_WORKERS = 1
# If you only use one sample rate, you should have an array with
# one sample rate in it.
# However, if you are evaluating multiple embeddings, you might
# want them all.
SAMPLE_RATES = [48000, 44100, 22050, 16000]
# TODO: Pick the 75th percentile length?
SAMPLE_LENGTH_SECONDS = 8.0
# TODO: Do we want to call this FRAME_RATE or HOP_SIZE
FRAME_RATE = 4
# Set this to None if you want to use ALL the data.
# NOTE: This will be, expected, 225 test files only :\
# NOTE: You can make this smaller during development of this
# preprocessing script, to keep the pipeline fast.
# WARNING: If you change this value, you *must* delete _workdir
# or working dir.
# Most of the tasks iterate over every audio file present,
# except for the one that downsamples the corpus.
# (This is why we should have one working directory per task)
MAX_FRAMES_PER_CORPUS = 20 * 3600

MAX_FILES_PER_CORPUS = int(MAX_FRAMES_PER_CORPUS / FRAME_RATE / SAMPLE_LENGTH_SECONDS)
12 changes: 12 additions & 0 deletions evaluation-tasks/config/s3.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
"""
Configuration specific to AWS S3
"""

# You should pick a unique handle, since this determine the S3 path
# (which must be globally unique across all S3 users).
HANDLE = "hear"
S3_BUCKET = f"hear2021-{HANDLE}"

# If this is None, boto will use whatever is in your
# ~/.aws/config or AWS_DEFAULT_REGION environment variable
S3_REGION_NAME = "eu-central-1"
Loading