Merge pull request #75 from p-lambda/dev

WILDS v1.2
p-lambda · Jul 19, 2021 · eac4d82 · eac4d82
2 parents 50b2677 + 2ee73c4
commit eac4d82
Show file tree

Hide file tree

Showing 46 changed files with 3,017 additions and 222 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,5 @@
 __pycache__
 build
 dist
+venv
 wilds.egg-info
diff --git a/README.md b/README.md
@@ -50,7 +50,7 @@ pip install -e .
 - torch>=1.7.0
 - torch-scatter>=2.0.5
 - torch-geometric>=1.6.1
-- tqdm>=4.53.0 
+- tqdm>=4.53.0
 
 Running `pip install wilds` or `pip install -e .` will automatically check for and install all of these requirements
 except for the `torch-scatter` and `torch-geometric` packages, which require a [quick manual install](https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html#installation-via-binaries).
@@ -83,6 +83,7 @@ python examples/run_expt.py --dataset civilcomments --algorithm groupDRO --root_
 
 The scripts are set up to facilitate general-purpose algorithm development: new algorithms can be added to `examples/algorithms` and then run on all of the WILDS datasets using the default models.
 
+### Downloading and training on the WILDS datasets
 The first time you run these scripts, you might need to download the datasets. You can do so with the `--download` argument, for example:
 ```
 python examples/run_expt.py --dataset civilcomments --algorithm groupDRO --root_dir data --download
@@ -102,25 +103,42 @@ These are the sizes of each of our datasets, as well as their approximate time t
 |-----------------|----------|--------------------|-------------------|-------------------------|
 | iwildcam        | Image    | 11                 | 25                | 7                       |
 | camelyon17      | Image    | 10                 | 15                | 2                       |
+| rxrx1           | Image    | 7                  | 7                 | 11                      |
 | ogb-molpcba     | Graph    | 0.04               | 2                 | 15                      |
+| globalwheat     | Image    | 10                 | 10                | 2                       |
 | civilcomments   | Text     | 0.1                | 0.3               | 4.5                     |
 | fmow            | Image    | 50                 | 55                | 6                       |
 | poverty         | Image    | 12                 | 14                | 5                       |
-| amazon          | Text     | 6.6                | 7                 | 5                       |
+| amazon          | Text     | 7                  | 7                 | 5                       |
 | py150           | Text     | 0.1                | 0.8               | 9.5                     |
 
 While the `camelyon17` dataset is small and fast to train on, we advise against using it as the only dataset to prototype methods on, as the test performance of models trained on this dataset tend to exhibit a large degree of variability over random seeds.
 
 The image datasets (`iwildcam`, `camelyon17`, `fmow`, and `poverty`) tend to have high disk I/O usage. If training time is much slower for you than the approximate times listed above, consider checking if I/O is a bottleneck (e.g., by moving to a local disk if you are using a network drive, or by increasing the number of data loader workers). To speed up training, you could also disable evaluation at each epoch or for all splits by toggling `--evaluate_all_splits` and related arguments.
 
-We have an [executable version](https://wilds.stanford.edu/codalab) of our paper on CodaLab that contains the exact commands, code, and data for the experiments reported in our paper, which rely on these scripts. Trained model weights for all datasets can also be found there.
+### Evaluating trained models
+We also provide an evaluation script that aggregates prediction CSV files for different replicates and reports on their combined evaluation. To use this, run:
+
+```bash
+python examples/evaluate.py <predictions_dir> <output_dir> --root-dir <root_dir>
+```
 
+where `<predictions_dir>` is the path to your predictions directory, `<output_dir>` is where the results JSON will be writte, and `<root_dir>` is the dataset root directory.
+The predictions directory should have a subdirectory for each dataset
+(e.g. `iwildcam`) containing prediction CSV files to evaluate; see our [submission guidelines](https://wilds.stanford.edu/submit/) for the format.
+The evaluation script will skip over any datasets that has missing prediction files.
+Any dataset not in `<root_dir>` will be downloaded to `<root_dir>`.
+
+### Reproducibility
+We have an [executable version](https://wilds.stanford.edu/codalab) of our paper on CodaLab that contains the exact commands, code, and data for the experiments reported in our paper, which rely on these scripts. Trained model weights for all datasets can also be found there.
+All configurations and hyperparameters can also be found in the `examples/configs` folder of this repo, and dataset-specific parameters are in `examples/configs/datasets.py`.
 
 ## Using the WILDS package
-### Data loading
+### Data
 
 The WILDS package provides a simple, standardized interface for all datasets in the benchmark.
 This short Python snippet covers all of the steps of getting started with a WILDS dataset, including dataset download and initialization, accessing various splits, and preparing a user-customizable data loader.
+We discuss data loading in more detail in [#Data loading](#data-loading).
 
 ```py
 >>> from wilds import get_dataset
@@ -143,13 +161,13 @@ This short Python snippet covers all of the steps of getting started with a WILD
 ...   ...
 ```
 
-The `metadata` contains information like the domain identity, e.g., which camera a photo was taken from, or which hospital the patient's data came from, etc.
+The `metadata` contains information like the domain identity, e.g., which camera a photo was taken from, or which hospital the patient's data came from, etc., as well as other metadata.
 
 ### Domain information
-To allow algorithms to leverage domain annotations as well as other
-groupings over the available metadata, the WILDS package provides `Grouper` objects.
-These `Grouper` objects extract group annotations from metadata, allowing users to
-specify the grouping scheme in a flexible fashion.
+To allow algorithms to leverage domain annotations as well as other groupings over the available metadata, the WILDS package provides `Grouper` objects.
+These `Grouper` objects are helper objects that extract group annotations from metadata, allowing users to specify the grouping scheme in a flexible fashion.
+They are used to initialize group-aware data loaders (as discussed in [#Data loading](#data-loading)) and to implement algorithms that rely on domain annotations (e.g., Group DRO).
+In the following code snippet, we initialize and use a `Grouper` that extracts the domain annotations on the iWildCam dataset, where the domain is location.
 
 ```py
 >>> from wilds.common.grouper import CombinatorialGrouper
@@ -164,9 +182,21 @@ specify the grouping scheme in a flexible fashion.
 ...   ...
 ```
 
-The `Grouper` can be used to prepare a group-aware data loader that, for each minibatch, first samples a specified number of groups, then samples examples from those groups.
-This allows our data loaders to accommodate a wide array of training algorithms,
-some of which require specific data loading schemes.
+### Data loading
+
+For training, the WILDS package provides two types of data loaders.
+The standard data loader shuffles examples in the training set, and is used for the standard approach of empirical risk minimization (ERM), where we minimize the average loss.
+```py
+>>> from wilds.common.data_loaders import get_train_loader
+
+# Prepare the standard data loader
+>>> train_loader = get_train_loader('standard', train_data, batch_size=16)
+```
+
+To support other algorithms that rely on specific data loading schemes, we also provide the group data loader.
+In each minibatch, the group loader first samples a specified number of groups, and then samples a fixed number of examples from each of those groups.
+(By default, the groups are sampled uniformly at random, which upweights minority groups as a result. This can be toggled by the `uniform_over_groups` parameter.)
+We initialize group loaders as follows, using `Grouper` that specifies the grouping scheme.
 
 ```py
 # Prepare a group data loader that samples from user-specified groups
@@ -176,6 +206,20 @@ some of which require specific data loading schemes.
 ...                                 batch_size=16)
 ```
 
+Lastly, we also provide a data loader for evaluation, which loads examples without shuffling (unlike the training loaders).
+
+```py
+>>> from wilds.common.data_loaders import get_eval_loader
+
+# Get the test set
+>>> test_data = dataset.get_subset('test',
+...                                 transform=transforms.Compose([transforms.Resize((224,224)),
+...                                                               transforms.ToTensor()]))
+
+# Prepare the evaluation data loader
+>>> test_loader = get_eval_loader('standard', test_data, batch_size=16)
+```
+
 ### Evaluators
 
 The WILDS package standardizes and automates evaluation for each dataset.

diff --git a/dataset_preprocessing/camelyon17/generate_all_patch_coords.py b/dataset_preprocessing/camelyon17/generate_all_patch_coords.py
@@ -109,11 +109,7 @@ def _record_patches(center_size,
                     slide, slide_map, patch_level,
                     mask_level, tumor_mask, tissue_mask, normal_mask,
                     tumor_threshold,
-                    tumor_sel_ratio,
-                    tumor_sel_max,
-                    normal_threshold,
-                    normal_sel_ratio,
-                    normal_sel_max,
+                    normal_threshold,                    
                     **args):
     """
     Extract all tumor and non-tumor patches from a slide, using the given masks.
@@ -197,11 +193,7 @@ def generate_file(patient, node, xml_path, slide_path, folder_path):
         'mask_level' : MASK_LEVEL,
         'center_size' : CENTER_SIZE,
         'tumor_threshold' : 0,
-        'tumor_sel_ratio' : 1,
-        'tumor_sel_max' : 100000,
         'normal_threshold' : 0.2,
-        'normal_sel_ratio' : 1,
-        'normal_sel_max' : 100000,
         'mask_folder_path' : folder_path,
         'make_map' : True
     }

diff --git a/dataset_preprocessing/encode/README.md b/dataset_preprocessing/encode/README.md
@@ -0,0 +1,25 @@
+## ENCODE feature generation and preprocessing
+
+#### Requirements
+- pyBigWig
+
+#### Instructions to create Codalab bundle
+
+Here are instructions to reproduce the Codalab bundle, in a directory path `BUNDLE_ROOT_DIRECTORY`.
+
+1. Download the human genome sequence (hg19 assembly) in FASTA format from http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz and extract it into `SEQUENCE_PATH`.
+
+2. Run `python prep_sequence.py --seq_path SEQUENCE_PATH --output_dir OUTPUT_DIR` to write the fasta file found in `SEQUENCE_PATH` to a numpy array archive in `OUTPUT_PATH`. (The dataset loader assumes `OUTPUT_PATH` to be `<bundle root directory>/sequence.npz`.)
+
+3. Download the DNase accessibility data. This consists of whole-genome DNase files in bigwig format from https://guanfiles.dcmb.med.umich.edu/Leopard/dnase_bigwig/. Save these to filenames `<bundle root directory>/DNASE.<celltype>.fc.signal.bigwig` in the code.
+
+4. Run `python prep_accessibility.py`. This writes samples of each bigwig file to `<bundle root directory>/qn.<celltype>.npy`. These are used at runtime when the dataset loader is initialized, to perform quantile normalization on the DNase accessibility signals.
+
+5. Download the labels from the challenge into a label directory `<bundle root directory>/labels/` created for this purpose:
+  - The training chromosome labels for the challenge's training cell types from https://www.synapse.org/#!Synapse:syn7413983 for the relevant transcription factor ( https://www.synapse.org/#!Synapse:syn7415202 for the TF MAX, downloaded as MAX.train.labels.tsv.gz ).
+  - The training chromosome labels for the challenge's evaluation cell type (liver) from https://www.synapse.org/#!Synapse:syn8077511 for the relevant transcription factor ( https://www.synapse.org/#!Synapse:syn8077648 for the TF MAX, downloaded as MAX.train_wc.labels.tsv.gz ).
+  - The validation chromosome labels for the challenge's training cell types from https://www.synapse.org/#!Synapse:syn8441154 for the relevant transcription factor ( https://www.synapse.org/#!Synapse:syn8442103 for the TF MAX, downloaded as MAX.val.labels.tsv.gz ).
+  - The validation chromosome labels for the challenge's evaluation cell type (liver) from https://www.synapse.org/#!Synapse:syn8442975 for the relevant transcription factor ( https://www.synapse.org/#!Synapse:syn8443021 for the TF MAX, downloaded as MAX.test.labels.tsv.gz ).
+
+6. Run `python prep_metadata_labels.py`.
+
diff --git a/dataset_preprocessing/encode/prep_accessibility.py b/dataset_preprocessing/encode/prep_accessibility.py
@@ -0,0 +1,54 @@
+# Adapted from https://github.com/GuanLab/Leopard/blob/master/data/quantile_normalize_bigwig.py
+
+import argparse, time
+import numpy as np
+import pyBigWig
+
+# Human chromosomes in hg19, and their sizes in bp
+chrom_sizes = {'chr1': 249250621, 'chr10': 135534747, 'chr11': 135006516, 'chr12': 133851895, 'chr13': 115169878, 'chr14': 107349540, 'chr15': 102531392, 'chr16': 90354753, 'chr17': 81195210, 'chr18': 78077248, 'chr19': 59128983, 'chr2': 243199373, 'chr20': 63025520, 'chr21': 48129895, 'chr22': 51304566, 'chr3': 198022430, 'chr4': 191154276, 'chr5': 180915260, 'chr6': 171115067, 'chr7': 159138663, 'chr8': 146364022, 'chr9': 141213431, 'chrX': 155270560}
+
+
+def qn_sample_to_array(
+    input_celltypes,
+    input_chroms=None,
+    subsampling_ratio=1000,
+    data_pfx = '/users/abalsubr/wilds/examples/data/encode_v1.0/'
+):
+    """
+    Compute and write distribution of DNase bigwigs corresponding to input celltypes.
+    """
+    if input_chroms is None:
+        input_chroms = chrom_sizes.keys()
+    qn_chrom_sizes = { k: chrom_sizes[k] for k in input_chroms }
+    # Initialize chromosome-specific seeds for subsampling
+    chr_to_seed = {}
+    i = 0
+    for the_chr in qn_chrom_sizes:
+        chr_to_seed[the_chr] = i
+        i += 1
+
+    # subsampling
+    sample_len = np.ceil(np.array(list(qn_chrom_sizes.values()))/subsampling_ratio).astype(int)
+    sample = np.zeros(sum(sample_len))
+    start = 0
+    j = 0
+    for the_chr in qn_chrom_sizes:
+        np.random.seed(chr_to_seed[the_chr])
+        for ct in input_celltypes:
+            path = data_pfx + 'DNASE.{}.fc.signal.bigwig'.format(ct)
+            bw = pyBigWig.open(path)
+            signal = np.nan_to_num(np.array(bw.values(the_chr, 0, qn_chrom_sizes[the_chr])))
+            index = np.random.randint(0, len(signal), sample_len[j])
+            sample[start:(start+sample_len[j])] += (1.0/len(input_celltypes))*signal[index]
+        start += sample_len[j]
+        j += 1
+        print(the_chr, ct)
+    sample.sort()
+    np.save(data_pfx + "qn.{}.npy".format('.'.join(input_celltypes)), sample)
+
+
+if __name__ == '__main__':
+    train_chroms = ['chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr10', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr22', 'chrX']
+    all_celltypes = ['H1-hESC', 'HCT116', 'HeLa-S3', 'K562', 'A549', 'GM12878', 'MCF-7', 'HepG2', 'liver']
+    for ct in all_celltypes:
+        qn_sample_to_array([ct], input_chroms=train_chroms)