Public repo files add

modestyachts · Feb 12, 2019 · b14e00d · b14e00d
commit b14e00d
Show file tree

Hide file tree

Showing 1,374 changed files with 12,248,721 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1 @@
+*.swp
diff --git a/README.md b/README.md
@@ -0,0 +1,197 @@
+# ImageNetV2
+The ImageNetV2 dataset contains new test data for the [ImageNet](www.image-net.org) benchmark.
+This repository provides associated code for assembling and working with ImageNetV2.
+The actual test sets are stored [in a separate location](http://imagenetv2public.s3-website-us-west-2.amazonaws.com/).
+
+ImageNetV2 contains three test sets with 10,000 new images each.
+Importantly, these test sets were sampled *after* a decade of progress on the original ImageNet dataset.
+This makes the new test data independent of existing models and guarantees that the accuracy scores are not affected by adaptive overfitting.
+We designed the data collection process for ImageNetV2 so that the resulting distribution is as similar as possible to the original ImageNet dataset.
+Our paper ["Do ImageNet Classifiers Generalize to ImageNet?"](https://arxiv.org/abs/1806.00451) describes ImageNetV2 and associated experiments in detail.
+
+In addition to the three test sets, we also release our pool of candidate images from which the test sets were assembled.
+Each image comes with rich metadata such as the corresponding Flickr search queries or the annotations from MTurk workers.
+
+The aforementioned paper also describes CIFAR-10.1, a new test set for CIFAR-10.
+It can be found in the following repository: https://github.com/modestyachts/CIFAR-10.1
+
+
+# Using the Dataset
+Before explaining how the code in this repository was used to assemble ImageNetV2, we first describe how to load our new test sets.
+
+## Test Set Versions
+
+There are currently three test sets in ImageNetV2:
+
+- `Threshold0.7` was built by sampling ten images for each class among the candidates with selection frequency at least 0.7. 
+
+- `MatchedFrequency` was sampled to match the MTurk selection frequency distribution of the original ImageNet validation set for each class. 
+
+- `TopImages` contains the ten images with highest selection frequency in our candidate pool for each class.
+
+In our code, we adopt the following naming convention:
+Each test set is identified with a string of the form
+
+`imagenetv2-<test-set-letter>-<revision-number>`
+
+for instance, `imagenetv2-b-31`. The `Threshold0.7`, `MatchedFrequency`, and `TopImages` have test set letters `a`, `b`, and `c`, respectively.
+The current revision numbers for the test sets are `imagenetv2-a-44`, `imagenetv2-b-33`, `imagenetv2-c-12`.
+We refer to our paper for a detailed description of these test sets and the review process underlying the different test set revisions.
+
+
+## Loading a Test Set
+
+You can download the test sets from the following url: http://imagenetv2public.s3-website-us-west-2.amazonaws.com/
+
+To load the dataset, you can use the [`ImageFolder`](https://pytorch.org/docs/stable/torchvision/datasets.html#imagefolder) class in [PyTorch](https://pytorch.org/) on the extracted folder.
+For instance, the following code loads the `MatchedFrequency` dataset:
+
+```python
+from torchvision import datasets
+datasets.ImageFolder(root='imagenetv2-matched-frequency')
+```
+
+
+# Dataset Creation Pipeline
+
+The dataset creation process has several stages outlined below.
+We describe the process here at a high level.
+If you have questions about any individual steps, please contact Rebecca Roelofs (roelofs@cs.berkeley.edu) and Ludwig Schmidt (ludwig@berkeley.edu).
+
+
+## 1. Downloading images from Flickr
+
+In the first stage, we collected candidate images from the Flickr image hosting service.
+This requires a [Flickr API key](https://www.flickr.com/services/api/misc.api_keys.html).
+
+We ran the following command to search Flickr for images for a fixed list of wnids:
+
+```
+python flickr_search.py "../data/flickr_api_keys.json" \
+                        --wnids "{wnid_list.json}" \
+                        --max_images 200 \
+                        --max_date_taken "2013-07-11"\
+                        --max_date_uploaded "2013-07-11"\
+                        --min_date_taken "2012-07-11"\
+                        --min_date_uploaded "2012-07-11" 
+```
+We refer to the paper for more details on which Flickr search parameters we used to complete our candidate pool.
+
+The script outputs search result metadata, including the Flickr URLs returned for each query. 
+This search result metadata is written to `/data/search_results/`. 
+
+We then stored the images to an Amazon S3 bucket using 
+```
+python download_images_from_flickr.py ../data/search_results/{search_result.json} --batch --parallel
+```
+
+## 2. Create HITs
+Similar to the original ImageNet dataset, we used Amazon Mechanical Turk (MTurk) to filter our pool of candidates.
+The main unit of work on MTurk is a HIT (Human Intelligence Tasks), which in our case consists of 48 images with a target class.
+The format of our HITs was derived from the original ImageNet HITs.
+
+To submit a HIT, we performed the following steps.
+They require a configured [MTurk account](https://www.mturk.com/).
+  1. Encrypt all image URLs.  This is necessary so that MTurk workers cannot identify whether an image is from the original validation set or our candidate pool by the source URL. 
+    `python encrypt_copy_objects.py imagenet2candidates_mturk --strip_string ".jpg" --pywren`
+  2. Run the image consistency check.  This checks that all of the new candidate images have been stored to S3 and have encrypted URLs. 
+    `python image_consistency_check.py`
+  3. Generate hit candidates. This outputs a list of candidates to `data/hit_candidates`
+    `python generate_hit_candidates.py  --num_wnids 1000`
+  4. Submit live HITs to MTurk. 
+    `bash make_hits_live.sh sample_args_10.json <username> <latest_hit_candidate_file>`
+  5. Wait for prompt, and check if HTML file in the code/ directory looks correct.
+  6. Type in the word LIVE to confirm submitting the HITs to MTurk (this costs money).
+
+The HIT metadata created by `make_hits_live.sh` is stored in `data/mturk/hit_data_live/`.
+
+After a set of HITs was submitted, you can check their progress using
+`python3 mturk.py show_hit_progress --live --hit_file ../data/mturk/hit_data_live/{hit.json}`
+
+Additionally, we occasionally used the Jupyter notebook `inspect_hit.ipynb` to visually examine the HITs we created.
+The code for this notebook is stored in `inspect_hit_notebook_code.py`.
+
+
+
+## 3. Remove near duplicates
+Next, we removed near-duplicates from our candidate pool.
+We checked for near-duplicates both within our new test set and between our new test set and the original ImageNet dataset.
+
+To find near-duplicates, we computed the 30 nearest neighbors for each candidate image in three different metrics: l2 distance on raw pixels, l2 distance on features extracted from a pre-trained [VGG](https://arxiv.org/abs/1409.1556) model (fc7), and [SSIM](http://www.cns.nyu.edu/pub/lcv/wang03-preprint.pdf) (structural similarity).
+
+The fc7 metric requires that each image is featurized using the same pre-trained VGG model. The scripts `featurize.py`, `feaurize_test.py` and `featurize_candidates.py` were used to perform the fc7 featurization. 
+
+Next, we computed the nearest neighbors for each image.
+Each metric has a different starting script:
+* `run_near_duplicate_checker_dssim.py`
+* `run_near_duplicate_checker_l2.py`
+* `run_near_duplicate_checker_fc7.py`
+
+All three scripts use `near_duplicate_checker.py` for the underlying computation.
+
+The script `test_near_duplicate_checker.sh` was used to run the unit tests for the near duplicate checker contained in `test_near_duplicate_checker.py`. 
+
+Finally, we manually reviewed the nearest neighbor pairs using the notebook `review_near_duplicates.ipynb`. The file `review_near_duplicates_notebook_code.py` contains the code for this notebook. The review output is saved in `data/metadata/nearest_neighbor_reviews_v2.json`.
+All near duplicates that we found are saved in `data/metadata/near_duplicates.json`.
+
+
+## 4. Sample Dataset
+After we created a labeled candidate pool, we sampled the new test sets.
+
+We use a separate bash script to sample each version of the dataset, i.e `sample_dataset_type_{a}.sh`.
+Each script calls `sample_dataset.py` and `initialize_dataset_review.py` with the correct arguments.
+The file `dataset_sampling.py` contains helper functions for the sampling procedure. 
+
+## 5. Review Final Dataset
+For quality control, we added a final reviewing step to our dataset creation pipeline.
+
+* `initialize_dataset_review.py` initializes the metadata needed for each dataset review round.
+* `final_dataset_inspection.ipynb` is used to manually review dataset versions. 
+* `final_dataset_inspection_notebook_code.py` contains the code needed for the `final_dataset_inspection.ipynb` notebook.
+
+* `review_server.py` is the review server used for additional cleaning of the candidate pool.  The review server starts a web UI that allows one to browse all candidate images for a particular class.  In addition, a user can easily flag images that are problematic or near duplicates.
+
+The review server can use local, downloaded images if started with the flag
+`python3 review_server.py --use_local_images`.
+In addition, you also need to launch a separate static file server for serving the images.
+There is a script in `data` for starting the static file server `./start_file_server.sh`.
+
+The local images can be downloaded using 
+* `download_all_candidate_images_to_cache.py`
+* `download_dataset_images.py`
+
+
+# Data classes
+
+Our code base contains a set of data classes for working with various aspects of ImageNetV2.
+
+* `imagenet.py`: This file contains the `ImageNetData` class that provides metadata about ImageNet (a list of classes, etc.) and functionality for loading images in the original ImageNet dataset. The scripts `generate_imagenet_metadata_pickle.py` are used to assemble `generate_class_info_file.py` some of the metadata in the `ImageNetData` class.
+
+* `candidate_data.py` contains the `CandidateData` class that provides easy access to all candidate images in ImageNetV2 (both image data and metadata). The metadata file used in this class comes from `generate_candidate_metadata_pickle.py`.
+
+* `image_loader.py` provides a unified interface to loading image data from either ImageNet or ImageNetV2.
+
+* `mturk_data.py` provides the `MTurkData` class for accessing the results from our MTurk HITs. The data used by this class is assembled via `generate_mturk_data_pickle`.
+
+* `near_duplicate_data.py` loads and processes the information about near-duplicates in ImageNetV2. Some of the metadata is prepared with `generate_review_thresholds_pickle.py`.
+
+* `dataset_cache.py` allows easy loading of our various test set revisions.
+
+* `prediction_data.py` provides functionality for loading the predictions of various classification models on our three test sets.
+
+The functionality provided by each data class is documented via examples in the `notebooks` folder of this repository.
+
+# Evaluation Pipeline
+Finally, we describe our evaluation pipeline for the PyTorch models.
+The main file is `eval.py`, which can be invoked as follows:
+
+`python eval.py --dataset $DATASET --models $MODELS`
+
+where $DATASET is one of
+- `imagenet-validation-original` (the original validation set)
+- `imagenetv2-b-33` (our new `MatchedFrequency` test set)
+- `imagenetv2-a-44` (our new `Threshold.7` test set)
+- `imagenetv2-c-12` (our new `TopImages` test set).
+
+The $MODELS parameter is a comma-separated list of model names in the [torchvision](https://pytorch.org/docs/stable/torchvision/models.html) or [Cadene/pretrained-models.pytorch](https://github.com/Cadene/pretrained-models.pytorch) repositories.
+Alternatively, $MODELS can also be `all`, in which case all models are evaluated.
diff --git a/code/.pywren_config b/code/.pywren_config
@@ -0,0 +1,32 @@
+account:
+    aws_account_id: 372094282631
+    aws_lambda_role: pywren_exec_role_1
+    aws_region: us-west-2
+
+
+lambda:
+    memory : 1536
+    timeout : 300
+    function_name : pywren_2
+
+s3:
+    bucket: imagenet2datav2
+    pywren_prefix: pywren.jobs
+
+runtime:
+    s3_bucket: imagenet2pywren
+    s3_key: pywren.runtime/pywren_runtime-3.6-imagenet2pywren.meta.json
+
+scheduler:
+    map_item_limit: 500000
+
+standalone:
+    ec2_instance_type: c4.8xlarge
+    sqs_queue_name: pywren-jobs-1
+    visibility: 10
+    ec2_ssh_key : imagenet2
+    target_ami : ami-09ff122fe3407eb8c
+    instance_name: pywren-standalone
+    instance_profile_name: pywren-standalone
+    max_idle_time: 50000
+    idle_terminate_granularity: 5000
diff --git a/code/aes.py b/code/aes.py
@@ -0,0 +1,22 @@
+import base64
+
+from Crypto import Random
+from Crypto.Cipher import AES
+from Crypto.Util.Padding import pad, unpad
+
+
+class AESCipher(object):
+    def __init__(self, key, iv):
+        self.key = key
+        self.iv  = iv
+
+    def encrypt(self, raw):
+        raw = pad(raw, AES.block_size)
+        cipher = AES.new(self.key, AES.MODE_CBC, self.iv)
+        return base64.urlsafe_b64encode(cipher.encrypt(raw))
+
+    def decrypt(self, enc):
+        enc = base64.urlsafe_b64decode(enc)
+        cipher = AES.new(self.key, AES.MODE_CBC, self.iv)
+        return unpad(cipher.decrypt(enc), AES.block_size)
+
diff --git a/code/backup_mturk.py b/code/backup_mturk.py
@@ -0,0 +1,87 @@
+import datetime
+import getpass
+import json
+import time
+from timeit import default_timer as timer
+
+import tqdm
+
+import mturk_utils
+
+
+client = mturk_utils.get_mturk_client(live=True)
+max_num_results = 100
+
+api_results = []
+
+all_hit_ids = []
+
+print('Retrieving HITs ...')
+last_hit_count_print = 0
+last_hit_count_print_time = timer()
+
+res = client.list_hits(MaxResults=max_num_results)
+cur_call = {
+    'method': 'list_hits',
+    'arguments': {'MaxResults': max_num_results},
+    'result': res
+}
+api_results.append(cur_call)
+
+for hit in res['HITs']:
+    all_hit_ids.append(hit['HITId'])
+
+next_token = None
+if 'NextToken' in res:
+    next_token = res['NextToken']
+while next_token is not None:
+    res = client.list_hits(MaxResults=max_num_results, NextToken=next_token)
+    cur_call = {
+        'method': 'list_hits',
+        'arguments': {'MaxResults': max_num_results, 'NextToken': next_token},
+        'result': res
+    }
+    api_results.append(cur_call)
+    for hit in res['HITs']:
+        all_hit_ids.append(hit['HITId'])
+    if len(all_hit_ids) - last_hit_count_print > 2000:
+        hits_per_second = (len(all_hit_ids) - last_hit_count_print) / (timer() - last_hit_count_print_time)
+        print(f'    retrieved {len(all_hit_ids)} HITs so far ({hits_per_second:.1f} HITs per second)')
+        last_hit_count_print = len(all_hit_ids)
+        last_hit_count_print_time = timer()
+    next_token = None
+    if 'NextToken' in res:
+        next_token = res['NextToken']
+
+assert len(all_hit_ids) == len(set(all_hit_ids))
+print(f'Retrieved {len(all_hit_ids)} HITs in total')
+print('Retrieving corresponding assignments ...')
+
+for hit_id in tqdm.tqdm(all_hit_ids):
+    cur_result = client.list_assignments_for_hit(HITId=hit_id,
+                                                 MaxResults=max_num_results)
+    assert cur_result['NumResults'] < max_num_results
+    cur_call = {
+        'method': 'list_assignments_for_hit',
+        'arguments': {
+            'HITId': hit_id,
+            'MaxRresults': max_num_results
+        },
+        'result': cur_result
+    }
+    api_results.append(cur_call)
+
+time_string = datetime.datetime.now(datetime.timezone.utc).isoformat()
+
+api_data = {
+    'username': getpass.getuser(),
+    'time': time_string,
+    'results': api_results
+}
+
+def datetime_serializer(dt):
+    assert type(dt) is datetime.datetime
+    return dt.isoformat()
+
+with open(f'api_data_{time_string}.json', 'w') as f:
+    json.dump(api_data, f, sort_keys=True, indent=2, default=datetime_serializer)