Initial commit.

mhw32 · May 27, 2018 · 594451f · 594451f
1 parent 7a28b49
commit 594451f
Show file tree

Hide file tree

Showing 28 changed files with 4,812 additions and 2 deletions.
diff --git a/.gitignore b/.gitignore
@@ -99,3 +99,5 @@ ENV/
 
 # mypy
 .mypy_cache/
+
+**.DS_Store
diff --git a/README.md b/README.md
@@ -1,2 +1,117 @@
-# multimodal-vae-public
-A PyTorch implementation of "Multimodal Generative Models for Scalable Weakly-Supervised Learning" (https://arxiv.org/abs/1802.05335)
+# Multimodal Variational Autoencoder
+A PyTorch implementation of *Multimodal Generative Models for Scalable Weakly-Supervised Learning* (https://arxiv.org/abs/1802.05335).
+
+## Setup/Installation
+
+Open a new conda environment and install the necessary dependencies. See [here](https://www.pyimagesearch.com/2017/03/27/how-to-install-dlib/) for more details on installing `dlib`.
+```
+conda create -n multimodal python=2.7 anaconda
+# activate the environment
+source activate multimodal
+
+# install the pytorch
+conda install pytorch torchvision -c pytorch
+
+pip install tqdm
+pip install scikit-image
+pip install python-opencv
+pip install imutils
+
+# install dlib
+brew install cmake
+brew install boost
+pip install dlib
+```
+
+Some additional setup is needed for CelebA-related datasets. Download the aligned-and-cropped version [here](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html). Also download any annotation information. For the computer vision experiment, we need to precompute a few computations on the CelebA dataset. The dlib model we use to extract landmarks is from a PyImageSearch tutorial. You can download it [here](https://www.pyimagesearch.com/2017/04/03/facial-landmarks-dlib-opencv-python/). After downloading CelebA, try the following:
+```
+cd vision
+# assuming CelebA images are stored in ./data/images
+python setup.py grayscale ./data/images ./data/grayscale
+python setup.py edge ./data/images ./data/edge
+python setup.py mask ./data/images ./data/mask
+```
+
+## Example Experiments
+This repository contains a subset of the experiments mentioned in the [paper](https://arxiv.org/abs/1802.05335). In each folder, there are 3 scripts that one can run: `train.py` to fit the MVAE; `sample.py` to (conditionally) reconstruct from samples in the latent space; and `loglike.py` to compute the marginal log likelihood `log p(x)` using `q(z|x,y)` as the inference network. 
+
+By default, we anneal KL from 0 to 1. The user can customize the learning rate (`--lr`), number of latent dimensions (`--n-latents`), te annealing rate (`--annealing-epochs`), etc. from the command line. Notably, the user can set `lambda_image` and `lambda_text`, which balance the reconstruction terms. This tends to be important in practice. Training the model will save weights to filesystem. Run `python train.py -h` for details.
+
+![experiment-reconstructions](./static/reconstructions1.png)
+
+### MNIST 
+Treat images as one modality and the label (integer 0 to 9) as a second. 
+
+```
+cd mnist
+CUDA_VISIBLE_DEVICES=0 python train.py --lambda-text 50. --cuda
+# model is stored in ./trained_models
+CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --cuda
+# you can also condition on the label
+CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --condition-on-text 5 --cuda
+```
+
+### FashionMNIST
+Very similar to MNIST, except the labels correspond to categories of fashion items.
+
+```
+cd fashionmnist
+CUDA_VISIBLE_DEVICES=0 python train.py --lambda-text 50. --cuda
+# model is stored in ./trained_models
+CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --cuda
+# you can also condition on the label
+CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --condition-on-text 1 --cuda
+```
+
+### MultiMNIST
+Again, a MNIST-derivative except each image contains up to 4 digits in fixed locations. The second modality is a string of digits representing the character(s) in the image. We employ an RNN in the label inference network `q(z|y)`.
+
+```
+cd multimnist
+CUDA_VISIBLE_DEVICES=0 python train.py --lambda-text 10. --cuda
+# model is stored in ./trained_models
+CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --cuda
+# you can also condition on the digits
+CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --condition-on-text 1773 --cuda
+```
+
+### CelebA
+Treat images of celebrity faces as one modality and 18 attributes pertaining to the celebrity (i.e. gender, hair color, etc) as a second modality.
+
+```
+cd celeba
+CUDA_VISIBLE_DEVICES=0 python train.py --lambda-attrs 10. --cuda
+# model is stored in ./trained_models
+CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --cuda
+# you can also condition on the attribute
+CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --condition-on-attrs Male --cuda
+```
+
+### CelebA-19
+Similar to CelebA except we treat each attribute as its own expert in the product-of-experts. Here we begin to explore more than 2 modalities. See code for an example of the MVAE training paradigm (mentioned in the paper) by sampling multimodal ELBO terms.
+
+```
+cd celeba
+CUDA_VISIBLE_DEVICES=0 python train.py --lambda-attrs 10. approx-m 1 --cuda
+```
+
+Here `approx-m` sets the number of ELBO terms to sample beyond the complete and individual terms.
+
+### Computer Vision Transformations
+We learn a series of image processing transformations (i.e. colorization, image completion, edge detciont, watermark removal, and facial landmark segmentation) as modalities. We curate a dataset by applying off-the-shelf tools to CelebA. For simplicitly, in this implementation, we only include the complete ELBO term (using all 6 modalities), and the 6 individual ELBO terms as the objective (in order words `k = 0`). One can also subsample more ELBO terms to better approximate the true MVAE objective (as in `/celeba19/train.py`).
+
+```
+cd vision
+CUDA_VISIBLE_DEVICES=0 python train.py --cuda
+# model is stored in ./trained_models
+CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --cuda
+# this will reconstruct all the modalities from the image
+CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --condition-file <path_to_file> --condition-type image --cuda
+# we can also go in the other directions
+CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --condition-file <path_to_file> --condition-type watermark --cuda
+```
+
+![vision-reconstructions](./static/reconstructions2.png)
+
+## Questions?
+Please report any bugs and I will get to them ASAP. For any additional questions, feel free to email me@mikehwu.com.
diff --git a/celeba/datasets.py b/celeba/datasets.py
@@ -0,0 +1,152 @@
+from __future__ import division
+from __future__ import print_function
+from __future__ import absolute_import
+
+import os
+import sys
+import copy
+import random
+import numpy as np
+import numpy.random as npr
+from PIL import Image
+from random import shuffle
+from scipy.misc import imresize
+
+import torch
+from torch.utils.data.dataset import Dataset
+
+VALID_PARTITIONS = {'train': 0, 'val': 1, 'test': 2}
+# go from label index to interpretable index
+ATTR_TO_IX_DICT  = {'Sideburns': 30, 'Black_Hair': 8, 'Wavy_Hair': 33, 'Young': 39, 'Heavy_Makeup': 18, 
+                    'Blond_Hair': 9, 'Attractive': 2, '5_o_Clock_Shadow': 0, 'Wearing_Necktie': 38, 
+                    'Blurry': 10, 'Double_Chin': 14, 'Brown_Hair': 11, 'Mouth_Slightly_Open': 21, 
+                    'Goatee': 16, 'Bald': 4, 'Pointy_Nose': 27, 'Gray_Hair': 17, 'Pale_Skin': 26, 
+                    'Arched_Eyebrows': 1, 'Wearing_Hat': 35, 'Receding_Hairline': 28, 'Straight_Hair': 32, 
+                    'Big_Nose': 7, 'Rosy_Cheeks': 29, 'Oval_Face': 25, 'Bangs': 5, 'Male': 20, 'Mustache': 22, 
+                    'High_Cheekbones': 19, 'No_Beard': 24, 'Eyeglasses': 15, 'Bags_Under_Eyes': 3, 
+                    'Wearing_Necklace': 37, 'Wearing_Lipstick': 36, 'Big_Lips': 6, 'Narrow_Eyes': 23, 
+                    'Chubby': 13, 'Smiling': 31, 'Bushy_Eyebrows': 12, 'Wearing_Earrings': 34}
+# we only keep 18 of the more visually distinctive features
+# See [1] Perarnau, Guim, et al. "Invertible conditional gans for 
+#         image editing." arXiv preprint arXiv:1611.06355 (2016).
+ATTR_IX_TO_KEEP  = [4, 5, 8, 9, 11, 12, 15, 17, 18, 20, 21, 22, 26, 28, 31, 32, 33, 35]
+IX_TO_ATTR_DICT  = {v:k for k, v in ATTR_TO_IX_DICT.iteritems()}
+N_ATTRS          = len(ATTR_IX_TO_KEEP)
+ATTR_TO_PLOT     = ['Heavy_Makeup', 'Male', 'Mouth_Slightly_Open', 'Smiling', 'Wavy_Hair']
+
+
+class CelebAttributes(Dataset):
+    """Define dataset of images of celebrities and attributes.
+    
+    The user needs to have pre-defined the Anno and Eval folder from 
+    http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
+
+    @param partition: string
+                      train|val|test [default: train]
+                      See VALID_PARTITIONS global variable.
+    @param data_dir: string
+                     path to root of dataset images [default: ./data]
+    @param image_transform: ?torchvision.Transforms
+                            optional function to apply to training inputs
+    @param attr_transform: ?torchvision.Transforms
+                           optional function to apply to training outputs
+    """
+    def __init__(self, partition='train', data_dir='./data', 
+                 image_transform=None, attr_transform=None):
+        self.partition       = partition
+        self.image_transform = image_transform
+        self.attr_transform  = attr_transform
+        self.data_dir        = data_dir
+        assert partition in VALID_PARTITIONS.keys()
+        self.image_paths     = load_eval_partition(partition, data_dir=data_dir)
+        self.attr_data       = load_attributes(self.image_paths, partition,
+                                               data_dir=data_dir)
+        self.size            = int(len(self.image_paths))
+
+    def __getitem__(self, index):
+        """
+        Args:
+            index (int): Index
+        Returns:
+            tuple: (image, target) where target is index of the target class.
+        """
+        image_path = os.path.join(self.data_dir, 'img_align_celeba', 
+                                  self.image_paths[index])
+        attr       = self.attr_data[index]
+        image      = Image.open(image_path).convert('RGB')
+
+        if self.image_transform is not None:
+            image  = self.image_transform(image)
+
+        if self.attr_transform is not None:
+            attr   = self.attr_transform(attr)
+
+        return image, attr
+
+    def __len__(self):
+        return self.size
+
+
+def load_eval_partition(partition, data_dir='./data'):
+    """After downloading the dataset, we can load a subset for
+    training or testing.
+
+    @param partition: string
+                      which subset to use (train|val|test)
+    @param data_dir: string [default: ./data]
+                     where the images are saved
+    """
+    eval_data = []
+    with open(os.path.join(data_dir, 'Eval/list_eval_partition.txt')) as fp:
+        rows  = fp.readlines()
+        for row in rows:
+            path, label = row.strip().split(' ')
+            label       = int(label)
+            if label == VALID_PARTITIONS[partition]:
+                eval_data.append(path)
+    return eval_data
+
+
+def load_attributes(paths, partition, data_dir='./data'):
+    """Load the attributes into a torch tensor.
+
+    @param paths: string
+                  a numpy array of attributes (1 or 0)
+    @param partition: string
+                      which subset to use (train|val|test)
+    @param data_dir: string [default: ./data]
+                     where the images are saved
+    """
+    if os.path.isfile(os.path.join(data_dir, 'Anno/attr_%s.npy' % partition)):
+        attr_data = np.load(os.path.join(data_dir, 'Anno/attr_%s.npy' % partition))
+    else:
+        attr_data = []
+        with open(os.path.join(data_dir, 'Anno/list_attr_celeba.txt')) as fp:
+            rows    = fp.readlines()
+            for ix, row in enumerate(rows[2:]):
+                row = row.strip().split()
+                path, attrs = row[0], row[1:]
+                if path in paths:
+                    attrs   = np.array(attrs).astype(int)
+                    attrs[attrs < 0] = 0
+                    attr_data.append(attrs)
+        attr_data = np.vstack(attr_data).astype(np.int64)
+    attr_data     = torch.from_numpy(attr_data).float()
+    return attr_data[:, ATTR_IX_TO_KEEP]
+
+
+def tensor_to_attributes(tensor):
+    """Use this for the <image_transform>.
+
+    @param tensor: PyTorch Tensor
+                   D dimensional tensor
+    @return attributes: list of strings
+    """
+    attrs  = []
+    n      = tensor.size(0)
+    tensor = torch.round(tensor)
+    for i in xrange(n):
+        if tensor[i] > 0.5:
+            attr = IX_TO_ATTR_DICT[ATTR_IX_TO_KEEP[i]]
+            attrs.append(attr)
+    return attrs