Skip to content

Commit

Permalink
Initial commit.
Browse files Browse the repository at this point in the history
  • Loading branch information
mhw32 committed May 27, 2018
1 parent 7a28b49 commit 594451f
Show file tree
Hide file tree
Showing 28 changed files with 4,812 additions and 2 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Expand Up @@ -99,3 +99,5 @@ ENV/

# mypy
.mypy_cache/

**.DS_Store
119 changes: 117 additions & 2 deletions README.md
@@ -1,2 +1,117 @@
# multimodal-vae-public
A PyTorch implementation of "Multimodal Generative Models for Scalable Weakly-Supervised Learning" (https://arxiv.org/abs/1802.05335)
# Multimodal Variational Autoencoder
A PyTorch implementation of *Multimodal Generative Models for Scalable Weakly-Supervised Learning* (https://arxiv.org/abs/1802.05335).

## Setup/Installation

Open a new conda environment and install the necessary dependencies. See [here](https://www.pyimagesearch.com/2017/03/27/how-to-install-dlib/) for more details on installing `dlib`.
```
conda create -n multimodal python=2.7 anaconda
# activate the environment
source activate multimodal
# install the pytorch
conda install pytorch torchvision -c pytorch
pip install tqdm
pip install scikit-image
pip install python-opencv
pip install imutils
# install dlib
brew install cmake
brew install boost
pip install dlib
```

Some additional setup is needed for CelebA-related datasets. Download the aligned-and-cropped version [here](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html). Also download any annotation information. For the computer vision experiment, we need to precompute a few computations on the CelebA dataset. The dlib model we use to extract landmarks is from a PyImageSearch tutorial. You can download it [here](https://www.pyimagesearch.com/2017/04/03/facial-landmarks-dlib-opencv-python/). After downloading CelebA, try the following:
```
cd vision
# assuming CelebA images are stored in ./data/images
python setup.py grayscale ./data/images ./data/grayscale
python setup.py edge ./data/images ./data/edge
python setup.py mask ./data/images ./data/mask
```

## Example Experiments
This repository contains a subset of the experiments mentioned in the [paper](https://arxiv.org/abs/1802.05335). In each folder, there are 3 scripts that one can run: `train.py` to fit the MVAE; `sample.py` to (conditionally) reconstruct from samples in the latent space; and `loglike.py` to compute the marginal log likelihood `log p(x)` using `q(z|x,y)` as the inference network.

By default, we anneal KL from 0 to 1. The user can customize the learning rate (`--lr`), number of latent dimensions (`--n-latents`), te annealing rate (`--annealing-epochs`), etc. from the command line. Notably, the user can set `lambda_image` and `lambda_text`, which balance the reconstruction terms. This tends to be important in practice. Training the model will save weights to filesystem. Run `python train.py -h` for details.

![experiment-reconstructions](./static/reconstructions1.png)

### MNIST
Treat images as one modality and the label (integer 0 to 9) as a second.

```
cd mnist
CUDA_VISIBLE_DEVICES=0 python train.py --lambda-text 50. --cuda
# model is stored in ./trained_models
CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --cuda
# you can also condition on the label
CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --condition-on-text 5 --cuda
```

### FashionMNIST
Very similar to MNIST, except the labels correspond to categories of fashion items.

```
cd fashionmnist
CUDA_VISIBLE_DEVICES=0 python train.py --lambda-text 50. --cuda
# model is stored in ./trained_models
CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --cuda
# you can also condition on the label
CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --condition-on-text 1 --cuda
```

### MultiMNIST
Again, a MNIST-derivative except each image contains up to 4 digits in fixed locations. The second modality is a string of digits representing the character(s) in the image. We employ an RNN in the label inference network `q(z|y)`.

```
cd multimnist
CUDA_VISIBLE_DEVICES=0 python train.py --lambda-text 10. --cuda
# model is stored in ./trained_models
CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --cuda
# you can also condition on the digits
CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --condition-on-text 1773 --cuda
```

### CelebA
Treat images of celebrity faces as one modality and 18 attributes pertaining to the celebrity (i.e. gender, hair color, etc) as a second modality.

```
cd celeba
CUDA_VISIBLE_DEVICES=0 python train.py --lambda-attrs 10. --cuda
# model is stored in ./trained_models
CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --cuda
# you can also condition on the attribute
CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --condition-on-attrs Male --cuda
```

### CelebA-19
Similar to CelebA except we treat each attribute as its own expert in the product-of-experts. Here we begin to explore more than 2 modalities. See code for an example of the MVAE training paradigm (mentioned in the paper) by sampling multimodal ELBO terms.

```
cd celeba
CUDA_VISIBLE_DEVICES=0 python train.py --lambda-attrs 10. approx-m 1 --cuda
```

Here `approx-m` sets the number of ELBO terms to sample beyond the complete and individual terms.

### Computer Vision Transformations
We learn a series of image processing transformations (i.e. colorization, image completion, edge detciont, watermark removal, and facial landmark segmentation) as modalities. We curate a dataset by applying off-the-shelf tools to CelebA. For simplicitly, in this implementation, we only include the complete ELBO term (using all 6 modalities), and the 6 individual ELBO terms as the objective (in order words `k = 0`). One can also subsample more ELBO terms to better approximate the true MVAE objective (as in `/celeba19/train.py`).

```
cd vision
CUDA_VISIBLE_DEVICES=0 python train.py --cuda
# model is stored in ./trained_models
CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --cuda
# this will reconstruct all the modalities from the image
CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --condition-file <path_to_file> --condition-type image --cuda
# we can also go in the other directions
CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --condition-file <path_to_file> --condition-type watermark --cuda
```

![vision-reconstructions](./static/reconstructions2.png)

## Questions?
Please report any bugs and I will get to them ASAP. For any additional questions, feel free to email me@mikehwu.com.
152 changes: 152 additions & 0 deletions celeba/datasets.py
@@ -0,0 +1,152 @@
from __future__ import division
from __future__ import print_function
from __future__ import absolute_import

import os
import sys
import copy
import random
import numpy as np
import numpy.random as npr
from PIL import Image
from random import shuffle
from scipy.misc import imresize

import torch
from torch.utils.data.dataset import Dataset

VALID_PARTITIONS = {'train': 0, 'val': 1, 'test': 2}
# go from label index to interpretable index
ATTR_TO_IX_DICT = {'Sideburns': 30, 'Black_Hair': 8, 'Wavy_Hair': 33, 'Young': 39, 'Heavy_Makeup': 18,
'Blond_Hair': 9, 'Attractive': 2, '5_o_Clock_Shadow': 0, 'Wearing_Necktie': 38,
'Blurry': 10, 'Double_Chin': 14, 'Brown_Hair': 11, 'Mouth_Slightly_Open': 21,
'Goatee': 16, 'Bald': 4, 'Pointy_Nose': 27, 'Gray_Hair': 17, 'Pale_Skin': 26,
'Arched_Eyebrows': 1, 'Wearing_Hat': 35, 'Receding_Hairline': 28, 'Straight_Hair': 32,
'Big_Nose': 7, 'Rosy_Cheeks': 29, 'Oval_Face': 25, 'Bangs': 5, 'Male': 20, 'Mustache': 22,
'High_Cheekbones': 19, 'No_Beard': 24, 'Eyeglasses': 15, 'Bags_Under_Eyes': 3,
'Wearing_Necklace': 37, 'Wearing_Lipstick': 36, 'Big_Lips': 6, 'Narrow_Eyes': 23,
'Chubby': 13, 'Smiling': 31, 'Bushy_Eyebrows': 12, 'Wearing_Earrings': 34}
# we only keep 18 of the more visually distinctive features
# See [1] Perarnau, Guim, et al. "Invertible conditional gans for
# image editing." arXiv preprint arXiv:1611.06355 (2016).
ATTR_IX_TO_KEEP = [4, 5, 8, 9, 11, 12, 15, 17, 18, 20, 21, 22, 26, 28, 31, 32, 33, 35]
IX_TO_ATTR_DICT = {v:k for k, v in ATTR_TO_IX_DICT.iteritems()}
N_ATTRS = len(ATTR_IX_TO_KEEP)
ATTR_TO_PLOT = ['Heavy_Makeup', 'Male', 'Mouth_Slightly_Open', 'Smiling', 'Wavy_Hair']


class CelebAttributes(Dataset):
"""Define dataset of images of celebrities and attributes.
The user needs to have pre-defined the Anno and Eval folder from
http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
@param partition: string
train|val|test [default: train]
See VALID_PARTITIONS global variable.
@param data_dir: string
path to root of dataset images [default: ./data]
@param image_transform: ?torchvision.Transforms
optional function to apply to training inputs
@param attr_transform: ?torchvision.Transforms
optional function to apply to training outputs
"""
def __init__(self, partition='train', data_dir='./data',
image_transform=None, attr_transform=None):
self.partition = partition
self.image_transform = image_transform
self.attr_transform = attr_transform
self.data_dir = data_dir
assert partition in VALID_PARTITIONS.keys()
self.image_paths = load_eval_partition(partition, data_dir=data_dir)
self.attr_data = load_attributes(self.image_paths, partition,
data_dir=data_dir)
self.size = int(len(self.image_paths))

def __getitem__(self, index):
"""
Args:
index (int): Index
Returns:
tuple: (image, target) where target is index of the target class.
"""
image_path = os.path.join(self.data_dir, 'img_align_celeba',
self.image_paths[index])
attr = self.attr_data[index]
image = Image.open(image_path).convert('RGB')

if self.image_transform is not None:
image = self.image_transform(image)

if self.attr_transform is not None:
attr = self.attr_transform(attr)

return image, attr

def __len__(self):
return self.size


def load_eval_partition(partition, data_dir='./data'):
"""After downloading the dataset, we can load a subset for
training or testing.
@param partition: string
which subset to use (train|val|test)
@param data_dir: string [default: ./data]
where the images are saved
"""
eval_data = []
with open(os.path.join(data_dir, 'Eval/list_eval_partition.txt')) as fp:
rows = fp.readlines()
for row in rows:
path, label = row.strip().split(' ')
label = int(label)
if label == VALID_PARTITIONS[partition]:
eval_data.append(path)
return eval_data


def load_attributes(paths, partition, data_dir='./data'):
"""Load the attributes into a torch tensor.
@param paths: string
a numpy array of attributes (1 or 0)
@param partition: string
which subset to use (train|val|test)
@param data_dir: string [default: ./data]
where the images are saved
"""
if os.path.isfile(os.path.join(data_dir, 'Anno/attr_%s.npy' % partition)):
attr_data = np.load(os.path.join(data_dir, 'Anno/attr_%s.npy' % partition))
else:
attr_data = []
with open(os.path.join(data_dir, 'Anno/list_attr_celeba.txt')) as fp:
rows = fp.readlines()
for ix, row in enumerate(rows[2:]):
row = row.strip().split()
path, attrs = row[0], row[1:]
if path in paths:
attrs = np.array(attrs).astype(int)
attrs[attrs < 0] = 0
attr_data.append(attrs)
attr_data = np.vstack(attr_data).astype(np.int64)
attr_data = torch.from_numpy(attr_data).float()
return attr_data[:, ATTR_IX_TO_KEEP]


def tensor_to_attributes(tensor):
"""Use this for the <image_transform>.
@param tensor: PyTorch Tensor
D dimensional tensor
@return attributes: list of strings
"""
attrs = []
n = tensor.size(0)
tensor = torch.round(tensor)
for i in xrange(n):
if tensor[i] > 0.5:
attr = IX_TO_ATTR_DICT[ATTR_IX_TO_KEEP[i]]
attrs.append(attr)
return attrs

0 comments on commit 594451f

Please sign in to comment.