# Data Creator

This notebook reads, modifies, and concatenates datasets into pickle files of image paths and corresponding targets that can be used for training models.

### Preliminary Data Analysis

Plot some samples from each dataset, as well as the action histograms. Do the samples look reasonable? Is the histogram too lopsided? 

In [None]:
import os
import sys
mod_path = os.path.abspath(os.path.join('..'))
sys.path.append(mod_path)

from src.dataset import Dataset
from src.config import DATA_DIR

In [None]:
# 12-3-2017

#beach_messy
beach_messy_path = os.path.join(DATA_DIR, '0312beaches_messy')
beach_messy = Dataset.from_path(beach_messy_path)

#forest_messy
forest_messy_path = os.path.join(DATA_DIR, '0312forest_messy')
forest_messy = Dataset.from_path(forest_messy_path)

#clean_mix
clean_mix_path = os.path.join(DATA_DIR, '0312cleaner')
clean_mix = Dataset.from_path(clean_mix_path)

#test_dataset
test_path = os.path.join(DATA_DIR, '0312cleaner')
test = Dataset.from_path(test_path, max_n = 100)

# Plot sample images and histograms
beach_messy.analyze()
forest_messy.analyze()
clean_mix.analyze()
test.analyze()

### Mix and Match

Here we combine our raw datasets into custom datasets we will use in training.

In [None]:
# 12-3-2017

# Dataset 1: beach messy + forest messy
d1 = Dataset.from_datasets([beach_messy,
                            forest_messy])

# Dataset 2: messy but with better balance
d2 = Dataset.from_datasets([beach_messy.only_label('A'),
                            beach_messy.only_label('D'),
                            forest_messy.only_label('A'),
                            forest_messy.only_label('D'),
                            beach_messy.only_label('A'),
                            beach_messy.only_label('D'),
                            forest_messy.only_label('A'),
                            forest_messy.only_label('D'),
                            beach_messy,
                            forest_messy])

# Dataset 3: clean only
d3 = Dataset.from_datasets([clean_mix])

# Dataset 4: clean with better balance
d4 = Dataset.from_datasets([clean_mix.only_label('A'),
                            clean_mix.only_label('D'),
                            clean_mix.only_label('A'),
                            clean_mix.only_label('D'),
                            clean_mix])

# Plot sample images and histograms
d1.analyze()
d2.analyze()
d3.analyze()
d4.analyze()

## Dump to pickle

For our keras hyperopt model, lets dump the image paths and labels to a pickle file.

In [None]:
import pickle

# Paths to local directories containing data
d1_path = os.path.join(DATA_DIR, '1203_d1.pickle')
d2_path = os.path.join(DATA_DIR, '1203_d2.pickle')
d3_path = os.path.join(DATA_DIR, '1203_d3.pickle')
d4_path = os.path.join(DATA_DIR, '1203_d4.pickle')
test_path = os.path.join(DATA_DIR, 'test.pickle')

d1.save_to_pickle(d1_path)
d2.save_to_pickle(d2_path)
d3.save_to_pickle(d3_path)
d4.save_to_pickle(d4_path)
test.save_to_pickle(test_path)