# Image Preprocessing Workflow
This notebook demonstrates the preprocessing workflow used to prepare images from the [New York GIS Clearinghouse orthoimagery dataset](https://gis.ny.gov/gateway/mg/index.html) for semantic segmentation analysis.

Begin by importing the libraries we need

In [1]:
import utils

## Extract Images
We started our labelling process with the High Resolution 2018 Imagery for Queens available from the [Download Page](https://gis.ny.gov/gateway/mg/2018/new_york_city/).

The images have a resolution of 0.5 ft per pixel and are in four bands (RGB + Infrared). Images are formatted as JPEG2000, which requires conversion for use by [labelme](https://github.com/wkentaro/labelme) and the subsequent parts of the workflow. We extract the images as 3 channel by default, because the 4th channel is interpreted as transparency by default.

In [2]:
zipfn = "c:\\nycdata\\boro_queens_sp18.zip"
png_dir = "c:\\nycdata\\boro_queens_sp18_png"
ir_dir = "c:\\nycdata\\boro_queens_sp18_png_ir"
utils.zip_to_png(zipfn, png_dir, ir_dir)

RGB Output path not empty. Skipping entire operation.


## Label Images
At this point, we conducted image labeling. We used [labelme](https://github.com/wkentaro/labelme) and created polygons for each image. The polygons were given three different labels:
- `pv` for obvious pv installations.
- `notpv` for empty rings within a larger polygon.
- `maybe` for uncertain spots that required further review.


<img src="../example_label.png" alt="Rooftop" width=400>

The labels were saved into JSON files co-located with the images, and with identical filenames. Labelme defaults to saving a copy of the image within the JSON file, and it's easy to forget to turn that off. So there is a helper to strip out the image data from a whole directory of JSON files. Additionally, labelme does not create JSON for images with no labelled sections. Since we might still care about these images, we have a tool to generate blank JSON files for images that do not already have a JSON label. Finally, we provide a tool to convert the JSON files into the binary image masks. Generating the masks works best when we provide a dictionary correlating each label name to a value for the mask.

In [3]:
png_dir = "c:\\nycdata\\sample_subset\\raw"  # while labelling is still ongoing, we will override the png directory to a manually copied subset
mask_dir = "c:\\nycdata\\sample_subset\\masks"
# Clear all json of imagedata
for f in utils.fileio.files_of_type(png_dir, '*.json'):
    utils.clear_imagedata(f)

# Generate the blank json files
utils.generate_blank_json_dir(png_dir)

label_vals = {"_background_": 0, "maybe": 0, "notpv": 0, "pv": 255}
for f in utils.fileio.files_of_type(png_dir, '*.json'):
    utils.json_to_binary(f, mask_dir, label_vals)

Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: skipping.
Mask file exists: sk

## Splitting Images into Tiles
The images that come right from the dataset are 5000 x 5000 pixels. This is much too large for us to work with in the neural network processing. So we want to split both the images and the masks into tiles and send to the same directories.

In [4]:
png_tile_dir = "c:\\nycdata\\sample_subset\\final\\data"
mask_tile_dir = "c:\\nycdata\\sample_subset\\final\\masks"

tile_size = 625

# Operate if the directories are empty
if utils.fileio.is_dir_empty(png_tile_dir):
    for f in utils.fileio.files_of_type(png_dir, '*.png'):
        utils.slice_image(f, tile_size, tile_size, png_tile_dir)
if utils.fileio.is_dir_empty(mask_tile_dir):
    for f in utils.fileio.files_of_type(mask_dir, '*.png'):
        utils.slice_image(f, tile_size, tile_size, mask_tile_dir)

## Delete Blank Tiles
Because we're slicing up images, there's a possibility that we will end up with a large number of blanks. Drop those from the list of files. The opportunity exists here to set the percentage of blanks that should make up the entire dataset.

In [5]:
utils.delete_blank_tiles(png_tile_dir, mask_tile_dir, maxfrac=0, seed=0)

In [6]:
# Print info about the remaining dataset
import glob, os
print(f"Total Number of Labels in Dataset: {len(glob.glob(os.path.join(mask_tile_dir,'*.png')))}")

Total Number of Labels in Dataset: 345
