# Preprocessing Data

The given dataset provides large, cloud-optimized Geotiff files of several gigabytes. To calculate features on single roof images, they are cut out using the roof polygon coordinates in `train-<region>.geojson` and `test-<region>.geojson`.
Every roof image is stored as a single tif-file.

Data is handled __by region,__ i.e. every region gets its own folder of training and test images.
Training images are sorted into different subfolders respective of the material label.

The preprocessing step adds folders `roofs_train` and `roofs_test` to the existing file tree, resulting in a structure as follows:

```
data
└───region1
│   │   train-region1.geojson
│   │   region1_ortho-cog.tif
│   │   ...
│   └───roofs_train
│   │   └───healthy_metal
│   │       │   roof_id_a.tif
│   │       │   roof_id_b.tif
│   │       │   ...
│   │ 
│   │   └───irregular_metal
│   │       │   ...
│   │   └───concrete_cement
│   │       │   ...
│   │   └───incomplete
│   │       │   ...
│   │   └───other
│   │       │   ...
│   │
│   └───roofs_test
│       │   roof_id_m.tif
│       │   roof_id_n.tif
│       │   ...
│
└───region2
│   │   ...
```

In [None]:
import rasterio
from rasterio.plot import show
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import json
from os.path import join

In [None]:
region = 'dennery'
region_fp = join('..', '..', 'data', region)

geojson_fp = join(region_fp, 'train-'+region+'.geojson')
image_fp = join(region_fp, region+'_ortho-cog.tif')
roof_dir = join(region_fp, 'roofs_train')

## 1. Get an overview of the scenery

Show a thumbnail of the image.

In [None]:
with rasterio.open(image_fp) as src:
    profile = src.profile
    print(profile['crs'])
    print(src.profile)
    print(src.bounds)

In [None]:
with rasterio.open(image_fp) as src:
    oviews = src.overviews(1)
    oview = oviews[-1]
    print('Decimation factor = {}'.format(oview))
    b, g, r = (src.read(k, out_shape=(1, int(src.height // oview), int(src.width // oview))) for k in (1, 2, 3))

img = (b, g, r)
img = np.moveaxis(img, 0, -1)

fig = plt.figure(figsize=(10, 5))
plt.imshow(img)
plt.colorbar()
plt.title('Overview')

## 2. Cut out training roofs and store as separate files

In [None]:
import rasterio
from rasterio.mask import mask
from os import makedirs
from os.path import exists

Create subfolders for each material.

In [None]:
materials = ['healthy_metal', 'irregular_metal', 'concrete_cement', 'incomplete', 'other']
for mat in materials:
    directory = join(roof_dir, mat) 
    if not exists(directory):
        makedirs(directory)

Coordinates in GeoJSON label file need to be converted to the CSR format of the image.

In [None]:
from pyproj import Proj, transform
outProj = Proj(init=profile['crs']) # CRS format of image
inProj = Proj(init='epsg:4326') # lat/lon coordinate format

#### Cutting out roofs
1. Extract the roof id and polygon coordinates from GeoJSON label file
2. Cut the polygon out of the image
3. Save the cut out image of the roof to file with id as name

In [None]:
with open(geojson_fp) as geojson:
    geoms = json.loads(geojson.read())
    roofs = geoms['features']

for roof in roofs:
    roof_id = roof['id']
    roof_geom = roof['geometry']
    roof_material = roof['properties']['roof_material']    
    print(roof_id)
    
    # There are about 10 Multipolygons in the whole dataset.
    # I chose to ignore them instead of writing a special function to cut them out.
    if roof_geom['type'] == 'MultiPolygon':
        print("MULTIPOLYGON")
        continue
    else:
        coord = roof_geom['coordinates'][0]
        for c in coord:
            c[0], c[1] = transform(inProj, outProj, c[0], c[1])
    
    # Cut out the roof from the original image
    print(roof_geom)
    with rasterio.open(image_fp) as image:
        roof_image, roof_transform = mask(image, [roof_geom], filled=True, crop=True)
    #show(roof_image)
    
    # Copy metadata from original image but update important parameters
    roof_meta = image.meta.copy()
    roof_meta.update({"driver": "GTiff",
        "dtype": rasterio.uint8,
        "height": roof_image.shape[1],
        "width": roof_image.shape[2],
        "transform": roof_transform,
        "tiled": True,
        "compress": 'lzw'})
    
    # Save to file
    roof_image_fp = join(roof_dir, roof_material, str(roof_id)+".tif")
    with rasterio.open(roof_image_fp, "w", **roof_meta) as dest:
        dest.write(roof_image)                             

## 3. Preprocess all regions using module `preprocessing`

The above function only preprocesses the __training__ images. The `preprocessing` module features a slightly different function for the __test__ images, that ignores the material label and puts all test images into one folder.

In [None]:
import preprocessing

In [None]:
for region in regions:
    preprocessing.preprocess_region(region)