# Download and Prepare Training Data

Typically the most difficult and time consuming task for any supervised deep learning project is creating quality training data. Thankfully, in this case, there is a dataset available we can use. Still, we have to get and organize this trainig data before we can get started. This notebook takes care of these tasks, and _only_ needs to be run once.

__NOTE__: Each of these cells can be run _even if run before_, since they check if the results are already there before doing a long running process again.

In [1]:
import os
from pathlib import Path
import shutil
import tarfile

import arcpy
import requests

## Set Up Project Data Directories

The project structure keeps data organized in subdirectories within a data directory. If you just cloned this repo, these directories do not exist, so first we are going to ensure this directory structure is set up.

In [2]:
# paths to common data locations - NOTE: to convert any path to a raw string, simply use str(path_instance)
dir_prj = Path.cwd().parent

dir_data = dir_prj/'data'

dir_raw = dir_data/'raw'
dir_ext = dir_data/'external'
dir_int = dir_data/'interim'
dir_out = dir_data/'processed'

dir_models = dir_prj/'models'

gdb_int = dir_int/'interim.gdb'
gdb_out = dir_out/'processed.gdb'

# make sure data directories exist
for dir_dta in (dir_raw, dir_ext, dir_int, dir_out, dir_models):
    if not dir_dta.exists():
        dir_dta.mkdir(parents=True)
        
# make sure the geodatabases exist
for gdb in [gdb_int, gdb_out]:
    if not arcpy.Exists(str(gdb)):
        arcpy.management.CreateFileGDB(str(gdb.parent), str(gdb.name))

## Download Data

The data used for training is freely available for download. Hence, the first step is procuring the archive by downloading it.

In [3]:
%%time

archive_url = 'https://mycityreport.s3-ap-northeast-1.amazonaws.com/02_RoadDamageDataset/RoadDamageDataset.tar.gz'
archive_pth = dir_raw/archive_url.split('/')[-1]

if not archive_pth.exists():
    
    print(f'Downloading {archive_pth} from {archive_url}.\n\nThis is going to take a while. Go for a walk. Watch a sitcom. Do something else for a while.')
    
    with requests.get(archive_url, stream=True) as req:
        
        req.raise_for_status()
        
        with open(archive_pth, 'wb') as out_file:
            
            for chunk in req.iter_content(chunk_size=8192):
                
                out_file.write(chunk)
    
    print('\nDownload finally complete. At least you only have to do that once, right?\n')
                
else:
    
    print(f'{archive_pth} has already been downloaded. Lucky you.\n')

Downloading D:\projects\road-surface-detection\data\raw\RoadDamageDataset.tar.gz from https://mycityreport.s3-ap-northeast-1.amazonaws.com/02_RoadDamageDataset/RoadDamageDataset.tar.gz.

This is going to take a while. Go for a walk. Watch a sitcom. Do something else for a while.

Download finally complete. At least you only have to do that once, right?

Wall time: 5min 25s


## Unpack the Archive

Next, although we have the data, it is still all wrapped up in an archive, so we are going to have to unpack it.

In [4]:
%%time

dir_raw_training = dir_raw/'RoadDamageDataset'

if not dir_raw_training.exists():
    
    print(f'Starting to extract {archive_pth} to {dir_raw_training}.\n')
    
    with tarfile.open(archive_pth, 'r:gz') as gz_fl:
        gz_fl.extractall(dir_raw)
    
    print('Finally finsihed extracting.\n')

else:
    print('Training data already extracted - lucky you.\n')

Starting to extract D:\projects\road-surface-detection\data\raw\RoadDamageDataset.tar.gz to D:\projects\road-surface-detection\data\raw\RoadDamageDataset.

Finally finsihed extracting.

Wall time: 1min 37s


## Organize Data

Even with the data extracted, we still need the data organized so the dataloader will where to find the images and labels.

In [5]:
%%time

dir_training = dir_raw/'training_data'

assert dir_raw_training.exists()

dir_images = dir_training/'images'
dir_labels = dir_training/'labels'

dir_images.mkdir(parents=True, exist_ok=True)
dir_labels.mkdir(parents=True, exist_ok=True)

if not(len(list(dir_images.glob('*'))) and len(list(dir_labels.glob('*')))):

    for fl in os.listdir(dir_raw_training):
        if not(fl.startswith(".")):
            for f in os.listdir(dir_raw_training/fl/'Annotations'):
                if not(f.startswith(".")):
                    img_name = f.split('.')[0] + '.jpg'

                    shutil.copyfile(dir_raw_training/fl/'JPEGImages'/img_name, dir_training/'images'/img_name)
                    shutil.copyfile(dir_raw_training/fl/'Annotations'/f, dir_training/'labels'/f)

Wall time: 39.6 s
