# Data Setup

If you just cloned this project from a GitHub repo, you will need to start by running this notebook. The data directories are excluded from the repo by design to keep the project a manageable size, and prevent issues with GitHub. This means you will need to run this notebook _once_ after first cloning or downloading this project from a repo. This notebook sets up the directory structure for the project data, and downloads all the needed training data (_all 9.8 GB of it_) you will need to successfully run this project.

In [14]:
import importlib
import os
from pathlib import Path
import sys
from zipfile import ZipFile

from arcgis.features import GeoAccessor, GeoSeriesAccessor
from arcgis.gis import GIS
from dotenv import load_dotenv, find_dotenv
import pandas as pd
import requests

# import arcpy if available
if importlib.util.find_spec("arcpy") is not None:
    import arcpy

This starts with some routes to project paths and included packages, all useful for accessing project resources.

In [5]:
# paths to common data locations - NOTE: to convert any path to a raw string, simply use str(path_instance)
dir_prj = Path.cwd().parent

# import the project package from the project package path
sys.path.insert(0, str(dir_prj/'src'))
import lndcvr_unet
from ck_tools import paths

# load the "autoreload" extension so that code can change, & always reload modules so that as you change code in src, it gets loaded
%load_ext autoreload
%autoreload 2

# load environment variables from .env
load_dotenv(find_dotenv())

True

## Set Up Data Directories

Now, if you have just cloned this project, we are going to start by making sure all the diretory structures are in place to store data. These directories are excluded from the GitHub repo, so if you just downloaded the project, this will get everything set up.

NOTE: You can re-run this without any adverse affects to your current project. It does not overwrite anything, just sets it up if it is not there.

In [6]:
paths.create_resources()

## Download Data

Now, we need to get some data to start working with. If you have not already downloaded the data, depending on your internet connection, this may take a while. The total dataset is just under 10GB, so if you have not already downloaded it, this may take a while - not a bad idea to simply kick off toward the end of the day.

The first step is getting the size of the file, useful for tracking download progress.

In [13]:
zip_url = 'https://cicwebresources.blob.core.windows.net/chesapeakebaylandcover/BAYWIDE/Baywide_13Class_20132014.zip'
download_chunk_size = 1024 * 100  # you can increase this if you have plenty of RAM

# get the file name from the url and create a path where the file will be saved
zip_name = data_url.split('/')[-1]
zip_pth = paths.dir_raw/file_name

# interrogate the remote headers to see how big the file is
req_hd = requests.head(zip_url)
zip_size = int(req_hd.headers['Content-Length'])
zip_size_str = f'{file_size/1024/1000000:.4f} GB'

print(f'The remote file, {zip_name}, is {zip_size_str}.')

Now, it is time to download the data. If you have already done this, it will not re-download the data.

In [13]:
if zip_pth.exists():
    print(f'The data archive, {zip_name}, has already been downloaded.')

else:
    print(f'The data archive, {zip_name}, has not yet been downloaded, and will be downloaded to {paths.dir_raw}.')

    # create a streaming get request object to handle the download
    with requests.get(zip_url, stream=True) as req:

        req.raise_for_status()

        # open a file to save the data into
        with open(zip_pth, 'wb') as file_obj:

            # iteratively download and save the file in chunks, using tqdm to report progress
            for part in tqdm(req.iter_content(chunk_size=download_chunk_size), total=math.ceil(zip_size/download_chunk_size)):
                file_obj.write(part)

## Extract the Archive

The data comes as a zipped archive, so extract it. This also takes a bit of time.

In [13]:
# once downloaded, extract the data to the training data directory
with ZipFile(zip_pth, 'r') as zip_ref:
    zip_ref.extractall(paths.dir_raw)

While you may have the space, there is now no real reason to keep the original, so we can remove it.

In [13]:
# remove the original archive to save disk space
zip_pth.unlink()

## IMPORTANT - Set Raster Properties to Thematic

This is an important step. The Export Training Data for Deep Learning tool _requires_ the raster be a _thematic_ raster. While it technically is, an integer raster with each integer corresponding to a land cover classification, the tool does not necessarially know this. When the data is first downloaded, it is simply a _general_ raster. Hence, we need to change this to _thematic_ using the Set Raster Properties tool.

In [32]:
tif_pth = paths.dir_raw/f'{zip_pth.stem}.tif'

assert tif_pth.exists()

arcpy.management.SetRasterProperties(str(tif_pth), data_type='THEMATIC')