# Part 2: Data Preparation
 ----

Note this Demo is based on ngc docker image `nvcr.io/nvidia/pytorch:21.11-py3`

This notebook walks you each step to train a model using containers from the NGC Catalog. We chose the GPU optimized Pytorch container as an example. The basics of working with docker containers apply to all NGC containers.

We will show you how to:

* Download the Xview Dataset
* How to convert labels to coco format
* How to conduct the preprocessing step tiling (i.e. slicing large satellite imagery into chunks )
* How to upload to s3 bucket to support distributed training

Let's get started!

---


### Pre-reqs, set up jupyter notebook environment using NGC container 

# Execute docker run to create NGC environment for Data Prep
make sure to map host directory to docker directory, we will use the host directory again to 
* `docker run   --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -v /home/ubuntu:/home/ubuntu  -p 8008:8888 -it nvcr.io/nvidia/pytorch:21.11-py3  /bin/bash`

# Run jupyter notebook command within docker container to access it on your local browser
* `cd /home/ubuntu`
* `jupyter lab --ip=0.0.0.0 --port=8888 --NotebookApp.token='' --NotebookApp.password=''` 
* `git clone https://github.com/interactivetech/e2e_blogposts.git`



### Download the Xview Dataset
The dataset we will be using is from the DIUx xView 2018 Challenge https://challenge.xviewdataset.org by U.S. National Geospatial-Intelligence Agency (NGA). You will need to create an account at https://challenge.xviewdataset.org/welcome, agree to the terms and conditions, and download the dataset manually.

You can download the dataset at the url https://challenge.xviewdataset.org/data-download



In [1]:
# run pip install to get the SAHI library
!pip install sahi scikit-image opencv-python-headless==4.5.5.64

Defaulting to user installation because normal site-packages is not writeable
Collecting sahi
  Downloading sahi-0.11.14-py3-none-any.whl (104 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.0/104.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting opencv-python-headless==4.5.5.64
  Downloading opencv_python_headless-4.5.5.64-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.8/47.8 MB[0m [31m111.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting fire
  Downloading fire-0.5.0.tar.gz (88 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.3/88.3 kB[0m [31m199.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting shapely>=1.8.0
  Downloading shapely-2.0.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [3

In [7]:
# Example command to download train images with wget command, you will need to update the url as the token is expired"
!wget -O train_images.tgz \
  "https://d307kc0mrhucc3.cloudfront.net/train_images.tgz?Expires=1689131216&Signature=PC5FQU8ls3vvBIxeOLcl-GgeYnidq~qolQymEfeENmxvpW~D6eo6O0rtCm4D4O4EBkzMMIUJSeofrHx09GNf2cgPbTW3LTN8fIgN4UaRNCeLWVwHj7wC5DHyoDaCsN7-G3Z7jlslXiPLR8u4DaqlI-h4-vtR1UzxfWjoH3wtOe7GeTcSnINbdtc88MtqbSofh6FOIpRZ9XrhcpQ8fv43cQKLsCZLAR48Jg56ByoXWoXVoCrtcbviX67lyfa0YicvnbS9Ji6EKk8scBeE~LMfZ3KvyXlxlpJankKFlh5pv5P25ocKYOnlKmzMM-cdWL3JQr6-GWp4~pzyPMsMQJRheQ__&Key-Pair-Id=APKAIKGDJB5C3XUL2DXQ"

--2023-07-11 21:19:10--  https://d307kc0mrhucc3.cloudfront.net/train_images.tgz?Expires=1689131216&Signature=PC5FQU8ls3vvBIxeOLcl-GgeYnidq~qolQymEfeENmxvpW~D6eo6O0rtCm4D4O4EBkzMMIUJSeofrHx09GNf2cgPbTW3LTN8fIgN4UaRNCeLWVwHj7wC5DHyoDaCsN7-G3Z7jlslXiPLR8u4DaqlI-h4-vtR1UzxfWjoH3wtOe7GeTcSnINbdtc88MtqbSofh6FOIpRZ9XrhcpQ8fv43cQKLsCZLAR48Jg56ByoXWoXVoCrtcbviX67lyfa0YicvnbS9Ji6EKk8scBeE~LMfZ3KvyXlxlpJankKFlh5pv5P25ocKYOnlKmzMM-cdWL3JQr6-GWp4~pzyPMsMQJRheQ__&Key-Pair-Id=APKAIKGDJB5C3XUL2DXQ
Resolving d307kc0mrhucc3.cloudfront.net (d307kc0mrhucc3.cloudfront.net)... 18.161.153.4, 18.161.153.12, 18.161.153.53, ...
Connecting to d307kc0mrhucc3.cloudfront.net (d307kc0mrhucc3.cloudfront.net)|18.161.153.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15413902447 (14G) [application/gzip]
Saving to: ‘train_images.tgz’


2023-07-11 21:24:30 (45.9 MB/s) - ‘train_images.tgz’ saved [15413902447/15413902447]



In [8]:
# Example command to download train images with wget command, you will need to update the url as the token is expired"
!wget -O train_labels.tgz \
  "https://d307kc0mrhucc3.cloudfront.net/train_labels.tgz?Expires=1689131216&Signature=DuWDhUvne4g9Mp~KbK~9VJdfrUybAKusLwXoGFPZ43D86y2bSV3BY08PNaMooENOFiJFlqVsXPSp512ZxxiITakSQ889YEgHKxDHPiMyO4OCILWZYmpivTrw3AI3gYQXCAMwkz3v~1WrgX2y8Yi5VTCtrNKWXgYFyOULCQCD6gJFJX7Buq0ldwY7nQQXoaqf2vYO7LKCviHt3EK6-CtO3sRB82LLmqLK8x~Sau~HM06v40s8jnBbU8m~W81zqQh5LMziBz7suAYeVNv8hhE5ej6IXJ9JgIatrhE8Ki9ytdWNxFTDokQUqW7DPioeGDMRfeu1xCuojxVbfBLtGhyeaQ__&Key-Pair-Id=APKAIKGDJB5C3XUL2DXQ"

--2023-07-11 21:24:31--  https://d307kc0mrhucc3.cloudfront.net/train_labels.tgz?Expires=1689131216&Signature=DuWDhUvne4g9Mp~KbK~9VJdfrUybAKusLwXoGFPZ43D86y2bSV3BY08PNaMooENOFiJFlqVsXPSp512ZxxiITakSQ889YEgHKxDHPiMyO4OCILWZYmpivTrw3AI3gYQXCAMwkz3v~1WrgX2y8Yi5VTCtrNKWXgYFyOULCQCD6gJFJX7Buq0ldwY7nQQXoaqf2vYO7LKCviHt3EK6-CtO3sRB82LLmqLK8x~Sau~HM06v40s8jnBbU8m~W81zqQh5LMziBz7suAYeVNv8hhE5ej6IXJ9JgIatrhE8Ki9ytdWNxFTDokQUqW7DPioeGDMRfeu1xCuojxVbfBLtGhyeaQ__&Key-Pair-Id=APKAIKGDJB5C3XUL2DXQ
Resolving d307kc0mrhucc3.cloudfront.net (d307kc0mrhucc3.cloudfront.net)... 18.161.153.53, 18.161.153.129, 18.161.153.4, ...
Connecting to d307kc0mrhucc3.cloudfront.net (d307kc0mrhucc3.cloudfront.net)|18.161.153.53|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 48950328 (47M) [application/gzip]
Saving to: ‘train_labels.tgz’


2023-07-11 21:24:32 (54.8 MB/s) - ‘train_labels.tgz’ saved [48950328/48950328]



In [10]:
# unzip images and labels from /home/ubuntu/e2e_blogposts/ngc_blog
!tar -xf train_images.tgz -C xview_dataset/

In [11]:
# unzip labels from /home/ubuntu/e2e_blogposts/ngc_blog directory 
!tar -xf train_labels.tgz -C xview_dataset/

# Convert TIF to RGB

In [14]:
# Here loop through all the images and convert them to RGB, this is 
# important for tiling the images and training with pytorch
# will take about an hour to complete
!python data_utils/tif_2_rgb.py --input_dir xview_dataset/train_images \
  --out_dir xview_dataset/train_images_rgb/

Created xview_dataset/train_images_rgb/ ...
renaming bad named files...
[PosixPath('xview_dataset/train_images/._109.tif'), PosixPath('xview_dataset/train_images/._102.tif'), PosixPath('xview_dataset/train_images/._100.tif')]
[PosixPath('109.tif'), PosixPath('102.tif'), PosixPath('100.tif')]
100%|█████████████████████████████████████████| 846/846 [55:13<00:00,  3.92s/it]


# How to convert labels to coco format
Here we run a script to convert the dataset labels from .geojson format to COCO format. More details on the COCO format here: 

The result will be two files (in COCO formal) generated `train.json` and `val.json`

In [3]:
# make sure train_images_dir is pointing to the .tif images
!python data_utils/convert_geojson_to_coco.py --train_images_dir xview_dataset/train_images/ \
  --train_images_dir_rgb xview_dataset/train_images_rgb/ \
  --train_geojson_path xview_dataset/xView_train.geojson \
  --output_dir xview_dataset/ \
  --train_split_rate 0.75 \
  --category_id_remapping data_utils/category_id_mapping.json \
  --xview_class_labels data_utils/xview_class_labels.txt


Namespace(category_id_remapping='data_utils/category_id_mapping.json', output_dir='xview_dataset/', train_geojson_path='xview_dataset/xView_train.geojson', train_images_dir='xview_dataset/train_images/', train_images_dir_rgb='xview_dataset/train_images_rgb/', train_split_rate=0.75, xview_class_labels='data_utils/xview_class_labels.txt')
5.tif:  True
Parsing xView data: 100%|████████████| 601937/601937 [00:07<00:00, 78172.18it/s]
Converting xView data into COCO format: 100%|█| 846/846 [01:32<00:00,  9.13it/s]


# Slicing/Tiling the Dataset
Here we are using the SAHI library to slice our large satellite images. Satellite images can be up to 50k^2 pixels in size, which wouldnt fit in GPU memory. We alleviate this problem by slicing the image. 

In [4]:
!python data_utils/slice_coco.py --image_dir xview_dataset/train_images_rgb/ \
  --train_dataset_json_path xview_dataset/train.json \
  --val_dataset_json_path xview_dataset/val.json \
  --slice_size 640 \
  --overlap_ratio 0.2 \
  --ignore_negative_samples True \
  --min_area_ratio 0.1 \
  --output_train_dir xview_dataset/train_images_rgb_no_neg/ \
  --output_val_dir xview_dataset/val_images_rgb_no_neg/

Slicing step is starting...
indexing coco dataset annotations...
Loading coco annotations: 100%|███████████████| 634/634 [00:28<00:00, 22.13it/s]
100%|█████████████████████████████████████████| 634/634 [13:36<00:00,  1.29s/it]
Sliced dataset for 'slice_size: 640' is exported to xview_dataset/train_images_rgb_no_neg/
Slicing step is starting...
indexing coco dataset annotations...
Loading coco annotations: 100%|███████████████| 212/212 [00:07<00:00, 26.97it/s]
100%|█████████████████████████████████████████| 212/212 [04:01<00:00,  1.14s/it]
Sliced dataset for 'slice_size: 640' is exported to xview_dataset/val_images_rgb_no_neg/


# Upload to s3 bucket to support distributed training

We will now upload our exported data to a publically accessible S3 bucket. This will enable for a large scale distributed experiment to have access to the dataset without installing the dataset on device. 
View these links to learn how to upload your dataset to an S3 bucket. Review the `S3Backend` class in `data.py`
* https://docs.determined.ai/latest/training/load-model-data.html#streaming-from-object-storage
* https://codingsight.com/upload-files-to-aws-s3-with-the-aws-cli/

Once you create an S3 bucket that is publically accessible, here are example commands to upload the preprocessed dataset to S3:
* `aws s3 cp --recursive xview_dataset/train_sliced_no_neg/   s3://determined-ai-xview-coco-dataset/train_sliced_no_neg`
* `aws s3 cp --recursive xview_dataset/val_sliced_no_neg/   s3://determined-ai-xview-coco-dataset/val_sliced_no_neg`