# Part 2: Data Preparation
 ----

Note this Demo is based on https://github.com/pytorch/vision/tree/v0.11.3

This notebook walks you each step to train a model using containers from the NGC Catalog. We chose the GPU optimized Pytorch container as an example. The basics of working with docker containers apply to all NGC containers.

We will show you how to:

* Download the Xview Dataset
* How to convert labels to coco format
* How to conduct the preprocessing step ,Tiling: slicing large satellite imagery into chunks 
* How to upload to s3 bucket to support distributed training

Let's get started!

---


### 2. Download the TensorFlow container from the NGC Catalog 

Once the Docker Engine is installed on your machine, visit https://ngc.nvidia.com/catalog/containers and search for the TensorFlow container. Click on the TensorFlow card and copy the pull command.
UPDATE IMG




# 1. Download the Xview Dataset
The dataset we will be using is from the DIUx xView 2018 Challenge https://challenge.xviewdataset.org by U.S. National Geospatial-Intelligence Agency (NGA). You will need to create an account at https://challenge.xviewdataset.org/welcome, agree to the terms and conditions, and download the dataset manually.

You can download the dataset at the url https://challenge.xviewdataset.org/data-download



In [3]:
# run pip install to get the SAHI library
!pip install sahi scikit-image

Collecting scikit-image
  Downloading scikit_image-0.20.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/13.4 MB[0m [31m153.5 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
Collecting networkx>=2.8
  Downloading networkx-3.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m191.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy<1.9.2,>=1.8
  Downloading scipy-1.9.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (43.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.4/43.4 MB[0m [31m161.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting imageio>=2.4.1
  Downloading imageio-2.27.0-py3-none-any.whl (3.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m188.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyWavelets>=1.1.1
  Downloading PyWavelets-1.4.1

In [None]:
# Example command to download train images with wget command, you will need to update the url as the token is expired"
!wget -O train_images.tgz "https://d307kc0mrhucc3.cloudfront.net/train_images.tgz?Expires=1680826141&Signature=par-WmB2Ffj-BZUoVkApKYADOMnqEQMWo7ZalDo47UqhPvy0nDge6pTUaEYH8F7xSR8nJKb3fFHdfqFea9Jua5LgqTa1sp5Ekaw8FloYIJIFvv-S0OxA-5VRpyYNLiKNjIg4uxykSKMYPj3xTq8YicBZNdnrXzafsRxmeQmcbiSqGR~8Jf1PndguouuNa4TV0D4iBtKqF0G6phgNCCF3ofGXO6YwLjjaVyKVsyLcvuQ2xj4KGKJM0AP2VJA43XjlgGJEuNzh1LubPocN8OCUlbm~jnUxq1N0iWDBFfxdW2JhCFN-iajIsgWEzQd0SMxB7bpyQhfWNEFHwNcA71uwKQ__&Key-Pair-Id=APKAIKGDJB5C3XUL2DXQ"

In [None]:
# Example command to download train images with wget command, you will need to update the url as the token is expired"
!wget -O train_labels.tgz "https://d307kc0mrhucc3.cloudfront.net/train_labels.tgz?Expires=1680826141&Signature=ToOHVlmZq6tjN0La0wYL9~DEeaf9HK1F0KB8yy4Izk020HJemSDzakYmhCF3CsXJ3ns-KrZ4Vfws6mIlmfkk9l0FvVByQC94MB618CaRBytbCkO69ONUFAt0OzNUR14DB9cCM6Q3VJ9dHcUw-fAr~D2yHeK3mhnDSNbCQAqOaKoYQlfLxbAdTrMfU8KL6z3vAD6hC0ofa6QtlSxbhJgXupfw7nzgNmtrhF2Q6xX7gDSoi6~7OLu7bisGlA8sJuzPCVONWl5zxwK~ZPOgJsF6UAckdKsd2V-4IX4cWQlZYUf7FkKG3ccT8XFmeHY-9dBfX2AgfFH84p5P~nrPA41R-Q__&Key-Pair-Id=APKAIKGDJB5C3XUL2DXQ"

In [None]:
# unzip images and labels from /run/determined/workdir directory 
!tar -xvf train_images.tgz -C e2e_blogposts/ngc_blog/xview_dataset/

In [None]:
# unzip labels from /run/determined/workdir directory 
!tar -xvf train_labels.tgz -C e2e_blogposts/ngc_blog/xview_dataset/

# Convert TIF to RGB

In [21]:
# Here loop through all the images and convert them to RGB, this is important for tiling the images and training with pytorch
# will take about an hour to complete
!python data_utils/tif_2_rgb.py --input_dir xview_dataset/train_images \
  --out_dir xview_dataset/train_images_rgb/

renaming bad named files...
[PosixPath('100.tif'), PosixPath('109.tif'), PosixPath('102.tif')]
100%|███████████████████████████████████████| 846/846 [1:03:55<00:00,  4.53s/it]


# 2. How to convert labels to coco format
Here we run a script to convert the dataset labels from .geojson format to COCO format. More details on the COCO format here: 

The result will be two files (in COCO formal) generated `train.json` and `val.json`

In [11]:
# make sure train_images_dir is pointing to the .tif images
!python data_utils/convert_geojson_to_coco.py --train_images_dir xview_dataset/train_images/ \
  --train_images_dir_rgb xview_dataset/train_images_rgb/ \
  --train_geojson_path xview_dataset/xView_train.geojson \
  --output_dir xview_dataset/ \
  --train_split_rate 0.75 \
  --category_id_remapping data_utils/category_id_mapping.json \
  --xview_class_labels data_utils/xview_class_labels.txt


Namespace(category_id_remapping='data_utils/category_id_mapping.json', output_dir='xview_dataset/', train_geojson_path='xview_dataset/xView_train.geojson', train_images_dir='xview_dataset/train_images/', train_images_dir_rgb='xview_dataset/train_images_rgb/', train_split_rate=0.75, xview_class_labels='data_utils/xview_class_labels.txt')
5.tif:  True
Parsing xView data: 100%|████████████| 601937/601937 [00:14<00:00, 42033.53it/s]
chips:  ['2355.png' '2355.png' '2355.png' ... '389.png' '389.png' '389.png']
Converting xView data into COCO format: 100%|█| 846/846 [01:12<00:00, 11.64it/s]


# 3. Slicing/Tiling the Dataset
Here we are using the SAHI library to slice our large satellite images. Satellite images can be up to 50k^2 pixels in size, which wouldnt fit in GPU memory. We alleviate this problem by slicing the image. 

In [12]:
!python data_utils/slice_coco.py --image_dir xview_dataset/train_images_rgb/ \
  --dataset_json_path xview_dataset/train.json \
  --slice_size 300 \
  --overlap_ratio 0.2 \
  --ignore_negative_samples True \
  --min_area_ratio 0.1 \
  --output_dir xview_dataset/train_images_rgb_no_neg/

Slicing step is starting...
indexing coco dataset annotations...
Loading coco annotations: 100%|███████████████| 634/634 [01:01<00:00, 10.31it/s]
100%|█████████████████████████████████████████| 634/634 [47:13<00:00,  4.47s/it]
Sliced dataset for 'slice_size: 300' is exported to xview_dataset/train_images_rgb_no_neg/


# 4. How to upload to s3 bucket to support distributed training

In [13]:
!tar -cvf xview_dataset.tgz xview_dataset/ -v

drwxr-xr-x root/root         0 2023-04-06 23:25 xview_dataset/
-rw-r--r-- root/root  15447246 2023-04-07 00:14 xview_dataset/val.json
drwxr-xr-x root/root         0 2023-04-06 23:53 xview_dataset/train_images_rgb_no_neg/
drwxr-xr-x root/root         0 2023-04-07 01:03 xview_dataset/train_images_rgb_no_neg/train_images_300_02/
-rw-r--r-- root/root     10137 2023-04-07 00:21 xview_dataset/train_images_rgb_no_neg/train_images_300_02/1922_3600_2160_3900_2460.jpg
-rw-r--r-- root/root      9312 2023-04-07 00:22 xview_dataset/train_images_rgb_no_neg/train_images_300_02/2020_2640_2400_2940_2700.jpg
-rw-r--r-- root/root      9942 2023-04-07 00:56 xview_dataset/train_images_rgb_no_neg/train_images_300_02/2361_480_2160_780_2460.jpg
-rw-r--r-- root/root      9090 2023-04-07 00:34 xview_dataset/train_images_rgb_no_neg/train_images_300_02/888_2880_1680_3180_1980.jpg
-rw-r--r-- root/root     15809 2023-04-07 00:36 xview_dataset/train_images_rgb_no_neg/train_images_300_02/2044_240_960_540_1260.jpg
-rw

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



-rw-r--r-- root/root      4687 2023-04-07 00:40 xview_dataset/train_images_rgb_no_neg/train_images_300_02/2455_2640_720_2940_1020.jpg
-rw-r--r-- root/root      8304 2023-04-07 00:56 xview_dataset/train_images_rgb_no_neg/train_images_300_02/1450_0_2640_300_2940.jpg
-rw-r--r-- root/root     10981 2023-04-07 00:16 xview_dataset/train_images_rgb_no_neg/train_images_300_02/637_480_1920_780_2220.jpg
-rw-r--r-- root/root     10347 2023-04-07 00:36 xview_dataset/train_images_rgb_no_neg/train_images_300_02/293_960_2400_1260_2700.jpg
-rw-r--r-- root/root      7142 2023-04-07 00:29 xview_dataset/train_images_rgb_no_neg/train_images_300_02/1126_1920_960_2220_1260.jpg
-rw-r--r-- root/root     11951 2023-04-07 00:18 xview_dataset/train_images_rgb_no_neg/train_images_300_02/140_240_1920_540_2220.jpg
-rw-r--r-- root/root     10495 2023-04-07 00:59 xview_dataset/train_images_rgb_no_neg/train_images_300_02/1914_720_1200_1020_1500.jpg
-rw-r--r-- root/root     11286 2023-04-07 00:23 xview_dataset/train_im

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



-rw-r--r-- root/root      6707 2023-04-07 00:44 xview_dataset/train_images_rgb_no_neg/train_images_300_02/432_0_1920_300_2220.jpg
-rw-r--r-- root/root      6517 2023-04-07 00:41 xview_dataset/train_images_rgb_no_neg/train_images_300_02/1114_1920_0_2220_300.jpg
-rw-r--r-- root/root      6043 2023-04-07 00:46 xview_dataset/train_images_rgb_no_neg/train_images_300_02/433_720_480_1020_780.jpg
-rw-r--r-- root/root      9964 2023-04-07 00:52 xview_dataset/train_images_rgb_no_neg/train_images_300_02/1067_2880_1440_3180_1740.jpg
-rw-r--r-- root/root     22005 2023-04-07 01:02 xview_dataset/train_images_rgb_no_neg/train_images_300_02/600_3013_2640_3313_2940.jpg
-rw-r--r-- root/root      4374 2023-04-07 00:57 xview_dataset/train_images_rgb_no_neg/train_images_300_02/2520_1680_1920_1980_2220.jpg
-rw-r--r-- root/root     16086 2023-04-07 00:58 xview_dataset/train_images_rgb_no_neg/train_images_300_02/1042_1440_1200_1740_1500.jpg
-rw-r--r-- root/root     13414 2023-04-07 00:26 xview_dataset/train_i