# Big Earth Net Preprocessing
## Irrigation Capstone Fall 2020
### TP Goter

This notebook is used to preprocess the GeoTiff files that contain the Sentinel-2 MSI data comprising the BigEarthNet dataset into TFRecords files. It is based on the preprocessing scripts from the BigEarthNet repo, but has been updated to work in Colaboratory with Python3.7+ and TensorFlow 2.3.

This version of the preprocessor is for specifically isolating the irrigated and non-irrigated examples.

In [1]:
import pandas as pd
import tensorflow as tf
from glob import glob
import os
#from matplotlib import pyplot as plt
#%matplotlib inline
import numpy as np
from tqdm import tqdm
#from google.colab import drive
#import seaborn as sns
#from matplotlib.cm import get_cmap
#import folium
#import gdal
import rasterio
import csv
import json

In [2]:
print(pd.__version__)
print(tf.__version__)


1.1.2
2.3.1


## Mount Google Drive and Set Paths

In [3]:
#from google.colab import drive
#drive.mount('/content/gdrive')

In [4]:
#base_path = '/content/gdrive/My Drive/Capstone Project'
big_earth_path ='./BigEarthNet-v1.0/'

## Create Symbolic Link(s)
Set up a symbolic link to allow for easy Python module imports. Then check to make sure the link works (it is a Unix link so check from shell)

In [5]:
!ln -s './bigearthnet-models/' bemodels

ln: bemodels/bigearthnet-models: File exists


In [6]:
!ls bemodels

README.md           bigearthnet-models  prep_splits.py      tensorflow_utils.py
[34m__pycache__[m[m         label_indices.json  [34msplits[m[m


In [7]:
from bemodels import tensorflow_utils

## Process All of the BigEarthNet data
This simple script will loop over all of the subfolders in the BigEarthNet-v1.0 folder. Currently this folder does not contain the entirety of the BigEarthNet Dataset. Due to this issue, the original scripting was modified to run through the train, test, val sets and only process files if they exist. The previous script simply aborted if a file was listed in the train.csv file and was not in the directory.

### Note: This processing takes a really long time. 
We need to determine if there is a better way to get this data ready for ingestion into our models.

In [8]:
with open('./bigearthnet-models/label_indices.json', 'rb') as f:
    label_indices = json.load(f)

root_folder = big_earth_path
out_folder = './tfrecords'
splits = glob(f'./bigearthnet-models/splits/train.csv')

# Checks the existence of patch folders and populate the list of patch folder paths
folder_path_list = []
if not os.path.exists(root_folder):
    print('ERROR: folder', root_folder, 'does not exist')




In [9]:
patch_names_list = []
split_names = []
for csv_file in splits:
    patch_names_list.append([])
    split_names.append(os.path.basename(csv_file).split('.')[0])
    with open(csv_file, 'r') as fp:
        csv_reader = csv.reader(fp, delimiter=',')
        for row in csv_reader:
            patch_names_list[-1].append(row[0].strip())    

# tensorflow_utils.prep_tf_record_files(
#     root_folder, out_folder, 
#     split_names, patch_names_list, 
#     label_indices)

In [10]:
len(patch_names_list[0])

269695

In [None]:
irrigated_examples = []
nonirrigated_examples = []
missing_count = 0
for patch_name in tqdm(patch_names_list[0]):
    patch_folder_path = os.path.join(root_folder, patch_name)
    patch_json_path = os.path.join(
                    patch_folder_path, patch_name + '_labels_metadata.json')
    try:
        with open(patch_json_path, 'rb') as f:
                        patch_json = json.load(f)
    except:
#         print(f'Missing Labels for {patch_name}')
        missing_count += 1
        continue

    if 'Permanently irrigated land' in patch_json['labels']:
        irrigated_examples.append(patch_folder_path)
    else:
        nonirrigated_examples.append(patch_folder_path)


  1%|          | 2515/269695 [00:26<45:44, 97.34it/s]  

In [12]:
len(irrigated_examples)

2375

In [13]:
len(nonirrigated_examples)

87739

In [11]:
pos_df = pd.read_csv('./bigearthnet-models/splits/positive_test.csv')
neg_df = pd.read_csv('./bigearthnet-models/splits/negative_test.csv')

In [14]:
pos_df = pd.DataFrame(irrigated_examples,columns=['file'])
neg_df = pd.DataFrame(nonirrigated_examples,columns=['file'])
pos_df.to_csv('./bigearthnet-models/splits/positive_test.csv')
neg_df.to_csv('./bigearthnet-models/splits/negative_test.csv')

# Create Data sets for finetuning. Make total dataset size divisible by 32 or 64 for easy batching

In [12]:
pos_df_1_percent = pos_df.sample(frac=0.0135)
#pos_df_10_percent = pos_df.sample(frac=0.1346)

In [13]:
print(len(pos_df_1_percent))
#print(len(pos_df_10_percent))

32


In [14]:
sample_frac_1p = len(pos_df_1_percent)/len(neg_df)
#sample_frac_10p = len(pos_df_10_percent)/len(neg_df)

In [15]:
subset_neg_df_1p = neg_df.sample(frac=sample_frac_1p)
#subset_neg_df_10p = neg_df.sample(frac=sample_frac_10p)

In [16]:
print(len(subset_neg_df_1p))
#print(len(subset_neg_df_10p))

32


In [35]:
64*2
320*2

640

In [27]:
# start_index = 0
# stop_index = 0
# # for i in range(5):
# #     print(f'Start Index: {start_index}')
# #     stop_index = len(subset_neg_df)*(i+1)//5
# #     print(f'Stop Index: {stop_index}')
# #     balanced_df = pd.concat([pos_df, subset_neg_df[start_index:stop_index]])
# #     start_index = stop_index
# #     # Shuffle the examples
# #     balanced_df = balanced_df.sample(frac=1)
# #     balanced_df.to_csv(f'./bigearthnet-models/splits/balanced_val{i}.csv')

Start Index: 0
Stop Index: 4971
Start Index: 4971
Stop Index: 9942
Start Index: 9942
Stop Index: 14913
Start Index: 14913
Stop Index: 19884
Start Index: 19884
Stop Index: 24855


In [17]:
balanced_df = pd.concat([pos_df_1_percent, subset_neg_df_1p])
# Shuffle the examples
balanced_df = balanced_df.sample(frac=1)
balanced_df.to_csv(f'./bigearthnet-models/splits/balanced_train_1percent.csv')

In [18]:
splits = glob(f'./bigearthnet-models/splits/balanced_train_1percent.*')
patch_names_list = []
split_names = []
for csv_file in splits:
    patch_names_list.append([])
    split_names.append(os.path.basename(csv_file).split('.')[0])
    csv_df = pd.read_csv(csv_file)
    patch_names_list[-1] = list(csv_df.file)
    patch_names_list[-1] = [name.split('/')[-1] for name in patch_names_list[-1]]
    

tensorflow_utils.prep_tf_record_files(
    root_folder, out_folder, 
    split_names, patch_names_list, 
    label_indices)

0it [00:00, ?it/s]

INFO: creating the split of balanced_train_1percent is started
 0/64 [..............................] - ETA: 0s

1it [00:00,  2.40it/s]

 2/64 [..............................] - ETA: 17s

3it [00:00,  3.20it/s]

 4/64 [>.............................] - ETA: 10s

5it [00:00,  4.20it/s]

 6/64 [=>............................] - ETA: 7s

7it [00:00,  5.40it/s]

 8/64 [==>...........................] - ETA: 6s

9it [00:00,  6.82it/s]

10/64 [===>..........................] - ETA: 5s

11it [00:01,  8.00it/s]

12/64 [====>.........................] - ETA: 5s

13it [00:01,  9.53it/s]

14/64 [=====>........................] - ETA: 4s

15it [00:01,  9.79it/s]



17it [00:01, 10.90it/s]



19it [00:01, 11.84it/s]



21it [00:01, 13.25it/s]



23it [00:01, 14.26it/s]



25it [00:02, 14.32it/s]



27it [00:02, 15.25it/s]



29it [00:02, 16.02it/s]



31it [00:02, 16.14it/s]



33it [00:02, 15.40it/s]



35it [00:02, 12.54it/s]



37it [00:02, 13.54it/s]



39it [00:02, 14.52it/s]



41it [00:03, 15.30it/s]



43it [00:03, 15.57it/s]



45it [00:03, 16.35it/s]



47it [00:03, 16.78it/s]



49it [00:03, 17.10it/s]



51it [00:03, 16.49it/s]



53it [00:03, 16.39it/s]



55it [00:03, 16.37it/s]



57it [00:04, 16.05it/s]



59it [00:04, 15.75it/s]



61it [00:04, 15.80it/s]



63it [00:04, 15.90it/s]



64it [00:04, 14.17it/s]
