<a href="https://colab.research.google.com/github/gl7176/CNN_tools/blob/main/training_from_multiple_sets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import tiles from one dataset and run a model from another dataset

**Before running this script, enter the drive folder links of each dataset you would like to use to train the model. Each drive folder should include (1) tiled images, (2) the `tiling_scheme.json` file, and (3) the training data associated with each tile set**

<a href="https://colab.research.google.com/github/gl7176/CNN_tools/blob/main/training_from_multiple_sets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
#####  <center> Be sure to update this hyperlink above if you clone and want to point to a different GitHub </center>

In [30]:
# set variable to the destination google drive folder you want to pull from
drive_folders = ['https://drive.google.com/drive/folders/1DKAp-k2cHWFj9rLNhNL4i6dKoPcR3Gn6',
                'https://drive.google.com/drive/folders/1INuRNVKvKMy8L_Nb6lmoVbyvScWK0-0D']

# manually assign IDs to the dataset, if wanted, for output labeling
dataset_IDs = ['HI2016', 'HI2015']

!pip install -U -q PyDrive
import os, numpy as np
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# 1. Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# 2. Auto-iterate using the query syntax
#    https://developers.google.com/drive/v2/web/search-parameters
dataset_list = np.empty(len(drive_folders), dtype=object) 
for num,entry in enumerate(drive_folders):
  print("new directory: dataset_{c}".format(c=num))
  local_download_path = os.path.expanduser("dataset_{c}".format(c=num))
  try:
    os.makedirs(local_download_path)
  except: pass
  pointer = str("'" + entry.split("/")[-1] + "'" + " in parents")

  file_list = drive.ListFile(
      {'q': pointer}).GetList()

    # 3. Create & download filetypes of interest by id.
  count = 0
  image_list = []
  for f in file_list:
    fname = os.path.join(local_download_path, f['title'])
    if fname.endswith(".png"):
      image_list.append(fname)
      count += 1
      if count % 10 == 0:
        print("dataset_{e}: {c} tiles pulled".format(e=num, c=count))
      f_ = drive.CreateFile({'id': f['id']})
      f_.GetContentFile(fname)
    elif fname.endswith(".csv") or fname.endswith(".json"):
      f_ = drive.CreateFile({'id': f['id']})
      f_.GetContentFile(fname)
  dataset_list[num] = {"dataset_name": local_download_path,
                  "annotations_file": "annotations_placeholder",
                  "tiling_scheme_file": "tsf_placeholder",
                  "image_list": image_list}
  print("dataset_{e}: {c} tiles pulled".format(e=num, c=count))

new directory: dataset_0
dataset_0: 10 tiles pulled
dataset_0: 20 tiles pulled
dataset_0: 30 tiles pulled
dataset_0: 40 tiles pulled
dataset_0: 50 tiles pulled
dataset_0: 60 tiles pulled
dataset_0: 70 tiles pulled
dataset_0: 80 tiles pulled
dataset_0: 90 tiles pulled
dataset_0: 100 tiles pulled
dataset_0: 110 tiles pulled
dataset_0: 120 tiles pulled
dataset_0: 130 tiles pulled
dataset_0: 134 tiles pulled
new directory: dataset_1
dataset_1: 10 tiles pulled
dataset_1: 20 tiles pulled
dataset_1: 30 tiles pulled
dataset_1: 40 tiles pulled
dataset_1: 50 tiles pulled
dataset_1: 60 tiles pulled
dataset_1: 70 tiles pulled
dataset_1: 80 tiles pulled
dataset_1: 90 tiles pulled
dataset_1: 100 tiles pulled
dataset_1: 110 tiles pulled
dataset_1: 120 tiles pulled
dataset_1: 130 tiles pulled
dataset_1: 140 tiles pulled
dataset_1: 150 tiles pulled
dataset_1: 160 tiles pulled
dataset_1: 170 tiles pulled
dataset_1: 180 tiles pulled
dataset_1: 190 tiles pulled
dataset_1: 200 tiles pulled
dataset_1: 210 t

### Identify necessary files from among files in the input directory

In [31]:
import csv, json
for num,dataset in enumerate(dataset_list):
  for fname in os.listdir(dataset["dataset_name"]):
    if fname.endswith(".csv"): 
      annotations_candidate = "{i}/{f}".format(i=dataset["dataset_name"], f=fname)
      with open(annotations_candidate, "r") as f:
        if next(csv.reader(f, delimiter=","))[0:3] == ['filename', 'file_size', 'file_attributes']:
          dataset_list[num]["annotations_file"] = annotations_candidate
        else: continue

    if fname.endswith(".json"):
      tiling_scheme_candidate = "{i}/{f}".format(i=dataset["dataset_name"], f=fname)
      with open(tiling_scheme_candidate) as f:
        try:
          image_list = list(json.load(f)["tile_pointers"]["image_locations"].keys())
          dataset_list[num]["tiling_scheme_file"] = tiling_scheme_candidate
        except: continue

  if dataset_list[num]["annotations_file"] == "annotations_placeholder":
    raise Exception("VIA annotations file not found")
  elif dataset_list[num]["tiling_scheme_file"] == "TSF_placeholder":
    raise Exception("tiling scheme file not found")

  print("{d} annotations file identified as {f}".format(d=dataset["dataset_name"], f = dataset_list[num]["annotations_file"]))
  print("{d} tiling scheme file identified as {f}".format(d=dataset["dataset_name"], f = dataset_list[num]["tiling_scheme_file"]))

dataset_0 annotations file identified as dataset_0/via_SealCNN_TrainingData2016.csv
dataset_0 tiling scheme file identified as dataset_0/tiling_scheme.json
dataset_1 annotations file identified as dataset_1/via_SealCNN_TrainingData.csv
dataset_1 tiling scheme file identified as dataset_1/tiling_scheme.json


### Shuffle and split images into 3 datasets: Training, Testing, Validation

In [40]:
import random
# set pseudo-random values for replicability
random.seed(3)

image_list = []
for value in dataset_list:
  image_list= image_list + value["image_list"]

# shuffle the image list randomly and get total count
random.shuffle(image_list)
total_count = len(image_list)

# set indices for breaking up the total dataset into TTV parts
valid_fraction, train_fraction = 0.2, 0.8

# spit error if the math don't add up
if (sum([valid_fraction, train_fraction]) != 1.0):
   raise Exception("fractions should add up to 1")

split_index = int(total_count * train_fraction)

# use indices to break up dataset into the three parts
train_dataset, valid_dataset= image_list[:split_index], image_list[split_index:]
print(len(valid_dataset), len(train_dataset))

output_dir = "output_directory"
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# spit out CSV listing the image subsets
subset_list = []
for row in valid_dataset:
        subset_list.append([row, "validation"])
for row in train_dataset:
        subset_list.append([row, "training"])
with open(output_dir + '/subset_list.csv', 'w', newline='') as fp:
    writer = csv.writer(fp)
    writer.writerows(subset_list)

76 302


### Reformat annotations from VIA to RetinaNet format
The following loop pulls each annotation, line-by-line, from the VIA exported CSV, extracts the necessary information, reformats it into the format that RetinaNet requires (https://github.com/fizyr/keras-retinanet#annotations-format), then reassembles a new CSV line-by-line that RetinaNet can receive

In [42]:
# Create blank list for class names
class_list = []
image_annotations_train, image_annotations_valid = [], []

# read each line, parse it, convert it, put it all back together
# then drop it in the appropriate subset
for num,dataset in enumerate(dataset_list):
  with open(dataset["annotations_file"], "r") as f:
      reader = csv.reader(f, delimiter=",")
      for line in reader: 
          # output we want:
          # format: path/to/image.jpg,x1,y1,x2,y2,class_name
          # example: /data/imgs/img_001.jpg,837,346,981,456,cow
          filename = line[0]
          if filename == 'filename':
              # bypassing comments in csv
              continue
          filename = "{d}/{f}".format(d=dataset["dataset_name"], f=filename)
          if '{}' in line[5]:
              new_row = [filename,"","","","",""]
              # create a blank entry for empty images
          else:  
            # pulling from column named "region_shape_attributes"
            box_entry = json.loads(line[5])
            top_left_x, top_left_y, width, height = box_entry["x"], box_entry["y"], box_entry["width"], box_entry["height"]
    
            if width == 0 or height == 0:
                continue
                # skip tiny/empty boxes
            
            # convert from "top left and width/height" to "x and y values at each corner of the box"
            if top_left_x < 0:
                top_left_x = 1
            if top_left_y < 0:
                top_left_y = 1
            x1, x2, y1, y2 = top_left_x, top_left_x + width, top_left_y, top_left_y + height 
            
            # pulling from column named "region_attributes" to get class names
            name = json.loads(line[6])["Age Class"]

            # skip unknown class, in this case. Might be useful in other applications though,
            # e.g. total object count irrespective of class
            if name == "Unknown":
                continue

            # build list of classes as we encounter new names
            if name not in class_list:
                class_list.append(name)

            # create the annotation row
            new_row = [filename, x1, y1, x2, y2, name]

            # append the row to the correct subset (training, testing, or validation)
            if filename in train_dataset:
                image_annotations_train.append(new_row)
            else:
                image_annotations_valid.append(new_row)

tv_ = list(map(len, [image_annotations_train, image_annotations_valid]))
tv = list(map(int, [x/sum(tv_)*100 for x in tv_]))
print("total breakdown of annotations: {n} - {t}% training set, {v}% validation set".format(t=str(tv[0]), v=str(tv[1]), n=tv_))

total breakdown of annotations: [5630, 1374] - 80% training set, 19% validation set


In [45]:
output_name = 'annotations_train.csv'
for ID in dataset_IDs:
  output_name = "{i}_{o}".format(i=ID, o=output_name)
training_data_file = "{d}/{n}".format(d=output_dir, n=output_name)
with open(training_data_file, 'w', newline='') as fp:
    writer = csv.writer(fp)
    writer.writerows(image_annotations_train)

output_name = 'annotations_valid.csv'
for ID in dataset_IDs:
  output_name = "{i}_{o}".format(i=ID, o=output_name)
validation_data_file = "{d}/{n}".format(d=output_dir, n=output_name)
with open(validation_data_file, 'w', newline='') as fp:
    writer = csv.writer(fp)
    writer.writerows(image_annotations_valid)

detection_classes = []
for i in range(0, len(class_list)):
    detection_classes.append([class_list[i], i])
classes_file = "{d}/{n}".format(d=output_dir, n="classes.csv")
with open(classes_file, 'w', newline='') as fp:
    writer = csv.writer(fp)
    writer.writerows(detection_classes)

### Install the Convolutional Neural Network that will do the detections. 

This section sets up the software and pulls code for a CNN model called "RetinaNet" which uses the model "ResNet-50" as a subcomponent. This section then loads data for an existing ResNet-50 model (pre-trained for object detection) which we will further train for our task.

Disregard any errors or prompts to "restart runtime" unless the code stops progressing (then email me at gdl10@duke.edu).

In [None]:
# install the keras package
! pip install keras==2.4

In [None]:
# copy the files for RetinaNet
# note that this build is now deprecated, but we are fine with that
# now pulling from a personal clone that outputs error metrics
! git clone https://github.com/gl7176/keras-retinanet.git

In [None]:
# change directory and install RetinaNet from the copied code
% cd keras-retinanet

! pip install .

In [None]:
! python setup.py build_ext --inplace

In [None]:
% cd ../

# get the pre-trained ResNet-50 model
! wget -P data "https://github.com/fizyr/keras-retinanet/releases/download/0.5.1/resnet50_coco_best_v2.1.0.h5"

In [None]:
import shutil
print(os.getcwd())
for dataset in dataset_list:
  shutil.move(dataset["dataset_name"], "{o}/{d}".format(o=output_dir, d=dataset["dataset_name"]))
  #original = r'original path where the directory is currently stored\directory name'
  #target = r'target path where the directory will be moved\directory name'

#shutil.move(original,target)

In [None]:
image_count = 0
for dataset in dataset_list:
  image_count += len(dataset["image_list"])

import subprocess, glob

epoch_number = 50
batch_size_number = 2
step_number = int(image_count/batch_size_number)
print(str(step_number) + " steps")

# terminal code for troubleshooting
#! keras-retinanet/keras_retinanet/bin/train.py \
#--weights data/resnet50_coco_best_v2.1.0.h5 \
#--epochs 50 --steps 189 --batch-size 2 \
#csv output_directory/HI2015_HI2016_annotations.csv output_directory/classes.csv

# this process takes a while to run, be warned!
# you can monitor epoch outputs by output files in the "output" folder

model_run = subprocess.check_output(['keras-retinanet/keras_retinanet/bin/train.py',
                 '--weights', 'data/resnet50_coco_best_v2.1.0.h5',
                 '--epochs', str(epoch_number),  '--steps', str(step_number), '--batch-size',
                 str(batch_size_number), 'csv', training_data_file, classes_file,
                 '--val-annotations', validation_data_file]).decode("utf-8")
print(model_run)

In [None]:
list_of_files = glob.glob('snapshots/resnet*.h5')
latest_file = max(list_of_files, key=os.path.getctime)
epoch_final = latest_file[latest_file.index("_csv_")+5:-3]
best_model_training = latest_file.replace("/content/", "")
print(best_model_training)

This next section converts the model from training mode to inference mode so it can be used to detect our target objects (seals). Until now we've been updating the model based on its performance; now we're fixing the model in a static "snapshot" so we can test it out. This conversion process take a little time.

In [None]:
# note that we are naming our model "best_model_inference" and locating it in the "snapshots" directory. Customize if wanted
model_name = "best_model_inference"
#! keras-retinanet/keras_retinanet/bin/convert_model.py snapshots/resnet50_csv_10.h5 snapshots/best_model_inference.h5
subprocess.run(["keras-retinanet/keras_retinanet/bin/convert_model.py", best_model_training, "snapshots/{m}.h5".format(m=model_name)])


### Export model and metrics

In [None]:
from google.colab import files

In [None]:
# export metrics (fast)
files.download("/content/output/Epoch-{n}.png".format(n=epoch_final))
files.download("/content/output/Epoch-{n}.csv".format(n=epoch_final))

In [None]:
#export inference model (slow)
files.download("/content/{m}".format(m=model_path))

In [None]:
#export training model (even slower)
files.download("/content/{m}".format(m=best_model_training))