#Generation of Data sets for Training 

## Generating the Proxy and Full tasks

This notebook serves the purpose of extracting and downsampling patches from the HR versions of the DIV2K image data sets comprising 800 images for training and 100 for validation. Applying data augmentation to this data set, as many patches as possible with 64 x 64 pixels are extracted from the 900 images. 

By extracting the aforementioned patches it is possible to define two tasks with a validation and a training set each:

* A proxy task, best suited for architecture search, that presents a x2 task with a reduced number of instances/patches.

* A full training task, that presents 3 data pairs with different resolutions. 


It is worth noting that the total number of instances for each data task is different.

### Proxy task

This task is composed of two datasets. The first set has 522,939 pairs of low- and high-resolution image patches for model training. The second set contains 66650 pairs of images used for validation. These sets only consider a x2 resolution problem and no additional data augmentation is applied over the patches.

The objective of this proxy is to allow a reduced computational overhead while searching for architectures, in contrast to training over larger data sets and more tasks.

### Full task

This task extends on the previous one by adding two more resolution differences (x3 and x4) and a data augmentation procedure. For each set of instances patches are subjected to horizontal flip, vertical flip and 90º rotation. This provides additional patter information to models, such that richer features can be extracted during training.

Following these procedure results in 2M image pairs for training and 266,600 pairs for validation, with architectures requiring to reconstruct three super resolution problems: x2, x3 and x4.

This task is directed at evaluating machine-crafted models exhaustively, allowing a proper contrast among different approaches. 


## Structure of this notebook.

To help the reader identify different elements and their application throughout this notebook, we structured it as follows:

* Library import. This section incorporates the instalation and inclusion of the necesary libraries for executing this notebook successfully.
*   Data Loading and pre-processing. Here image paths are transformed into image arrays to be used in later sections.
*   Creation of directories. Here the structure for image allocation in directories is defined. A compartmentalization of images in subsets is needed as the sheer quantity of files can cause troubles while importing and exporting the patches.
* Extraction of Patches and downsampling. Here each image is subjected to the same operators that sample multiple pieces of each image, apply the data augmentation described in the full task and down sample accordingly to each resolution.
*Data Download. While the datasets will be made public, we include a routine for downloading the data sets generated. 


#Library import and install

## Connection with gdrive to access local DIV2k data sets copies.

In [None]:
from google.colab import drive
drive.mount('/gdrive', force_remount=True)
%cd "/gdrive/MyDrive/Data_for_training/"

Mounted at /gdrive
/gdrive/MyDrive/Doctorate_Thesis_Coding/COMPSR-NET/Data_for_training


##Installing the Patchify and Pillow libraries to handle image processing.

For this we simply install the necessary libraries with !pip install.

In [None]:
!pip install patchify

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting patchify
  Downloading patchify-0.2.3-py3-none-any.whl (6.6 kB)
Installing collected packages: patchify
Successfully installed patchify-0.2.3


In [None]:
!pip install Pillow

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Import Libraries

We then import the necesary libraries used throughout the notebook.

In [None]:
import os
import glob
import copy

import os.path
from os import path

import time

import numpy as np

from math import floor

from matplotlib import pyplot as plt
from tqdm import tqdm

from IPython.display import display


from patchify import patchify, unpatchify
from PIL import Image

from zipfile import ZipFile, ZIP_DEFLATED
import pathlib

# Data Loading

Here we allocate the images from the DIV2K data set as image paths.  

We separate these paths into into training (800) and validation (100) sets before processing.

In [None]:
train_val_split_perc = 0.888889
val_test_split_perc = 0.5

In [None]:
img_paths = []
val_img_paths = []
for dirname, _, filenames in os.walk('../DIV2K_train_HR'):
    for filename in filenames:
        img_paths.append(os.path.join(dirname, filename))
        
for dirname, _, filenames in os.walk('../DIV2K_valid_HR'):
    for filename in filenames:
        img_paths.append(os.path.join(dirname, filename))
        
print('Dataset dimension: ', len(img_paths))

val_img_paths = img_paths[floor(len(img_paths) * train_val_split_perc):]
img_paths = img_paths[:floor(len(img_paths) * train_val_split_perc)]
print('Training: ', len(img_paths))
print('Validation: ', len(val_img_paths))

Dataset dimension:  900
Training:  800
Validation:  100


#Data pre-processing

This section transforms the paths extracted from the step before into full image arrays. This way simplifying the process of working with the data.

##Going from local paths into full images.

Here we make images more easily accesible by transforming them directly into image arrays.

In [None]:
# Arrays that cover the training set. Expect a list with 800 image arrays.
img_arrays = []
for path in img_paths:
  im = Image.open(path)
  img_arrays.append(np.array(im))

In [None]:
print(len(img_arrays))

800


In [None]:
# Arrays that cover the validation set. Expect a list with 100 image arrays.
val_img_arrays = []
for path in val_img_paths:
  im = Image.open(path)
  val_img_arrays.append(np.array(im))

In [None]:
print(len(val_img_arrays))

100


# Extracting patches from images and performing downsampling

First we determine the size of the patches, by default our protocol uses 64 by 64 images, but other sizes could be use by a practitioner if needed.

In [None]:
#@title Patch size
size = 64 #@param {type:"number"}

Then we determine the resolution task we wish to generate, this can be x2, x3 or x4 as our protocol suggest. Nonetheless, larger and smaller resolutions can be generated.

Note: selecting a value of 1 results in the HR patches being generated. 

In [None]:
#@title Selection of resolution.
Task = 1 #@param {type:"slider", min:1, max:8, step:1}

## Remove processed dataset directories (Only use if removing the directories with data is required)

The following instructions allow the quick remotion of direcories in case of encountering an issue that requires them to be generated again. 

In [None]:
!rm -rf /content/Data_set_64/

In [None]:
!rm -rf /content/Data_set_val_64/

## Create the required directories

Here we create the directories needed to allocate the data accordingly to resolution and augmentation technique employed.

Both training and validation sets follows the same general structure:

```
Data_Set/
├─ Resolution/
│  ├─ First subset of images/
│  │  ├─ Original Patches/
│  │  │  ├─ patch_0_0_0.jpg
│  │  │  ├─ patch_0_0_1.jpg
│  │  │  ├─ ...
│  │  │  ├─ patch_#image_#row_#column.jpg
│  │  ├─ Horizontal Flip Patches/
│  │  ├─ Vertical Flip Patches/
│  │  ├─ Rotated 90º Patches/
│  ├─ Second subset of images/
│  │  ├─ Original patches/
│  │  ├─ .../
│  ├─ .../
│  ├─ N-th subset of images/
```




In [None]:
#Creating the directory for Training set.
import os.path
from os import path

if path.exists(f'/content/Data_set_{size}') == False:

    os.mkdir(f'/content/Data_set_{size}')

    if Task > 1:
      print(Task)
      print(f'/content/Data_set_{size}/x{Task}')
      if path.exists(f'/content/Data_set_{size}/x{Task}') == False:
        os.mkdir(f'/content/Data_set_{size}/x{Task}')
        for i in range(10, 801, 10):
          os.mkdir(f'/content/Data_set_{size}/x{Task}/{i}')
          os.mkdir(f'/content/Data_set_{size}/x{Task}/{i}/Original')
          os.mkdir(f'/content/Data_set_{size}/x{Task}/{i}/Flipped_Horizontally')
          os.mkdir(f'/content/Data_set_{size}/x{Task}/{i}/Flipped_Vertically')
          os.mkdir(f'/content/Data_set_{size}/x{Task}/{i}/Rotated_90')

    else:
      if path.exists(f'/content/Data_set_{size}/HR') == False:
        os.mkdir(f'/content/Data_set_{size}/HR')
        for i in range(10, 801, 10):
          os.mkdir(f'/content/Data_set_{size}/HR/{i}')
          os.mkdir(f'/content/Data_set_{size}/HR/{i}/Original')
          os.mkdir(f'/content/Data_set_{size}/HR/{i}/Flipped_Horizontally')
          os.mkdir(f'/content/Data_set_{size}/HR/{i}/Flipped_Vertically')
          os.mkdir(f'/content/Data_set_{size}/HR/{i}/Rotated_90')

else:

    if Task > 1:
      if path.exists(f'/content/Data_set_{size}/x{Task}') == False:
        os.mkdir(f'/content/Data_set_{size}/x{Task}')
        for i in range(10, 801, 10):
          os.mkdir(f'/content/Data_set_{size}/x{Task}/{i}')
          os.mkdir(f'/content/Data_set_{size}/x{Task}/{i}/Original')
          os.mkdir(f'/content/Data_set_{size}/x{Task}/{i}/Flipped_Horizontally')
          os.mkdir(f'/content/Data_set_{size}/x{Task}/{i}/Flipped_Vertically')
          os.mkdir(f'/content/Data_set_{size}/x{Task}/{i}/Rotated_90')

    else:
      if path.exists(f'/content/Data_set_{size}/HR') == False:
        os.mkdir(f'/content/Data_set_{size}/HR')
        for i in range(10, 801, 10):
          os.mkdir(f'/content/Data_set_{size}/HR/{i}')
          os.mkdir(f'/content/Data_set_{size}/HR/{i}/Original')
          os.mkdir(f'/content/Data_set_{size}/HR/{i}/Flipped_Horizontally')
          os.mkdir(f'/content/Data_set_{size}/HR/{i}/Flipped_Vertically')
          os.mkdir(f'/content/Data_set_{size}/HR/{i}/Rotated_90')

In [None]:
#Creating the directory for Validation set.
import os.path
from os import path

if path.exists(f'/content/Data_set_val_{size}') == False:

    os.mkdir(f'/content/Data_set_val_{size}')

    if Task > 1:
      print(Task)
      print(f'/content/Data_set_val_{size}/x{Task}')
      if path.exists(f'/Data_set_val_/Data_set_{size}/x{Task}') == False:
        os.mkdir(f'/content/Data_set_val_{size}/x{Task}')
        for i in range(10, 101, 10):
          os.mkdir(f'/content/Data_set_val_{size}/x{Task}/{i}')
          os.mkdir(f'/content/Data_set_val_{size}/x{Task}/{i}/Original')
          os.mkdir(f'/content/Data_set_val_{size}/x{Task}/{i}/Flipped_Horizontally')
          os.mkdir(f'/content/Data_set_val_{size}/x{Task}/{i}/Flipped_Vertically')
          os.mkdir(f'/content/Data_set_val_{size}/x{Task}/{i}/Rotated_90')

    else:
      if path.exists(f'/content/Data_set_val_{size}/HR') == False:
        os.mkdir(f'/content/Data_set_val_{size}/HR')
        for i in range(10, 101, 10):
          os.mkdir(f'/content/Data_set_val_{size}/HR/{i}')
          os.mkdir(f'/content/Data_set_val_{size}/HR/{i}/Original')
          os.mkdir(f'/content/Data_set_val_{size}/HR/{i}/Flipped_Horizontally')
          os.mkdir(f'/content/Data_set_val_{size}/HR/{i}/Flipped_Vertically')
          os.mkdir(f'/content/Data_set_val_{size}/HR/{i}/Rotated_90')

else:

    if Task > 1:
      if path.exists(f'/content/Data_set_val_{size}/x{Task}') == False:
        os.mkdir(f'/content/Data_set_val_{size}/x{Task}')
        for i in range(10, 101, 10):
          os.mkdir(f'/content/Data_set_val_{size}/x{Task}/{i}')
          os.mkdir(f'/content/Data_set_val_{size}/x{Task}/{i}/Original')
          os.mkdir(f'/content/Data_set_val_{size}/x{Task}/{i}/Flipped_Horizontally')
          os.mkdir(f'/content/Data_set_val_{size}/x{Task}/{i}/Flipped_Vertically')
          os.mkdir(f'/content/Data_set_val_{size}/x{Task}/{i}/Rotated_90')

    else:
      if path.exists(f'/content/Data_set_val_{size}/HR') == False:
        os.mkdir(f'/content/Data_set_val_{size}/HR')
        for i in range(10, 101, 10):
          os.mkdir(f'/content/Data_set_val_{size}/HR/{i}')
          os.mkdir(f'/content/Data_set_val_{size}/HR/{i}/Original')
          os.mkdir(f'/content/Data_set_val_{size}/HR/{i}/Flipped_Horizontally')
          os.mkdir(f'/content/Data_set_val_{size}/HR/{i}/Flipped_Vertically')
          os.mkdir(f'/content/Data_set_val_{size}/HR/{i}/Rotated_90')

In [None]:
folders = []
for f in range(10, 801, 10):
  folders.append(f)

In [None]:
folders_val = []
for f in range(10,101,10):
  folders_val.append(f)

## Extracting patches and downsampling.

Here are the main blocks of code of this notebook. These focus on the extracting and downsampling of data generating in principle each of the data set tasks.

In [None]:
a = 0
for k in tqdm(range(len(img_arrays))):
  one_image = img_arrays[k]
  #Extracting patches
  patches_img = patchify(one_image, (size,size,3), step = size)   
  for i in range(patches_img.shape[0]):
    for j in range(patches_img.shape[1]):
      single_patch_img = patches_img[i, j, 0, :, :, :]
      
      #From patch array to image
      patch = Image.fromarray(single_patch_img, 'RGB')
      ##From patch array to image flipped horizontally
      patch_FH = patch.transpose(Image.FLIP_LEFT_RIGHT)
      #From patch array to image flipped vertically
      patch_FV = patch.transpose(Image.FLIP_TOP_BOTTOM)
      #From patch array to image rotated 90º
      patch_R = patch.transpose(Image.ROTATE_90)


      #Downsampling original patch by 1/2, 1/3 and 1/4 the size. 
      p_ds = patch.resize((patch.size[0]//Task, patch.size[1]//Task), Image.BICUBIC)
      #Downsampling horizontally flipped patch by 1/2, 1/3 and 1/4 the size.
      p_ds_fh = p_ds.transpose(Image.FLIP_LEFT_RIGHT)
      #Downsampling vertical flipped patch by 1/2, 1/3 and 1/4 the size.
      p_ds_fv = p_ds.transpose(Image.FLIP_TOP_BOTTOM)
      #Downsampling rotated patch by 1/2, 1/3 and 1/4 the size.
      p_ds_r = p_ds.transpose(Image.ROTATE_90)

      # Saving the generated images at different directories for each corresponding image kind.

      if Task == 1:
        if k<folders[a]: 
          patch.save(f"/content/Data_set_{size}/HR/{folders[a]}/Original/patch_{k}_{i}_{j}.jpg")
          patch_FH.save(f"/content/Data_set_{size}/HR/{folders[a]}/Flipped_Horizontally/patch_{k}_{i}_{j}.jpg")
          patch_FV.save(f"/content/Data_set_{size}/HR/{folders[a]}/Flipped_Vertically/patch_{k}_{i}_{j}.jpg")
          patch_R.save(f"/content/Data_set_{size}/HR/{folders[a]}/Rotated_90/patch_{k}_{i}_{j}.jpg")
        
        else:
          #Changing folders.
          a+=1      
          patch.save(f"/content/Data_set_{size}/HR/{folders[a]}/Original/patch_{k}_{i}_{j}.jpg")
          patch_FH.save(f"/content/Data_set_{size}/HR/{folders[a]}/Flipped_Horizontally/patch_{k}_{i}_{j}.jpg")
          patch_FV.save(f"/content/Data_set_{size}/HR/{folders[a]}/Flipped_Vertically/patch_{k}_{i}_{j}.jpg")
          patch_R.save(f"/content/Data_set_{size}/HR/{folders[a]}/Rotated_90/patch_{k}_{i}_{j}.jpg")
      
      else:
        if k<folders[a]: 
          
          p_ds.save(f"/content/Data_set_{size}/x{Task}/{folders[a]}/Original/patch_{k}_{i}_{j}.jpg")
          p_ds_fh.save(f"/content/Data_set_{size}/x{Task}/{folders[a]}/Flipped_Horizontally/patch_{k}_{i}_{j}.jpg")
          p_ds_fv.save(f"/content/Data_set_{size}/x{Task}/{folders[a]}/Flipped_Vertically/patch_{k}_{i}_{j}.jpg")
          p_ds_r.save(f"/content/Data_set_{size}/x{Task}/{folders[a]}/Rotated_90/patch_{k}_{i}_{j}.jpg")

        else:

          #Changing folders.
          a+=1       
          
          p_ds.save(f"/content/Data_set_{size}/x{Task}/{folders[a]}/Original/patch_{k}_{i}_{j}.jpg")
          p_ds_fh.save(f"/content/Data_set_{size}/x{Task}/{folders[a]}/Flipped_Horizontally/patch_{k}_{i}_{j}.jpg")
          p_ds_fv.save(f"/content/Data_set_{size}/x{Task}/{folders[a]}/Flipped_Vertically/patch_{k}_{i}_{j}.jpg")
          p_ds_r.save(f"/content/Data_set_{size}/x{Task}/{folders[a]}/Rotated_90/patch_{k}_{i}_{j}.jpg")

# Uncomment to see some samples of the patches forming the image.
#  rows = patches_img.shape[0]
#  columns =patches_img.shape[1]
#  fig = plt.figure()
#  counter = 1
#
#  if (k+1)%1 == 0:
#    for m in range(patches_img.shape[0]):
#      for n in range(patches_img.shape[1]):
#        fig.add_subplot(rows, columns, counter)
#        counter +=1
#        plt.imshow(patches_img[m, n, 0, :, :, :])
#        plt.axis('off')
#        
#    plt.show()
#    plt.close()

100%|██████████| 800/800 [12:19<00:00,  1.08it/s]


In [None]:
a = 0
for k in tqdm(range(len(val_img_arrays))):
  one_image = val_img_arrays[k]
  patches_img = patchify(one_image, (size,size,3), step = size)
  for i in range(patches_img.shape[0]):
    for j in range(patches_img.shape[1]):
      single_patch_img = patches_img[i, j, 0, :, :, :]

      patch = Image.fromarray(single_patch_img, 'RGB')
      patch_FH = patch.transpose(Image.FLIP_LEFT_RIGHT)
      patch_FV = patch.transpose(Image.FLIP_TOP_BOTTOM)
      patch_R = patch.transpose(Image.ROTATE_90)
      
      p_ds = patch.resize((patch.size[0]//2, patch.size[1]//2), Image.BICUBIC)
      p_ds_fh = p_ds.transpose(Image.FLIP_LEFT_RIGHT)
      p_ds_fv = p_ds.transpose(Image.FLIP_TOP_BOTTOM)
      p_ds_r = p_ds.transpose(Image.ROTATE_90)
      
      if Task == 1:

        if k<folders_val[a]:
          patch.save(f"/content/Data_set_val_{size}/HR/{folders_val[a]}/Original/patch_{k}_{i}_{j}.jpg")
          patch_FH.save(f"/content/Data_set_val_{size}/HR/{folders_val[a]}/Flipped_Horizontally/patch_{k}_{i}_{j}.jpg")
          patch_FV.save(f"/content/Data_set_val_{size}/HR/{folders_val[a]}/Flipped_Vertically/patch_{k}_{i}_{j}.jpg")
          patch_R.save(f"/content/Data_set_val_{size}/HR/{folders_val[a]}/Rotated_90/patch_{k}_{i}_{j}.jpg")

        else:

          a+=1
          
          patch.save(f"/content/Data_set_val_{size}/HR/{folders_val[a]}/Original/patch_{k}_{i}_{j}.jpg")
          patch_FH.save(f"/content/Data_set_val_{size}/HR/{folders_val[a]}/Flipped_Horizontally/patch_{k}_{i}_{j}.jpg")
          patch_FV.save(f"/content/Data_set_val_{size}/HR/{folders_val[a]}/Flipped_Vertically/patch_{k}_{i}_{j}.jpg")
          patch_R.save(f"/content/Data_set_val_{size}/HR/{folders_val[a]}/Rotated_90/patch_{k}_{i}_{j}.jpg")

      else:
        if k<folders_val[a]:
          p_ds.save(f"/content/Data_set_val_{size}/x{Task}/{folders_val[a]}/Original/patch_{k}_{i}_{j}.jpg")
          p_ds_fh.save(f"/content/Data_set_val_{size}/x{Task}/{folders_val[a]}/Flipped_Horizontally/patch_{k}_{i}_{j}.jpg")
          p_ds_fv.save(f"/content/Data_set_val_{size}/x{Task}/{folders_val[a]}/Flipped_Vertically/patch_{k}_{i}_{j}.jpg")
          p_ds_r.save(f"/content/Data_set_val_{size}/x{Task}/{folders_val[a]}/Rotated_90/patch_{k}_{i}_{j}.jpg")

        else:

          a+=1
          
          p_ds.save(f"/content/Data_set_val_{size}/x{Task}/{folders_val[a]}/Original/patch_{k}_{i}_{j}.jpg")
          p_ds_fh.save(f"/content/Data_set_val_{size}/x{Task}/{folders_val[a]}/Flipped_Horizontally/patch_{k}_{i}_{j}.jpg")
          p_ds_fv.save(f"/content/Data_set_val_{size}/x{Task}/{folders_val[a]}/Flipped_Vertically/patch_{k}_{i}_{j}.jpg")
          p_ds_r.save(f"/content/Data_set_val_{size}/x{Task}/{folders_val[a]}/Rotated_90/patch_{k}_{i}_{j}.jpg")

# Uncomment to see some samples of the patches forming the image.
#  rows = patches_img.shape[0]
#  columns =patches_img.shape[1]
#  fig = plt.figure()
#  counter = 1
#
#  if (k+1)%1 == 0:
#    for m in range(patches_img.shape[0]):
#      for n in range(patches_img.shape[1]):
#        fig.add_subplot(rows, columns, counter)
#        counter +=1
#        plt.imshow(patches_img[m, n, 0, :, :, :])
#        plt.axis('off')
#    
#    plt.show()
#    plt.close()

100%|██████████| 100/100 [01:45<00:00,  1.06s/it]


## Testing for the correct number of instances at each dataset.

Here we test that the datasets are complete.

In [None]:
import os
total = 0
for f in range(10, 801, 10):
  FOLDER_PATH1 = f'Data_set_{size}/HR/{f}/Original'
  FOLDER_PATH2 = f'Data_set_{size}/HR/{f}/Flipped_Horizontally'
  FOLDER_PATH3 = f'Data_set_{size}/HR/{f}/Flipped_Vertically'
  FOLDER_PATH4 = f'Data_set_{size}/HR/{f}/Rotated_90'
  ROOT_PATH = '/content'
  total += len(os.listdir(os.path.join(ROOT_PATH, FOLDER_PATH1)))
  total += len(os.listdir(os.path.join(ROOT_PATH, FOLDER_PATH2)))
  total += len(os.listdir(os.path.join(ROOT_PATH, FOLDER_PATH3)))
  total += len(os.listdir(os.path.join(ROOT_PATH, FOLDER_PATH4)))
print(total)

2091756


In [None]:
import os
total = 0
for f in range(10, 101, 10):
  FOLDER_PATH1 = f'Data_set_val_{size}/HR/{f}/Original'
  FOLDER_PATH2 = f'Data_set_val_{size}/HR/{f}/Flipped_Horizontally'
  FOLDER_PATH3 = f'Data_set_val_{size}/HR/{f}/Flipped_Vertically'
  FOLDER_PATH4 = f'Data_set_val_{size}/HR/{f}/Rotated_90'
  ROOT_PATH = '/content'
  total += len(os.listdir(os.path.join(ROOT_PATH, FOLDER_PATH1)))
  total += len(os.listdir(os.path.join(ROOT_PATH, FOLDER_PATH2)))
  total += len(os.listdir(os.path.join(ROOT_PATH, FOLDER_PATH3)))
  total += len(os.listdir(os.path.join(ROOT_PATH, FOLDER_PATH4)))
print(total)

266600


# Compressing data sets into ZIP files

This is done so it is easier for users to download the resulting files.

In [None]:
if path.exists(f'/content/Data_set_{size}_HR_train') == False:
  os.mkdir(f'/content/Data_set_{size}_HR_train')

In [None]:
for i in range(20,801,10):
  directory = pathlib.Path(f"/content/Data_set_64/HR/{i}")
  
  with ZipFile(f"/content/HR_train_{i}.zip", "w") as archive:
    for file_path in directory.rglob("*"):
      archive.write(file_path, arcname=file_path.relative_to(directory))


In [None]:
!zip -r /content/Data_set_64/HR_val /content/Data_set_val_64/HR