## Image preparation

The below code can be used to transform the images in an input directory (Input_Dir) 
to the right size (e.g. 32x32 pixels) into an output directory (Output_Dir).

### Note
Duplicates will be removed by evaluating the file hash

### Basic Parameter

In [1]:
# Parameters
Input_Dir = 'data_raw_all'
Output_Dir = 'data_resize_all'

# Target image size
target_size_x = 32
target_size_y = 32

In [2]:
# Parameters
Input_Dir = "data_raw_all"
Output_Dir = "data_resize_all"


### Load libraries and defaults

In [3]:
import glob
import os
from pathlib import Path

from PIL import Image


### Delete output directory

In [4]:
files = glob.glob(Output_Dir + '/*.jpg')
for f in files:
    os.remove(f)
print(str(len(files)) + " files have been deleted.")

0 files have been deleted.


### Load files and resize

In [5]:
import hashlib

files = glob.glob(Input_Dir + '/*.jpg')
hashes={}
for i,aktfile in enumerate(files):
    if i%500==0:
        print(i, aktfile)
    test_image = Image.open(aktfile)
    hash=hashlib.sha256(test_image.tobytes()).hexdigest()
    if hash in hashes:
        hashes[hash].append(aktfile)
    else:
        hashes[hash]=[aktfile]
    test_image = test_image.resize((target_size_x, target_size_y), Image.LANCZOS)
    base=os.path.basename(aktfile)
    save_name = Output_Dir + '/' + base
    test_image.save(save_name, "JPEG", quality = 100)

0 data_raw_all/5.6_1f48d6bd3fc40354b9253b4352c4c554.jpg


500 data_raw_all/6.6_32f8960168ad68dd0ce3eb6e9365fa1b.jpg


1000 data_raw_all/3.8_1884_zeiger4_2019-06-04T110009.jpg


1500 data_raw_all/0.0_e0941e6e3c5bf5a31d46bc38cb65648f.jpg


### Remove duplicate files

In [6]:
# duplicate files are a risk to the metrics, they pollute the validation dataset
for hash in hashes:
    if len(hashes[hash])>1:
        print(hashes[hash])    
        for duplicate in hashes[hash][1:]:
            # remove all except the first
            os.remove(duplicate)    