# Image preparation

The original image size is 55x90 pixels with a color depth of 3 (RGB).
The below code can be used to transform the images in an input directory (Input_dir) to the right size (20x32 pixels) into an output directory (Output_dir). Inside the directory the pictures are stored in subdirectories according their labeling (0 ... 9 + NaN).
Any other image converter can be used as well.

### Prerequisite
Installed OpenCV libary within python (opencv)

In [1]:
import glob
import os
from PIL import Image 

Input_dir = 'data_raw_all'
Output_dir= 'data_resize_all'

target_size_x = 32
target_size_y = 32

In [2]:
files = glob.glob(Output_dir + '/*.jpg')
for f in files:
    os.remove(f)
print(str(len(files)) + " files have been deleted.")

7684 files have been deleted.


In [3]:
import hashlib

files = glob.glob(Input_dir + '/*.jpg')
hashes={}
for i,aktfile in enumerate(files):
    if i%500==0:
        print(i, aktfile)
    test_image = Image.open(aktfile)
    hash=hashlib.sha256(test_image.tobytes()).hexdigest()
    if hash in hashes:
        hashes[hash].append(aktfile)
    else:
        hashes[hash]=[aktfile]
    test_image = test_image.resize((target_size_x, target_size_y), Image.NEAREST)
    base=os.path.basename(aktfile)
    save_name = Output_dir + '/' + base
    test_image.save(save_name, "JPEG")

0 data_raw_all\0.0_0.0.jpg


  test_image = test_image.resize((target_size_x, target_size_y), Image.NEAREST)


500 data_raw_all\0.7_0273_zeiger1_2020-04-29_12-21-02.jpg
1000 data_raw_all\1.3_0565_zeiger4_2020-04-29_11-44-02.jpg
1500 data_raw_all\1.9_0886_zeiger3_2019-09-14_21-00-12.jpg
2000 data_raw_all\2.5_2f0b6991fad6e5308dd6bf5fffa0bede.jpg
2500 data_raw_all\3.2_1532_zeiger1_2020-04-29_14-02-02.jpg
3000 data_raw_all\3.8_93d5002d7c0c697d8437dc6db7d9b995.jpg
3500 data_raw_all\4.4_pointer_20211008-080205.jpg
4000 data_raw_all\5.1_021a729de1f0b2df4a2f3dc792e0e806.jpg
4500 data_raw_all\5.7_e50495abc5bf227f64168bf61451db77.jpg
5000 data_raw_all\6.4_3110_zeiger1_2019-09-14_21-20-13.jpg
5500 data_raw_all\7.0_3597_zeiger3_2019-11-19_16-52-03.jpg
6000 data_raw_all\7.8_026949f0099cb7a3970423ec929b3916.jpg
6500 data_raw_all\8.5_19ff6680c7934b07ba3230441521a6d3.jpg
7000 data_raw_all\9.1_4532_zeiger2_2019-06-02T183009.jpg
7500 data_raw_all\9.7_4892_zeigeru__2019-06-05T09000.jpg


# Removing duplicate files

In [4]:
# duplicate files are a risk to the metrics, they pollute the validation dataset
for hash in hashes:
    if len(hashes[hash])>1:
        print(hashes[hash])    
        for duplicate in hashes[hash][1:]:
            # remove all except the first
            os.remove(duplicate)    