# Download and Preprocess Images
In this notebook, we will download the images from s3 bucket, correct the orientation of the images using exif tag and upload them back to another s3 bucket for training.

**We do NOT need a GPU for this notebook**

In [2]:
# Install exif package
!pip install exif

Collecting exif
  Downloading exif-1.2.2-py3-none-any.whl (29 kB)
Collecting plum-py==0.3.1
  Downloading plum_py-0.3.1-py3-none-any.whl (69 kB)
     |████████████████████████████████| 69 kB 12.7 MB/s            
Installing collected packages: plum-py, exif
Successfully installed exif-1.2.2 plum-py-0.3.1


In [3]:
import numpy as np
import pandas as pd  # Home of the DataFrame construct, _the_ most important object for Data Science
import sys  # Python system library needed to load custom functions
import glob
import exif
import PIL
from PIL import Image

In [4]:
sys.path.append('../src')

In [5]:
from config import DEFAULT_BUCKET, ORIGINAL_BUCKET
from gdsc_util import download_directory, download_file, upload_file, load_sections_df, set_up_logging, PROJECT_DIR
set_up_logging()

In [6]:
download_directory('jpgs/', local_dir='data', bucket=ORIGINAL_BUCKET)  # Download the JPG images into our data folder
download_file('gdsc_train.csv', local_dir='data', bucket=ORIGINAL_BUCKET)  # Download the list of worm sections
download_file('test_files.csv', local_dir='data', bucket=ORIGINAL_BUCKET)  # Download the files for which we need to create a predictions

Some of the the images were rotated after they were created and labelled. As it turns out, the rotation is only done via exif annotation and not at a fundamental level.

In [7]:
# Fix the orientation of the images using exif
img_paths = glob.glob('../data/jpgs/*.jpg')
for img_path in img_paths:
    img = PIL.Image.open(img_path)
    if not img.getexif(): # No EXIF tag at all
        continue 
    # Load Image EXIF
    with open(img_path, 'rb') as f:
        img_exif = exif.Image(f)
    # Delete orientation tag and store the image 
    if 'orientation' in dir(img_exif):
        print(img_path)
        img_exif.delete('orientation')
        with open(img_path, 'wb') as f:
            f.write(img_exif.get_file())

In [8]:
# Upload the corrected images to another s3 bucket i.e. DEFAULT_BUCKET which we will use for training
img_paths = glob.glob('../data/jpgs/*.jpg')
for local_path in img_paths:
    s3_path = 'jpgsnew/' + local_path.split('/')[-1]
    upload_file(local_path, s3_path, DEFAULT_BUCKET)

In [9]:
# Upload the train and test csv files to DEFAULT_BUCKET
upload_file('../data/gdsc_train.csv', 'gdsc_train.csv', DEFAULT_BUCKET)
upload_file('../data/test_files.csv', 'test_files.csv', DEFAULT_BUCKET)

We will now load gdsc_train.csv and divide it into train and validation set. We will use 90% of data for training and rest 10% for validation. For creating the validation set, we will take random 10% of data from each stain. We will then save the train and validation data in src folder.

In [10]:
data = load_sections_df('../data/gdsc_train.csv')
train = []
val = []
val_percent = 0.1
for stain in data['staining'].unique():
    df = data[data['staining']==stain]
    np.random.seed(0)
    filenames = df['file_name'].unique()
    random_list = np.random.rand(len(filenames))
    val_filenames = filenames[random_list<val_percent]
    print(f'Stain : {stain}, Number of files for validation : {len(val_filenames)}')
    train_filenames = filenames[random_list>=val_percent]
    val.append(df[df['file_name'].isin(val_filenames)])
    train.append(df[df['file_name'].isin(train_filenames)])
train_df = pd.concat(train, ignore_index=True)
val_df = pd.concat(val, ignore_index=True)
train_df.to_csv(f'../src/gdsc_train_dataset_{int((1-val_percent)*100)}.csv', sep=';', index=False)
val_df.to_csv(f'../src/gdsc_val_dataset_{int(val_percent*100)}.csv', sep=';', index=False)

Stain : D, Number of files for validation : 21
Stain : C, Number of files for validation : 8
Stain : B, Number of files for validation : 24
Stain : A, Number of files for validation : 42
Stain : DD, Number of files for validation : 3
