<a href="https://www.kaggle.com/code/ksenia5/preprocessing-image-data-for-classification?scriptVersionId=123227682" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Preprocessing image data for transfer learning

Here image data on 10 object classes is loaded and prepared for transfer learning.


In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

import random # for shuffling data

# Image tools from scikit image
from skimage import transform, color

# import cv2
import os
import pickle

# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# import image from keras preprocessing
import tensorflow as tf
from tensorflow.keras.preprocessing import image

# # tensorflow keras applications resnet50
# from tensorflow.keras.applications.resnet50 import preprocess_input, ResNet50, decode_predictions

## Reading in data for transfer learning

Data reshaping with adapted code from this [video tutorial](https://www.youtube.com/watch?v=j-3vuBynnOE).

In [2]:
# Get the object labels
datadir = '/kaggle/input/object-detection'
categories = [f for f in os.listdir('/kaggle/input/object-detection') if os.path.isdir(os.path.join(datadir,f)) ]

# print(categories)
# print(len(categories))

# list to store labelled image data
all_data = [] 
# dictionary mapping numerical labels to object classes
label_dict = {}

def get_data():
    for cat in categories:
        path = os.path.join(datadir, cat+"/"+cat)
        class_num = categories.index(cat)
        label_dict[class_num] = cat
        for img in os.listdir(path):
            try:
                im = image.load_img(os.path.join(path,img))
                im_shape = np.shape(im)
                im_square = tf.image.resize_with_crop_or_pad(im,target_height=max(im_shape),target_width=max(im_shape))
                im_small = transform.resize(im_square,(224,224),anti_aliasing=False)
                all_data.append([im_small, class_num])
            except Exception as e:
                pass
get_data()

  "Palette images with Transparency expressed in bytes should be "


In [3]:
# Check the number of images in each class
for cat in categories:
    num = sum([1 for im, c in all_data if c== categories.index(cat)])
    print(f"{cat}................{num} images")

USB stick................95 images
satellite dish device................95 images
magnifying glass................95 images
laptop................95 images
computer mouse................95 images
keyboard I................26 images
keys objects................95 images
server rack................95 images
phone................95 images
keyboard II................69 images
router................95 images


9 classes each have 95 images, except for the keyboard class which is split into 2 - keyboard I and keyboard II. Together these contain 95 images and represent the same target object (keyboard).
<br>
Create a dictionary mapping numerical class labels to class names.

In [4]:
label_dict = dict([(categories.index(cat),cat) for cat in categories])
label_dict

{0: 'USB stick',
 1: 'satellite dish device',
 2: 'magnifying glass',
 3: 'laptop',
 4: 'computer mouse',
 5: 'keyboard I',
 6: 'keys objects',
 7: 'server rack',
 8: 'phone',
 9: 'keyboard II',
 10: 'router'}

We'll relabel "keyboard I" and "keybord II" as "keyboard" in the dictionary and update the data labels in the all_data array.

In [5]:
label_dict[5] = 'keyboard'
label_dict[9] = 'router'
del label_dict[10]
label_dict

{0: 'USB stick',
 1: 'satellite dish device',
 2: 'magnifying glass',
 3: 'laptop',
 4: 'computer mouse',
 5: 'keyboard',
 6: 'keys objects',
 7: 'server rack',
 8: 'phone',
 9: 'router'}

In [6]:
# data labelled 9 will get label 5
# data labelled 10 will get label 9
data = [[img,5] if label==9 else [img,label] for [img, label] in all_data]
data = [[img,9] if label==10 else [img, label] for [img, label] in data]

In [7]:
# Check the number of images in each class
for label, cat in label_dict.items():
    num = sum([1 for im, c in data if c== int(label)])
    print(f"{label}..{cat}................{num} images")

0..USB stick................95 images
1..satellite dish device................95 images
2..magnifying glass................95 images
3..laptop................95 images
4..computer mouse................95 images
5..keyboard................95 images
6..keys objects................95 images
7..server rack................95 images
8..phone................95 images
9..router................95 images


Now there are 95 images in each of 10 classes.

Next, we'll shuffle the data and split into training and validation sets.

In [8]:
# train test split 
random.shuffle(data)

# for sample in training_data[:10]:
#     print(sample[1])

In [9]:
# Create features nparray and 
X = []
y =[]
for features, label in data:
    X.append(features)
    y.append(label)
X = np.array(X).reshape(-1,224,224,3)

In [10]:
# Save data
pickle_out = open("/kaggle/working/X.pickle", "wb")
pickle.dump(X, pickle_out)
pickle_out.close()

pickle_out = open("/kaggle/working/y.pickle", "wb")
pickle.dump(y, pickle_out)
pickle_out.close()

In [11]:
# export labels dictioanry
pickle.dump(label_dict, open("/kaggle/working/label_dict.pickle", "wb"))
pickle_out.close()

To load data into future notebook use

In [12]:
pickle_in = open("/kaggle/working/X.pickle","rb")
X = pickle.load(pickle_in)
pickle_in.close()

pickle_in = open("/kaggle/working/y.pickle","rb")
y = pickle.load(pickle_in)
pickle_in.close()

label_dict = pickle.load(open("/kaggle/working/label_dict.pickle","rb"))
pickle_in.close()

## REFERENCES

1. [Loading in your own data](https://www.youtube.com/watch?v=j-3vuBynnOE)
