# Your Second Image Classifier: Using CNN to Classify Images
# Pre-processing

The purpose of this dataset is to correctly classify an image as containing a dog, cat, or panda.
Containing only 3,000 images, the Animals dataset is meant to be another **introductory** dataset
that we can quickly train a CNN model.

Let's take the following steps:

1. Fetch Data (reuse of the previous project)
2. Pre-processing
3. Clean data

<center><img width="900" src="https://drive.google.com/uc?export=view&id=1haMB_Zt6Et9q9sPHxfuR4g3FT5QRXlTI"></center>


## Step 01: Setup

Start out by installing the experiment tracking library and setting up your free W&B account:


*   **pip install wandb** – Install the W&B library
*   **import wandb** – Import the wandb library
*   **wandb login** – Login to your W&B account so you can log all your metrics in one place

In [1]:
!pip install wandb -qU

[K     |████████████████████████████████| 1.9 MB 26.7 MB/s 
[K     |████████████████████████████████| 166 kB 72.4 MB/s 
[K     |████████████████████████████████| 182 kB 55.4 MB/s 
[K     |████████████████████████████████| 63 kB 2.0 MB/s 
[K     |████████████████████████████████| 166 kB 78.3 MB/s 
[K     |████████████████████████████████| 162 kB 55.0 MB/s 
[K     |████████████████████████████████| 162 kB 63.2 MB/s 
[K     |████████████████████████████████| 158 kB 60.1 MB/s 
[K     |████████████████████████████████| 157 kB 1.5 MB/s 
[K     |████████████████████████████████| 157 kB 63.2 MB/s 
[K     |████████████████████████████████| 157 kB 65.6 MB/s 
[K     |████████████████████████████████| 157 kB 55.9 MB/s 
[K     |████████████████████████████████| 157 kB 65.7 MB/s 
[K     |████████████████████████████████| 157 kB 63.4 MB/s 
[K     |████████████████████████████████| 157 kB 61.0 MB/s 
[K     |████████████████████████████████| 156 kB 60.8 MB/s 
[?25h  Building wheel for 

### Import Packages

In [2]:
# import the necessary packages
from imutils import paths
import logging
import os
import cv2
import numpy as np
import joblib
import tensorflow as tf
import wandb

In [3]:
wandb.login()

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [4]:
# configure logging
# reference for a logging obj
logger = logging.getLogger()

# set level of logging
logger.setLevel(logging.INFO)

# create handlers
c_handler = logging.StreamHandler()
c_format = logging.Formatter(fmt="%(asctime)s %(message)s",datefmt='%d-%m-%Y %H:%M:%S')
c_handler.setFormatter(c_format)

# add handler to the logger
logger.handlers[0] = c_handler

## Step 02: Fetch Data

In [5]:
# since we are using Jupyter Notebooks we can replace our argument
# parsing code with *hard coded* arguments and values
args = {
  "project_name": "first_image_classifier",
  "artifact_name": "animals_raw_data:latest",
}

In [6]:
# open the W&B project created in the Fetch step
run = wandb.init(entity="ivanovitch-silva",project=args["project_name"], job_type="preprocessing")

# download the raw data from W&B
raw_data = run.use_artifact(args["artifact_name"])
data_dir = raw_data.download()
logger.info("Path: {}".format(data_dir))

[34m[1mwandb[0m: Currently logged in as: [33mivanovitch-silva[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Downloading large artifact animals_raw_data:latest, 187.97MB. 3000 files... 
[34m[1mwandb[0m:   3000 of 3000 files downloaded.  
Done. 0:0:25.1
24-10-2022 23:50:58 Path: ./artifacts/animals_raw_data:v0


In [7]:
run.finish()

VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

## Step 03 - Clean Data

### Project Config.

In [8]:
data_dir

'./artifacts/animals_raw_data:v0'

In [9]:
# since we are using Jupyter Notebooks we can replace our argument
# parsing code with *hard coded* arguments and values
args = {
	"features": "clean_features",
  "target": "labels",
  "project_name": "alexnet"
}

In [10]:
# open the W&B project created in the Fetch step
run = wandb.init(entity="ivanovitch-silva",project=args["project_name"], job_type="preprocessing")

### Loader and Preprocessing Classes

Source code based on **Rosebrock, Adrian. Deep Learning For Computer vision with Python, 2019** [link](https://pyimagesearch.com/deep-learning-computer-vision-python-book/)

In [11]:
# 
# a basic simple preprocessor that resize a image
#
class SimplePreprocessor:
	def __init__(self, width, height, inter=cv2.INTER_AREA):
		# store the target image width, height, and interpolation
		# method used when resizing
		self.width = width
		self.height = height
		self.inter = inter

	def preprocess(self, image):
		# resize the image to a fixed size, ignoring the aspect
		# ratio
		return cv2.resize(image, (self.width, self.height),interpolation=self.inter)

In [12]:
#
# Rearrange the dimension of an image and return a numpy array
# Default dimension is (heigh, width, channel)
#
class ImageToArrayPreprocessor:
	def __init__(self, dataFormat=None):
		# store the image data format
		self.dataFormat = dataFormat

	def preprocess(self, image):
		# apply the Keras utility function that correctly rearranges
		# the dimensions of the image
		return tf.keras.utils.img_to_array(image, data_format=self.dataFormat)

In [13]:
# Building an image loader
class SimpleDatasetLoader:
  def __init__(self, preprocessors=None, logger=None):
		# store the image preprocessor
    self.preprocessors = preprocessors
    self.logger = logger

		# if the preprocessors are None, initialize them as an
		# empty list
    if self.preprocessors is None:
      self.preprocessors = []

  def load(self, imagePaths, verbose=-1):
		# initialize the list of features and labels
    data = []
    labels = []

		# loop over the input images
    for (i, imagePath) in enumerate(imagePaths):
			# load the image and extract the class label assuming
			# that our path has the following format:
			# /path/to/dataset/{class}/{image}.jpg
			# e.g "img example: ./artifacts/animals_raw_data:v0/dogs/dogs_00892.jpg"
			# imagePath.split(os.path.sep)[-2] will return "dogs"
      image = cv2.imread(imagePath)
      label = imagePath.split(os.path.sep)[-2]

      # check to see if our preprocessors are not None
      if self.preprocessors is not None:
				# loop over the preprocessors and apply each to
				# the image
        for p in self.preprocessors:
          image = p.preprocess(image)

			# treat our processed image as a "feature vector"
			# by updating the data list followed by the labels
      data.append(image)
      labels.append(label)
   
			# show an update every `verbose` images
      if verbose > 0 and i > 0 and (i + 1) % verbose == 0:
        logger.info("[INFO] processed {}/{}".format(i + 1,len(imagePaths)))

		# return a tuple of the data and labels
    return (np.array(data), np.array(labels))

### Cleaning

In [14]:
# grab the list of images that we'll be describing
logger.info("[INFO] preprocessing images...")
imagePaths = list(paths.list_images(data_dir))

# initialize the image preprocessors
sp = SimplePreprocessor(227, 227)
iap = ImageToArrayPreprocessor()

# load the dataset from disk then scale the raw pixel intensities
# to the range [0, 1]
sdl = SimpleDatasetLoader(preprocessors=[sp, iap])
(data, labels) = sdl.load(imagePaths, verbose=500)
data = data.astype("float") / 255.0

# show some information on memory consumption of the images
logger.info("[INFO] features matrix: {:.1f}MB".format(data.nbytes / (1024 * 1024)))
logger.info("[INFO] labels vector: {:.1f}MB".format(labels.nbytes / (1024 * 1024)))
logger.info("[INFO] features shape: {}, labels shape: {}".format(data.shape,labels.shape))

24-10-2022 23:52:09 [INFO] preprocessing images...
24-10-2022 23:52:11 [INFO] processed 500/3000
24-10-2022 23:52:13 [INFO] processed 1000/3000
24-10-2022 23:52:17 [INFO] processed 1500/3000
24-10-2022 23:52:22 [INFO] processed 2000/3000
24-10-2022 23:52:24 [INFO] processed 2500/3000
24-10-2022 23:52:27 [INFO] processed 3000/3000
24-10-2022 23:52:31 [INFO] features matrix: 3538.2MB
24-10-2022 23:52:31 [INFO] labels vector: 0.1MB
24-10-2022 23:52:31 [INFO] features shape: (3000, 227, 227, 3), labels shape: (3000,)


### Dump the artifacts to disk and upload to W&B

In [15]:
# Save the feature artifacts using joblib
joblib.dump(data, args["features"])

# Save the target using joblib
joblib.dump(labels, args["target"])

logger.info("Dumping the clean data artifacts to disk")

24-10-2022 23:53:02 Dumping the clean data artifacts to disk


In [16]:
# clean data artifact
artifact = wandb.Artifact(args["features"],
                          type="CLEAN_DATA",
                          description="A json file representing the clean features data"
                          )

logger.info("Logging clean data artifact")
artifact.add_file(args["features"])
run.log_artifact(artifact)

24-10-2022 23:53:05 Logging clean data artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7fe2264eff50>

In [17]:
# clean label artifact
artifact = wandb.Artifact(args["target"],
                          type="CLEAN_DATA",
                          description="A json file representing the clean target"
                          )

logger.info("Logging clean target artifact")
artifact.add_file(args["target"])
run.log_artifact(artifact)

24-10-2022 23:53:36 Logging clean target artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7fe226495e90>

In [18]:
run.finish()