# Introduction to Deep Learning using TensorFlow and Keras.

## 2.- Image classifier (pt.1) - Preprocessing our data.

In this second tutorial we are going to try to make a **classifier of images** of dogs and cats. In this case, the **data set is not prepared** and needs to be accommodated to what is most optimal for use in a neural network. The images come in color and come in different shapes and sizes. **We will adjust these features** so that the model is solid and can classify correctly.

The images of the **dataset will be downloaded** from the Microsoft website which you can find [here](https://www.microsoft.com/en-us/download/confirmation.aspx?id=54765).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import os
from os.path import expanduser
import cv2


DATADIR = os.path.expanduser('~') + "/Documents/datasets/PetImages" # Get the home path (from each user)
CATEGORIES = ['Dog', 'Cat']

for category in CATEGORIES:
    path = os.path.join(DATADIR, category) # Path to cats or dogs dir
    for img in os.listdir(path):
        img_array = cv2.imread(os.path.join(path, img), cv2.IMREAD_GRAYSCALE)
        plt.imshow(img_array, cmap="gray")
        plt.show()
        break
    break

In [None]:
print(DATADIR)

Why do we convert all the images to grayscale? Because this particular classifier does not require the use of color to segment a feature. This saves computation time. OpenCV does not follow the classical RGB architecture but BGR. If we print one of the images with `imshow` the result will be a little strange because of this change in the order of representation.

In [None]:
print("Shape: " + str(img_array.shape) + "\n")
print("Data: \n\n" + str(img_array))

Since the images in the dataset have **different shapes** (as can be seen with the `img_array.shape` directive), as in exercise 1, we are going to **normalize** the entire dataset. It will be a size that has a **balance** between **definition** and image recognition while **minimizing the size**.

In [None]:
IMG_SIZE = 50
new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))
plt.imshow(new_array, cmap="gray")
plt.show()

Since **the dataset is not tagged**, in the same procedure where we resize each image, **we will tag each image**. We don't need the tags to be 'Dog' or 'Cat', we can use a number tag that will have the same functionality. We set 0 to be a dog and 1 to be a cat ;-)

So, taking the first category, dog, you will go image by image resizing and setting the label 0 for each one of them. Then the same will be done with the images of cats.

In [None]:
training_data = []

def create_training_data():
    for category in CATEGORIES:
        path = os.path.join(DATADIR, category) # Path to cats or dogs dir
        class_num = CATEGORIES.index(category) # 0 will be a dog and 1 will be a cat
        
        for img in os.listdir(path):
            try:
                img_array = cv2.imread(os.path.join(path, img), cv2.IMREAD_GRAYSCALE) # Get original image
                new_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))               # Resize image
                training_data.append([new_array, class_num])                          # [img, label]
            except Exception as e:
                print(e)
            
create_training_data()
       

In [None]:
print(len(training_data))

The next step is to **mix our dataset**. Since there is a loop through each of the categories (in our case dogs and cats), we want to avoid that the net trains first with dogs and then with cats. Therefore, using the Python `shuffle` library we will **shake** the datasets so that the next input image is random (probability 0.5).

In [None]:
import random

random.shuffle(training_data)

In [None]:
for sample in training_data[:10]:
    print(sample[1])

Let's pack it into the variables that we're going to use right before we feed it into our neuronal network so that's going to be an empty list `X` and an empty list for `y`. In general, capital `X` is your **feature** set and lowercase `y` is your **labels**.

In [None]:
# x = feature set, y = labels
X = []
y = []

**We can't pass a list to the neuronal network** (at least for the time being, maybe in the future, Keras will allow it.

In [None]:
for features, label in training_data:
    X.append(features)     # X 
    y.append(label)        # y is a list
    
# X has to be first numpy array so let's go ahead:
X = np.array(X).reshape(-1, IMG_SIZE, IMG_SIZE, 1)
# -1 is hoy many features do we have (something like that).
# Last 1 is greyscale, 3 is RGB (BGR).

To avoid having to readjust the images every time you want to run the practice, you can use the Python [Pickle](https://docs.python.org/3/library/pickle.html) library, which **allows us to store (serialize)** the information for later use.

In [None]:
import pickle

pickle_out = open("models/X.pickle", "wb")
pickle.dump(X, pickle_out)
pickle_out.close()

pickle_out = open("models/y.pickle", "wb")
pickle.dump(y, pickle_out)
pickle_out.close()

In [None]:
pickle_in = open("models/X.pickle", "rb")
x = pickle.load(pickle_in)

In the next part we will create the model and run the neural network to see how to classify each of the images. See you in the next part ;-)