# Day 2 Example 5
# Object Detection and Tracking

Last edited 2019/07/15

some cells depend on imports or outputs from previous cells

In [0]:
# change Runtime type by going to Runtime>Change runtime type>Hardware accelerator>GPU

%matplotlib inline

# import several packages and files with associated file structure
# requirements
!pip install pyamg
!pip install opencv-python

# data
!apt-get install subversion
!svn checkout "https://github.com/jojker/PML_Workshops/trunk/Summer 2019/Day 2 - Goal 1 - Turning Images into Data/Ex 5 - object detection and tracking/Data"

path = 'Data/'

# Finding contours and the center of a blob with OpenCV

In [0]:
import cv2
from google.colab.patches import cv2_imshow

# read image through command line
img = cv2.imread(path + 'multiple-blob.png')

# convert the image to grayscale
gray_image = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

print("Grayscale Image")
cv2_imshow(gray_image)

# convert the grayscale image to binary image
ret,thresh = cv2.threshold(gray_image,127,255,0)

# find contours in the binary image
im2, contours, hierarchy = cv2.findContours(thresh,cv2.RETR_TREE,cv2.CHAIN_APPROX_SIMPLE)

# draw the contours on the image
img_contours = img.copy()
cv2.drawContours(img_contours, contours, -1, (10,10,255), 2)

print("Image with Contours")
cv2_imshow(img_contours)

for c in contours:
  # calculate moments for each contour
  M = cv2.moments(c)
  
  # catch zero M00 term and calculate x,y coordinate of center
  if M["m00"] != 0:
    cX = int(M["m10"] / M["m00"])
    cY = int(M["m01"] / M["m00"])
  else:
    cX, cY = 0, 0
    
  # add point on image indicating the centroid
  cv2.circle(img, (cX, cY), 4, (255, 255, 255), -1)
  
  # add text to the image if desired
  # cv2.putText(img, "centroid", (cX - 25, cY - 25),cv2.FONT_HERSHEY_SIMPLEX, 0.5, (255, 255, 255), 2)

# display image with centroids
print("\n\nImage with Centroids")
cv2_imshow(img)

# Object Recognition and Object Detection

**Object recognition**  identifies which objects are present in an image. It takes the entire image as an input and outputs class labels and class probabilities of objects present in that image

 **Object detection** not only tells you which objects are present in the image, it also outputs bounding boxes (x, y, width, height) to indicate the location of the objects inside the image


To localize an object, we have to select sub-regions (patches) of the image and apply the object recognition algorithm to these patches. The location of the objects is given by the location of the image patches where the class probability returned by the object recognition algorithm is high.

The most straightforward way to generate smaller sub-regions (patches) is called the **sliding window** approach where a box slides over an image to select a patch and classify each patch using the object recognition model.  It is an exhaustive search and can be costly because often times it is important to search multiple aspect ratios over the entire image

<img src=https://www.learnopencv.com/wp-content/uploads/2017/09/object-recognition-dogs-768x436.jpg width="500">

Sliding window problems can be solved with **region proposal** where the output is all patches that are most likely to be objects. The proposed regions can be noisy, overlapping, and may not contain the object perfectly. The regions with high probability scores are the object locations.

<img src=https://www.learnopencv.com/wp-content/uploads/2017/10/object-recognition-false-positives-true-positives-768x436.jpg width="500">

In region proposal algorithms possible objects are identified using segmentation where similar adjacent regions are grouped based on criteria such as color, texture, etc.  The final number of proposals generated are many times less than that ofthe sliding window approach and the regions are of different aspect ratios.

It is okay for the region proposal to produce a lot of false positives so long as it catches all the true positives (high recall) because the false positives will be rejected by the object recognition step.

**Selective search** is a popular choice for region proposal because it is fast and has high recall. In selective search the image is segmented by pixel intensity

original image

<img src=https://www.learnopencv.com/wp-content/uploads/2017/09/breakfast-300x200.jpg width="300">

segmented image

<img src=https://www.learnopencv.com/wp-content/uploads/2017/09/breakfast_fnh-300x200.jpg width="300">

Selective search uses an oversegmented image as the initial input and performs the following things
1. Add all bounding boxes corresponding to segmented parts to the list of region proposals
2. Group Adjacent segments based on similarity
3. Go to step 1

oversegmented image

<img src=https://www.learnopencv.com/wp-content/uploads/2017/09/breakfast_oversegment-300x200.jpg width="300">

At each iteration, larger segments are formed and added to the list of region proposals.

<img src=https://www.learnopencv.com/wp-content/uploads/2017/09/hierarchical-segmentation-1.jpg width="800">

Selective search uses 4 similarity measures based on color, texture, size, and shape. 

Color:
A color histogram of 25 bins is calculated for each channel of the image and histograms for all channels are concatenated to obtain a color descriptor resulting into a 25×3 = 75-dimensional color descriptor.
Color similarity of two regions is based on histogram intersection and can be calculated as:

<img src=https://www.learnopencv.com/wp-content/ql-cache/quicklatex.com-3a99604c3b9fc1664b0ebd9b16aa190c_l3.png width="300">


c<sub>i</sub><sup>k</sup> is the histogram value for k<sup>th</sup> bin in color descriptor

Texture:
Texture features are calculated by extracting Gaussian derivatives at 8 orientations for each channel. For each orientation and for each color channel, a 10-bin histogram is computed resulting into a 10x8x3 = 240-dimensional feature descriptor. 

<img src=https://www.learnopencv.com/wp-content/ql-cache/quicklatex.com-169d419080f56b69f9645cd13ee5b0ac_l3.png width="300">

Size Similarity: 
Size similarity encourages smaller regions to merge early. It ensures that region proposals at all scales are formed at all parts of the image. If this similarity measure is not taken into consideration a single region will keep gobbling up all the smaller adjacent regions one by one and hence region proposals at multiple scales will be generated at this location only. Size similarity is defined as:

<img src=https://www.learnopencv.com/wp-content/ql-cache/quicklatex.com-ed6bd32a9661aa84228d1ca1c75f5d29_l3.png width="325">

where size(im) is the size of image in pixels

Shape Compatibility:
Shape compatibility measures how well two regions (r<sub>i</sub> and r<sub>j</sub>) fit into each other. If r<sub>i</sub> fits into r<sub>j</sub> we would like to merge them in order to fill gaps.  If they are not touching they should not be merged.  Shape compatibility is defined as:

<img src=https://www.learnopencv.com/wp-content/ql-cache/quicklatex.com-9a3fdf638488b3c77915b9b83bf2f3e1_l3.png width="400">

where size(BB<sub>ij</sub>) is a bounding box around r<sub>i</sub> and r<sub>j</sub>.

Final Similarity:
The final similarity between two regions is defined as a linear combination of aforementioned 4 similarities.

<img src=https://www.learnopencv.com/wp-content/ql-cache/quicklatex.com-67a3c5c3f45a9407ee513056c759f095_l3.png width="700">

where r<sub>i</sub> and r<sub>j</sub> are two regions or segments in the image and a<sub>i</sub> is either 0 or 1 denoting if the similarity measure is used or not.

Selective Search implementation in OpenCV gives thousands of region proposals arranged in decreasing order of objectness. For clarity, we are sharing results with top 200-250 boxes drawn over the image. In general 1000-1200 proposals are good enough to get all the correct region proposals.

<img src=https://www.learnopencv.com/wp-content/uploads/2017/09/breakfast-top-200-proposals-300x200.jpg width="500">


# Selective Search Example

In [0]:
#!/usr/bin/env python
 
import sys
import cv2
from google.colab.patches import cv2_imshow

# select fast ('f') but low recall selective search or slow ('q') but high recall
# selective search
selection = 'q'

im_loc = path + 'dog.jpg'

# speed-up using multithreads
cv2.setUseOptimized(True);
cv2.setNumThreads(4);

# read image
im = cv2.imread(im_loc)

# resize image
newHeight = 300
newWidth = int(im.shape[1]*newHeight/im.shape[0])
im = cv2.resize(im, (newWidth, newHeight))    

# create Selective Search Segmentation Object using default parameters
ss = cv2.ximgproc.segmentation.createSelectiveSearchSegmentation()

# set input image on which we will run segmentation
ss.setBaseImage(im)

# Switch to fast but low recall Selective Search method
if (selection == 'f'):
  ss.switchToSelectiveSearchFast()

# Switch to high recall but slow Selective Search method
elif (selection == 'q'):
  ss.switchToSelectiveSearchQuality()

# run selective search segmentation on input image
rects = ss.process()
print('Total Number of Region Proposals: {}'.format(len(rects)))

# number of region proposals to show
numShowRects = 60

print('showing', numShowRects, 'region proposals')
 
# create a copy of original image
imOut = im.copy()
  
# itereate over all the region proposals
for i, rect in enumerate(rects):
  # draw rectangle for region proposal till numShowRects
  if (i < numShowRects):
    x, y, w, h = rect
    cv2.rectangle(imOut, (x, y), (x+w, y+h), (0, 255, 0), 1, cv2.LINE_AA)
  else:
    break

# show output of resized image with proposal boxes
cv2_imshow(imOut)

#Face Detection

Viola-Jones Object Detection Framework

This algorithm is named after two computer vision researchers who proposed the method in 2001: Paul Viola and Michael Jones.

Given an image, the algorithm looks at many smaller subregions and tries to find a face by looking for specific features in each subregion. It needs to check many different positions and scales because an image can contain many faces of various sizes. Viola and Jones used Haar-like features to detect faces.

Haar-Like Features:

All human faces share some similarities. These similarities help the algorithm understand if an image contains a human face.

A simple way to find contrast between regions is to sum up the pixel values of both regions and compare them. The sum of pixel values in the darker region will be smaller than the sum of pixels in the lighter region. This can be accomplished using Haar-like features.

A Haar-like feature is represented by taking a rectangular part of an image and dividing that rectangle into multiple parts. They are often visualized as black and white adjacent rectangles:

<img src=https://files.realpython.com/media/Haar.885b5c872b35.png width="500">

In this image, you can see 4 basic types of Haar-like features:

1. Horizontal feature with two rectangles - edge detection
2. Vertical feature with two rectangles - edge detection
3. Vertical feature with three rectangles - line detection
4. Diagonal feature with four rectangles

The value of the feature is calculated as a single number: the sum of pixel values in the black area minus the sum of pixel values in the white area. For uniform areas like a wall, this number would be close to zero.

To be useful, a Haar-like feature needs to give a large number, meaning the areas in the black and white rectangles are very different. There are known features that perform very well to detect human faces

Integral Images:

An integral image is the name of both a data structure and an algorithm used to obtain this data structure. It is used as a quick and efficient way to calculate the sum of pixel values in an image or rectangular part of an image. The value of each point is the sum of all pixels above and to the left, including the target pixel:

<img src=https://files.realpython.com/media/Integral_image.ff570b17c188.png width='400'>

The integral image can be calculated in a single pass over the original image. This reduces summing the pixel intensities within a rectangle into only three operations with four numbers, regardless of rectangle size:

<img src=https://files.realpython.com/media/ABCD.97ca0ef04d39.png width='200'>

The sum of pixels in the rectangle ABCD can be derived from the values of points A, B, C, and D, using the formula D - B - C + A

But how do you decide which of these features and in what sizes to use for finding faces in images? This is solved by a machine learning algorithm called boosting

Adaptive Boosting:

Boosting is based on the following question: “Can a set of weak learners create a single strong learner?” A weak learner is defined as a classifier that is only slightly better than random guessing.

In face detection, this means that a weak learner can classify a subregion of an image as a face or not-face only slightly better than random guessing.

The power of boosting comes from combining many (thousands) of weak classifiers into a single strong classifier. In the Viola-Jones algorithm, each Haar-like feature represents a weak learner. To decide the type and size of a feature that goes into the final classifier, adaptive boosting checks the performance of all classifiers that you supply to it.

To calculate the performance of a classifier, you evaluate it on all subregions of all the images used for training. Some subregions will produce a strong response in the classifier. Those will be classified as positives, meaning the classifier thinks it contains a human face.

<img src=https://files.realpython.com/media/AdaBoost-7.2ec2db197252.png width='300'>

Cascading Classifiers:

The definition of a cascade is a series of waterfalls coming one after another. A similar concept is used in computer science to solve a complex problem with simple units. Here we want to reduce the number of computations for each image.

Viola and Jones turned their strong classifier (consisting of thousands of weak classifiers) into a cascade where each weak classifier represents one stage. The job of the cascade is to quickly discard non-faces and avoid wasting precious time and computations.

When an image subregion enters the cascade, it is evaluated by the first stage. If that stage evaluates the subregion as positive, meaning that it thinks it’s a face, the output of the stage is maybe.

If any stage gives a negative evaluation, then the image is immediately discarded as not containing a human face.

<img src=https://files.realpython.com/media/Classifier_cascade.e3b2a5652044.png width='500'>

This is designed so that non-faces get discarded very quickly, which saves a lot of time and computational resources.

Though the theory may sound complicated, in practice it is quite easy. The cascades themselves are just a bunch of XML files that contain OpenCV data used to detect objects. You initialize your code with the cascade you want, and then it does the work for you.

Since face detection is such a common case, OpenCV comes with a number of built-in cascades for detecting everything from faces to eyes to hands to legs.

We will load a casade set called haarcascade_frontalface_alt.xml from the opencv database
https://github.com/opencv/opencv/blob/master/data/haarcascades/haarcascade_frontalface_alt.xml

In [0]:
import cv2

# Read image from your local file system
im = cv2.imread(path + 'people.jpg')

# resize image
newHeight = 400
newWidth = int(im.shape[1]*newHeight/im.shape[0])
im = cv2.resize(im, (newWidth, newHeight))   

# Convert color image to grayscale for Viola-Jones
grayscale_image = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)

# Load the classifier and create a cascade object for face detection
face_cascade = cv2.CascadeClassifier(path + 'haarcascade_frontalface_alt.xml')

The face_cascade object has a method detectMultiScale(), which receives an image as an argument and runs the classifier cascade over the image. The term MultiScale indicates that the algorithm looks at subregions of the image in multiple scales, to detect faces of varying sizes.

The detectMultiScale function is a general function that detects objects. Since we are calling it on the face cascade, that’s what it detects. The function has several options

1. image to be searched for feature.

2. **scaleFactor** - since some faces may be closer to the camera, they would appear bigger than the faces in the back. The scale factor compensates for this.

3. The detection algorithm uses a moving window to detect objects. **minNeighbors** defines how many objects are detected near the current one before it declares the face found. **minSize** gives the size of each window.

In [0]:
# Detect faces in the image
detected_faces = face_cascade.detectMultiScale(
    grayscale_image,
    scaleFactor=1.1,
    minNeighbors=7,
    minSize=(30, 30)
)

The variable detected_faces now contains all the detections for the target image. To visualize the detections, you need to iterate over all detections and draw rectangles over the detected faces.

OpenCV’s rectangle() draws rectangles over images, and it needs to know the pixel coordinates of the top-left and bottom-right corner. The coordinates indicate the row and column of pixels in the image.

Luckily, detections are saved as pixel coordinates. Each detection is defined by its top-left corner coordinates and width and height of the rectangle that encompasses the detected face.

Adding the width to the row and height to the column will give you the bottom-right corner of the image:

In [0]:
for (column, row, width, height) in detected_faces:
    cv2.rectangle(
        im,
        (column, row),
        (column + width, row + height),
        (0, 255, 0),
        2
    )

"""
rectangle() accepts the following arguments:

The original image
The coordinates of the top-left point of the detection
The coordinates of the bottom-right point of the detection
The color of the rectangle (a tuple that defines the amount of red, green, and blue (0-255))
The thickness of the rectangle lines
"""

# Dislplay the image
cv2_imshow(im)
cv2.waitKey(0)
cv2.destroyAllWindows()

#Brief Example on Reinforcement Learning
- quickly train a model to identify new objects

ImageNet is a research project to develop a large database of images with annotations, e.g. images and their descriptions. The goal is to classify all objects into 1000 categories. 

Researchers from the Oxford Visual Geometry Group, or VGG for short, participated in the challenge and created the VGG16 model (16 layers)

In [0]:
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import preprocess_input
from keras.applications.vgg16 import decode_predictions
from keras.applications.vgg16 import VGG16

# load the model (this may take a couple of mintues, but you only have to do it once per colab session)
model = VGG16()

In [0]:
# Visualize the model (don't get hung up on the details here)
print(model.summary())

In [0]:
# some imports to get plotting working dynamically through a for loop
import matplotlib.pyplot as plt
from time import sleep
import glob

# select all the images to test using glob to extract all files with the extension .jpg
new_image_files = glob.glob(path + '*.jpg')

for file in new_image_files:
  
  # load the image
  image = load_img(file, target_size=(224, 224))
  
  # convert the image pixels to a numpy array
  im = img_to_array(image)

  # reshape data for the model
  im = im.reshape((1, im.shape[0], im.shape[1], im.shape[2]))

  # prepare the image for the VGG model
  im = preprocess_input(im)
  
  # predict the probability across all output classes
  yhat = model.predict(im)
  
  # convert the probabilities to class labels
  label = decode_predictions(yhat)
  
  # retrieve the most likely result, e.g. highest probability
  label = label[0][0]

  # print the classification and show the image
  print('%s (%.2f%%)' % (label[1], label[2]*100))
  img = plt.imshow(image)
  plt.axis('off')
  plt.show()
  sleep(1)
  

#How to go from this to a new data set?

Remove fully-connected layers at the top of the network and add new fully connected layers with the correct number of outputs

Training will train the new fully-connected layers and fine-tune the convolutional layers



In [0]:
from keras.applications.vgg16 import VGG16
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.applications.vgg16 import decode_predictions
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
from keras.layers import Input, Flatten, Dense
from keras.models import Model
import numpy as np

#Get back the convolutional part of a VGG network trained on ImageNet
model_vgg16_conv = VGG16(weights='imagenet', include_top=False)

# make the VGG16 layers non trainable (making use of their pre training)
# this reduces the number of training parameters from 100 Million to 1 Million (really depends on first layer after vgg16 layers)
for layer in model_vgg16_conv.layers:
  layer.trainable = False

#Create your own input format (here 224x224x3)
input = Input(shape=(224,224,3),name = 'image_input')

#Use the generated model 
output_vgg16_conv = model_vgg16_conv(input)

#Add the fully-connected layers 
x = Flatten(name='flatten')(output_vgg16_conv)
x = Dense(32, activation='relu', name='fc1')(x)
x = Dense(64, activation='relu', name='fc2')(x)

x = Dense(7, activation='softmax', name='predictions')(x)

#Create your own model 
my_model = Model(input=input, output=x)
my_model.summary()

In [0]:
# collect data to train the new model
import glob

# select all the images to train with using glob to extract all files int he folder with the extension .jpg
poke_folders = glob.glob(path +'pokemon/*/')
poke_type = []
for p in poke_folders:
  p = p[:-1]
  p = p.split('/')
  poke_type.append(p[-1])

print("new labels")
print(poke_type)

# create integers associated with each label
poke_num = range(len(poke_type))

labels = []
labels_num = []
total_count = 0
label_count = 0

for label in poke_type:
  
  new_images = glob.glob(path + 'pokemon/' + label + '/*.jpg')
  for file in new_images:
    
    # specify the label and a corresponding number for the label
    labels.append(label)
    labels_num.append(label_count)
    
    # load the images
    image = load_img(file, target_size=(224, 224))

    # convert the image pixels to a numpy array
    im = img_to_array(image)

    # reshape data for the model
    im = im.reshape((1, im.shape[0], im.shape[1], im.shape[2]))
    
    # preprocess the image for the input type of the VGG16 model
    # im = preprocess_input(im)
    
    # initialize the data varialbe if this is the first time through the loops
    if total_count==0:
      data = im
    
    # otherwise, stack the arrays to create a single input array
    else:
      # stack the new arrays
      data = np.concatenate((data,im))
    
    # bump the counter
    total_count = total_count + 1
  
  # bump the label count
  label_count = label_count + 1


# show the shape of the new data
print('\ndata shape: ', data.shape)
print('number of images: ',total_count)

In [0]:
# make the labels a numpy array
labels_num1 = np.reshape(np.array(labels_num),(len(labels_num),1))

shape0 = data.shape[0]
shape1 = data.shape[1]
shape2 = data.shape[2]
shape3 = data.shape[3]

data = np.reshape(data, (shape0,shape1*shape2*shape3))
data = np.append(data,labels_num1,axis=1)

# shuffle the array so the data is not fed in sequentially by type
np.random.shuffle(data)
print('data shape after shuffle: ', data.shape)

# separate the last column to extract the labels from the data set
labels_num = data[:, -1] # for last column
data = data[:, :-1]        # for all but last column
print('data shape after shuffle with the labels removed: ', data.shape)

data = np.reshape(data,(shape0,shape1,shape2,shape3))

print('data shape after reshape to the original shape: ', data.shape)
print('labels shape: ', labels_num.shape)

# make the labels into one hot encoded labels
# example 3 -> (0,0,0,1,0,0,0) and 0 -> (1,0,0,0,0,0,0) and 5 -> (0,0,0,0,0,1,0)
labels_num.tolist()
labels_one_hot = []
for i in labels_num:
  if i==0:
    labels_one_hot.append((1,0,0,0,0,0,0))
  if i==1:
    labels_one_hot.append((0,1,0,0,0,0,0))
  if i==2:
    labels_one_hot.append((0,0,1,0,0,0,0))
  if i==3:
    labels_one_hot.append((0,0,0,1,0,0,0))
  if i==4:
    labels_one_hot.append((0,0,0,0,1,0,0))
  if i==5:
    labels_one_hot.append((0,0,0,0,0,1,0))
  if i==6:
    labels_one_hot.append((0,0,0,0,0,0,1))

labels_one_hot = np.array(labels_one_hot)
print('one hot labels shape: ', labels_one_hot.shape)
  

In [0]:
# compile the model
import keras.optimizers

# select an optimizer

#sgd = keras.optimizers.SGD(lr=0.01, momentum=0.0, decay=0.0, nesterov=False)
#rmsprop = keras.optimizers.RMSprop(lr=0.001, rho=0.9, epsilon=None, decay=0.0)
#adagrad = keras.optimizers.Adagrad(lr=0.01, epsilon=None, decay=0.0)
#adadelta = keras.optimizers.Adadelta(lr=1.0, rho=0.95, epsilon=None, decay=0.0)
#adam = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
#adamax = keras.optimizers.Adamax(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0)
nadam = keras.optimizers.Nadam(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=None, schedule_decay=0.004)

my_model.compile(optimizer=nadam, loss='categorical_crossentropy', metrics=['accuracy'])

In [0]:
# check that the GPU is available

import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

In [0]:
# train the model using keras fit for a small number of epochs (this may take ~2 minutes per epoch without the GPU)
with tf.device('/gpu:0'):
  my_model.fit(data,labels_one_hot, epochs=20, verbose=1, shuffle=True, batch_size=32)

  # evaluate the model
  scores = my_model.evaluate(data,labels_one_hot)

print("%s: %.2f%%" % (my_model.metrics_names[1], scores[1]*100))

In [0]:
# save model weights
my_model.save_weights(path + 'pokemon_model.h5')
print("Saved model to disk")
print("Need to export from Keras before closing session!")

In [0]:
# OR load some pretrained weights instead of training the model using the above two above cells
my_model.load_weights(path + 'pokemon_model.h5')
print("Loaded model from disk")

# evaluate the model
scores = my_model.evaluate(data,labels_one_hot)
print("%s: %.2f%%" % (my_model.metrics_names[1], scores[1]*100))

In [0]:
# run the model on some images it hasn't seen

import matplotlib.pyplot as plt
import glob
from time import sleep


files = glob.glob(path + '/pokemon/*.jpg')
for picture in files:
  image = load_img(picture, target_size=(224, 224))
  
  # convert the image pixels to a numpy array
  im = img_to_array(image)
  
  # reshape data for the model
  im = im.reshape((1, im.shape[0], im.shape[1], im.shape[2]))

  # predict the probability across all output classes
  predicted_label = my_model.predict(im)
  val = np.argmax(predicted_label[0])
  print(poke_type[val])
  print('activation percentage: ', round(100*predicted_label[0][val],4), '%')

  # show the image
  img = plt.imshow(image)
  plt.axis('off')
  plt.show()
  sleep(1)


It may be helpful to learn how to save and load entire models 

see the link below for a description

https://machinelearningmastery.com/save-load-keras-deep-learning-models/

Additional Resources

https://www.learnopencv.com/object-tracking-using-opencv-cpp-python/
https://www.learnopencv.com/deep-learning-based-object-detection-and-instance-segmentation-using-mask-r-cnn-in-opencv-python-c/
https://www.learnopencv.com/find-center-of-blob-centroid-using-opencv-cpp-python/
https://www.learnopencv.com/selective-search-for-object-detection-cpp-python/
https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html?highlight=object%20detection
https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html?highlight=object%20detection
https://realpython.com/traditional-face-detection-python/
https://realpython.com/face-recognition-with-python/
https://www.pyimagesearch.com/2017/12/11/image-classification-with-keras-and-deep-learning/
https://www.tensorflow.org/lite/models/object_detection/overview
https://research.google.com/seedbank/seed/tfhub_action_recognition_model
https://research.google.com/seedbank/seed/the_whatif_tool_analyzing_an_image_classifier