# The attention center model with OpenVINO™

This notebook demonstrates how to use the [attention center model](https://github.com/google/attention-center/tree/main) with OpenVINO. This model is in the [TensorFlow Lite format](https://www.tensorflow.org/lite), which is supported in OpenVINO now by TFlite frontend.

Eye tracking is commonly used in visual neuroscience and cognitive science to answer related questions such as visual attention and decision making. Computational models that predict where to look have direct applications to a variety of computer vision tasks. The attention center model takes an RGB image as input and return a 2D point as output. This 2D point is the predicted center of human attention on the image i.e. the most salient part of images, on which people pay attention fist to. This allows find the most visually salient regions and handle it as early as possible. For example, it could be used for the latest generatipon image format(such as [JPEG XL](https://github.com/libjxl/libjxl)), which supports encoding the parts that you pay attention to fist. It can help to improve user experience, image will appear to load faster.

Attention center model architecture is:
> The attention center model is a deep neural net, which takes an image as input, and uses a pre-trained classification network, e.g, ResNet, MobileNet, etc., as the backbone. Several intermediate layers that output from the backbone network are used as input for the attention center prediction module. These different intermediate layers contain different information e.g., shallow layers often contain low level information like intensity/color/texture, while deeper layers usually contain higher and more semantic information like shape/object. All are useful for the attention prediction. The attention center prediction applies convolution, deconvolution and/or resizing operator together with aggregation and sigmoid function to generate a weighting map for the attention center. And then an operator (the Einstein summation operator in our case) can be applied to compute the (gravity) center from the weighting map. An L2 norm between the predicted attention center and the ground-truth attention center can be computed as the training loss. Source: [google AI blogpost](https://opensource.googleblog.com/2022/12/open-sourcing-attention-center-model.html).

<img align='center' src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjxLCDJHzJNjB_von-vFlq8TJJFA41aB85T-QE3ZNxW8kshAf3HOEyIEJ4uggXjbJmZhsdj7j6i6mvvmXtyaxXJPm3JHuKILNRTPfX9KvICbFBRD8KNuDVmLABzYuhQci3BT2BqV-wM54IxaoAV1YDBbnpJC92UZfEBGvakLusiqND2AaPpWPr2gJV1/s1600/image4.png" alt="drawing" width="80%"/>

The attention center model has been trained with images from the [COCO dataset](https://cocodataset.org/#home) annotated with saliency from the [salicon dataset](http://salicon.net/).


The tutorial consists of the following steps:
* Downloading the model
* Loading the model and make inference with OpenVINO API
* Run Live Attention Center Detection

## Imports

In [None]:
import time
import cv2
import sys
import collections

import numpy as np
import tensorflow as tf
from pathlib import Path
from IPython import display
import matplotlib.pyplot as plt

from openvino.tools import mo
from openvino.runtime import serialize, Core

sys.path.append("../utils")
import notebook_utils as utils

## Download the attention-center model

Download the model as part of [attention-center repo](https://github.com/google/attention-center/tree/main). The repo include model in folder `./model`. 

In [None]:
if not Path('./attention-center').exists():
    ! git clone https://github.com/google/attention-center

### Convert Tensorflow Lite model to OpenVINO IR format

The attention-center model is pre-trained model in TensorFlow Lite format. In this Notebook the model will be converted to 
OpenVINO IR format with Model Optimizer. This step will be skipped if the model have already been converted. For more information about Model Optimizer, please, see the [Model Optimizer Developer Guide]( https://docs.openvino.ai/2023.0/openvino_docs_MO_DG_Deep_Learning_Model_Optimizer_DevGuide.html). 

Also TFLite models format is supported in OpenVINO by TFlite frontend, so the model can be passed directly to `core.read_model()`. You can find example in [002-openvino-api](https://github.com/openvinotoolkit/openvino_notebooks/tree/main/notebooks/002-openvino-api).

In [None]:
tflite_model_path = Path("./attention-center/model/center.tflite")

ir_model_path = Path("./model/ir_center_model.xml")

core = Core()

if not ir_model_path.exists():
    model = mo.convert_model(tflite_model_path)
    serialize(model, ir_model_path.as_posix())
    print("IR model saved to {}".format(ir_model_path))
else:
    print("Read IR model from {}".format(ir_model_path))
    model = core.read_model(ir_model_path)

device = "CPU"
compiled_model = core.compile_model(model=model, device_name=device)

## Prepare image to use with attention-center model

The attention-center model takes an RGB image with shape (480, 640) as input.

In [None]:
class Image():
    def __init__(self, model_input_image_shape, image_path=None, image=None):
        self.model_input_image_shape = model_input_image_shape
        self.image = None
        self.real_input_image_shape = None

        if image_path is not None:
            self.image = cv2.imread(image_path)
            self.real_input_image_shape = self.image.shape
        elif image is not None:
            self.image = image
            self.real_input_image_shape = self.image.shape
        else:
            raise Exception("Sorry, image can't be found, please, specify image_path or image")

    def prepare_image_tensor(self):
        rgb_image = cv2.cvtColor(self.image, cv2.COLOR_BGR2RGB)
        resized_image = cv2.resize(rgb_image, (self.model_input_image_shape[1], self.model_input_image_shape[0]))

        image_tensor = tf.constant(np.expand_dims(resized_image, axis=0),
                                   dtype=tf.float32)
        return image_tensor

    def scalt_center_to_real_image_shape(self, predicted_center):
        new_center_y = round(predicted_center[0] * self.real_input_image_shape[1] / self.model_input_image_shape[1])
        new_center_x = round(predicted_center[1] * self.real_input_image_shape[0] / self.model_input_image_shape[0])
        return (new_center_y, new_center_x)

    def draw_attention_center_point(self, predicted_center):
        image_with_circle = cv2.circle(self.image,
                                       predicted_center,
                                       radius=10,
                                       color=(3, 3, 255),
                                       thickness=-1)
        return image_with_circle

    def print_image(self, predicted_center=None):
        image_to_print = self.image
        if predicted_center is not None:
            image_to_print = self.draw_attention_center_point(predicted_center)

        plt.imshow(cv2.cvtColor(image_to_print, cv2.COLOR_BGR2RGB))

image_file_name = Path("../data/image/coco.jpg")
input_image = Image((480, 640), image_file_name.as_posix())
image_tensor = input_image.prepare_image_tensor()
input_image.print_image()

## Get result with OpenVINO IR model

In [None]:
output_layer = compiled_model.output(0)

# make inference, get result in input image resolution
res = compiled_model([image_tensor])[output_layer]
# scale point to original image resulution
predicted_center = input_image.scalt_center_to_real_image_shape(res[0])
print(f'Prediction attention center point {predicted_center}')
input_image.print_image(predicted_center)

## Live attention center detection

Use a webcam as the video input. By default, the primary webcam is set with `source=0`. If you have multiple webcams, each one will be assigned a consecutive number starting at 0. Set `flip=True` when using a front-facing camera. Some web browsers, especially Mozilla Firefox, may cause flickering. If you experience flickering, set `use_popup=True`.

>**NOTE**: To use this notebook with a webcam, you need to run the notebook on a computer with a webcam. If you run the notebook on a server (for example, Binder), the webcam will not work. Popup mode may not work if you run this notebook on a remote computer (for example, Binder).


In [None]:
def run_live_attention_center_detection(source=0,
                                        flip=False,
                                        use_popup=False,
                                        skip_first_frames=0,
                                        model=model,
                                        device='CPU'):
    player = None
    compiled_model = core.compile_model(model, device)
    try:
        # Create a video player to play with target fps.
        player = utils.VideoPlayer(
            source=source, flip=flip, fps=30, skip_first_frames=skip_first_frames
        )
        # Start capturing.
        player.start()
        if use_popup:
            title = "Press ESC to Exit"
            cv2.namedWindow(
                winname=title, flags=cv2.WINDOW_GUI_NORMAL | cv2.WINDOW_AUTOSIZE
            )

        processing_times = collections.deque()
        while True:
            # Grab the frame.
            frame = player.next()
            if frame is None:
                print("Source ended")
                break

            # prepare the image, reshape it and change color format
            image = Image((480, 640), image=frame)
            image_tensor = image.prepare_image_tensor()

            output_layer = compiled_model.output(0)

            # make inference
            start_time = time.time()
            res = compiled_model([image_tensor])[output_layer]
            stop_time = time.time()
            
            # draw the attention center point on image
            predicted_center = image.scalt_center_to_real_image_shape(res[0])
            frame = image.draw_attention_center_point(predicted_center)

            processing_times.append(stop_time - start_time)
            # Use processing times from last 200 frames.
            if len(processing_times) > 200:
                processing_times.popleft()

            _, f_width = frame.shape[:2]
            # Mean processing time [ms].
            processing_time = np.mean(processing_times) * 1000
            fps = 1000 / processing_time
            cv2.putText(
                img=frame,
                text=f"Inference time: {processing_time:.1f}ms ({fps:.1f} FPS)",
                org=(20, 40),
                fontFace=cv2.FONT_HERSHEY_COMPLEX,
                fontScale=f_width / 1000,
                color=(0, 0, 255),
                thickness=1,
                lineType=cv2.LINE_AA,
            )
            # Use this workaround if there is flickering.
            if use_popup:
                cv2.imshow(winname=title, mat=frame)
                key = cv2.waitKey(1)
                # escape = 27
                if key == 27:
                    break
            else:
                # Encode numpy array to jpg.
                _, encoded_img = cv2.imencode(
                    ext=".jpg", img=frame, params=[cv2.IMWRITE_JPEG_QUALITY, 100]
                )
                # Create an IPython image.
                i = display.Image(data=encoded_img)
                # Display the image in this notebook.
                display.clear_output(wait=True)
                display.display(i)
    # ctrl-c
    except KeyboardInterrupt:
        print("Interrupted")
    # any different error
    except RuntimeError as e:
        print(e)
    finally:
        if player is not None:
            # Stop capturing.
            player.stop()
        if use_popup:
            cv2.destroyAllWindows()

## Run live attention center detection

Note that in some images may be several part be visually important, so the attention center point will be placed in the middle.


In [None]:
run_live_attention_center_detection(source=0, flip=True, use_popup=False, model=model, device=device)