# <font style = "color:rgb(50,120,229)">GOTURN : Deep Learning based Object Tracking </font>
In this module, we will learn about a Deep Learning based object tracking algorithm called GOTURN. The original implementation of GOTURN is in Caffe.

We have covered the use of GOTURN in the the previous module along with other Tracking Algorithms like MIL, KCF, MOSSE, CSRT, etc.

For using the GOTURN Tracker, you need to download the model weights. You can use this [dropbox link](https://www.dropbox.com/sh/77frbrkmf9ojfm6/AACgY7-wSfj-LIyYcOgUSZ0Ua?dl=0) to download the model. Please keep in mind it may take a long time to download the file because it is about 370 MB! Once you download the weights, keep it in the same folder as the code.

The authors have released [a caffe model for GOTURN](https://github.com/davheld/GOTURN). You can try it using Caffe too. 

Next, we will have a quick look at how GOTURN works along with its strengths and weaknesses.


## <font style = "color:rgb(50,120,229)">What is GOTURN?</font>
GOTURN, short for Generic Object Tracking Using Regression Networks, is a Deep Learning based tracking algorithm. [This video](https://www.youtube.com/watch?v=SygkiWNSkWk) explains GOTURN and shows a few results.

Most tracking algorithms are trained in an online manner. In other words, the tracking algorithm learns the appearance of the object it is tracking at runtime.

Therefore, many real-time trackers rely on online learning algorithms that are typically much faster than a Deep Learning based solution.

**GOTURN** changed the way we apply Deep Learning to the problem of tracking by learning the motion of an object in an offline manner. The GOTURN model is trained on thousands of video sequences and does not need to perform any learning at runtime.

## <font style = "color:rgb(50,120,229)">How does GOTURN work?</font>
GOTURN was introduced by David Held, Sebastian Thrun, Silvio Savarese in their paper titled [“Learning to Track at 100 FPS with Deep Regression Networks”](http://davheld.github.io/GOTURN/GOTURN.pdf).

<center><img src="https://www.learnopencv.com/wp-content/uploads/2018/07/goturn-inputs-ouputs.jpg"/></center>

<center> Figure 1:GOTURN takes two cropped frames as input and outputs the bounding box around the object in the second frame.</center>

&nbsp;
In the first frame (also referred to as the previous frame), the location of the object is known, and the frame is cropped to two times the size of the bounding box around the object. The object in the first cropped frame is always centered.

The location of the object in the second frame (also referred to as the current frame) needs to be predicted. The bounding box used to crop the first frame is also used to crop the second frame. Because the object might have moved, the object is not centered in the second frame.

A Convolutional Neural Network (CNN) is trained to predict the location of the bounding box in the second frame.

**<font style="color:rgb(255,0,0)">Note for Beginners</font>**
If you are an absolute beginner, think of the CNN as a black box with many knobs that can be set to different values. When the settings on the knobs are right, the CNN produces the right bounding box. Initially, the settings of the knobs are random. At the time of training, we show the neural network pairs of frames for which we known the location of the object (i.e. bounding boxes). If the CNN makes a mistake, the knobs are changed in a principled way using an algorithm called back propagation so that it gradaully stops making as many mistakes. When changing the knob settings stops improving the results anymore, we say the model is trained.

## <font style = "color:rgb(50,120,229)">GOTURN Architecture</font>
In the previous section, we just showed the CNN as a black box. Now, let’s see what is inside the box.

<center><img src="https://www.learnopencv.com/wp-content/uploads/2018/07/GOTURN-architecture.jpg"/></center>
<center>Figure 2: GOTURN Architecture</center>

&nbsp;

Figure 2 shows the architecture of GOTURN. As mentioned before, it takes two cropped frame as input. Notice, the previous frame, shown at the bottom, is centered and our goal is the find the bounding box for the currrent frame shown on the top.

Both frames pass through a bank of convolutional layers. The layers are simply the first five convolutional layers of the CaffeNet architecture. The outputs of these convolutional layers (i.e. the pool5 features) are concatenated into a single vector of length 4096. This vector is input to 3 fully connected layers. The last fully connected layer is finally connected to the output layer containing 4 nodes representing the top and bottom points of the bounding box.

**<font style="color:rgb(255,0,0)">Note for Beginners</font>**
Whenever you see a bank of convolutional layers and are confused what it means, think of them as filters that change the original image such that important information for solving the problem at hand is retained and unimportant information in the image is thrown away.

The multi-dimensional image (tensor) obtained at the end of the convolutional filters is converted to a long vector of numbers by simply unrolling the tensor. This vector serves as input to a few fully connected layers and finally the output layer. The fully connected layers can be thought as the learning algorithm that is using the useful information extracted from the images by the convolutional layer to solve the classification or regression problem at hand.

## <font style = "color:rgb(50,120,229)">Strengths and Limitations of GOTURN</font>
Compared to other Deep Learning based trackers, GOTURN is fast. It runs at 100FPS on a GPU in Caffe and at about 20FPS in OpenCV CPU. Even though the tracker is generic, one can, in theory, achieve superior results on specific objects (say pedestrians) by biasing the traning set with the specific kind of object.

I have identified some of the weaknesses in GOTURN. Please keep in mind, these observations are based on limited tests and one should take them with a grain of salt. Also, note that the OpenCV version of GOTURN uses a different model than the Caffe version. The following observations are for the OpenCV version which was created without any guidance from the authors.

1. **Tracking objects that are not in the training set in the presence of objects that are in the training set:** I was tracking the palm of my hand and as I moved it over my face, the tracker latched on to the face and never recovered. I tried covering my face with my palm just to see if I could get the tracker off my face, but it did not.
Then, I tried tracking my face and occluded it with my hands, but the tracker was able to track the face through the occlusion.
My guess is that there were many more faces in the training set than palms and so it has a problem tracking a hand when a face is in the neighborhood.
This may be a more general problem when there are multiple objects in the scene interacting. The tracker may latch onto objects in the scene which are in the training set when they come close to the tracked objects that are not in the training set.

2. **Tracking part of an object:** It also appears that the tracker would have a hard time tracking a part of an object compared to the entire object. For example, when I tried to use it to track the tip of my finger, it ended up tracking the hand. This is probably because it is not trained on parts of objects, but entire objects.

3. **Lack of Motion information:** Since motion information is not incorporated in the two frame model, if we are tracking an object ( say a face ) moving in one direction, and it gets partially occluded by a similar object ( say another face ) moving in the other direction, there is a chance the tracker will latch onto the wrong face. This problem can be fixed by using the first frame as the previous frame.