TLD: Track Learn Detect
This algorithm, also known as "Predator" algorithm, developed by Zdenek Kalal. For more information, please visit his homepage: http://info.ee.surrey.ac.uk/Personal/Z.Kalal/tld.html
How it works?
This is a long story, please read Zdenek's paper. Here is how it works in command-line if you compiled ccv with FFMPEG support:
./tld <Your Video> x y width height
It will output each tracking coordinates for each frame.
What about performance?
TLD is implemented closely after Zdenek's paper, but still, varies in quite a few aspects significantly. I've done excessive tests to make sure performance, in terms of accuracy and speed matches the original implementation.
TLD uses randomization algorithm, thus, the result can vary from time to time, I managed to run ccv's TLD implementation on test videos with "rotation == 0" and default parameters. With 3 runs and then pick the median, I've able to generate some meaningful data to analyze on.
detections : 901 true detections : 1412 correct detections : 833 precision : 0.924528 recall : 0.589943 f-measure : 0.720277
The result on the same video reported in: Zdenek Kalal, Jiri Matas and Krystian Mikolajczyk, Online Learning of Robust Object Detectors during Unstable Tracking:
precision : 0.96 recall : 0.54
After 69th frame failed to recover (out of 140 frames)
The result on the same video reported in: Zdenek Kalal, Jiri Matas and Krystian Mikolajczyk, P-N Learning: Bootstrapping Binary Classifiers by Structural Constraints:
After 27th frame failed to recover (out of 140 frames)
Note that a few runs I can get outperformed results than Zdenek's implementation sometimes, but choose to ignore these instead.
All these results are obtained with alantrrs' evaluate_vis.py script in https://github.com/alantrrs/OpenTLD/blob/master/datasets/evaluate_vis.py and the dataset in that repository. Thanks alantrrs!
By enable "rotation" technique, you can achieve near real-time performance on QVGA video, with minor accuracy loss. With "rotation == 1" (default parameter), TLD spends around 15ms on tracking, 50ms on detecting, 50ms on learning for 320x240 video on single thread of i7-2620M 2.7GHz.
Under the hood?
ccv's TLD implementation varies from Zdenek's original Matlab implementation in several significant ways:
Zdenek's implementation uses a smaller LK window for computation (5x5), whereas ccv's implementation uses a 15x15 window for such.
2). Ferns Detection (Random Forest):
Zdenek's implementation uses random forest for object detection (in short, the probability for each feature add up), whereas ccv's implementation uses ferns for object detection (using multiplication of probabilities, A.K.A. semi-naive Bayes classifier). To compensate such choice, ccv's implementation uses 40 ferns, and for each fern, uses 18 features (the default parameter), and the default ferns threshold for ccv's implementation is 0.
3). Nearest-neighbor Classifier:
Zdenek's implementation uses aspect-ratio normalized examples (15x15); these examples are normalized so that a simple multiply can yield correlation confidence. ccv's implementation uses aspect-aware examples (constraint to area size of 400); examples are left as it is and using normalized coefficient computation to get confidence score.
4). Pseudo-random Number Generator:
Zdenek's implementation uses srand() for random number generation, and seed it with 0. ccv's implementation uses a Mersenne-Twister random number generator with an environment-dependent seed.