Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can someone clarify the anchor box concept used in Yolo? #568

Open
hbzhang opened this issue Mar 26, 2018 · 66 comments
Open

Can someone clarify the anchor box concept used in Yolo? #568

hbzhang opened this issue Mar 26, 2018 · 66 comments

Comments

@hbzhang
Copy link

hbzhang commented Mar 26, 2018

I know this might be too simple for many of you. But I can not seem to find a good literature illustrating clearly and definitely for the idea and concept of anchor box in Yolo (V1,V2, andV3). Thanks!

@vkmenon
Copy link

vkmenon commented Mar 26, 2018

Here's a quick explanation based on what I understand (which might be wrong but hopefully gets the gist of it). After doing some clustering studies on ground truth labels, it turns out that most bounding boxes have certain height-width ratios. So instead of directly predicting a bounding box, YOLOv2 (and v3) predict off-sets from a predetermined set of boxes with particular height-width ratios - those predetermined set of boxes are the anchor boxes.

@AlexeyAB
Copy link
Collaborator

Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map):

darknet/src/yolo_layer.c

Lines 88 to 89 in 6f6e475

b.w = exp(x[index + 2*stride]) * biases[2*n] / w;
b.h = exp(x[index + 3*stride]) * biases[2*n+1] / h;

  • x[...] - outputs of the neural network

  • biases[...] - anchors

  • b.w and b.h result width and height of bounded box that will be showed on the result image

Thus, the network should not predict the final size of the object, but should only adjust the size of the nearest anchor to the size of the object.

In Yolo v3 anchors (width, height) - are sizes of objects on the image that resized to the network size (width= and height= in the cfg-file).

In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map (32 times smaller than in Yolo v3 for default cfg-files).

@hbzhang
Copy link
Author

hbzhang commented Mar 28, 2018

Thanks!

@spinoza1791
Copy link

For YoloV2 (5 anchors) and YoloV3 (9 anchors) is it advantageous to use more anchors? For example, if I have one class (face), should I stick with the default number of anchors or could I potentially get higher IoU with more?

@CageCode
Copy link

CageCode commented Apr 18, 2018

For YoloV2 (5 anchors) and YoloV3 (9 anchors) is it advantageous to use more anchors? For example, if I have one class (face), should I stick with the default number of anchors or could I potentially get higher IoU with more?

I was wondering the same. The more anchors used, the higher the IoU; see (https://medium.com/@vivek.yadav/part-1-generating-anchor-boxes-for-yolo-like-network-for-vehicle-detection-using-kitti-dataset-b2fe033e5807).
However, when you try to detect one class, which often show the same object aspect ratios (like faces) I don't think that increasing the number of anchors is going to increase the IoU by a lot. While the computational overhead is going to increase significantly.

@fkoorc
Copy link

fkoorc commented Apr 22, 2018

I used YOLOv2 to predict some industry meter board few weeks ago and I try the same idea spinoza1791 and CageCode refered,
The reason was that I need high accuracy but also want close to real time so I thought change num of anchors (YOLOv2 -> 5) but it all end to crush after about 1800 iteration
So I might lose someing there

@frozenscrypt
Copy link

frozenscrypt commented Sep 10, 2018

@AlexeyAB How do you get the initial anchor box dimensions after clustering? The width and height after clustering are all number s less than 1, but anchor box dimensions are greater of less than 1. How to get the anchor box dimensions?

@saiteja011
Copy link

Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map):

darknet/src/yolo_layer.c

Lines 88 to 89 in 6f6e475
b.w = exp(x[index + 2stride]) * biases[2n] / w;
b.h = exp(x[index + 3stride]) * biases[2n+1] / h;

* `x[...]` - outputs of the neural network

* `biases[...]` - anchors

* `b.w` and `b.h` result width and height of bounded box that will be showed on the result image

Thus, the network should not predict the final size of the object, but should only adjust the size of the nearest anchor to the size of the object.

In Yolo v3 anchors (width, height) - are sizes of objects on the image that resized to the network size (width= and height= in the cfg-file).

In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map (32 times smaller than in Yolo v3 for default cfg-files).

great explanation bro. thank you.

@andyrey
Copy link

andyrey commented Nov 5, 2018

Sorry, still unclear phrase
In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map
What are "final feature map" sizes?
For yolo-voc.2.0.cfg input image size is 416x416,
anchors = 1.08,1.19, 3.42,4.41, 6.63,11.38, 9.42,5.11, 16.62,10.52.
I got- each pair represents anchor width and height, centered in every of 13X13 cells.
The last anchor- 16.62 (width?), 10.52(height?)-what units they are? Can somebody explain litterally
with this example?
And, may be, someone uploaded the code for deducing best anchors from given dataset with K-means?

@fkoorc
Copy link

fkoorc commented Nov 9, 2018

I think maybe your anchor has some error. In yolo2 the anchor size is based on final feature map(13x13) as you said.
So the anchor aspect ratio must be smaller than 13x13
But in yolo3 the author changed anchor size based on initial input image size.
As author said:
"In YOLOv3 anchor sizes are actual pixel values. this simplifies a lot of stuff and was only a little bit harder to implement"
Hope I am not missing anything :)

@jalaldev1980
Copy link

Dears,

is it necessarily to get the anchors values before the training to enhance the model?

I am building my own data set to detect 6 classes using tiny yolov2 and I used the below code to get anchors values
do I need to change the width and height if I am changing it in the cfg file ?

are the below anchors accepted or the values are huge values ?
what is the num_of_clusters 9 ?

....\build\darknet\x64>darknet.exe detector calc_anchors data/obj.data -num_of_clusters 9 -width 416 -height 416

num_of_clusters = 9, width = 416, height = 416
read labels from 8297 images
loaded image: 2137 box: 7411

Wrong label: data/obj/IMG_0631.txt - j = 0, x = 1.332292, y = 1.399537, width = 0.177083, height = 0.412037
loaded image: 2138 box: 7412
calculating k-means++ ...

avg IoU = 59.41 %

Saving anchors to the file: anchors.txt
anchors = 19.2590,25.4234, 42.6678,64.3841, 36.4643,117.4917, 34.0644,235.9870, 47.0470,171.9500, 220.3569,59.5293, 48.2070,329.3734, 99.0149,240.3936, 165.5850,351.2881

@fkoorc
Copy link

fkoorc commented Nov 21, 2018

To get anchor value first makes training time faster but not necessary
tiny yolo is not quite accuracy if you can I adjust you use yolov2

@andyrey
Copy link

andyrey commented Nov 21, 2018

@jalaldev1980
I try to guess, where did you take this calc_anchors flag in your command line? I didn't find it in YOLO-2,
may be, it is in YOLO-3 ?

@developer0hye
Copy link

@jalaldev1980
I try to guess, where did you take this calc_anchors flag in your command line? I didn't find it in YOLO-2,
may be, it is in YOLO-3 ?

./darknet detector calc_anchors your_obj.data -num_of_clusters 9 -width 416 -height 416

@jalaldev1980
Copy link

jalaldev1980 commented Nov 23, 2018 via email

@NadimSKanaan
Copy link

Can someone provide some insights into YOLOv3's time complexity if we change the number of anchors?

@CoinCheung
Copy link

Hi guys,

I got to know that yolo3 employs 9 anchors, but there are three layers used to generate yolo targets. Does this mean, each yolo target layer should have 3 anchors at each feature point according to their scale as does in FPN, or do we need to match all 9 anchors with one gt on all the 3 yolo output layers?

@andyrey
Copy link

andyrey commented Jan 10, 2019

I use single set of 9 anchors for all of 3 layers in cfg file, it works fine.
I believe, this set is for one base scale, and rescaled in the other 2 layers somewhere in framework code.
Let someone correct me, if I am wrong.

@weiaicunzai
Copy link

Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map):

darknet/src/yolo_layer.c
Lines 88 to 89 in 6f6e475
b.w = exp(x[index + 2stride]) * biases[2n] / w;
b.h = exp(x[index + 3stride]) * biases[2n+1] / h;

  • x[...] - outputs of the neural network
  • biases[...] - anchors
  • b.w and b.h result width and height of bounded box that will be showed on the result image

Thus, the network should not predict the final size of the object, but should only adjust the size of the nearest anchor to the size of the object.

In Yolo v3 anchors (width, height) - are sizes of objects on the image that resized to the network size (width= and height= in the cfg-file).

In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map (32 times smaller than in Yolo v3 for default cfg-files).

Thanks, but why darknet's yolov3 config file https://github.com/pjreddie/darknet/blob/master/cfg/yolov3-voc.cfg and https://github.com/pjreddie/darknet/blob/master/cfg/yolov3.cfg have different input size(416 and 608), but use the same anchor size?If yolo v3 anchors are sizes of objects on the image that resized to the network size.

@andyrey
Copy link

andyrey commented Jan 14, 2019

@weiaicunzai
You are right, 2 different input size (416 and 608) cfg files have the same anchor box sizes. Seems to be a mistake. As for me, I use utilite to find anchors specific to my dataset, it increases accuracy.

@CoinCheung
Copy link

Hi, Here I have some anchor question please:
If I did not misunderstand the paper, there is also a positive-negative mechanism in yolov3, but only when we compute confidence loss, since xywh and classification only rely on the best match. Thus the xywh loss and classification loss are computed with gt and only one associated match. As for the confidence, the division of positive and negative is based on the iou value. Here my question is: is this iou computed between gt and the anchors, or between gt and the predictions which are computed from anchor and the model outputs(output is the offset generated from the model)?

@Sauraus
Copy link

Sauraus commented Jan 15, 2019

Say I have a situation where all my objects that I need to detect are of the same size 30x30 pixels on an image that is 295x295 pixels, how would I go about calculating the best anchors for yolo v2 to use during training?

@andyrey
Copy link

andyrey commented Jan 16, 2019

@Sauraus
There is special python program, see AlexeyAB reference on github, which calculates 5 best anchors based on your dataset variety(for YOLO-2). Very easy to use. Then replace string with new anchor boxes in your cfg file. If you have same size objects, it probably would give you set of same pair of digits.

@Sauraus
Copy link

Sauraus commented Jan 16, 2019

@andyrey are you referring to this: https://github.com/AlexeyAB/darknet/blob/master/scripts/gen_anchors.py by any chance?

@andyrey
Copy link

andyrey commented Jan 16, 2019

@Sauraus:
Yes, I used this for YOLO-2 with cmd:
python gen_anchors.py -filelist train.txt -output_dir ./ -num_clusters 5

and for 9 anchors for YOLO-3 I used C-language darknet:
darknet3.exe detector calc_anchors obj.data -num_of_clusters 9 -width 416 -height 416 -showpause

@pkhigh
Copy link

pkhigh commented Feb 15, 2019

Is anyone facing an issue with YoloV3 prediction where occasionally bounding box centre are either negative or overall bounding box height/width exceeds the image size?

@Sauraus
Copy link

Sauraus commented Feb 18, 2019

Yes and it's driving me crazy.

Is anyone facing an issue with YoloV3 prediction where occasionally bounding box centre are either negative or overall bounding box height/width exceeds the image size?

@fkoorc
Copy link

fkoorc commented Feb 19, 2019

I think that the bounding box is hard to precisely fit your target
There is always some deviation, just how much the degree of error it is.
If the error is very large maybe you should check your training data and test data
But still there is so many possible reason cause that
Maybe you can post your picture?

@gouravsinghbais
Copy link

Can anyone explain the process flow since I am getting different concepts from different sources. I am not clear if Yolo first divides the images into n x n grids and then does the image classification or it classifies the object in one pass. So it will be very helpful if someone explains the process from starting.

@klopezlinar
Copy link

Hi all,

We're struggling to get our Yolov3 working for a 2 class detection problem (the size of the objects of both classes are varying and similar, generally small, and the size itself does not help differentiating the object type). We think that the training is not working due to some problem with the anchor boxes, since we can clearly see that depending on the assigned anchor values the yolo_output_0, yolo_output_1 or yolo_output_2 fail to return a loss value different to 0 (for xy, hw and class components). However, even if there are multiple threads about anchor boxes we cannot find a clear explanation about how they are assigned specifically for YOLOv3.

So far, what we're doing to know the size of the boxes is:
1- We run a clustering method on the normalized ground truth bounding boxes (according to the original size of the image) and get the centroids of the clusters. In our case, we have 2 clusters and the centroids are something about (0.087, 0.052) and (0.178, 0.099).
2- Then we rescale the values according to the rescaling we are going to apply to the images during training. We are working with rectangular images of (256, 416), so we get bounding boxes of (22,22) and (46,42). Note that we have rounded the values as we have read that yoloV3 expects actual pixel values.
3- Since we compute anchors at 3 different scales (3 skip connections), the previous anchor values will correspond to the large scale (52). The anchors for the other two scales (13 and 26) are calculated by dividing the first ancho /2 and /4.

We are not even sure if we are correct up to this point. If we look at the code in the original models.py what we see is the following:

yolo_anchors = np.array([(10, 13), (16, 30), (33, 23), (30, 61), (62, 45), (59, 119), (116, 90), (156, 198), (373, 326)], np.float32) / 416
yolo_anchor_masks = np.array([[6, 7, 8], [3, 4, 5], [0, 1, 2]])

So, there are 9 anchors, which are ordered from smaller to larger and the, the anchor_masks determine if the resolution at which they are used, is this correct? In fact, our first question is, are they 9 anchors or 3 anchors at 3 different scales? If so, how are they calculated? we know about the gen_anchors script in yolo_v2 and a similar script in yolov3, however we don't know if they calculate 9 clusters and then order them according to the size or if they follow a procedure similar to ours.

Additionally, we don’t fully understand why these boxes are divided by 416 (image size). This would mean having anchors that are not integers (pixels values), which was stated was necessary for yolov3.

We would be really grateful if someone could provide us with some insight into these questions and help us better understanding how yoloV3 performs.

Thanks and regards
Karen

@andyrey
Copy link

andyrey commented Apr 6, 2020

Why do you use 2 clusters for your dataset? In YOLO-3 you can prepare 9 anchors, regardless class number. Each of the scale of net uses 3 of them (3x3=9).
Look at line mask = 0,1,2 , then mask = 3,4,5, and mask = 6,7,8 in cfg file.

@Sauraus
Copy link

Sauraus commented Apr 6, 2020

When you say small can you quantify that? From experience I can say that YOLO V2/3 is not great on images below 35x35 pixels.

@klopezlinar
Copy link

Why do you use 2 clusters for your dataset? In YOLO-3 you can prepare 9 anchors, regardless class number. Each of the scale of net uses 3 of them (3x3=9).
Look at line mask = 0,1,2 , then mask = 3,4,5, and mask = 6,7,8 in cfg file.

Thanks for your response. We use 2 because if we look at our data the sizes of our bounding boxes can be clustered into 2 groups, even in one would be enough, so we don't need to use 3 of them. We do not set 2 anchor boxes because of the number of classes.

@klopezlinar
Copy link

klopezlinar commented Apr 6, 2020

When you say small can you quantify that? From experience I can say that YOLO V2/3 is not great on images below 35x35 pixels.

Hi Sauraus, thanks for your response. The original size of our images is something about (2000-5000)x(4800-7000), and the average size of the object bounding boxes are 300x300. Do you think this is a problem?

@andyrey
Copy link

andyrey commented Apr 6, 2020

As I understood, your dataset objects differ only in size? Then you should detect all of them as 1 class and differentiate them with simple size threshold.

@klopezlinar
Copy link

As I understood, your dataset objects differ only in size? Then you should detect all of them as 1 class and differentiate them with simple size threshold.

No, they don't differ in size, they differ in content/appearance

@Sauraus
Copy link

Sauraus commented Apr 6, 2020

As I understood, your dataset objects differ only in size? Then you should detect all of them as 1 class and differentiate them with simple size threshold.

No, they don't differ in size, they differ in content/appearance

Content = class (cat/dog/horse etc.)
Appearance = variance in class (black/red/brown cat)

Is that how you classify those words? :)

@klopezlinar
Copy link

As I understood, your dataset objects differ only in size? Then you should detect all of them as 1 class and differentiate them with simple size threshold.

No, they don't differ in size, they differ in content/appearance

Content = class (cat/dog/horse etc.)
Appearance = variance in class (black/red/brown cat)

Is that how you classify those words? :)

We have breast masses, some of the malignant, some of them benign. Our classes then are "malignant" and "benign"

@andyrey
Copy link

andyrey commented Apr 6, 2020

Does it mean you deal with gray-scale picture, with content occupying whole picture area, so that you have to classify structure of the tissue, without detection of some compact objects on it? Can you refer to such pictures?

@klopezlinar
Copy link

Does it mean you deal with gray-scale picture, with content occupying whole picture area, so that you have to classify structure of the tissue, without detection of some compact objects on it? Can you refer to such pictures?

yes, they are grayscale images (we have already changes de code for 1 channel). The content usually occupies half image, so we are also trying to crop it in order to reduce the amount of background. The objects to detect are masses, sometimes compact, sometimes more disperse. Then, from a clinical point of view according to some characteristics of the masses (borders, density, shape..) they are classified as malignant or benign. Here you have some sample images (resized to 216*416):

imagen

imagen

@andyrey
Copy link

andyrey commented Apr 7, 2020

These objects (tumors) can be different size. So you shouldn't restrict with 2 anchor sizes, but use as much as possible, that is 9 in our case. If this is redundant, clustering program would yield 9 closely sized anchors, it is not a problem. What is more important, this channel probably not 8-bit, but deeper, and quantifying from 16 to 8 may lose valuable information. Or may be split 16-bit into two different channels- I don't know, but this is issue to think off...

@klopezlinar
Copy link

These objects (tumors) can be different size. So you shouldn't restrict with 2 anchor sizes, but use as much as possible, that is 9 in our case. If this is redundant, clustering program would yield 9 closely sized anchors, it is not a problem. What is more important, this channel probably not 8-bit, but deeper, and quantifying from 16 to 8 may lose valuable information. Or may be split 16-bit into two different channels- I don't know, but this is issue to think off...

Ok, we will try with the 9 anchors. Regarding the 16-bit, we are using tf2 so that's not a problem I think...

Now we are able to detect some masses but when the we lower the score_threshold in the detection.

@ameeiyn
Copy link

ameeiyn commented Apr 12, 2020

So far, what we're doing to know the size of the boxes is:
1- We run a clustering method on the normalized ground truth bounding boxes (according to the original size of the image) and get the centroids of the clusters. In our case, we have 2 clusters and the centroids are something about (0.087, 0.052) and (0.178, 0.099).
2- Then we rescale the values according to the rescaling we are going to apply to the images during training. We are working with rectangular images of (256, 416), so we get bounding boxes of (22,22) and (46,42). Note that we have rounded the values as we have read that yoloV3 expects actual pixel values.
3- Since we compute anchors at 3 different scales (3 skip connections), the previous anchor values will correspond to the large scale (52). The anchors for the other two scales (13 and 26) are calculated by dividing the first ancho /2 and /4.

First of all Sorry to join the party late. From what I understand here, you have two classes Malignant and Benign which are merely the output classes but doesn't necessarily have to be of the same size (in dimensions of the bounding boxes) and therefore (as @andyrey suggested) I would suggest to either use the default number and sizes of anchors or run k-means on your dataset to obtain the best sizes for the anchors and best numbers. I am not sure about the sizes but you can increase the number of anchors at least as the images might have different ratios (even if he tumours are of the same size which again might not be the case) and I think would be favourable for your application.

Are all the input images of fixed dimensions ie. (256x416) ? You have also suggested two bounding boxes of (22,22) and (46,42). are the bounding boxes always of these dimensions ? If so there might be something wrong as they may start from those values but should be able to form the box around tumours as tightly as possible. Need more clarification.

Although there is a possibility you might get results but I am not quite sure if YOLO is the perfect algorithm that works on non-rgb. Its quite been some time since I have worked with YOLO and referred the theoretical scripts and papers so I am not quite sure but I would suggest you to first test it by training on your dataset without making a lot of changes and then finetune by making changes to get more accuracy if you receive some promising results in the first case.

@pra-dan
Copy link

pra-dan commented Apr 28, 2020

@ameeiyn @andyrey Thanks for clarifying on the getting w and h from predictions and anchor values. I think I have got the box w and h successfully using the

box_w = anchor_sets[anchor_index] * exp(offset_w) * 32
box_h = anchor_sets[anchor_index+1] * exp(offset_h) * 32

where offset_whatever is the predicted value of w and h. But I for obtaining the x and y values of the bounding boxes, I am simply multipluing the predicted coordinates (x and y) with image width and height. I am getting poor predictions as well as dislocated boxes:
Screenshot from 2020-04-28 17-33-07

Can you guys

kindly help ?

@mark198181
Copy link

I want to learn to please

@easy-and-simple
Copy link

Your explanations are useless like your existence obviously
Only real morons would explain pictures with words instead to write them
Is there normal humans that can write few pictures of how anchors look and work?

@apd888
Copy link

apd888 commented Jun 11, 2020

This may be fundamental: what if I train the network for an object in location (x,y), but detect the same object located in (x+10, y) in a picture ? How can YOLO detect the physical location?

@rajan780
Copy link

Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map):

darknet/src/yolo_layer.c

Lines 88 to 89 in 6f6e475

b.w = exp(x[index + 2*stride]) * biases[2*n] / w;
b.h = exp(x[index + 3*stride]) * biases[2*n+1] / h;

  • x[...] - outputs of the neural network
  • biases[...] - anchors
  • b.w and b.h result width and height of bounded box that will be showed on the result image

Thus, the network should not predict the final size of the object, but should only adjust the size of the nearest anchor to the size of the object.

In Yolo v3 anchors (width, height) - are sizes of objects on the image that resized to the network size (width= and height= in the cfg-file).

In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map (32 times smaller than in Yolo v3 for default cfg-files).

Hi @AlexeyAB
I understand that yolo v3 anchors are sizes of objects on the image that resized to network size while in yolov2 anchors are sizes of objects relative to the final feature map.
but my question is:
the way of calculating width and height is same for both yolov3 and yolov2 for example: width =e^(tw)*pw and height = e^(th)*ph. Then why yolov3 uses anchors of network size and yolov2 uses anchors of final feature map.

@Vedant1parekh
Copy link

Anchors are initial sizes (width, height) some of which (the closest to the object size) will be resized to the object size - using some outputs from the neural network (final feature map):

darknet/src/yolo_layer.c

Lines 88 to 89 in 6f6e475

b.w = exp(x[index + 2*stride]) * biases[2*n] / w;
b.h = exp(x[index + 3*stride]) * biases[2*n+1] / h;

  • x[...] - outputs of the neural network
  • biases[...] - anchors
  • b.w and b.h result width and height of bounded box that will be showed on the result image

Thus, the network should not predict the final size of the object, but should only adjust the size of the nearest anchor to the size of the object.

In Yolo v3 anchors (width, height) - are sizes of objects on the image that resized to the network size (width= and height= in the cfg-file).

In Yolo v2 anchors (width, height) - are sizes of objects relative to the final feature map (32 times smaller than in Yolo v3 for default cfg-files).

How many number of anchor boxes needed in yolov4?

@mathemaphysics
Copy link

So far, reading through this whole three-year long thread, I've concluded that it's probably best just to re-read the papers. There are diagrams in the papers. Both in this and for the most part in the papers it is not made clear whether the anchor boxes are (x, y, w, h) in the input image or in the output feature layers (plural for skip connections).

I'm seeing no connection made between the input and the output of the network at all whatsoever. Literally everything else from batch normalization to internal covariate shift makes sense to me. The anchor boxes don't. It would really help to have a better summary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests