# Data Prep

Paper can be found here: [https://www.gperezs.com/papers/bppa.pdf](https://www.gperezs.com/papers/bppa.pdf) and their CVPR 2019 workshop publication can be found here [https://biomedicalcomputervision.uniandes.edu.co/publications/finding-four-leaf-clovers-a-benchmark-for-fine-grained-object-localization/](https://biomedicalcomputervision.uniandes.edu.co/publications/finding-four-leaf-clovers-a-benchmark-for-fine-grained-object-localization/)

Given over 100,000 images and many types of annotations, instane, semantic etc... how do we get this ready for use?   
Here is a file tree of the [FLC data](https://drive.google.com/file/d/1fq6ZeVg2-CjRtKgcBcnKy9jLfUIabfe8/view?usp=sharing)   
```
FLC2019 % tree -C -r -I '*.jpg|*.png|*.mat'   
.
├── trainval
│   ├── coco_annotations
│   │   ├── leaves_trainval_pos.json
│   │   ├── leaves_trainval_negs.json
│   │   ├── instances_trainval_pos.json
│   │   ├── instances_trainval_negs.json
│   │   └── instances_trainval_hard_pos.json
│   ├── SegmentationClass
│   ├── JPEGImages_pos
│   ├── JPEGImages
│   └── BorderClass
├── test
│   ├── gt_mats_pos
│   ├── gt_mats
│   │   └── gt_mats_pos
│   ├── coco_annotations
│   │   ├── leaves_test_pos.json
│   │   ├── leaves_test_negs.json
│   │   ├── instances_test_pos.json
│   │   ├── instances_test_negs.json
│   │   ├── instances_test_hard_pos.json
│   │   └── instances_test_all.json
│   ├── SegmentationClass
│   ├── JPEGImages_pos
│   ├── JPEGImages_neg
│   ├── JPEGImages
│   ├── BorderClass_pos
│   ├── BorderClass_alejo
│   └── BorderClass
│       └── BorderClass_pos
└── create_contours.py

```   

As far as data we will be using, we will not be using the .mat files or the BorderClass files.   
Here is some more information:   
* In the **trainval**/JPEGImages_pos dir, we have 1000 jpg files totalling 1.63GB. These are images *with* four-leaf clovers   
* In the **trainval**/JPEGImages dir, we have 52,638 jpg files totalling 23.8GB. These are images *without* four-leaf clovers as well as the same 1000 images *with* four-leaf clovers   
  
-----

*   In the **test**/JPEGImages_pos dir, we have 500 jpg files totalling 888.8MB. These are images *with* four-leaf clovers  
*   In the **test**/JPEGImages_neg dir, we have 51,671 jpg files totalling 22.2GB. These are images *without* four-leaf clovers  
*   Additionally, we have the  **test**/JPEGImages dir which combines both the pos and neg images into a single dir   
  
-----   
   
According to their paper:

| General statistics         | Trainval set | Test set |
|----------------------------|--------------|----------|
| Total positive images      | 1,000        | 500      |
| Total negative images      | 51,637       | 51,670   |
| Total images               | 52,637       | 52,170   |
| 4-leaf clover instances    | 1,412        | 739      |
| 4-leaf clover leaves       | 5,858        | 3,094    |
| 4-leaf clover pixels       | 0.0445%      | 0.0588%  |
| 4-leaf clover boundary pxls| 0.0026%      | 0.0030%  |

Both the trainval and test dirs include a coco_annotations dir, I will be using the *instances_trainval_hard_pos.json* and *instances_test_hard_pos.json* - This is for the sake of my limitation of training time and compute resources. These are standard coco bbox and segmentation annotations for the pos images. They include both three-leaf clovers and four-leaf clovers annotated in all 1500 trainval/test images.

### What results are we trying to replicate?   

Object Detection:   
Semantic Segmentation:   
Instance Segmentation:   

### Let's get some basic vars out of the way

In [None]:
# image dirs
trainval_pos_img_dir = '/Users/mattsloan/Downloads/FLC2019/trainval/JPEGImages_pos'
test_pos_img_dir = '/Users/mattsloan/Downloads/FLC2019/test/JPEGImages_pos'

# coco annotation files
trainval_hard_ann = '/Users/mattsloan/Downloads/FLC2019/trainval/coco_annotations/instances_trainval_hard_pos.json'
test_hard_ann = '/Users/mattsloan/Downloads/FLC2019/test/coco_annotations/instances_test_hard_pos.json'