# Embrapa Wine Grape Instance Segmentation Dataset (WGISD)

## Why was the dataset created?

Embrapa WGISD (Wine Grape Instance Segmentation Dataset) was created to provide images and annotation to study
object detection and instance segmentation in image-based monitoring and field robotics for viticulture. It provides
instances from five different grape varieties taken on field. These instances shows variance in grape pose, illumination
and focus, including genetic and phenological variations as shape, color and compactness.

# Dataset Composition

In [1]:
varietals = ['CDY', 'CFR', 'CSV', 'SVB', 'SYH']

## How many instances of each type are there?

In [2]:
import os
import numpy as np

In [3]:
instances = {v: [] for v in varietals}

for dirname, dirnames, filenames in os.walk('.'):
    if dirname == '.':
        for filename in [f for f in filenames if f.endswith('.txt')]:
            for v in varietals:
                if filename.startswith(v):
                    instances[v].append(filename[:-4])

In [4]:
n_vimages = {v: len(inst_v) for v, inst_v in instances.items()}
n_vimages

{'CDY': 0, 'CFR': 0, 'CSV': 0, 'SVB': 0, 'SYH': 0}

In [5]:
n_images = np.array([n for __, n in n_vimages.items()]).sum()
n_images

0

### Bounding boxes

In [6]:
n_iboxes = {v: {} for v in varietals}

for v in varietals:
    for ii in instances[v]:
        annot_file = ii + '.txt'
        bboxes = np.loadtxt(annot_file)
        n_iboxes[v][ii] = bboxes.shape[0]

In [7]:
n_vboxes = {v: np.array([n for ii, n in n_iboxes[v].items()]).sum() for v in varietals}
n_vboxes

{'CDY': 0.0, 'CFR': 0.0, 'CSV': 0.0, 'SVB': 0.0, 'SYH': 0.0}

### Masks

In [8]:
n_imasks = {v: {} for v in varietals}

for v in varietals:
    for ii in instances[v]:
        annot_file = ii + '.npz'
        if os.path.isfile(annot_file):
            masks = np.load(annot_file)['arr_0']
            n_imasks[v][ii] = masks.shape[2]

There is a mask for each bounding box in the masked images?

In [9]:
for v in varietals:
    for ii in n_imasks[v]:
        assert(n_imasks[v][ii] == n_iboxes[v][ii])

In [10]:
n_vmasks = {v: np.array([n for ii, n in n_imasks[v].items()]).sum() for v in varietals}
n_vmasks

{'CDY': 0.0, 'CFR': 0.0, 'CSV': 0.0, 'SVB': 0.0, 'SYH': 0.0}

In [11]:
import pandas as pd

In [12]:
n_vimages

{'CDY': 0, 'CFR': 0, 'CSV': 0, 'SVB': 0, 'SYH': 0}

In [13]:
df = pd.DataFrame(index=varietals, columns=['Images', 'BoxedBunches', 'MaskedBunches'])
for v, val in n_vimages.items():
    df.loc[v, 'Images'] = val
    df.loc[v, 'BoxedBunches'] = n_vboxes[v]
    df.loc[v, 'MaskedBunches'] = n_vmasks[v]
     
df

Unnamed: 0,Images,BoxedBunches,MaskedBunches
CDY,0,0.0,0.0
CFR,0,0.0,0.0
CSV,0,0.0,0.0
SVB,0,0.0,0.0
SYH,0,0.0,0.0


In [14]:
df.sum()

Images             0
BoxedBunches     0.0
MaskedBunches    0.0
dtype: object

## Are there recommended data splits or evaluation measures?

In [15]:
with open('train.txt', 'r') as fp:
    train = fp.readlines()
train = set([i[:-1] for i in train])

len(train)

242

In [16]:
with open('test.txt', 'r') as fp:
    test = fp.readlines()
test = set([i[:-1] for i in test])

len(test)

58

Assert train and test are _disjoints_ :

In [17]:
train.intersection(test)

set()

In [18]:
data = []
for v, val in n_iboxes.items():
    for i in val:
        if i in train:
            sp = 'Train'
        else:
            sp = 'Test'
            
        if i in n_imasks[v]:
            nm = n_imasks[v][i]
        else:
            nm = 0
            
        data.append((i, v, sp, 1, n_iboxes[v][i], nm))

dfi = pd.DataFrame(data, 
                   columns=['Inst', 'Variety', 'Split', 'Image', 'BoxedBunches', 'MaskedBunches']).set_index('Inst')

In [19]:
dfi.groupby(['Split']).sum()

Unnamed: 0_level_0,Variety,Image,BoxedBunches,MaskedBunches
Split,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1


In [20]:
dfi.groupby(['Split']).sum().sum()

Variety          0
Image            0
BoxedBunches     0
MaskedBunches    0
dtype: object