In [1]:
import numpy as np
import pandas as pd

I think that we may have to be a bit tricky about how this dataset is created in order to do good validation.

For whales with multiple examples, we stratified sample, so that we get the same distribution in the training and validation sets. Since we need to create pairs, we will need at least 4 examples of a whale (2 for training, 2 for validation). This leaves us with the whales with 3 and the whales with 2. We can potentially use some of them for training as well, first let's establish how many whales have different levels first. I think 3 or 2 sample whales will either ALL go in the training, or ALL go in the validation.

This may actually be a bad way to do things. Should we have whales that are in both the training and validation set, or should we ACTUALLY do the split based on whale?

It's funny, because in terms of end performance, overfitting to whale species that we might see again could actually be beneficial in the test set, but it doesn't seem like the right thing to do in terms of creating sets for machine learning.

In [2]:
tr = pd.read_csv('../train.csv')

In [3]:
tr.head()

Unnamed: 0,Image,Id
0,0000e88ab.jpg,w_f48451c
1,0001f9222.jpg,w_c3d896a
2,00029d126.jpg,w_20df2c5
3,00050a15a.jpg,new_whale
4,0005c1ef8.jpg,new_whale


In [4]:
wcts = tr['Id'].value_counts()

nonew = pd.DataFrame(wcts[1:]).reset_index()
nonew.columns = ['Id', 'cts']

In [15]:
nonew.shape

(5004, 2)

So there are 5004 individual whales with 2 or more samples.

In [16]:
nonew[nonew['cts']>1].shape

(2931, 2)

In [17]:
nonew[nonew['cts']==1].shape

(2073, 2)

And 2931 whales have 2 or more pictures, and 2073 have 1 picture only. So I think we just do two training / validation splits, one for whales with 2 or more images, one for whales with only 1 image. Then we can worry about how to create the datasets from those.

In [20]:
same = nonew[nonew['cts']>1]
diff = nonew[nonew['cts']==1]

In [21]:
same.shape, diff.shape

((2931, 2), (2073, 2))

Let's do the same inspection, but with the channels taken into account.

In [24]:
tr = pd.read_csv('./image_dims.csv')

In [25]:
tr = tr[tr['channels']==3]

In [26]:
wcts = tr['Id'].value_counts()

nonew = pd.DataFrame(wcts[1:]).reset_index()
nonew.columns = ['Id', 'cts']

In [27]:
nonew.shape

(4725, 2)

In [28]:
nonew[nonew['cts']>1].shape

(2722, 2)

In [29]:
nonew[nonew['cts']==1].shape

(2003, 2)

Let's only work with 3 channel images for now. We can add the others pretty easily, just 