# Who can identify whale species in images?

In this competition, we are given a [**multilabel classification**](https://jmread.github.io/talks/Tutorial-MLC-Porto.pdf) problem, which is basically a problem where we have to decide, given an image, which labels does it belong to?

Our task is as follows:
>"For each Image in the test set, you may predict up to 5 labels for the whale Id"

In this notebook, we will:
1. Look at the actual images to get a first impression of the data
2. Cluster the images according to their pixel intensities to find potentially formed groups
3. Generate a baseline bernoulli sample submission

A standard approach to multilabel classification is to learn as many OVA (one vs all) models as there are distinct labels and then assign labels by the classifier output of each of the models, we'll get to that later.

**If this notebook earns your upvote, please upvote this :)**


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 
import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

# Load the dataset

In [None]:
# Get the list of training files 
train = os.listdir('../input/train')
# Get the list of test files
test = os.listdir('../input/test')

print("Total number of training images: ",len(train))
print("Toal number of test images: ",len(test))

In [None]:
sample = pd.read_csv('../input/sample_submission.csv')
print(sample.shape)
sample.head()

In [None]:
# load training labels into a pandas dataframe
train_labels = pd.read_csv('../input/train.csv')
train_labels.head()

In [None]:
train_labels.info()

# Id counts
Let's count all of the ids.

In [None]:
all_labels = train_labels['Id']
unique_labels = all_labels.unique()

In [None]:
print("There are {} unique IDs".format(unique_labels.shape[0]))

In [None]:
print("There are {} non unique IDs".format(all_labels.shape[0]))

In [None]:
print("Average number of labels per image {}".format(1.0*all_labels.shape[0]/train_labels.shape[0]))

In [None]:
all_ids = [item for sublist in list(train_labels['Id'].apply(lambda row: row.split(" ")).values) for item in sublist]
print('total of {} non-unique tags in all training images'.format(len(all_ids)))
print('average number of labels per image {}'.format(1.0*len(all_ids)/train_labels.shape[0]))


Now, let's do the actual counting. We are going to use pandas dataframe groupby method for that.  

In [None]:
ids_counted_and_sorted = pd.DataFrame({'Id': all_labels}).groupby('Id')\
                            .size().reset_index().sort_values(0, ascending=False)
ids_counted_and_sorted.head(20)

There are only a few ids that occur very often in the data:
1.  new_whale
2.  w_1287bfc
3. w_98baff9



# Submission from training tag counts

It is time for the fun part. Let's take the training id distribution and sample from it as a prior for our test data. For that we will configure a bernoulli distribution for each sample with the observed training frequency and sample from that for each test image. With that we'll generate a submission without ever looking at the actual images.


In [None]:
from scipy.stats import bernoulli

In [None]:
id_probas = ids_counted_and_sorted[0].values / (ids_counted_and_sorted[0].values.sum())
indicators = np.hstack([bernoulli.rvs(p, 0, sample.shape[0]).reshape(sample.shape[0], 1) for p in id_probas])

In [None]:
indicators = np.array(indicators)
indicators.shape

In [None]:
indicators[:10,:]

In [None]:
sorted_ids = ids_counted_and_sorted['Id'].values
all_test_ids = []

In [None]:
for index in range(indicators.shape[0]):
    all_test_ids.append(' '.join(list(sorted_ids[np.where(indicators[index, :] == 1)[0]])))

In [None]:
len(all_test_ids)

In [None]:
sample['Id'] = all_test_ids
sample.head()
sample.to_csv('bernoulli_submission.csv', index=False)

In [None]:
!ls