# Real test set size estimation

In this notebook I'm going to try to estimate the real size of the test set. We know the followign facts:

> To deter hand labeling, we have supplemented the test set with car images that are ignored in scoring.

And also in the leaderboard says the following:

> This leaderboard is calculated with approximately 25% of the test data.
The final results will be based on the other 75%, so the final standings may be different.

## Strategy
I have made a submission using a mask that covers the full image. The score was 0.361.

Now I'm going to make random submissions setting some of the masks to zero and leaving only 20% of the full masks. I will collect the scores and hopefully that will have information to estimate the size of the real test set.

## Simulating the experiments
First of all let's make simulations to see if different test sizes will produce different score distributions.

In [3]:
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
import seaborn as sns

%matplotlib inline

In [4]:
def simulate_n_submissions(n_submissions, public_ratio, full_mask_ratio, score_scale=0.361):
    """
    Simulates the submissions that we are performing on the public leaderboard
    
    Parameters
    ----------
    n_submissions : int
        Number of submissions to be made
    public_ratio : float
        Fraction of the test set that belongs to public leaderboard
    full_mask_ratio : float
        Fraction of the submission that will use full_mask
    score_scale : float
        A factor for scaling the score. Typical value is 0.361, which is the score for using
        full mask on all the submission instances
    """
    # Create the synthetic test set
    test_size = int(1e5)
    test_set = np.concatenate((np.ones(int(test_size*public_ratio)), 
                               np.zeros(int(test_size*(1-public_ratio)))))
    test_set /= np.sum(test_set)
    # Do the submissions and collect the scores
    score_list = []
    for _ in tqdm(range(n_submissions)):
        sampling_mask = (np.random.rand(test_size) < full_mask_ratio).astype(np.int)
        score = np.sum(sampling_mask*test_set)*score_scale
        
        score_list.append(score)
    return score_list

In [9]:
plt.figure(figsize=(12, 6))
score_list = []
p_range = [0.01, 0.05, 0.5]
labels = ['test set public ratio= %.2f' % i for i in p_range]
for p in p_range:
    score_list.append(simulate_n_submissions(1000, public_ratio=p, full_mask_ratio=0.2, score_scale=0.361))
    
for values, label in zip(score_list, labels):
    sns.distplot(values, label=label)
plt.xlabel('Submission score')
plt.legend();

The plot above shows that there are big differences in the score distributions if the real test size changes. The bigger the test size the narrower the distribution.

## Comparing the real data with the simulations
I have made 25 random submissions, let's compare the distribution of my submission with the simulated scores.

In [8]:
submission_scores = [0.075, 0.074, 0.068, 0.076, 0.069, 0.075,0.07,0.071,0.076,0.071,0.071,0.07,0.073,0.075,0.076,0.065,0.072,0.077,0.063,0.074,0.067,0.068,0.065,0.069]

In [10]:
plt.figure(figsize=(12, 6))
score_list = [submission_scores]
p_range = [0.01, 0.05, 0.5]
labels = ['real submission'] + ['test set public ratio= %.3f' % i for i in p_range]
for p in p_range:
    score_list.append(simulate_n_submissions(1000, public_ratio=p, full_mask_ratio=0.2, score_scale=0.361))
    
for values, label in zip(score_list, labels):
    sns.distplot(values, label=label)
plt.xlabel('Submission score')
plt.legend();

The above plot is showing that the submission distribution is very similar from having a test set public ratio of 0.5 and 0.05.

Let's try smaller values, and get more samples for higher precision.

In [14]:
plt.figure(figsize=(12, 6))
score_list = [submission_scores]
p_range = [ 0.005, 0.01, 0.02]
labels = ['real submission'] + ['test set public ratio= %.3f' % i for i in p_range]
for p in p_range:
    score_list.append(simulate_n_submissions(10000, public_ratio=p, full_mask_ratio=0.2, score_scale=0.361))
    
for values, label in zip(score_list, labels):
    sns.distplot(values, label=label)
plt.xlabel('Submission score')
plt.legend();

The most similar distribution is the one that has a public of 0.01.  
**This means that only 1% of the test set belongs to the public set.**

## Reflexions
From my experience in other competitions the test set usually is similar in size to the train set.

In this competition we have 5.088 images on train and 100.064 on test. If the real test set has the same size of the train set and the public test set is 25% of the total test set we get that the ratio between public test set and total test set is 1.27%. This is very close to our estimate. 

** So I think that the real test set has the same size as the train set. And this means that 95% of the test set won't be used.
**