There was a suggestion floating around to group the data into clusters of ids based on missing values. Here is an exploration as well as a feasibility study of the suggestion

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import kagglegym

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
env = kagglegym.make()
observation = env.reset()
train = observation.train
print("Train has {} rows".format(len(train)))

In [None]:
train.describe()

In [None]:
unique_id = train["id"].unique()
print("There are {} unique ids".format(len(unique_id)))

In [None]:
train.groupby("id").count()

In [None]:
count = 0
missing_value = np.zeros((len(unique_id), train.shape[1]))
for item in unique_id:
    tmp_id = train[train["id"]==item]
    tmp_id = tmp_id.fillna(0)
    tmp_id = tmp_id.values
    missing_value[count] = np.sum(tmp_id, 0) == 0
    count = count + 1

In [None]:
missing_value # binary matrix indicating (1s) completely missing features for each id

In [None]:
missing_value.shape

In [None]:
unique_missing_value = np.vstack(set(map(tuple, missing_value)))

In [None]:
print("So the ids can be clustered into {} groups, based on completely missing features".format(unique_missing_value.shape[0]))

In [None]:
# create a frequency table of ids in each group
freq_missing_value = []
for row in missing_value:
    for index, row_unique in enumerate(unique_missing_value):
        if np.array_equal(row, row_unique):
            freq_missing_value.append(index)
            
group, id_count = np.unique(freq_missing_value, return_counts=True)            

plt.plot(id_count)
plt.xlabel("group")
plt.ylabel("frequency")
plt.show()

The above plot shows that there are some groups where the frequency of ids can be up to 25, but the majority of the groups are filled with only one id.

To dive deeper, a frequency table of the counts in each group is created.

In [None]:
supergroup, group_count = np.unique(id_count, return_counts=True)

print(np.asarray((supergroup, group_count)).T)

plt.plot(supergroup, group_count, '-o', label="Original noisy data")
plt.xlabel("supergroup")
plt.ylabel("frequency")
plt.legend()
plt.show()

It appears that the supergroup is structured according to the power law. Applying an exponential fit to the data leads to the below plots.

In [None]:
group_count = group_count*10 # rescale data to zoom into the long tail

from scipy.optimize import curve_fit
def func(x, a, b, c):
    return a * np.power(x,-b) + c

popt, pcov = curve_fit(func, supergroup, group_count, maxfev=2000)

print("Fit parameters are", popt) 
print("Error on parameters are", np.sqrt(np.diag(pcov)))

plt.plot(supergroup, group_count, 'ko', label="Original noisy data")
plt.plot(supergroup, func(supergroup, *popt), 'r-', label="Exponential fit")
plt.xlabel("supergroup")
plt.ylabel("frequency")
plt.legend()
plt.show()

In [None]:
plt.loglog(supergroup, group_count, 'ko', label="Original noisy data")
plt.loglog(supergroup, func(supergroup, *popt), 'r-', label="Fitted Curve")
plt.xlabel("supergroup")
plt.ylabel("frequency")
plt.legend()
plt.show()

The above plot shows that an exponential curve fits the data fairly well.

## Conclusion:
There are 1096 unique ids. They can be clustered into 784 groups based on completely missing features. About 700 groups are composed of a single id, which means that such a clustering is of no help.

The groups seem to be following a power law, like in any unconstrained system. The system here is obviously the financial market, and the ids seem to represent the market players like the companies or their stocks.

NB: Though the conclusions are practically not very useful, thought it might be a good idea to share the insights. Cheers!