In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns', 500)
pd.options.display.max_rows = 1000

In [None]:
hdf = pd.read_hdf('../input/train.h5')
print(len(hdf["id"].unique()))
hdf.head()

Note: Plotting the 1,424 ids may result in messy charts and never-ending facetgrids that take for ever to load. Best to sample a few ids used consistently throughout the notebook, and run the analyses for a few samples before making final conclusions.

In [None]:
n_samples = 200
df_sample = pd.DataFrame({"sample_id": pd.Series(hdf["id"].unique()).sample(n_samples) })
df_sample = df_sample.reset_index(drop=True).reset_index(drop=False)
df_sample["group"] = df_sample["index"].apply(lambda x: int((x+1)/10)) # Create groups of 10 sample ids.
hdf = pd.merge(left = hdf, right = df_sample[["sample_id", "group"]], left_on="id", right_on="sample_id", how = "left")
df_sample.head()

In [None]:
# Look at "y" for a random subset of n_samples ids
hdf_plot = hdf.loc[~hdf["sample_id"].isnull(), ["sample_id", "timestamp", "y"]]
grid = sns.FacetGrid( hdf_plot, col = "sample_id", col_wrap=5)
grid = (grid.map(plt.plot, "timestamp", "y")).add_legend()
plt.show()

In [None]:
# Look at all feature columns for 10 sampled ids
sample_id_group = 1
cols_included = list(hdf)[1:]
hdf_melt = hdf.loc[ (~hdf["sample_id"].isnull()) & (hdf["group"] == sample_id_group), cols_included]
hdf_melt = pd.melt (hdf_melt, id_vars=["timestamp","sample_id"])
hdf_melt.head()

In [None]:
grid = sns.FacetGrid( hdf_melt, col = "variable", col_wrap=5, hue  ="sample_id", ylim = [-2, 2]) # useful to clip values as suggested by other people
grid = (grid.map(plt.plot, "timestamp", "value")).add_legend()
plt.show()

Comments:

a) on "y"
- seems to oscillate around the zero mean, with a stationary pattern at least until time period c.1500 
- differs across ids in terms of scale
- differs across ids in terms of time periods available and general pattern. For example some ids show large volatility around time=1500 (for ex id n. 2140) while other don't (for example id n. 1419)

b) on feature columns
- Many features like fundamental_1 are clearly non stationary at least for some ids
- Technical_x features have some sort of weird sinusoidal shape with a binary scale. For example technical_43 seems to be either 0 or -2
- Within one feature, ids may differ greatly in terms of scale
- zooming in on shorter time periods shows some sort of seasonality for some features (not done here). De-trending could be a good idea
- Inconsistent time periods available across features for a given id. Note: some kernels have dealt with it by simply filling in missing values with medial or mean. Not sure if that's a good idea

Many kernels related to this competition involve fitting linear regressions on all the instruments (ids) and all the timestamps (time periods). Whether the models used are simple, ridge or LASSO regressions, they would perform best under a number of assumptions, including that all id-period observations are independent and identically distributed (iid), which may not apply to this dataset.

One reason why id-period observations may not be iid is because of specific, unobserved id-level characteristics that a regression model trained across all id-periods in a "pooled" fashion would not be able to capture. The problem caused by unobserved differences (also called unobserved heterogeneity) may be avoided by training separate models for each id. 

But modeling ids separately would not leverage the relationships that are constant across all ids, which a model spanning all id-period observations can capture. Also some ids have very little time periods available across any features, which means they could benefit from what can be learnt using other ids. 

So a solution could be to ensemble id-specific regressions with panel data regressions that account for unobserved heterogeneity. An example of an ensembling-based solution could be a weighted score of two models, one that is id specific and one that spans several ids. The weight of the latter model would be high for Ids with little time periods available.

In the following cells I look at the problemm from a panel data perspective, i.e. considering the availability and distribution of time periods by id. The ultimate questions is "Is there any reason why I should restricted my regression to a subset of ids, for example because their features span a consistent number of time range or because they have data available across q consistent subset of columns?".

In [None]:
# Set an index with two dimensions, id and timestamp. Useful for subsequent computations
hdf = hdf.set_index(["id", "timestamp"])
hdf = hdf.sort_index(level = 1).sort_index(level = 0)

# Number of periods available by id
t="Number of periods available by id"
res = hdf.groupby(level = 0)["y"].apply(lambda x: len(x[~x.isnull()])) # Note: some ids have null rows that would be counted if not removed
res.plot(kind = 'hist', bins = 100, title = t)
plt.show()

Over 500 ids have the maximum number of available observations for y (1812). That leaves c.900 ids with some gaps. If that's because they were created after timestamp=0 that's fine - there will still be a continuous set of observations. But if it's because specific time ranges are missing, that could be a problem. I am specifically worried that some ids may  have missing observations for time periods 1500-last. Volatility tends to increase at this point in time and predictions likely apply to timestamps beyond 1812 so it's important to accurately model relationships for this time period. Therefore, worth looking at the maximum time period available by id - is max timestamp mostly within 1500-1812?

In [None]:
res = hdf.groupby(level = 0)["y"].apply(lambda x: x.index.max()[1] )
res.plot(kind = 'hist', bins = 100, title = "Maximum time period for 'y' column by id")
plt.show()

About 1100 ids have their last observation of y in time period = 1812, which is the maximum timestamp. That's reassuring. Let's look at the c.300 other ids

In [None]:
# How about ids that end before 1812, do many of them end early on?
res[res < 1812].plot(kind = 'hist', bins = 4)
plt.show()

About 250 ids end before timestamp 1400. As suggested previously, these will most benefit from what can be learnt using ids with recent time periods.

Next plot looks at the number of empty columns. If most ids have only a few columns available then we would have to see if those columns tend to be the same across ids and hope that it's the case, otherwise a model that spans multiple ids will be no better than a collection of id-specifif models

In [None]:
def number_empty_columns(df):
    df_nulls = df.apply(lambda col: (pd.isnull(col)*1))
    df_nulls = df_nulls.sum(axis = 0)
    count = ((df_nulls == 0)*1).sum()
    return count

res = hdf.groupby(level = 0).apply(number_empty_columns)
res.plot(kind = 'hist', bins = 100, title = "Number of empty columns by id")
plt.show()

Close to 700 ids have only a few columns completely missing. Yet, quite a few ids miss 60 feature columns or more. Next plot shows the same historgram but with cumulative % to get a better sense of how many ids miss a large number of feature columns

In [None]:
res.hist(cumulative=True, normed=1, bins=10)
plt.suptitle( "Number of empty columns by id - cumulative % frequency")
plt.show()

40% of ids miss 60 features or more. c.20% miss 80 features of more. That's quite high and would require more investigation. For example, would be useful to know if among this group that includes about 20% of all ids, features available tend to overlap across ids (not done here). 

On the bright side, 40% of ids have 60 or less empty columns, which is not too bad. It would be even better if the non null columns had a decent number of time periods available vs just a few. So the next step takes a look at the average number of time periods available for non-empty columns

In [None]:
def average_peridods_by_col(df):
    df_nulls = df.apply(lambda col: (pd.isnull(col)*1)) # returns 1 if value is null, otherwise 0
    df_nulls = df_nulls.sum(axis = 0) #return count of non null values by column
    mapping = np.where(df_nulls > 0, '>0',  '0')
    mean = df_nulls.groupby(mapping).mean() # returns average number of time periods by category
    mean = mean['>0']
    return mean

res2 = hdf.groupby(level=0).apply(average_peridods_by_col)
res2.plot(kind = 'hist', bins = 10, title = "Average number of time periods for non empty columns" )
plt.show()

Quite a few ids have less than 100 time periods available on average per non null column. 100 periods is not lot considering the total period covered (1812 timestamps). 

Note that this could be due to those ids with large number of empty columns, a group that we may want to treat separatly. Let's look at the same plot by segmenting between ids that have less than 60 empty columns vs other ids. 

In [None]:
target_ids = res[res<60].index
mapping = np.in1d(res2.index, target_ids) # return True if id has less than 60 empty columns
mapping = np.where(mapping, "target", "not target")

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2)
fig.suptitle("Average number of time periods for non empty columns", fontsize=14)
title1, title2  = "Less than 60 empty columns",  "More than 60 empty columns"
res2[res2.index.isin(target_ids)].plot(kind = 'hist', bins = 10, ax=axes[0], sharex =True,sharey =True, title=title1, xlim = (0,1400))
res2[~res2.index.isin(target_ids)].plot(kind = 'hist', bins = 10, ax=axes[1], sharex =True,sharey =True,title=title2, xlim = (0,1400))
plt.show()

Same conclusion applies when we restrict to the group of ids with less than 60 empty columns: most have on average less than 100 time periods available.

The implication is that some column features may just cover a narrow time range, which relates to the previous concern that the most recent period with high volatility cannot be properly modelled. 

Also, I wonder to what extent that can be harmful for a model accounting for unobserved heterogeneity. For example, a fixed effects model can be implemented by demeaning the data at the id level - the so-called "within model". If for a given feature column, only a few time periods are available, the mean value could potentially be far off the actual value we would get across the 0-1812 period.