In [1]:
from curves import plot_lines, functional_boxplot, functional_boxplot_from_df, split_datasets
from bokeh.io import output_notebook
from bokeh.plotting import show

In [2]:
output_notebook()

# Data depth

Given an ensemble of data drawn from a distribution F, _data depth_ quantifies how central (or deep) is a particular sample within the cloud of the sampled data. The deeper samples are considered more representative of the ensemble and are assigned high depth values whereas samples farther away from the rest of the ensemble are considered to be outliers and are correspondingly assigned lower depth values. Therefore, the notion of data depth provides a center outward ordering (also known as order statistics) for an ensemble of sampled data. [Mirzagar, Whitaker (2014)](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6875964)

# Band Depth

The work of [Lopez-Pintado, Romo (2009)](https://www.tandfonline.com/doi/pdf/10.1198/jasa.2009.0108) introduces the notion of _functional band depth_, a generalization of data depth for a higher dimension that is designed for ensembles of functions. The difference between the notion of functional band depth and other generalizations for higher dimensions is that it goes beyond the point-wise analysis of functional data, it provides a measure of centrality of a function among an ensemble of functional data that is both sensitive to the shape and the position of a function in comparison to the rest of ensemble members.

The _band_ in band depth can be described as the region between two (or more) curves in a line chart, it can be seen as the grey region in the image below, from [__link sun, genton, nychka 2012 paper here__]:

![error loading image](images/band_SUN2012.jpeg "Example of a band delimited by two curves")

Thus, the notion of _band depth_ of a curve can be defined as the proportion of bands delimited by _j_ different curves that contains the whole graph of the curve in question. One problem with this definition is that in datasets with a large number of curves or that the curves cross at some point there are few bands that completely contain curves, resulting in an poorly defined rank and many curves with the same value of band depth. The example below shows 5 curves with 5 points each, and the central function has a spike in the middle point of its graph.

In [3]:
data, plot_lines_test = plot_lines('../data/infile.csv',"../outputs/outfile_out.txt",'od','1/1/2018','1/5/2018', 5)
show(plot_lines_test)

(17, 5)


If we color the curves according to its band depth values (below), in which lighter colors represent curves with lower band depth values while stronger colors represent curves with higher values of band depth, we can see that, using the original definition of band depth (the parameter 'od' selects the original method for calculating band depth), the central curve has the same color of the outer curves, which are only contained in the bands defined by themselves, denoting a low band depth value despite belonging to the bands defined by the other curves 80% of the time.

In [10]:
data, plot_lines_depth = plot_lines('data/infile.csv',"outputs/outfile_out.txt",'od','1/1/2018','1/5/2018',
                              5, depth_color=True)
data.head()
show(plot_lines_depth)

(17, 5)


In [9]:
data, plot_lines_depth = plot_lines('data/taxis_v1.csv',"outputs/taxis_v1_out.txt",'tmd','1/1/2018','12/31/2018',
                                    24, depth_color=True)
data.head()
show(plot_lines_depth)

(36, 365)


In [None]:
fbplot_taxisv1 = functional_boxplot('data/taxis_v1.csv',"outputs/taxis_v1_out.txt",'01/01/2018','12/31/2018', 24,'tmd')
show(fbplot_taxisv1)

In [None]:
data_weekdays_taxi, data_weekends_taxi = split_datasets('data/taxis_v1.csv',"outputs/taxis_v1_out.txt",'01/01/2018','12/31/2018', 24)
print(data_weekdays_taxi.shape, data_weekends_taxi.shape)

In [None]:
fbplot_taxisv1_weekday_od = functional_boxplot_from_df(data_weekdays_taxi,'od','Taxi trips on weekdays using original depth')
show(fbplot_taxisv1_weekday_od)