# Alibaba Development Set
Creates an optimal class-proportional, distribution-preserving sample of the alibaba dataset that can be used in profiler and exploratory development. Four files make up the Alibaba dataset:

- impression: Core file containing a user, an ad and the target variable, 'click'.
- ad: Ad basic information for all ads in the impression table.    
- user: User profile information 
- behavior: User browsing, shopping cart, favorite, and purchase behaviors

The samples will be taken from the impression table. Ad, user and behavior data will be include via joins with the sampled impression data.

To obtain a subsample that reflects the structure of the original alibaba dataset, repeated samples will be taken and the sample chosen will be that which has a distribution most similar to the distribution of the full dataset. 

## Distribution Density Measure
Several statistical methods are available for measuring and comparing density distributions. For this assignment, we will measure the degree to which the data follow a normal distribution using the Anderson-Darling test. The test statistic is defined as:

$$A^2=-n-S$$

where:
$$
S=\displaystyle\sum_{i=1}^n\frac{(2i-1)}{n}[\text{ln}\, F(Y_i) + \text{ln}(1-F(Y_{n+1-i}))]
$$

$F$ is the [cumulative distribution function](https://en.wikipedia.org/wiki/Cumulative_distribution_function) of the normal distribution, $Y_i$ is *ordered* data and $n$ is the sample size.

## Selection Criteria
Using the Anderson Darling test statistic, the best subsample will be that which had the smallest distance in each variable from the distribution of the respective original variable. Concretely, we will select the best sample as:

$$
\text{Best Sample = min(max(Distance(Sample-Original)))}
$$

## Original Dataset
The distribution densities for the original impression table are abtained as follows.

In [None]:
from deepctr.persistence.dal import DataParam, DataTableDAO

In [None]:
dto = DataParam(name='impression', dataset='alibaba_staged', asset='alibaba', stage='staged', env='test', format='parquet')
dao = DataTableDAO()
impression = dao.read(dto)
impression.show()
      