# Introduction
Alibaba Click and Conversion Prediction (Ali-CCP) benchmark dataset was collected from the Taobao recommender system and contains over 84 million impressions {cite}`maEntireSpaceMultiTask2018`. 

{numref}`ali_ccp_stats`: Alibaba Click and Conversion Prediction (Ali-CCP) Statistics

```{table} Dataset Statistics
:name: ali_ccp_stats
| Users | Items | Impressions | Clicks | Conversions |
|---|---|---|---|---|
|          400,000  |    4,300,000  |    84,000,000  |    3,400,000  |          18,000  |
```
The dataset $S=\{(x_{i},y_{i} \rightarrow z_{i}),i=1,...,N\}$ drawn from distribution $D \in \mathcal{X}\times\mathcal{Y}\times\mathcal{Z}$ where:    
- $\mathcal{X}$ is a feature space and $x_i$ denotes the $i^{th}$ feature vector in the feature space, 
- $\mathcal{Y}$ is the click label space and $y_i \in \{0,1\}$ is the binary indicator for a click through for the $i^{th}$ impression
- $\mathcal{Z}$ is the conversion label space and $z_i \in \{0,1\}$ is the binary indicator for a conversion for the $i^{th}$ impression
- $N$ is the number of impressions

## Dataset Label Distribution    
Further, $y\rightarrow z$ denotes the sequential dependence of clicks and conversions in the $\mathcal{Y}\times\mathcal{Z}$ label space as the label distribution below indicates.

{numref}`ali_ccp_labels`: Alibaba Click and Conversion Prediction (Ali-CCP) Label Distribution

```{table} Label Distribution
:name: ali_ccp_labels
| Y | Z | Click | Conversion |
|:---:|:---:|:---:|:---:|
| 0 | 0 | No | No |
| 0 | 1 | --- | --- |
| 1 | 0 | Yes | No |
| 1 | 1 | Yes | Yes |
```

An estimated sample click-through rate of 4.05% and 0.02% conversion rate evidence considerable imbalance, a distinguishing characteristic of digital marketing data. 


## Dataset Organization
The data are split into a training set and a corresponding test set. Each dataset is composed of two CSV formatted files: one which includes the labels and feature structures, and a second that captures features most common among the impressions. A completed sample contains both the label and core features, as well as the corresponding record from the common features file. The following summarizes the dataset statistics.

{numref}`ali_ccp_training_set`: Alibaba Click and Conversion Prediction (Ali-CCP) Training Set Statistics

```{table} Training Set
:name: ali_ccp_training_set
| Training Set           |  Observations           | Filename                  | File Size |
|------------------------|-------------------------|---------------------------|-----------|
| Core Dataset           |          42,300,135     | sample_skeleton_train.csv | 10G       |
| Common Feature Dataset |                730,600  | common_features_train.csv | 8.0G      |
```


{numref}`ali_ccp_test_set`: Alibaba Click and Conversion Prediction (Ali-CCP) Test Set Statistics

```{table} Test Set
:name: ali_ccp_test_set
| Test Set               |  Observations           | Filename                 | File Size |
|------------------------|-------------------------|--------------------------|-----------|
| Core Dataset           |          43,016,840     | sample_skeleton_test.csv | 10G       |
| Common Feature Dataset |                884,212  | common_features_test.csv | 10G       |
```
The training and test sets are split in approximate 50/50 proportions but combined the four files are approximately 38G uncompressed on disk.

As shown above, training and test data consist of two parts. The training set is comprised of a core training dataset (sample_skeleton_train.csv) and a common features dataset (common_features_train.csv). Next, we'll describe the structure and organization of the training set, which also applies to the test set.

### Core Training Dataset
Each row in the sample skeleton file represents an impression and is organized as follows:
```{figure} ../images/core_file_structure.png
---
name: core_file_structure
align: center
alt: Core Dataset Structure
---
Core Dataset Structure
```

The core datasets are comprised of three sections:    
- Sample Id Section: Primary key in the range [1,n], where n is the number of records in the file,       
- Labels Section: Two types of labels: a click, and conversion following the sequential pattern described in {ref}`ali_ccp_labels`
- Features Section: This section contains three fields:    
  1. Common Feature Index: A foreign key that references a row in the common features file,        
  2. Feature Number: The number of features in the following feature list, and     
  3. Feature List: Composed of several features separated by the ASCII character 0x01. Each element in the feature list is represented by Feature Structure containing the following three components:
   - Feature Field Id: This corresponds to the feature field id in {ref}`ali_ccp_feature_sets`     
   - Feature Id: A global identity for the feature     
   - Feature Value: The real value of the feature 
  The three components are separated by ASCII characters 0x02 and 0x03, respectively.

```{table} Description of Feature Sets
:name: ali_ccp_feature_sets
| Feature Category       | Feature Field ID | Feature Field Description                                    |
|------------------------|------------------|--------------------------------------------------------------|
| User Features          | 101              | User ID.                                                     |
|                        | 109_14           | User historical behaviors of category ID and count*.         |
|                        | 110_14           | User historical behaviors of shop ID and count*.             |
|                        | 127_14           | User historical behaviors of brand ID and count*.            |
|                        | 150_14           | User historical behaviors of intention node ID and count*.   |
|                        | 121              | Categorical ID of User Profile.                              |
|                        | 122              | Categorical group ID of User Profile.                        |
|                        | 124              | Users Gender ID.                                             |
|                        | 125              | Users Age ID.                                                |
|                        | 126              | Users Consumption Level Type I.                              |
|                        | 127              | Users Consumption Level Type II.                             |
|                        | 128              | Users Occupation: whether or not to work.                    |
|                        | 129              | Users Geography Informations.                                |
| Item Features          | 205              | Item ID.                                                     |
|                        | 206              | Category ID to which the item belongs.                       |
|                        | 207              | Shop ID to which item belongs.                               |
|                        | 210              | Intention node ID to which the item belongs.                 |
|                        | 216              | Brand ID of the item.                                        |
| Combination Features   | 508              | The combination of features with 109_14   and 206.           |
|                        | 509              | The combination of features with 110_14   and 207.           |
|                        | 702              | The combination of features with 127_14   and 216.           |
|                        | 853              | The combination of features with 150_14   and 210.           |
| Context Features       | 301              | A   categorical expression of position.                      |
```

### Common Features Dataset
Each row in the common features file represents a collection of common features shared by many impressions in the core data files and is organized as follows:
```{figure} ../images/common_features_file_structure.png
---
name: common_features_file_structure
align: center
alt: Common Features Dataset Structure
---
Common Features Dataset Structure
```
Like the core dataset, the common features dataset has three components:   
1. Common Features Index: Unique identifier for the common features row     
2. Feature Num: Number of features in the following feature list       
3. Feature List: A list of feature structures separated by ASCII character 0x01 as in the core dataset.
## Sourcing the Data
The data may be obtained from [Alibaba's Tianchi website](https://tianchi.aliyun.com/dataset/dataDetail?dataId=408). The training and test sets are approximately 9Gb compressed.

## Section Organization
Exploratory data analysis and inference will be the focus of this section. As such, the remainder of this section is organized as follows:

 - [Section 2.2](section_22_data_build) **Data Build**: The data are downloaded, extracted, merged, and subsampled downstream analysis.    
 - [Section 2.3](section_23_data_profile) **Data Profiling**: Evaluation of the structure, quality, and suitability of the data for the purpose. This high-level overview will examine data completeness, data classification, format, and properties of the data values. 
 - [Section 2.4](section_24_data_preprocessing) **Data Processing**: Structure, format, and scale will be addressed during the data processing section. Transformation, encoding, and filtering activities will ensure that the dataset is suitable for exploratory analysis. 
 -  [Section 2.5](section_25_eda) **Exploratory Data Analysis**: We aim to maximize insight into the dataset, uncover its underlying structure, extract important variables, detect outliers and anomalies, and gain insight.
