# Introduction
Alibaba Click and Conversion Prediction (Ali-CCP) benchmark dataset was collected from the Taobao recommender system and contains over 84 million impressions {cite}`maEntireSpaceMultiTask2018`. 

{numref}`ali_ccp_stats`: Alibaba Click and Conversion Prediction (Ali-CCP) Statistics

```{table} Dataset Statistics
:name: ali_ccp_stats
| Users | Items | Impressions | Clicks | Conversions |
|---|---|---|---|---|
|          400,000  |    4,300,000  |    84,000,000  |    3,400,000  |          18,000  |
```
The dataset $S=\{(x_{i},y_{i} \rightarrow z_{i}),i=1,...,N\}$ drawn from distribution $D \in \mathcal{X}\times\mathcal{Y}\times\mathcal{Z}$ where:    
- $\mathcal{X}$ is a feature space and $x_i$ denotes the $i^{th}$ feature vector in the feature space, 
- $\mathcal{Y}$ is the click label space and $y_i \in \{0,1\}$ is the binary indicator for a click through for the $i^{th}$ impression
- $\mathcal{Z}$ is the conversion label space and $z_i \in \{0,1\}$ is the binary indicator for a conversion for the $i^{th}$ impression
- $N$ is the number of impressions

## Dataset Label Distribution    
Further, $y\rightarrow z$ denotes the sequential dependence of clicks and conversions in the $\mathcal{Y}\times\mathcal{Z}$ label space as the label distribution below indicates.

{numref}`ali_ccp_labels`: Alibaba Click and Conversion Prediction (Ali-CCP) Label Distribution

```{table} Label Distribution
:name: ali_ccp_labels
| Y | Z | Click | Conversion |
|:---:|:---:|:---:|:---:|
| 0 | 0 | No | No |
| 0 | 1 | --- | --- |
| 1 | 0 | Yes | No |
| 1 | 1 | Yes | Yes |
```

An estimated sample click-through rate of 4.05% and 0.02% conversion rate evidence considerable imbalance, a distinguishing characteristic of digital marketing data. 

## Dataset Feature Space 
The dataset feature space $\mathcal{X}$ contains 13 user features, 5 item features. 4 combination features, and a context feature.    

**User Features**     
User identity, behavior, demographics, and historical information are captured by the following features.

1. User ID.
2. User historical behaviors of category ID and count*.
3. User historical behaviors of shop ID and count*.
4. User historical behaviors of brand ID and count*.
5. User historical behaviors of intention node ID and count*.
6. Categorical ID of User Profile.
7. Categorical group ID of User Profile.
8. Users Gender ID.
9. Users Age ID.
10. Users Consumption Level Type I.
11. Users Consumption Level Type II.
12. Users Occupation: whether or not to work.
13. Users Geography Informations.

**Item Features**    
The Taobao dataset distinguishes over 4.3 million items with five features including:

1. Item ID
2. Category ID to which the item belongs.    
3. Shop ID to which item belongs.
4. Intention node ID to which the item belongs.
5. Brand ID of the item.

Combination features capture the relationships between the users' historical behavior with respect to item aspects. Lastly, we have a geographical context feature.

## Dataset Organization
The data are split into a training set and a corresponding test set. Each dataset is composed of two CSV formatted files: one which includes the labels and feature structures, and a second that captures features most common among the impressions. A completed sample contains both the label and core features, as well as the corresponding record from the common features file. The following summarizes the dataset statistics.

{numref}`ali_ccp_training_set`: Alibaba Click and Conversion Prediction (Ali-CCP) Training Set Statistics

```{table} Training Set
:name: ali_ccp_training_set
| Training Set           |  Observations           | File Size |
|------------------------|-------------------------|-----------|
| Core Dataset           |          42,300,135     | 10Gb      |
| Common Feature Dataset |                730,600  | 8.0Gb     |
```


{numref}`ali_ccp_test_set`: Alibaba Click and Conversion Prediction (Ali-CCP) Test Set Statistics

```{table} Test Set
:name: ali_ccp_test_set
| Test Set               |  Observations           | File Size |
|------------------------|-------------------------|-----------|
| Core Dataset           |          43,016,840     | 10Gb      |
| Common Feature Dataset |                884,212  | 10Gb      |
```
The training and test sets are split in approximate 50/50 proportions, but combined the four files are approximately 38G uncompressed on disk.

## Sourcing the Data
The data may be obtained from [Alibaba's Tianchi website](https://tianchi.aliyun.com/dataset/dataDetail?dataId=408). The training and test sets are approximately 9Gb compressed.

## Section Organization
Exploratory data analysis and inference will be the focus of this section. As such, the remainder of this section is organized as follows:

 - [Section 2.2](section_22_data_acquisition) **Data Acquisition**: The data are extracted, stored, and subsampled downstream analysis.    
 - [Section 2.3](section_23_data_profile) **Data Profiling**: Evaluation of the structure, quality and suitability of the data for the purpose. This high-level overview will examine data completeness, data classification, format, and properties of the data values. 
 - [Section 2.4](section_24_data_preprocessing) **Data Processing**: Structure, format, and scale will be addressed during the data processing section. Transformation, encoding, and filtering activities will ensure that the dataset is suitable for exploratory analysis. 
 -  [Section 2.5](section_25_eda) **Exploratory Data Analysis**: We aim to maximize insight into the dataset, uncover its underlying structure, extract important variables, detect outliers and anomalies, and gain insight.


