In [1]:
# Imports
import os
import pandas as pd

In [2]:
# REMOVE-CELL
home = "/home/john/projects/DeepCVR/"
os.chdir(home)

# Introduction
Alibaba Click and Conversion Prediction (Ali-CCP) benchmark dataset was collected from the Taobao recommender system and contains over 84 million impressions {cite}`maEntireSpaceMultiTask2018`.

{numref}`ali_ccp_stats`: Alibaba Click and Conversion Prediction (Ali-CCP) Statistics

```{table} Dataset Statistics
:name: ali_ccp_stats
| Users | Items | Impressions | Clicks | Conversions |
|---|---|---|---|---|
|          400,000  |    4,300,000  |    84,000,000  |    3,400,000  |          18,000  |
```
The dataset $S=\{(x_{i},y_{i} \rightarrow z_{i}),i=1,...,N\}$ drawn from distribution $D \in \mathcal{X}\times\mathcal{Y}\times\mathcal{Z}$ where:
- $\mathcal{X}$ is a feature space and $x_i$ denotes the $i^{th}$ feature vector in the feature space,
- $\mathcal{Y}$ is the click label space and $y_i \in \{0,1\}$ is the binary indicator for a click through for the $i^{th}$ impression
- $\mathcal{Z}$ is the conversion label space and $z_i \in \{0,1\}$ is the binary indicator for a conversion for the $i^{th}$ impression
- $N$ is the number of impressions

## Dataset Label Distribution
Further, $y\rightarrow z$ denotes the sequential dependence of clicks and conversions in the $\mathcal{Y}\times\mathcal{Z}$ label space as the label distribution below indicates.

{numref}`ali_ccp_labels`: Alibaba Click and Conversion Prediction (Ali-CCP) Label Distribution

```{table} Label Distribution
:name: ali_ccp_labels
| Y | Z | Click | Conversion |
|:---:|:---:|:---:|:---:|
| 0 | 0 | No | No |
| 0 | 1 | --- | --- |
| 1 | 0 | Yes | No |
| 1 | 1 | Yes | Yes |
```

An estimated sample click-through rate of 4.05% and 0.02% conversion rate evidence considerable imbalance, a distinguishing characteristic of digital marketing data.

## Dataset Features
The dataset includes user, item, combination and context features as specified in {ref}`ali_ccp_feature_sets` below.
```{table} Description of Feature Sets
:name: ali_ccp_feature_sets
| Feature Category       | Feature Field ID | Feature Field Description                                    |
|------------------------|------------------|--------------------------------------------------------------|
| User Features          | 101              | User ID.                                                     |
|                        | 109_14           | User historical behaviors of category ID and count*.         |
|                        | 110_14           | User historical behaviors of shop ID and count*.             |
|                        | 127_14           | User historical behaviors of brand ID and count*.            |
|                        | 150_14           | User historical behaviors of intention node ID and count*.   |
|                        | 121              | Categorical ID of User Profile.                              |
|                        | 122              | Categorical group ID of User Profile.                        |
|                        | 124              | Users Gender ID.                                             |
|                        | 125              | Users Age ID.                                                |
|                        | 126              | Users Consumption Level Type I.                              |
|                        | 127              | Users Consumption Level Type II.                             |
|                        | 128              | Users Occupation: whether or not to work.                    |
|                        | 129              | Users Geography Informations.                                |
| Item Features          | 205              | Item ID.                                                     |
|                        | 206              | Category ID to which the item belongs.                       |
|                        | 207              | Shop ID to which item belongs.                               |
|                        | 210              | Intention node ID to which the item belongs.                 |
|                        | 216              | Brand ID of the item.                                        |
| Combination Features   | 508              | The combination of features with 109_14   and 206.           |
|                        | 509              | The combination of features with 110_14   and 207.           |
|                        | 702              | The combination of features with 127_14   and 216.           |
|                        | 853              | The combination of features with 150_14   and 210.           |
| Context Features       | 301              | A   categorical expression of position.                      |
```

## Dataset Organization
The data are split approximately 50/50 between training and test sets, each set containing two CSV formatted files as outlined in {ref}`ali_ccp_training_set` and {ref}`ali_ccp_test_set` below.

{numref}`dataset_statistics`: Alibaba Click and Conversion Prediction (Ali-CCP) Dataset Statistics

```{table} Dataset Statistics
:name: dataset_statistics

| Name                               |    Set   |          Filename         | Size (GB) |           n          | p |
|------------------------------------|:--------:|:-------------------------:|:---------:|:--------------------:|:-:|
| Impressions                        | Training | sample_skeleton_train.csv |     10    |      42,300,134.00   | 6 |
| Common Features                    | Training | common_features_train.csv |     8     |          730,599.00  | 3 |
| Impressions                        |   Test   |  sample_skeleton_test.csv |     10    |      43,016,839.00   | 6 |
| Common Features                    |   Test   |  common_features_test.csv |     10    |          884,211.00  | 3 |
|                                    |          |                           |     38    |      86,931,783.00   |   |
| Training Set None Response         |   96.1%  |                           |           |                      |   |
| Training Set Click-through-Rate    |   3.89%  |                           |           |                      |   |
| Training Set Conversion Rate (CVR) |   0.02%  |                           |           |                      |   |
```
The summary statistics expose challenges characteristic of digital marketing and analytics datasets: the rarity of response and the associated class imbalance in the datasets. Average CTR for display networks is below 2% {cite}`GoogleAdsBenchmarks`. CTR for B2B companies averages approximately 2.3% across all display platforms. Such severe class imbalance between majority and minority classes in Big Data can bias the predictive performance of machine learning algorithms. Unable to effectively distinguish majority and minority classes, these algorithms tend to label almost all instances as the majority, producing deceptively high accuracy scores. To address the class imbalance question, several data generation and sampling techniques, most notably Random Undersampling, demonstrate superior  binary classification performance at reduced computational burden and training time.

### Core Dataset
```{figure} ../images/core_file_structure.png
---
height: 600px
width: 900px
name: core_dataset_structure
---
Core Dataset Structure
```
As shown in {ref}`core_dataset_structure`, core datasets are comprised of three sections:    

- **Sample Id Section**: Primary key in the range [1,n], where n is the number of records in the file,
- **Labels Section**: Two types of labels: a click, and conversion following the sequential pattern described in {ref}`ali_ccp_labels`
- **Features Section**: The features section contains three fields:
  - Common Feature Index: A foreign key that references a row in the common features file,
  - Feature Number: The number of features in the following feature list, and
  - Feature List: Composed of several features separated by the ASCII character 0x01. Each element in the feature list is represented by feature structure containing the following three components:    
    - Feature Field Id: This corresponds to the feature field id in {ref}`ali_ccp_feature_sets`
    - Feature Id: A global identity for the feature
    - Feature Value: The real value of the feature
  The three components are separated by ASCII characters 0x02 and 0x03, respectively. Here, we have a sample of five rows from the Core Dataset.


In [5]:
# HIDE-INPUT
# Core Dataset Sample
filepath = "data/development/raw/sample_skeleton_train.csv"
core = pd.read_csv(filepath, header=None, index_col=None)
core.head()

Unnamed: 0,0,1,2,3,4,5
0,11515523,0,0,d5f794198192a713,8,20787184361.021091122781.021090428991....
1,11060573,0,0,5ab1af84e729a269,13,21090299621.021090904571.021691506511....
2,19579112,0,0,15100e25b3982cc3,17,30193516661.021692536751.021091144811....
3,12131559,0,0,4e21ad4cf5b148d7,10,20555650711.021090633511.021090746901....
4,34008079,0,0,654765381422bb63,17,21090623511.021090762221.050893550770....



### Common Features Dataset
Each row in the common features file represents a collection of common features shared by many impressions in the core data files and is organized as follows:
```{figure} ../images/common_features_file_structure.png
---
name: common_features_file_structure
align: center
alt: Common Features Dataset Structure
---
Common Features Dataset Structure
```
The common features dataset has three components:
1. Common Features Index: Unique identifier for the common features row
2. Feature Num: Number of features in the following feature list
3. Feature List: A list of feature structures separated by ASCII character 0x01. Feature structures share the same organization as in the core dataset.

Here is a sample from a common features dataset.

In [6]:
# HIDE-INPUT
# Common Features Dataset Sample
filepath = "data/development/raw/common_features_train.csv"
core = pd.read_csv(filepath, header=None, index_col=None)
core.head()

Unnamed: 0,0,1,2
0,023a8f5b7b8a3348,1052,110_1414381141.09861110_1418460592.07944...
1,030dab7c09c9213d,748,150_1438980633.37304150_1439196042.03693...
2,05b3fd32a3e72c87,852,127_1434944021.09861127_1438180851.09861...
3,09ed88afc2780752,459,150_1439081542.19722150_1438815952.99987...
4,0b7a30a3cacee086,459,150_1439261452.6390612134386581.0122343...


## Source of Data
The Ali-CCP: Alibaba Click and Conversion Prediction dataset may be obtained from the [Alibaba's Tianchi website](https://tianchi.aliyun.com/dataset/dataDetail?dataId=408) after registering with the website. Moreover, the data have been made publically available for download from an Amazon S3 instance containing the [ALI-CCP Training Set](https://deepcvr-data.s3.amazonaws.com/taobao_train.tar.gz) and the [ALI-CCP Test Set](https://deepcvr-data.s3.amazonaws.com/taobao_test.tar.gz).


## Section Organization
Having described our source dataset, the remainder of this section is organized as follows:

 - [Section 2.2](section_22_data_acquisition) **Data Acquisition**: The Extract-Transform-Load pattern used to obtain and ingest the data is covered in detail.
 - [Section 2.3](section_23_data_exploration) **Data Exploration**: This section is an exploratory data analysis revealing feature importances, uncovering hidden insights, and discovering underlying structures in the data.
 - [Section 2.4](section_24_data_preparation) **Data Preparation**: Filtering, transforming, and encoding the data for modeling are covered in this section.
 - [Section 2.5](section_25_summary) **Summary**: Finally, we review the data acquisition, exploration, and prepartion process, highlighting key insights and structures in the data.