In [2]:
# REMOVE-CELL
import os
import pandas as pd
home = "/home/john/projects/DeepCVR/"
os.chdir(home)

# Data
The Alibaba Ad Display/Click Dataset originates from the real-world traffic logs of the Taobao Marketplace recommender system. Headquartered in Hangzhou, Zhejiang, People's Republic of China, Taobao Marketplace facilitates consumer-to-consumer (C2C) retail for small businesses and entrepreneurs.    

{numref}`ali_display_summary`: Alibaba Ad Display / Click Summary

```{table} Dataset Summary
:name: ali_display_summary

|        Users        |              User Profiles              |                 Interactions                 |    Behaviors    |
|:-------------------:|:---------------------------------------:|:---------------------------------------:|:---------------:|
|          1,140,000  |                              1,060,000  |                             26,000,000  |    700,000,000  |
```

The advertising / click interactions summarized in {numref}`ali_display_summary` represent the ads and impressions served to approximately 1.1 million randomly selected users over the eight days beginning May 6, 2017, and ending on May 13, 2017. In addition, user profiles were obtained for 1.06 million of the users to whom ads were served during the eight day period. In total, over 26 million interactions were captured.  

The dataset also included an additional 22 days of page views, favorite tagging, shopping cart activity and product purchases. Combining department level data rendered a user behavior log exceeding 700 million interactions. 

## Alibaba Ad Display / Click Dataset 
The logical data model presented in {ref}`ali_display_data_model` models four tables containing some 23 features, as well as click, favorate, shopping cart, and conversion labels. 

{ref}`ali_display_data_model`: Alibaba Ad Display / Click Data Model

```{figure} ../images/ali-display.png
:name: ali_display_data_model

alt: Ali Display / Click Data Model
Ali Display / Click Data Model
``` 
Note: All data have been de-identified and desensitized in accordance with relevant privacy regulations. 

### Data Model Design v1.0.0
The data model diagram in {ref}`ali_display_data_model` makes explicit, the relationships among the user, interaction, advertising, and behavior tables. From the center of the diagram, we have the interactions between a single user and a single advertisement impression. Correspondingly, a one-to-many relationship exists between an advertisement and a sample. Similarly, a single user, represented by a user profile observation, may have zero, one or many samples. Lastly, a single user (user table) has a behavior cardinality of one-to-one or more. 

### Table Descriptions
Let's briefly introduce the tables and their corresponding features. 

**interaction table**
There are approximately 26 million user/advertisement interactions which are collectively defined by:   
- **user_id** (int): A de-identified identifier for a user.   (Composite Primary Key)
- **time_stamp** (timestamp): The timestamp when the interaction occurred. (Composite Primary Key)   
- **adgroup_id** (int): A desensitized advertising unit identifier.   
- **scenario** (varchar): Definition unspecified.  
- **click** (int): Binary indicator of the occurence of a click. 1 if yes. 0 if no.

Note: The raw presentation included a 'noclick' field with the opposite boolean value. This was deemed redundant and removed from the model. 

The interaction table has a composite primary key formed by the user_id and timestamp. It also has a foreign key, the adgroup_id, referencing the primary key of the same name on the ad table.  

**ad table**
Advertising impressions are structures connecting the product, customer, and campaign. The six features are:
- **adgroup_id** (int):  A desensitized advertising unit identifier. (Primary Key)  
- **category_id** (int): A product's decensitized commodity category id.   
- **campaign_id** (int): A desensitized advertising plan identifier.  
- **customer_id** (int): A desensitized customer segment identifier.  
- **brand** (float): A desensitized brand to which the product belongs.    
- **price** (float): The price for the product. Currency not specified.  

An adgroup_id, a commodity_id, and a brand represent the concept of the 'item' for inference and analysis purposes, yet no explicit representation of an 'item' exists in the data model. Notwithstanding, the conceptual 'item' is understood to have one-to-many relationships with both category_id and brand. Correspondingly, an adgroup_id represents but a single 'item'. 

**user table**
The user table contains some 1.06 million user profiles. The nine-features captured for each user are:
- **user_id** (int): A de-identified identifier for a user.   (Primary Key)
- **cms_segid** (int): A micro-group identifier.  
- **cms_group_id** (int): Unspecified   
- **gender_code** (int): 1 for male, 2 for female.  
- **age_level** (int): Unspecified   
- **consumption_level** (float): 1.0: low- grade, 2.0: mid-grade, 3.0: high-grade.  
- **shopping_level** (int): 1: shallow user , 2: moderate user , 3: deep user
- **student** (int): 1 if user is a college student, 0 if no.     
- **city_level** (float): Unspecified.  

**behavior table**
Finally, the behavior table contains over 700 million events, having five attributes:   
 - **user_id** (int): A de-identified identifier for a user.  (Composite Primary Key)
 - **timestamp** (timestamp): The timestamp when the interaction occurred. (Composite Primary Key)   
 - **btag** (varchar): Tag describing one of the following four behaviors:     
    1. pv: Page view   
    2. fav: Like 
    3. cart: Add to shopping cart    
    4. buy: Purchase conversion
- **category_id** (int): A product's decensitized commodity category id.   
- **brand** (float): A desensitized brand to which the product belongs.    

### Data Model Design v1.1.0
We've put forward an initial database model for the Alibaba Ad Display / Click Dataset, based primarly on it current flat file structure. We've illuminated the inter-table relationships, the key and index configuration, as well as the data definitions and datatypes. As we contemplate physical database design for a database of such considerable size, query performance, file IO throughput, and efficient cache and memory utilization will shape the decisions which ensure that the database supports effective data utilization, analytics and inference. A performant database design will:

- minimize redundancy, repeated groups and storage costs; each attribute is represented once,    
- lack unwanted functional dependencies,    
- mitigate data update, insert, modification, and delete anomalies,    
- secure referential integrity,   

There exists no dearth of database design strategies. Those based upon normalization principles best address  structural, performance, and management requirements, such as those listed above. On the other hand, denormalized approaches provide greater flexibility for analysts and business intelligence practitioners with complex or denormalized data access workloads. For this use case; however, a performance centric, normalization-based design best aligns with our efficiency, performance objectives. Perhaps the most valuable benefit of a normalized approach is that it forces a logical and efficient organization of data that facilitates data management and  governance. 

Database normalization was introduced by  English computer scientist and inventor of the relational model for database management  E.F. Codd in 1971 {cite}`coddRelationalModelData1970`. The objectives of normalization as defined by Codd in 1970 {cite}`persons/Codd71a`, are:  

1. Permit data to o be queried and manipulated using a "universal data sub-language" grounded in first-order logic {cite}`coddRelationalModelData1970`.
2. To free the collection of relations from undesirable insertion, update and deletion dependencies.
3. To reduce the need for restructuring the collection of relations, as new types of data are introduced, and thus increase the life span of application programs.
4. To make the relational model more informative to users.
 
Today, there are 11 normal forms; however, a relational database is considered "normalized" if it meets the third normal form (3NF) {cite}`dateIntroductionDatabaseSystems2004`.

### Data Model Normalization
Database normalization is a sequential progression from first normal form (1NF) through higher forms of normalization. With that, we begin with the 1NF which states that a relation is in first normal form if and only if no attribute domain has relations as elements {cite}`coddRelationalModelData1970`. Let's take another look at our database model from above. 

{ref}`ali_display_data_model2`: Alibaba Ad Display / Click Data Model

```{figure} ../images/ali-display.png
:name: ali_display_data_model2

alt: Ali Display / Click Data Model
Ali Display / Click Data Model
``` 
 





Starting at the core, the sample table represents a single advertising impression served to a user, a timestamp, a click label and a variable with an unspecified meaning. A


{numref}`ali_display_dataset`: Alibaba Ad Display / Click Dataset

```{table} Dataset Overview
:name: ali_display_dataset

| Table    | Description            | Filename                |   Size   |
|----------|------------------------|-------------------------|:--------:|
| sample   | Raw Training Samples   | raw_sample.csv.tar.gz   |  15.84MB |
| ad       | Ad’S Basic Information | ad_feature.csv.tar.gz   | 231.38MB |
| user     | User Profile           | user_profile.csv.tar.gz |  9.51MB  |
| behavior | User Behavior Log      | behavior_log.csv.tar.gz |  5.80MB  |
```

## Raw Sample
The ad display and click-through logs for 1,140,000 users, randomly selected from the Taobao website for a period of 8 days.  Total observations for this dataset include 26 million records. 