# Data Acquisition
Modeling the state-of-the-art from the 1970s, we'll obtain the data using the standard extract, transform and load (ETL) design pattern.  This will require a Criteo data source configuration, a series of tasks to perform the ETL operations and a pipeline to orchestrate the process. We'll import them now and briefly introduce the modules en route.

In [1]:
from myst_nb import glue
import pandas as pd
pd.set_option('display.width', 1000)
pd.options.display.float_format = '{:,.2f}'.format

from cvr.utils.config import CriteoConfig 
from cvr.data.etl import Extract, TransformETL, LoadDataset
from cvr.core.pipeline import DataPipeline, DataPipelineBuilder
from cvr.utils.config import WorkspaceConfig
from cvr.core.workspace import Workspace

## Data Source
CriteoConfig packages the URL, file structure, and local destination file path information for the Criteo data source. For illustrative purposes, the CriteoConfig on this machine is shown below.

In [2]:
config = CriteoConfig()
config.print()



                        Criteo Data Source Configuration                        
                        ________________________________                        
                          name : Criteo Sponsored Search Conversion Log Dataset
                        source : http://go.criteo.net/criteo-research-search-conversion.tar.gz
                   destination : data\external\criteo.tar.gz
              filepath_extract : Criteo_Conversion_Search/CriteoSearchData
                  filepath_raw : raw\criteo.csv
                     workspace : root
                           sep : \t
                       missing : -1


## Data Pipeline Steps
Our three workers, Extract, TransformETL and LoadDataset are described below. 

| Step | Module       | Description                                                                     |
|------|--------------|---------------------------------------------------------------------------------|
| 1    | Extract      | Downloads the source data into a local raw data directory                       |
| 2    | TransformETL | Transform the raw data into a Dataset object and perform basic   preprocessing. |
| 3    | LoadDataset  | Load the Dataset object into our local workspace.                               |

Two basic preprocessing steps are taken a priori based upon the description of the data provided by Criteo Labs. First, we convert the missing values indicator (-1) to NaNs. Second, we convert non-numeric columns to the pandas' category data type for computational and space efficiency purposes. Let's instantiate our tasks.

In [3]:
extract = Extract(config=config)
transform=TransformETL(value=[-1,"-1"])
load = LoadDataset()

## Data Pipeline Builder
Next, we will construct a DataPipelineBuilder which will produce our ETL pipeline. The pipeline, and the Dataset it produces, are designated a name, in this case 'criteo_full', and a stage such as 'preprocessing'. An underscore concatenation of name and stage make up the Dataset object's asset identifier, or AID. (id was taken, apparently, is a python built-in function). Using the AID, we can store and retrieve our Dataset objects from the workspace. 

In [4]:
builder = DataPipelineBuilder()
builder.create()
builder.set_name("criteo_full").set_stage("preprocessed").set_force(True).set_verbose(True) 
builder.add_task(extract)
builder.add_task(transform)
builder.add_task(load)
builder.build()
pipeline = builder.pipeline


## ETL Pipeline Execution
We've instiated the builder and added the tasks. Our dataset is slightly under 6.5 GB; making this ETL a network and IO intensive process estimated to complete in around 12 minutes. 

In [5]:
dataset = pipeline.run()

Started criteo_full
	Downloading 1910.08 Mb in 10 Mb chunks

	Chunk #20: 10.47 percent downloaded at 4 Mbps
	Chunk #40: 20.94 percent downloaded at 4 Mbps
	Chunk #60: 31.41 percent downloaded at 5 Mbps
	Chunk #80: 41.88 percent downloaded at 4 Mbps
	Chunk #100: 52.35 percent downloaded at 4 Mbps
	Chunk #120: 62.82 percent downloaded at 4 Mbps
	Chunk #140: 73.3 percent downloaded at 4 Mbps
	Chunk #160: 83.77 percent downloaded at 4 Mbps
	Chunk #180: 94.24 percent downloaded at 4 Mbps

	Download complete! 1910.081 Mb downloaded in 193 10 Mb chunks.
	Decompression initiated.
	Decompression Complete! 6129.08 Mb Extracted.
Completed criteo_full


In [6]:
pipeline.summary



                        DataPipeline criteo_full Summary                        
                        ________________________________                        
           Task                      Start                        End  Minutes   Status
0       Extract 2022-01-25 00:29:26.724909 2022-01-25 00:42:31.781926    13.08  200: OK
1  TransformETL 2022-01-25 00:42:31.786223 2022-01-25 00:42:47.944636     0.27  200: OK
2   LoadDataset 2022-01-25 00:42:47.947636 2022-01-25 00:43:11.399116     0.39  200: OK


Our pipeline appears to have run successfully. Let's check the task summaries.

In [7]:
xsum = extract.summary



                              Extract Task Summary                              
                   Dataset: Criteo Full / Preprocessed Stage                    
                   _________________________________________                    
                            Status Code : 200
                           Content Type : application/x-gzip
                          Last Modified : Wed, 08 Apr 2020 13:39:53 GMT
                    Content Length (Mb) : 1,910.081
                        Chunk Size (Mb) : 10
                      Chunks Downloaded : 193
                        Downloaded (Mb) : 1,910.081
                                   Mbps : 4.88
                    Size Extracted (Mb) : 6,129.08
                                  Start : 2022-01-25 00:29:26.724909
                                    End : 2022-01-25 00:42:31.781926
                               Duration : 0:13:05.057017
                                 Status : 200: OK
                            Status Dat

In [8]:
_ = glue("downloaded",xsum["Content Length (Mb)"])
_ = glue("size", xsum["Size Extracted (Mb)"])

1910.081

6129.08

From this we see that we've downloaded slightly over 6 GB in 9 minutes using 193 10 Mb chunks with an average throughput of 4 Mbps. Next, we have the transform step.

In [9]:
_ = transform.summary



                           TransformETL Task Summary                            
      Missing Values Replacement: Dataset Criteo Full / Preprocessed Stage      
      ____________________________________________________________________      
                       Before     After
sale                        0         0
sales_amount                0  14262913
conversion_time_delay       0  14268293
click_ts                    0         0
n_clicks_1week              0   6744427
product_price               0         0
product_age_group           0  11760058
device_type                 0      3032
audience_id                 0  11502018
product_gender              0  11654195
product_brand               0   7241560
product_category_1          0   6142756
product_category_2          0   6151249
product_category_3          0   7342676
product_category_4          0  10494308
product_category_5          0  14595054
product_category_6          0  15717150
product_category_7          0  1599

Replacing the missing value indicators with NaNs will simplify data processing and analysis. It does reveal; however, that nearly half of the data are missing. Notably, diversity and sparsity in observations are common challenges in marketing and customer analytics. Lastly, we have the load step.

In [10]:
_ = load.summary



                            LoadDataset Task Summary                            
                   Dataset: Criteo Full / Preprocessed Stage                    
                   _________________________________________                    
                            AID : preprocessed_criteo_full
                      Workspace : full_monte
                   Dataset Name : criteo_full
                          Stage : preprocessed
                       filepath : workspaces\full_monte\Dataset\preprocessed\full_monte_Dataset_preprocessed_criteo_full.pkl
                          Start : 2022-01-25 00:42:47.947636
                            End : 2022-01-25 00:43:11.399116
                       Duration : 0:00:23.451480
                         Status : 200: OK
                    Status Date : 2022-01-25
                    Status Time : 00:43:11


We'll note the name and stage for this dataset, which we will use to obtain the Dataset object from the workspace. Before closing this section, we'll demonstrate how an object can be stored and retrieved from a workspace.

## Workspaces
Workspaces are parameterized by some storage space, a dataset of a particular sample size, an asset manager to persist Datasets and the data itself. To inspect or change the current workspace, we instantiate a WorkspaceConfig object as follows:

In [11]:
config = WorkspaceConfig()
config.get_workspace()

'full_monte'

Alas we have a name. Each workspace has a designated sample size for the seeding dataset. 

In [12]:
config.get_sample_size()

1.0

A sample size of 1.0 simply means the full dataset. To commit a dataset to the Workspace, we launch a Workspace object using the workspace name 'full_monte' and save it as follows.

In [13]:
workspace = Workspace('full_month')
workspace.add_dataset(dataset)

'workspaces\\full_month\\Dataset\\preprocessed\\full_month_Dataset_preprocessed_criteo_full.pkl'

To retrieve the dataset from the workspace, we pass the name and stage to the appropriate method as follows.

In [14]:
ds2 = workspace.get_dataset(name='criteo', stage='preprocessed')
ds2.info()



                                 Dataset criteo                                 
                                 ______________                                 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----   
 0   sale                   10000 non-null  category
 1   sales_amount           1612 non-null   float64 
 2   conversion_time_delay  1608 non-null   float64 
 3   click_ts               10000 non-null  float64 
 4   n_clicks_1week         4700 non-null   float64 
 5   product_price          10000 non-null  float64 
 6   product_age_group      1815 non-null   category
 7   device_type            9991 non-null   category
 8   audience_id            2809 non-null   category
 9   product_gender         1794 non-null   category
 10  product_brand          2748 non-null   category
 11  product_category_1     4786 non-nu

Viola! This closes the data acquisition portion of this series. In the next section, we get our first glimpses of the data from a profiling and data quality perspective.