# Data Acquisition
Modeling the state-of-the-art from the 1970s, we'll obtain the data using the standard extract, transform and load (ETL) design pattern.  This will require a Criteo data source configuration, a series of tasks to perform the ETL operations, a pipeline to orchestrate the process, and a workspace into which the output Dataset object will be loaded. We'll import them now and introduce the modules enroute.

In [1]:
from myst_nb import glue
import pandas as pd
pd.set_option('display.width', 1000)
pd.options.display.float_format = '{:,.2f}'.format

from cvr.core.workspace import Workspace, WorkspaceManager
from cvr.utils.config import CriteoConfig 
from cvr.data.etl import Extract, TransformETL, LoadDataset
from cvr.core.pipeline import DataPipeline, DataPipelineBuilder


## Workspace
Before we set up the ETL pipeline, we will need to establish a workspace for it. A Workspace object is essentially a container for Datasets, Models, and Experiments. Singleton class WorkspaceManager allows us to create and manage Workspace objects. 

In [2]:
wsm = WorkspaceManager()
vesuvio = wsm.create_workspace(name="Vesuvio", description="Vesuvio Workspace", current=True)

Workspace Vesuvio has been created and is now the 'current' workspace. Datasets, Models, and Experiments will be contained within this workspace until another workspace is created and/or set current.

## Data Source
The CriteoConfig object packages the URL, file structure, and local destination file path information for the Criteo data source. For illustrative purposes, the CriteoConfig on this machine is shown below.

In [3]:
config = CriteoConfig()
config.print()



                        Criteo Data Source Configuration                        
                        ________________________________                        
                          name : Criteo Sponsored Search Conversion Log Dataset
                        source : http://go.criteo.net/criteo-research-search-conversion.tar.gz
                   destination : data\external\criteo.tar.gz
              filepath_extract : Criteo_Conversion_Search/CriteoSearchData
                  filepath_raw : raw\criteo.csv
                     workspace : root
                           sep : \t
                       missing : -1


## Data Pipeline Steps
Our pipeline consists of three steps, Extract, TransformETL and LoadDataset described below. 

| Step | Module       | Description                                                                     |
|------|--------------|---------------------------------------------------------------------------------|
| 1    | Extract      | Downloads the source data into a local raw data directory                       |
| 2    | TransformETL | Transforms the raw data into a Dataset object and performs basic preprocessing. |
| 3    | LoadDataset  | Loads the Dataset object into our current workspace.                            |

The Extract step takes the CriteoConfig data source configuration object as input and produces our raw data. Transform ETL performs two basic preprocessing a priori based upon the description of the data provided by Criteo Labs. First, we convert the missing values indicator (-1) to NaNs. Second, we convert the categorical variables to the pandas' category data type for computation and space efficiency purposes. The pipeline tasks are instantiated below. Finally, LoadDataset loads the preprocessed Dataset object into our current workspace.

In [4]:
extract = Extract(config=config, chunk_size=20)
transform=TransformETL(value=[-1,"-1"])
load = LoadDataset()

## Data Pipeline Builder
The pipeline is contructed via the DataPipelineBuilder, parameterized by the name and stage for the final Dataset object it produces. 

In [5]:
builder = DataPipelineBuilder()
builder.set_name("criteo").set_stage("preprocessed").set_force(True).set_verbose(True) 
builder.add_task(extract)
builder.add_task(transform)
builder.add_task(load)
builder.build()
pipeline = builder.pipeline


## ETL Pipeline Execution
The dataset is approximately 6.5 GB; making this ETL a network and IO intensive process. Estimated processing time: 12 minutes. 

In [6]:
dataset = pipeline.run()
pipeline.summary

Started criteo
	Downloading 1910.08 Mb in 20 Mb chunks

	Chunk #10: 10.47 percent downloaded at 3 Mbps
	Chunk #20: 20.94 percent downloaded at 3 Mbps
	Chunk #30: 31.41 percent downloaded at 3 Mbps
	Chunk #40: 41.88 percent downloaded at 3 Mbps
	Chunk #50: 52.35 percent downloaded at 3 Mbps
	Chunk #60: 62.82 percent downloaded at 3 Mbps
	Chunk #70: 73.3 percent downloaded at 3 Mbps
	Chunk #80: 83.77 percent downloaded at 3 Mbps
	Chunk #90: 94.24 percent downloaded at 2 Mbps

	Download complete! 1910.081 Mb downloaded in 97 20 Mb chunks.
	Decompression initiated.
	Decompression Complete! 6129.08 Mb Extracted.
Completed criteo




                          DataPipeline criteo Summary                           
                          ___________________________                           
           Task                      Start                        End  Minutes   Status
0       Extract 2022-01-25 18:54:48.889947 2022-01-25 19:12:26.990696    17.64  200: OK
1  TransformETL 2022-01-25 19:12:26.994700 2022-01-25 19:12:43.938661     0.28  200: OK
2   LoadDataset 2022-01-25 19:12:43.942662 2022-01-25 19:13:08.104044     0.40  200: OK


Our pipeline appears to have run successfully. Let's check the task summaries.

In [7]:
xsum = extract.summary



                              Extract Task Summary                              
                      Dataset: Criteo / Preprocessed Stage                      
                      ____________________________________                      
                            Status Code : 200
                           Content Type : application/x-gzip
                          Last Modified : Wed, 08 Apr 2020 13:39:53 GMT
                    Content Length (Mb) : 1,910.081
                        Chunk Size (Mb) : 20
                      Chunks Downloaded : 97
                        Downloaded (Mb) : 1,910.081
                                   Mbps : 2.899
                    Size Extracted (Mb) : 6,129.08
                                  Start : 2022-01-25 18:54:48.889947
                                    End : 2022-01-25 19:12:26.990696
                               Duration : 0:17:38.100749
                                 Status : 200: OK
                            Status Dat

In [13]:
_ = glue("etl_downloaded",xsum["Content Length (Mb)"], display=False)
_ = glue("etl_chunk_size", xsum["Chunk Size (Mb)"], display=False)
_ = glue("etl_chunks_downloaded", xsum["Chunks Downloaded"], display=False)
_ = glue("etl_speed", xsum["Mbps"], display=False)
_ = glue("etl_size", xsum["Size Extracted (Mb)"], display=False)
_ = glue("etl_duration", round(xsum["Duration"].total_seconds() / 60,2), display=False)

From this we see that we've downloaded {glue:}`etl_downloaded` Mb in about {glue:}`etl_duration` minutes in {glue:}`etl_chunks_downloaded`  {glue:}`etl_chunk_size` Mb chunks with an average throughput of {glue:}`etl_speed` Mbps. Next, we have the transform step.

In [9]:
_ = transform.summary



                           TransformETL Task Summary                            
        Missing Values Replacement: Dataset Criteo / Preprocessed Stage         
        _______________________________________________________________         
                       Before     After
sale                        0         0
sales_amount                0  14262913
conversion_time_delay       0  14268293
click_ts                    0         0
n_clicks_1week              0   6744427
product_price               0         0
product_age_group           0  11760058
device_type                 0      3032
audience_id                 0  11502018
product_gender              0  11654195
product_brand               0   7241560
product_category_1          0   6142756
product_category_2          0   6151249
product_category_3          0   7342676
product_category_4          0  10494308
product_category_5          0  14595054
product_category_6          0  15717150
product_category_7          0  1599

Replacing the missing value indicators with NaNs will simplify data processing and analysis. It does reveal; however, signficant data sparsity. Notably, diversity and sparsity in observations are common challenges in marketing and customer analytics. 

Lastly, we have the load step.

In [10]:
_ = load.summary



                            LoadDataset Task Summary                            
                      Dataset: Criteo / Preprocessed Stage                      
                      ____________________________________                      
                             AID : preprocessed_criteo
                       Workspace : Vesuvio
                    Dataset Name : criteo
                           Stage : preprocessed
                        filepath : workspaces\Vesuvio\Dataset\preprocessed\Vesuvio_Dataset_preprocessed_criteo.pkl
                           Start : 2022-01-25 19:12:43.942662
                             End : 2022-01-25 19:13:08.104044
                        Duration : 0:00:24.161382
                          Status : 200: OK
                     Status Date : 2022-01-25
                     Status Time : 19:13:08


We'll note the name and stage for this dataset, which we will use to obtain the Dataset object from the workspace. Before closing this section, we'll demonstrate how an object can be stored and retrieved from our current workspace, 'vesuvio'.

In [11]:
dataset = vesuvio.get_dataset(name='criteo', stage='preprocessed')
dataset.info()



                                 Dataset criteo                                 
                                 ______________                                 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15995634 entries, 0 to 15995633
Data columns (total 23 columns):
 #   Column                 Non-Null Count     Dtype   
---  ------                 --------------     -----   
 0   sale                   15995634 non-null  category
 1   sales_amount           1732721 non-null   float64 
 2   conversion_time_delay  1727341 non-null   float64 
 3   click_ts               15995634 non-null  float64 
 4   n_clicks_1week         9251207 non-null   float64 
 5   product_price          15995634 non-null  float64 
 6   product_age_group      4235576 non-null   category
 7   device_type            15992602 non-null  category
 8   audience_id            4493616 non-null   category
 9   product_gender         4341439 non-null   category
 10  product_brand          8754074 non-null   ca

Viola! This closes the data acquisition portion of this series. In the next section, we will get our first glimpses of the data from a profiling and data quality perspective.