# Data Acquisition
Modeling the state-of-the-art from the 1970s, we'll obtain the data using the standard extract, transform and load (ETL) design pattern.  This will require a Criteo data source configuration, a series of tasks to perform the ETL operations and a pipeline to orchestrate the process. We'll import them now and briefly introduce the modules en route.

In [None]:
from cvr.utils.config import CriteoConfig 
from cvr.data.etl import Extract, TransformETL, LoadDataset
from cvr.core.pipeline import DataPipeline, DataPipelineBuilder
from cvr.utils.config import WorkspaceConfig
from cvr.core.workspace import Workspace

import pandas as pd
pd.set_option('display.width', 1000)
pd.options.display.float_format = '{:,.2f}'.format


In [None]:
import sys
sys.version

## Data Source
CriteoConfig packages the URL, file structure, and local destination file path information for the Criteo data source. For illustrative purposes, the CriteoConfig on this machine is shown below.

In [None]:
config = CriteoConfig()
config.print()

## Data Pipeline Steps
Our three workers, Extract, TransformETL and LoadDataset are described below. 

| Step | Module       | Description                                                                     |
|------|--------------|---------------------------------------------------------------------------------|
| 1    | Extract      | Downloads the source data into a local raw data directory                       |
| 2    | TransformETL | Transform the raw data into a Dataset object and perform basic   preprocessing. |
| 3    | LoadDataset  | Load the Dataset object into our local workspace.                               |

Two basic preprocessing steps are taken a priori based upon the description of the data provided by Criteo Labs. First, we convert the missing values indicator (-1) to NaNs. Second, we convert non-numeric columns to the pandas' category data type for computational and space efficiency purposes. Let's instantiate our tasks.

In [None]:
extract = Extract(config=config)
transform=TransformETL(value=[-1,"-1"])
load = LoadDataset()

## Data Pipeline Builder
Next, we will construct a DataPipelineBuilder which will produce our ETL pipeline. The pipeline, and the Dataset it produces, are designated a name, in this case 'criteo', and a stage such as 'preprocessing'. An underscore concatenation of name and stage make up the Dataset object's asset identifier, or AID. (id was taken, apparently, is a python built-in function). Using the AID, we can store and retrieve our Dataset objects from the workspace. 

In [None]:
builder = DataPipelineBuilder()
builder.create()
builder.set_name("criteo").set_stage("preprocessed").set_force(True).set_verbose(True) 
builder.add_task(extract)
builder.add_task(transform)
builder.add_task(load)
builder.build()
pipeline = builder.pipeline


## ETL Pipeline Execution
We've instiated the builder and added the tasks. Our dataset is slightly under 6.5 GB; making this ETL a network and IO intensive process estimated to complete in around 12 minutes. 

In [None]:
dataset = pipeline.run()

In [None]:
pipeline.summary

Our pipeline appears to have run successfully. Let's check the task summaries.

In [None]:
xsum = extract.summary

In [None]:
_ = glue("downloaded",xsum["Content Length (Mb)"])
_ = glue("size", xsum["Size Extracted (Mb)"])

From this we see that we've downloaded slightly over 6 GB in 9 minutes using 193 10 Mb chunks with an average throughput of 4 Mbps. Next, we have the transform step.

In [None]:
_ = transform.summary

Replacing the missing value indicators with NaNs will simplify data processing and analysis. It does reveal; however, that nearly half of the data are missing. Notably, diversity and sparsity in observations are common challenges in marketing and customer analytics. Lastly, we have the load step.

In [None]:
_ = load.summary

We'll note the name and stage for this dataset, which we will use to obtain the Dataset object from the workspace. Before closing this section, we'll demonstrate how an object can be stored and retrieved from a workspace.

## Workspaces
Workspaces are parameterized by some storage space, a dataset of a particular sample size, an asset manager to persist Datasets and the data itself. To inspect or change the current workspace, we instantiate a WorkspaceConfig object as follows:

In [None]:
config = WorkspaceConfig()
config.get_workspace()

Alas we have a name. Each workspace has a designated sample size for the seeding dataset. 

In [None]:
config.get_sample_size()

A sample size of 1.0 simply means the full dataset. To commit a dataset to the Workspace, we launch a Workspace object using the workspace name 'full_monthe' and save it as follows.

In [None]:
workspace = Workspace('full_month')
workspace.add_dataset(dataset)

To retrieve the dataset from the workspace, we pass the name and stage to the appropriate method as follows.

In [None]:
ds2 = workspace.get_dataset(name='criteo', stage='preprocessed')
ds2.info()

Viola! This closes the data acquisition portion of this series. In the next section, we get our first glimpses of the data from a profiling and data quality perspective.