# Data Acquisition
Modeling the state-of-the-art from the 1970s, we'll obtain the data using the standard extract, transform and load (ETL) design pattern.  This will require a Criteo data source configuration, a series of tasks to perform the ETL operations, a pipeline to orchestrate the process, and a workspace into which the output Dataset object will be loaded. We'll import them now and introduce the modules enroute.

In [1]:
# IMPORTS
from myst_nb import glue
import pandas as pd
pd.set_option('display.width', 1000)
pd.options.display.float_format = '{:,.2f}'.format

from cvr.core.dataset import DatasetRequest
from cvr.core.workspace import Workspace, WorkspaceManager
from cvr.utils.config import CriteoConfig 
from cvr.data.etl import Extract, TransformETL, LoadDataset
from cvr.core.pipeline import DataPipeline, DataPipelineBuilder, DataPipelineRequest


ImportError: cannot import name 'Asset' from 'cvr.core.asset' (c:\Users\John\Documents\Data Science\Projects\cvr\cvr\core\asset.py)

## Workspace
Before we set up the ETL pipeline, we will need to establish a workspace for it. A Workspace object is essentially a container for Datasets, Models, and Experiments. Singleton class WorkspaceManager allows us to create and manage Workspace objects. 

In [None]:
# REMOVE
# wsm = WorkspaceManager()
# wsm.delete_workspace('Trieste')

In [None]:
wsm = WorkspaceManager()
wsm.set_current_workspace(name="Vesuvio")
vesuvio = wsm.get_current_workspace()

Workspace Trieste has been created and is now the 'current' workspace. Datasets, Models, and Experiments will be contained within this workspace until another workspace is created and/or set current.

## Data Source
The CriteoConfig object packages the URL, file structure, and local destination file path information for the Criteo data source. For illustrative purposes, the CriteoConfig on this machine is shown below.

In [None]:
config = CriteoConfig()
config.print()

## Data Pipeline Steps
Our pipeline consists of three steps, Extract, TransformETL and LoadDataset described below. 

| Step | Module       | Description                                                                     |
|------|--------------|---------------------------------------------------------------------------------|
| 1    | Extract      | Downloads the source data into a local raw data directory                       |
| 2    | TransformETL | Transforms the raw data into a Dataset object and performs basic preprocessing. |
| 3    | LoadDataset  | Loads the Dataset object into our current workspace.                            |

The Extract step takes the CriteoConfig data source configuration object as input and produces our raw data. Transform ETL performs two basic preprocessing a priori based upon the description of the data provided by Criteo Labs. First, we convert the missing values indicator (-1) to NaNs. Second, we convert the categorical variables to the pandas' category data type for computation and space efficiency purposes. The pipeline tasks are instantiated below. Finally, LoadDataset loads the preprocessed Dataset object into our current workspace.

In [None]:
extract = Extract(datasource_config=config, chunk_size=20)
transform=TransformETL(value=[-1,"-1"])
load = LoadDataset()

## Data Pipeline Request
Specifying the parameters for the Pipeline and its resultant Dataset is performed via request objects. 

In [None]:
dataset_request = DatasetRequest(name="criteo", 
                                 description="Criteo Preprocessed", 
                                 stage="preprocessed", 
                                 sample_size=None,
                                 workspace=vesuvio,
                                 )
pipeline_request = DataPipelineRequest(name="etl", 
                                       stage="preprocessed", 
                                       workspace=vesuvio,
                                       random_state=602, 
                                       force=True, 
                                       logging_level='info',
                                       dataset_request=dataset_request
                                       )

## Data Pipeline Builder
Now, we pass our request to the DataPipelineBuilder object, add the tasks, and call the build method. The pipeline is provided via a property on the builder.

In [None]:
builder = DataPipelineBuilder()
builder.reset()
builder.make_request(pipeline_request) 
builder.add_task(extract)
builder.add_task(transform)
builder.add_task(load)
builder.build()
pipeline = builder.pipeline


## ETL Pipeline Execution
The dataset is approximately 6.5 GB; making this ETL a network and IO intensive process. Estimated processing time: 12 minutes. 

In [None]:
dataset = pipeline.run()
pipeline.summary

Our pipeline appears to have run successfully. Let's check the task summaries.

In [None]:
xsum = extract.summary

In [None]:
# GLUE
_ = glue("etl_downloaded",xsum.get("Content Length (Mb)",1910.081), display=False)
_ = glue("etl_chunk_size", xsum.get("Chunk Size (Mb)", 20), display=False)
_ = glue("etl_chunks_downloaded", xsum.get("Chunks Downloaded",97), display=False)
_ = glue("etl_speed", xsum.get("Mbps",5.102), display=False)
_ = glue("etl_size", xsum.get("Size Extracted (Mb)",6129.08), display=False)
if 'Duration' in xsum.keys():
    _ = glue("etl_duration", round(xsum["Duration"].total_seconds() / 60,2), display=False)
else:
    _ = glue("etl_duration",17.38)

From this we see that we've downloaded {glue:}`etl_downloaded` Mb in about {glue:}`etl_duration` minutes in {glue:}`etl_chunks_downloaded`  {glue:}`etl_chunk_size` Mb chunks with an average throughput of {glue:}`etl_speed` Mbps. Next, we have the transform step.

In [None]:
_ = transform.summary

Replacing the missing value indicators with NaNs will simplify data processing and analysis. It does reveal; however, signficant data sparsity. Notably, diversity and sparsity in observations are common challenges in marketing and customer analytics. 

Lastly, we have the load step.

In [None]:
_ = load.summary

We'll note the name and stage for this dataset, which we will use to obtain the Dataset object from the workspace. Before closing this section, we'll demonstrate how an object can be stored and retrieved from our current workspace, 'vesuvio'.

In [None]:
dataset = vesuvio.get_dataset(name='criteo', stage='preprocessed')
dataset.info()

Viola! This closes the data acquisition portion of this series. In the next section, we will get our first glimpses of the data from a profiling and data quality perspective.