# Data Acquisition
Modeling the state-of-the-art from the 1970s, we'll obtain the Criteo Sponsored Search Conversion Log Dataset using the extract, transform and load (ETL) pattern. This will require a Criteo data source configuration, a series of tasks to perform the ETL operations and a pipeline to orchestrate the process. We'll import them now and briefly introduce the modules en route.

In [1]:
from cvr.utils.config import CriteoConfig 
from cvr.data.etl import Extract, TransformETL, LoadDataset
from cvr.core.pipeline import DataPipeline, DataPipelineBuilder
import platform

## Data Source
CriteoConfig packages the URL, file structure, and local destination file path information for the Criteo data source. For illustrative purposes, the CriteoConfig on this machine is shown below.

In [2]:
config = CriteoConfig()
config.print()



                        Criteo Data Source Configuration                        
                        ________________________________                        
                          name : Criteo Sponsored Search Conversion Log Dataset
                        source : http://go.criteo.net/criteo-research-search-conversion.tar.gz
                   destination : data\external\criteo.tar.gz
              filepath_extract : Criteo_Conversion_Search/CriteoSearchData
                  filepath_raw : raw\criteo.csv
                     workspace : root
                           sep : \t
                       missing : -1


## Data Pipeline Steps
Our three-step data pipeline involves the following: 

| Step | Module       | Description                                                                     |
|------|--------------|---------------------------------------------------------------------------------|
| 1    | Extract      | Downloads the source data into a local raw data directory                       |
| 2    | TransformETL | Transform the raw data into a Dataset object and perform basic   preprocessing. |
| 3    | LoadDataset  | Load the Dataset object into our local workspace.                               |

Two basic preprocessing steps are taken a priori based upon the description of the data provided by Criteo Labs. First, we convert the missing values indicator (-1) to NaNs. Second, we convert non-numeric columns to the pandas' category data type for computational and space efficiency purposes. The tasks are instantiated below.

In [3]:
extract = Extract(config=config)
transform=TransformETL(value=[-1,"-1"])
load = LoadDataset()

## Data Pipeline Builder
Next, we'll construct the pipeline and place the resultant data in the staging area. 

In [4]:
builder = DataPipelineBuilder()
builder.create()
builder.set_name("vesuvio").set_stage("staging").set_force(True).set_keep_interim(True).set_verbose(True) 
builder.add_task(extract)
builder.add_task(transform)
builder.add_task(load)
builder.build()
pipeline = builder.pipeline

## Data Pipeline Execution
The data are about 3.4 Gb compressed and nearly 6.5 Gb expanded; hence, downloading and loading the data are network and disk-intensive operations. This will take a minute.

In [5]:
pipeline.run()
pipeline.summary

Started vesuvio
	Downloading 1910.08 Mb in 10 Mb chunks

100%|██████████| 2.00G/2.00G [06:56<00:00, 4.81MiB/s]
	1910.081 Mb downloaded in 193 10 Mb chunks. Download complete!
	Decompression initiated.
	Sampling dataset initiated.
	Sampling Complete! 10000.0 Rows Sampled.
	Decompression Complete! 6129.08 Mb Extracted.
Extract Complete. Status: 200: OK
TransformETL Complete. Status: 200: OK
LoadDataset Complete. Status: 200: OK
Completed vesuvio. Duration 0:08:25.674205




                          DataPipeline vesuvio Summary                          
                                  Extract Step                                  
                          ____________________________                          
                            Status Code : 200
                           Content Type : application/x-gzip
                          Last Modified : Wed, 08 Apr 2020 13:39:53 GMT
                    Content Length (Mb) : 1,910.081
                        Chunk Size (Mb) : 10
                      Chunks Downloaded : 193
                        Downloaded (Mb) : 1,910.081
                         File Size (Mb) : 1,910.081
                                   Mbps : 4.572
                    Size Extracted (Mb) : 6,129.08
           Sampled Dataset Observations : 10,000.0
                   Sampled Dataset Size : 3.48
                                  Start : 2022-01-24 01:26:00.489686
                                    End : 2022-01-24 01:34:26.0