# Data Acquisition
The Criteo Sponsored Search Conversion Log Dataset contains 90 days of live click and conversion traffic, twenty-three product features for over 16m observations. This preliminary segment will extract the data from the Criteo Labs website and stage it for analysis and downstream processing.

In [2]:
# IMPORTS
from myst_nb import glue
import pandas as pd
from datetime import datetime
pd.set_option('display.width', 1000)
pd.options.display.float_format = '{:,.2f}'.format

from cvr.core.asset import AssetPassport
from cvr.core.pipeline import DataPipelineBuilder, PipelineConfig
from cvr.core.atelier import AtelierFabrique
from cvr.utils.config import CriteoConfig 
from cvr.data import criteo_columns, criteo_dtypes
from cvr.core.dataset import Dataset, DatasetFactory, DatasetWriter, Download, Extract, DatasetReader

## Studio
Studios provide self-contained environments and persistent storage to support experimentation. Let's instantiate a space for the raw data.

In [3]:
name = 'incept'
description = 'là où tout commence'
factory = AtelierFabrique()
studio = factory.create(name=name, description=description, logging_level='info')

## Datasource
A configuration object contains the data source URL, file structure and local storage paths.

In [4]:
source = CriteoConfig()
source.print()



                        Criteo Data Source Configuration                        
                        ________________________________                        
                         name : Criteo Sponsored Search Conversion Log Dataset
                          url : http://go.criteo.net/criteo-research-search-conversion.tar.gz
            download_filepath : data/external/criteo.tar.gz
             extract_filepath : Criteo_Conversion_Search/CriteoSearchData
                  destination : data/raw/criteo.csv
                          sep : \t
                      missing : -1


## Data Pipeline
A mini-data pipeline will download the data, performing some decompression, transform the data into  a Dataset object for profiling and store the object in our studio. We'll set up the tasks, configure our pipeline, then we are a go for ignition.

#### Download
Download task will take ~ 3 minutes on a 500Mbps line. 

In [5]:
passport = {'aid': studio.next_aid, 
            'asset_type': 'task',
            'name': 'download', 
            'description': 'Download Criteo Data from Criteo Labs', 
            'stage': 'raw'}
download = Download(passport=passport,
                    source=source.url,
                    destination=source.download_filepath)

#### Extract
Extract contents from gzip archive

In [6]:
passport = {'aid': studio.next_aid, 
         'name': 'extract', 
         'description': 'Extract Criteo Data from the Archive', 
         'stage': 'raw'}
extract = Extract(passport=passport,
                  download=source.download_filepath,
                   extract=source.extract_filepath,
                   destination=source.destination,
                   colnames = criteo_columns,
                   dtypes = criteo_dtypes)

#### Create Dataset Object
Create the Dataset object.

In [7]:
passport = AssetPassport(
    aid=studio.next_aid,    
    asset_type='dataset',
    name='criteo',
    description='Create Dataset object from Criteo Labs Conversion Logs Dataset (Raw)',
    stage='raw'    
)
dataset_factory = DatasetFactory(passport=passport)

#### Load Dataset Object
Note, the name, stage, and dataset as asset type for future reference. We can use these labels or the asset identifier to obtain the data from the studio object.

In [8]:
dataset_writer = DatasetWriter(
    passport=passport
)

#### DataPipeline Config
Let's configure the pipeline with logging for progress monitoring.

In [9]:
config = PipelineConfig(
    logger=studio.logger,       # Logging object
    verbose=True,               # Print messages
    force=False,                # If step already completed, don't force it.
    progress=False,              # No progress bar
    dataset_repo=studio.assets,   # dataset repository
    directory=studio.assets_directory   # Assets directory for the studio
)


Lastly, the builder will create the data pipeline and we are a 'go' for ignition.

In [10]:
builder = DataPipelineBuilder()

pipeline = builder.set_config(config).set_passport(passport).add_task(download).add_task(extract).add_task(dataset_factory).add_task(dataset_writer).build().data_pipeline

pipeline.run()

	Started Extract run
	Completed Extract run
	Started DatasetFactory run
	Completed DatasetFactory run
	Started DatasetWriter run
	Completed DatasetWriter run


Viola! Our dataset has been secured. Just to confirm, let's obtain it from the studio.

In [11]:
dataset = studio.get_asset(name='criteo', asset_type='dataset', stage='raw')

In [12]:
dataset.head()



                                     criteo                                     
                                  First 5 Rows                                  
                                  ____________                                  
   0    -1    -1.1  1598891820  -1.2   0.0                              -1.3  7E56C27BFF0305E788DA55A029EC4988                              -1.4                              -1.5  ...                              -1.9                             -1.10                             -1.11 -1.12 -1.13  57A1D462A03BD076E029CF9310C11FC5  A66DB02AC1726A8D79C518B7F7AB79F0                                              -1.14  E3DDEB04F8AFF944B11943BB57D2F620  493CFB4A87C50804C94C0CF76ABD19CD
0  0 -1.00      -1  1598925284  0.00  0.00  4C90FD52FC53D2C1C205844CB69575AB  D7D1FB49049702BF6338894757E0D959                                -1  1B491180398E2F0390E6A588B3BCE291  ...                                -1                                -1                   

In [13]:
dataset.passport.print()



                                 Asset Passport                                 
      Create Dataset Object From Criteo Labs Conversion Logs Dataset (Raw)      
      ____________________________________________________________________      
                              aid : 0003
                       asset_type : dataset
                             name : criteo
                      description : Create Dataset object from Criteo Labs Conversion Logs Dataset (Raw)
                            stage : raw
                          version : 1
                         filepath : ateliers\incept\raw_dataset_criteo_v001.pkl


We can obtrain this dataset from the studio 'incept' by its asset id (aid) number '0003' or by its name, data_type, and stage. Data acquisition. Complete.