# Data Acquisition
Extract, transform, load (ETL), popularized in the ’70s, is a design pattern for continuous extraction of data from multiple heterogeneous data sources, enforcement of strict data quality and consistency standards, and finally, the presentation of massive, transformed, normalized, and sanitized data into enterprise-level data warehouses and (or) data lakes, for analysis, business intelligence, and analytics. Though this project has none of those things, we’ll adopt ETL as our organizing framework for the data acquisition phase.

## Extract Transform Load (ETL) Overview
To implement the ETL process, we'll need the data source configuration, a pipeline to orchestrate the process, and the tasks to perform the operations on the data. Let's import these assets here. 

In [1]:
from cvr.core.pipeline import DataPipeline, DataPipelineBuilder, PipelineCommand
from cvr.data.etl import Download, ExtractRawData, ConvertDtypes, SetNA, LoadData
from cvr.utils.config import CriteoConfig 

### Data Source Configuration
We start with the Criteo configuration object, CriteoConfig. It contains the data source parameters such as the download url, the local file path, the structure of the gzip archive and so on. Methods are exposed for each of the tasks described below. The configuration on this machine is shown for illustrative purposes. 

### ETL Pipeline
The ETL process is orchestrated by the DataPipeline, constructed by a DataPipelineBuilder, and parameterized by the PipelineCommand object.

| # | Class               | Description                                               |
|---|---------------------|-----------------------------------------------------------|
| 1 | DataPipeline        | Executes the series of extract, transform and load tasks. |
| 2 | DataPipelineBuilder | Constructs the DataPipeline object                        |
| 3 | PipelineCommand     | Encapsulates the parameters of the DataPipeline           |

### ETL Pipeline Tasks
Finally, we have the tasks that perform the operations.

| Step | Phase     | Class          | Description                                                     |
|------|-----------|----------------|-----------------------------------------------------------------|
| 1    | Extract   | Download       | Downloads the source data into a local directory                |
| 2    | Extract   | ExtractRawData | Extract the raw data from the gzip file                         |
| 3    | Transform | ConvertDtypes  | Converts the target and object variables to category data type. |
| 4    | Transform | SetNA          | Changes the missing value indicator, '-1', to NaNs.             |
| 5    | Load      | LoadData       | Loads the preprocessed data into a staging directory.           |

## Data Source Configuration
We start with the Criteo configuration object, CriteoConfig. It contains the data source parameters such as the download url, the local file path, the structure of the gzip archive and so on. Methods are exposed for each of the tasks described below. The configuration on this machine is shown for illustrative purposes. 

In [2]:
config_filepath = "tests\\test_config\criteo.yaml"
config = CriteoConfig()
config.print()



                        Criteo Data Source Configuration                        
                        ________________________________                        
                           name : Criteo Sponsored Search Conversion Log Dataset
                download_source : http://go.criteo.net/criteo-research-search-conversion.tar.gz
           download_destination : data/criteo/external/criteo.tar.gz
                 extract_source : data/criteo/external/criteo.tar.gz
            extract_destination : data/criteo/raw/criteo.csv
               extract_filepath : Criteo_Conversion_Search/CriteoSearchData
                 convert_source : data/criteo/raw/criteo.csv
               load_destination : data/criteo/staged/criteo.csv
                      workspace : root
                            sep : \t
                        missing : -1


## ETL Pipeline Tasks
Next, we have the pipeline tasks.

| Step | Phase     | Class          | Description                                                     |
|------|-----------|----------------|-----------------------------------------------------------------|
| 1    | Extract   | Download       | Downloads the source data into a local directory                |
| 2    | Extract   | ExtractRawData | Extract the raw data from the gzip file                         |
| 3    | Transform | ConvertDtypes  | Converts the target and object variables to category data type. |
| 4    | Transform | SetNA          | Changes the missing value indicator, '-1', to NaNs.             |
| 5    | Load      | LoadData       | Loads the preprocessed data into a staging directory.           |

In [3]:
# Extract
download = Download(source=config.download_source, destination=config.download_destination)
extract = ExtractRawData(source=config.extract_source, 
                         destination=config.extract_destination, 
                         filepath_extract=config.extract_filepath)

# Transform
convert = ConvertDtypes(source=config.convert_source)
setna = SetNA(value=[-1,"-1"])

# Load
load = LoadData(destination=config.load_destination)

## ETL Pipeline Development
Next, we create the PipelineCommand which parameterizes the DataPipeline object. The force parameter indicates whether a step should be executed if the output data already exists and verbose specifies whether reporting will be provided during pipeline execution.

In [4]:
command = PipelineCommand(name="Criteo ETL", force=True, verbose=True)

Here, we construct the DataPipeline.

In [5]:
builder = DataPipelineBuilder()
builder.create(command)
# Extract
builder.add_task(download)
builder.add_task(extract)
# Transform
builder.add_task(convert)
builder.add_task(setna)
# Load
builder.add_task(load)

builder.build()
etl = builder.pipeline

## ETL Pipeline Execution
The pipeline is ready to run. Dataset size: 6.5 Gb. Estimated processing time: 15 minutes. 

In [6]:
etl.run()

2022-01-21 15:23:22,607 - root_pipeline - INFO - Started root_pipeline
2022-01-21 15:23:22,608 - root_pipeline - INFO - Started Task: Download
2022-01-21 15:23:23,680 - root_pipeline - INFO - 	Downloading 1910.08 Mb

2022-01-21 15:23:46,973 - root_pipeline - INFO - 	Chunk #10: 5.24 percent downloaded at 4 Mbps
2022-01-21 15:24:11,447 - root_pipeline - INFO - 	Chunk #20: 10.47 percent downloaded at 4 Mbps
2022-01-21 15:24:34,605 - root_pipeline - INFO - 	Chunk #30: 15.71 percent downloaded at 4 Mbps
2022-01-21 15:24:57,618 - root_pipeline - INFO - 	Chunk #40: 20.94 percent downloaded at 4 Mbps
2022-01-21 15:25:20,783 - root_pipeline - INFO - 	Chunk #50: 26.18 percent downloaded at 4 Mbps
2022-01-21 15:25:43,924 - root_pipeline - INFO - 	Chunk #60: 31.41 percent downloaded at 4 Mbps
2022-01-21 15:26:06,523 - root_pipeline - INFO - 	Chunk #70: 36.65 percent downloaded at 4 Mbps
2022-01-21 15:26:29,132 - root_pipeline - INFO - 	Chunk #80: 41.88 percent downloaded at 4 Mbps
2022-01-21 15:26



                        DataPipeline Criteo ETL Summary                         
                                 Download Step                                  
                        _______________________________                         
                            Status Code : 200
                           Content Type : application/x-gzip
                          Last Modified : Wed, 08 Apr 2020 13:39:53 GMT
                    Content Length (Mb) : 1,910.081
                        Chunk Size (Mb) : 10
                      Chunks Downloaded : 193
                        Downloaded (Mb) : 1,910.081
                         File Size (Mb) : 1,910.081
                                   Mbps : 4.308
                                  Start : 2022-01-21 15:23:22.610763
                                    End : 2022-01-21 15:30:45.996134
                               Duration : 0:07:23.385371
                                 Status : 200: OK
                            Status D

2022-01-21 15:32:41,664 - root_pipeline - INFO - Ended Task: ExtractRawData. Status: 200: OK
2022-01-21 15:32:41,669 - root_pipeline - INFO - Started Task: ConvertDtypes




                        DataPipeline Criteo ETL Summary                         
                              ExtractRawData Step                               
                        _______________________________                         
                                 Source : data/criteo/external/criteo.tar.gz
                        Compressed Size : 2,002,864,638
                            Destination : data/criteo/raw/criteo.csv
                          Expanded Size : 6,426,808,162
                                  Start : 2022-01-21 15:30:46.000168
                                    End : 2022-01-21 15:32:41.664393
                               Duration : 0:01:55.664225
                                 Status : 200: OK
                            Status Date : 2022-01-21
                            Status Time : 15:32:41


2022-01-21 15:37:25,359 - root_pipeline - INFO - Ended Task: ConvertDtypes. Status: 200: OK
2022-01-21 15:37:25,363 - root_pipeline - INFO - Started Task: SetNA




                        DataPipeline Criteo ETL Summary                         
                               ConvertDtypes Step                               
                        _______________________________                         
                                   Rows : 15,995,634
                                Columns : 23
                                   Size : 5,380,123,480
                                Missing : 65,417
                        Rows w/ Missing : 65,417
                     Columns w/ Missing : 1
                               category : 17
                                  int64 : 3
                                float64 : 2
                                 object : 1
                                  Start : 2022-01-21 15:32:41.669787
                                    End : 2022-01-21 15:37:25.359735
                               Duration : 0:04:43.689948
                                 Status : 200: OK
                            Status Da

2022-01-21 15:37:40,866 - root_pipeline - INFO - Ended Task: SetNA. Status: 200: OK
2022-01-21 15:37:40,868 - root_pipeline - INFO - Started Task: LoadData




                        DataPipeline Criteo ETL Summary                         
                                   SetNA Step                                   
                        _______________________________                         
                                   Rows : 15,995,634
                                Columns : 23
                                  Cells : 367,899,582
                         Missing Before : 65,417
                       Missing % Before : 0.02
                          Missing After : 167,746,417
                        Missing % After : 45.6
                               % Change : 256,326.34
                                  Start : 2022-01-21 15:37:25.363766
                                    End : 2022-01-21 15:37:40.866263
                               Duration : 0:00:15.502497
                                 Status : 200: OK
                            Status Date : 2022-01-21
                            Status Time : 15:37:40


2022-01-21 15:37:59,855 - root_pipeline - INFO - Ended Task: LoadData. Status: 200: OK
2022-01-21 15:37:59,857 - root_pipeline - INFO - Completed root_pipeline. Duration 0:14:37.249498




                        DataPipeline Criteo ETL Summary                         
                                 LoadData Step                                  
                        _______________________________                         
                            Destination : data/criteo/staged/criteo.csv
                         File Size (Mb) : 1,835.47
                                  Start : 2022-01-21 15:37:40.868256
                                    End : 2022-01-21 15:37:59.855772
                               Duration : 0:00:18.987516
                                 Status : 200: OK
                            Status Date : 2022-01-21
                            Status Time : 15:37:59


                        DataPipeline Criteo ETL Summary                         
                        _______________________________                         
                               Download : 200: OK
                         ExtractRawData : 200: OK
                   

TypeError: 'NoneType' object is not callable