# Data Profile
This first examination of the data seeks to characterize data quality in its (near) raw form. Here, we will discover the scope and breadth of data preprocessing that would be required before advancing to the exploratory analysis effort. The remainder of this section is organized as follows:

1. Data Acquisition and Ingestion    
    1.0. Extract: Extract data from the source site.    
    1.1. Transform: Preliminary preprocessing prior to loading and analysis.     
    1.2. Load: Load the data into Dataset objects.    
    
2. Data Profile    
    2.0. Descriptive statistics, missing values and cardinality.    
    2.1. Distribution analysis of continuous variables.    
    2.2. Frequency analysis of categorical variables.     
    
8. Summary and Recommendations: Characterize key findings and data preprocessing recommendations.

## Data Acquisition and Ingestion
Extract, transform, load (ETL), popularized in the ’70s, is a design pattern for continuous extraction of data from multiple heterogeneous data sources, enforcement of strict data quality and consistency standards, and finally, the presentation of massive, transformed, normalized, and sanitized data into enterprise-level data warehouses and (or) data lakes, for analysis, business intelligence, and analytics. Though this project has none of those things, we’ll adopt ETL as our organizing framework for the data acquisition phase.

### Extract Transform Load Data Pipeline
To orchestrate the ETL process, we'll need to construct a DataPipeline object, and a series of Tasks to perform the operations. The necessary modules are described here. 

| Step | Class         | Description                                                             |
|------|---------------|-------------------------------------------------------------------------|
| 1    | Download      | Downloads the source data into a local directory                        |
| 2    | Decompress    | Decompresses the data from a gzip archive                               |
| 3    | Copy          | Copies the data to an immutable raw data file.                          |
| 4    | ConvertDtypes | Converts the target and object variables to category data type.         |
| 5    | SetNA         | Changes the missing value indicator, '-1', to NaNs.                     |
| 6    | BuildDataset  | Constructs a Dataset object from the data.                              |
| 7    | Dataset       | Object encapsulating the data and various behaviors used for profiling. |

In [1]:
from cvr.core.pipeline import DataPipeline, DataPipelineBuilder, PipelineCommand
from cvr.core.task import Download, Decompress, Copy, ConvertDtypes, SetNA, SavePKLDataFrame
from cvr.utils.config import CriteoConfig 


#### Build Pipeline
##### Configuration
We begin with the CriteoConfig object which packages the information about the Criteo data source such as:

- url: Web address for the Criteo Sponsored Search Conversion Log Dataset   
- destination: Local filepath where the data will be downloaded
- filepath_decompressed: Criteo data filepath for uncompressed version.    
- filepath_raw	Raw data filepath   
- workspace: The workspace into which the final Dataset object will be stored.

For illustration purposes, the configuration on this machine is a follows.

In [2]:
config_filepath = "tests\\test_config\criteo.yaml"
config = CriteoConfig(config_filepath=config_filepath)
config.print()




                        Criteo Data Source Configuration                        
                        ________________________________                        
                      name : Criteo Sponsored Search Conversion Log Dataset
                       url : http://go.criteo.net/criteo-research-search-conversion.tar.gz
               destination : tests/test_data/criteo/external/criteo.tar.gz
     filepath_decompressed : tests/test_data/criteo/external/Criteo_Conversion_Search/CriteoSearchData
              filepath_raw : tests/test_data/criteo/raw/criteo.csv
                 workspace : root
                       sep : \t
                   missing : -1


##### Tasks
The following Task objects comprise the ETL pipeline steps.

In [3]:
# Extract
download = Download(source=config.url, destination=config.destination)
decompress = Decompress(source=config.destination, destination=config.filepath_decompressed)
copy = Copy(source=config.filepath_decompressed, destination=config.filepath_raw)
# Transform
convert = ConvertDtypes(source=config.filepath_raw)
setna = SetNA(value=[-1,"-1"])
# Load
#builder = BuildDataset()
#load = LoadDataset()



##### Data Pipeline
Next, we construct the data pipeline and a command object to parameterize it.

In [4]:
command = PipelineCommand(name="Criteo ETL", force=False, verbose=True)

In [5]:
builder = DataPipelineBuilder()
builder.create(command)
builder.add_task(download)
builder.add_task(decompress)
builder.add_task(copy)
builder.add_task(convert)
builder.add_task(setna)
builder.build()
pipeline = builder.pipeline

The pipeline is ready to run. The raw dataset is approximately 6.5 Gb, so this will take approximately __ minutes. 

In [6]:
pipeline.run()

2022-01-21 05:31:41,104 - root_pipeline - INFO - Started root_pipeline
2022-01-21 05:31:41,105 - root_pipeline - INFO - Started Task: Download
2022-01-21 05:31:41,953 - root_pipeline - INFO - 	Downloading 1910.08 Mb

2022-01-21 05:32:00,835 - root_pipeline - INFO - 	Chunk #10: 5.24 percent downloaded at 5 Mbps
2022-01-21 05:32:19,435 - root_pipeline - INFO - 	Chunk #20: 10.47 percent downloaded at 5 Mbps
2022-01-21 05:32:37,856 - root_pipeline - INFO - 	Chunk #30: 15.71 percent downloaded at 5 Mbps
2022-01-21 05:32:56,721 - root_pipeline - INFO - 	Chunk #40: 20.94 percent downloaded at 5 Mbps
2022-01-21 05:33:15,329 - root_pipeline - INFO - 	Chunk #50: 26.18 percent downloaded at 5 Mbps
2022-01-21 05:33:34,072 - root_pipeline - INFO - 	Chunk #60: 31.41 percent downloaded at 5 Mbps
2022-01-21 05:33:53,035 - root_pipeline - INFO - 	Chunk #70: 36.65 percent downloaded at 5 Mbps
2022-01-21 05:34:11,422 - root_pipeline - INFO - 	Chunk #80: 41.88 percent downloaded at 5 Mbps
2022-01-21 05:34



                        DataPipeline Criteo ETL Summary                         
                                 Download Step                                  
                        _______________________________                         
                            Status Code : 200
                           Content Type : application/x-gzip
                          Last Modified : Wed, 08 Apr 2020 13:39:53 GMT
                    Content Length (Mb) : 1,910.081
                        Chunk Size (Mb) : 10
                      Chunks Downloaded : 193
                        Downloaded (Mb) : 1,910.081
                         File Size (Mb) : 1,910.081
                                   Mbps : 5.33
                                  Start : 2022-01-21 05:31:41.107671
                                    End : 2022-01-21 05:37:39.471976
                               Duration : 0:05:58.364305
                                 Status : 200: OK
                            Status Da

2022-01-21 05:39:09,846 - root_pipeline - INFO - Ended Task: Decompress. Status: 200: OK
2022-01-21 05:39:09,849 - root_pipeline - INFO - Started Task: Copy
2022-01-21 05:39:09,852 - root_pipeline - INFO - Ended Task: Copy. Status: 215: Complete - Not Executed: Output Data Already Exists
2022-01-21 05:39:09,853 - root_pipeline - INFO - Started Task: ConvertDtypes




                        DataPipeline Criteo ETL Summary                         
                                Decompress Step                                 
                        _______________________________                         
                          Source : tests/test_data/criteo/external/criteo.tar.gz
                 Compressed Size : 2,002,864,638
                     Destination : tests/test_data/criteo/external/Criteo_Conversion_Search/CriteoSearchData
                   Expanded Size : 6,426,808,162
                           Start : 2022-01-21 05:37:39.473976
                             End : 2022-01-21 05:39:09.846819
                        Duration : 0:01:30.372843
                          Status : 200: OK
                     Status Date : 2022-01-21
                     Status Time : 05:39:09


                        DataPipeline Criteo ETL Summary                         
                                   Copy Step                                 

2022-01-21 05:43:53,338 - root_pipeline - INFO - Ended Task: ConvertDtypes. Status: 200: OK
2022-01-21 05:43:53,342 - root_pipeline - INFO - Started Task: SetNA




                        DataPipeline Criteo ETL Summary                         
                               ConvertDtypes Step                               
                        _______________________________                         
                                   Rows : 15,995,634
                                Columns : 23
                                   Size : 5,380,123,480
                                Missing : 65,417
                        Rows w/ Missing : 65,417
                     Columns w/ Missing : 1
                               category : 17
                                  int64 : 3
                                float64 : 2
                                 object : 1
                                  Start : 2022-01-21 05:39:09.853834
                                    End : 2022-01-21 05:43:53.338569
                               Duration : 0:04:43.484735
                                 Status : 200: OK
                            Status Da

2022-01-21 05:44:09,037 - root_pipeline - INFO - Ended Task: SetNA. Status: 200: OK
2022-01-21 05:44:09,040 - root_pipeline - INFO - Completed root_pipeline. Duration 0:12:27.935732




                        DataPipeline Criteo ETL Summary                         
                                   SetNA Step                                   
                        _______________________________                         
                                   Rows : 15,995,634
                                Columns : 23
                                  Cells : 367,899,582
                         Missing Before : 65,417
                       Missing % Before : 0.02
                          Missing After : 167,746,417
                        Missing % After : 45.6
                               % Change : 256,326.34
                                  Start : 2022-01-21 05:43:53.342629
                                    End : 2022-01-21 05:44:09.037405
                               Duration : 0:00:15.694776
                                 Status : 200: OK
                            Status Date : 2022-01-21
                            Status Time : 05:44:09
