# Data Discovery
This first examination of the data seeks to characterize data quality and to identify any data preparation that should be conducted in advance of an exploratory data analysis. The remainder of this section is organized as follows:

1. Data Acquisition: Obtaining the Criteo data from the source.   
2. Dataset Creation: Transform the raw data into a Dataset object 
3. Dataset Profile: Basic data quality assessment, descriptive statistics, missing values analysis and cardinality
4. Distribution Analysis: Evaluation of continuous variable distributions and potential transformations.
5. Frequency Analysis: Frequency analysis for categorical variables
6. Summary and Recommendations: Characterize key findings and data preprocessing recommendations.

## Preliminaries
### Module Imports

In [None]:
from cvr.data.source import CriteoETL, CriteoTransformer
from cvr.utils.config import DataSourceConfig

### Workspaces
Workspace functionality was developed to support prototyping and experimentation with various data sets in isolated, persistent environments. Here, we'll create two workspaces: tom and jerry. Tom includes the entire dataset; whereas, jerry operates on a random subsample of the larger datast.

In [None]:
tom = Workspace('tom', 'full dataset')
jerry = Workspace('jerry', 'sample dataset')

## Data Acquisition
The data are extracted from the Criteo website, transformed into Dataset objects and loaded into the workspace for data profiling. 
### Extract
The DataSourceConfig object, which contains the URL, filenames for the Criteo site, downloads the data to the raw data directory.


In [None]:
config = DataSourceConfig()
source = CriteoETL(config)
source.extract()
source.summary

### Transform
CriteoTransformer implements several a priori transformations. Specifically, missing values indicated by '-1' were converted to numpy NaN values. The target variable, 'sale' was converted from an integer to a binary categorical variable. Object datatypes were also converted to category types. 

In [None]:
x4mr = CriteoTransformer()
x4mr.transform()

Next, we transform the data into a Dataset object encapsulating the data, basic analysis functionality and metadata. 