(section_22_data_acquisition)=
# Data Acquisition
Wrangling, munging, cleansing and manipulating data are irreducible variables in the machine learning and big data value equation. Statistical inference, predictive analytics, and problem solving with machines and math require data, in the right format, volume, and veracity. In this section, we design, build and execute a simple, automated and reproducible data ingestion pipeline that extracts the data from its source, transforms it into a usable and reliable resource, then loads the data into a database for downstream analysis and modeling. The main components are put forward as follows:

![ETL](jbook/images/ETL-DAG.png)

## Extract
A DAG object containing the extract, transform, and load objects is instantiated with a pipeline specification - a configuration file containing basic declarative statements, expressions, and instructions for each of the pipeline tasks. The data is downloaded, decompressed, and stored as raw data. Some minor preprocessing is performed, the data is registered in the metadata database and persisted in a staging area pending the next step in the ETL.

## Transform
Some 23 user, demographic, behavioural, and item features are split among two files: the impressions file containing a single ad impression per row, and a common features file that aggregates lists of features common among many sample impressions. As depicted in the entity relationship diagram below, the impressions file contains our targets, the click and conversion labels, a unique sample id, a feature count, and a few gigabytes of strings containing feature lists. Our common features dataset is similarly formatted. A few samples are printed for illustration purposes.

In [1]:
# IMPORTS
import pandas as pd

In [3]:
df = pd.read_csv("data/development/staged/sample_skeleton_train.csv")
df.head()

Unnamed: 0,sample_id,click_label,conversion_label,common_features_index,num_features,features_list,partition
0,11515523,0,0,d5f794198192a713,8,20787184361.021091122781.021090428991....,0
1,11060573,0,0,5ab1af84e729a269,13,21090299621.021090904571.021691506511....,1
2,19579112,0,0,15100e25b3982cc3,17,30193516661.021692536751.021091144811....,2
3,12131559,0,0,4e21ad4cf5b148d7,10,20555650711.021090633511.021090746901....,3
4,34008079,0,0,654765381422bb63,17,21090623511.021090762221.050893550770....,4


Our aim for the transform step, is a fully realized 3rd normal form target data model free of without redundancy, logical inconsistencies, transitive dependencies, and read/write anomalies. Normalization improves memory, cpu, and disk efficiency, boosts ad-hoc query processing and reduces the computational effort associated with big data analytics. Not all optimization is premature. 

Notwithstanding, transforming our features from a series of strings to rows of feature structures will involve computationally inefficient row-wise dataframe operations on some 88 million rows. Fortunately, Apache Spark's Pandas UDF functions implement a so-called 'split-apply-combine' pattern in which a Spark DataFrame is split into groups, a function is applied to each group, they are dispatched to a configurable number of CPU cores running in parallel, then the results are combined into a final single DataFrame. 

The source code for the transform step 
Viola! 
dispatched to which allows one to split trials using Pandas apply method on a sample dataset were not Fortunately, Spark's recent    row-wise dataframe operations that can't task will involve 84 million costly row-wise 





 in Memory, cpu, and disk utilization of 3NFare optimized while the additiona, memory requirement, faster disk operations, Disk operations, memory utilization, query response times are advantaged by a 3NF database design and and as we move to the exploratory data analysis work,  

Analyzing our data

Space and time complexity oThird normal form provides flexibility, ensures referential integrity, and can be considered increases data processing efficiency, reduces storage space, , ideal for online transaction processing (OLTP)   with referential integrity,  is to parse, extract, and convert the data  the data  features into into a 3rd normal form (3NF), thereby eliminating redundancy, ensuring referential integrity, and simplify data management and exploratory analysis. 

Unfortunately, this parsing exercise involves a rather tedious, row-wise treatment that can't be easily vectorized. Processing 84 million rows of the data. 
Unfortunately, the structure of the feature data will require row-wise parsing - a rather computationally burdensome task 

This row-wise parsing exercise can't be efficiently vectorized, but park  

 and management.   form  the features The  Each impression in the impression file contains a list of one ore more feature structures concatenated into strings, which delimited by selected non-printable ASCII characters. Similarly, lists of feature structures 
Our feature set includes some 23 user demographic, behavioural, transactional and item features concatenated, and compressed into two strings stored across the two files which collectively make up our training set.  files which collectively series of strings across across two files. impressions file contains:

![ERD](/jbook/images/ETL-DAG.png)


 the the target click and conversion labels, a feature count, a sample id and a series of strings containing one ore more feature lists. The second file, contains a similar collection of features lists organized into a series of concatenated feature structures.
 features that are common among many of the samples in the impressions file.   common feature file contains a collection of feature groups that have been aggregated , packed into ASCII character delimited strings containing the feature structures. Each structure contains and id, a feature name and a corresponding feature value. The primary aim of the transform step is to parse the features structures into the individual features and samples. Concretely, our core impressions will be split into an impressions table, containing a single observation for each  impression, and a features table with one-to-many foreign references to the impressions   into file will be transformed into features these features into feature structures that can be analyzed and processed. The sample below  containing   in comma separated strings.concatenated and encoded into comma separated strings  strings   
 and partiti in the metatadatabase  that  the tasks to be completed, the parameters  
Step 1. Download our data from its Amazon S3 instance, unzip the compressed archives, persist and register the raw data. Next, column names are added, partitions are assigned, and the assets are registered in the metadata database before staging the data for the transformation phase.  


## Extract

The remote S3 datasource is downloaded, decompressed, and stored in the raw data directory. A staging process adds column names and assigns each observation a partition number to support parallel processing in the transform stage.
 partitions   this data management framework is to download the source data into the external data directory on the local drive. It is then decompressed from its GZIP archive and migrated to teh loca

We begin the ETL design with a quick assessment of the data vis-a-vis our (heretofore unspecified) target database in order to:

- quickly illuminate structural or data quality issues 
- assess the complexity of the integration effort, and
- evaluate the utility of the dataset and its attributes to the analysis and modeling efforts. 

[erd](jbook/images/ERD.png)


To reduce our computational burden, advance the ETL analysis, design, and development effort, a multivariate multi-objective stratified optimal distribution-preserving class-proportional downsampling dataset will be created that reflects the structure of the entire training set.

sampling and allocation data profiling effort and the analysis, design, and ToTo mitigate computational burden  and of Analyzing and manipulating 90 million observations across 40 Gb To reduce computational cost and to facilitate the data profiling and discovery effort, a random sample   ETL development  deTo address the class imbalance question, data generation and sampling techniques have evolved    
To moderate the computational cost of analyzing and manipulating our data,  Though our dataset would not be considered big data in any modern context, the computational cost of analyzing and manipulating such datasets motivates   increases controlling the computational cost of the data acquisition and exploratory analysis efforts  motivated questions about the optimal size and allocation of data samples    analyzing and manipulating datasets of these sizes came with a computational burden 
To reduce the computational burden, multivariate proportional stratified downsampling was conducted to produce a sample dataset that reflected the distributions, diversity, and statistical properties of the full training. 

{numref}`sampling_strata`: Alibaba Click and Conversion Prediction (Ali-CCP) Dataset Sampling Strata

```{table} Sampling Strata
:name: sampling_strata

| Stratum | Click | Conversion | Proportion | Response                     |
|:-------:|:-----:|:----------:|:----------:|------------------------------|
|    1    |   0   |      0     |   96.11%   | No response to ad            |
|    2    |   1   |      0     |    3.89%   | Click through                |
|    3    |   1   |      1     |    0.02%   | Click-through and conversion |
```
A sample size 

Next, an optimal total sample size was calculated and stratified random sampling from each strata was conducted in accordance with the distribution conducted to preserve 
   was  , Analyzing and manipulating mid-sized datasets To mitigate some computational cost 
Combined, we have approximately 86 million observations split almost evenly between the training and test sets. Restricting our   observations in our training and test sets. 
For computational convenience, we'll extract a *representative* sample from the *training* set for this stage of the analysis. And since the common features dataset extends the impression dataset, we'll treat both as a single training set of 42.3 million observations. 

Thus, we need to know how large a representative sample needs to be, assuming a margin of error of +/-5%. Restating the problem, we seek a dataset in which the 100(1-$\alpha$)% confidence interval for the sample conversion rate contains the true population conversion rate with probability of at least 1-$\alpha$. Hence, we have a 95% confidence that the true conversion rate is contained inside the 95% confidence interval. 

Conversions are discrete events following a binomial distribution. If $P$ is our 



 Since   Defining *representative* in terms of conversion rate, we seek a sample size in which the sample mean conversion rate and its variance approximates the associated mean and variance of the *population* within some margin of error, say, 0.05%. Fortunately, the central limit theorem provides a principled method for     of the  and the  and  Our impressions dataset has a population of 42 million observations   Representatve Fortunately, the central limit theorem (CLT) allows us to 

### Impressions Data

(section_22_data_acquisition)=
# Data Acquisition
Wrangling, munging, cleansing and manipulating data are irreducible variables in the machine learning and big data value equation. Statistical inference, predictive analytics, and problem solving with machines and math require data, in the right format, volume, and veracity. In this section, we design, build and execute a simple, automated and reproducible data ingestion pipeline that extracts the data from its source, transforms it into a usable and reliable resource, then loads the data into a database for downstream analysis and modeling. The main components are put forward as follows:

## Extract
Our ETL pipeline is defined using declarative pipeline syntax - basic statements and expressions which sequence the parameterized tasks that collectively execute the ETL process. First, the data are downloaded from an Amazon S3 instance, unzipped, persisted, and this raw data are registered as assets in the metadata database. Column headings are added, partitions are assigned, andd the data are stored in a staging area for the transformation step. 

## Transform
Some 23 user, demographic, behavioural, and item features are split among two files: the impressions file containing a single ad impression per row, and a common features file that aggregates lists of features common among many sample impressions. As depicted in the entity relationship diagram below, the impressions file contains our targets, the click and conversion labels, a unique sample id, a feature count, and a few gigabytes of strings containing feature lists. Our common features dataset is similarly formatted. A few samples are printed for illustration purposes.

Our aim for the transform step, is a fully realized 3rd normal target data model free of redundancy, and logical inconsistencies, inappropriate and transitive dependencies, and read/write anomalies. Normalization improves memory, cpu, and disk efficiency, boosts ad-hoc query processing and reduces the computational effort associated with big data analytics. Not all optimization is premature. 

Notwithstanding, transforming our feature data will involve computationally inefficient row-wise dataframe operations on some 88 million rows. Fortunately, Apache Spark's Pandas UDF functions implement a so-called 'split-apply-combine' pattern in which a Spark DataFrame is split into groups, a function is applied to each group, and dispatched to one of a configurable number of CPU cores, then results are combined into a final single DataFrame. 

The source code for the transform step 
Viola! 
dispatched to which allows one to split trials using Pandas apply method on a sample dataset were not Fortunately, Spark's recent    row-wise dataframe operations that can't task will involve 84 million costly row-wise 





 in Memory, cpu, and disk utilization of 3NFare optimized while the additiona, memory requirement, faster disk operations, Disk operations, memory utilization, query response times are advantaged by a 3NF database design and and as we move to the exploratory data analysis work,  

Analyzing our data

Space and time complexity oThird normal form provides flexibility, ensures referential integrity, and can be considered increases data processing efficiency, reduces storage space, , ideal for online transaction processing (OLTP)   with referential integrity,  is to parse, extract, and convert the data  the data  features into into a 3rd normal form (3NF), thereby eliminating redundancy, ensuring referential integrity, and simplify data management and exploratory analysis. 

Unfortunately, this parsing exercise involves a rather tedious, row-wise treatment that can't be easily vectorized. Processing 84 million rows of the data. 
Unfortunately, the structure of the feature data will require row-wise parsing - a rather computationally burdensome task 

This row-wise parsing exercise can't be efficiently vectorized, but park  

 and management.   form  the features The  Each impression in the impression file contains a list of one ore more feature structures concatenated into strings, which delimited by selected non-printable ASCII characters. Similarly, lists of feature structures 
Our feature set includes some 23 user demographic, behavioural, transactional and item features concatenated, and compressed into two strings stored across the two files which collectively make up our training set.  files which collectively series of strings across across two files. impressions file contains:

![ERD](/jbook/images/ETL-DAG.png)


 the the target click and conversion labels, a feature count, a sample id and a series of strings containing one ore more feature lists. The second file, contains a similar collection of features lists organized into a series of concatenated feature structures.
 features that are common among many of the samples in the impressions file.   common feature file contains a collection of feature groups that have been aggregated , packed into ASCII character delimited strings containing the feature structures. Each structure contains and id, a feature name and a corresponding feature value. The primary aim of the transform step is to parse the features structures into the individual features and samples. Concretely, our core impressions will be split into an impressions table, containing a single observation for each  impression, and a features table with one-to-many foreign references to the impressions   into file will be transformed into features these features into feature structures that can be analyzed and processed. The sample below  containing   in comma separated strings.concatenated and encoded into comma separated strings  strings   
 and partiti in the metatadatabase  that  the tasks to be completed, the parameters  
Step 1. Download our data from its Amazon S3 instance, unzip the compressed archives, persist and register the raw data. Next, column names are added, partitions are assigned, and the assets are registered in the metadata database before staging the data for the transformation phase.  


## Extract

The remote S3 datasource is downloaded, decompressed, and stored in the raw data directory. A staging process adds column names and assigns each observation a partition number to support parallel processing in the transform stage.
 partitions   this data management framework is to download the source data into the external data directory on the local drive. It is then decompressed from its GZIP archive and migrated to teh loca

We begin the ETL design with a quick assessment of the data vis-a-vis our (heretofore unspecified) target database in order to:

- quickly illuminate structural or data quality issues 
- assess the complexity of the integration effort, and
- evaluate the utility of the dataset and its attributes to the analysis and modeling efforts. 

[erd](jbook/images/ERD.png)


To reduce our computational burden, advance the ETL analysis, design, and development effort, a multivariate multi-objective stratified optimal distribution-preserving class-proportional downsampling dataset will be created that reflects the structure of the entire training set.

sampling and allocation data profiling effort and the analysis, design, and ToTo mitigate computational burden  and of Analyzing and manipulating 90 million observations across 40 Gb To reduce computational cost and to facilitate the data profiling and discovery effort, a random sample   ETL development  deTo address the class imbalance question, data generation and sampling techniques have evolved    
To moderate the computational cost of analyzing and manipulating our data,  Though our dataset would not be considered big data in any modern context, the computational cost of analyzing and manipulating such datasets motivates   increases controlling the computational cost of the data acquisition and exploratory analysis efforts  motivated questions about the optimal size and allocation of data samples    analyzing and manipulating datasets of these sizes came with a computational burden 
To reduce the computational burden, multivariate proportional stratified downsampling was conducted to produce a sample dataset that reflected the distributions, diversity, and statistical properties of the full training. 

{numref}`sampling_strata`: Alibaba Click and Conversion Prediction (Ali-CCP) Dataset Sampling Strata

```{table} Sampling Strata
:name: sampling_strata

| Stratum | Click | Conversion | Proportion | Response                     |
|:-------:|:-----:|:----------:|:----------:|------------------------------|
|    1    |   0   |      0     |   96.11%   | No response to ad            |
|    2    |   1   |      0     |    3.89%   | Click through                |
|    3    |   1   |      1     |    0.02%   | Click-through and conversion |
```
A sample size 

Next, an optimal total sample size was calculated and stratified random sampling from each strata was conducted in accordance with the distribution conducted to preserve 
   was  , Analyzing and manipulating mid-sized datasets To mitigate some computational cost 
Combined, we have approximately 86 million observations split almost evenly between the training and test sets. Restricting our   observations in our training and test sets. 
For computational convenience, we'll extract a *representative* sample from the *training* set for this stage of the analysis. And since the common features dataset extends the impression dataset, we'll treat both as a single training set of 42.3 million observations. 

Thus, we need to know how large a representative sample needs to be, assuming a margin of error of +/-5%. Restating the problem, we seek a dataset in which the 100(1-$\alpha$)% confidence interval for the sample conversion rate contains the true population conversion rate with probability of at least 1-$\alpha$. Hence, we have a 95% confidence that the true conversion rate is contained inside the 95% confidence interval. 

Conversions are discrete events following a binomial distribution. If $P$ is our 



 Since   Defining *representative* in terms of conversion rate, we seek a sample size in which the sample mean conversion rate and its variance approximates the associated mean and variance of the *population* within some margin of error, say, 0.05%. Fortunately, the central limit theorem provides a principled method for     of the  and the  and  Our impressions dataset has a population of 42 million observations   Representatve Fortunately, the central limit theorem (CLT) allows us to 

### Impressions Data

(section_22_data_acquisition)=
# Data Acquisition
Wrangling, munging, cleansing and manipulating data are irreducible variables in the machine learning and big data value equation. Statistical inference, predictive analytics, and problem solving with machines and math require data, in the right format, volume, and veracity. In this section, we design, build and execute a simple, automated and reproducible data ingestion pipeline that extracts the data from its source, transforms it into a usable and reliable resource, then loads the data into a database for downstream analysis and modeling. The main components are put forward as follows:

## Extract
Our ETL pipeline is defined using declarative pipeline syntax - basic statements and expressions which sequence the parameterized tasks that collectively execute the ETL process. First, the data are downloaded from an Amazon S3 instance, unzipped, persisted, and this raw data are registered as assets in the metadata database. Column headings are added, partitions are assigned, andd the data are stored in a staging area for the transformation step. 

## Transform
Some 23 user, demographic, behavioural, and item features are split among two files: the impressions file containing a single ad impression per row, and a common features file that aggregates lists of features common among many sample impressions. As depicted in the entity relationship diagram below, the impressions file contains our targets, the click and conversion labels, a unique sample id, a feature count, and a few gigabytes of strings containing feature lists. Our common features dataset is similarly formatted. A few samples are printed for illustration purposes.

Our aim for the transform step, is a fully realized 3rd normal target data model free of redundancy, and logical inconsistencies, inappropriate and transitive dependencies, and read/write anomalies. Normalization improves memory, cpu, and disk efficiency, boosts ad-hoc query processing and reduces the computational effort associated with big data analytics. Not all optimization is premature. 

Notwithstanding, transforming our feature data will involve computationally inefficient row-wise dataframe operations on some 88 million rows. Fortunately, Apache Spark's Pandas UDF functions implement a so-called 'split-apply-combine' pattern in which a Spark DataFrame is split into groups, a function is applied to each group, and dispatched to one of a configurable number of CPU cores, then results are combined into a final single DataFrame. 

The source code for the transform step 
Viola! 
dispatched to which allows one to split trials using Pandas apply method on a sample dataset were not Fortunately, Spark's recent    row-wise dataframe operations that can't task will involve 84 million costly row-wise 





 in Memory, cpu, and disk utilization of 3NFare optimized while the additiona, memory requirement, faster disk operations, Disk operations, memory utilization, query response times are advantaged by a 3NF database design and and as we move to the exploratory data analysis work,  

Analyzing our data

Space and time complexity oThird normal form provides flexibility, ensures referential integrity, and can be considered increases data processing efficiency, reduces storage space, , ideal for online transaction processing (OLTP)   with referential integrity,  is to parse, extract, and convert the data  the data  features into into a 3rd normal form (3NF), thereby eliminating redundancy, ensuring referential integrity, and simplify data management and exploratory analysis. 

Unfortunately, this parsing exercise involves a rather tedious, row-wise treatment that can't be easily vectorized. Processing 84 million rows of the data. 
Unfortunately, the structure of the feature data will require row-wise parsing - a rather computationally burdensome task 

This row-wise parsing exercise can't be efficiently vectorized, but park  

 and management.   form  the features The  Each impression in the impression file contains a list of one ore more feature structures concatenated into strings, which delimited by selected non-printable ASCII characters. Similarly, lists of feature structures 
Our feature set includes some 23 user demographic, behavioural, transactional and item features concatenated, and compressed into two strings stored across the two files which collectively make up our training set.  files which collectively series of strings across across two files. impressions file contains:

![ERD](/jbook/images/ETL-DAG.png)


 the the target click and conversion labels, a feature count, a sample id and a series of strings containing one ore more feature lists. The second file, contains a similar collection of features lists organized into a series of concatenated feature structures.
 features that are common among many of the samples in the impressions file.   common feature file contains a collection of feature groups that have been aggregated , packed into ASCII character delimited strings containing the feature structures. Each structure contains and id, a feature name and a corresponding feature value. The primary aim of the transform step is to parse the features structures into the individual features and samples. Concretely, our core impressions will be split into an impressions table, containing a single observation for each  impression, and a features table with one-to-many foreign references to the impressions   into file will be transformed into features these features into feature structures that can be analyzed and processed. The sample below  containing   in comma separated strings.concatenated and encoded into comma separated strings  strings   
 and partiti in the metatadatabase  that  the tasks to be completed, the parameters  
Step 1. Download our data from its Amazon S3 instance, unzip the compressed archives, persist and register the raw data. Next, column names are added, partitions are assigned, and the assets are registered in the metadata database before staging the data for the transformation phase.  


## Extract

The remote S3 datasource is downloaded, decompressed, and stored in the raw data directory. A staging process adds column names and assigns each observation a partition number to support parallel processing in the transform stage.
 partitions   this data management framework is to download the source data into the external data directory on the local drive. It is then decompressed from its GZIP archive and migrated to teh loca

We begin the ETL design with a quick assessment of the data vis-a-vis our (heretofore unspecified) target database in order to:

- quickly illuminate structural or data quality issues 
- assess the complexity of the integration effort, and
- evaluate the utility of the dataset and its attributes to the analysis and modeling efforts. 

[erd](jbook/images/ERD.png)


To reduce our computational burden, advance the ETL analysis, design, and development effort, a multivariate multi-objective stratified optimal distribution-preserving class-proportional downsampling dataset will be created that reflects the structure of the entire training set.

sampling and allocation data profiling effort and the analysis, design, and ToTo mitigate computational burden  and of Analyzing and manipulating 90 million observations across 40 Gb To reduce computational cost and to facilitate the data profiling and discovery effort, a random sample   ETL development  deTo address the class imbalance question, data generation and sampling techniques have evolved    
To moderate the computational cost of analyzing and manipulating our data,  Though our dataset would not be considered big data in any modern context, the computational cost of analyzing and manipulating such datasets motivates   increases controlling the computational cost of the data acquisition and exploratory analysis efforts  motivated questions about the optimal size and allocation of data samples    analyzing and manipulating datasets of these sizes came with a computational burden 
To reduce the computational burden, multivariate proportional stratified downsampling was conducted to produce a sample dataset that reflected the distributions, diversity, and statistical properties of the full training. 

{numref}`sampling_strata`: Alibaba Click and Conversion Prediction (Ali-CCP) Dataset Sampling Strata

```{table} Sampling Strata
:name: sampling_strata

| Stratum | Click | Conversion | Proportion | Response                     |
|:-------:|:-----:|:----------:|:----------:|------------------------------|
|    1    |   0   |      0     |   96.11%   | No response to ad            |
|    2    |   1   |      0     |    3.89%   | Click through                |
|    3    |   1   |      1     |    0.02%   | Click-through and conversion |
```
A sample size 

Next, an optimal total sample size was calculated and stratified random sampling from each strata was conducted in accordance with the distribution conducted to preserve 
   was  , Analyzing and manipulating mid-sized datasets To mitigate some computational cost 
Combined, we have approximately 86 million observations split almost evenly between the training and test sets. Restricting our   observations in our training and test sets. 
For computational convenience, we'll extract a *representative* sample from the *training* set for this stage of the analysis. And since the common features dataset extends the impression dataset, we'll treat both as a single training set of 42.3 million observations. 

Thus, we need to know how large a representative sample needs to be, assuming a margin of error of +/-5%. Restating the problem, we seek a dataset in which the 100(1-$\alpha$)% confidence interval for the sample conversion rate contains the true population conversion rate with probability of at least 1-$\alpha$. Hence, we have a 95% confidence that the true conversion rate is contained inside the 95% confidence interval. 

Conversions are discrete events following a binomial distribution. If $P$ is our 



 Since   Defining *representative* in terms of conversion rate, we seek a sample size in which the sample mean conversion rate and its variance approximates the associated mean and variance of the *population* within some margin of error, say, 0.05%. Fortunately, the central limit theorem provides a principled method for     of the  and the  and  Our impressions dataset has a population of 42 million observations   Representatve Fortunately, the central limit theorem (CLT) allows us to 

### Impressions Data

In [2]:
# IMPORTS
import pandas as pd

In [4]:
impressions = "data/archive/production/raw/sample_skeleton_train.csv"
df = pd.read_csv(impressions, header=None, index_col=None)
df.loc[(df[1]==0) & (df[2]==0)].shape[0] / df.shape[0] * 100

96.1123386485646

In [9]:
df.head()

Unnamed: 0,0,1,2,3,4,5
0,11515523,0,0,d5f794198192a713,8,20787184361.021091122781.021090428991....
1,11060573,0,0,5ab1af84e729a269,13,21090299621.021090904571.021691506511....
2,19579112,0,0,15100e25b3982cc3,17,30193516661.021692536751.021091144811....
3,12131559,0,0,4e21ad4cf5b148d7,10,20555650711.021090633511.021090746901....
4,34008079,0,0,654765381422bb63,17,21090623511.021090762221.050893550770....


(section_22_data_acquisition)=
# Data Acquisition
One of the most challenging problems to solve in deep learning has little to do with neural network architectures, algorithm design, or AI framework selection. Teaching machines to learn without explicitly programming them to do so rests on our ability to acquire, prepare, and serve the right data, of the right quality, quantity and format. In this section, we design, build and execute a simple, automated and reproducible data extraction, transformation and loading (ETL) pipeline.

## Data Profile
Our dataset consists of a training and test set, each comprised of an impressions file, and a common features file as indicated below.

| Name            | Set      | Filename                  | Size (GB) |
|-----------------|----------|---------------------------|-----------|
| Impressions     | Training | sample_skeleton_train.csv | 10        |
| Common Features | Training | common_features_train.csv | 8         |
| Impressions     | Test     | sample_skeleton_test.csv  | 10        |
| Common Features | Test     | common_features_test.csv  | 10        |

Let's take a look.

### Impressions Data




This proj
Our goal is to extract our data from its source, transform it into a structure and format consistent with our target data model, then load the data into the target database for downstream processing, cleaning, selection, analytics and modeling. This extract-transform-load (ETL) design pattern is the  remains the  the centralized data repositories of the 1970's, through the emergence of the data warehouses in the 1990s, the extract-transform-load (ETL) has emerged as datesOur ex  then load the data into our target database for downstream processing, analytics and modeling. . data selection,  into the target dataa structure and format appropriate for analysis and modeling, then load the data into our target database. 
 mafor anlysis  a structure and format that supports analysis and   source data from  
We begin our design process with a detailed, field-level analysis and mapping of our source data to our target data model. One of the most important data integration tasks, source-to-target mapping (STTM) exercise:  
- ensures that the source data conforms with the target data model,    
- confirms that the source data meets the structural, integrity, and data quality requirements of the analytics and modeling efforts, and    
- provides a sense of the complexity of the integration effort, and
- reveals the logic required to convert the source data to the target data model.

### Source-to-Target Mapping (STTM)
```{figure} ../images/STTM.png
---
height: 500px
width: 900px
name: sttm
---
Source to Target Model
```
A mini-mapping analysi {ref}`sttm` depicts,  
Our S2M As indicated in {ref}`sttm`, our source data are comprised of two types of datasets: a CoreData and a CommonFeatures da containing id, label, and feature data for each impression, and a common features dataset which aggregates features common to many impressions. The target data model contains the following four tables.

- **Impressions**: id, and label data for each impression,    
- **CoreFeatures**: The feature data for each immpression, represented in 3rd normal formFeature data in 3rd normal form 

- Our target data model contains 
- Impressions table

 the following four tables:    
- 
 contain  and  feature, and a foreign key reference to a seco the sample id

, problems, risks,     our target environment, and  system,    examing our source dataa source to target (S2T)The first step in our design process is develop a field-level mapping from the data sources to the target system. This Source To Target (S2T) analysis: 
- ensures that the source data exists and meets requirements for analysis and model development,  
- illuminates risks,  complexities, risks,   for  by analysit exiists   reveals data conversion logic, business rules, loading frequency, business logic that  informs a host of design decisions, from loading frequency  includes not only a field level mapping from  serves  a blueprint for our data acquisition solution   characterize the strutusource data and target data structure 
Recall from the prior section that our source data are comprised of two related datasets: a core dataset containing the labels and features for each impression, and a common features dataset which aggregates features common to many impressions. The goal of the data acquisition pipeline is to convert the source data depicted on the left of {ref}`sttm` to the structure and format represented on the right side of {ref}`sttm`. To accomplish this, we will 
The goal of the data acquisition pipeline is to:
1. **Extract** the CoreData and CommonFeaturesData datasets depicted on the left of {ref}`sttm` to a local staging area, 
2. **Transform** the CoreData and CommonFeaturesData datasets into the four target environment datasets described on right of {ref}`sttm`, and finally, 
3. **Load** the four datasets into the target relational database management system.


```{figure} ../images/STTM.png
---
height: 500px
width: 900px
name: sttm
---
Source to Target Model
```
Let's review the mapping of the CoreData and CoreFeatures.
#### CoreData Dataset Mapping
The CoreData datasets map to an Impressions table, and a CoreFeatures table. The Impressions table has all the fields contained in the CoreData datasets with one exception: the feature list. Feature lists found in the CoreData and CommonFeaturesData datasets are variable length lists of feature structures, each containing a feature name, feature id, and feature value. The lists of feature structures will be parsed, normalized, and stored in a separate CoreFeatures table where each observation corresponds to a single feature for an impression. 

#### CommonFeaturesData Dataset Mapping
Similarly, the CommonFeaturesData dataset is comprised of rows of common feature lists observed across many impressions. This dataset will map to a CommonFeaturesSummary and a CommonFeatures table.  The CommonFeaturesSummary simply stores the common_feature_index and the number of feature structures in feature lists stored in the CommonFeatures table.  

### Directed Acyclic Graph 
We've described the ETL process as a pipeline through which data flows sequentially from one end to the other. In practice, the metaphor is a bit misleading. Data isn't literally flowing from one end of a single tube to the other. Rather, ETL processes may be complex, non-linear, networks, of objects, tasks performed on those objects, and dependencies between tasks. A more apt theoretical framework for reasoning about ETL workflows can be borrowed from graph theory.

A graph is a pair $G=(V,E)$, where: 
- $V$ is a set of vertices, and 
- $E$ is a set of paired vertices or edges.

In a *directed* graph or *digraph*, each edge $E \subseteq \{(x,y)|(x,y)\in V^2$ and $x\ne y\}$ between a pair of vertices has a polarity or orientation from one vertex to another. For instance, the pair of vertices may be tasks to perform within a data pipeline, and the edge between them may represent the constraint that the end task must initiate after the start task has completed.  

A *path* graph is a sequence of edges in a graph in which the ending vertex or task of each edge in a sequence is the same as the starting vertex or task of the next edge in the sequence. More formally, a graph of order $n\ge2$ is a graph in which the vertices can be listed in an order $\{v_1,v_2,\dots,v_n\}$ such that the edges are $\{v_i,v_{i+1}\}$ for  $i=1,2,\dots,n-1$. When the starting vertex of the path is the same as the ending vertex of the path, a cycle has been formed. 

Finally, a directed *acyclic* graph has at least one topological ordering of its vertices into a sequence, such that the start vertex of every directed edge occurs earlier in the sequence than the ending vertex of that edge. Further, any graph that has topological ordering cannot have any cycles because the edge into the earliest vertex of the cycle would have to be oriented in the wrong direction. 

Hence, graph theory provides a powerfully, simple and mathematically elegant language for  for expressing    





Given their mathematical properties, DAGs have been used in a wide range of scientific, computational, biological, and sociological applications. 

Now, we can represent our ETL process as a directed acyclic graph $G=(V,E)$ where $V$ is a set of objects or vertices, and $E$ is the set edges or tasks directionally connecting objects. The high-level ETL DAG is summarized in {ref}`etl_dag`.

```{figure} ../images/ETLDAG.png
---
height: 500px
width: 900px
name: etl_dag
---
Extract Transform Load DAG
```
In the 
## Extract 

In [3]:
# Imports
# External Modules
import os
import boto3
from botocore.exceptions import NoCredentialsError
import logging
import progressbar
import tarfile
import tempfile
import numpy as np
import numexpr as ne
os.environ['NUMEXPR_MAX_THREADS'] = '24'
os.environ['NUMEXPR_NUM_THREADS'] = '16'
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.width', 1000)
# Logging Configuration
# ------------------------------------------------------------------------------------------------ #
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)
# ------------------------------------------------------------------------------------------------ #

In [1]:
# REMOVE-CELL
# Must reset current directory to the project home before importing internal modules
home = "/home/john/projects/DeepCVR/"
os.chdir(home)

In [None]:
# Imports
# Internal Modules
from deepcvr.utils.config import S3Config

In [2]:
# Constants 
S3_BUCKET = 'deepcvr-data'
DIRECTORY_EXTERNAL = "data/external"
DIRECTORY_RAW = 'data/raw'
DIRECTORY_STAGED = 'data/staged'
DIRECTORY_SAMPLE = 'data/sample'
FILEPATH_EXTERNAL_TRAIN = os.path.join(DIRECTORY_EXTERNAL, 'taobao_train.tar.gz')
FILEPATH_EXTERNAL_TEST = os.path.join(DIRECTORY_EXTERNAL, 'taobao_test.tar.gz')
FILEPATH_RAW_TRAIN_CORE = os.path.join(DIRECTORY_RAW,"sample_skeleton_train.csv")
FILEPATH_RAW_TRAIN_COMMON = os.path.join(DIRECTORY_RAW,"common_features_train.csv")
FILEPATH_RAW_TEST_CORE = os.path.join(DIRECTORY_RAW,"sample_skeleton_test.csv")
FILEPATH_RAW_TEST_COMMON = os.path.join(DIRECTORY_RAW,"common_features_test.csv")


## Download Data
Downloading the data from our S3 instance will take approximately 15 minutes on a standard 40 Mbps internet line.

In [4]:
# %load -s S3Downloader deepcvr/data/download.py
class S3Downloader:
    """Download operator for Amazon S3 Resources

    Args:
        bucket (str): The name of the S3 bucket
        destination (str): Director to which all resources are to be downloaded
    """

    def __init__(self, bucket: str, destination: str, force: bool = False) -> None:
        self._bucket = bucket
        self._destination = destination
        self._force = force
        config = S3Config()
        self._s3 = boto3.client(
            "s3", aws_access_key_id=config.key, aws_secret_access_key=config.secret
        )
        self._progressbar = None

    def execute(self) -> None:

        object_keys = self._list_bucket_contents()

        for object_key in object_keys:
            destination = os.path.join(self._destination, object_key)
            if not os.path.exists(destination) or self._force:
                self._download(object_key, destination)
            else:
                logger.info(
                    "Bucket resource {} already exists and was not downloaded.".format(destination)
                )

    def _list_bucket_contents(self) -> list:
        """Returns a list of objects in the designated bucket"""
        objects = []
        s3 = boto3.resource("s3")
        bucket = s3.Bucket(self._bucket)
        for object in bucket.objects.all():
            objects.append(object.key)
        return objects

    def _download(self, object_key: str, destination: str) -> None:
        """Downloads object designated by the object ke if not exists or force is True"""

        response = self._s3.head_object(Bucket=self._bucket, Key=object_key)
        size = response["ContentLength"]

        self._progressbar = progressbar.progressbar.ProgressBar(maxval=size)
        self._progressbar.start()

        os.makedirs(os.path.dirname(destination), exist_ok=True)
        try:
            self._s3.download_file(
                self._bucket, object_key, destination, Callback=self._download_callback
            )
            logger.info("Download of {} Complete!".format(object_key))
        except NoCredentialsError:
            msg = "Credentials not available for {} bucket".format(self._bucket)
            raise NoCredentialsError(msg)

    def _download_callback(self, size):
        self._progressbar.update(self._progressbar.currval + size)


In [5]:
downloader = S3Downloader(bucket=S3_BUCKET, destination=DIRECTORY_EXTERNAL)
downloader.execute()

INFO:botocore.credentials:Credentials found in config file: ~/.aws/config
INFO:__main__:Bucket resource data/external/taobao_test.tar.gz already exists and was not downloaded.
INFO:__main__:Bucket resource data/external/taobao_train.tar.gz already exists and was not downloaded.


## Extract Raw Data
Here, we extract the compressed files into a raw data directory

In [6]:
# %load -s Extractor deepcvr/data/extract.py
class Extractor:
    """Decompresses a gzip archive, stores the raw data

    Args:
        source (str): The filepath to the source file to be decompressed
        destination (str): The destination directory into which data shall be stored.
        filetype (str): The file extension for the uncompressed data
        force (bool): Forces extraction even when files already exist.
    """

    def __init__(self, source: str, destination: str, force: bool = False) -> None:

        self._source = source
        self._destination = destination
        self._force = force

    def execute(self) -> None:
        """Extracts and stores the data, then pushes filepaths to xCom."""
        logger.debug("\tSource: {}\tDestination: {}".format(self._source, self._destination))

        # If all 4 raw files exist, it is assumed that the data have been downloaded
        n_files = len(os.listdir(self._destination))
        if n_files < 4:

            with tempfile.TemporaryDirectory() as tempdir:
                # Recursively extract data and store in destination directory
                self._extract(source=self._source, destination=tempdir)

    def _extract(self, source: str, destination: str) -> None:
        """Extracts the data and returns the extracted filepaths"""

        logger.debug("\t\tOpening {}".format(source))
        data = tarfile.open(source)

        for member in data.getmembers():
            if self._is_csvfile(filename=member.name):
                if self._not_exists_or_force(member_name=member.name):
                    logger.debug("\t\tExtracting {} to {}".format(member.name, self._destination))
                    data.extract(member, self._destination)  # Extract to destination
                else:
                    pass  # Do nothing if the csv file already exists and Force is False

            else:
                logger.debug("\t\tExtracting {} to {}".format(member.name, destination))
                data.extract(member, destination)  # Extract to tempdirectory

    def _not_exists_or_force(self, member_name: str) -> bool:
        """Returns true if the file doesn't exist or force is True."""
        filepath = os.path.join(self._destination, member_name)
        return not os.path.exists(filepath) or self._force

    def _is_csvfile(self, filename: str) -> bool:
        """Returns True if filename is a csv file, returns False otherwise."""
        return ".csv" in filename


In [7]:
extractor = Extractor(source=FILEPATH_EXTERNAL_TRAIN, destination=DIRECTORY_RAW)
filenames = extractor.execute()
os.listdir(DIRECTORY_RAW)

['sample_skeleton_train.csv',
 'sample_skeleton_test.csv',
 'common_features_test.csv',
 'common_features_train.csv']

## Core Dataset Preprocessing
Let's take a preliminary look at the core training dataset.
### Core Raw Training Set

In [18]:
df = pd.read_csv(FILEPATH_RAW_TEST_CORE, header=None, index_col=[0], nrows=10000)
df.head()

Unnamed: 0_level_0,1,2,3,4,5
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0,0,23bd0f75de327c60,14,21691810781.030193516651.020555871431.020683152771.020788010261.070298787552.07...
2,0,0,23bd0f75de327c60,15,20556627321.020683168931.020789873281.0853100205382.53970298966523.5835250893551...
3,0,0,23bd0f75de327c60,12,20683154051.020565395121.030193516651.021692734271.021091004791.021090841271.0...
4,0,0,23bd0f75de327c60,11,50996861712.9957321090689921.020788010261.020683152761.020580106491.02109104804...
5,0,0,543b0cd53c7d5858,11,20683170931.050893553232.6390621090204101.021090452281.021090890731.02109035934...


Here we have: 

| Column | Field                                  |
|--------|----------------------------------------|
| 0      | Sample-id                              |
| 1      | Click Label                            |
| 2      | Conversion Label                       |
| 3      | Common Features Foreign Key            |
| 4      | Number of features in the feature list |
| 5      | Feature List                           |


In [19]:
df = pd.read_csv(FILEPATH_RAW_TRAIN_COMMON, header=None, index_col=0, nrows=100)
df.head()

Unnamed: 0_level_0,1,2
0,Unnamed: 1_level_1,Unnamed: 2_level_1
84dceed2e3a667f8,343,101313191.012534387741.012634387791.012734387821.012838648851.012938648881.015...
0000350f0c2121e7,811,127_1437162241.94591127_1435146270.69315127_1437728710.69315127_1435432831.60944127_...
000091a89d1867ab,7,12534387731.012434387691.012234387611.012134386581.012938648891.012838648851.0...
0001a4114b0ae8bf,231,150_1439166842.3979150_1439407981.07056150_1438923681.6259150_1439146340.55962150_14...
0001def19d7cb335,964,150_1439091500.84715150_1439330134.44265150_1439340833.3322150_1438742584.09988150_1...


Here we have: 

| Column | Field                                  |
|--------|----------------------------------------|
| 0      | Sample-id                              |
| 1      | Click Label                            |
| 2      | Conversion Label                       |
| 3      | Common Features Foreign Key            |
| 4      | Number of features in the feature list |
| 5      | Feature List                           |

# REMOVE-CELL
# References and Notes
Refer to  https://www.netquest.com/blog/en/random-sampling-stratified-sampling for sampling techniques