In [None]:
# REMOVE-CELL
import os
home = "/home/john/projects/DeepCVR/"
os.chdir(home)

(section_22_data_acquisition)=
# Data Acquisition and Ingestion
Wrangling, munging, cleansing and manipulating data are the irreducible factors of machine learning and its value equation. Statistical inference, predictive analytics,..., problem solving with math and machines require the right data, in the right volumes to be in a structure and format useful to the inference and predictive analytics task. This section describes the processes by which we obtain and convert our source data into forms and formats suitable for analysis and modeling. 

## Data Model Design Principlies
Yet, as we kick-off reportedly the most time-consuming phase of a machine learning project, let's align on a few design principles that will inform decisions around reproducibility, transparency, productivity, and efficiency.

1. **Simplicity**: This is a fairly simple data migration and the aim is to keep it so while providing a detailed account of the data preparation processes, software and toolstacks used.
2. **Declarative Pipelines**: Pipelines, or more formally, networks of directed acyclic graphs (DAGs) will be defined in declarative YAML-based configuration files.  Whereas imperative programming approaches decompose problems into a collection of computational steps that must be carried out, declarative programming abstracts away control flow for logic required to perform an action, by instead stating the task or desired outcome without explicitly listing the commands required to complete the task. Defining pipelines in this way emphasizes 'what' must be done rather than 'how'. Achieving simplicity through restraint, declarative pipelines have a restricted, simpler syntax, allowing for less error-prone, more structured, manageable, and scalable pipeline development. Easier to read, write, and maintain, declarative pipelines are the central idea behind modern 'Pipeline as Code' approaches.
3. **Parallel Functional Style Implementation**: For discrete tasks, operators should emphasize a stateless functional programmining paradigm that defines operations only in terms of functions or methods, their inputs, and their return values. Small functions are deterministic and composable into modules that cannot be affected by any mutable state or unintended side-effect. Defining operations in a functional paradigm supports modularity, easier debugging, testing, and verification.   
4. **Normalize First**: Design target database schemas to third normal form (3NF), to mitigate data anomalies, ensure referential integrity, and eliminate duplication and redundancies. Denormalize into dimensions and star schemas as dimensions emerge to materially improve inference and consumability of the data.

## Data Ingestion Framework
Our development framework aligns our design principles with execution. As a general-purpose, high-level, multi-paradigm programming language, Python fully supports functional, object oriented, procedural and scripting programming paradigms for rapid development and prototyping. Python cleans up after itself. Memory for Python objects is dynamic allocated on a private heap and garbage collected by the Python Memory Manager automatically when no longer needed. For in-memory analytics, Pandas further enhances Python's memory cost performance through efficient storage of numeric and categorical data types. The Pandas DataFrame API has become somewhat of a standard, implemented and extended by an ecosystem that includes Koalas, a DataFrame interface on top of Apache Spark. In fact, we will be exploiting Spark's Pandas User-Defined Function (UDF) facility with PySpark to parallelize and scale DataFrame processing across multiple cores during the Transform stage of our ETL. Finally, our database, consisting of millions of advertising impressions and billions of user and item feature structures will be running MySQL. Version 8.0.28 includes a host of improvements designed to take full advantage of modern hardware and operating system resources and efficiencies.

## Data Ingestion Module



 Our database engine, MySQL, has been chosen for its performance, scalability and     
3. The imperative components of the ETL module include the DagBuilder, an ExtractDAG, TransformDAG, and a LoadDAG. DAG tasks will be performed by python Operator objects.   
3. The datasets are reasonably large, and certain transformations must occur at the instance level. We will parallelize were feasible with Apache Spark.
4. Our database is largely comprised of billions of feature structures. For its performance, scalability, and simplicity, MySQL will provide the backend database for analysis.

## Target Data Model
For the analysis stage, we will restructure the training data into third normal form (3NF), a relational database schema design originally defined by E.F. Codd in the early 1970's. A 3NF database minimizes data anomalies, guarantees referential integrity and ensures that every non-key attribute provide true evidence about the table key, the whole key, and nothing but the key, "so help me Codd" {cite}diehrDatabaseManagement1989`.  

{numref}`s2t` relates the source to target database structure.

```{figure} ../images/s2t.png
---
name: s2t
align: center
alt: Source to Target Model
---
Source to Target (S2T) Model
```
As depicted in {numref}`s2t`, our sample and common features datasets will become a relational database containing:  

- a **sample** table holding the sample_id, the target click and conversion labels and a common_feature_index, 
- a **feature** *reference* table holding the global feature ids and their associated names,  
- a **sample_feature** table comprised of one or more feature structures (id, value) for each sample, and 
- a **common_feature_group** table holding feature structures commmon among many samples. 

## Data Acquisition and Ingestion Process
The roots of the extract transform load (ETL) design pattern extend back to the centralized data repositories of the 1970s and the data warehouses of the 1980s and early 1990s. Today, ETL and its variants describe doctrine for obtaining data from disparate systems, staging and converting the data to a suitable format and integrating the data into a data warehouse or target environment. Adopting the ETL design pattern  
or similar storage facility.   and persisting  and is doctrine for   moving from our source to our target data model has its  The remaining subsections will be devoted to the construction and execution of a simple, configuration-based, automated and reproducible data ingestion pipeline that 

- **extracts** the data from its [source](s3://deepcvr-data/production/taobao_train.tar.gz), 
- **transforms** it into a usable and reliable structure, then 
- **loads** the data into a [database](https://www.mysql.com/) for downstream analysis and modeling. 




The main components of this extract-transform-load (ETL) pipeline are put forward in {numref}`etl_dag`

```{figure} ../images/ETL-DAG.png
---
name: etl_dag
align: center
alt: Extract Transform Load
---
Extract Transform Load
```


Allons-y!

## Extract
The extract phase is comprised of two steps. The first step is executed by the S3Downloader executable in the deepcvr.data.extract directory. It downloads the data from the production folder of the deepcvr-data AWS bucket and stores the data in the data/external directory. The second step, is performed by the Decompress class in the deepcvr.data.extract module. It unpacks the data from its GZIP archive and stores the raw data in csv format in the data/raw/ directory. The yaml configuration file 'extract.yml' file below describes a simple extract directed acyclic graph (DAG) in terms of the task executables, and their parameters.

In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from deepcvr.base.dag import DagBuilder
from deepcvr.utils.config import config_dag

In [2]:

config_filepath = "config/extract.yaml"
config = config_dag(config_filepath)
dag = DagBuilder(config=config).build()
dag.run()

INFO:task_event:S3Downloader.execute called with dict_values([1, 's3_download', OrderedDict([('bucket', 'deepcvr-data'), ('folder', 'production/'), ('destination', 'data/external/'), ('force', False)]), 'deepcvr-data', 'production/', 'data/external/', False, None])
DEBUG:deepcvr.data.extract:	Started S3Downloader execute
DEBUG:botocore.hooks:Changing event name from creating-client-class.iot-data to creating-client-class.iot-data-plane
DEBUG:botocore.hooks:Changing event name from before-call.apigateway to before-call.api-gateway
DEBUG:botocore.hooks:Changing event name from request-created.machinelearning.Predict to request-created.machine-learning.Predict
DEBUG:botocore.hooks:Changing event name from before-parameter-build.autoscaling.CreateLaunchConfiguration to before-parameter-build.auto-scaling.CreateLaunchConfiguration
DEBUG:botocore.hooks:Changing event name from before-parameter-build.route53 to before-parameter-build.route-53
DEBUG:botocore.hooks:Changing event name from requ

	Task 1:	s3 download complete.	Duration: 0.63 seconds.


ERROR:task_event:Exception raised in execute. exception: Error -3 while decompressing data: invalid stored block lengths
Traceback (most recent call last):
  File "/home/john/projects/DeepCVR/deepcvr/utils/decorators.py", line 74, in wrapper
    result = func(self, *args, **kwargs)
  File "/home/john/projects/DeepCVR/deepcvr/data/extract.py", line 158, in execute
    tar.extractall(self._destination)
  File "/home/john/anaconda3/envs/deepcvr/lib/python3.8/tarfile.py", line 2028, in extractall
    self.extract(tarinfo, path, set_attrs=not tarinfo.isdir(),
  File "/home/john/anaconda3/envs/deepcvr/lib/python3.8/tarfile.py", line 2069, in extract
    self._extract_member(tarinfo, os.path.join(path, tarinfo.name),
  File "/home/john/anaconda3/envs/deepcvr/lib/python3.8/tarfile.py", line 2141, in _extract_member
    self.makefile(tarinfo, targetpath)
  File "/home/john/anaconda3/envs/deepcvr/lib/python3.8/tarfile.py", line 2190, in makefile
    copyfileobj(source, target, tarinfo.size, Read

error: Error -3 while decompressing data: invalid stored block lengths

In [None]:
df = pd.read_csv("data/development/staged/sample_skeleton_train.csv")
df.head()


## Transform
Some 23 user, demographic, behavioural, and item features are split among two files: the impressions file containing a single ad impression per row, and a common features file that aggregates lists of features common among many sample impressions. As depicted in the entity relationship diagram below, the impressions file contains our targets, the click and conversion labels, a unique sample id, a feature count, and a few gigabytes of strings containing feature lists. Our common features dataset is similarly formatted. A few samples are printed for illustration purposes.

Our aim for the transform step, is a fully realized 3rd normal form target data model free of without redundancy, logical inconsistencies, transitive dependencies, and read/write anomalies. Normalization improves memory, cpu, and disk efficiency, boosts ad-hoc query processing and reduces the computational effort associated with big data analytics. Not all optimization is premature. 

Notwithstanding, transforming our features from a series of strings to rows of feature structures will involve computationally inefficient row-wise dataframe operations on some 88 million rows. Fortunately, Apache Spark's Pandas UDF functions implement a so-called 'split-apply-combine' pattern in which a Spark DataFrame is split into groups, a function is applied to each group, they are dispatched to a configurable number of CPU cores running in parallel, then the results are combined into a final single DataFrame. 

The source code for the transform step 
Viola! 
dispatched to which allows one to split trials using Pandas apply method on a sample dataset were not Fortunately, Spark's recent    row-wise dataframe operations that can't task will involve 84 million costly row-wise 





 in Memory, cpu, and disk utilization of 3NFare optimized while the additiona, memory requirement, faster disk operations, Disk operations, memory utilization, query response times are advantaged by a 3NF database design and and as we move to the exploratory data analysis work,  

Analyzing our data

Space and time complexity oThird normal form provides flexibility, ensures referential integrity, and can be considered increases data processing efficiency, reduces storage space, , ideal for online transaction processing (OLTP)   with referential integrity,  is to parse, extract, and convert the data  the data  features into into a 3rd normal form (3NF), thereby eliminating redundancy, ensuring referential integrity, and simplify data management and exploratory analysis. 

Unfortunately, this parsing exercise involves a rather tedious, row-wise treatment that can't be easily vectorized. Processing 84 million rows of the data. 
Unfortunately, the structure of the feature data will require row-wise parsing - a rather computationally burdensome task 

This row-wise parsing exercise can't be efficiently vectorized, but park  

 and management.   form  the features The  Each impression in the impression file contains a list of one ore more feature structures concatenated into strings, which delimited by selected non-printable ASCII characters. Similarly, lists of feature structures 
Our feature set includes some 23 user demographic, behavioural, transactional and item features concatenated, and compressed into two strings stored across the two files which collectively make up our training set.  files which collectively series of strings across across two files. impressions file contains:

![ERD](/jbook/images/ETL-DAG.png)


 the the target click and conversion labels, a feature count, a sample id and a series of strings containing one ore more feature lists. The second file, contains a similar collection of features lists organized into a series of concatenated feature structures.
 features that are common among many of the samples in the impressions file.   common feature file contains a collection of feature groups that have been aggregated , packed into ASCII character delimited strings containing the feature structures. Each structure contains and id, a feature name and a corresponding feature value. The primary aim of the transform step is to parse the features structures into the individual features and samples. Concretely, our core impressions will be split into an impressions table, containing a single observation for each  impression, and a features table with one-to-many foreign references to the impressions   into file will be transformed into features these features into feature structures that can be analyzed and processed. The sample below  containing   in comma separated strings.concatenated and encoded into comma separated strings  strings   
 and partiti in the metatadatabase  that  the tasks to be completed, the parameters  
Step 1. Download our data from its Amazon S3 instance, unzip the compressed archives, persist and register the raw data. Next, column names are added, partitions are assigned, and the assets are registered in the metadata database before staging the data for the transformation phase.  


## Extract

The remote S3 datasource is downloaded, decompressed, and stored in the raw data directory. A staging process adds column names and assigns each observation a partition number to support parallel processing in the transform stage.
 partitions   this data management framework is to download the source data into the external data directory on the local drive. It is then decompressed from its GZIP archive and migrated to teh loca

We begin the ETL design with a quick assessment of the data vis-a-vis our (heretofore unspecified) target database in order to:

- quickly illuminate structural or data quality issues 
- assess the complexity of the integration effort, and
- evaluate the utility of the dataset and its attributes to the analysis and modeling efforts. 

[erd](jbook/images/ERD.png)


To reduce our computational burden, advance the ETL analysis, design, and development effort, a multivariate multi-objective stratified optimal distribution-preserving class-proportional downsampling dataset will be created that reflects the structure of the entire training set.

sampling and allocation data profiling effort and the analysis, design, and ToTo mitigate computational burden  and of Analyzing and manipulating 90 million observations across 40 Gb To reduce computational cost and to facilitate the data profiling and discovery effort, a random sample   ETL development  deTo address the class imbalance question, data generation and sampling techniques have evolved    
To moderate the computational cost of analyzing and manipulating our data,  Though our dataset would not be considered big data in any modern context, the computational cost of analyzing and manipulating such datasets motivates   increases controlling the computational cost of the data acquisition and exploratory analysis efforts  motivated questions about the optimal size and allocation of data samples    analyzing and manipulating datasets of these sizes came with a computational burden 
To reduce the computational burden, multivariate proportional stratified downsampling was conducted to produce a sample dataset that reflected the distributions, diversity, and statistical properties of the full training. 

{numref}`sampling_strata`: Alibaba Click and Conversion Prediction (Ali-CCP) Dataset Sampling Strata

```{table} Sampling Strata
:name: sampling_strata

| Stratum | Click | Conversion | Proportion | Response                     |
|:-------:|:-----:|:----------:|:----------:|------------------------------|
|    1    |   0   |      0     |   96.11%   | No response to ad            |
|    2    |   1   |      0     |    3.89%   | Click through                |
|    3    |   1   |      1     |    0.02%   | Click-through and conversion |
```
A sample size 

Next, an optimal total sample size was calculated and stratified random sampling from each strata was conducted in accordance with the distribution conducted to preserve 
   was  , Analyzing and manipulating mid-sized datasets To mitigate some computational cost 
Combined, we have approximately 86 million observations split almost evenly between the training and test sets. Restricting our   observations in our training and test sets. 
For computational convenience, we'll extract a *representative* sample from the *training* set for this stage of the analysis. And since the common features dataset extends the impression dataset, we'll treat both as a single training set of 42.3 million observations. 

Thus, we need to know how large a representative sample needs to be, assuming a margin of error of +/-5%. Restating the problem, we seek a dataset in which the 100(1-$\alpha$)% confidence interval for the sample conversion rate contains the true population conversion rate with probability of at least 1-$\alpha$. Hence, we have a 95% confidence that the true conversion rate is contained inside the 95% confidence interval. 

Conversions are discrete events following a binomial distribution. If $P$ is our 



 Since   Defining *representative* in terms of conversion rate, we seek a sample size in which the sample mean conversion rate and its variance approximates the associated mean and variance of the *population* within some margin of error, say, 0.05%. Fortunately, the central limit theorem provides a principled method for     of the  and the  and  Our impressions dataset has a population of 42 million observations   Representatve Fortunately, the central limit theorem (CLT) allows us to 

### Core Data


# Data Acquisition
Wrangling, munging, cleansing and manipulating data are irreducible variables in the machine learning and big data value equation. Statistical inference, predictive analytics, and problem solving with machines and math require data, in the right format, volume, and veracity. In this section, we design, build and execute a simple, automated and reproducible data ingestion pipeline that extracts the data from its source, transforms it into a usable and reliable resource, then loads the data into a database for downstream analysis and modeling. The main components are put forward as follows:

## Extract
Our ETL pipeline is defined using declarative pipeline syntax - basic statements and expressions which sequence the parameterized tasks that collectively execute the ETL process. First, the data are downloaded from an Amazon S3 instance, unzipped, persisted, and this raw data are registered as assets in the metadata database. Column headings are added, partitions are assigned, andd the data are stored in a staging area for the transformation step. 

## Transform
Some 23 user, demographic, behavioural, and item features are split among two files: the impressions file containing a single ad impression per row, and a common features file that aggregates lists of features common among many sample impressions. As depicted in the entity relationship diagram below, the impressions file contains our targets, the click and conversion labels, a unique sample id, a feature count, and a few gigabytes of strings containing feature lists. Our common features dataset is similarly formatted. A few samples are printed for illustration purposes.

Our aim for the transform step, is a fully realized 3rd normal target data model free of redundancy, and logical inconsistencies, inappropriate and transitive dependencies, and read/write anomalies. Normalization improves memory, cpu, and disk efficiency, boosts ad-hoc query processing and reduces the computational effort associated with big data analytics. Not all optimization is premature. 

Notwithstanding, transforming our feature data will involve computationally inefficient row-wise dataframe operations on some 88 million rows. Fortunately, Apache Spark's Pandas UDF functions implement a so-called 'split-apply-combine' pattern in which a Spark DataFrame is split into groups, a function is applied to each group, and dispatched to one of a configurable number of CPU cores, then results are combined into a final single DataFrame. 

The source code for the transform step 
Viola! 
dispatched to which allows one to split trials using Pandas apply method on a sample dataset were not Fortunately, Spark's recent    row-wise dataframe operations that can't task will involve 84 million costly row-wise 





 in Memory, cpu, and disk utilization of 3NFare optimized while the additiona, memory requirement, faster disk operations, Disk operations, memory utilization, query response times are advantaged by a 3NF database design and and as we move to the exploratory data analysis work,  

Analyzing our data

Space and time complexity oThird normal form provides flexibility, ensures referential integrity, and can be considered increases data processing efficiency, reduces storage space, , ideal for online transaction processing (OLTP)   with referential integrity,  is to parse, extract, and convert the data  the data  features into into a 3rd normal form (3NF), thereby eliminating redundancy, ensuring referential integrity, and simplify data management and exploratory analysis. 

Unfortunately, this parsing exercise involves a rather tedious, row-wise treatment that can't be easily vectorized. Processing 84 million rows of the data. 
Unfortunately, the structure of the feature data will require row-wise parsing - a rather computationally burdensome task 

This row-wise parsing exercise can't be efficiently vectorized, but park  

 and management.   form  the features The  Each impression in the impression file contains a list of one ore more feature structures concatenated into strings, which delimited by selected non-printable ASCII characters. Similarly, lists of feature structures 
Our feature set includes some 23 user demographic, behavioural, transactional and item features concatenated, and compressed into two strings stored across the two files which collectively make up our training set.  files which collectively series of strings across across two files. impressions file contains:

![ERD](/jbook/images/ETL-DAG.png)


 the the target click and conversion labels, a feature count, a sample id and a series of strings containing one ore more feature lists. The second file, contains a similar collection of features lists organized into a series of concatenated feature structures.
 features that are common among many of the samples in the impressions file.   common feature file contains a collection of feature groups that have been aggregated , packed into ASCII character delimited strings containing the feature structures. Each structure contains and id, a feature name and a corresponding feature value. The primary aim of the transform step is to parse the features structures into the individual features and samples. Concretely, our core impressions will be split into an impressions table, containing a single observation for each  impression, and a features table with one-to-many foreign references to the impressions   into file will be transformed into features these features into feature structures that can be analyzed and processed. The sample below  containing   in comma separated strings.concatenated and encoded into comma separated strings  strings   
 and partiti in the metatadatabase  that  the tasks to be completed, the parameters  
Step 1. Download our data from its Amazon S3 instance, unzip the compressed archives, persist and register the raw data. Next, column names are added, partitions are assigned, and the assets are registered in the metadata database before staging the data for the transformation phase.  


## Extract

The remote S3 datasource is downloaded, decompressed, and stored in the raw data directory. A staging process adds column names and assigns each observation a partition number to support parallel processing in the transform stage.
 partitions   this data management framework is to download the source data into the external data directory on the local drive. It is then decompressed from its GZIP archive and migrated to teh loca

We begin the ETL design with a quick assessment of the data vis-a-vis our (heretofore unspecified) target database in order to:

- quickly illuminate structural or data quality issues 
- assess the complexity of the integration effort, and
- evaluate the utility of the dataset and its attributes to the analysis and modeling efforts. 

[erd](jbook/images/ERD.png)


To reduce our computational burden, advance the ETL analysis, design, and development effort, a multivariate multi-objective stratified optimal distribution-preserving class-proportional downsampling dataset will be created that reflects the structure of the entire training set.

sampling and allocation data profiling effort and the analysis, design, and ToTo mitigate computational burden  and of Analyzing and manipulating 90 million observations across 40 Gb To reduce computational cost and to facilitate the data profiling and discovery effort, a random sample   ETL development  deTo address the class imbalance question, data generation and sampling techniques have evolved    
To moderate the computational cost of analyzing and manipulating our data,  Though our dataset would not be considered big data in any modern context, the computational cost of analyzing and manipulating such datasets motivates   increases controlling the computational cost of the data acquisition and exploratory analysis efforts  motivated questions about the optimal size and allocation of data samples    analyzing and manipulating datasets of these sizes came with a computational burden 
To reduce the computational burden, multivariate proportional stratified downsampling was conducted to produce a sample dataset that reflected the distributions, diversity, and statistical properties of the full training. 

{numref}`sampling_strata`: Alibaba Click and Conversion Prediction (Ali-CCP) Dataset Sampling Strata

```{table} Sampling Strata
:name: sampling_strata

| Stratum | Click | Conversion | Proportion | Response                     |
|:-------:|:-----:|:----------:|:----------:|------------------------------|
|    1    |   0   |      0     |   96.11%   | No response to ad            |
|    2    |   1   |      0     |    3.89%   | Click through                |
|    3    |   1   |      1     |    0.02%   | Click-through and conversion |
```
A sample size 

Next, an optimal total sample size was calculated and stratified random sampling from each strata was conducted in accordance with the distribution conducted to preserve 
   was  , Analyzing and manipulating mid-sized datasets To mitigate some computational cost 
Combined, we have approximately 86 million observations split almost evenly between the training and test sets. Restricting our   observations in our training and test sets. 
For computational convenience, we'll extract a *representative* sample from the *training* set for this stage of the analysis. And since the common features dataset extends the impression dataset, we'll treat both as a single training set of 42.3 million observations. 

Thus, we need to know how large a representative sample needs to be, assuming a margin of error of +/-5%. Restating the problem, we seek a dataset in which the 100(1-$\alpha$)% confidence interval for the sample conversion rate contains the true population conversion rate with probability of at least 1-$\alpha$. Hence, we have a 95% confidence that the true conversion rate is contained inside the 95% confidence interval. 

Conversions are discrete events following a binomial distribution. If $P$ is our 



 Since   Defining *representative* in terms of conversion rate, we seek a sample size in which the sample mean conversion rate and its variance approximates the associated mean and variance of the *population* within some margin of error, say, 0.05%. Fortunately, the central limit theorem provides a principled method for     of the  and the  and  Our impressions dataset has a population of 42 million observations   Representatve Fortunately, the central limit theorem (CLT) allows us to 

### Core Data

In [None]:
# IMPORTS
import pandas as pd

In [None]:
impressions = "data/archive/production/raw/sample_skeleton_train.csv"
df = pd.read_csv(impressions, header=None, index_col=None)
df.loc[(df[1]==0) & (df[2]==0)].shape[0] / df.shape[0] * 100

In [None]:
df.head()

## Download Data
Downloading the data from our S3 instance will take approximately 15 minutes on a standard 40 Mbps internet line.

In [None]:
# %load -s S3Downloader deepcvr/data/download.py
class S3Downloader:
    """Download operator for Amazon S3 Resources

    Args:
        bucket (str): The name of the S3 bucket
        destination (str): Director to which all resources are to be downloaded
    """

    def __init__(self, bucket: str, destination: str, force: bool = False) -> None:
        self._bucket = bucket
        self._destination = destination
        self._force = force
        config = S3Config()
        self._s3 = boto3.client(
            "s3", aws_access_key_id=config.key, aws_secret_access_key=config.secret
        )
        self._progressbar = None

    def execute(self) -> None:

        object_keys = self._list_bucket_contents()

        for object_key in object_keys:
            destination = os.path.join(self._destination, object_key)
            if not os.path.exists(destination) or self._force:
                self._download(object_key, destination)
            else:
                logger.info(
                    "Bucket resource {} already exists and was not downloaded.".format(destination)
                )

    def _list_bucket_contents(self) -> list:
        """Returns a list of objects in the designated bucket"""
        objects = []
        s3 = boto3.resource("s3")
        bucket = s3.Bucket(self._bucket)
        for object in bucket.objects.all():
            objects.append(object.key)
        return objects

    def _download(self, object_key: str, destination: str) -> None:
        """Downloads object designated by the object ke if not exists or force is True"""

        response = self._s3.head_object(Bucket=self._bucket, Key=object_key)
        size = response["ContentLength"]

        self._progressbar = progressbar.progressbar.ProgressBar(maxval=size)
        self._progressbar.start()

        os.makedirs(os.path.dirname(destination), exist_ok=True)
        try:
            self._s3.download_file(
                self._bucket, object_key, destination, Callback=self._download_callback
            )
            logger.info("Download of {} Complete!".format(object_key))
        except NoCredentialsError:
            msg = "Credentials not available for {} bucket".format(self._bucket)
            raise NoCredentialsError(msg)

    def _download_callback(self, size):
        self._progressbar.update(self._progressbar.currval + size)


In [None]:
downloader = S3Downloader(bucket=S3_BUCKET, destination=DIRECTORY_EXTERNAL)
downloader.execute()

## Extract Raw Data
Here, we extract the compressed files into a raw data directory

In [None]:
# %load -s Extractor deepcvr/data/extract.py
class Extractor:
    """Decompresses a gzip archive, stores the raw data

    Args:
        source (str): The filepath to the source file to be decompressed
        destination (str): The destination directory into which data shall be stored.
        filetype (str): The file extension for the uncompressed data
        force (bool): Forces extraction even when files already exist.
    """

    def __init__(self, source: str, destination: str, force: bool = False) -> None:

        self._source = source
        self._destination = destination
        self._force = force

    def execute(self) -> None:
        """Extracts and stores the data, then pushes filepaths to xCom."""
        logger.debug("\tSource: {}\tDestination: {}".format(self._source, self._destination))

        # If all 4 raw files exist, it is assumed that the data have been downloaded
        n_files = len(os.listdir(self._destination))
        if n_files < 4:

            with tempfile.TemporaryDirectory() as tempdir:
                # Recursively extract data and store in destination directory
                self._extract(source=self._source, destination=tempdir)

    def _extract(self, source: str, destination: str) -> None:
        """Extracts the data and returns the extracted filepaths"""

        logger.debug("\t\tOpening {}".format(source))
        data = tarfile.open(source)

        for member in data.getmembers():
            if self._is_csvfile(filename=member.name):
                if self._not_exists_or_force(member_name=member.name):
                    logger.debug("\t\tExtracting {} to {}".format(member.name, self._destination))
                    data.extract(member, self._destination)  # Extract to destination
                else:
                    pass  # Do nothing if the csv file already exists and Force is False

            else:
                logger.debug("\t\tExtracting {} to {}".format(member.name, destination))
                data.extract(member, destination)  # Extract to tempdirectory

    def _not_exists_or_force(self, member_name: str) -> bool:
        """Returns true if the file doesn't exist or force is True."""
        filepath = os.path.join(self._destination, member_name)
        return not os.path.exists(filepath) or self._force

    def _is_csvfile(self, filename: str) -> bool:
        """Returns True if filename is a csv file, returns False otherwise."""
        return ".csv" in filename


In [None]:
extractor = Extractor(source=FILEPATH_EXTERNAL_TRAIN, destination=DIRECTORY_RAW)
filenames = extractor.execute()
os.listdir(DIRECTORY_RAW)

## Core Dataset Preprocessing
Let's take a preliminary look at the core training dataset.
### Core Raw Training Set

In [None]:
df = pd.read_csv(FILEPATH_RAW_TEST_CORE, header=None, index_col=[0], nrows=10000)
df.head()

Here we have: 

| Column | Field                                  |
|--------|----------------------------------------|
| 0      | Sample-id                              |
| 1      | Click Label                            |
| 2      | Conversion Label                       |
| 3      | Common Features Foreign Key            |
| 4      | Number of features in the feature list |
| 5      | Feature List                           |


In [None]:
df = pd.read_csv(FILEPATH_RAW_TRAIN_COMMON, header=None, index_col=0, nrows=100)
df.head()

Here we have: 

| Column | Field                                  |
|--------|----------------------------------------|
| 0      | Sample-id                              |
| 1      | Click Label                            |
| 2      | Conversion Label                       |
| 3      | Common Features Foreign Key            |
| 4      | Number of features in the feature list |
| 5      | Feature List                           |

# REMOVE-CELL
# References and Notes
Refer to  https://www.netquest.com/blog/en/random-sampling-stratified-sampling for sampling techniques