(section_22_data_build)=
# Data Build


In [1]:
# REMOVE-CELL
home = "/home/john/projects/DeepCVR/"
os.chdir(home)

The data pipeline constructed in this section will download, extract, merge, and load the data which wil be used to evaluate state-of-the-art conversion rate prediction algorithms throughout the rest of this series. Our dataset, the Ali-CCP: Alibaba Click and Conversion Prediction dataset may be obtained [Alibaba's Tianchi website](https://tianchi.aliyun.com/dataset/dataDetail?dataId=408) after registering with the website. Alternatively, the data may be downloaded from an Amazon S3 instance. For reproducibility purposes, all artifacts will be obtained from an Amazon S3 resource. 

Our data pipeline is summarized as follows:

| # | Step                                              | Destination   |
|---|---------------------------------------------------|---------------|
| 1 | Download the data from an Amazon S3 Instance      | data/external |
| 2 | Extract and secure the raw training and test sets | data/raw      |
| 3 | Merge the core and common features datasets       | data/staged   |
| 4 | Subsample the data for profiling and analysis     | data/sample   |

```{tip} Reproducibility
:class: dropdown
To run this notebook on the cloud, click on the shuttle icon in the external links at the top right of the page. This will launch the notebook on a Google Colab instance.
```
```{tip} Viewing Source Code
:class: dropdown 
Some of the source code is hidden by default. To view source code in this notebook, click on the icon 'Click to show' to the right.
```

In [12]:
# Constants 
DIRECTORY_EXTERNAL = "data/external"
DIRECTORY_RAW = 'data/raw'
DIRECTORY_STAGED = 'data/staged'
DIRECTORY_SAMPLE = 'data/sample'
S3_BUCKET = 'deepcvr-data'
FILENAME_CORE_TRAIN = "sample_skeleton_train.csv"
FILENAME_COMMON_FEATURES_TRAIN = "common_features_train.csv"

In [24]:
# Imports
# Download
import os
import os
os.environ['NUMEXPR_MAX_THREADS'] = '24'
os.environ['NUMEXPR_NUM_THREADS'] = '16'
import boto3
import logging
import progressbar
from botocore.exceptions import NoCredentialsError
from deepcvr.utils.config import S3Config
# Extract
import os
import logging
import tarfile
import tempfile
# Inspection
import pandas as pd
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.width', 1000)
import numpy as np
import numexpr as ne
# Logging objects
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

## Download Data
Downloading the data from our S3 instance will take approximately 15 minutes on a standard 40 Mbps internet line.

In [4]:
# %load -s S3Downloader deepcvr/data/download.py
class S3Downloader:
    """Download operator for Amazon S3 Resources

    Args:
        bucket (str): The name of the S3 bucket
        destination (str): Director to which all resources are to be downloaded
    """

    def __init__(self, bucket: str, destination: str, force: bool = False) -> None:
        self._bucket = bucket
        self._destination = destination
        self._force = force
        config = S3Config()
        self._s3 = boto3.client(
            "s3", aws_access_key_id=config.key, aws_secret_access_key=config.secret
        )
        self._progressbar = None

    def execute(self) -> None:

        object_keys = self._list_bucket_contents()

        for object_key in object_keys:
            destination = os.path.join(self._destination, object_key)
            if not os.path.exists(destination) or self._force:
                self._download(object_key, destination)
            else:
                logger.info(
                    "Bucket resource {} already exists and was not downloaded.".format(destination)
                )

    def _list_bucket_contents(self) -> list:
        """Returns a list of objects in the designated bucket"""
        objects = []
        s3 = boto3.resource("s3")
        bucket = s3.Bucket(self._bucket)
        for object in bucket.objects.all():
            objects.append(object.key)
        return objects

    def _download(self, object_key: str, destination: str) -> None:
        """Downloads object designated by the object ke if not exists or force is True"""

        response = self._s3.head_object(Bucket=self._bucket, Key=object_key)
        size = response["ContentLength"]

        self._progressbar = progressbar.progressbar.ProgressBar(maxval=size)
        self._progressbar.start()

        os.makedirs(os.path.dirname(destination), exist_ok=True)
        try:
            self._s3.download_file(
                self._bucket, object_key, destination, Callback=self._download_callback
            )
            logger.info("Download of {} Complete!".format(object_key))
        except NoCredentialsError:
            msg = "Credentials not available for {} bucket".format(self._bucket)
            raise NoCredentialsError(msg)

    def _download_callback(self, size):
        self._progressbar.update(self._progressbar.currval + size)


In [5]:
downloader = S3Downloader(bucket=S3_BUCKET, destination=DIRECTORY_EXTERNAL)
downloader.execute()

INFO:botocore.credentials:Credentials found in config file: ~/.aws/config
INFO:__main__:Bucket resource data/external/taobao_test.tar.gz already exists and was not downloaded.
INFO:__main__:Bucket resource data/external/taobao_train.tar.gz already exists and was not downloaded.


## Extract Raw Data
Here, we extract the compressed files into a raw data directory

In [9]:
# %load -s Extractor deepcvr/data/extract.py
class Extractor:
    """Decompresses a gzip archive, stores the raw data

    Args:
        source (str): The filepath to the source file to be decompressed
        destination (str): The destination directory into which data shall be stored.
        filetype (str): The file extension for the uncompressed data
        force (bool): Forces extraction even when files already exist.
    """

    def __init__(self, source: str, destination: str, force: bool = False) -> None:

        self._source = source
        self._destination = destination
        self._force = force

    def execute(self) -> None:
        """Extracts and stores the data, then pushes filepaths to xCom."""
        logger.debug("\tSource: {}\tDestination: {}".format(self._source, self._destination))

        with tempfile.TemporaryDirectory() as tempdir:
            # Recursively extract data and store in destination directory
            self._extract(source=self._source, destination=tempdir)

    def _extract(self, source: str, destination: str) -> None:
        """Extracts the data and returns the extracted filepaths"""

        logger.debug("\t\tOpening {}".format(source))
        data = tarfile.open(source)

        for member in data.getmembers():
            if self._is_csvfile(filename=member.name):
                if self._not_exists_or_force(member_name=member.name):
                    logger.debug("\t\tExtracting {} to {}".format(member.name, self._destination))
                    data.extract(member, self._destination)  # Extract to destination
                else:
                    pass  # Do nothing if the csv file already exists and Force is False

            else:
                logger.debug("\t\tExtracting {} to {}".format(member.name, destination))
                data.extract(member, destination)  # Extract to tempdirectory

    def _not_exists_or_force(self, member_name: str) -> bool:
        """Returns true if the file doesn't exist or force is True."""
        filepath = os.path.join(self._destination, member_name)
        return not os.path.exists(filepath) or self._force

    def _is_csvfile(self, filename: str) -> bool:
        """Returns True if filename is a csv file, returns False otherwise."""
        return ".csv" in filename


In [10]:
extractor = Extractor()
filenames = extractor.execute(source=DIRECTORY_EXTERNAL, destination=DIRECTORY_RAW)
os.listdir(DIRECTORY_RAW)

['sample_skeleton_train.csv',
 'sample_skeleton_test.csv',
 'common_features_test.csv',
 'common_features_train.csv']

## Core Dataset Preprocessing
Let's take a preliminary look at the core training dataset.
### Core Raw Training Set

In [25]:
raw_train_core_filepath = os.path.join(DIRECTORY_RAW,FILENAME_CORE_TRAIN)
df = pd.read_csv(raw_train_core_filepath, header=None, index_col=0, nrows=100)
df.head()

Unnamed: 0_level_0,1,2,3,4,5
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0,0,bacff91692951881,9,21090522181.021090645531.021090934451.021691547801.030193516651.020541862221.0...
2,0,0,bacff91692951881,10,21091097321.021090462841.021090990351.021090465401.021091041071.021691887571.0...
3,1,0,bacff91692951881,20,21090897311.021090475601.050995117692.3025950893548375.1416670298677624.02535853...
4,0,0,bacff91692951881,13,30193516651.021090503641.021090833881.021091042291.021090353951.050893544482.39...
5,0,0,bacff91692951881,9,20549456631.030193516651.021691721791.021090283891.021090622721.021090305541.0...


Here we have: 

| Column | Field                                  |
|--------|----------------------------------------|
| 0      | Sample-id                              |
| 1      | Click Label                            |
| 2      | Conversion Label                       |
| 3      | Common Features Foreign Key            |
| 4      | Number of features in the feature list |
| 5      | Feature List                           |


In [26]:
raw_train_common_features_filepath = os.path.join(DIRECTORY_RAW,FILENAME_COMMON_FEATURES_TRAIN)
df = pd.read_csv(raw_train_common_features_filepath, header=None, index_col=0, nrows=100)
df.head()

Unnamed: 0_level_0,1,2
0,Unnamed: 1_level_1,Unnamed: 2_level_1
84dceed2e3a667f8,343,101313191.012534387741.012634387791.012734387821.012838648851.012938648881.015...
0000350f0c2121e7,811,127_1437162241.94591127_1435146270.69315127_1437728710.69315127_1435432831.60944127_...
000091a89d1867ab,7,12534387731.012434387691.012234387611.012134386581.012938648891.012838648851.0...
0001a4114b0ae8bf,231,150_1439166842.3979150_1439407981.07056150_1438923681.6259150_1439146340.55962150_14...
0001def19d7cb335,964,150_1439091500.84715150_1439330134.44265150_1439340833.3322150_1438742584.09988150_1...


Here we have: 

| Column | Field                                  |
|--------|----------------------------------------|
| 0      | Sample-id                              |
| 1      | Click Label                            |
| 2      | Conversion Label                       |
| 3      | Common Features Foreign Key            |
| 4      | Number of features in the feature list |
| 5      | Feature List                           |

# REMOVE-CELL
# References and Notes
Refer to  https://www.netquest.com/blog/en/random-sampling-stratified-sampling for sampling techniques