(section_22_data_acquisition)=
# Data Acquisition


For these datasets we will implement the extract load transform (ELT) pattern, a variant of the well established extract transform load pattern from the 70's. Rather than transform the data prior to loading, we'll load the raw data into the database and perform transformations there. Concretely, our data pipeline will:

1. Extract: Download the data from its source site to our raw data directory 
2. Load: Create a MySQL database and load the raw data 
3. Transform: Not sure what this will entail as of yet.

Once the data are loaded, we'll extract a sample representating the distribution of labels in the full training dataset. The data pipeline will be operationalized so that the transformations applied to the training set can be applied to the test set without data leakage.

```{tip} Reproducibility
:class: dropdown
To run this notebook on the cloud, click on the shuttle icon in the external links at the top right of the page. This will launch the notebook on a Google Colab instance. The data may be obtained from [Alibaba's Tianchi website](https://tianchi.aliyun.com/dataset/dataDetail?dataId=408); however, we will be downloading the files from an Amazon AWS server. 
```
```{tip} Acquisition Imports
:class: dropdown 
To view source code in this notebook, click on the icon 'Click to show' to the right.
```

In [1]:
# Imports
import os
import sys
from IPython import display
import logging
import pandas as pd
import inspect
import tarfile
import shutil
import tempfile

In [2]:
# REMOVE-CELL
home = "/home/john/projects/deepcvr"
os.chdir(home)

## Extract
### Download Data
Here we download and extract the data from compressed archive files. The Amazon AWS access credentials are stored in a configuration file which will parameterize our download function. 


### Extract Raw Data
The files have been downloaded to our project external data directory at data/external. Next, we'll extract the CSV files from the archive and secure the raw data in our raw directory, data/raw. The following code extracts the data and returns the filepaths. 

In [3]:
# %load -s Extractor deepcvr/data/extract.py
class Extractor:
    """Decompresses a gzip archive, stores the raw data and pushes the filepaths to xCom

    Args:
        source (str): The source filepath
        destination (str): The destination directory
    """

    def __init__(self) -> None:

        self._source = None
        self._destination = None

    def execute(self, source: str, destination: str) -> None:
        """Extracts and stores the data, then pushes filepaths to xCom. """

        self._source = source
        self._destination = destination

        if not self._exists():
            # Recursively extract data and store in destination directory
            self._extract(self._source)

        # Extract filepaths for all data downloaded and extracted
        filepaths = self._get_filepaths()

        return filepaths

    def _exists(self) -> bool:
        """Checks destination directory and returns True if not empty. False otherwise."""

        return len(os.listdir(self._destination)) > 0

    def _extract(self, filepath: str) -> None:
        """Extracts the data and returns the extracted filepaths"""

        if tarfile.is_tarfile(filepath):
            with tempfile.TemporaryDirectory() as tempdirname:
                data = tarfile.open(filepath)
                for member in data.getmembers():
                    # If the file already exists, skip this step
                    filepath = os.path.join(tempdirname, member.name)
                    data.extract(member, tempdirname)
                    return self._extract(filepath)
        else:
            self._savefile(filepath)

    def _savefile(self, filepath: str) -> None:
        """Saves file to destination and adds name and filepath to filepaths dictionary

        Args:
            filepath (str): the path to the extracted file in the temp directory
        """

        # Create destination filepath and move the file
        destination = os.path.join(self._destination, os.path.basename(filepath))
        os.makedirs(os.path.dirname(destination), exist_ok=True)
        shutil.move(filepath, destination)

    def _get_filepaths(self) -> dict:
        """Creates a dictionary of destination file paths."""
        filepaths = {}

        filenames = os.listdir(self._destination)
        if len(filenames) > 0:
            for filename in filenames:
                filepath = os.path.join(self._destination, filename)
                name = os.path.splitext(filename)[0]
                filepaths[name] = filepath
        else:
            msg = "Destination directory is empty"
            raise FileNotFoundError(msg)

        return filepaths


In [4]:
source = "data/external/taobao_train.tar.gz"
destination = 'data/raw'
extractor = Extractor()
filenames = extractor.execute(source=source, destination=destination)
os.listdir(destination)

['sample_skeleton_train.csv', 'common_features_train.csv']

In [8]:
core_train_raw_filepath = os.path.join(destination, os.listdir(destination)[0])
common_features_train_raw_filepath = os.path.join(destination, os.listdir(destination)[1])
core_train_raw_sample = pd.read_csv(core_train_raw_filepath,nrows=1000,header=None, index_col=0)
core_train_raw_sample.head()


Unnamed: 0_level_0,1,2,3,4,5
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0,0,bacff91692951881,9,21090522181.021090645531.021090934451....
2,0,0,bacff91692951881,10,21091097321.021090462841.021090990351....
3,1,0,bacff91692951881,20,21090897311.021090475601.050995117692....
4,0,0,bacff91692951881,13,30193516651.021090503641.021090833881....
5,0,0,bacff91692951881,9,20549456631.030193516651.021691721791....


# REMOVE-CELL
# References and Notes
Refer to  https://www.netquest.com/blog/en/random-sampling-stratified-sampling for sampling techniques