(section_22_data_acquisition)=
# Data Acquisition


This section describes the methods used to acquire the data. Our data pipeline will:

1. Download the data from its source site, 
2. Extract and secure the raw training and test set data,  
3. Merge the core and common features datasets,  
4. Store the data in a MySQL Datbase, then     
5. Subsample the data for profiling and analysis.

## Reproducibility
To run this notebook on the cloud, click on the shuttle icon in the external links at the top right of the page. This will launch the notebook on a Google Colab instance. The data may be obtained from [Alibaba's Tianchi website](https://tianchi.aliyun.com/dataset/dataDetail?dataId=408); however, we will be downloading the files from an Amazon AWS server. 

```{tip} Acquisition Imports
:class: dropdown 
To view source code in this notebook, click on the icon 'Click to show' to the right.
```

In [1]:
# Imports
import os
import sys
from IPython import display
import logging
import inspect
import tarfile
import shutil
import tempfile

In [2]:
# REMOVE-CELL
home = "/home/john/projects/deepcvr"
os.chdir(home)

## Extract
### Download Data
Here we download and extract the data from compressed archive files. The Amazon AWS access credentials are stored in a configuration file which will parameterize our download function. 


### Extract Raw Data
The files have been downloaded to our project external data directory at data/external. Next, we'll extract the CSV files from the archive and secure the raw data in our raw directory, data/raw. The following code extracts the data and returns the filepaths. 

In [1]:
# %load -s Extractor deepcvr/data/extract.py
class Extractor:
    """Decompresses a gzip archive, stores the raw data and pushes the filepaths to xCom

    Args:
        source (str): The source filepath
        destination (str): The destination directory
    """

    def __init__(self) -> None:

        self._source = None
        self._destination = None

    def execute(self, source: str, destination: str) -> None:
        """Extracts and stores the data, then pushes filepaths to xCom. """

        self._source = source
        self._destination = destination

        if not self._exists():
            # Recursively extract data and store in destination directory
            self._extract(self._source)

        # Extract filepaths for all data downloaded and extracted
        filepaths = self._get_filepaths()

        return filepaths

    def _exists(self) -> bool:
        """Checks destination directory and returns True if not empty. False otherwise."""

        return len(os.listdir(self._destination)) > 0

    def _extract(self, filepath: str) -> None:
        """Extracts the data and returns the extracted filepaths"""

        if tarfile.is_tarfile(filepath):
            with tempfile.TemporaryDirectory() as tempdirname:
                data = tarfile.open(filepath)
                for member in data.getmembers():
                    # If the file already exists, skip this step
                    filepath = os.path.join(tempdirname, member.name)
                    data.extract(member, tempdirname)
                    return self._extract(filepath)
        else:
            self._savefile(filepath)

    def _savefile(self, filepath: str) -> None:
        """Saves file to destination and adds name and filepath to filepaths dictionary

        Args:
            filepath (str): the path to the extracted file in the temp directory
        """

        # Create destination filepath and move the file
        destination = os.path.join(self._destination, os.path.basename(filepath))
        os.makedirs(os.path.dirname(destination), exist_ok=True)
        shutil.move(filepath, destination)

    def _get_filepaths(self) -> dict:
        """Creates a dictionary of destination file paths."""
        filepaths = {}

        filenames = os.listdir(self._destination)
        if len(filenames) > 0:
            for filename in filenames:
                filepath = os.path.join(self._destination, filename)
                name = os.path.splitext(filename)[0]
                filepaths[name] = filepath
        else:
            msg = "Destination directory is empty"
            raise FileNotFoundError(msg)

        return filepaths


In [None]:
source = "data/external/taobao_train.tar.gz"
destination = 'data/raw'
extractor = Extractor()
filenames = extractor.execute(source=source, destination=destination)

# REMOVE-CELL
# References and Notes
Refer to  https://www.netquest.com/blog/en/random-sampling-stratified-sampling for sampling techniques