# Data preparation

The purpose of this notebook is to download data (already imported and persisted in the Azure Blob Storage) and prepare it. As a result we will get a dataset ready for further analyses and modeling.

The data preparation steps are:
* [Environment configuration](#Environment-configuration)
* [Unit tests execution](#Unit-tests-execution)
* [Data ingestion, cleaning and featurization pipeline](#Data-ingestion,-cleaning-and-featurization-pipeline)
* [Quick verification of pipeline outputs](#Post-execution-verification)

## Environment configuration

In [1]:
import os
import pandas as pd

## Unit tests execution

This step is to make sure the python code (responsible for data ingestion, cleaning and featurization pipeline) is in the stable state.

In [2]:
# Let the 'bikeavailability' folder be a current directory
os.chdir('..')
print(f"Current working directory: {os.getcwd()}")

Current working directory: /Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikeavailability


In [3]:
# execute tests to make sure everything is working as expected
!python -m pytest

platform darwin -- Python 3.6.9, pytest-5.2.2, py-1.8.0, pluggy-0.13.0
rootdir: /Users/mariuszrokita/GitHub/wroclawski-rower-miejski, inifile: tox.ini
plugins: cov-2.8.1
[1mcollecting ... [0m[1mcollected 0 items                                                              [0m



## Data preparation pipeline

This step has been implemented as a full pipeline and consists of following steps:
* data ingestion (downloading already imported data from Azure Blob Storage to local destination),
* cleaning and soft/hard removing records
* creating new features
* saving dataset in the `data/processed` location

In fact, we'll execute the pipeline twice. This way we'll get two output datasets: one with soft deleted records, other one - with hard removed records.

In [4]:
# Go to source folder
os.chdir('src')

In [5]:
# run data loading and processing pipeline 
# (with soft deleting so that we can inspect everything)
%run -t run_pipeline.py --hard-delete=False --save=bike_availability_soft

ModuleNotFoundError: No module named 'bikeavailability.src.ingestion.bike_rental_data_downloader'


IPython CPU timings (estimated):
  User   :       0.39 s.
  System :       0.14 s.
Wall time:       0.45 s.


In [None]:
# run data loading and processing pipeline (with hard deleting)
#%run -t run_pipeline.py --hard-delete=True --save=bike_rentals

## Post-execution verification

In [None]:
# set up paths to created datasets
#filepath_soft = os.path.join(os.getcwd(), '..', 'data', 'processed', 'bike_availability_soft.csv')
#filepath_hard = os.path.join(os.getcwd(), '..', 'data', 'processed', 'bike_rentals.csv')

In [None]:
# load data
#bike_rentals_soft_df = pd.read_csv(filepath_soft)
#bike_rentals_hard_df = pd.read_csv(filepath_hard)

In [None]:
#print('Dataset with soft deleted records: ', bike_rentals_soft_df.shape)
#print('Dataset with hard deleted records: ', bike_rentals_hard_df.shape)