# Data preparation

The purpose of this notebook is to download data (already imported and persisted in the Azure Blob Storage) and prepare it. As a result we will get a dataset ready for further analyses and modeling.

The data preparation steps are:
* [Environment configuration](#Environment-configuration)
* [Unit tests execution](#Unit-tests-execution)
* [Data ingestion, cleaning and featurization pipeline](#Data-ingestion,-cleaning-and-featurization-pipeline)
* [Quick verification of pipeline outputs](#Post-execution-verification)

## Environment configuration

In [1]:
import os
import pandas as pd

## Unit tests execution

This step is to make sure the python code (responsible for data ingestion, cleaning and featurization pipeline) is in the stable state.

In [2]:
# Let the 'bikerentals' folder be a current directory
%cd ..

/Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikerentals


In [3]:
# execute tests to make sure everything is working as expected
!python -m pytest

platform darwin -- Python 3.6.9, pytest-5.2.2, py-1.8.0, pluggy-0.13.0
rootdir: /Users/mariuszrokita/GitHub/wroclawski-rower-miejski, inifile: tox.ini
plugins: cov-2.8.1
collected 28 items                                                             [0m[1m

src/tests/cleaning/test_extract_gps_from_station_name.py [32m.[0m[32m.[0m[36m              [  7%][0m
src/tests/cleaning/test_pipeline.py [32m.[0m[32m.[0m[32m.[0m[32m.[0m[36m                                 [ 21%][0m
src/tests/cleaning/test_remove_missing_gps.py [32m.[0m[32m.[0m[36m                         [ 28%][0m
src/tests/cleaning/test_remove_same_location.py [32m.[0m[32m.[0m[32m.[0m[36m                      [ 39%][0m
src/tests/features/test_day_of_week.py [32m.[0m[32m.[0m[36m                                [ 46%][0m
src/tests/features/test_distance.py [32m.[0m[32m.[0m[36m                                   [ 53%][0m
src/tests/features/test_holidays.py [32m.[0m[32m.[0m[32m.[0m[36m 

## Data preparation pipeline

This step has been implemented as a full pipeline and consists of following steps:
* data ingestion (downloading already imported data from Azure Blob Storage to local destination),
* cleaning and soft/hard removing records
* creating new features
* saving dataset in the `data/processed` location

In fact, we'll execute the pipeline twice. This way we'll get two output datasets: one with soft deleted records, other one - with hard removed records.

In [4]:
# Go to source folder
%cd src

/Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikerentals/src


In [5]:
# run data loading and processing pipeline 
# (with soft deleting so that we can inspect everything)
%run -t run_pipeline.py --hard-delete=False --save=bike_rentals_soft

Script execution started
Root folder set to: /Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikerentals
Pipeline execution about to start!
**** DataPreparationPipeline stage - start ****
    **** DataIngestion stage - start ****
Client-Request-ID=fded5ba4-0ca3-11ea-8152-4c327596358d Outgoing request: Method=GET, Path=/bike-rentals, Query={'restype': 'container', 'comp': 'list', 'prefix': None, 'delimiter': None, 'marker': None, 'maxresults': None, 'include': None, 'timeout': None}, Headers={'x-ms-version': '2018-03-28', 'User-Agent': 'Azure-Storage/1.4.2-1.5.0 (Python CPython 3.6.9; Darwin 19.0.0)', 'x-ms-client-request-id': 'fded5ba4-0ca3-11ea-8152-4c327596358d', 'x-ms-date': 'Thu, 21 Nov 2019 21:15:06 GMT', 'Authorization': 'REDACTED'}.
Client-Request-ID=fded5ba4-0ca3-11ea-8152-4c327596358d Receiving Response: Server-Timestamp=Thu, 21 Nov 2019 21:15:06 GMT, Server-Request-ID=75231c75-201e-0040-44b0-a04b43000000, HTTP Status Code=200, Message=OK, Headers={'transfer-encoding': 'c


IPython CPU timings (estimated):
  User   :      99.95 s.
  System :       2.51 s.
Wall time:     104.01 s.


In [6]:
# run data loading and processing pipeline (with hard deleting)
%run -t run_pipeline.py --hard-delete=True --save=bike_rentals

Script execution started
Root folder set to: /Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikerentals
Pipeline execution about to start!
**** DataPreparationPipeline stage - start ****
    **** DataIngestion stage - start ****
Client-Request-ID=3b89a9e2-0ca4-11ea-b5f4-4c327596358d Outgoing request: Method=GET, Path=/bike-rentals, Query={'restype': 'container', 'comp': 'list', 'prefix': None, 'delimiter': None, 'marker': None, 'maxresults': None, 'include': None, 'timeout': None}, Headers={'x-ms-version': '2018-03-28', 'User-Agent': 'Azure-Storage/1.4.2-1.5.0 (Python CPython 3.6.9; Darwin 19.0.0)', 'x-ms-client-request-id': '3b89a9e2-0ca4-11ea-b5f4-4c327596358d', 'x-ms-date': 'Thu, 21 Nov 2019 21:16:50 GMT', 'Authorization': 'REDACTED'}.
Client-Request-ID=3b89a9e2-0ca4-11ea-b5f4-4c327596358d Receiving Response: Server-Timestamp=Thu, 21 Nov 2019 21:16:49 GMT, Server-Request-ID=fce034a7-101e-0126-09b0-a0fa6e000000, HTTP Status Code=200, Message=OK, Headers={'transfer-encoding': 'c


IPython CPU timings (estimated):
  User   :      85.94 s.
  System :       2.15 s.
Wall time:      89.92 s.


## Post-execution verification

In [7]:
# set up paths to created datasets
filepath_soft = os.path.join(os.getcwd(), '..', 'data', 'processed', 'bike_rentals_soft.csv')
filepath_hard = os.path.join(os.getcwd(), '..', 'data', 'processed', 'bike_rentals.csv')

In [8]:
# load data
bike_rentals_soft_df = pd.read_csv(filepath_soft)
bike_rentals_hard_df = pd.read_csv(filepath_hard)

In [9]:
print('Dataset with soft deleted records: ', bike_rentals_soft_df.shape)
print('Dataset with hard deleted records: ', bike_rentals_hard_df.shape)

Dataset with soft deleted records:  (480640, 18)
Dataset with hard deleted records:  (345999, 17)
