# Data preparation

The purpose of this notebook is to download data (already imported and persisted in the Azure Blob Storage) and prepare it. As a result we will get a dataset ready for further analyses and modeling.

The data preparation steps are:
* [Environment configuration](#Environment-configuration)
* [Unit tests execution](#Unit-tests-execution)
* [Data ingestion, cleaning and featurization pipeline](#Data-ingestion,-cleaning-and-featurization-pipeline)
* [Quick verification of pipeline outputs](#Post-execution-verification)

## Environment configuration

In [1]:
import os
import pandas as pd

## Unit tests execution

This step is to make sure the python code (responsible for data ingestion, cleaning and featurization pipeline) is in the stable state.

In [2]:
# Let the 'bikerentals' folder be a current directory
%cd ..

C:\Dev\private\wroclawski-rower-miejski\bikerentals


In [3]:
# execute tests to make sure everything is working as expected
!python -m pytest

platform win32 -- Python 3.6.8, pytest-5.2.2, py-1.8.0, pluggy-0.13.0
rootdir: C:\Dev\private\wroclawski-rower-miejski, inifile: tox.ini
collected 24 items

src\tests\cleaning\test_extract_gps_from_station_name.py ..              [  8%]
src\tests\cleaning\test_pipeline.py ....                                 [ 25%]
src\tests\cleaning\test_remove_missing_gps.py ..                         [ 33%]
src\tests\cleaning\test_remove_same_location.py ...                      [ 45%]
src\tests\features\test_day_of_week.py ..                                [ 54%]
src\tests\features\test_distance.py ..                                   [ 62%]
src\tests\features\test_holidays.py ...                                  [ 75%]
src\tests\features\test_hour.py ..                                       [ 83%]
src\tests\features\test_month.py ..                                      [ 91%]
src\tests\features\test_season.py ..                                     [100%]



## Data preparation pipeline

This step has been implemented as a full pipeline and consists of following steps:
* data ingestion (downloading already imported data from Azure Blob Storage to local destination),
* cleaning and soft/hard removing records
* creating new features
* saving dataset in the `data/processed` location

In fact, we'll execute the pipeline twice. This way we'll get two output datasets: one with soft deleted records, other one - with hard removed records.

In [4]:
# Go to source folder
%cd src

C:\Dev\private\wroclawski-rower-miejski\bikerentals\src


In [5]:
# run data loading and processing pipeline 
# (with soft deleting so that we can inspect everything)
%run -t run_pipeline.py --hard-delete=False --save=bike_rentals_soft

Script execution started
Root folder set to: C:\Dev\private\wroclawski-rower-miejski\bikerentals
Pipeline execution about to start!
**** DataIngestion stage - start ****
Output data shape: (480640, 10)
**** DataIngestion stage - end ****

**** DataCleaning stage - start ****
Input data shape: (480640, 10)
    **** GpsFromStationNameExtractor stage - start ****
    Input data shape: (480640, 10)
    Output data shape: (480640, 10)
    **** GpsFromStationNameExtractor stage - end ****

    **** GpsFromStationNameExtractor stage - start ****
    Input data shape: (480640, 10)
    Output data shape: (480640, 10)
    **** GpsFromStationNameExtractor stage - end ****

    **** SameLocationRemover stage - start ****
    Input data shape: (480640, 10)
    Output data shape: (480640, 11)
    **** SameLocationRemover stage - end ****

    **** MissingGpsLocationRemover stage - start ****
    Input data shape: (480640, 11)
    Output data shape: (480640, 11)
    **** MissingGpsLocationRemover sta


IPython CPU timings (estimated):
  User   :      61.44 s.
  System :       0.00 s.
Wall time:      61.44 s.


In [6]:
# run data loading and processing pipeline (with hard deleting)
%run -t run_pipeline.py --hard-delete=True --save=bike_rentals

Script execution started
Root folder set to: C:\Dev\private\wroclawski-rower-miejski\bikerentals
Pipeline execution about to start!
**** DataIngestion stage - start ****
Output data shape: (480640, 10)
**** DataIngestion stage - end ****

**** DataCleaning stage - start ****
Input data shape: (480640, 10)
    **** GpsFromStationNameExtractor stage - start ****
    Input data shape: (480640, 10)
    Output data shape: (480640, 10)
    **** GpsFromStationNameExtractor stage - end ****

    **** GpsFromStationNameExtractor stage - start ****
    Input data shape: (480640, 10)
    Output data shape: (480640, 10)
    **** GpsFromStationNameExtractor stage - end ****

    **** SameLocationRemover stage - start ****
    Input data shape: (480640, 10)
    Output data shape: (480640, 11)
    **** SameLocationRemover stage - end ****

    **** MissingGpsLocationRemover stage - start ****
    Input data shape: (480640, 11)
    Output data shape: (480640, 11)
    **** MissingGpsLocationRemover sta


IPython CPU timings (estimated):
  User   :      49.83 s.
  System :       0.00 s.
Wall time:      49.83 s.


## Post-execution verification

In [7]:
# set up paths to created datasets
filepath_soft = os.path.join(os.getcwd(), '..', 'data', 'processed', 'bike_rentals_soft.csv')
filepath_hard = os.path.join(os.getcwd(), '..', 'data', 'processed', 'bike_rentals.csv')

In [8]:
# load data
bike_rentals_soft_df = pd.read_csv(filepath_soft)
bike_rentals_hard_df = pd.read_csv(filepath_hard)

In [9]:
print('Dataset with soft deleted records: ', bike_rentals_soft_df.shape)
print('Dataset with hard deleted records: ', bike_rentals_hard_df.shape)

Dataset with soft deleted records:  (480640, 17)
Dataset with hard deleted records:  (351571, 16)
