# Data preparation

The purpose of this notebook is to download data (already imported and persisted in the Azure Blob Storage) and prepare it. As a result we will get a dataset ready for further analyses and modeling.

The data preparation steps are:
* [Environment configuration](#Environment-configuration)
* [Unit tests execution](#Unit-tests-execution)
* [Data ingestion, cleaning and featurization pipeline](#Data-ingestion,-cleaning-and-featurization-pipeline)
* [Quick verification of pipeline outputs](#Post-execution-verification)

## Environment configuration

In [1]:
import os
import pandas as pd

## Unit tests execution

This step is to make sure the python code (responsible for data ingestion, cleaning and featurization pipeline) is in the stable state.

In [2]:
# Let the 'bikeavailability' folder be a current directory
os.chdir('..')
print(f"Current working directory: {os.getcwd()}")

Current working directory: /Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikeavailability


In [3]:
# execute tests to make sure everything is working as expected
!python -m pytest

platform darwin -- Python 3.6.9, pytest-5.2.2, py-1.8.0, pluggy-0.13.0
rootdir: /Users/mariuszrokita/GitHub/wroclawski-rower-miejski, inifile: tox.ini
plugins: cov-2.8.1
collected 0 items                                                              [0m[1m



## Data preparation pipeline

This step has been implemented as a full pipeline and consists of following steps:
* data ingestion (downloading already imported data from Azure Blob Storage to local destination),
* cleaning and soft/hard removing records
* creating new features
* saving dataset in the `data/processed` location

In fact, we'll execute the pipeline twice. This way we'll get two output datasets: one with soft deleted records, other one - with hard removed records.

In [4]:
# Go to source folder
os.chdir('src')

In [5]:
# run data loading and processing pipeline 
# (with soft deleting so that we can inspect everything)
%run -t run_pipeline.py --hard-delete=False --save=bike_availability_soft

Script execution started
Root folder set to: /Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikeavailability
Pipeline execution about to start!
**** DataPreparationPipeline stage - start ****
    **** DataIngestion stage - start ****
    **** DataIngestion stage - end ****

**** DataPreparationPipeline stage - end ****

Data saved to: /Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikeavailability/data/processed/bike_availability_soft.csv
Pipeline execution completed!



IPython CPU timings (estimated):
  User   :     224.19 s.
  System :      45.39 s.
Wall time:     274.46 s.


In [None]:
# run data loading and processing pipeline (with hard deleting)
#%run -t run_pipeline.py --hard-delete=True --save=bike_rentals

## Post-execution verification

In [6]:
# set up paths to created datasets
filepath_soft = os.path.join(os.getcwd(), '..', 'data', 'processed', 'bike_availability_soft.csv')
#filepath_hard = os.path.join(os.getcwd(), '..', 'data', 'processed', 'bike_rentals.csv')

In [8]:
# load data
bike_availability_soft_df = pd.read_csv(filepath_soft)
#bike_availability_hard_df = pd.read_csv(filepath_hard)

In [9]:
print('Dataset with soft deleted records: ', bike_availability_soft_df.shape)
#print('Dataset with hard deleted records: ', bike_availability_hard_df.shape)

Dataset with soft deleted records:  (952573, 3)


In [10]:
bike_availability_soft_df.head()

Unnamed: 0,timestamp,bikes,number
0,2019-10-25T15:20:00.201851,3,15001
1,2019-10-25T15:20:00.201851,11,15002
2,2019-10-25T15:20:00.201851,3,15003
3,2019-10-25T15:20:00.201851,37,15004
4,2019-10-25T15:20:00.201851,1,15005


In [11]:
bike_availability_soft_df.tail()

Unnamed: 0,timestamp,bikes,number
952568,2019-11-27T20:50:00.224191,3,15197
952569,2019-11-27T20:50:00.224191,3,15198
952570,2019-11-27T20:50:00.224191,7,15199
952571,2019-11-27T20:50:00.224191,1,15200
952572,2019-11-27T20:50:00.224191,7,15167
