# Data preparation

The purpose of this notebook is to download data (already imported and persisted in the Azure Blob Storage) and prepare it. As a result we will get a dataset ready for further analyses and modeling.

The data preparation steps are:
* [Environment configuration](#Environment-configuration)
* [Unit tests execution](#Unit-tests-execution)
* [Data ingestion, cleaning and featurization pipeline](#Data-ingestion,-cleaning-and-featurization-pipeline)
* [Quick verification of pipeline outputs](#Post-execution-verification)

## Environment configuration

In [1]:
import os
import pandas as pd

## Unit tests execution

This step is to make sure the python code (responsible for data ingestion, cleaning and featurization pipeline) is in the stable state.

In [2]:
# Let the 'bikeavailability' folder be a current directory
os.chdir('..')
print(f"Current working directory: {os.getcwd()}")

Current working directory: /Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikeavailability


In [3]:
# execute tests to make sure everything is working as expected
!python -m pytest

platform darwin -- Python 3.6.9, pytest-5.2.2, py-1.8.0, pluggy-0.13.0
rootdir: /Users/mariuszrokita/GitHub/wroclawski-rower-miejski, inifile: tox.ini
plugins: cov-2.8.1
collected 0 items                                                              [0m[1m



## Data preparation pipeline

This step has been implemented as a full pipeline and consists of following steps:
* data ingestion (downloading already imported data from Azure Blob Storage to local destination),
* cleaning and soft/hard removing records
* creating new features
* saving dataset in the `data/processed` location

In fact, we'll execute the pipeline twice. This way we'll get two output datasets: one with soft deleted records, other one - with hard removed records.

In [4]:
# Go to source folder
os.chdir('src')

In [5]:
# run data loading and processing pipeline 
# (with soft deleting so that we can inspect everything)
%run -t run_pipeline.py

Script execution started
Root folder set to: /Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikeavailability
Pipeline execution about to start!
**** DataPreparationPipeline stage - start ****
    **** DataIngestion stage - start ****
Downloading file /Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikeavailability/data/raw/2019_11_30_22_40_00_0.json
Downloading file /Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikeavailability/data/raw/2019_11_30_22_40_00_100.json
Downloading file /Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikeavailability/data/raw/2019_11_30_22_40_00_200.json
    **** DataIngestion stage - end ****

    Data saved to: /Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikeavailability/data/processed/bike_availability.csv
    Data saved to: /Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikeavailability/data/processed/processed_files.csv
**** DataPreparationPipeline stage - end ****

Pipeline execution completed!



IPython CPU timings (estimated):
  User   :      10.78 s.
  System :       0.52 s.
Wall time:      15.69 s.


## Post-execution verification

In [6]:
# set up paths to created datasets
filepath = os.path.join(os.getcwd(), '..', 'data', 'processed', 'bike_availability.csv')

In [7]:
# load data
bike_availability_df = pd.read_csv(filepath)

In [8]:
print('Dataset: ', bike_availability_df.shape)
#print('Dataset with hard deleted records: ', bike_availability_hard_df.shape)

Dataset:  (1042502, 3)


In [9]:
bike_availability_df.head()

Unnamed: 0,Timestamp,Available Bikes,Bike Station Number
0,2019-10-25 15:20:00,3,15171
1,2019-10-25 15:20:00,0,15164
2,2019-10-25 15:20:00,1,15163
3,2019-10-25 15:20:00,4,15162
4,2019-10-25 15:20:00,0,15161


In [10]:
bike_availability_df.tail()

Unnamed: 0,Timestamp,Available Bikes,Bike Station Number
1042497,2019-11-30 22:40:00,0,15071
1042498,2019-11-30 22:40:00,5,15072
1042499,2019-11-30 22:40:00,2,15073
1042500,2019-11-30 22:40:00,0,15063
1042501,2019-11-30 22:40:00,10,15167
