# Data preparation

The purpose of this notebook is to download data (already imported and persisted in the Azure Blob Storage) and prepare it. As a result we will get a dataset ready for further analyses and modeling.

The data preparation steps are:
* [Environment configuration](#Environment-configuration)
* [Unit tests execution](#Unit-tests-execution)
* [Data ingestion, cleaning and featurization pipeline](#Data-ingestion,-cleaning-and-featurization-pipeline)
* [Quick verification of pipeline outputs](#Post-execution-verification)

## Environment configuration

In [1]:
import os
import pandas as pd

## Unit tests execution

This step is to make sure the python code (responsible for data ingestion, cleaning and featurization pipeline) is in the stable state.

In [2]:
# Let the 'bikeavailability' folder be a current directory
os.chdir('..')
print(f"Current working directory: {os.getcwd()}")

Current working directory: /Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikeavailability


In [3]:
# execute tests to make sure everything is working as expected
!python -m pytest

platform darwin -- Python 3.6.9, pytest-5.2.2, py-1.8.0, pluggy-0.13.0
rootdir: /Users/mariuszrokita/GitHub/wroclawski-rower-miejski, inifile: tox.ini
plugins: cov-2.8.1
collected 0 items                                                              [0m[1m



## Data preparation pipeline

This step has been implemented as a full pipeline and consists of following steps:
* data ingestion (downloading already imported data from Azure Blob Storage to local destination),
* cleaning and soft/hard removing records
* creating new features
* saving dataset in the `data/processed` location

In fact, we'll execute the pipeline twice. This way we'll get two output datasets: one with soft deleted records, other one - with hard removed records.

In [4]:
# Go to source folder
os.chdir('src')

In [5]:
# run data loading and processing pipeline 
# (with soft deleting so that we can inspect everything)
%run -t run_pipeline.py --hard-delete=False --save=bike_availability_soft

Script execution started
Root folder set to: /Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikeavailability
Pipeline execution about to start!
**** DataPreparationPipeline stage - start ****
    **** DataIngestion stage - start ****
    Output data shape: (0, 0)
    **** DataIngestion stage - end ****

Output data shape: (0, 0)
**** DataPreparationPipeline stage - end ****

Data saved to: /Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikeavailability/data/processed/bike_availability_soft.csv
Pipeline execution completed!



IPython CPU timings (estimated):
  User   :       7.18 s.
  System :       0.41 s.
Wall time:      12.99 s.


In [6]:
# run data loading and processing pipeline (with hard deleting)
#%run -t run_pipeline.py --hard-delete=True --save=bike_rentals

## [tmp] data ingestion code development

In [7]:
raw_data_folder = os.path.join(os.getcwd(), '..', 'data', 'raw')
processed_data_folder = os.path.join(os.getcwd(), '..', 'data', 'processed')

In [8]:
import json
import glob
import pandas as pd

In [9]:
# list of files already read and processed
processed_filenames = []
all_df = pd.DataFrame(data=[], columns=['bikes', 'number', 'timestamp'])

for filename in sorted(glob.glob(os.path.join(raw_data_folder, '*.json')))[:10]:
    # load json file
    print(filename)
    with open(filename) as json_file:
        data = json.load(json_file)
        
    if data['success'] == True:
        # make dataframe from a single json
        records_df = pd.DataFrame(data=data['result']['records'])
        records_df['timestamp'] = data['datetime']
        all_df = pd.concat([all_df, records_df], sort=True, ignore_index=True)
        processed_filenames.append(os.path.basename(filename))

/Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikeavailability/src/../data/raw/2019_10_25_15_20_00_0.json
/Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikeavailability/src/../data/raw/2019_10_25_15_20_00_100.json
/Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikeavailability/src/../data/raw/2019_10_25_15_20_00_200.json
/Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikeavailability/src/../data/raw/2019_10_25_15_30_00_0.json
/Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikeavailability/src/../data/raw/2019_10_25_15_30_00_100.json
/Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikeavailability/src/../data/raw/2019_10_25_15_30_00_200.json
/Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikeavailability/src/../data/raw/2019_10_25_15_40_00_0.json
/Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikeavailability/src/../data/raw/2019_10_25_15_40_00_100.json
/Users/mariuszrokita/GitHub/wroclawski-rower-miejski/bikeavailability/src/../data/raw/

In [10]:
all_df.head()

Unnamed: 0,bikes,number,timestamp
0,3,15001,2019-10-25T15:20:00.201851
1,11,15002,2019-10-25T15:20:00.201851
2,3,15003,2019-10-25T15:20:00.201851
3,37,15004,2019-10-25T15:20:00.201851
4,1,15005,2019-10-25T15:20:00.201851


In [11]:
all_df.tail()

Unnamed: 0,bikes,number,timestamp
704,20,15096,2019-10-25T15:50:00.187345
705,8,15104,2019-10-25T15:50:00.187345
706,14,15097,2019-10-25T15:50:00.187345
707,10,15098,2019-10-25T15:50:00.187345
708,12,15099,2019-10-25T15:50:00.187345


In [12]:
all_df

Unnamed: 0,bikes,number,timestamp
0,3,15001,2019-10-25T15:20:00.201851
1,11,15002,2019-10-25T15:20:00.201851
2,3,15003,2019-10-25T15:20:00.201851
3,37,15004,2019-10-25T15:20:00.201851
4,1,15005,2019-10-25T15:20:00.201851
...,...,...,...
704,20,15096,2019-10-25T15:50:00.187345
705,8,15104,2019-10-25T15:50:00.187345
706,14,15097,2019-10-25T15:50:00.187345
707,10,15098,2019-10-25T15:50:00.187345


In [13]:
processed_filenames

['2019_10_25_15_20_00_0.json',
 '2019_10_25_15_20_00_100.json',
 '2019_10_25_15_20_00_200.json',
 '2019_10_25_15_30_00_0.json',
 '2019_10_25_15_30_00_100.json',
 '2019_10_25_15_30_00_200.json',
 '2019_10_25_15_40_00_0.json',
 '2019_10_25_15_40_00_100.json',
 '2019_10_25_15_40_00_200.json',
 '2019_10_25_15_50_00_0.json']

## Post-execution verification

In [14]:
# set up paths to created datasets
#filepath_soft = os.path.join(os.getcwd(), '..', 'data', 'processed', 'bike_availability_soft.csv')
#filepath_hard = os.path.join(os.getcwd(), '..', 'data', 'processed', 'bike_rentals.csv')

In [15]:
# load data
#bike_rentals_soft_df = pd.read_csv(filepath_soft)
#bike_rentals_hard_df = pd.read_csv(filepath_hard)

In [16]:
#print('Dataset with soft deleted records: ', bike_rentals_soft_df.shape)
#print('Dataset with hard deleted records: ', bike_rentals_hard_df.shape)