# Download Bike Rentals Data

Capital Bike Share stores data about bike rentals on [S3 bucket](https://s3.amazonaws.com/capitalbikeshare-data/index.html). Sets are available in zip format. Till the end of 2017, data were grouped in yearly packages, starting from 2018 they are published monthly.

Script below was used in order to automate download process and to obtain complete dataset in one file.

In [None]:
import os
os.chdir('..')

In [1]:
import pandas as pd
import numpy as np

import os
import shutil
from zipfile import ZipFile
from urllib.request import urlopen

* Pattern of files' urls is consistent and it requires substituting only part of it with the appropriate date.
* All downloaded files will be stored in temporary folder which will be deleted with all its contents after all operations.
* File with complete dataset will be saved in parquet format.

In [2]:
bikeRental_batch = list(map(str, range(2010,2018))) \
                    + pd.date_range('2018-01', periods=14, freq='M').to_period('M').strftime('%Y%m').to_list()

In [3]:
temp_dir = 'data/raw_data/temp'
try:
    os.mkdir(temp_dir)
except FileExistsError:
    temp_dir = temp_dir + str(np.random.randint(0,50))
    os.mkdir(temp_dir)

In [4]:
df_bikeRental = pd.DataFrame()

for i in bikeRental_batch:
    bikeRentalUrl = 'https://s3.amazonaws.com/capitalbikeshare-data/' + i + '-capitalbikeshare-tripdata.zip'
    zfName = temp_dir + '/bike_data_'+ i +'.zip'
    
    url = urlopen(bikeRentalUrl)
    output = open(zfName, 'wb')        
    output.write(url.read())
    output.close()
    
    zf = ZipFile(zfName)
    
    df_bikeTemp = pd.concat([pd.read_csv(zf.open(i)) for i in ZipFile.namelist(zf) if '/' not in i])
    df_bikeRental = df_bikeRental.append(df_bikeTemp)
    print('number of records with ', i, 'data: ', len(df_bikeRental))

    
df_bikeRental.to_parquet('data/raw_data/bikeRental.parquet')
zf = None
shutil.rmtree(temp_dir)

number of records with  2010 data:  115597
number of records with  2011 data:  1342364
number of records with  2012 data:  3371275
number of records with  2013 data:  5926816
number of records with  2014 data:  8839966
number of records with  2015 data:  12025872
number of records with  2016 data:  15359866
number of records with  2017 data:  19117643
number of records with  201801 data:  19286233
number of records with  201802 data:  19468611
number of records with  201803 data:  19707609
number of records with  201804 data:  20036516
number of records with  201805 data:  20410631
number of records with  201806 data:  20802969
number of records with  201807 data:  21207730
number of records with  201808 data:  21611596
number of records with  201809 data:  21937396
number of records with  201810 data:  22280417
number of records with  201811 data:  22501474
number of records with  201812 data:  22660327
number of records with  201901 data:  22811107
number of records with  201902 data

  result = infer_dtype(pandas_collection)
