# Data loading
This notebook initializes the project. It preprocesses the data and converts it to a format.
1. section: download archived data from a shared cloud directory
2. section: unzip the received data
3. section: load the data and store it as optimized hdf5 file
4. section: simple checks on the loaded data

Disclaimer: the first run will take a few minutes due to downloading, unzipping and preprocessing of the csv file. But it safes time in the future ;\)

## Download zipped csv file and unzip

In [4]:
import os
from urllib import request

if not os.path.exists("data/raw/Taxi_Trips_-_2014.zip"):
    request.urlretrieve("https://onedrive.live.com/download?cid=E8998C463734AF4D&resid=E8998C463734AF4D%21142424&authkey=AJTPTaNOMKxKPE0", "data/raw/Taxi_Trips_-_2014.zip")

In [7]:
import zipfile_deflate64 as zipfile

# unzip if zip file exists and it's not already unzipped

if os.path.exists("data/raw/Taxi_Trips_-_2014.zip") and not os.path.exists("data/raw/Taxi_Trips_-_2014.csv") :
    with zipfile.ZipFile("data/raw/Taxi_Trips_-_2014.zip", 'r') as zip_ref:
        zip_ref.extract(member="Taxi_Trips_-_2014.csv", path="data/raw")

## Loading Dataframe using VAEX

VAEX is used to preprocess and load the csv file since its processing on big datasets on a (single) local machine outperforms pandas as well as dask.
Dask focuses more on scalable computations on clusters, while Vaex allows you to work with large data sets on a single machine. Vaex also offers features for easy visualization and presentation of large data sets, while Dask focuses more on data processing and data wrangling. (https://www.datarevenue.com/de-blog/pandas-skalieren-ein-vergleich-von-dask-ray-modin-vaex-und-rapids#:~:text=Vaex%2C%20ermöglicht%20mit%20großen%20Datenmengen,diese%20möglichst%20effizient%20nutzen%20möchte.)

Documentation: https://vaex.io/docs/index.html

Comparison of python data processing libraries: https://towardsdatascience.com/beyond-pandas-spark-dask-vaex-and-other-big-data-technologies-battling-head-to-head-a453a1f8cc13

In [2]:
import vaex

In [5]:
# load the vaex dataframe from csv and store a hdf5-file for future access. 
# hdf5 is an binary and access optimized format allowing loading the complete dataset in fractions of a second

if not os.path.exists("data/trips/trips.hdf5"):
    df_vaex = vaex.from_csv("data/raw/Taxi_Trips_-_2014.csv", convert="data/trips/trips.hdf5", progress=True)
else:
    df_vaex = vaex.open("data/trips/trips.hdf5")

Check if the import was successful and get rudimentary information about the data

In [11]:
df_vaex.info()

column,type,unit,description,expression
Trip ID,str,,,
Taxi ID,str,,,
Trip Start Timestamp,str,,,
Trip End Timestamp,str,,,
Trip Seconds,float64,,,
Trip Miles,float64,,,
Pickup Census Tract,float64,,,
Dropoff Census Tract,float64,,,
Pickup Community Area,float64,,,
Dropoff Community Area,float64,,,

#,Trip ID,Taxi ID,Trip Start Timestamp,Trip End Timestamp,Trip Seconds,Trip Miles,Pickup Census Tract,Dropoff Census Tract,Pickup Community Area,Dropoff Community Area,Fare,Tips,Tolls,Extras,Trip Total,Payment Type,Company,Pickup Centroid Latitude,Pickup Centroid Longitude,Pickup Centroid Location,Dropoff Centroid Latitude,Dropoff Centroid Longitude,Dropoff Centroid Location
0,506646dd0685bd55094f1b7dfac97cf3b07b6126,'1099c684f9da1ac1d0dd253d394f1c8dad37bc1a53d944d...,05/17/2014 10:45:00 AM,05/17/2014 11:00:00 AM,1080.0,7.0,17031281900.0,17031061100.0,28.0,6.0,17.45,0.0,0.0,1.5,18.95,Cash,Taxi Affiliation Services,41.879255084,-87.642648998,POINT (-87.642648998 41.8792550844),41.949139771,-87.656803909,POINT (-87.6568039088 41.9491397709)
1,1462669f59c3b5a1730e8a5f511a4102d1998c21,'379e0fd9da136cabc9eec3aca37047bbdee373ca2ef7a03...,04/17/2014 06:30:00 PM,04/17/2014 07:00:00 PM,1080.0,3.0,,17031841100.0,,34.0,11.45,0.0,0.0,1.0,12.45,Cash,Taxi Affiliation Services,,,--,41.851017824,-87.635091856,POINT (-87.6350918563 41.8510178239)
2,8d699aec32ce70e3647f7b06147001d3926975cc,'6768c7ebfdee8e7e7b3f5ec44739316241895aa2f7edf10...,05/03/2014 11:30:00 PM,05/03/2014 11:45:00 PM,720.0,2.1,17031081800.0,17031081403.0,8.0,8.0,8.45,2.0,0.0,1.0,11.45,Credit Card,Dispatch Taxi Affiliation,41.89321636,-87.63784421,POINT (-87.6378442095 41.8932163595),41.890922026,-87.618868355,POINT (-87.6188683546 41.8909220259)
3,ae05f6f7a766b58b059f04c7549892da4dc3cf54,'7dc01f4be54a4058ffb81098be25f52c9f1249afc88e3ef...,05/17/2014 01:30:00 AM,05/17/2014 01:45:00 AM,480.0,0.0,17031071500.0,17031062200.0,7.0,6.0,7.85,3.0,0.0,1.0,11.85,Credit Card,Taxi Affiliation Services,41.914616286,-87.631717366,POINT (-87.6317173661 41.9146162864),41.94258518,-87.656644092,POINT (-87.6566440918 41.9425851797)
4,b45b42620c3a926ea633b9115dc624a33dcf3b56,'9f87b11be025b5dcc55e9472b8c2158664c97d193cf15d9...,04/25/2014 02:30:00 PM,04/25/2014 02:30:00 PM,0.0,0.0,,,,,42.05,7.0,0.0,0.0,49.05,Credit Card,T.A.S. - Payment Only,,,--,,,--
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37395431,40aa131f22b63a5ca71d1eff4477500e6b875036,'8c244243c4123f4b467c2931cf59a691059a45302dc13e3...,04/03/2014 08:30:00 PM,04/03/2014 08:45:00 PM,900.0,3.7,,,6.0,24.0,11.25,0.0,0.0,0.0,11.25,Cash,Taxi Affiliation Services,41.944226601,-87.655998182,POINT (-87.6559981815 41.9442266014),41.901206994,-87.676355989,POINT (-87.6763559892 41.90120699410001)
37395432,0cd3c0015580380a89b2da5d4ed01016c0399e6a,'01480513a7f6f4664d6cc4c41ff3043ae6ecbc8cb17404f...,05/13/2014 11:30:00 PM,05/13/2014 11:45:00 PM,540.0,0.0,,,8.0,6.0,11.85,3.55,0.0,0.0,15.4,Credit Card,Blue Ribbon Taxi Association Inc.,41.899602111,-87.633308037,POINT (-87.6333080367 41.899602111),41.944226601,-87.655998182,POINT (-87.6559981815 41.9442266014)
37395433,d378e1b76a0de5844b9d50a719239bcf878da1fc,'fdeaab06ce15ad69658acbc0e591218ef6146a29ef21275...,05/14/2014 12:45:00 PM,05/14/2014 12:45:00 PM,240.0,0.0,17031839100.0,17031833000.0,32.0,28.0,5.25,3.0,0.0,1.5,9.75,Credit Card,Taxi Affiliation Services,41.880994471,-87.632746489,POINT (-87.6327464887 41.8809944707),41.88528132,-87.6572332,POINT (-87.6572331997 41.8852813201)
37395434,d64938a724961d6f2c0ae58932e78e3671784822,'a42edac48945c9dab13540adee09290f11b123a513580f2...,05/14/2014 12:15:00 PM,05/14/2014 12:15:00 PM,360.0,0.0,17031081202.0,17031081300.0,8.0,8.0,5.05,0.0,0.0,0.0,5.05,Cash,KOAM Taxi Association,41.902788048,-87.62614559,POINT (-87.6261455896 41.9027880476),41.898331794,-87.620762865,POINT (-87.6207628651 41.8983317935)


In [1]:
#df_vaex.describe()

In [14]:
df_vaex.close()