# Working with large data
In this extra notebook we want to show how to work with large data files - often several GB large). In such cases the data often wants to occupy more memory than your computer can offer and therefore we need to think about different strategy to tackle such large datasets (especially when working locally on our laptops). 

In this notebook we suggest one way to do so using a library named [VAEX](https://vaex.io/). VAEX is a development on top of Pandas and aims at giving users the same data handling functionality but with less memory-intensive workflows. The idea of VAEX is based on memory mapping using a single memory object and several mappings of it. 

## Download the dataset
To show the power of this library we will take as an example the famous yellow taxi dataset that contains taxi trip data from yellow cabs in New York City in 2019. Execute the command below and wait for the download to begin. This will take several minutes as we download 7.7 GB.

In [None]:
# Download the yellow taxi data file into the data folder.
!curl -L "https://data.cityofnewyork.us/api/views/2upf-qytp/rows.csv?accessType=DOWNLOAD" --output data/yellow_taxi_2019.csv

## Convert the data

The data is stored as a `CSV` file. __Please do not try to open this file as it will probably lead to problems on your computer because your memory will run full.__ As the same effect would occur when opening this file by either a Python file reader or using Pandas it has to be done in a resourceful way. VAEX offers such a resourceful way by loading it in chunks (we use chunks of around 100,000 rows) and exporting it again in a subfolder in `HDF5` format. [`HDF5`](https://support.hdfgroup.org/HDF5/doc/H5.intro.html) is a data format that gives it already a fast readable structure with many metadata and a hierarchical design. By this I/O access can be managed in parallel and data is already ordered in groups such that certain data can be loaded without touching other. This enables a very fast data access. 

In [None]:
import os, vaex
from tqdm.notebook import tqdm

# Read in the first two lines of data and store the data types
# in a dictionary. 
df_first2 = vaex.from_csv('data/yellow_taxi_2019.csv', nrows = 30)
column_types = {name: (str(dtype.__name__) if isinstance(dtype, type) else 'float64') for name,dtype in df_first2.dtypes.to_dict().items()}

# Create a new directory in the data folder that holds the 
# converted data chunks for further preprocessing.
!mkdir data/yellow_taxi

# Read in the data in memory-firendly chunks. We clean the dataset 
# on-the-fly and remove all entries with more than 6 passengers. 
# Export the data in the data format 'HDF5'.
for i, df in tqdm(enumerate(vaex.from_csv('data/yellow_taxi_2019.csv', chunk_size=100_000, dtype=column_types)), total=844):
    df = df[df.passenger_count < 6]
    df.export_hdf5(f'data/yellow_taxi/taxi_{i:02}.hdf5')
    
# Count the files exported.
print( f"# HDF5 Files in data/yellow_taxi: {len(os.listdir('data/yellow_taxi/'))}")

In [None]:
# Check the folder size.
!du -h data/yellow_taxi

The exported chunk files occupy now 14 GB of disk space. This is due to the metadata that provides the `HDF5` file structure. And: it is of neglible cost. Disk space is cheap, memory is expensive and the resource to be guarded. However, to use the whole power `HDF5` offers we have to create a single file of this data format. This is done in the next cells. 

In [None]:
# Import all .hdf5 chunks into a single dataset and
# export it to a single .hdf5 file we can work on
df = vaex.open('data/yellow_taxi/taxi*')
df.export_hdf5('data/yellow_taxi_aggregated.hdf5')

# Remove the temporary file directory from the data 
# preprocessing step.
!rm -r data/yellow_taxi

## Reading the data

After the data preprocessing has been done we can finally open the data file and start with the data transformation and analysis. VAEX offers for this the `open()` command that opens an `HDF5`file (without loading it). 

In [None]:
# Open the HDF5 file and show the dimensions
df = vaex.open('data/yellow_taxi_aggregated.hdf5')
print(f'Data dimensions: {df.shape}')

Over 80 mio. entries. This is impressive. Let us have a preview on the data in the next cell. 

In [None]:
# Preview the data
df

Why is it so fast? This is due to the fact that it reads in only the HDF5 metadata (i.e. path, data structure, file description, etc.) and not the data itself. Then, displaying the data requires only the first and last 5 rows to be read from disk. VAEX only goes over the entire dataset if needed. 

For a good starting point we always look at the data summaries so let us see how this works with VAEX on a large dataset. 

In [None]:
# Show some summary statistics
# NOTE: Throws probably a warning in regard to
# division by NAN. You can safely neglect this warning.
df.describe()

All of these stats are calculated in a single pass over the data. This illustrates nicely how efficiently VAEX works. Other libraries would have needed more computing resources while VAEX you need only a few RAM to run this operation. 

## Data manipulations
We can see here a number of different variables, altogether 38, for taxi trips in NYC 2019. Quite often we want to filter the dataset for certain groups of values. Here for example we might want to filter the dataset by the number of passengers:

In [None]:
# Filter the dataset for entries with certain passenger counts and
# display a preview.
df_filtered = df[(df.passenger_count > 0) & (df.passenger_count < 5)]
df_filtered

Why does this work so fast when we actually have over 80 mio. entries in the data set? 

> It does so, because VAEX has a zero-memory policy,

i.e. filtering the data does not create a copy of it but simply a mapping. `df_filtered` takes no extra memory, instead it is a [shallow copy](https://stackoverflow.com/questions/184710/what-is-the-difference-between-a-deep-copy-and-a-shallow-copy). So instead copying the data and creating an object holding the filtered data, VAEX creates a binary mask applied to the original data. As a comparison: one needs around 1.2GB memory to filter an 80 billion row dataset that might take around 100 GB on disk. Other libraries will have to use another 100GB of memory for the filtered copy. 

### Transforming data

If we transform data, we usually operate on the columns and create new ones. Again, if we do this we create usually more data objects (e.g. Pandas Series) to hold the transformed data. If we have like here >80 mio. entries this is a signficant increase in memory needed. To avoid this VAEX follows its zero-memory policy and works with socalled virtual columns. Virtual columns work as follows: if we create a column from some others in the dataframe VAEX simply stores the expression and evaluates it on-the-fly, if the column is needed. This policy is called [lazy evaluation](https://en.wikipedia.org/wiki/Lazy_evaluation#:~:text=In%20programming%20language%20theory%2C%20lazy,avoids%20repeated%20evaluations%20(sharing).). 

In [None]:
# Create a new column with percentage values
df['tip_percentage'] = df.tip_amount / df.total_amount * 100
df.tip_percentage

In [None]:
# Show the new column in a group
df[['fare_amount', 'total_amount', 'tip_amount', 'tip_percentage']].head(5)

The evaluation of such virtual columns in VAEX is blazingly fast as it is written in `C++` and works parallel in core. In the same way aggregation functions and other operands work quite fast in VAEX. We consider an example where we want to the count the trips that have taken place during the year 2019 with certain passenger numbers. 

In [None]:
# Count the number of trips in each passenger number group.
df.passenger_count.value_counts(progress='widget')

In the following we show a functionality to plot large data using [matplotlib](https://matplotlib.org/) and [seaborn](https://seaborn.pydata.org/introduction.html) together with VAEX.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Get the number of unique trips with different amount of passnegers
passengers_num = df.passenger_count.value_counts(progress=True)

# Plot the results in a bar chart.
plt.figure(figsize=(10,4))
sns.barplot(x=passengers_num.index, y=np.log10(passengers_num.values))
plt.xlabel('Number of passengers')
plt.ylabel('Number of trips [dex]')
plt.xticks(rotation='45')
plt.show()

Now, you have some insight in how to work with large data files (often larger than your memory). Check out the [VAEX API](https://vaex.readthedocs.io/en/latest/api.html) and read through the documentation to learn more. 