# Progressive Loading and Visualization with Issues

This notebook shows a simple code to download and visualize all the New York Yellow Taxi trips from January 2015. 
The trip data is stored in multiple CSV files, containing geolocated taxi trips.
We visualize progressively the pickup locations (where people have been picked up by the taxis).
Unfortunately, with big data, unexpected results can happen.

In [1]:
# We make sure the libraries are reloaded when modified, and avoid warning messages
# %load_ext autoreload
# %autoreload 2
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Some constants we'll need: the data file to download and final image size
LARGE_TAXI_FILE = "https://www.aviz.fr/nyc-taxi/yellow_tripdata_2015-01.csv.bz2"
RESOLUTION=512

## Create Modules
First, create the four modules we need.

In [3]:
from progressivis import CSVLoader, Histogram2D, Min, Max, Heatmap

# Create a CSVLoader module, a Min and Max module, a Histogram2D module, and a Heatmap module.

# The CSV Loader only loads two columns of interest here with the 'usecols' keyword.
csv = CSVLoader(LARGE_TAXI_FILE, usecols=['pickup_longitude', 'pickup_latitude'])
min = Min()
max = Max()
# This Histogram2D column will compute a 2D histogram from the 2 columns with a resolution
histogram2d = Histogram2D('pickup_longitude', 'pickup_latitude', xbins=RESOLUTION, ybins=RESOLUTION)
heatmap = Heatmap()

## Connect Modules

Then, connect the modules.

In [4]:
# Now, connect the modules to create the Dataflow graph
# Min/Max input a table and output the min/max values of all the numeric columns
min.input.table = csv.output.result
max.input.table = csv.output.result
histogram2d.input.table = csv.output.result
histogram2d.input.min = min.output.result
histogram2d.input.max = max.output.result
# Connect the Histogram2D to the Heatmap module
heatmap.input.array = histogram2d.output.result

## Display the Heatmap

The displayed image only shows two small points instead of the revealing the map of Manhattan.
This is because a few taxi trips go to Florida and other far away locations with their meter on. The image is thus scaled down to show most of the US instead of Manhattan only.

In [5]:
heatmap.display_notebook()

VBox(children=(HBox(children=(IntProgress(value=0, description='0/0', max=1000), Button(description='Save', ic…

## Start the scheduler

In [6]:
csv.scheduler.task_start()

<Task pending name='Task-15' coro=<Scheduler.start() running at /home/fekete/src/progressivis/progressivis/core/scheduler.py:277>>

Starting scheduler
# Scheduler added module(s): ['csv_loader_1', 'heatmap_1', 'histogram2_d_1', 'max_1', 'min_1']


## Show the modules
printing the scheduler shows all the modules and their states

In [7]:
csv.scheduler

Id,Class,State,Last Update,Order
csv_loader_1,csv_loader,state_ready,71,0
min_1,min,state_blocked,72,1
max_1,max,state_blocked,73,2
histogram2_d_1,histogram2_d,state_blocked,69,3
heatmap_1,heatmap,state_blocked,70,4


## Stop the scheduler
To stop the scheduler, uncomment the next cell and run it

In [None]:

# csv.scheduler.task_stop()