barclay-viz

Where to find the raw data

TFL download link

Pre-processing steps

Extract raw data.
CSV conversion: use Excel or LibreOffice to load, then save as CSV.
You may end up with Excel default line endings (^M in vim); convert in Linux/Mac OS using:

tr "\r" "\n"
When building an aggregated dataset, remember to strip headers from the CSV files

Data processing

Generate list of Boris Stations

Option 1 (derive from recent list)

cat inputs/boris_station_snapshot.xml | ./parse_stations.rb outputs/boris_stations.csv
csvfix sql_insert -t stations -f 1:logical_terminal,2:full_name src_stations.03.uniq.clean.csv > src_stations.sql

Option 2 (derive from source data)

csvfix order -f 9,10 aggregated_data.csv > src_stations.01.orig.csv
csvfix order -f 14,15 aggregated_data.csv >> src_stations.01.orig.csv
cat src_stations.01.orig.csv | sort | uniq > src_stations.02.uniq.csv
Remove leading zeros - Edit / sed / Google refine
Change non-numeric terminal IDs to numeric
cat src_stations.02.uniq_edit.csv | sort | uniq > src_stations.03.uniq.clean.csv
csvfix sql_insert -t stations -f 1:logical_terminal,2:full_name src_stations.03.uniq.clean.csv > src_stations.sql
Remove duplicates caused by changed names (cross reference recent list of stations)

Notes

Option 2 yields a number of duplicate names, which are rejected on DB insertion. In each case, there are reasonable replacements in the recent data suggesting that the logical terminal IDs have been retained for modified or renamed stations. The differences are:

Marylebone Flyover -> Paddington Green Police Station
Gloucester Slips Car Park -> London Zoo Car Park
Wapping Lane 2 -> Wapping Lane
6 -> Mechanical Workshop PS

Comparing a sample of 10 records between the two suggests the downloaded source can be used reliably as a list of logical terminal IDs. However it should be augmented with additional records from source data, such as NA, 0, Tabletop1.

Load into the DB (use SQL workbench or similar)
mysql -utfl -f tfl_bike < stations.downloaded.sql
mysql -utfl -f tfl_bike < stations.source.sql

Duplicates will be filtered from the second file by the 'force' option.

Generate journey data

Ensure schema is loaded into MySQL DB
Fix references non-numeric terminal IDs

csvfix edit -f 9,14 -e 's/^Tabletop1$/-2/' aggregated_data.csv > aggregated_data.01.fix_table.csv csvfix edit -f 9,14 -e 's/^NA$/-1/' aggregated_data.01.fix_table.csv > aggregated_data.02.fixed.csv

Fix up dates for insertion

csvfix edit -f 7,12 -e "s|$.*$|STR_TO_DATE('\1', '%e/%c/%Y %T')|" aggregated_data.02.fixed.csv > aggregated_data.03.date_func.csv

Use csvfix to generate SQL statements.

csvfix sql_insert -t journeys -f 1:rental_id,2:billable_duration,3:duration,4:customer_record_number,5:subscription_id,6:bike_id,7:end_timestamp,8:end_station_id,9:end_station_logical_term,11:end_station_priority_id,12:start_timestamp,13:start_station_id,14:start_station_logical_term,16:start_station_priority_id,17:end_hour_category_id,18:start_hour_category_id,19:bike_user_type_id -nq 7,11 aggregated_data.03.date_func.csv > aggregated_data.import.sql

Load into the DB

mysql -utfl -f tfl_bike < aggregated_data.import.sql > import.out 2>&1 &

CSV hacks

Summarize journeys directly from CSV

Run through summ_by_X.rb
Scripts for day versus journey summaries
Input read from STDIN
Specify output CSV file as first param
Day script requires records ordered by timestamp

Generate ranked list of customers (for filtering)

awk -F"," '{print $4}' inputs/csv/*.csv | sort | uniq -c | sort > outputs/riders/top_customer.ids

Creating Gephi vizualizations

Prepare data for Gephi (non-DB)

Filter journeys down to manageable set (e.g. top 2k / popular routes; eliminate missing bikes)
Create edge table: 'Source', 'Target', 'Weight' columns. Other attributes can be added for partitioning, etc.
Prepare for import - e.g. store as CSV for import into Gephi.
Extract boris station list for Gephi use. Must have id column.

Import data into Gephi

Open Gephi, go to Data Lab
Import spreadsheet - straight edge table
Import spreadsheet - node list
Check for added duplicates (sort by name, look for blank lat/long)
Generate area labels (copy using regex). Step 1 == ',.$' ... Step 2 == '[^,].'

Visualize

Apply edge and node ranking ...
Apply clustering, then use to partition ...

Notes

Labels in the gephi_project/complete_viz/bojo_viz3_areas_deletions.gephi vizualization have been merged for consistency.

TODOs

Web vizualization

Move all controls into panels DONE
Add hover info panel DONE
Add reset button DONE
Make #journeys threshold absolute ?
Try fixed-range scaling for edges (2-20?) DONE
Try different edge opacity DONE
Try edge colour range based on weight or time x
Scale nodes? DONE
Fix count slider events DONE
Fix new dataset filtering / reset done
Finalize 1440 vs 1439 design

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

barclay-viz

Where to find the raw data

Pre-processing steps

Data processing

Generate list of Boris Stations

Option 1 (derive from recent list)

Option 2 (derive from source data)

Notes

Generate journey data

CSV hacks

Summarize journeys directly from CSV

Generate ranked list of customers (for filtering)

Creating Gephi vizualizations

Prepare data for Gephi (non-DB)

Import data into Gephi

Visualize

Notes

TODOs

Web vizualization

Files

README.md

Latest commit

History

README.md

File metadata and controls

barclay-viz

Where to find the raw data

Pre-processing steps

Data processing

Generate list of Boris Stations

Option 1 (derive from recent list)

Option 2 (derive from source data)

Notes

Generate journey data

CSV hacks

Summarize journeys directly from CSV

Generate ranked list of customers (for filtering)

Creating Gephi vizualizations

Prepare data for Gephi (non-DB)

Import data into Gephi

Visualize

Notes

TODOs

Web vizualization