# Nonparametric Bayesian label prediction on a graph

This script generates the plots and parts of the storyline for the presentation of the doctoral thesis "Nonparametric Bayesian label prediction on a graph" that is comprehensible to non-specialists. It includes an analysis of simulated station traffic data on the Transport for London network (tube, overground, DLR, tramlink). The edges in the network are actual line connections and stations that are within walking distance as per the [TfL network](http://content.tfl.gov.uk/large-print-tube-map.pdf) (May 2019).

* The data folder contains the edge list of the network and station coordinates and zones in .csv format. The Lines.xlsx file contains the edge list with a separate sheet per line, it is not used in our script, but included for readability.
* The modules folder contains thesis_presentation_aux.py which is a module with auxilliary functions used in this notebook.
* This notebook contains the storyline, plots and example analysis of the TfL network.

## Introduction

My research is about label prediction on a graph. A graph, or network, I encounter or hear about almost every day is the tube in London. This presentation will introduce some of the concepts related to my thesis to give you some idea of its contents.

Let's have a look at the TfL (Transport for London) network. This contains the tube, DLR, cable car, overground, TfL rail and trams:

![TfL map](data/tube-map.gif)

We can also represent the TfL network using a node for each station and connecting the nodes if there is a line, internal interchange or under 10 minutes walk between them.

We use the (geographic) coordinates of each station to make a similar picture as the above map. This will be the represenation of the TfL network we work with.

One of the features of the TfL network is that it is very busy in peak hours. We represent how crowded a station is on a red-amber-green color scale with red being overcrowded and green being quiet. Let's say we have collected information about the stations in the network. Some stations are busy (red) and other stations are quiet (green). For half of the stations we have no information (light grey). Additionally, the information we have might have measurement errors. 

In my research, I have developed methods to complete this picture and get rid of the noise. We will use the incomplete and noisy observation of station crowdedness to predict the crowdedness of the entire network. The underlying assumption of these methods is that stations that are close to each other in the network should be about equally busy.

## Scale and frequency

To estimate the correct and full colors on the TfL network we interpret our data as a noisy observation of a signal.

In the above picture, we are interested in finding the true signal (blue) from our noisy observations (dots). In our analogy, this picture corresponds to the partially colored picture of the TfL network.

To find the true signal, we try to build it from basic signals. For a function on the line (as above), these basic signals correspond to frequencies.

As you can see, we start with a simple constant signal. Low frequencies correspond to smooth signals, whereas high frequencies correspond to rough signals. If we include higher frequencies, we can build more complicated functions. On the TfL network, frequecies correspond to how often the signal on the network changes color.