Hacking the Paris metro
Clone or download
Latest commit 1b5a9e9 Jun 6, 2017

README.md

Hacking the Paris metro

Big data analysis of real-time RATP data

This project is intended to use the real data from RATP (Régie Autonome des Transports Parisiens, the state-owned public transport operator in Paris) to predict the lateness of train (RER B).

The work consists of three parts:

  • Acquiring Data

    We need to first acquire the data with the help of RATP API as well as some provided RATP Docs.

    How to use the API of RATP ?

  • Data Extraction

    Then we need to retrieve the data from log files.

  • Machine Learning

    At last, we can applying some machine learning methods to predict the lateness of train.


Acquiring Data

What we need is the estimated/real arrival time for a mission at all the stations. However, by using the RATP API, we can neitSher get the real arrival time directly nor track a specific mission. But almost all the information in included in the response of API as stationsMessages.

So we only need to monitor all the stations by using getMissionsNext method and analyze the log file later.

getMissionsNext

To use getMissionsNext method (example), we need to pass the following parameters:

parameter value
station line id 'RB'(real) or '78' (theoretical)
name like Lozere
direction sens A or R
limit 1 or 2,3,...

where:

  • if line id is 'RB'

    A means from Saint Remy les Chevreuse to Aéroport CDG

    R means from Aéroport CDG to Saint Remy les Chevreuse

  • if line id is '78' (It's different from 'RB')

    A means from Aéroport CDG to Saint Remy les Chevreuse

    R means from Saint Remy les Chevreuse to Aéroport CDG

And the response has a structure like:

properties comments
missions id id of mission like 'EPOL'
stationsDates estimated arrival time like '201705160743'
stationsStops true if the train stops at this station
stationsMessages messages like '07:43', "Train à l'approche" ,'Train à quai', etc.
perturbations level type of perturbations like 'info','alert','critical', etc.
message message-text includes the message of perturbations

Remark : When we ask for the theoretical timetable with a numerical id like '78', stationsDates has two elements, the first is the theoretical arrival time at the station and the second is the theoretical arrival time at mission's terminal.

Monitor the Stations

The use of API is in fact limited by the number of requests each month. We can send at most 30 000 000 times/month. It's about 11.5 times/second. It's quite enough but due to some problems of IP, maybe we are sharing this quota with the whole school. So it's better to reduce the number of requests if possible.

We used to send a series of request for all stations every 30 seconds and it's enough to observe the change of messages like "Train à l'approche" ,'Train à quai'. But each station's situation is different, and if the next train is in 15 minutes, there is no need to send requests every 30 seconds.

Monitors

So we use the multiprocessing package to create an individual Process (monitor) for each station. And for each station, we analyze the stationsMessages :

  • If msg[2]==':' which means that the message is like '07:43' and the train is still far away. But the estimated arrival time is indicated in stationsDates. Suppose that stationsDates is time2wait seconds later from now.

    • If time2wait is bigger than 60, time.sleep(min(time2wait - 60, sleepMax))

      We wait until one minute before the arrival but meanwhile we wait at most sleepMax seconds in order to prevent longtime blocking.

  • else

    time.sleep(sleeptime), we wait for sleeptime seconds

Printer

Moreover, normally we need to monitor the stations for a longtime and it's a bad idea to open the log file and keep writing until the end. We also need to solve the concurrent writing issue of those individual Process(monitors).

As a result, we create a new Process (printer) who use a Queue to receive messages from other Process and print them one by one. In order to prevent possible crash and save the data, we close and reopen the log file every time after writing 50 lines.

Algorithm

Thus, the algorithm is:

  1. Set parameters.
  2. Request for the list of stations.
  3. Create a queue.
  4. Initialize the monitors and the printer.
  5. Start the monitors. (.start())
  6. Start the printer.
  7. Wait until the monitors finish their jobs. (.join())
  8. Add a special term in to queue to tell the printer to stop.
  9. Wait until the printer stops.
  10. Done !

Data Extraction

Description of Log File

In fact, when we send requests, we first ask for the next mission with id 'RB' to get the real time data, then we ask for the next theoretical mission with id '78'.

When we store the response, we write the log file only when the real time data is available. If the theoretical data is available too, we write real and theoretical data at the same time.

Moreover, sometimes there are perturbations and we write only the perturbations whose level isn't info.

Therefore, there are three types of log:

  • logTime, stationName, midE, arrivalTimeE, stop, msg, midT, arrivalTimeT, terminalTime

    like "20170527050515,Orsay Ville,EPOL,201705270505,true,Train à quai V.2,EPOL,201705270506,201705270620"

  • logTime, stationName, midE, arrivalTimeE, stop, msg

    like "20170528003621,Laplace,ILUS,201705280036,true,Train à quai V.2"

  • perturbation, stationName, level, msg

    like "perturbation,Arcueil Cachan,alert,RER A:TRAFIC INTERROMPU entre NANTERRE PREF ET CERGY/POISSY jusqu'à 20h environ, pour une rupture alimentation électrique à HOUILLES"

where:

  • logTime refers to the time when we receive response from API
  • stationName refers to the name of station
  • When the real data of next mission is available
    • midE refers to its id
    • arrivalTimeE refers to the estimated arrival time at this station when we send request
    • stop indicate whether the train will stop at this station
    • msg is missions-stationsMessages
  • When the data of the next theoretical mission is available
    • midT refers to its id
    • arrivalTimeT refers to its theoretical arrival time at this station
    • terminalTime refers to its theoretical arrival time at the end of the line
  • When the perturbation exists
    • perturbation is a keyword for identification
    • stationName refers to the name of station
    • level is perturbations-level
    • msg is perturbations-message

Retrieve Data

To store the data, we use a two dimensional dictionary whose value is a list of dictionaries. So for each mission id and each station, we have a list of logs.

data[midE][stationName]=[{'logTime':logTime, 'arrivalTimeE':arrivalTimeE, 'stop':stop, 'msg':msg, 'midT':midT, 'arrivalTimeT':arrivalTimeT, 'terminalTime':terminalTime},{...},...]

Here the msg should be unified before being stored.

First of all, since there are French accents, we use stripAccents() to remove them:

def stripAccents(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
                     if unicodedata.category(c) != 'Mn')

Then we translated them into English according to the following table:

translated original comments
'coming' 'Train a l'approche' or starts with 'A l'approche'
'arrive' starts with 'Train a quai'
'arrive' 'Voie 2', 'Voie Z', 'Voie 2B' strange messages
'late' 'Train retarde'
'depart' contains 'Depart'
'pass' 'Train sans arret' or starts with 'Sans arret'
'deleted' 'Supprime'
'terminal' 'Train terminus'
'notraveler' 'Sans voyageurs'
'park' 'Stationne'
msg msg[2] == ':' messages like '07:43'
msg else unknown messages

Clean Data

Before using these data, we need to do some work to clean it because the real time data is never perfect because of the API or some bugs.

For example, we need to delete those unknown missionId. Particularly, in the early morning there are some missions whose id are of form W*W* (e.g. WAWJ) and these missions never take passengers. Moreover, sometimes the train stopped at some unexpected stations or it skipped some stations due to perturbation, etc.

There are many small problems in the data and we need to solve them one by one by adding exceptions.

Get Arrival Time

Now, we have the log sequence of each mission at each station, we are able to obtain the exact arrival time by analyzing the message sequence.

For example, if we check the information screen every 30 seconds, we can normally observe a sequence like (here we use the unified message):

'07:47', '07:47', '07:47', 'coming', 'coming', 'arrive', 'arrive', '08:00'

which means that:

  • when the train is far away, the message is the estimated arrival time
  • when the train is approaching, the message is 'coming'
  • when the train arrives, the message is 'arrive'
  • when the train leaves, the message is the estimated arrival time of the next train

In fact, for each mission at each station, we would like to get three arrival times:

  • First observed estimated arrival time

    It means the first estimated arrival time that we observe, since RATP may modify the estimated arrival time so that the train is "on time" at last.

  • Real arrival time

  • Theoretical arrival time

By reading the following example, we may understand these three arrival times better:

We arrived at Massy Palaiseau and we checked the information screen. It said that a train would arrive at 07H43. But the train finally arrived at 07H44. Moreover, according to the projected timetable, a train should arrive at Massy Palaiseau at 07H45 everyday.

In this case:

  • First observed estimated arrival time = 07H43
  • Real arrival time = 07H44
  • Theoretical arrival time = 07H55

Thus, how can we retreive these three information from the message sequence ? The idea is that:

When we first observe an estimated arrival time, we note it down as well as the arrival time of the next theoretical mission if its mission id is the same. Then we keep reading messages until we observe 'coming' then 'arrive', we use the 'logTime' of the first 'arrive' message as the real arrival time of this mission. Then we ignore all the possible following 'arrive' messages in order to observe the next mission.

In reality, with many different types of messages, the situation is much more complicated and we have to do the following processes:

  • We don't handle the 'depart', 'terminal' messages for the moment
  • We create a new message 'sametime' when the time in message equals the estimated arrival time to compensate a bug
  • We treat 'park','sametime' in the same way as 'arrive'
  • We treat 'late' in the same way as messages like '07:43'
  • We restart when observing 'deleted' since the train doesn't exist any more.

The exact algorithm is explained in function data2table() of the script ProcessData.py.

Restore Mission's Timeline

By analyzing the messages, we now have a list of arrival time for each mission id at each station:

{'E':arrivalTimeE,'R':arrivalTimeR,'T':arrivalTimeT}

and we would like to use this information to restore mission's timeline.

The problem we have is that we didn't observe the same number of missions at each station. For example, a mission should stop at two stations but due to some bugs, we observed ten missions at the first station and eight missions at the second. So we have to find out a mapping between these two lists of arrival time.

We use the following algorithm:

  • From the first station to the second-to-last station:
    • Note its real arrival time list as list0 and its next station's list as list1
    • Read a timestamp t0 from list0 and a timestamp t1 from list1
    • Until the end of list0 or list1
      • if t0 < t1
        • if t1 - t0 > 10 minutes: delete t0 from list0 and read the next t0
        • else: read the next t0, t1
      • else: delete t1 from list1 and read the next t1

With this algorithm, we delete the unpaired timestamps. But it's not enough, we need to use the same technique to traverse from the last station to the first to make sure that we have the same number of observations at each station.

For example, at first, we observe:

station name # of missions
Cite Universitaire 20
Denfert Rochereau 22
Port Royal 21
Luxembourg 18
Saint Michel 23
Chatelet 20

After deleting unpaired observations by traversing from Cite Universitaire to Chatelet, we have:

station name # of missions
Cite Universitaire 20
Denfert Rochereau 19
Port Royal 17
Luxembourg 17
Saint Michel 15
Chatelet 15

Here, for example, 19 for Denfert Rochereau means that there are 19 pairs of observations between Denfert Rochereau and Port Royal.

Then we need to traverse from Chatelet to Cite Universitaire to make sure that we have the same number of observations at each station. So we have

station name # of missions
Cite Universitaire 15
Denfert Rochereau 15
Port Royal 15
Luxembourg 15
Saint Michel 15
Chatelet 15

Now, we just find 15 missions which pass from Cite Universitaire to Chatelet.

Training Data

Now we can calculate the lateness by using arrivalTimeR - arrivalTimeE. (For the moment we don't use the theoretical data.)

For example, we find 15 missions at 6 stations, then the data is:

  • X of size (15,5) which refers to the lateness of these trains at Cite Universitaire, Denfert Rochereau, Port Royal, Luxembourg, Saint Michel.
  • Y of size (15,1) which refers to the lateness of these trains at Chatelet.

Machine Learning


Appendix

How to use the API of RATP

Example

Thanks to Victor's example and advices, with the help of Postman, we will run a small test by using the getMissionsNext method to get the information of RER B at station Lozere.

  1. Download Postman and open it.
  2. Choose POST and enter URL: http://opendata-tr.ratp.fr/wsiv/services/Wsiv
  3. Create a new key in Headers:
    • Key: SOAPAction
    • Value: ""
  4. Choose mode raw and XML(text/xml) in Body, then enter the request code.
  5. Click Send to test. Normally it works if you have applied for the access.
  6. Click the button code to generate the python code of type: Python http.client (Python 3)
  7. Use the generated code to send requests to get response and turn the response in xml to dict with the help of xmltodict.
  8. Retrieve the data from the response and that's all !

References:

  1. How I “hacked” into RATP’s API
  2. Making SOAP requests using Postman
  3. Postman Documentation

Requests

getMissionsNext

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://wsiv.ratp.fr/xsd" xmlns:wsiv="http://wsiv.ratp.fr">
    <soapenv:Header/>
    <soapenv:Body>
        <wsiv:getMissionsNext>
            <wsiv:station>
                <xsd:line>
                    <xsd:id>RB</xsd:id>
                </xsd:line>
                <xsd:name>Lozere</xsd:name>
            </wsiv:station>
            <wsiv:direction>
                <xsd:sens>A</xsd:sens>
            </wsiv:direction>
         </wsiv:getMissionsNext>
    </soapenv:Body>
</soapenv:Envelope>

getMission

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://wsiv.ratp.fr/xsd" xmlns:wsiv="http://wsiv.ratp.fr">
    <soapenv:Header/>
    <soapenv:Body>
        <wsiv:getMission>
            <wsiv:mission>
                <xsd:id>INKE</xsd:id>
                <xsd:line>
                    <xsd:id>RB</xsd:id>
                </xsd:line>
            </wsiv:mission>
        </wsiv:getMission>
    </soapenv:Body>
</soapenv:Envelope>

getStations

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://wsiv.ratp.fr/xsd" xmlns:wsiv="http://wsiv.ratp.fr">
    <soapenv:Header/>
    <soapenv:Body>
        <wsiv:getStations>
            <wsiv:station>
                <xsd:line>
                    <xsd:id>RB</xsd:id>
                </xsd:line>
            </wsiv:station>
         </wsiv:getStations>
    </soapenv:Body>
</soapenv:Envelope>

Get theoretical arrival time from RATP docs

We are going to dig out the theoretical timetable of RER B by using the data RATP_GTFS_LINES and the doc OffreDeTransport_GTFS_RATP.

We can also use the API with a numerical id 78 (RER B) to get the theoretical arrival time.

First, we can find several files in the folder \RATP_GTFS_LINES\RATP_GTFS_RER_B\ but we use only trips.txt and stop_times.txt.

trips.txt

trips.txt contains thousands of records like :

route_id service_id trip_id trip_headsign trip_short_name direction_id shape_id
1631687 2350656 1023506560917735 SAXO SAXO 0

Each record corresponds to a "trip", but we only cares about the relation between the trip_id and trip_headsign(=trip_short_name).

In fact, we find that each different trip_headsign corresponds to several trip_id but the last seven numbers of all these trip_id are the same series, which means that :

From the last seven numbers of trip_id, we can map trip_id to a unique trip_headsign.

stop_times.txt

stop_times.txt contains more than one hundred thousand records like :

trip_id arrival_time departure_time stop_id stop_sequence stop_headsign shape_dist_traveled
1023506560917735 05:25:00 05:25:00 2056 1

Each records corresponds the estimated arrival time for a mission(train) at one station.

With the help of trips.txt, we can easily map an arrival time(arrival_time) to a station(stop_id) and a mission(trip_headsign).