Skip to content
Juan Rodriguez Hortala edited this page Jun 19, 2014 · 9 revisions

#General goal Providing the means to analyze the data (see data sources) offered by the City of Barcelona about the use of the "bicing" public bike lending system, to detect usage patterns that could be used to optimize the system.

#Analysis of the available information ###Information we don’t have:

  • Bikes location: we don’t know the coordinates of the bikes, we only have information about the stations
  • Bike id: bikes are anonymous, and there is no way for us to detect the transfer of a bike from one station to another

Note that bikes are not moving from one station to one of the closer stations, so several stations will be skipped in the average journey

###Information we can deduce:

  • Event of a bike leaving or entering a station
  • Enrichments of the spatial information, by 1) associating a district to each station; 2) obtaining additional information about the station or district that intuitively should have an impact in the use of the bicing system, like info related to tourism (number of monuments in the district or up to some distance to the station, number of hotels, …), socio-economic information (per district: average income, employment rate, census, number of business, ...)

##Dimensions and measures Taking that into account we consider the following relevant dimensions and measures

Dimensions

  • Station location: long/lat, heigh, district
  • Station metainfo: address and additional info from enrichment of spatial information
  • Station status: open/closed
  • Time: hour, day sixth ([4-8), [8-12), [12-16), [16-20), [20,0), [0, 4)), other day division (going to work, lunch time, back home, rest of the hours), day, week, month, trimester, year

NOTE: either we assume integrity for the field id in the source, or we avoid using it and identify stations by their long/lat pair.

Maybe nearbyStationList could be useful for defining possibly overlapping clusters. It’s interesting because probably that goes beyond district limits, but currently I don’t have a defined idea.

Measures

The following measures work at the level of station

  • Capability measures: number of bike slots, number of bikes available, number of stations. Note that a filter based on the station status would be interesting in the predefined reports, for example to count the number of open stations, or to avoid counting as available bikes in a closed station
  • Traffic measures: number of bikes lent, number of bikes returned. These transactions are identified by a decrease or increase in the bikes field wrt the previous bicing xml update.

#Specific goals This section will be structured as several lists of questions about the data paired with some charts that could provide a partial answer to those questions. There is one subsection per each analysis type.

##OLAP reports In the first approach based on Apache Phoenix a continuous ETL will insert the data in HBase, so reports are updated in near real-time. This is a secondary goal and so there's no problem if changes are not refreshed in Saiku at that speed

Which areas of the city are used more frequently as origin and destination?

  • Report with a stacked bar per station with one part for each fragment of the day, for the number of bikes lent during the last week. Bars are sorted by the number of bikes lent.
  • Same as before but with bikes returned

These charts may have the following variants: group stations by district or other criteria (e.g. socio-economic information for the station district), other time groups like day or month, other time intervals like one chart per month for the last year

Where should we place a new station?

The idea is that we should place a station in an area with its capacity close to the limit, and with a lot of traffic. Therefore we can define a table with the following rows: district, capacity as average number of bikes available (0 when a station is closed), and traffic as number of bikes lent plus number of bikes returned. The report will return the top ten areas, sorting ascending by capacity and descending by traffic, we define reports for the last week and the last month.

Which are the usage patterns for stations along the day?

Define a line chart per station, with the horizontal axis for the hour, and the following lines:

  • Average number of bikes available in that hour
  • Average number of bikes lent in that hour
  • Average number of bikes returned in that hour

This report can be defined for the last day (so the average acts as the identity function), or for the last week or month. Also this report might group stations by district or other criteria, and also group hours into parts of the day.

Other questions

The user can also define new reports using Saiku’s drag and drop user interface.

##Real-time visualizations

Which is the current state of the bicing system

Real time heatmaps in CartoDB for:

  • Stations state as open or closed
  • Number of bikes available per station, using 0 for closed stations
  • Simple moving average during the last hour for the number of bikes lent, per station
  • Simple moving average during the last hour for the number of bikes returned, per station

Also real-time graphs in OpenTSDB corresponding to those four maps:

  • Line chart with one line for the number of open stations and another for the number of closed stations
  • Line chart with one line per district with the number of bikes available at the stations in that district
  • Line chart with one line per district with the simple moving average during the last hour for the number of bikes lent at the stations in that district
  • Line chart with one line per district with the simple moving average during the last hour for the number of bikes returned at the stations in that district

Other possible extensions:

  • Computing and visualizing in a horizontal reference line the simple moving average for the whole city or a larger period (day, week, ...) would for any of the metrics above would be interesting
  • It would be also interesting to combine historical information with real-time information. For that a very simple approach could be defining a Pig job to compute the average during the last month and trimester for those three values and store it in HBase. This job could run hourly and be represented as a pair of constant horizontal reference lines in each of the charts.

##Time series analysis Which are the predictions for the capability and traffic of the stations?

Time series forecasting will be performed over the following time series:

  • A variable per station, new value each hour for the average number of available bikes during that hour
  • A variable per station, new value each hour for the number of bikes lent during that hour
  • A variable per station, new value each hour for the number of bikes received during that hour