# TPL Data

## Preprocessing GTFS data

**GTFS**: General Transit Feed Specification

The General Transit Feed Specification (GTFS), also known as GTFS static or static transit to differentiate it from the GTFS realtime extension, defines a common format for public transportation schedules and associated geographic information. GTFS "feeds" let public transit agencies publish their transit data and developers write applications that consume that data in an interoperable way.

A GTFS feed is composed of a series of text files collected in a ZIP file. Each file models a particular aspect of transit information: stops, routes, trips, and other schedule data. 
The details of each file are defined in the [GTFS reference](https://developers.google.com/transit/gtfs/reference/)

https://it.wikipedia.org/wiki/General_Transit_Feed_Specification

* **agency**: l'agency corrisponde alla tabella con le informazioni sull'azienda dei trasporti.

agency_name

agency_url

agency_timezone

* **routes**: la tabella routes contiene i percorsi.


route_id (primary key)

route_short_name

route_long_name

route_type --> important: https://developers.google.com/transit/gtfs/reference/extended-route-types

* **trips**:

trip_id (primary key)

route_id (foreign key)

service_id (foreign key)

Campi opzionali:

block_id - Il block ID indica il blocco al quale un viaggio (trip) appartiene.

* **stop_times**: Orari presso una fermata del mezzo di trasporto.

stop_id (primary key)

trip_id (foreign key)

arrival_time

departure_time

stop_sequence

* **stops**: La tabella stops definisce le informazioni geografiche di ogni fermata.

stop_id (primary key)

stop_name

stop_lon

stop_lat

* **calendar**: la tabella calendario definisce la ricorrenza con cui avviene il passaggio di un mezzo di trasporto presso una fermata come i giorni di esercizio ed il periodo di esercizio.

service_id (primary key)

monday

tuesday

wednesday

thursday

friday

saturday

sunday

start_date

end_date


**Tabelle opzionali:**

calendar_dates.txt

fare_attributes.txt

fare_rules.txt

shapes.txt

frequencies.txt

transfers.txt

feed_info.txt

**GTFS DATA:**

* MILANO: https://www.amat-mi.it/it/mobilita/dati-strumenti-tecnologie/dati-gtfs/

* TORINO: http://opendata.5t.torino.it/gtfs/piemonte_it.zip

* ROMA: http://dati.muovi.roma.it/gtfs/rome_static_gtfs.zip

In [1]:
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import gtfstk as gt
import geopandas as gpd
import sys
from pathlib import Path

In [2]:
DIR = Path('../..')
sys.path.append(str(DIR))

TPL_DIR = DIR/'data/raw/tpl/'
GEO_DIR = DIR/'data/processed/'

In [3]:
from references import common_cfg, city_settings
from src.models.factories import TransportStopFactory

### Load TPL GTFS Data

In [4]:
#implementare poi loop tra città
selectedCity = 'Roma'
print(selectedCity)
city = city_settings.get_city_config(selectedCity)
istatData = city.istat_cpa_data

Roma


In [5]:
# List feed - hard coded - to improve
if selectedCity=="Torino":
    fileName="torino/piemonte_it.zip"
if selectedCity =="Milano":
    fileName="milano/Export_OpenDataTPL_Current.zip"
if selectedCity =="Roma":
    fileName="roma/rome_static_gtfs.zip"

path = TPL_DIR/fileName
gt.list_gtfs(path)

Unnamed: 0,file_name,file_size
0,agency.txt,275
1,calendar_dates.txt,5311374
2,feed_info.txt,170
3,routes.txt,25595
4,shapes.txt,21710823
5,stop_times.txt,179087168
6,stops.txt,1331982
7,transfers.txt,56
8,trips.txt,6272835


In [6]:
# Read and print feed
feed = gt.read_gtfs(path, dist_units='km')

In [7]:
# Load stops to intersect with sezioni 
# Multiple routes may use the same stop. The stop_id is used by systems as an internal identifier of this record (e.g., primary key in database), 
# and therefore the stop_id must be dataset unique.
# Keep info "wheelchair_boarding"
stops = feed.stops
#Rename columns to transform in geopandas
stops.rename(columns={
    "stop_lon": common_cfg.coord_col_names[0],
    "stop_lat": common_cfg.coord_col_names[1]},
             inplace=True)

In [8]:
#stops_geo: convert csv in geodataframe
stops_geo = common_cfg.df_to_gdf(input_df=stops)

In [9]:
stops_geo.columns

Index(['stop_id', 'stop_code', 'stop_name', 'stop_desc', 'Lat', 'Long',
       'zone_id', 'stop_url', 'location_type', 'parent_station',
       'stop_timezone', 'wheelchair_boarding', 'geometry'],
      dtype='object')

In [24]:
istatData.reset_index(level=0, inplace=True)
istatDataMin = istatData[['geometry','IDquartiere','SEZ2011']]

Unnamed: 0,geometry,IDquartiere,SEZ2011
0,POINT (12.5382291021966 41.90741114226486),Q22,580912220072
1,POINT (12.53998076747949 41.88798204118268),Q07,580912070152
2,POINT (12.55146893794632 41.90745579228835),Q22,580912220197
3,POINT (12.540534989828 41.90735481993809),Q22,580912220079
4,POINT (12.5243794373368 41.91094320131206),Q05,580912050143
5,POINT (12.53938704850022 41.88564638371127),Q07,580912070193
6,POINT (12.6167928296514 41.81216339244668),ZAR19,580914190120
7,POINT (12.25289740311049 42.00678895991031),ZAR49,580914490066
8,POINT (12.25235697130869 42.01472707935229),ZAR49,580914490040
9,POINT (12.25263809487124 42.01175218393681),ZAR49,580914490053


In [11]:
# We need to filter just on Turin, (we have date for the entire region)
# keep only STOPS in Turin = geometry  + IDquartiere + SEZ2011
# --> spatialjoin using geometry
stops_geo_internal = gpd.sjoin(istatDataMin,stops_geo, how="inner", op='intersects')

In [25]:
#STOP TIMES gli stops per risalire alla tipologia della linea (metro, bus)
stop_times = feed.stop_times
stop_times = stop_times[["trip_id","stop_id"]]

In [13]:
#filter just for city stop
stop_times_city = stop_times.loc[stop_times['stop_id'].isin(stops_geo_internal["stop_id"])]

In [14]:
#mi recupero il trips id per poi andare a prendere il tipo routes
trips = feed.trips
trips = trips[["route_id","trip_id"]]
trips_stop_times_city = trips.merge(stop_times_city, on="trip_id")

In [15]:
#routee - finale
routes = feed.routes
routes = routes [["route_id","route_type"]]
routes_trips_stop_times_city = routes.merge(trips_stop_times_city, on="route_id" )

In [16]:
final = routes_trips_stop_times_city[["route_id","route_type","stop_id"]].copy()
final.drop_duplicates(inplace=True)

In [17]:
final=final.merge(stops_geo_internal, on="stop_id" )

In [18]:
final = final.drop('geometry', 1)
final = final.drop('stop_timezone',1)
final = final.drop('stop_code', 1)
final = final.drop('index_right',1)

In [19]:
# make unique stop id and append it
final[TransportStopFactory.id_col] = final['stop_id'] + '_' + final['route_id']
assert not any(final[TransportStopFactory.id_col].duplicated()), 'Unexpected duplicates'
#Save Final data
outFilename = '../../data/processed/'+selectedCity+'_TPL.csv'
final.to_csv(outFilename, sep=';', index = False)

In [20]:
final.route_id.nunique()

0

In [21]:
final.stop_id.nunique()

0

In [22]:
#example with 1871 
#stop_times_TO.loc[stop_times_TO.stop_id=='1871'].trip_id.value_counts()
#test = routes_trips_stop_times_TO.loc[routes_trips_stop_times_TO.stop_id=='1871']
#test.sort_values(['route_id', 'trip_id'], ascending=[True, True])
#test_min = test[["route_id","route_type","stop_id"]]