<img src="https://www.saperessere.com/wp-content/uploads/2013/08/logo-sapienza-new.jpg" width="300"/>

# Directions and information from this notebook

This notebook aims to calculate the latitude and longitude of the stations, providing their address and city. 

It is based on `geopy` which is an easy tool for Python developers to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders and other data sources.

For plotting the map it is used  `plotly` based on a token from **Mapbox** that is stored in a text file named *mapbox_token.txt*. To create the token use [this](https://www.youtube.com/watch?v=6iQEhaE1bCY) tutorial.

# Import libraries

In [None]:
import time
import pandas as pd
import numpy as np
import plotly.express as px

from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Nominatim

from google.colab import drive
# Mount drive from Google
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


# Define parameters

Within this parameter's definition, the aim is to be able to run the code easily, only by changing the location where the whole data is (or will be) stored.

In [None]:
# Define paths
root_path = '/content/gdrive/MyDrive/Students_honour_programme/project_with_professor_Francesca_Cuomo'
%cd $root_path

output_path = root_path + '/output'
# models_path = root_path + '/models'

# Define random state for reproducibility
random_state = 0

/content/gdrive/MyDrive/Students_honour_programme/project_with_professor_Francesca_Cuomo


# Load data

The data **Possible places for stations.csv** contains the 130 selected places to be treated as stations.

In [None]:
# Read stations
stations = pd.read_csv("data/Possible places for stations.csv", sep=";", encoding='latin-1', dtype = {"station_id" : "str"})
# stations = stations.head(5)
stations

Unnamed: 0,station_id,place_name,address,city,country,type,latitude,longitude,weights
0,3001,The Colosseum,"Piazza del Colosseo, 1, 00184 Roma RM",Rome,Italy,Building,41.890542,12.492317,3.0
1,3002,The Pantheon,"Piazza della Rotonda, 00186 Roma RM",Rome,Italy,Building,,,3.0
2,3003,Trevi Fountain,"Piazza di Trevi, 00187 Roma RM",Rome,Italy,Building,,,3.0
3,3004,Altare della patria,"Piazza di S. Marco, 4289, 00186 Roma RM",Rome,Italy,Building,41.894576,12.483123,2.0
4,3005,Spanish Steps,"Piazza della Trinità dei Monti, 00187 Roma RM",Rome,Italy,Building,,,0.9
...,...,...,...,...,...,...,...,...,...
125,3126,Teano Station,"Teano,00177 Roma",Rome,Italy,Subway station,41.889673,12.550923,0.7
126,3127,Lodi Station,"Via la Spezia, 121, 00182 Roma RM",Rome,Italy,Subway station,41.887003,12.518642,1.5
127,3128,Piazza del Risorgimento,"Piazza del Risorgimento, 00192 Roma RM",Rome,Italy,Bus Stop,41.906330,12.457332,2.5
128,3129,Cipro Station,"Via Cipro, 00136 Roma RM",Rome,Italy,Subway station,41.907718,12.447160,2.5


# Geocode (get latitude and longitude)

In [None]:
start_time = time.time()
# Define geolocator
locator = Nominatim(user_agent="name_of_your_app")

# Separate locations already geocoded
stations_geocoded = stations.loc[stations['latitude'].notnull()].reset_index(drop=True)
stations_NOgeocoded = stations.loc[stations['latitude'].isnull()].reset_index(drop=True)

# Delay one second between geocoding calls to avoid denials of access to the service
geocode = RateLimiter(locator.geocode, min_delay_seconds=1)

# Extract the location
location = stations_NOgeocoded['address'].apply(geocode)

# Split point into latitude, longitude and altitude
stations_NOgeocoded[['latitude', 'longitude', 'altitude']] = pd.DataFrame(location.apply(lambda loc: tuple(loc.point) if loc else None).tolist(), index=stations_NOgeocoded.index)

# Remove unnecesary columns
del stations_NOgeocoded['altitude']

# Append locations geocoded and no geocoded
stations = pd.concat([stations_geocoded, stations_NOgeocoded], axis = 0).reset_index(drop=True)

# Export fle with georeferenced values
stations.to_csv(root_path + '/data/stations_geocoded.csv', index = False, sep=";")
print("Execution time: %s seconds" % (time.time() - start_time))

Execution time: 114.63174867630005 seconds


# Check if any address couldn't be geolocalized

The idea of this check is to verify that all the addresses were geolocated (it should always have 0 rows).

In [None]:
stations.loc[stations["latitude"].isnull()]

Unnamed: 0,station_id,place_name,address,city,country,type,latitude,longitude,weights


# Visualize stations geolocalized

In [None]:
# Define token from mapbox
px.set_mapbox_access_token(open(root_path + "/data/mapbox_token.txt").read())

# Read stations
stations = pd.read_csv("data/stations_geocoded.csv", sep=";", encoding='utf-8')

# Define map
fig = px.scatter_mapbox(stations,
                        lat=stations.latitude,
                        lon=stations.longitude,
                        hover_name="place_name",
                        width=1200,
                        height=800,
                        zoom=11.3,
                        color="type", size = np.ones(stations.shape[0]),
                        color_discrete_sequence = px.colors.qualitative.Light24,
                        opacity = 0.7
                        )

fig.update_layout(legend_font_size=20, hoverlabel=dict(font=dict(family='sans-serif', size=20)))

fig.show()

In [None]:
# stations.loc[stations["place_name"] == 'Villa Pamphilj']