# OpenStreetMap Case Study: Mestre Venice


In this project, I apply data munging techniques like assessing data quality for validity, accuracy, completeness, consistency and uniformity, to clean the OpenStreetMap data for my hometown [Venice Mestre, Italy](https://www.openstreetmap.org/node/29997772#map=10/45.4953/12.2415)

The following image is a satellite picture of the Venice area taken from Google maps. [Mestre](https://en.wikipedia.org/wiki/Mestre) is part of the city of Venice and it is denoted with the red flag A. 
![alt text](https://7crooks.files.wordpress.com/2011/04/screen-shot-2011-04-20-at-11-23-56-am.png "")

The dataset is about **112 MB** and is used entirely in this analysis. 

In this project, the following information of the dataset are audited:

1. street names 
2. postal code between 30121 and 30176
3. names of city suburbs
4. province information
5. telephone number in the format +39 XXX XXXXXX


## PART 1: find errors in the dataset and define functions to correct them

In this part of the project, we check the OSM dataset and we check the fields listed in the introduction to find errors. Given the list of errors, functions that correct these error are created and tested.

### Street names

The XML file of Mestre OpenStreetMap is imported using python's cElementTree. We access the street name using the tags of the XML file. Using Python's regular expression library, we extract street names like raod, avenue, etc, and we compare this name with a list of correct names. If the name is not in the list, we save it in a dictionary of wrong names where the key is the wrong street name and value the list of full wrong names. We create a new dictionary that maps the wrong names (key) to the correct name (value). Using this map, we correct the street names.

In this anaylsis, numerous typos error were found (for example, we found Dorsorduro instead of Dorsoduro). Moreover, several names without first capital letter were found. We would like to keep the format with capital letter. 

In [1]:
# File path
OSMFILE = "C:/Users/jacopo/Desktop/Deep Learning/Udacity/Projects/DataWrangling/Final project/mestre.osm"

In [2]:
from audit_street import audit_street, correct_street_name, mapping_street

# Run codes
st_types = audit_street(OSMFILE)
print('Correction of street names (first 5):\n')  
counter = 0
for st_type, ways in st_types.items():
    
    for name in ways:
        better_name = correct_street_name(name, mapping_street)
        if counter <5:  
            print(name + '-->' + better_name)
            counter +=1

Correction of street names (first 5):

Sestiere Dorsoduro-->Dorsoduro
Sestiere Cannaregio-->Cannaregio
Dorsoduro, San Trovaso-->Dorsoduro
Gallion-->Calle Gallion
Stazione Santa Lucia-->Cannaregio


### Postal code

We check if the postal code lies between 30121 and 30176. We found that two tags contain the city name (Venice) or a street name (Ponte dei Pugni), whereas several postal codes were outside the expected range. We found that the map of Mestre contains several location outside the city!

In [3]:
from audit_postal import audit_postcode, correct_postal_code, mapping_postal_code

# Run codes
wrong_postal_codes = audit_postcode(OSMFILE)

print('\nCorrection of postal codes:')  
for code_type, ways in wrong_postal_codes.items():
    for name in ways:
        new_postal_code = correct_postal_code(name, mapping_postal_code)
        print(name + '-->' + new_postal_code)
        

Postal codes outside Mestre area (first 5):
30100
30034
30020
30020
30030

Correction of postal codes:
PontedeiPugni-->30123
Venice30123-->30123


### Names of city suburbs

Here we check if the names of suburbs were correct. Also in this case, we found several locations outside the city. Moreover, we correct missplelled names as in part 1. In two cases, we found the postal code instead of the suburb name (it was corrected by inserting the suburb name).

In [4]:
from audit_suburb import audit_city, mapping_city, correct_city_sub

# Run codes
city_sub_wrong = audit_city(OSMFILE)
counter = 0
print('Correction of suburb names (first 5):\n')  
for code_type, ways in city_sub_wrong.items():
    for name in ways:
        new_suburb = correct_city_sub(name, mapping_city)
        if counter < 5:
            print(name + '-->' + new_suburb)
            counter+=1


Correction of suburb names (first 5):

Venice-->Venezia
Marghera VE-->Marghera
30173-->Tessera
3073-->Tessera
Venezia Mestre-->Mestre


### Province information

We make sure that province is "Venezia" in every node. The province name was changed from "VE" to "Venezia".

In [5]:
from audit_province import audit_prov

audit_prov(OSMFILE)

### Telephone number

We check telephone numbers. Using the same approach as part 1, we ensures that all the phone numbers have the same format +39 XXX XXXXX.

In [6]:
from audit_phone_number import audit_phone

audit_phone(OSMFILE)

Corrected phone numbers (first 5):

390412776142-->+39 041 2776142
+39 0412749227-->+39 041 2749227
+39 0415341310-->+39 041 5341310
+39 041 52 87 409-->+39 041 5287409
+390415470160-->+39 041 5470160


## PART 2: correct data and create a RDBMS database

The OSM dataset is now parsed and saved as a set of dictionaries, being "node" and "way".
Each element of the OSM XML file is analyzed and corrected using the functions defined in step 1. The document is transformed to tabular format using python's dictionaries and saved into .csv files. These csv files can then easily be imported to a SQL database as tables.
This process consists of the following steps:
- iteratively step through each top level element in the OSM XML 
- analyze each node by correcting wrong values and saving them into dictionaries
- use a schema and validation library to ensure the transformed data is in the correct format
- write each data structure to the appropriate .csv files

The script *process_map()* defined in *create_RDBMS.py* corrects the errors in the OSM dataset and creates the following files:
- nodes.csv
- nodes_tags.csv
- ways.csv
- ways_nodes.csv
- ways_tags.csv

Using these files, a new relational database called **mestre.db** is created using SQLite3.


In [None]:
from create_RDBMS import process_map

process_map(OSMFILE, validate=True)

## PART 3: Statistics of Mestre's OSM database (corrected)

The RDBMS database previously created is analyzed using SQL. The original OSM file is about 113 MB, whereas the size of the RDBMS is 87 MB. 

In [7]:
# import SQLIte3
import sqlite3
from query_db import query_db

# connect to the database and create cursor for querying
conn = sqlite3.connect('mestre.db')
cursor=conn.cursor()


Check **number of nodes and ways** in the map:

In [8]:
print('Nodes: ')
QUERY = "SELECT COUNT(*) FROM Nodes"
result = query_db(QUERY,cursor)

print('\nWays: ')
QUERY = "SELECT COUNT(*) FROM Ways"
query_db(QUERY,cursor)

Nodes: 
528507

Ways: 
81700


Check **number of users** that contributed to the OSM database.

In [9]:
print('Top 3 users that most contributed to ways: ')
QUERY = "SELECT user, COUNT(*) as total FROM Ways GROUP BY user ORDER BY total DESC LIMIT 3;"
result = query_db(QUERY,cursor)

print('\nTop 3 users that most contributed to nodes: ')
QUERY = "SELECT user, COUNT(*) as total FROM Nodes GROUP BY user ORDER BY total DESC LIMIT 3;"
result = query_db(QUERY,cursor)

print('\nTop 3 users that most contributed to OSM database: ')
QUERY = "SELECT way_node.user, COUNT(*) as total FROM (SELECT user FROM Ways UNION ALL SELECT user FROM Nodes) way_node GROUP BY way_node.user ORDER BY total DESC LIMIT 3;"
result = query_db(QUERY,cursor)

Top 3 users that most contributed to ways: 
DarkSwan_Import: 22837
bellazambo: 15054
Arlas: 13230

Top 3 users that most contributed to nodes: 
DarkSwan_Import: 130932
bellazambo: 98640
Arlas: 69366

Top 3 users that most contributed to OSM database: 
DarkSwan_Import: 153769
bellazambo: 113694
Arlas: 82596


The most active user is **DarkSwan_Import**, who created 28% of the ways, 25% of the nodes and 25% of the map. The top 3 users are **DarkSwan_Import, bellazambo**, and **Arlas**, who created 63% of the ways, 57% of the nodes and 58% of the OSM data.

**Analyze restaurants** in Mestre's OSM database.

In [10]:
print('Number of restaurants in Mestre Venezia OSM:')
QUERY = "SELECT COUNT(*) FROM Nodes_tags WHERE value LIKE '%restaurant%';"
query_db(QUERY,cursor)

print('\nMost popular cuisines:')
QUERY = "SELECT Nodes_tags.value, COUNT(*) AS tot FROM Nodes_tags WHERE Nodes_tags.id IN (SELECT DISTINCT(id) FROM Nodes_tags WHERE value LIKE '%restaurant%') AND Nodes_tags.key LIKE '%cuisine%' GROUP BY Nodes_tags.value ORDER BY tot DESC LIMIT 5;"
query_db(QUERY,cursor)

print('\nNumber of restaurants having phone number specified:')
QUERY = "SELECT COUNT(*) AS tot FROM Nodes_tags WHERE Nodes_tags.id IN (SELECT DISTINCT(id) FROM Nodes_tags WHERE value LIKE '%restaurant%') AND Nodes_tags.key LIKE '%phone%';"
query_db(QUERY,cursor)


print('\nTotal number of cusine specification:')
QUERY = "SELECT COUNT(*) FROM Nodes_tags WHERE key LIKE '%cuisine%';"
query_db(QUERY,cursor)

Number of restaurants in Mestre Venezia OSM:
236

Most popular cuisines:
italian: 33
pizza: 22
regional: 12
italian;pizza: 5
chinese: 3

Number of restaurants having phone number specified:
43

Total number of cusine specification:
129


There are 236 restaurants in the OSM database. However, **only 129 restaurants (55%) specify the cusine**. The most popular cusine is italian (33), followed by pizza (22). By considering that pizza is a particular italian speciality, **72 restaurants on the 129 (therefore 56%) with specified cuisine serve italian specialities**.

However, **only 43 restaurants (18%) have phone number specified**. Therefore, another resource must be employed to check the phone number out and book a table.

In addition to the 236 restaurants, there are 36 pubs, 179 bars and 77 cafes in the city, as shown by the the following query: 

In [11]:
print('Number of pubs in Mestre Venezia:')
QUERY = "SELECT COUNT(*) FROM Nodes_tags WHERE value LIKE '%pub%';"
query_db(QUERY,cursor)

print('Number of bars in Mestre Venezia:')
QUERY = "SELECT COUNT(*) FROM Nodes_tags WHERE value LIKE '%bar%';"
query_db(QUERY,cursor)

print('Number of cafes in Mestre Venezia:')
QUERY = "SELECT COUNT(*) FROM Nodes_tags WHERE value LIKE '%cafe%';"
query_db(QUERY,cursor)

Number of pubs in Mestre Venezia:
36
Number of bars in Mestre Venezia:
179
Number of cafes in Mestre Venezia:
77


Check **public transportation stops**.

In [12]:
print('Number of stops for public transportation in Mestre Venezia:')
QUERY = "SELECT value,COUNT(*) AS tot FROM Nodes_tags WHERE value LIKE '%bus_stop%' OR value LIKE '%tram_stop%' GROUP BY value ORDER BY tot DESC ;"
query_db(QUERY,cursor)

Number of stops for public transportation in Mestre Venezia:
bus_stop: 499
tram_stop: 70
bus stop only, private lane: 1
bus stop to PLUS Camping Jolly: 1


The city of Mestre has about **571 stops for public transportation**. Among them, **87% of the stops are for busses**. In fact, the tram was built in 2015 and it consists of only two lines. On the contrary, bus public transportation has about 50 lines, which results in much more stops than tram. A typical bus and tram stops in Mestre downtown are showed in following picture taken from local citynews website (www.veneziatoday.it).

![alt text](http://2.citynews-veneziatoday.stgy.ovh/~media/original-hi/36526749365012/piazzale-cialdini-mestre-autobus-tram-2.jpg)



Check **number of hotels**:

In [13]:
print('Number of Hotels in Mestre Venezia:')
QUERY = "SELECT COUNT(*) FROM Nodes_tags WHERE value LIKE '%hotel%';"
query_db(QUERY,cursor)

Number of Hotels in Mestre Venezia:
291


Check **number of supermarkets** in the city:

In [14]:
print('Number of supermarket in Mestre Venezia:')
QUERY = "SELECT COUNT(*) FROM Nodes_tags WHERE value LIKE '%supermarket%';"
query_db(QUERY,cursor)

print('\nMost numerous supermarkets:')
QUERY = "SELECT Nodes_tags.value, COUNT(*) AS tot FROM Nodes_tags WHERE Nodes_tags.id IN (SELECT DISTINCT(id) FROM Nodes_tags WHERE value LIKE '%supermarket%') AND Nodes_tags.key LIKE '%name%' GROUP BY Nodes_tags.value ORDER BY tot DESC LIMIT 5;"
query_db(QUERY,cursor)

Number of supermarket in Mestre Venezia:
46

Most numerous supermarkets:
Prix: 5
Conad City: 4
Coop: 4
Cadoro: 3
Lidl: 3


Check **number of doctors** in the OSM database. Interestingly, the OSM database for the city of Mestre Venezia has 1 doctor. This field should be improved in the future.

In [15]:
print('Number of doctors in Mestre Venezia:')
QUERY = "SELECT COUNT(*) FROM Nodes_tags WHERE value LIKE '%doctor%';"
query_db(QUERY,cursor)

Number of doctors in Mestre Venezia:
1


Check **pedestrian crossing and traffic signals**.

In [16]:
print('Number of traffic signals in Mestre Venezia:')
QUERY = "SELECT COUNT(*) FROM Nodes_tags WHERE value LIKE '%traffic_signal%';"
query_db(QUERY,cursor)

print('\nNumber of street crossing area in Mestre Venezia: ')
QUERY = "SELECT COUNT(*) FROM Nodes_tags WHERE value LIKE '%crossing%';"
query_db(QUERY,cursor)

Number of traffic signals in Mestre Venezia:
124

Number of street crossing area in Mestre Venezia: 
1355


## Suggestion for improvement

This analysis showed that the OSM file is incomplete and several errors are present. The main issue found was the presence of nodes that are not part of the city of Mestre. For example, many nodes and ways that are located in others nearby small cities are included in Mestre's OSM. Other errors like typos can be be corrected by auditing the XML file as we did in this project. 

However, the dataset should be improved consistently in order to provide useful and complete information. The poorest part was the information about doctors, which is an important information for locals and tourists. For this reason, it would be beneficial to use third party sites like ulss12.it, which is the official healthcare website of the area, to include these info. Since Mestre is part of the City of Venice, more information about public transportations to and from Venice should be provided. Also in this case, third party website like actv.it (official public transportaton website) should be used to create new tags for the bus stop nodes. Info about restaurants and museums can be retrieved from Google Maps, whereas info about hotels can be obtained from booking.com. However, a lot of coding has to be done in order to automatically access these info from website's API. To ensure the correct format of data, the codes developed in this project can be employed.