<h1 align="center"><font color="#000066" size="6">OpenStreetMap Data Wrangling Project</font></h1> 



For this project I chose to investigate data about the city Nancy located in France which is my hometown.

# 1. Problems Encountered in the Map


#### a) Names misspelled

The major problem I encountered was that different names was used for the same city or the same street. For instance the city "Vandoeuvre-lès-Nancy" had these four differents spellings :
* Vandoeuvre-les-Nancy
* Vandoeuvre-lès-Nancy
* Vandœuvre-lès-Nancy
* Vandoeuvre les Nancy

So I made sure to update these names to an common name before inserting them into mongodb. 

#### b) Street names not capitalized

Some street name were starting with a lower letter and others with upper letters so I updated them programatically with the `capitalize` python function before inserting them into mongodb.  

#### c) Postal codes not corresponding to Nancy

Before inserting postal codes into the database I implemented a routine to check that all of them have a consistent format. Postal code for Nancy and its suburbs are all beginning with '54' and is followed by three digits. So the code is :
```
import re

p = re.compile('54[0-9]{3}\\b')
p.match('postal_code')

```

Postal codes that didn't match were not inserted into the database and written out in a separate file for further inspection. Surprisingly none of the postal codes were inconsistent. A simple database query `db.nodes.distinct('address.postcode', {})` confirm that postal codes have all a correct format but does not correspond all to the postal code of Nancy. After it appears that the data concern not only the city of Nancy but also the suburban area. Below is a count of the node for each postal code :

```
db.nodes.aggregate([{ $group: {_id:"$address.postcode", count: {$sum: 1}}} ])

{ "_id" : "54770", "count" : 3 }
{ "_id" : "54425", "count" : 3 }
{ "_id" : "54270", "count" : 3 }
{ "_id" : "54320", "count" : 2 }
...

```
 
 It appears that very few documents include postal codes. 
 
 
#### d) Timestamp consistency

The consistency of the timestamps were checked using the same procedure than for postal code. So a regex was used to assure that their format is correct and inconsistent timestamps are written out in a separate file. The folowing regex is used :
```
t = re.compile('[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}Z')
t.match('timestamp')
```

Once again no inconsistent timestamps were found.


# 2 Data Overview

#### size of the file :

  nancy_france.osm : 148M

#### number of unique users : 

```
> db.nodes.distinct('created.user').length
256
```
   
#### number of nodes : 

```
> db.nodes.count()
624594
```
   
#### number of ways : 

```
> db.ways.count() 
113654 
```
   
#### Number of different cities :

```
> db.nodes.distinct('address.city', {}).length
8
```

#### 
   
# 3 Other ideas


I wanted to compute some statistics for each city such as the number of house or street by city but as shown by the following query most of the nodes have no city assigned :

```
> db.nodes.group( { key: {'address.city':1}, reduce: function ( curr, result ) { result.total++; }, initial: { total : 0 } } )
   
   {
		"address.city" : null,
		"total" : 624577
	},
	{
		"address.city" : "Maxéville",
		"total" : 1
	},
	{
		"address.city" : "Villers-lès-Nancy",
		"total" : 1
	},
	.
    .
    .
]
```

So an improvement over the data could be to correctly assign to each node its city. To do that the location (latitude and longitude) stored with each node could be used. For example the following code show how the adress of a node can be obtained using the [geopy][1] python library :

```
from geopy.geocoders import Nominatim

geolocator = Nominatim()
location = geolocator.reverse("48.7040581, 6.1336632")
print(location.address)

"A 31, Maxéville, Nancy, Meurthe-et-Moselle, Alsace-Champagne-Ardenne-Lorraine, France métropolitaine, 54320, France"
```

The coordinates used here are real coordinates stored in our database and we can observe that the library correctly returns meaningfull informations such as the name of the city (Maxéville), the postal code (54320), the street name (A 31)... Behind the scene the library is relying on several APIs ( Google Geocoding API,  OpenStreetMap Nominatim, Bing Maps API...).
 
[1]: https://github.com/geopy/geopy

    
   
   


