# Udacity Data Analyst Project 3:  Wrangling Open Street Data

_ by Jens Laufer _

_ jenslaufer@gmail.com _

_ Ph. +49-172-8443048 _

## Introduction

In this project I am importing XML Data from OpenStreetMap, auditing and wrangling this data and export it to MongoDB. Afterwards I am doing some analysis on the imported data.

I selected the area of Ostfriesland in north eastern Germany, which is close to the north sea. Although I am not from there, I am interested in the area as I want to extend my AirBNB business and the area looks promising to me, as it is touristic and property prices are low. I got the insights about the area from google search data and data from the german realestate website [Immobilienscout24](http://immobilienscout24.de), but this is not part of the assigment. I am especially interesting in all data relating to lodging. 

![](img/map.png)

I moved the code from the case study from the Udacity Data Analyst nanodegree course to a module called audit.py, which I am using within the scope of this assignment. The provided code for extracting parts of the big OSM file to a smaller file I moved to a module extractor.

In [8]:
# python imports
from collections import defaultdict
import pprint
import re
import codecs
import json
import os
import file_size_humanize as humanize
import os.path as path
import time
import pymongo as mongo
# this is the import of the code from the case study
import audit
#import of extractor with code provided from udacity 
import extractor

ImportError: No module named phonenumbers

In [2]:
# some constants I am using 
OSM_URL = "http://overpass-api.de/api/map?bbox=6.6309,53.4302,7.8291,53.8227"
OSM_FILE = "ostfriesland.osm" 
OSM_EXTRACT_FILE = "{0}_extract.osm".format(OSM_FILE[:OSM_FILE.find('.osm')])

OSM_EXPORT_FILE = "{0}.json".format(OSM_FILE)

MONGO_URL = 'mongodb://localhost:27017/'

## Download of the dataset

In [3]:
# I stream the osm data with the overpass url for the box we want to use into a local file in case the file 
# does not exist already
from urllib2 import urlopen

if not path.exists(OSM_FILE):
    response = urlopen(OSM_URL)
    CHUNK = 16 * 1024
    with open(OSM_FILE, 'wb') as f:
        while True:
            chunk = response.read(CHUNK)
            if not chunk:
                break
            f.write(chunk)

## Auditing of the data

#### Auditing address data

In [4]:
invalid_phone, invalid_email, invalid_url = audit.audit_contact_data(OSM_FILE)

In [5]:
invalid_phone

['+49 (44 62) 92 25 23',
 '+49 (44 62) 92 25 20',
 '+49 (44 62) 92 20 56',
 '+49 (44 62) 55 11',
 '+49 (44 62) 9 46 99 37',
 '+49 (44 62) 2 34 04',
 '+49 (44 62) 92 39 28',
 '+49 (44 62) 2 33 54',
 '+49 (44 62) 2 33 55',
 '+49 4935 91 12 0002',
 '+49(0)4462/943781',
 '+49 (44 62) 56 39',
 '+49 4935 99 99 83 60',
 '+49 (44 62) 20 51 24',
 '+49 (44 62) 98 64 91',
 '+49 (44 62) 98 64 91',
 '+49 (44 62) 20 52 76',
 '+49 (44 62) 9 48 80',
 '+49 (44 62) 9 23 89 81',
 '+49 (44 62) 202 90 53',
 '+49 (44 62) 22 43',
 '+49 (4939) 410',
 '+49 (4939) 914010',
 '+49 4935 91 81 0',
 '+49 251 2 76 16 03',
 '+49 4935 806 0',
 '+49 4935 804-0',
 '+49-4931-9755117',
 '+49-4931-9755117',
 '+49-4931-9347611',
 '+49-4931-9563237',
 '+49 (44 62) 43 34',
 '+49 (0)4920-939004',
 '+49 (0)4920-569',
 '+49 (4974) 249',
 '0173 - 292 21 90']

In [6]:
invalid_email
for email in invalid_email:
    print email.encode('utf-8')

info@küchenwerkstatt-juist.de
info@üetra-juist.de
immergrün.juist@t-online.de


In [7]:
invalid_url

['www.silvis-bungalow.de',
 'www.haus-thomas.de',
 'www.tuedelpott.de',
 'www.hotel-cafecaro.de',
 'www.hotel-westfalenhof.de']

## Export of OSM to JSON and Import to MongoDB

In [None]:
start = time.time()
audit.process(OSM_FILE,OSM_EXPORT_FILE);
(time.time() - start)


We add an index on id to improve speed on upsert on the id field

In [None]:

nodes = mongo.MongoClient(MONGO_URL).osm.nodes
nodes.create_index([("id", mongo.ASCENDING)])


In [None]:
start = time.time()
os.system('mongoimport --quiet --upsertFields id --db osm --collection nodes --file {0}'.format(OSM_FILE+'.json'));
(time.time() - start)


### Addition of some indexes

In [None]:
nodes.create_index([("type", mongo.ASCENDING)])
nodes.create_index([("address.city", mongo.ASCENDING)])

nodes.create_index([("pos", mongo.GEO2D)]);

## Analysis of the data

In [None]:
audit.count_tags(OSM_FILE)

In [None]:
# File size of the full osm file
info = os.stat(OSM_FILE)
"Filesize of {0} {1}".format(OSM_FILE, humanize.humansize(info.st_size))

In [None]:
# File size of the full osm file
info = os.stat(OSM_EXPORT_FILE)
"Filesize of {0} {1}".format(OSM_EXPORT_FILE, humanize.humansize(info.st_size))

In [None]:
"{} Documents in MongoDB".format(nodes.find().count())

In [None]:
list(nodes.aggregate([
        {'$group':{'_id':'$type','count':{'$sum':1}}}
    ]))

Most of the documents are nodes.

In [None]:
"{} Documents with contact in MongoDB".format(nodes.find({'contact':{'$exists':1}}).count())

In [None]:
"{} Documents with contact.website in MongoDB".format(nodes.find({'contact.website':{'$exists':1}}).count())

I have the opportunity to invest in a house in the village of Dornum. I am interested how many accomodations are there:

In [None]:
"{} Documents with contact.website in MongoDB".format(nodes.find({'contact.website':{'$exists':1}}).count())

## Additional problems in the data

  - Cross field consistecy Postcode ork for Street and City. 
  - Have address always a housenumber in case ther is a street
  - Handlingof P.O. boxes
  - E-Mail address still valid
  - Website still available
  - Street names
  - City names