# Udacity Data Analyst Project 3:  Wrangling Open Street Data

_ by Jens Laufer _

_ jenslaufer@gmail.com _


## Introduction

In this project I am importing XML Data from OpenStreetMap, auditing and wrangling this data and export it to MongoDB. Afterwards I am doing an analysis on the imported data.

I selected the area of Ostfriesland in north eastern Germany, which is close to the north sea. Although I am not from there, I am interested in the area as I want to extend my Airbnb hosting business and the area looks promising to me, as it is touristic and property prices are low. I got these insights about the area from google search data and data from the german realestate website [Immobilienscout24](http://immobilienscout24.de), but this is not part of the assigment. I am especially interesting in all data related to lodging. 

![](img/map.png)

I moved the code from the case study from the Udacity Data Analyst nanodegree course to a module called osm.py, which I am using within the scope of this assignment. I added some additional functions there e.g for auditing the contact data.

In [1]:
# python imports
import pandas as pd
import numpy as np
from collections import defaultdict
import pprint
import re
import codecs
import json
import os
import file_size_humanize as humanize
import os.path as path
import time
from pprint import pprint
import pymongo as mongo
# this is the import of the code from the case study
import osm

In [2]:
"""
definition of constants which I am using in this notebook 
"""

FORCE_IMPORT = True

OSM_URL = "http://overpass-api.de/api/map?bbox=6.6309,53.4302,7.8291,53.8227"
OSM_FILE = "ostfriesland.osm" 
OSM_EXPORT_FILE = "{0}.json".format(OSM_FILE)

MONGO_URL = 'mongodb://localhost:27017/'

## Download of the dataset

In [3]:
"""
I stream the osm data with the overpass url for the box I want to use into a local file in case the file does not already exist
"""
from urllib2 import urlopen

if not path.exists(OSM_FILE):
    response = urlopen(OSM_URL)
    CHUNK = 16 * 1024
    with open(OSM_FILE, 'wb') as f:
        while True:
            chunk = response.read(CHUNK)
            if not chunk:
                break
            f.write(chunk)

## Auditing of the data

#### Auditing contact data

I am auditing the contact data. Therefore I created a audit_contact_data in the osm module, which checks email address and URLs against regular expressions. For the phonenumber I am using a port of [Google's libphonenumber for python](https://github.com/daviddrysdale/python-phonenumbers) to test for phonenumber validity.

In [4]:
invalid_phone, invalid_email, invalid_url = osm.audit_contact_data(OSM_FILE)

Error parsing 0800 283 50000: (0) Missing or invalid default region.
Error parsing 01622 11 90 66: (0) Missing or invalid default region.
Error parsing 0173 - 292 21 90: (0) Missing or invalid default region.


In [5]:
invalid_phone

['0800 283 50000', '01622 11 90 66', '0173 - 292 21 90']

In [6]:
invalid_email

[]

In [None]:
invalid_url

['www.silvis-bungalow.de',
 'www.haus-thomas.de',
 'www.tuedelpott.de',
 'www.hotel-cafecaro.de',
 'www.hotel-westfalenhof.de']

I need to fix this problematic data, before exporting the data to MongoDB.

## Export of OSM to JSON and Import to MongoDB

I am extracting the data I am interested to, fixing the problematic data entries and export it to JSON for the MongoDB import.

In [None]:
start = time.time()

if FORCE_IMPORT:
    osm.process(OSM_FILE,OSM_EXPORT_FILE);
    
time.time() - start

I am adding an index on the 'id' field to improve speed on the upserts

In [None]:
nodes = mongo.MongoClient(MONGO_URL).osm.nodes
nodes.create_index([("id", mongo.ASCENDING)]);


In [None]:
"""
I am using the command line mongoimport
"""
start = time.time()

if FORCE_IMPORT:
    os.system('mongoimport --quiet --upsertFields id --db osm --collection nodes --file {0}'.format(OSM_EXPORT_FILE));
    
time.time() - start

#### Addition of some indexes

In [None]:
nodes.create_index([("type", mongo.ASCENDING)])
nodes.create_index([("address.city", mongo.ASCENDING)])
nodes.create_index([("pos", mongo.GEOSPHERE)]);

## Overview of the Data

#### Tags in OSM file

In [None]:
pprint(osm.count_tags(OSM_FILE))

#### File sizes

In [None]:
# File size of the full osm file
info = os.stat(OSM_FILE)
"Filesize of {0} {1}".format(OSM_FILE, humanize.humansize(info.st_size))

In [None]:
# File size of the full osm file
info = os.stat(OSM_EXPORT_FILE)
"Filesize of {0} {1}".format(OSM_EXPORT_FILE, humanize.humansize(info.st_size))

#### Number of documents in the database

In [None]:
"{} Documents in MongoDB".format(nodes.find().count())

#### Example document in MongoDB

In [None]:
pprint(nodes.find_one({ "contact.phone": { '$exists': 1 }, "contact.fax": { '$exists': 1 }, 
            "contact.website": { '$exists': 1 }, "address.street": { '$exists': 1 }, 
            "address.city": { '$exists': 1 } }))

#### Document types

In [None]:
pd.DataFrame(list(nodes.aggregate([
        {'$group':{'_id':'$type','count':{'$sum':1}}}
    ])))

#### Documents with contact data

In [None]:
"{} Documents with contact in MongoDB".format(nodes.find({'contact':{'$exists':1}}).count())

#### Documents with address data

In [None]:
"{} Documents with address in MongoDB".format(nodes.find({'address':{'$exists':1}}).count())

#### Test Fixing of problematic data

I am checking if my code for importing data fixed the problematic phone numbers and URLs.

In [None]:
assert nodes.find({'contact.phone':{'$in':invalid_phone}}).count() == 0

In [None]:
nodes.find({'contact.phone':{'$in':['+49 800 28350000', '+49 162 2119066', '+49 173 2922190']}}).count()

In [None]:
assert nodes.find({'contact.website':{'$in':invalid_url}}).count() == 0

In [None]:
nodes.find({'contact.website':{'$in':
['http://www.silvis-bungalow.de',
 'http://www.haus-thomas.de',
 'http://www.tuedelpott.de',
 'http://www.hotel-cafecaro.de',
 'http://www.hotel-westfalenhof.de']}}).count()

All problematic contact data was fixed correctly.

#### Analysis of lodging data

What type of touristic data is there?

In [None]:
pd.DataFrame(list(nodes.aggregate([
    {'$match': {'tourism': {'$exists': 1}}},
    {'$group': {'_id': '$tourism', 'count': {'$sum': 1}}},
    {'$sort': {'count': -1}}
])))

I am interested in the number of lodging facilities in the area:

In [None]:
pd.DataFrame(list(nodes.aggregate([
    {'$match': {'tourism': {'$in': ['bed_and_breakfast', 'motel',
                                    'apartment', 'hostel', 'guest_house', 'chalet', 'hotel']}}},
    {'$group': {'_id': '$tourism', 'count': {'$sum': 1}}},
    {'$sort': {'count': -1}}
])))

There seems quiet a number of lodging facilities in the area. It would be interesting to compare these numbers with other areas to get an idea if these numbers are really high.

I am interested in buying a property in the village of Dornum, therefore I am interested how many lodging facilities there are:

In [None]:

pd.DataFrame(list(nodes.aggregate(
    [
        {
            '$geoNear':
            {
                'near':
                {
                    'type': 'Point',
                            'coordinates': [53.645903, 7.430451]
                },
                'spherical': True,
                'query': {
                    'tourism': {'$in': ['bed_and_breakfast', 'motel', 'apartment', 'hostel', 'guest_house', 'chalet', 'hotel']}},
                'maxDistance': 1500,
                'distanceField':'dist',
            },
        }
    ])))

Let's check, if there are also holiday apartments on Airbnb in Dornum. I am using data I scraped from Airbnb for that:

In [None]:

listings = mongo.MongoClient(MONGO_URL).airbnb.listings

pd.DataFrame(list(listings.aggregate([ 
                {
                    '$geoNear':
                    {
                        'near':
                        {
                            'type': 'Point',
                            'coordinates':[53.645903, 7.430451]
                        },
                        'spherical': True,
                        'query': {},
                        'maxDistance' : 1500,
                        'distanceField':'dist',
                    },
                }
    ])))[['name', 'public_address', 'dist', 'loc']]

How many nights of these holiday apartments are booked out the next months and what is the average price per night:

In [None]:

dates = mongo.MongoClient(MONGO_URL).airbnb.dates

pd.DataFrame(list(dates.aggregate([
    {
        '$geoNear':
        {
            'near':
            {
                'type': 'Point',
                'coordinates': [53.645903, 7.430451]
            },
            'spherical': True,
            'query': {},
            'maxDistance': 1500,
            'distanceField': 'dist',
            'limit': 1000000
        },
    },
    {
        '$group':
        {
            '_id': 'availability',
            "total": {'$sum': 1},
            "avg_price_per_night": {'$avg': '$price.native_price'},
            "available":
            {
                "$sum": {"$cond": ["$available", 1, 0]}
            },
            "not_available":
            {
                "$sum": {"$cond": ["$available", 0, 1]}
            }
        }
    },
    {
        '$project':
         {
             '_id': '$_id',
             'avg_price_per_night': '$avg_price_per_night',
             'available': '$available',
             'not_available': '$not_available',
             'booking_ratio': {'$divide': ['$not_available', '$total']},
             'est_gross_income': {'$multiply': [365, '$avg_price_per_night', {'$divide': ['$not_available', '$total']}]},
         }
    }
])))


### Conclusion

My investment in the apartment seems quiet promising, as it's price is about €50.000 and my estimation for the gross income renting it out as holiday flat, based on average Dormum booking ratio and average price per night is about €19.000 per year.

It is also quiet interesting that none of the holiday apartments from Airbnb is on Open Street Map, this fact might be a marketing opportunity, as I could place a node there and link it to my apartment on Airbnb.

## Other ideas about the datasets

### Additional problems

There might be other problems in the datasets that should be part of a further analysis

  - Cross field consistency of Postcode, Street and City 
  - Does all addresses with a street have a house number?
  - Handling of P.O. boxes
  - Is the E-Mail address still valid?
  - Is the Website still available?
  - Is the Street name correct?
  - Is the city name correct?
  - Phone number: In case you don't have the country how can you get international format for the number?
  - Phone number: Are the area code consistent with the city?

### Additional ideas

It would be interesting to compare the data for the area with data from Google Maps. Google Maps does have a public Rest Api, so this would be possible.