<a id='Top Doc'></a>

# P3: Wrangling OpenStreetMap Data

## Udacity Data Analyst NanoDegree
___
#### Los Angeles, California, USA

### Contents
___

1. [Map Data](#Map-Data)
2. [Data Audit](#Data-Audit)
  1. [Data Structure](#Data-Structure)
  2. [Tag Attributes and Values](#Tag-Attributes-and-Values)
    1. [Node Tags Audit](#Node-tags)
    2. [Way Tags Audit](#Way-tags)
3. [Data Cleaning](#Data-Cleaning)
4. [References](#References)

### Map Data
___

##### Los Angeles, California, USA
[Mapzen](https://mapzen.com/data/metro-extracts/metro/los-angeles_california): `https://mapzen.com/data/metro-extracts/metro/los-angeles_california`

I downloaded the initial los-angeles_california.osm dataset from Mapzen (not included in the repository). The original dataset is 8.6 GB large, and after trying to run some of the code from the courses locally, it crashed my computer. I proceeded to make a series of sample files to use (based off of the sample.py script given in the Project Details), which are described in this table:

##### Sample files used for Project
Name | K-val | Size | Lines | Note
---|---:|---:|---:|---|
*la-sample.osm* | 1,000 | 8.6 MB | 110,729 | This file was used to test early python data auditing and cleaning scripts.
*la-med.osm* | 500 | 17.4 MB | 222,181 | Used as an intermediate test of the data auditing and cleaning scripts. Not included in this repository.
*la-final.osm* | 150 | 58.0 MB | 741,172 | Final file used for auditing, cleaning, and importing into the MongoDB database. Not included in this repository, but can be added for resubmission if necessary.

[Back To Contents](#Contents)

### Data Audit
___
#### Data Structure

After reviewing the [OSM XML Content][1] page, I wanted to check that the sample dataset actually had the tags and structure as described. Mainly that the XML was organized as blocks of ***nodes*** with tags for each node, ***ways*** with tags and references to their respectives nodes, and ***relations*** with tags and references too.

Using and modifying the `mapparser.py` code from the MongoDB Case Study for OSM Data, I generated a dictionary with element tag names, counts, and attributes. A simple table with the results can be seen here:

[1]: #References

Name | Count | Attributes
---|---:|---
member   | 78     | ref, role, type
nd       | 41,121 | ref
node     | 37,437 | changeset, id, lat, lon, timestamp, uid, user, version
relation | 32     | changeset, id, timestamp, uid, user, version
tag      | 24,813 | k, v
way      | 3,655  | changeset, id, timestamp, uid, user, version

I further modified the `mapparser.py` to see the structure of the data. I named it `data_structure.py` and the results are here:

In [1]:
run data_structure.py

{'member': {'count': 78},
 'nd': {'count': 41121},
 'node': {'count': 37437,
          'tag': {'attributes': {'k': 968, 'v': 968}, 'count': 968}},
 'osm': {'count': 1,
         'node': {'attributes': {'changeset': 37437,
                                 'id': 37437,
                                 'lat': 37437,
                                 'lon': 37437,
                                 'timestamp': 37437,
                                 'uid': 37437,
                                 'user': 37437,
                                 'version': 37437},
                  'count': 37437},
         'relation': {'attributes': {'changeset': 32,
                                     'id': 32,
                                     'timestamp': 32,
                                     'uid': 32,
                                     'user': 32,
                                     'version': 32},
                      'count': 32},
         'way': {'attributes': {'changeset': 3655,
            

Right away, you can see that ***nodes***, ***ways***, and ***members*** do in fact have the ***tag*** and ***nd*** tags as described by the OSM XML Content page. I also printed out the attributes, and nothing seems out of place. Further, the counts for each of the attributes matches the count of the tags themselves, so I don't have to worry about fixing any of those.

Interestingly, there are some uncertainties mentioned some problems that could merit further investigation if the data cleaning and auditing functions were to be used as a service in any way. Of note are that id or usernames not necessarily being present, untagged unconnected nodes, element IDs that are negative, among others. It would be important to implement a solution to check and correct these problems.

#### Tag Attributes and Values

The attributes for the tags member, nd, node, relation, and way seem to be fairly straightforward and easy to organize, so my Data Cleaning plan will organize those in a simple manner. However, the tag attributes may be a little more difficult to work with. Since the attribute ***k*** represents ***key***, which is assigned by the human user, there can be any number of different values for the k attribute.

Again, I modified an example of the course code to make audit.py, which looks at the k values for tags of the given tag type. I ran it for both ***node*** and ***way*** tags, and found some interesting results. The lists were very long, so I will go over the issues that stand out below. To evaluate specific tag values, I'm using the function `tag_search(filename, tag_name, regex)` in the audit.py file, with an example below. The `regex` value was changed for each tag attribute investigated.

___
##### Node tags
**addr:street** and **addr:street_direction_prefix** - these could be redundant, but I will have to review a few samples to see if this is worth correcting. Running the following code
```python
from audit import tag_search
import pprint

results = tag_search("data/la-small.osm", "node", r'addr:street')
pprint.pprint(results)
```
results in a single tag having the **addr:street_direction_prefix** key. Specifically, `{'k':'addr:street_direction_prefix', 'v': 'W'}`, meaning that I may be able to include this value with it's accompanying **addr:street** tag, if they are tags of the same node. Further, we can use the same code and change the regular expression

**Color** and **Colour** - Simply, the **color** and **colour** keys are going to be the same thing. Since this is LA, I'm going to change all **colour** keys to **color** when cleaning.

**Fixme** - There was one of these tags, and I may just exclude it in cleaning.

In [36]:
from audit import tag_search
import pprint

results = tag_search("data/la-small.osm", "node", r'fixme')
pprint.pprint(results)

[{'k': 'fixme', 'v': 'Transfer_info'}]


**Is_in** - This field returns redundant information, such as the state, country that the nodes are located in. Since I am evaluating Los Angeles, California, all these nodes should be in California, and most definitely in the US. I am going to exclude these tags from the database.

**GNIS** - Finally, it appears that someone has somehow (programmatically or not) included data from the United States Geographic Service - Geographic Names Information Service ([USGS GNIS][1]) as tags for certain nodes. After doing a little research, it appears that in 2009 US GNIS data was bulk imported into OSM. According to the OSM wiki entry, the GNIS is a database of "names" and not "features" and further, that many of these entries are incorrect or no longer exist. This poses a fantastic challenge to OSM, and would be a great opportunity for programmatically cleaning the OSM database, which I will discuss in review below. Using tag_search, it doesn't look like there is much of value in these tags, at least for this project. I am going to exclude these tags as well.

Other than these, every other tag is fairly straightforward.
___
##### Way tags

**FIXME**, **FMMP**, **NHD**, **NHS**, **gnis** - These will be ignored
 
**tiger:** - oddly, I found a series of tags with keys that had **tiger:** in them.
[1]: #References

In [35]:
results = tag_search("data/la-small.osm", "way", r'tiger:')
pprint.pprint(results[:6])

[{'k': 'tiger:cfcc', 'v': 'A74'},
 {'k': 'tiger:tlid', 'v': '195710849:195710852'},
 {'k': 'tiger:county', 'v': 'San Diego, CA'},
 {'k': 'tiger:source', 'v': 'tiger_import_dch_v0.6_20070809'},
 {'k': 'tiger:reviewed', 'v': 'no'},
 {'k': 'tiger:upload_uuid',
  'v': 'bulk_upload.pl-5dac241b-d144-4c9c-9e26-b4dec4590a61'}]


these look like some sort of algorithmic utility for uploading data to OSM. Further investigation proves this to be correct, as TIGER stands for "The Topologically Integrated Geographic Encoding and Referencing system (TIGER) data,[which is] produced by the US Census Bureau, is a public domain data source which has many geographic features. The TIGER/Line files are extracts of selected geographic information, including roads, boundaries, and hydrography features. All of the roads were imported into OSM in 2007 and 2008, populating the nearly empty map of the United States." from the [OSM wiki page][1]. I am going to exclude these as well, as the wiki explains that much of the US mapping is now done by the OSM mapping community, as mass uploads of the TIGER data stopped after 2007, and should be unimportant to the analysis for this project.

Other than these, the other tags should clean well, and any issues in the tags' key values will have to be the subject of a secondary cleaning.

[1]: #References

### Data Cleaning
___

I modified the data.py code from the final quiz in the MongoDB for OSM Case study as a base script for shaping and converting the osm data into json.


[Back To Contents](#Contents)

### References
___

1. [OSM: XML Contents wiki](https://wiki.openstreetmap.org/wiki/OSM_XML#Contents): `https://wiki.openstreetmap.org/wiki/OSM_XML#Contents`
2. [OSM: USGS GSM wiki](http://wiki.openstreetmap.org/wiki/USGS_GNIS): `http://wiki.openstreetmap.org/wiki/USGS_GNIS`
3. [OSM: TIGER wiki](http://wiki.openstreetmap.org/wiki/TIGER): `http://wiki.openstreetmap.org/wiki/TIGER`
___
[Back To Contents](#Contents)