<a id='Top Doc'></a>

# P3: Wrangling OpenStreetMap Data

## Udacity Data Analyst NanoDegree
___

### Contents
___

1. [Map Data](#Map-Data)

2. [Data Audit](#Data-Audit)
 
 a. [Data Structure](#Data-Structure)
 
 b. [Tag Attributes and Values](#Tag-Attributes-and-Values)

[Data Cleaning](#Data-Cleaning)

[References](#References)

### Map Data
___

##### Los Angeles, California, USA
[Mapzen](https://mapzen.com/data/metro-extracts/metro/los-angeles_california): `https://mapzen.com/data/metro-extracts/metro/los-angeles_california`

I downloaded the initial los-angeles_california.osm dataset from Mapzen (not included in the repository). The original dataset is 8.6 GB large, and after trying to run some of the code from the courses locally, it crashed my computer. I proceeded to make a series of sample files to use (based off of the sample.py script given in the Project Details), which are described in this table:

##### Sample files used for Project
Name | K-val | Size | Lines | Note
---|---:|---:|---:|---|
*la-sample.osm* | 1,000 | 8.6 MB | 110,729 | This file was used to test early python data auditing and cleaning scripts.
*la-med.osm* | 500 | 17.4 MB | 222,181 | Used as an intermediate test of the data auditing and cleaning scripts. Not included in this repository.
*la-final.osm* | 150 | 58.0 MB | 741,172 | Final file used for auditing, cleaning, and importing into the MongoDB database. Not included in this repository, but can be added for resubmission if necessary.

[Back To Contents](#Contents)

### Data Audit
___
#### Data Structure

After reviewing the [OSM XML Content][1] page, I wanted to check that the sample dataset actually had the tags and structure as described. Mainly that the XML was organized as blocks of ***nodes*** with tags for each node, ***ways*** with tags and references to their respectives nodes, and ***relations*** with tags and references too.

Using and modifying the `mapparser.py` code from the MongoDB Case Study for OSM Data, I generated a dictionary with element tag names, counts, and attributes. A simple table with the results can be seen here:

[1]: #References

Name | Count | Attributes
---|---:|---
member   | 78     | ref, role, type
nd       | 41,121 | ref
node     | 37,437 | changeset, id, lat, lon, timestamp, uid, user, version
relation | 32     | changeset, id, timestamp, uid, user, version
tag      | 24,813 | k, v
way      | 3,655  | changeset, id, timestamp, uid, user, version

I further modified the `mapparser.py` to see the structure of the data. I named it `data_structure.py` and the results are here:

In [1]:
run data_structure.py

{'member': {'count': 78},
 'nd': {'count': 41121},
 'node': {'count': 37437,
          'tag': {'attributes': {'k': 968, 'v': 968}, 'count': 968}},
 'osm': {'count': 1,
         'node': {'attributes': {'changeset': 37437,
                                 'id': 37437,
                                 'lat': 37437,
                                 'lon': 37437,
                                 'timestamp': 37437,
                                 'uid': 37437,
                                 'user': 37437,
                                 'version': 37437},
                  'count': 37437},
         'relation': {'attributes': {'changeset': 32,
                                     'id': 32,
                                     'timestamp': 32,
                                     'uid': 32,
                                     'user': 32,
                                     'version': 32},
                      'count': 32},
         'way': {'attributes': {'changeset': 3655,
            

Right away, you can see that ***nodes***, ***ways***, and ***members*** do in fact have the ***tag*** and ***nd*** tags as described by the OSM XML Content page. I also printed out the attributes, and nothing seems out of place. Further, the counts for each of the attributes matches the count of the tags themselves, so I don't have to worry about fixing any of those.

Interestingly, there are some uncertainties mentioned some problems that could merit further investigation if the data cleaning and auditing functions were to be used as a service in any way. Of note are that id or usernames not necessarily being present, untagged unconnected nodes, element IDs that are negative, among others. It would be important to implement a solution to check and correct these problems.

#### Tag Attributes and Values

The attributes for the tags member, nd, node, relation, and way seem to be fairly straightforward and easy to organize, so my Data Cleaning plan will organize those in a simple manner. However, the tag attributes may be a little more difficult to work with. Since the attribute ***k*** represents ***key***, which is assigned by the human user, there can be any number of different values for the k attribute.

Again, I modified an example of the course code to make audit.py, which looks at the k values for tags of the given tag type. I ran it for both ***node*** and ***way*** tags, and found some interesting results. The lists were very long, so I will go over the issues that stand out below.

##### Node tags
**addr:street** and **addr:street_direction_prefix** - these could be redundant, but I will have to review a few samples to see if this is worth correcting.

In [2]:
from audit import tag_search

In [4]:
results = tag_search("data/la-small.osm", "node", r'addr:street+')
#tag_k_search("data/la-small.osm", "tag", "tiger")

[{'k': 'tiger:cfcc', 'v': 'A74'},
 {'k': 'tiger:tlid', 'v': '195710849:195710852'},
 {'k': 'tiger:county', 'v': 'San Diego, CA'},
 {'k': 'tiger:source', 'v': 'tiger_import_dch_v0.6_20070809'},
 {'k': 'tiger:reviewed', 'v': 'no'},
 {'k': 'tiger:upload_uuid',
  'v': 'bulk_upload.pl-5dac241b-d144-4c9c-9e26-b4dec4590a61'},
 {'k': 'tiger:cfcc', 'v': 'A41'},
 {'k': 'tiger:county', 'v': 'San Diego, CA'},
 {'k': 'tiger:reviewed', 'v': 'no'},
 {'k': 'tiger:zip_left', 'v': '92061'},
 {'k': 'tiger:name_base', 'v': 'Sukat'},
 {'k': 'tiger:name_type', 'v': 'Trl'},
 {'k': 'tiger:zip_right', 'v': '92061'},
 {'k': 'tiger:cfcc', 'v': 'A41'},
 {'k': 'tiger:tlid', 'v': '194813633'},
 {'k': 'tiger:county', 'v': 'Riverside, CA'},
 {'k': 'tiger:source', 'v': 'tiger_import_dch_v0.6_20070809'},
 {'k': 'tiger:reviewed', 'v': 'no'},
 {'k': 'tiger:separated', 'v': 'no'},
 {'k': 'tiger:upload_uuid',
  'v': 'bulk_upload.pl-1be79c47-45e8-4ca4-8995-bc018e72ba7a'},
 {'k': 'tiger:cfcc', 'v': 'A41'},
 {'k': 'tiger:tlid

### Data Cleaning
___


[Back To Contents](#Contents)

### References
___

1. [OSM XML Contents](https://wiki.openstreetmap.org/wiki/OSM_XML#Contents): `https://wiki.openstreetmap.org/wiki/OSM_XML#Contents`

___
[Back To Contents](#Contents)