#Notes on Auditing Process
Since Atlanta is traditionally street names seperated into *oridinal directions*, majority of the street names end with one of these oridinal directions (SouthWest, SouthEast, etc..). Although all of the street names that end with an ordinal direction seemed to have been cleaned well, I think it would appropriate to move the direction into a seperate key/value pair.  The next is to print out the various key values to see if we can place the direction in a seperate tag. After looking at the tags, there isn't a key attribute to display the ordinal direction.  Thus we will add the key attribute **streetSuffix** to the dataset to describe the ordinal direction of the street.  There are some street names that have some abbreviations as well as some numbers in the street names. The cleaning plan is as follows:
- Fix street names that end with a direction initial and expand them to the full name
- Move all street names direction suffix to a new key/value pair called *streetSuffix*
- Fix all abbreviated street names to their full name
There are only a small number of addresses with numbers at the end so we will omit them in this project. Most of them belonged to a state Highway.
##Auditing Process on Postal Codes
Some of the postal codes have the full 9 digit code and some do not.  Also there are some that have the State code as a prefix.  Thus our cleaning process will involve:
- Removing the the state code from the postal codes
- Moving the last 4 digits of the postal code into a new key/value pair called *postExt*
After performing an initial audit on the postal codes, there are a lot of them that are not in Atlanta.  The golden standard of the postal codes were generated by Screen Scraping **zipcodestogo.com**. Since a large portion of the zipcodes do not belong to City of Atlanta, we generated a seperate key/value pair called *zipInAtlanta* so that we can perform an analysis on how many entries do not belong to the City of Atlanta.
##Auditing Process on County field
Looking at the county field, I see that the none of the counties reside in the City of Atlanta. Also, some of the counties have the state code appended to it. Thus as part of the cleaning process, we remove the state code from the end of the county field.   After performing an audit on a sample set, it seems that a good portion of the counties are counties that are not in the City of Atlanta.  Thus an analysis should be done on the Database to determine whether this dataset includes the Atlanta Metropolitan area.

In [1]:
import xml.etree.ElementTree as ET

osm_file = "atlanta_georgia.osm"
i = 0
for _, element in ET.iterparse(osm_file):
    if i == 10:
        break
    if element.tag == "node":
        print element.attrib
        i += 1
    if element.tag == "way":
        print element.attrib
        i += 1


{'changeset': '14903606', 'uid': '1081376', 'timestamp': '2013-02-03T22:37:10Z', 'lon': '-84.262916', 'version': '3', 'user': 'greenv505', 'lat': '34.0852695', 'id': '1481302'}
{'changeset': '3539549', 'uid': '147510', 'timestamp': '2010-01-04T18:34:56Z', 'lon': '-85.193644', 'version': '2', 'user': 'woodpeck_fixbot', 'lat': '32.870928', 'id': '52374104'}
{'changeset': '3173372', 'uid': '147510', 'timestamp': '2009-11-21T04:30:48Z', 'lon': '-85.193643', 'version': '2', 'user': 'woodpeck_fixbot', 'lat': '32.871276', 'id': '52374106'}
{'changeset': '3173372', 'uid': '147510', 'timestamp': '2009-11-21T04:30:48Z', 'lon': '-85.193652', 'version': '2', 'user': 'woodpeck_fixbot', 'lat': '32.87149', 'id': '52374108'}
{'changeset': '3173372', 'uid': '147510', 'timestamp': '2009-11-21T04:30:48Z', 'lon': '-85.193689', 'version': '2', 'user': 'woodpeck_fixbot', 'lat': '32.871997', 'id': '52374110'}
{'changeset': '3173372', 'uid': '147510', 'timestamp': '2009-11-21T04:30:52Z', 'lon': '-85.380091', 

In [1]:
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017")
db = client.osm

In [5]:
count = db.atlanta.count()

In [4]:
entries_with_address = {"address" : {"$exists" : 1}}
addr_entries = db.atlanta.count(filter=entries_with_address)




In [6]:
entries_with_pos = {"pos": {"$exists" : 1}}
pos_entries = db.atlanta.count(filter=entries_with_pos)

In [7]:
entries_with_postcodes = {"address.postcode" : {"$exists" : 1}}
postcode_entries = db.atlanta.count(filter=entries_with_postcodes)

In [8]:
print "Number of entries in database ", count
print "Entries with an address field: ",addr_entries
print "Entries with postcodes: ",postcode_entries

Number of entries in database  12167874
Entries with an address field:  100744
Entries with postcodes:  82402


## Which street suffix was used the most in the address's street names?

In [10]:
match_pipe = {"$match" : {"address" : {"$exists" : 1}}}
group_pipe = {"$group" : {"_id" : "$address.streetSuffix", "count" : {"$sum" : 1}}}
sort_pipe = {"$sort" : {"count" : -1}}
pipeline = [match_pipe,group_pipe,sort_pipe]
result = db.atlanta.aggregate(pipeline)
for entry in result:
    print(entry)

{u'count': 88496, u'_id': None}
{u'count': 4261, u'_id': u'Northwest'}
{u'count': 3158, u'_id': u'Northeast'}
{u'count': 2995, u'_id': u'Southwest'}
{u'count': 1602, u'_id': u'Southeast'}
{u'count': 101, u'_id': u'East'}
{u'count': 51, u'_id': u'South'}
{u'count': 50, u'_id': u'West'}
{u'count': 30, u'_id': u'North'}


In [15]:
match_pipe = {"$match" : {"address.zipInAtlanta" : 'T'}}
group_pipe = {"$group" : {"_id" : "$address.zipInAtlanta", "count" : {"$sum" : 1}}}
pipeline = [match_pipe,group_pipe]
result = db.atlanta.aggregate(pipeline)
for entry in result:
    print(entry)

{u'count': 17895, u'_id': u'T'}
