#Problems Encountered in the Map
Since Atlanta is traditionally street names seperated into *oridinal directions*, majority of the street names end with one of these oridinal directions (SouthWest, SouthEast, etc..). Although all of the street names that end with an ordinal direction seemed to have been cleaned well, I think it would appropriate to move the direction into a seperate key/value pair.  The next is to print out the various key values to see if we can place the direction in a seperate tag. After looking at the tags, there isn't a key attribute to display the ordinal direction.  Thus we will add the key attribute **streetSuffix** to the dataset to describe the ordinal direction of the street.  There are some street names that have some abbreviations as well as some numbers in the street names. The cleaning plan is as follows:
- Fix street names that end with a direction initial and expand them to the full name
- Move all street names direction suffix to a new key/value pair called *streetSuffix*
- Fix all abbreviated street names to their full name
There are only a small number of addresses with numbers at the end so we will omit them in this project. Most of them belonged to a state Highway.
##Auditing Process on Postal Codes
Some of the postal codes have the full 9 digit code and some do not.  Also there are some that have the State code as a prefix.  Thus our cleaning process will involve:
- Removing the the state code from the postal codes
- Moving the last 4 digits of the postal code into a new key/value pair called *postExt*
After performing an initial audit on the postal codes, there are a lot of them that are not in Atlanta.  The golden standard of the postal codes were generated by Screen Scraping **zipcodestogo.com**. Since a large portion of the zipcodes do not belong to City of Atlanta, we generated a seperate key/value pair called *zipInAtlanta* so that we can perform an analysis on how many entries do not belong to the City of Atlanta.
##Auditing Process on County field
Looking at the county field, I see that the none of the counties reside in the City of Atlanta. Also, some of the counties have the state code appended to it. Thus as part of the cleaning process, we remove the state code from the end of the county field.   After performing an audit on a sample set, it seems that a good portion of the counties are counties that are not in the City of Atlanta.  Thus an analysis should be done on the Database to determine whether this dataset includes the Atlanta Metropolitan area.

##Data Overview
Ifound that when using the *pymongo* driver, it is best to seperate the database queries into seperate cells. We first run some queries to find out the number of entries and how many of those entries have address fields.


In [41]:
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017")
db = client.osm

In [42]:
count = db.atlanta.count()

In [43]:
entries_with_address = {"address" : {"$exists" : 1}}
addr_entries = db.atlanta.count(filter=entries_with_address)

In [44]:
entries_with_pos = {"pos": {"$exists" : 1}}
pos_entries = db.atlanta.count(filter=entries_with_pos)

In [45]:
entries_with_postcodes = {"address.postcode" : {"$exists" : 1}}
postcode_entries = db.atlanta.count(filter=entries_with_postcodes)

In [46]:
print "Number of entries in database ", count
print "Entries with an address field: ",addr_entries
print "Entries with postcodes: ",postcode_entries
print "Entries with position information", pos_entries

Number of entries in database  12167874
Entries with an address field:  100744
Entries with postcodes:  82402
Entries with position information 11390067


## Which street suffix was used the most in the address's street names?
As I stated above, all street names in the city of Atlanta end with a ordinal direction as a suffix.  Here we find out which street suffix is the most used in our address field. This may give us some insight as to which quandrant of the city is best documented.

In [47]:
match_pipe = {"$match" : {"address" : {"$exists" : 1}}}
group_pipe = {"$group" : {"_id" : "$address.streetSuffix", "count" : {"$sum" : 1}}}
sort_pipe = {"$sort" : {"count" : -1}}
pipeline = [match_pipe,group_pipe,sort_pipe]
result = db.atlanta.aggregate(pipeline)
for entry in result:
    print(entry)

{u'count': 88496, u'_id': None}
{u'count': 4261, u'_id': u'Northwest'}
{u'count': 3158, u'_id': u'Northeast'}
{u'count': 2995, u'_id': u'Southwest'}
{u'count': 1602, u'_id': u'Southeast'}
{u'count': 101, u'_id': u'East'}
{u'count': 51, u'_id': u'South'}
{u'count': 50, u'_id': u'West'}
{u'count': 30, u'_id': u'North'}


It seems that Northwest Atlanta is the most used street suffix.  As a fun fact, I went to high school on this side of town.  Most entries in this dataset don't have a suffix at all which hints to me that this dataset is the Atlanta Metro Area, not the City of Atlanta.

##What is the ratio of Zipcodes that are actually in the City of Atlanta?

In [48]:
match_pipe = {"$match" : {"address.zipInAtlanta" : {"$exists" : 1}}}
group_pipe = {"$group" : {"_id" : 'null', "count" : {"$sum" : 1}}}
pipeline = [match_pipe,group_pipe]
result = db.atlanta.aggregate(pipeline)
total = 0
for entry in result:
    total = entry['count']
    print(entry)

{u'count': 82402, u'_id': u'null'}


In [49]:
match_pipe = {"$match" : {"address.zipInAtlanta" : 'T'}}
group_pipe = {"$group" : {"_id" : "$address.zipInAtlanta", "count" : {"$sum" : 1}}}
pipeline = [match_pipe,group_pipe]
result = db.atlanta.aggregate(pipeline)
in_atl = 0
for entry in result:
    in_atl = entry['count']
    print(entry)

{u'count': 17895, u'_id': u'T'}


In [50]:
print "Ratio of zip codes entries that are in Atlanta City Limits:", float(in_atl)/float(total)

Ratio of zip codes entries that are in Atlanta City Limits: 0.217167059052


The fact that 22% of zip codes are within city limits is also an hint that this dataset is Atlanta Metro Area.  One can also conclude that most of the modifications are done on places not within the city of Atlanta.

##Which Zipcode is the most common?

In [51]:
match_pipe = {"$match" : {"address.postcode" : {"$exists" : 1}}}
group_pipe = {"$group" : {"_id" : "$address.postcode", "count" : {"$sum" : 1}}}
sort_pipe = {"$sort" : {"count" : -1}}
limit_pipe = {"$limit" : 1}
pipeline = [match_pipe,group_pipe,sort_pipe,limit_pipe]
result = db.atlanta.aggregate(pipeline)
for entry in result:
    print entry

{u'count': 7259, u'_id': 30114}


This zipcode is in Canton, Georgia.

##What is the most common amenity in Atlanta?

In [52]:
match_pipe = {"$match" : {"amenity" : {"$exists" : 1}}}
group_pipe = {"$group" : {"_id" : "$amenity", "count" : {"$sum" : 1}}}
sort_pipe = {"$sort" : {"count" : -1}}
limit_pipe = {"$limit" : 1}
pipeline = [match_pipe,group_pipe,sort_pipe,limit_pipe]
result = db.atlanta.aggregate(pipeline)
for entry in result:
    print entry

{u'count': 5609, u'_id': u'place_of_worship'}


No surprise that the most common amenity is a place of worship.  The only thing bigger than Texas in the South is Religion.

## Which user contributed the most to this dataset?

In [53]:
group_pipe = {"$group" : {"_id" : "$created.user", "count" : {"$sum" : 1}}}
sort_pipe = {"$sort" : {"count" : -1}}
limit_pipe = {"$limit" : 1}
pipeline = [match_pipe,group_pipe,sort_pipe,limit_pipe]
result = db.atlanta.aggregate(pipeline)
for entry in result:
    print entry

{u'count': 4057, u'_id': u'iandees'}


This doesn't look like a bot name so I must commend *iandees* for making so many edits.

##Additional Ideas
