# OpenStreetMap Case Study

###Map Area

Austin, TX, United States

I have been living in Austin for 3 years. I feel excited to explore this city on OpenStreetMap 

###Data Overview
Because the full OSM file for Austin is more than 1.2 GB, I chosen a portion of it.

####File Size
    austin_texas.osm ....... 158 MB
    austin_texas.db ........ 93 MB
    nodes.csv .............. 66 MB
    nodes_tags.csv ......... 1.9 MB
    ways.csv ............... 5.6 MB
    ways_tags.csv .......... 9.4 MB
    ways_nodes.csv ......... 20 MB

#### Number of nodes
    Select Count(*) From nodes
736203

#### Number of ways
    Select Count(*) From ways
82267

#### Number of Unique Users
    SELECT COUNT(DISTINCT(e.uid)) FROM (SELECT uid FROM nodes UNION ALL SELECT uid FROM ways) e;
431

#### Number of chosen types of nodes, like cafes and shops
    Cafe:
    Select count(distinct id) From nodes_tags Where value Like "%cafe%"
76

    Shop:
    Select count(distinct id) From nodes_tags Where value Like "%shop%
21

###Problems Encountered

I notice following problems associated with the data.
- Overabbreviated street names, such as "E. 43rd St."
- Street Names pulled in second level 'k' pulled from Tiger GPS data are divided into segments, like following:
    <tag k="tiger:name_base" v="2nd"/> 
    <tag k="tiger:name_direction_prefix" v="S"/> 
    <tag k="tiger:name_type" v="St"/>

#### Overabbreviated street names
After converting data to database, running some simple query would reveal problems of overabbreviated street names. My way to deal with this is to create a function to individually check each street name, and if it violates the rule, that function will update the street name.

    def update(name, mapping):
        words = name.split()
        for w in range(len(words)):
            if words[w] in mapping:
                words[w] = mapping[words[w]]
        name = " ".join(words)
        return name
    
And also update my mapping dictionary:

    mapping = { "St": "Street", "St.": "Street",
                "Ave": "Avenue", "Ave.": "Avenue",
                "Rd": "Road", "Rd.": "Road"
                "E": "East", "E.": "East",
                "W": "West", "W.": "West",
                "N": "North", "N.": "North",
                "S": "South", "S.": "South"}

This update my problematic string "E 43rd St." to "East 43rd Street.

### Additional Ideas

#### Top Contributors
    select count(e.user) from (select user from nodes union all select user from ways) e 

818470

    select e.user, count(e.user) from (select user from nodes union all select user from ways) e 
    group by e.user
    order by count(e.user) desc
    
patisilva_atxbuildings 334716

jseppi_atxbuildings 201431

......


- pastisilva_atxbuildings contributes 40.90% of total.
- pastisilva and jseppi combined contribute 65.51% of total.
- Top 10 contributors combined contribute 90.36%.

#### Additional Data Investigation

    select value, count(value) as count from nodes_tags
    where key = 'cuisine'
    group by value
    order by count desc

Top 10 Cuisines

    mexican    31
    sandwich   16
    pizza      15
    burger     10
    coffe_shop 10
    american   8
    thai       7
    italian    6
    indian     5
    regional   5

### Improvements and Conclusion

In this project, I investigated Austin's OpenStreetMap data. After downloading XML data, I work through the XML -> CSV -> SQLite pipeline to generate my sqlite database. And I wrote various kinds of SQL queries to overview the data and checked the data quality. After noticing overabbreviated street names problem, I used function to update them to my expected street names. 

Some of improvements I can think of are:
####1. Complete the tags for all the nodes.
There are only 1148 nodes out of 32714 have tags, so one of the main contribution could be add tags for all the other nodes.
####2. Specify all the street types.
In some cases it's very hard to detect a possible street. For example we can correct "Rio Grande St" into "Rio Grande  Street". But if the original value is "Rio Grande", then it's very hard to correct this programmatically.

Both improvements require a lot of effort. But the benefits are also huge.