# Open Street Map Case Study

## Destination: Pasadena, CA!

https://www.openstreetmap.org/#map=13/34.1247/-118.0944

I chose Pasadena because I grew up there and spent over 20 years there so I am very familiar with it.

Please see Open_street_map.py for code.

## Issues:

Many lists and dictionaries needed specific types and needed to be global variables so that multiple functions could do focused and specific work towards the goal of cleaning the data.

    street_types = defaultdict(set)
    zip_types = defaultdict(set)
    street_fixes_list = []
    street_fixes = defaultdict(list)
    names_dict = defaultdict()

### Street Name issues:

Removing all unit numbers when populating the street_fixes dictionary:
    
    if not street_type[-1].isdigit()

Bringing in the element so that element id # can be extracted after it is determined whether the tag value is usable or not:

    def audit_street_type(street_types, street_name, elem):

Combining different dictionaries to make 1 useful one so that the section of the code which writes the CSV files does not need to be changed much:

    for old_name, new_name in names_dict.iteritems():
        for k, v in street_fixes.iteritems():
            if list(v)[0] == old_name:
                street_fixes[k].append(names_dict[list(v)[0]])

### Zip Code specific issues:

After creating the dictionary of zipcodes and their associated values, still having to remove the "CA"

    for zip_type, zipcode in zip_types.iteritems():
        for zipc in zipcode:
            if "CA" in zipc:
                print zipc, "=>", zip_type

Having 1 zip code which wasn't in the dictionary because of problem characters, so the value wasn't upated automatically and therefore a specific logic had to be madein the change_name function:

    elif tag.attrib['k'] == "addr:postcode":
        for k, v in zip_types.iteritems():
            if k in name:
                return k
            elif "90032" in name:
                return "90032"

## Solution

After all the dictionaries worked together to make 2 useful ones (one of Street names and one of Zip codes), half a line of code referred to a function which did all the work to clean the data:


Updating all necessary values (if needed) in 1 line:

    tag_dict["value"] = change_name(tag, tag_dict["id"], tag.attrib['v'])

# Incredible Data!!

After living in Pasadena for so many years, seeing all this real data is really fascinating!

### Zip codes in Region:

    sqlite> SELECT tags.value, COUNT(*) as count 
       ...> FROM (SELECT * FROM nodes_tags 
       ...>   UNION ALL 
       ...>       SELECT * FROM ways_tags) tags
       ...> WHERE tags.key='postcode'
       ...> GROUP BY tags.value
       ...> ORDER BY count DESC;
    91776,229
    91030,106
    91105,106
    91106,81
    91007,42
    91801,35
    90042,32
    91101,28
    91107,27
    91103,24
    91780,13
    90041,7
    91775,6
    91803,6
    91102,5
    91108,3
    90032,2
    90065,2
    90041-1238,1
    90042-4229,1
    91006,1
    91109,1
    91125,1
    91182,1
    91770,1
    91778,1

### Top 10 Amenities in Pasadena:

    sqlite> SELECT value, COUNT(*) as num
       ...> FROM nodes_tags
       ...> WHERE key='amenity'
       ...> GROUP BY value
       ...> ORDER BY num DESC
       ...> LIMIT 10;
    restaurant,187
    place_of_worship,140
    fast_food,93
    cafe,66
    school,66
    bank,40
    fuel,32
    post_box,25
    post_office,22
    library,19

### Top 10 types of Restaurants in Pasadena:

    sqlite> SELECT nodes_tags.value, COUNT(*) as num
       ...> FROM nodes_tags 
       ...>     JOIN (SELECT DISTINCT(id) FROM nodes_tags WHERE value='restaurant') i
       ...>     ON nodes_tags.id=i.id
       ...> WHERE nodes_tags.key='cuisine'
       ...> GROUP BY nodes_tags.value
       ...> ORDER BY num DESC
       ...> LIMIT 10;
    mexican,15
    chinese,14
    american,9
    italian,9
    burger,7
    japanese,7
    sandwich,7
    sushi,7
    thai,5
    asian,4

# Data Overview

### File Sizes

    nodes_tags.csv.......624 KB
    nodes.csv............128.7 MB
    pasadena.osm.........320.4 MB
    ways_nodes.csv.......36.6 MB
    ways_tags.csv........28.1 MB
    ways.csv.............9.4 MB

### Number of Nodes:
    
    sqlite> SELECT COUNT(*) FROM nodes;
    1361733

### Number of Ways:

    sqlite> SELECT COUNT(*) FROM ways;
    135944

### Number of unique users

    sqlite> SELECT COUNT(DISTINCT(e.uid))          
       ...> FROM (SELECT uid FROM nodes UNION ALL SELECT uid FROM ways) e;
    443


### Top 10 contributing users

    sqlite> SELECT e.user, COUNT(*) as num
       ...> FROM (SELECT user FROM nodes UNION ALL SELECT user FROM ways) e
       ...> GROUP BY e.user
       ...> ORDER BY num DESC
       ...> LIMIT 10;
    RichRico_labuildings,156414
    upendra_labuilding,143185
    poornima_labuildings,134275
    nammala_labuildings,133365
    Luis36995_labuildings,105991
    dannykath_labuildings,99443
    schleuss_imports,87816
    piligab_labuildings,85610
    calfarome_labuilding,71810
    Aloisian,60081

### Other ideas about the dataset

    When one looks at the top 20 contributors:
    
    RichRico_labuildings,156414
    upendra_labuilding,143185
    poornima_labuildings,134275
    nammala_labuildings,133365
    Luis36995_labuildings,105991
    dannykath_labuildings,99443
    schleuss_imports,87816
    piligab_labuildings,85610
    calfarome_labuilding,71810
    Aloisian,60081
    karitotp_labuildings,56976
    saikabhi_LA_imports,39034
    manings_labuildings,28213
    nikhil_imports,27740
    yurasi_import,27091
    ridixcr_import,25003
    BharataHS_laimport,24315
    JRHutson_Import,22001
    jerjozwik,19225
    Fa7C0N_imports,17445
    
    The majority of contributions are coming from 2 organizations: labuilding and import(s)
    
    Total contributions: 1,497,677
    Total contirbutions from top 10 users: 1,077,990
    Total contributions from top 20 users: 1,365,033
    Top 10 users are contributing 72% of the data.
    Top 20 users are contributing 91% of the data.
    
    Number of unique users: 443
    Number of unique users having <10 contributions: 229
    
    Gamification would lead to a more balanced spectrum of contributions.  Clearly there is interest and individual users are contributing but their motivations seem to be limited.  Having rewards they can share or competitions for individual users to gain prestige/prominance in their region would create incentive to increase their level of contribution.
    
    More information on how gamification can increase user engagement:
    
 https://www.forbes.com/sites/gartnergroup/2014/04/10/how-gamification-motivates-the-masses/#73c2ec0d5c04
 
 http://www.cmswire.com/cms/social-business/how-gamification-can-impact-employee-engagement-infographic-019914.php
 
 http://engageemployee.com/peer-peer-gamification-can-democratise-employee-engagement/

### Anticipated issues of improving the dataset

    The complexity arises from the requirement that knowledge of coding as a requirement to contribute is a huge learning curve for the average citizen to overcome.  To do so without getting any financial compensation in order to contribute for the greater good is a pretty high expectation.
    
    However, going through this as I am, I can definitely say that a step by step guide to contributing which is made in layman's terms would be a huge step towards increasing the audience that can contribute.  Basic learning can also be encouraged by mentioning that they would not only be contributing to their own hometown region but be learning the basic skills which are increasingly powering the world ahead.