# OSM data wrangling project:

## Metropolitan Area:
    San-Jose, California
    
    <https://mapzen.com/data/metro-extracts/metro/san-jose_california/>
    
## Why I chose San-Jose?

1. I live close to Chicago, but Chicago data was used in the lessons, so I didnt want to use it again.
2. I have been to San Jose and the OSM data after decompressing was around 383 MB, I sampled out of it an OSM file of 55 MB, fit    for the project.

    The only other resource I used is the schema from:

    <https://gist.github.com/swwelch/f1144229848b407e0a5d13fcb7fbbd6f>

# Data used in the analysis:

### Files
```
1. san_jose_compressed.osm            55 MB
2. mymap.db                           31 MB
3. nodes.csv                          21 MB
4. nodes_tags.csv                     0.5 MB
5. ways.csv                           2 MB
6. ways_nodes.csv                     7 MB
7. ways_tags.csv                      3 MB




## Problems encountered in the map:

### There are some discrepancies in the postcode:

-  u'94087\u200e': Unicode
- ‘95014’: Postcode without extension, however, extension doesn’t serve much purpose
- '95014-030': Postcode with wrong extension
- '95014-0438': Postcode with correct extension

### Abbreviations in the street names: 

- 'Blvd': set(['Los Gatos Blvd', 'Palm Valley Blvd']),
- 'Dr': set(['Samaritan Dr']),
- 'Ln': set(['Barber Ln', 'Branham Ln']),
- 'Ave': set(['1425 E Dunne Ave', 'Greenbriar Ave','N Blaney Ave','W Washington Ave']),

### The ‘other’ category of k values in the tag element has a wide variety of issues

```
 'gnis:Class',
 'gnis:County',
 'gnis:ST_num',
 'gnis:ST_alpha',
 'gnis:County_num'

 'tiger:name_base_1',
 'tiger:name_base_2',
 'tiger:name_type_1',
 'turn:lanes:both_ways',
 'tiger:MTFCC',
 'tiger:RTTYP',
 'tiger:LINEARID'
 ```
Fixing the problems above require quite a number of custom functions. But, the benefit would that the comprehensibility will increase dramatically. Some of the examples:

Besides that there are some garbage data that does not make a geographical sense, but the specification of the location/place:

- 'service:bicycle:chain_tool',
- 'FIXME',
- 'socket:type1',
- 'socket:type1_chademo',
- 'socket:type1_combo',
- 'service:bicycle:pump'

### Code snippets to find out the no of tags, nodes:

def count_tags(filename):
        
    for event, elem in ET.iterparse(filename): 
        if elem.tag in tags.keys(): 
            tags[elem.tag]+=1 
        else:
            tags[elem.tag]=1  
    return tags


print count_tags(filename)

#### Output:
```
{'node': 257543, 'nd': 302148, 'member': 3370, 'tag': 107229, 'relation': 372, 'way': 34265, 'osm': 1}
```

## A sample expected list and mapping dictionary

expected = ["Avenue","Alameda", "Barcelona","Boulevard","Broadway","Circle",
            "Drive","Court","East","Expressway", "Highway","Lane","Loop",
            "Luna","Marino","Napoli","Palamos","Parkway","Paviso","Place",
            "Plaza","Lane","Road", "Real","Sorrento","Square","Street",
            "Trail","Terrace","Volante","Way","West"]

mapping = { "St": "Street","Ave": "Avenue", "ave":"Avenue","Blvd":"Boulevard",
           "Rd": "Road", "Ln":"Lane","Dr":"Drive"}

def update_name(name, mapping):
   
    m = street_type_re.search(name)
    if m:
        street_type = m.group() 
        
        if street_type in mapping.keys():
           
            name = re.sub(m.group(), mapping[m.group()], name)
            
    return name


#### Sample output:
    
```
Barber Ln => Barber Lane
Branham Ln => Branham Lane
Pruneridge Ave #6 => Pruneridge Ave #6
Casa Verde St => Casa Verde Street
Wolfe Rd => Wolfe Road
Berryessa Rd => Berryessa Road
Mt Hamilton Rd => Mt Hamilton Road
```

### Cleaning the postal codes:

zip1 = re.compile(r'^\d{5}')

zip2 = re.compile(r'\d{5}-\d{4}')


def correct_zip(t):

    m = re.search(zip1, t)
    
    if m:
        n = re.search(zip2, t)
        if n:
            return  t, " ==>> ", n.group()
        else:
            return  t, " ==>> ", m.group()
    else:
        return  t, " ==>>", "None"

#### Sample output:

```
('94086-6406', ' ==>> ', '94086-6406')
('94086-640', ' ==>> ', '94086')
('94086', ' ==>> ', '94086')
('None', ' ==>>', 'None')
('CUPERTINO', ' ==>>', 'None')
```

# sql queries:

select count(distinct uid) from nodes

909

select count(distinct uid) from ways

598

```
select count(*) from nodes
select count(*) from ways
```

##### nodes:
257543
##### ways:
34265

### Top 10 nodes_tags

```
select key, count(*) from nodes_tags
group by key
order by count(*)
```

#### Output:
```
    key           count(*)
    
- 0	highway	      2285
- 1	housenumber	  999
- 2	street	      973
- 3	name	      863
- 4	amenity	      624
- 5	crossing      604
- 6	city	      506
- 7	postcode	  439
- 8	natural	      397
- 9	source        307
```

### Top five amenities:

```
select value, count(*) from ways_tags
where key = 'amenity' 
group by value
order by count(*) desc
limit 5
```

#### Output:
```
    value	            count(*)
0	parking	            309
1	school	            73
2	restaurant	        29
3	place_of_worship	26
4	fast_food	        20
```

### Top 5 kinds of highways in the nodes:

```
select value, count(*) from nodes_tags
where key = 'highway'
group by value
order by count(*) desc limit 5;
```

#### Output:
```
    value	         count(*)
0	turning_circle	 848
1	crossing	     643
2	traffic_signals	 426
3	stop	         201
4	bus_stop	     69
```

### Top 5 most popular cuisines in San Jose:

```
select value, count(*) from nodes_tags
where key = 'cuisine'
group by value
order by count(*) desc limit 5
```

#### Output:
```
    value	    count(*)
0	vietnamese	16
1	sandwich	15
2	chinese	    13
3	coffee_shop	13
4	mexican	    11
```

### Top 5 Chinese food enthusiasts in San Jose:

```
select nodes.user 
from nodes join nodes_tags on nodes.id=nodes_tags.id
where key = 'cuisine' and value = 'chinese'
```

#### Output:
```	user
0	andyyue
1	xybot
2	xybot
3	Walk and walk around
4	YC Chao
5	lyiu
```

# Insights:

1.	We can see that there is a larger number of nodes than ways on the map.
2.	However, larger number of ways are tagged than nodes by nearly 8.5 times.
3.	It looks like nodes have markets or hangout places while the ways have offices or schools.
4.	Asian food is quite popular in San Jose.
5.	There is a guy named Minh Nguyen who is crazy about Vietnamese food in San Jose.
6.	We can conclude that there are lot of people who probably like Chinese food.


# Benefits and anticipated problem in the implementation of solution:

Submission document includes thoughtful discussion about the benefits as well as 
some anticipated problems in implementing the improvement.

- Some of the street names are like "Pruneridge Ave #6" or "Concourse Dr #81"
- Ideally it should have ended with "Avenue" and "Drive" respectively. 

**Benefits**:

- Usually the street address should be like "6 Pruneridge Avenue" or "81 Concourse Drive", thus, the foreost benefit would be the consistency in the street address.

**Anticipated issues**:

- Such erroneous street names are quite unique and write a function or visually inspecting the data, won't be worth the benefit. For example, we may have to right a customised function/block of code to clean 1 or 2 or may be 3 such streets and there are quite of few of such kind

Further, there are few erroneous entries, for example one street address has just one word "yes", it is extremely difficult to know what the user intended to do and such corrections might be painstaking.






- For the zip code, the majority of zipcodes have the format "99999-9999" or "99999". The former also includes the extension and describes a precise location while the latter is not bad at all. However, there were a few like "99999-99" or "99999-999" and in these zip codes the extension has been dropped. But, there were also a few entries like 'CA 95116', 'CUPERTINO' that require disproportaionate amount of efforts.

**Benefits**:

- 'CA 95116' can be corrected to the correct format, as it does contain the zipcode, but in a reverse order.

**Anticipated issues**:

- But, 'CUPERTINO' as a zipcode is useless.


# Suggestions:

Other than GPS data, I think the tagging of the places and their street names and types can be made error proof, 
by creating an inbuilt program to prompt or suggest the user to choose the street type or zip code from a list of 
expected values created from vetted street names, when he or she types in the address. For example, if a user types 
in 'bou' that curser can prompt 'boulevard' or when a user types in an incomplete or wrong area code the curser can 
prompt "wrong area code", because, area codes follow a pretty standard numeric pattern.  

Further, a standardised format could be assigned to the features of shop or amenities, like XYZ motors is a mechanic shop 
not a car dealership. This part is a little difficult and can be corrected over a long period of time.