John McGonagle  
Udacity Data Analyst Nanodegree  
Data Wrangling Final Project  
  
## 1. Problems During Data Auditing and Cleaning
I chose to study an OpenStreetMap dataset spanning Boston, MA and Cambridge, MA (the two cities are across a river from one another).
  
While I imported all of the data in the map into a MongoDB database, I specifically focused on auditing and cleaning two fields, the street and housenumber fields.   These were originally tagged with the keys 'addr:street' and 'addr:housenumber' respectively, the 'addr' standing for 'address'.  Auditing each revealed some interesting problems that took a mix of programmatic and manual corrections.  Finally, the conversion to JSON revealed a conflict concerning the type field for relations using the original code.

### 1A. House Numbers
I started my investigation by creating a regular expression to test for standard house numbers between 1 and 5 digits:
```python
r'^[1-9]\d{0,4}?$'
```

This revealed the presence of many non-canonical formats for house numbers, some containing letters or fractions at the end (for multi-unit addresses), commas + semicolons (for locations with mulitple, non-consecutive addresses), hyphens + colons (for locations with multiple, consecutive addresses spanning a range), and non-numeric (e.g. 'Two' instead of '2').  Some examples from my dataset were:

```
{'11,11A',
 '111,111A',
 '114-124',
 '126-130',
 '14,16',
 '14-16',
 '21;23',
 '27,27 1/2',
 '308A',
 '33,35',
 '347;349',
 '35-39',
 '37R',
 '38,40',
 '92,94',
 'One'}
```

I decided that the canonical punctuation would be commas for non-consecutive lists and hyphens for consecutive lists.  This was easiest, as the vast majority of house numbers not satisfying my original regular expression used only these two forms of punctuation.  Furthermore, I decided letter suffixes would have to be capitalized and fractions would have to take the form of '1/2' (which all house numbers did anyway).  Also, no white space would be allowed in the cleaned house number, except in the case of a space separating the substring '1/2'.  Lastly, for non-consecutive lists as many as four house numbers would be permitted, which all existing house numbers satisfied.  The new regular expression was:

```python
r'^([1-9]\d{0,4}|0)([A-Z]| 1/2)?((-([1-9]\d{0,4}|0)([A-Z]| 1/2)?)?|(,([1-9]\d{0,4}|0)([A-Z]| 1/2)?){0,3})$'
```

A sample of house numbers that did not satisfy this regular expression were:

```
{'112:114',
 '13B;13C',
 '16;18',
 '17/19',
 '250;252;254;256',
 '25;27;29;29A',
 '299;301',
 '2;4',
 '347;349',
 '3;5',
 '4 Suite S-1155',
 '5;7',
 '62:64',
 '8B;10;12;14',
 '9;11;13;15',
 'Building 22',
 'PO Box 846028',
 'Ten',
 'Zero'}
```

In order to remove those house numbers that could be fixed by replacing semicolons with commas and colons with hyphens, I tried replacing each non-canonical string with the aforementioned replacements and then testing those strings against the above regular expression once again.  After I did this, only the following house numbers remained:

```
{'17/19',
 '33, 33 1/2',
 '34,36a,36b',
 '4 Suite S-1155',
 '6; 8',
 '84; 86',
 'One',
 'Ten',
 'Zero'}
 
{'Building 22',
 'PO Box 846028'}
```

The first set was small enough for me to handle through manual correction (in the form of a dictionary mapping the wrong value to the corrected value), while I considered the second set to be in error, since for those elements the housenumber field did not include actual house number information (the first is an internal college campus designation, while the second is not a street address).  House numbers that were in error were thrown out of the dataset (i.e. not included in the final JSON file).  
  
The final data cleaning consisted of the following logic:
1. If the string is a well-formed house number, return the original string.
2. Otherwise, try replacing all the semilcolons with commas and colons with hyphens.  If the string is now canonical, return this new string.
3. If the string is in the error set, exit and do not add the element to the final JSON file.
4. Otherwise, replace the string with the corresponding value from the manual corrections map.

### 1B. Street Names
Auditing and cleaning the street field took largely the same form.  I used the original regular expression to extract the last word (making sure to add whitespace at the end of the regular expression, since some street fields in my dataset had whitespace at the end):

```python
r'\b\S+\.?\s*$'
```

Using the values extracted by this regular expression, I compared each to a set of common canonical street types (e.g. 'Road', 'Street', and 'Avenue').  After inspecting the values, I added any common street types not in the canonical list (e.g. 'Boulevard' and 'Court') and reinspected the values.  There were also some locale-specific values that I added to a whitelist for ignoring, the most prominent example being `'Faneuil Hall'`, a popular Boston location.  Some examples of street types not satisfying this stage were:

```
{'303': {'First Street, Suite 303'},
 '501': {'Bromfield Street #501'},
 'Ave': {'Western Ave', 'Lexington Ave', 'Somerville Ave'},
 'Ave.': {'Somerville Ave.'},
 'Center': {'Cambridge Center'},
 'Ct': {'Kelley Ct'},
 'Floor': {'Boylston Street, 5th Floor'},
 'Pkwy': {'Birmingham Pkwy'},
 'Rd': {'Abby Rd', 'Soldiers Field Rd'}}
```

Some of the preceding values were just abbreviations (with or without punctuation) of street types in the canonical street type list.  Thus, to correct these, I built a dictionary mapping abbreviations to their canonical form.  Using a regular expression, the canonical type could be substituted for its corresponding abbreviation in the dictionary.  The eventual result of this process was the map:

```
{'Ave': 'Avenue', 
 'Ave.': 'Avenue', 
 'Ct': 'Court',
 'Hwy': 'Highway', 
 'Pkwy': 'Parkway', 
 'Pl': 'Place', 
 'rd.': 'Road', 
 'Rd': 'Road',
 'Rd.': 'Road', 
 'Sq.': 'Square', 
 'st': 'Street', 
 'St': 'Street',
 'St.': 'Street',
 'ST': 'Street'}
```

Once this was done, I extracted the list of non-satisfying street types into a list:

```
{'1100': {'First Street, Suite 1100'},
 '12': {'Harvard St #12'},
 '1302': {'Cambridge Street #1302'},
 '1702': {'Franklin Street, Suite 1702'},
 '3': {'Kendall Square - 3'},
 '303': {'First Street, Suite 303'},
 '501': {'Bromfield Street #501'},
 '6': {'South Station, near Track 6'},
 '846028': {'PO Box 846028'},
 'Albany': {'Albany'},
 'Boylston': {'Boylston'},
 'Building': {'South Market Building'},
 'Cambrdige': {'Cambrdige'},
 'Corner': {'Webster Street, Coolidge Corner'},
 'Dartmouth': {'Dartmouth'},
 'Floor': {'Boylston Street, 5th Floor'},
 'Garage': {'Stillings Street Garage'},
 'Hampshire': {'Hampshire'},
 'LEVEL': {'LOMASNEY WAY, ROOF LEVEL'},
 'Lafayette': {'Avenue De Lafayette'},
 'Longwood': {'Longwood'},
 'Market': {'Faneuil Hall Market'},
 'Newbury': {'Newbury'},
 'Pasteur': {'Avenue Louis Pasteur'},
 'South': {'Charles Street South'},
 'Windsor': {'Windsor'},
 'Winsor': {'Winsor'},
 'floor': {'Sidney Street, 2nd floor', 'First Street, 18th floor'}}
```

Values in this list fell into two categories, values that could be corrected manually and values that were errors.  Errors were, like the house number errors, values that did not contain street information, such as `'South Station, near Track 6'` and `'PO Box 846028'`.  These were not included in the final JSON file.  Values that could be corrected manually lagely consisted of streets missing street types, such as `'Windsor'` and `'Longwood`'.  Other values were French street names (e.g. `'Avenue De Lafayette`') that had the street type before the street name, values including unit and floor numbers (e.g. `'Cambridge Street #1302'` and `'First Street, 18th floor'`), and values referencing local landmarks (e.g. `'Webster Street, Coolidge Corner'`).  Manual corrections were handled by a dictionary mapping the wrong value to the corrected value.  
  
A final pass over the data revealed that some street names, while having the correct (in some cases, replaced) street types, also included a series of numbers at the beginning, corresponding to the housenumber.  This information was removed by using a regular expression to substitute the offending number (and optional adjacent whitespace) with the empty string.

The final data cleaning consisted of the following logic:
1. If the string is a well-formed street, do nothing.
2. Otherwise, test for the occurence of a street type abbreviation using the abbreviation-to-canonical dictionary.  If so, replace the abbreviation with the canonical street type.
3. If the string is in the error set, exit and do not add the element to the final JSON file.
4. Otherwise, replace the string with the corresponding value from the manual corrections map.
5. Finally, if the string contains a number (plus optional whitespace) at the beginning of the string, replace them with an empty string `''`.

### 1C. JSON Conversion
For completeness, I decided to add the 'relation' elements to my MongoDB database.  During the data cleaning and JSON conversion process, I discovered that the original code

```python
if element.tag == 'node' or element.tag == 'way' or element.tag == 'relation':
        node['type'] = element.tag
        node['created'] = {}
        node['pos'] = [None, None]
        attrib = element.attrib
```

yielded an unexpected value, since some relation elements contain tags with `k="type"`, causing the actual type (i.e. 'relation', the value of `element.tag`) to be overwritten.  For example:

```
{
  "type": "multipolygon",
  "created": {
    "changeset": "48277481",
    "timestamp": "2017-04-30T09:54:31Z",
    "uid": "665748",
    "user": "sebastic",
    "version": "2"
  }
```

does not have type equal to `'node'`, `'way'`, or `'relation'`, as it would appear to require, given the logic above.  The error is due to the blanket assignment statement `node[key] = value` later in the inner element for loop, specficially when the `key` variable equals `type`.  
  
Changing the above code to:

```python
if element.tag == 'node' or element.tag == 'way' or element.tag == 'relation':
        node['tag'] = element.tag
        node['created'] = {}
        node['pos'] = [None, None]
        attrib = element.attrib
```

fixed the problem.

I also added the field `'addressstring'` to handle the rare case that a node has both an address tag (usually a string of the full adress, such as `'123 Smith Street, Boston, MA 02215'`) and tags of the form 'addr:[some address subfield]'.  This preserves as much address information as possible in the case that the address subfields are incomplete compared to the address field.
  
## 2. Data Overview
After importing the JSON file into a MongoDB database, I ran multiple queries to calculate several different statistics of the dataset.
### 2A. Dataset Statistics
#### File Sizes
| Filename                  | Size   |
| ------------------------- |-------:|
| boston_cambridge.osm      | 69MB   |
| boston_cambridge_osm.json | 93MB   |
  
#### Number Of Top Level Elements
```python
osm_data.find().count()
```
```
Out: 342355
```
#### Number Of Nodes
```python
osm_data.find({'tag': 'node'}).count()
```
```
Out: 294037
```
#### Number Of Ways
```python
osm_data.find({'tag': 'way'}).count()
```
```
Out: 47703
```
#### Number Of Relations
```python
osm_data.find({'tag': 'relation'}).count()
```
```
Out: 615
```
#### Number Of Distinct Users
```python
len(osm_data.find().distinct('created.user'))
```
```
Out: 885
```
### 2B. Other Interesting Statistics
#### Top 20 Most Popular Types Of Nodes 
```python
pipeline = [{'$match': {'tag': 'node', 'type': {'$exists': True}}}, 
            {'$group': {'_id': '$type', 'count': {'$sum': 1}}}, 
            {'$sort': {'count': -1, '_id': 1}}, 
            {'$limit': 20}]

pprint.pprint(list(osm_data.aggregate(pipeline)))
```
```
Out: [{'_id': 'Special', 'count': 27},
      {'_id': 'Academic', 'count': 19},
      {'_id': 'Public', 'count': 16},
      {'_id': 'Private', 'count': 14},
      {'_id': 'School', 'count': 11},
      {'_id': 'Charter', 'count': 3},
      {'_id': 'Special-Law', 'count': 3},
      {'_id': 'Approved Special Education', 'count': 2},
      {'_id': 'County', 'count': 2},
      {'_id': 'Special-Institutional', 'count': 2},
      {'_id': 'Special-Medical', 'count': 2},
      {'_id': 'Collaborative Program', 'count': 1},
      {'_id': 'private', 'count': 1},
      {'_id': 'video', 'count': 1}]
```
#### Top 20 Most Popular Amenities
```python
pipeline = [{'$match': {'tag': 'node', 'amenity': {'$exists': True}}}, 
            {'$group': {'_id': '$amenity', 'count': {'$sum': 1}}}, 
            {'$sort': {'count': -1, '_id': 1}}, 
            {'$limit': 20}]

pprint.pprint(list(osm_data.aggregate(pipeline)))
```
```
Out: [{'_id': 'restaurant', 'count': 470},
      {'_id': 'bench', 'count': 415},
      {'_id': 'bicycle_parking', 'count': 202},
      {'_id': 'cafe', 'count': 184},
      {'_id': 'library', 'count': 140},
      {'_id': 'fast_food', 'count': 114},
      {'_id': 'school', 'count': 107},
      {'_id': 'bicycle_rental', 'count': 97},
      {'_id': 'place_of_worship', 'count': 92},
      {'_id': 'fountain', 'count': 63},
      {'_id': 'post_box', 'count': 62},
      {'_id': 'waste_basket', 'count': 62},
      {'_id': 'bank', 'count': 59},
      {'_id': 'bar', 'count': 51},
      {'_id': 'pub', 'count': 46},
      {'_id': 'atm', 'count': 36},
      {'_id': 'pharmacy', 'count': 35},
      {'_id': 'parking', 'count': 28},
      {'_id': 'drinking_water', 'count': 27},
      {'_id': 'bicycle_repair_station', 'count': 24}]
```
#### Banks With The Most ATMs
```python
pipeline = [{'$match': {'tag': 'node', 'amenity': 'bank', 'atm': 'yes'}}, 
            {'$group': {'_id': '$name', 'count': {'$sum': 1}}}, 
            {'$sort': {'count': -1, '_id': 1}}, 
            {'$limit': 20}]

pprint.pprint(list(osm_data.aggregate(pipeline)))
```
```
Out: [{'_id': 'Bank of America', 'count': 6},
      {'_id': 'Citizens Bank', 'count': 5},
      {'_id': 'Santander', 'count': 3},
      {'_id': 'Cambridge Trust Company', 'count': 2},
      {'_id': 'Citibank', 'count': 2},
      {'_id': 'Eastern Bank', 'count': 2},
      {'_id': 'Admirals Bank', 'count': 1},
      {'_id': 'Cambridge Savings Bank', 'count': 1},
      {'_id': 'Cambridge Trust', 'count': 1},
      {'_id': "Citizen's Bank", 'count': 1},
      {'_id': 'MIT Federal Credit Union', 'count': 1},
      {'_id': "People's United Bank", 'count': 1},
      {'_id': 'TD Bank', 'count': 1}]
```
#### Top 20 Most Popular Cuisines
```python
pipeline = [{'$match': {'tag': 'node', 'cuisine': {'$exists': True}}}, 
            {'$group': {'_id': '$cuisine', 'count': {'$sum': 1}}}, 
            {'$sort': {'count': -1, '_id': 1}}, 
            {'$limit': 20}]

pprint.pprint(list(osm_data.aggregate(pipeline)))
```
```
Out: [{'_id': 'coffee_shop', 'count': 49},
      {'_id': 'pizza', 'count': 42},
      {'_id': 'mexican', 'count': 34},
      {'_id': 'sandwich', 'count': 32},
      {'_id': 'american', 'count': 29},
      {'_id': 'italian', 'count': 22},
      {'_id': 'chinese', 'count': 19},
      {'_id': 'burger', 'count': 17},
      {'_id': 'asian', 'count': 15},
      {'_id': 'donut', 'count': 15},
      {'_id': 'thai', 'count': 15},
      {'_id': 'indian', 'count': 14},
      {'_id': 'ice_cream', 'count': 13},
      {'_id': 'japanese', 'count': 13},
      {'_id': 'international', 'count': 6},
      {'_id': 'regional', 'count': 6},
      {'_id': 'seafood', 'count': 6},
      {'_id': 'sushi', 'count': 6},
      {'_id': 'mediterranean', 'count': 5},
      {'_id': 'french', 'count': 4}]
```
#### Top 20 Users With The Most Created Elements
```python
pipeline = [{'$group': {'_id': '$created.user', 'count': {'$sum': 1}}}, 
            {'$sort': {'count': -1, '_id': 1}}, 
            {'$limit': 20}]

pprint.pprint(list(osm_data.aggregate(pipeline)))
```
```
Out: [{'_id': 'crschmidt', 'count': 144099},
      {'_id': 'ryebread', 'count': 34851},
      {'_id': 'wambag', 'count': 32600},
      {'_id': 'jremillard-massgis', 'count': 30442},
      {'_id': 'mapper999', 'count': 12981},
      {'_id': 'morganwahl', 'count': 12871},
      {'_id': 'OceanVortex', 'count': 7705},
      {'_id': 'MassGIS Import', 'count': 4059},
      {'_id': 'JasonWoof', 'count': 3907},
      {'_id': 'JessAk71', 'count': 3655},
      {'_id': 'Utible', 'count': 2857},
      {'_id': 'Shannon Kelly', 'count': 1935},
      {'_id': 'Alexey Lukin', 'count': 1893},
      {'_id': 'Ahlzen', 'count': 1814},
      {'_id': 'cspanring', 'count': 1796},
      {'_id': 'fiveisalive', 'count': 1661},
      {'_id': 'probiscus', 'count': 1414},
      {'_id': 'phyzome', 'count': 1325},
      {'_id': 'synack', 'count': 1322},
      {'_id': 'Extant', 'count': 1164}]
```
#### Histogram Of Number Of Edits
```python
pipeline = [{'$group': {'_id': '$created.version', 'count': {'$sum': 1}}}, 
            {'$sort': {'count': -1, '_id': 1}}, 
            {'$limit': 50}]

version_dist = list(osm_data.aggregate(pipeline))
hist = [(vd['_id'], vd['count']) for vd in version_dist]
```
```python
import matplotlib.pyplot as plt
%matplotlib inline

# take log log plot of histogram, roughly straight line implies power law
plt.plot(*zip(*hist), 'o')
plt.xscale('log')
plt.yscale('log')
plt.title('Power Law Distribution Of \nNumber Of Edits By Count')
plt.xlabel('Number of Edits')
plt.ylabel('Count');
```
![Log Log Plot of Edits by Count](../images/version_histogram.png)
#### Most Edited Elements
```python
max_version = osm_data.find_one(sort=[('created.version', -1)])['created']['version']

pprint.pprint(list(osm_data.find({'created.version': max_version}, ['name', 'tag', 'created.version'])))
```
```
Out: [{'_id': ObjectId('59bceb1c9d250b2eb76c0164'),
       'created': {'version': 463},
       'name': 'United States of America',
       'tag': 'relation'}]
```
Since the map of Boston/Cambridge contains the United States border (Boston/Cambridge is on the ocean), the local city map contains the United States boundary relation.  Since this relation defines the outer edge of the entire United States, it is most likely being edited by many members working on local sections of areas on the US border.  This explains the high number of edits.  A similar phenomenon should be seen in state and city boundaries as well.  It would not be too far fetched to propose that the number of edits for a boundary relation would be proportional to the boundary's length.
  
## 3. Potential Improvements

### 3A. Data Gathering Process

Since most smartphones these days are GPS-enabled and machine vision has reached a point where scalable image recognition is possible, I imagine an automated crowdsourcing technique could be used to increase the amount and reliability of location tags.  While it is not possible, using only the given dataset, to verify the accuracy of tag information, it is easy to see that existing tags are inconsistent between locations of a similar type (e.g. restaurants or cafes). For example, the following query

```
from bson.code import Code

# define javascript map function
# basically an unwind on each document's dictionary keys/fields
map_function = Code("""
                    function() {
                        for (var tag_name in this) {
                            emit(tag_name, 1);
                        }
                    }
                    """)

# define javascript reduce function
# count the number of each tag
reduce_function = Code("""
                       function(tag_name, tag_count) {
                           return Array.sum(tag_count);
                       }
                       """)

# perform the map reduce and name the resulting collectoin 'cafe_tags_pct'
# use the query argument to filter out all documents that aren't cafes before performing the map reduce
osm_data.map_reduce(map_function, 
                    reduce_function, 
                    out='cafe_tags_pct',
                    query={'amenity': 'cafe'})
                          
ctp = db.cafe_tags_pct

# used for calculating the percentage
num_cafes = ctp.find_one({'_id': '_id'})['value']

# 1. Take out known common tags ('_id' and 'amenity', by the above map reduce query operation)
# 2. Calculate the percentage as the count of tags divided by the total number of cafe documents
# 3. Sort by percentage
# 4. Format to remove excessive precision and add '%' symbol
pipeline = [{'$match': {'_id': {'$nin': ['_id', 'amenity']}}}, 
            {'$project': {'pct': {'$multiply': [{'$divide': ['$value', num_cafes]}, 100]}}}, 
            {'$sort': {'pct': -1, '_id': 1}}, 
            {'$project': {'pct': {'$concat': [{'$substr': ['$pct', 0, 5]}, '%']}}}]

pprint.pprint(list(ctp.aggregate(pipeline)))
```
```
Out: [{'_id': 'created', 'pct': '100%'},
      {'_id': 'id', 'pct': '100%'},
      {'_id': 'tag', 'pct': '100%'},
      {'_id': 'name', 'pct': '98.42%'},
      {'_id': 'pos', 'pct': '96.84%'},
      {'_id': 'cuisine', 'pct': '48.42%'},
      {'_id': 'address', 'pct': '35.26%'},
      {'_id': 'opening_hours', 'pct': '14.21%'},
      {'_id': 'website', 'pct': '12.63%'},
      {'_id': 'phone', 'pct': '10%'},
      {'_id': 'internet_access', 'pct': '8.947%'},
      {'_id': 'smoking', 'pct': '5.263%'},
      {'_id': 'takeaway', 'pct': '5.263%'},
      {'_id': 'source', 'pct': '4.736%'},
      {'_id': 'wheelchair', 'pct': '4.736%'},
      {'_id': 'wifi', 'pct': '4.736%'},
      {'_id': 'building', 'pct': '3.157%'},
      {'_id': 'node_refs', 'pct': '3.157%'},
      {'_id': 'outdoor_seating', 'pct': '2.631%'},
      {'_id': 'toilets', 'pct': '1.578%'},
      {'_id': 'wikidata', 'pct': '1.578%'},
      {'_id': 'designation', 'pct': '1.052%'},
      {'_id': 'diaper', 'pct': '1.052%'},
      {'_id': 'drive_through', 'pct': '1.052%'},
      {'_id': 'email', 'pct': '1.052%'},
      {'_id': 'operator', 'pct': '1.052%'},
      {'_id': 'shop', 'pct': '1.052%'},
      {'_id': 'brand', 'pct': '0.526%'},
      {'_id': 'coffee', 'pct': '0.526%'},
      {'_id': 'created_by', 'pct': '0.526%'},
      {'_id': 'drinking_water', 'pct': '0.526%'},
      {'_id': 'level', 'pct': '0.526%'},
      {'_id': 'note', 'pct': '0.526%'},
      {'_id': 'wikipedia', 'pct': '0.526%'}]
``` 

reveals that most tags apply only to a fraction of the existing locations in a particular category (cafes in this example).  This reduces the usefulness of doing search queries against such data, as sampling errors (whether due to small samples for a given tag or large samples with selection bias) hamper the ability to extract useful statistics.
  
Applying machine vision recognition techniques to satellite imagery, one could identify likely buildings, landmarks, and points-of-interest.  Then, a smartphone app could detect when a user is near untagged or under-tagged locations and supply a notification to the user asking them to answer a question.  This question would either ask users to confirm a location's tag information or ask the user to supply new tag information, such as the type of amenity or the name of a restaurant.  Location-time distributions could also be used to detect areas of interest, as people tend to spend more time at a location than they do just traveling between locations (i.e. areas of low motion, or change in position over time, may reflect a location of interest).  This could be combined with machine vision techniques to make sure that question answering is targeted towards the most popular areas (since users' time and/or willingness to answer questions is an exhaustive resource).  In the case that machine vision techniques are too expensive or there is a lack of organizational expertise, out-of-the-box clustering algorithms applied to location-time data (gathered by smartphone GPS) would provide an approachable way for non-domain experts to extract points-of-interest.
  
### 3B. Pros and Cons

Proactive, crowdsourced information gathering would allow OSM to leverage the power of the crowd as well as become more responsive over time, in the case that tag information changes (e.g. when a building's occupant changes).  Since some areas are under rapid redevelopment, tag information can quickly become stale, reducing its usefulness to consumers.  Currently, very few users are generating the majority of the OSM data (i.e. the Pareto Principle), likely due to the lack of an intuitive interface and locality in space (i.e. gathering data when it is most available and accurate).  A Q&A based, location-aware service would ameliorate these two issues.

The main drawback of this approach is getting users to actually submit accurate data upon request.  Many users may be too busy or not care enough about the OSM mission to supply answers (or even worse, inaccurate answers).  One way to address the latter issue may be to match a user's answers against the answers of other users, to see how reliable a person's answers are (i.e. how well they correspond to the 'wisdom of the crowd').  The former problem could be alleviated by leveraging existing social media platforms to publicize a person's contributions.  While one may not expect virality from such an approach, using social media may allow such an app to reach the critical mass needed to sustain the information gathering process.  If enough users are contributing, the app could even create its own internal social network or community to motivate users to contribute more and work together on larger projects (e.g. Wikipedia has a community of contributors known as the Wikimedia movement).