# DATA WRANGLING WITH MONGODB
## OpenStreet Map Project
##  The Greater Seattle Area
The goal of this project is to assess the quality of data for validity, accuracy, completeness, and uniformity. I will be choosing the Greater Seattle Area map for this data wrangling project as I live there.

The Greater Seattle area is the hub of the many world's top companies like Amazon, Boeing, Microsoft, and Starbucks Coffee. It is a rapidly growing region of the United States with the rise of tech-related jobs attracting talents from across the globe. With such growth comes many data about the area's many exciting places.   

Using Open-Street Map dataset, a data wrangling task will be performed to analyze whether the integrity of the data stored there is intact; that is, if there are missing values or the data is accurate.


# Problems with Dataset
# Street Names Issues
One of the problems I came across with the map dataset is that some street names are abbreviated and therefore corrections are needed to format them into full street names. For example, *'S'* must be changed to *'South'* and *'Ave'* to *'Avenue'*. 

Implementing a regular expression to match the street names is the first step in fixing issues with the abbreviated street names.

The regular expression below is a way to process the necessary changes.

***
`
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)`





While the majority of the abbreviated street names  were successfully changed to full street names, a problem arose with some street names that begin with ,for example, 'S' as in 'South' not correctly formated. The examples below show the existing problems.

* *72nd Ave S => 72nd Ave South*
* *Wells Ave S => Wells Ave South*
* *Williams Ave S => Williams Ave South*
* *Burnett Ave S => Burnett Ave South*
* *Park Ave N => Park Ave North*
* *S 2nd St => S 2nd Street*
* *S 212th St => S 212th Street*

To overcome the obstacle above, update_name function was used. Here's how to correct the problem the abbreviations.

In [None]:
def update_name(name, mapping):
    
    key = mapping.keys()
    
    best_name =""
    for nm in name.split(" "):
        
        if nm in key:
            
            bn = mapping[nm]
            if best_name=="":
                best_name +=bn
            else:
                best_name = best_name + " " + bn
        else:
            if best_name=="":
                best_name +=nm
            else:
                best_name = best_name + " " + nm
         
    return best_name

The result of running the above function is shown below:
* *72nd Ave S => 72nd Avenue South*
* *Wells Ave S => Wells Avenue South*
* *Williams Ave S => Williams Avenue South*
* *Burnett Ave S => Burnett Avenue South*
* *Park Ave N => Park Avenue North*
* *S 2nd St => South 2nd Street*
* *S 212th St => South 212th Street*

# Overview of the Data
After going through the arduous process of wrangling the dataset, Mongo dB is used to statistically analyzed the dataset. Here all the interesting questions will be answered using the Mongo dB query methods.

For example to know the size of the file without using the the dabatase query methods generated the sizes below after converting the xml file to a json format.
* *seattle.osm 169.6 MB*
* *seattle.osm.json 184.9 MB*

### Size of the file
To retrieve the size of the selected area of the map, the Mongo dB *count()* method is applied.
*collection.find().count()* returns the number of documents in the database collection which is the same the size of the file.

`The file size is : 834061` 

### The number of unique users
Using this query method: 
***
`len(collection.distinct('created.user'))`
*** 
resulted in the number of unique users available in the dataset which is  

***
**`925`**
***

### The number of nodes and ways
To find the number of nodes, apply this method
***
`collection.find({'type':'way'}).count()`
***

`Nodes: 111,469`

### The Top 5 Banks
Knowing the number of banks in the area helps existing and new customers to choose the closest and most frequent banks to avoid a long commute.

***
`top_5_banks = list(collection.aggregate([{'$match': {'amenity': 'bank'}}, 
                                {'$group': {'_id': '$name', 
                                            'count': {'$sum': 1}}}, 
                                {'$sort': {'count': -1}}, 
                                {'$limit': 5}]))
 print(top_5_banks)`
***                            
                                
`[{'_id': 'Chase', 'count': 7},
 {'_id': 'Bank of America', 'count': 6},
 {'_id': 'Wells Fargo', 'count': 6},
 {'_id': 'KeyBank', 'count': 5},
 {'_id': 'U.S. Bank', 'count': 5}]`
 
### Most Popular Cafe
Since this region is the home and hub of the world's renowned technological and other companies, it is interesting to find how many Starbucks are there and its closest competitors.

***
`
top_cafes = list(collection.aggregate([{'$match': {'amenity': 'cafe'}}, 
                                {'$group': {'_id': '$name', 
                                            'count': {'$sum': 1}}}, 
                                {'$sort': {'count': -1}}, 
                                {'$limit': 5}]))
print(top_cafes) `
***
***
`[{'_id': 'Starbucks', 'count': 33},
 {'_id': None, 'count': 8},
 {'_id': 'BigFoot Java', 'count': 4},
 {'_id': 'LadyBug Bikini Espresso', 'count': 3},
 {'_id': 'Mighty Mugs Coffee', 'count': 2}]`
 ***
 
### Top 5 Restaurants

The dominant restaurants in the area are listed accordingly. It is surprising that Pizza Hut has more locations than the rest of the restaurants.
***
`
restaurants = list(collection.aggregate([{'$match': {'amenity': 'restaurant'}}, \
                                {'$group': {'_id': '$name', \
                                            'count': {'$sum': 1}}}, \
                                {'$sort': {'count': -1}}, \
                                {'$limit': 5}]))`
***
***
`
[{'_id': 'Pizza Hut', 'count': 6},
 {'_id': "Denny's", 'count': 4},
 {'_id': 'IHOP', 'count': 3},
 {'_id': "Applebee's", 'count': 2},
 {'_id': 'Tortas Locas', 'count': 2}]`
***

### Top 5 Fast-Food Places

***
`
list(collection.aggregate([{'$match': {'amenity': 'fast_food'}}, \
                                {'$group': {'_id': '$name', \
                                            'count': {'$sum': 1}}}, \
                                {'$sort': {'count': -1}}, \
                                {'$limit': 5}]))`

***
`
[{'_id': 'Subway', 'count': 21},
 {'_id': "McDonald's", 'count': 12},
 {'_id': 'Jack in the Box', 'count': 10},
 {'_id': 'Taco Time', 'count': 6},
 {'_id': "Wendy's", 'count': 6}]
`

# Other Ideas About The Dataset

Majority of the fields in the map dataset are correctly formatted but there is more room for improvement when it comes to some fields that are missing values. For example, upon further wrangling the dataset, when it comes to some businesses and organizations, the names of the cities they located in are missing. The address portion of the data is mostly blank.

Here is the sample of the dataset after the query was conducted:



***
`
collection.aggregate([{'$match': {'phone':{ '$regex':'^253'}}},        
                       {'$group': {'_id': '$name','phone':{'$push':'$phone'}, 'address':{'$push':'$address.city'}, \
                                   'count': {'$sum': 1}}}, \
                       {'$sort': {'count': -1}}, {'$limit':10}])`


***





***

```
[{'_id': 'Afford-A-Vet Animal Clinic',
  'phone': ['253-859-8387'],
  'address': [],
  'count': 1},
  
 {'_id': 'WinCo Foods', 
  'phone': ['253-850-8818'], 
  'address': [],
  'count': 1},
  
 {'_id': 'Grace Fellowship of Kent',
  'phone': ['253-854-4248'],
  'address': [],
  'count': 1},
 
 {'_id': 'New Hope Presbyterian Church',
  'phone': ['253-859-8998'],
  'address': [],
  'count': 1},
```

***

## Conclusion

After wrangling the Greater Seattle area map dataset, the contributors did mostly a great job in minimizing errors with the dataset. I believe with the technological advancement, data-entry errors will be detected in no time and corrected accordingly by the future auto-correcting data-entry forms through  well-built supervised learning algorithms where data are cross referenced prior to official submissions.

**Benefits** 

One of the advantages of having auto-correcting data-entry forms is that the contributors do not waste their time fixing the problems themselves. Another benefit is that it removes barrier to entry for those who want to be part of an open-source community and therefore volunteer their time to make the map dataset well-documented, valid, and error-free.

**Anticipated problems**

The over-arching challenges facing any open-source platforms are the volume of the data submitted. In the case of the OSM dataset, there are likely issues with duplicate entries that cannot be easily detected in time until a verification process is in place. Duplicated entries cause  previous data to be overwritten and therefore contribute to missing values. For example, as documented in the database query section, some businesses and organizations are missing the cities they operate in. Also another problem that faces the OSM dataset is the integrity of the data and whether the volunteers are qualified to do the contributions.
