# Programming Project - Unit 2
*by Débora Azevedo, Eliseu Jayro, Francisco de Paiva and Igor Brandão*

**Goals**
The purpose of this project is explore the following:

- Choropleth maps

## 1. Introduction

This notebook is organized as follow. The section 2 indicate the GeoJSON file that we are going to use. The next section gives a brief explanation of the Uber API. In section 4 we show how the dataset of the wait times were generated, explaining others APIs tested. The section 5 shows the choropleth maps generated and gives some theories about the findings.

In [4]:
!pip install Shapely-1.6.4.post1-cp37-cp37m-win32.whl

Shapely-1.6.4.post1-cp37-cp37m-win32.whl is not a supported wheel on this platform.
You are using pip version 10.0.1, however version 18.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [None]:
### Library necessary to run this IPython Notebook
!pip install shapely
!pip install tqdm
!pip install tabulate
!pip install pandas-datareader
!pip install requests

In [7]:
import os
import folium
import json
import pandas as pd
from branca.colormap import linear
import numpy as np
from shapely.geometry import Polygon
from shapely.geometry import Point
from numpy import random

import csv
import datetime as dt

ModuleNotFoundError: No module named 'Shapely'

## 2. GeoJSON - Neighborhoods of Natal - RN

In order to draw the choropleth map, the first thing that is necessary is the GeoJSON file of the area to be analyzed. To get that, we use the [Overpass turbo](http://overpass-turbo.eu/) and made a query to Natal - RN neighborhoods, with the code described bellow. The site allow the download of the output as a GeoJSON file. More explanations about the Overpass turbo Project can be found in their [wiki](http://wiki.openstreetmap.org/wiki/Overpass_turbo) page.

>```python
[out:json][timeout:25];
{{geocodeArea:Natal RN Brasil}}->.searchArea;
(
  relation["admin_level"="10"](area.searchArea);
);
out body;
>;
out skel qt;
```

The code below just import the GeoJSON file and prints some useful informations.

In [6]:
# import geojson file about natal neighborhoods
natal_neigh = os.path.join('geojson', 'natal.geojson')

# load the data and use 'UTF-8'encoding
geo_json_natal = json.load(open(natal_neigh,encoding='UTF-8'))

In [7]:
# print the keys of the dictionary
print(geo_json_natal.keys())
# print the list of features (neighborhoods)
geo_json_natal['features']

dict_keys(['type', 'generator', 'copyright', 'timestamp', 'features'])


[{'type': 'Feature',
  'properties': {'@id': 'relation/388146',
   'admin_level': '10',
   'boundary': 'administrative',
   'is_in': 'Natal',
   'name': 'Pitimbu',
   'place': 'suburb',
   'type': 'boundary'},
  'geometry': {'type': 'Polygon',
   'coordinates': [[[-35.2251535, -5.8800875],
     [-35.2245789, -5.8789859],
     [-35.2235407, -5.8773961],
     [-35.2216713, -5.8748329],
     [-35.219967, -5.8725269],
     [-35.219495, -5.8717499],
     [-35.2183771, -5.8693635],
     [-35.2158321, -5.8640165],
     [-35.2159318, -5.8639635],
     [-35.2160512, -5.8638541],
     [-35.2207751, -5.8610102],
     [-35.226799, -5.857474],
     [-35.2287216, -5.8563299],
     [-35.2288872, -5.8562443],
     [-35.2292113, -5.8560767],
     [-35.2293013, -5.8560301],
     [-35.2316996, -5.854504],
     [-35.2330707, -5.8537719],
     [-35.2333137, -5.8512412],
     [-35.2351934, -5.85008],
     [-35.2363521, -5.8486285],
     [-35.2382532, -5.846884],
     [-35.2386823, -5.8464656],
     [-35.239

In [8]:
neighborhood = []
# list all neighborhoods
for neigh in geo_json_natal['features']:
        neighborhood.append(neigh['properties']['name'])

In [9]:
# print the number of neighborhoods
len(neighborhood)

36

In [10]:
# print all neighborhoods
neighborhood

['Pitimbu',
 'Planalto',
 'Ponta Negra',
 'Neópolis',
 'Capim Macio',
 'Lagoa Azul',
 'Pajuçara',
 'Lagoa Seca',
 'Barro Vermelho',
 'Candelária',
 'Praia do Meio',
 'Rocas',
 'Santos Reis',
 'Redinha',
 'Salinas',
 'Igapó',
 'Nossa Senhora da Apresentação',
 'Potengi',
 'Ribeira',
 'Cidade Alta',
 'Alecrim',
 'Nordeste',
 'Quintas',
 'Bom Pastor',
 'Dix-Sept Rosado',
 'Nossa Senhora de Nazaré',
 'Lagoa Nova',
 'Mãe Luiza',
 'Nova Descoberta',
 'Tirol',
 'Petrópolis',
 'Areia Preta',
 'Cidade Nova',
 'Cidade da Esperança',
 'Felipe Camarão',
 'Guarapes']

Here is the map with the GeoJSON layer set to be our imported file. It's possible to confirm that our file is right.

In [11]:
# Create a map object
m = folium.Map(
    location=[-5.802592, -35.212558],
    zoom_start=12,
    tiles='OpenStreetMap'
)

# Configure geojson layer
folium.GeoJson(geo_json_natal).add_to(m)

# print the map
m

## 3. Exercise data

Now the we already have our GeoJSON file with the defination of all the neighborhoods of Natal - RN, which makes possible to draw the map we want, we need the data itself of the exercises. To get those estimations, we need to import it from the csv file **geolocation.csv**.

In [13]:
# Import the geolocation.csv data: data
geolocation_data = pd.read_csv("geolocation.csv", encoding = 'latin2')

### 3.1. Exercise coordinates

The exercise data can be analysed from the following columns>

>```python
data_instance(
    altitude= 6.686207311207311,
    latitude= -5.778328,
    longitude= -35.204698,
    timestamp= 0.0,
    type= start,
    datetime= "Wed, 24 Oct 2018 12:33:35",
    year= 2018,
)
```

where,

- **`altitude`**: (float) Altitude related to the tracked coordinate.
- **`latitude`**: (float) Latitude related to the tracked coordinate.
- **`longitude`**: (float) Longitude related to the tracked coordinate.
- **`timestamp`**: (float) Datetime + ellapsed time.
- **`type`**: (string) start, gps or end.
- **`datetime`**: (string) The number of seats required for uberPOOL. Default and maximum value is 2.
- **`year`**: (int) the year itself.

In [14]:
# makes a request to get the price estimate
geolocation_data.head()

Unnamed: 0.1,Unnamed: 0,altitude,datetime,latitude,longitude,month,timestamp,type,year
0,0,7.0,"Wed, 30 May 2018 14:28:34",-5.778329,-35.204913,5,0.0,start,2018
1,1,6.714286,"Wed, 30 May 2018 14:28:34",-5.778428,-35.204899,5,11.643,gps,2018
2,2,6.625,"Wed, 30 May 2018 14:28:34",-5.778516,-35.204833,5,18.645,gps,2018
3,3,6.555556,"Wed, 30 May 2018 14:28:34",-5.778625,-35.204843,5,22.642,gps,2018
4,4,6.5,"Wed, 30 May 2018 14:28:34",-5.778716,-35.20487,5,25.642,gps,2018


## 4. Dataset

To have a representative dataset for the wait time of Uber in all neighborhoods of Natal we adopted the following assumptions:

- We must have data of all days of the week, including the weekend;
- We must have data of all periods of the day;
- We must have the most points as possible;
- We must have data of all the products of the Uber available for the given point;
- The points should be chosen in a random way inside each neighborhood;
- The points chosen should be a **valid** one. We define a point as **invalid** if it is obvious that it can't be reached by a Uber car, eg, inside the ocean.

### 4.2. Choice of points

If we simple choose random points inside each neigborhood, inevitably we will face some "invalid" points, eg, in the middle of a river or at the top of Morro do Careca. The code below show this happening.

In [24]:
# return a number of points inside the polygon
def generate_random(number, polygon, neighborhood):
    list_of_points = []
    minx, miny, maxx, maxy = polygon.bounds
    counter = 0
    while counter < number:
        x = random.uniform(minx, maxx)
        y = random.uniform(miny, maxy)
        pnt = Point(x, y)
        if polygon.contains(pnt):
            list_of_points.append([x,y,neighborhood])
            counter += 1
    return list_of_points

In [25]:
number_of_points = 10

# search all features
for feature in geo_json_natal['features']:
    # get the name of neighborhood
    neighborhood = feature['properties']['name']
    # take the coordinates (lat,log) of neighborhood
    geom = feature['geometry']['coordinates']
    # create a polygon using all coordinates
    polygon = Polygon(geom[0])
    # return number_of_points by neighborhood as a list [[log,lat],....]
    points = generate_random(number_of_points,polygon, neighborhood)
    # iterate over all points and print in the map
    for i,value in enumerate(points):
        log, lat, name = value 
        # Draw a small circle
        folium.CircleMarker([lat,log],
                    radius=2,
                    popup='%s %s%d' % (name, '#', i),
                   color='red').add_to(m)

# print the map
m

NameError: name 'Polygon' is not defined

Although the Uber gives us a good documentation about his API, they don't explain what is expected to happen if we make a request for an "invalid" point. To investigate this scenario, we made some requests of a couple of known points and create markers in the map along with the wait time estimated.

It's possible to observe that the Uber API returns a time estimated even for points that area clearly in the middle of the ocean and that this time estimated is considerably larger then the closer point in a road. Therefore, it's safe to say that the Uber API don't get the closer valid point to obtain the estimated time. **In this case, we need a way to say if the random point obtained is a valid one.**

In the map printed below is also possible to note that for a "invalid" point close (<300 meters) to a valid one, the diference of the time estimate is irrelevant. This can be seen with the points \#5 and \#6.

In [None]:
!pip install geopy

In [None]:
# import the geopy library (https://pypi.python.org/pypi/geopy)
# for calculate the distance between two coordinates
from geopy.distance import vincenty

# two chosen points
point1 = [-5.876702, -35.176005]
point2 = [-5.863845, -35.148890]
# distance between two points, in meters
distance12 = vincenty(point1, point2).meters

# two chosen points
point3 = [-5.751032, -35.207708]
point4 = [-5.756198, -35.213887]
# distance between two points, in meters
distance34 = vincenty(point3, point4).meters

# two chosen points
point5 = [-5.783261, -35.247053]
point6 = [-5.783835, -35.249602]
# distance between two points, in meters
distance56 = vincenty(point5, point6).meters


points = [point1, point2, point3, point4, point5, point6]
distances = [0, distance12, 0, distance34, 0, distance56]
lat = 0
log = 1

In [None]:
# Create a map object
m = folium.Map(
    location=[-5.802592, -35.212558],
    zoom_start=12,
    tiles='OpenStreetMap'
)

# Configure geojson layer
folium.GeoJson(geo_json_natal).add_to(m)

for i, point in enumerate(points):
    
    # get the estimates times for each point
    wait_time = client.get_pickup_time_estimates( 
                            start_latitude=point[lat], 
                            start_longitude=point[log],
                            product_id='65cb1829-9761-40f8-acc6-92d700fe2924'
                            )

    popup = ('Point #' + str(i) + '<br>' +
             ('Aproximate distance to nearest road: %.2f meters' % distances[i]) + '<br>' +
             ('Wait time: %.2f seconds' % wait_time.json.get('times')[0]['estimate'])
            )    
    
    # print a marker for each point with the corresponding wait time
    folium.Marker([point[lat], point[log]],
                  popup=popup
                 ).add_to(m)

#print map
m

#### 4.2.1. Google Maps Roads API

The Google has an [Maps Road API](https://developers.google.com/maps/documentation/roads/intro) which has a method to inform the nearest road of a given point. The code below show an example of his usage.

But this API has a much more restrict limitation, only allowing 2.500 free requests per **day**, which already makes impossible the use of this API. Beyond that, this API only return a nearest road if the given point is already much cloer to one road. For example, for the point #6, which is about 300 meters away from a road, the `gmaps.nearest_roads()` don't return any road. **Therefore, this API doesn't fit to our needs.**

In [None]:
import googlemaps

#get my google key
google_key = keys['google']

gmaps = googlemaps.Client(key=google_key)

result1 = gmaps.nearest_roads((point2[0], point2[1]))
result2 = gmaps.nearest_roads((point4[0], point4[1]))
result3 = gmaps.nearest_roads((point6[0], point6[1]))

print(result1)
print(result2)
print(result3)

#### 4.2.2. Project OSRM

The [Open Source Routing Machine Project](http://project-osrm.org/) keeps an [HTTP Server](https://github.com/Project-OSRM/osrm-backend/blob/master/docs/http.md) that answer to several kinds of requests. One of them is the `nearest service`. The description of this service is detailed below.

##### Nearest service

Snaps a coordinate to the street network and returns the nearest `n` matches.

```endpoint
GET http://{server}/nearest/v1/{profile}/{coordinates}.json?number={number}
```

Where `coordinates` only supports a single `{longitude},{latitude}` entry.

In addition to the general options the following options are supported for this service:

|Option      |Values                        |Description                                         |
|------------|------------------------------|----------------------------------------------------|
|number      |`integer >= 1` (default `1`)  |Number of nearest segments that should be returned. |

**Response**

- `code` if the request was successful `Ok` otherwise see the service dependent and general status codes.
- `waypoints` array of `Waypoint` objects sorted by distance to the input coordinate. Each object has at least the following additional properties:
  - `distance`: Distance in meters to the supplied input coordinate.
  
Below is an example of the usage of this service. We are using the point \#6 of above.

In [None]:
import requests

# point 6
url = 'http://router.project-osrm.org/nearest/v1/car/-35.249602,-5.783835'
response = requests.get(url)
response_json = json.loads(response.text)
distance = response_json.get('waypoints')[0]['distance']

print(response.text)
print('\nDistance: %.2f' % distance)

Therefore, with the OSRM Project we can define our own criteria to classify a point in "valid" or not. Consedering the existance of some house condominiums, which has a largest area and could be considerably far from a road, we decide to use 400 meters as the criteria, eg, if a point is more then 400 meters from the nearest road, than is considered an "invalid" point.

Below, we show an example of 30 points in each neighborhood with our classification of "valid" point.

In [None]:
# return the nearest road of a given logitude and latitude
# using the OSRM Project server
def nearest_road_distance(log, lat):
    # define the options of the request
    server = 'router.project-osrm.org'
    service = 'nearest'
    version = 'v1'
    profile = 'car'
    
    # mount the request
    url = ('http://' + server + '/' + service + '/' + version +
            '/' + profile + '/' + str(log) + ',' + str(lat) )

    # try to get the response of the server
    try:
        # get the response of the server
        response = requests.get(url)
        # loads the response in a json format
        response_json = json.loads(response.text)

        # get the distance of the nearest road, in meters
        distance = response_json.get('waypoints')[0]['distance']
    
    # if can't get the answer, return infinite
    except:
        distance = float('inf')

    return distance

In [None]:
# return a number of points inside the polygon and has a determined max distance of a road
def generate_random_with_distance(number, polygon, neighborhood, max_distance):
    list_of_points = []
    minx, miny, maxx, maxy = polygon.bounds
    counter = 0
    while counter < number:
        x = random.uniform(minx, maxx)
        y = random.uniform(miny, maxy)
        pnt = Point(x, y)
        if polygon.contains(pnt) and nearest_road_distance(x, y) <= max_distance:
            list_of_points.append([x,y,neighborhood])
            counter += 1
    return list_of_points

In [None]:
# Create a map object
m = folium.Map(
    location=[-5.802592, -35.212558],
    zoom_start=12,
    tiles='OpenStreetMap'
)

# Configure geojson layer
folium.GeoJson(geo_json_natal).add_to(m)

#define the number of points
number_of_points = 20

# search all features
for feature in geo_json_natal['features']:
    # get the name of neighborhood
    neighborhood = feature['properties']['name']
    # take the coordinates (lat,log) of neighborhood
    geom = feature['geometry']['coordinates']
    # create a polygon using all coordinates
    polygon = Polygon(geom[0])
    # maximun distance of the point to a road to be considered valid
    max_distance = 400
    # return number_of_points by neighborhood as a list [[log,lat],....]
    points = generate_random_with_distance(number_of_points,polygon, neighborhood, max_distance)
    # iterate over all points and print in the map
    for i,value in enumerate(points):
        log, lat, name = value 
        # Draw a small circle
        folium.CircleMarker([lat,log],
                    radius=2,
                   color='red').add_to(m)

# print the map
m

### 4.3. Server to generate the dataset

Now we already have all the elements to run our server and collect the data to do a representative estimation of the mean wait time of Ubers in all the nieighborhoods of Natal - RN. The server has the code below.

```python
#time interval between periods, in minutes
INTERVAL = 7

print('Initializing server...')

#try to stablish a connection with the Uber Server
try:
    session = Session(server_token=keys['uber'])
    client = UberRidesClient(session)

    print('Uber client initialized')

except:
    print('Unable to stablish client connection with Uber.')

#initializing a counter to control the number of iterations
k = 0
# define the initial time
initial_time = dt.datetime.now()
while True:
    if dt.datetime.now() >= initial_time:
        k = k+1
        print('\n\nCollecting the data.')
        print('Iteration number: ', k )
        print('\n\n')

        #number of points for each neighborhood
        number_of_points = 2

        #open the file that will act like a database in the 'append' mode
        #which allow us to append a row each time we open it
        file = open('db.csv','a')
        writer = csv.writer(file)

        # search all features
        for feature in geo_json_natal['features']:
            # get the name of neighborhood
            neighborhood = feature['properties']['name']
            # take the coordinates (lat,log) of neighborhood
            geom = feature['geometry']['coordinates']
            # create a polygon using all coordinates
            polygon = Polygon(geom[0])

            # maximun distance of the point to a road to be considered valid
            max_distance = 400
            # return number_of_points by neighborhood as a list [[log,lat],....]
            points = generate_random(number_of_points,polygon, neighborhood, max_distance)
            # iterate over all points and print in the map
            for i,value in enumerate(points):
                log, lat, name = value

                #try to get the products for each point
                try:
                    response = client.get_products(lat,log)

                    # API - get/products
                    products = response.json.get('products')
                    #for each point, get the time estimates and write in the db file
                    for product in products:
                        #get the timestamp for insert into the db
                        now = dt.datetime.now()

                        #try to get the time estimates
                        try:
                            wait_time = client.get_pickup_time_estimates(lat,log,
                                                product['product_id'])

                            #mount the row to be inserted in the db file
                            row = [wait_time.json.get('times')[0]['localized_display_name'],
                                   lat,
                                   log,
                                   neighborhood,
                                   now,
                                   wait_time.json.get('times')[0]['estimate']]

                            #write the row mounted
                            writer.writerow(row)
                            #print the row in the terminal for the user see whats going on the server
                            print(row)

                        #we don't make any treatment with the exceptions
                        #because there isn't a problem if we miss a couple of points
                        except:
                            pass

                #we don't make any treatment with the exceptions
                #because there isn't a problem if we miss a couple of points
                except:
                    pass        

        # close the file
        file.close()

        # update the next initial time in order to obey the limitation of the Uber API
        initial_time += dt.timedelta(minutes=INTERVAL)

    #wait 10 seconds
    sleep(10)
```

The server ran from Sunday, October 29, 03:18, to Sunday, November 05, 09:06, which correspond to 7 days in a row. Collected 176.860 points, which gives a mean of, approximately, 4.900 points per neighborhood

In [None]:
data = pd.read_csv('db.csv')
date = data['REQUEST_TIME']
print('Start date: ', date.min())
print('End date: ',date.max())
print('Number of points: ', len(date))
print('Mean points per neighborhood: ', len(date) / len(neighborhood))

In [None]:
#print the first 5 rows of the dataset
data.head()

## 5. Choropleth map

Once that our dataset is fully populated and we already have the GeoJSON file for the neighborhoods of Natal - RN, it's easy to make a choropleth map using the `folium` library. We have another notebook that better explains the basics of a [choropleth map](https://github.com/vhfdoliveira/DataScience/blob/master/03_Choropleth/Northeast_Chropleth_map.ipynb).

But in this notebook we will explore two others functions of the `folium` library: 

(i) the ability to make custom icons for the markers; 

and (ii) plot a Vega chart on the popup of those markers.

### **Custom icons**:

The full description of the `folium.features.CustomIcon()` can be found in the [documentation](http://python-visualization.github.io/folium/docs-master/modules.html) of the `folium` library. In this notebook, we will use only the ability to load a given image and resize it. Actually, we get the image of the default marker and resize it to better fit our neighborhood map. 

Below is a resume of the commands that we use to draw a custom icon for the markers.

>```python
icon_path = os.path.join('icon', 'marker-icon.png')
#define the path of the new image
icon_image = icon_path
#customize the new image, resizing it to better fit our map
icon = folium.features.CustomIcon(
    icon_image,
    icon_size=(15, 25)
)
#call the marker function passing the new icon as parameter
folium.Marker({coordinates},
              icon = icon
             ).add_to(m)
```

The description of the parameters used in the `folium.features.CustomIcon()` are:

  - **Icon_image**: (string, file or array-like object) – The data you want to use as an icon. * If string, it will be written directly in the output file. * If file, it’s content will be converted as embedded in the output file. * If array-like, it will be converted to PNG base64 string and embedded in the output.
  - **icon_size**: (tuple of 2 int) – Size of the icon image in pixels.
  
### **Vega chart on popup**:

With the `folium.Vega` it's possible to put Vega charts inside a popup. And for the creation of Vega charts we used the [vicent library](https://github.com/wrobstory/vincent) .

The `folium` documentation has some [examples](http://python-visualization.github.io/folium/docs-master/quickstart.html#Vincent/Vega-and-Altair/VegaLite-Markers) of how to use Vega charts as popups. The same goes for the `vicent` documentation, where their [examples](http://vincent.readthedocs.io/en/latest/quickstart.html) is focus on the charts itself.

Following this two documentions, we were able to make the charts that we want, which was that for every neighborhood we insert into the popup a bar chart with the mean wait time for every hour of the day, so the user can have an insight of how the mean wait time behaves along the day.

Below is a resume of the main commands that we use to insert the charts in the popup of the markers.

>```python
import vicent
# create a new column with the hour of the day for each point
data['HOUR'] = pd.to_datetime(data['REQUEST_TIME']).dt.hour
#filter the dataset to each neighborhood
data_neigh = data[ data['NEIGHBORHOOD'] == name ]
# aggregate the data by the hours of the day, calculating the mean of the wait_time column
mean_wait_time_per_hour = data_neigh.pivot_table(index='HOUR', values='WAIT_TIME', aggfunc=np.mean)
#create a Vega bar plot
bar = vincent.Bar(mean_wait_time_per_hour,
                  key_on='HOUR',
                  columns=['WAIT_TIME'],
                  width=450,
                  height=220)
#put the Vega plot in a dictionary
bar_dict = json.loads(bar.to_json())
# Let's create a Vega popup based on bar_dict.
popup = folium.Popup(max_width=580)
folium.Vega(bar_dict, height=270, width=580).add_to(popup)
# print a marker with the name of the neighborhood and the mean wait time
folium.Marker({coordinates},
              popup=popup
             ).add_to(m)
```

The description of the parameters used in the `vicent.Bar()` are:
  - **data**: (Tuples, List, Dict, Pandas Series, or Pandas DataFrame) – Input data. Tuple of paired tuples, List of single values, dict of key/value pairs, Pandas Series/DataFrame, Numpy ndarray;
  - **columns**: (list, default None) – Pandas DataFrame columns to plot;
  - **key_on**: (string, default 'idx') – Pandas DataFrame column to key on, if not index;
  - **width**: (int, default 960)  – Chart width;
  - **height**: (int, default 500) – Chart height;


### 5.1. All data (all products)

Below we draw our first choropleth map, which contain all the data collected, including all the products of the Uber in those locations (UberX and UberSelect), all the week days and all the hours of a day.


In [None]:
import vincent

# read the dataset
data = pd.read_csv('db.csv')
# create a new column with the hour of the day for each point
data['HOUR'] = pd.to_datetime(data['REQUEST_TIME']).dt.hour
# aggregate the data by the neighborhoods, calculating the mean of the wait_time column
mean_wait_time = data.pivot_table(index='NEIGHBORHOOD', values='WAIT_TIME', aggfunc=np.mean)
# remove the 'NEIGHBORHOOD' as the index, making them as a regular column
mean_wait_time.reset_index(inplace=True)

In [None]:
# Create a map object
m = folium.Map(
    location = [-5.802592, -35.212558],
    zoom_start = 12,
    tiles='OpenStreetMap'
)

# create a threshold of legend
threshold_scale = np.linspace(mean_wait_time['WAIT_TIME'].min(),
                              mean_wait_time['WAIT_TIME'].max(), 6, dtype=int).tolist()

# draw the choropleth
m.choropleth(
    geo_data=geo_json_natal,
    data=mean_wait_time,
    name='All data',
    columns=['NEIGHBORHOOD', 'WAIT_TIME'],
    key_on='feature.properties.name',
    fill_color = 'OrRd',
    legend_name='Mean wait time for uber in the neighborhoods of Natal (in seconds)',
    highlight=True,
    threshold_scale = threshold_scale
)

# define the path for the default icon
icon_path = os.path.join('icon', 'marker-icon.png')
icon_image = icon_path

# print one marker on each neighborhood
for neighborhood in geo_json_natal['features']:
    #customize the default icon for a marker, 
    #making a litle smaller for better visualization in our map
    icon = folium.features.CustomIcon(
        icon_image,
        icon_size=(15, 25)
    )
    
    # get the name of neighborhood
    name = neighborhood['properties']['name']
    # take the coordinates (lat,log) of neighborhood
    geom = neighborhood['geometry']['coordinates']
    # create a polygon using all coordinates
    polygon = Polygon(geom[0])
    
    #filter the dataset to each neighborhood
    data_neigh = data[ data['NEIGHBORHOOD'] == name ]
    # aggregate the data by the hours of the day, calculating the mean of the wait_time column
    mean_wait_time_per_hour = data_neigh.pivot_table(index='HOUR', values='WAIT_TIME', aggfunc=np.mean)
    mean_wait_time_per_hour.reset_index(inplace=True)    
    mean_wait_time_per_hour.sort_values(by='HOUR',inplace=True)
    
    #create a Vega bar plot
    bar = vincent.Bar(mean_wait_time_per_hour,
                      key_on='HOUR',
                      columns=['WAIT_TIME'],
                      width=450,
                      height=220)
    bar.axis_titles(x='Hour of the day', y='Mean wait time (seconds)')
    #define the title of the legend as the name of the neighborhood
    bar.legend(title=name)
    #put the Vega plot in a dictionary
    bar_dict = json.loads(bar.to_json())
    
    # Let's create a Vega popup based on bar_dict.
    popup = folium.Popup(max_width=580)
    folium.Vega(bar_dict, height=270, width=580).add_to(popup)
    
    # print a marker with the name of the neighborhood and the mean wait time
    folium.Marker([polygon.centroid.y, polygon.centroid.x],
                  icon = icon,
                  popup=popup
                 ).add_to(m)
    

# add a layer control
folium.LayerControl().add_to(m)
# print the map
m

### 5.2. UberX

Below we print the choropleth map filtering the data to contain only the UberX product, which is the most popular one, with lower prices.
 

In [None]:
# filter the data to only the UberX product
data_X = data[ data['UBER_TYPE'] == 'uberX' ]
# aggregate the data by the neighborhoods, calculating the mean of the wait_time column
mean_wait_time_X = data_X.pivot_table(index='NEIGHBORHOOD', values='WAIT_TIME', aggfunc=np.mean)
# remove the 'NEIGHBORHOOD' as the index, making them as a regular column
mean_wait_time_X.reset_index(inplace=True)

In [None]:
# Create a map object
m = folium.Map(
    location = [-5.802592, -35.212558],
    zoom_start = 12,
    tiles='OpenStreetMap'
)

# create a threshold of legend
threshold_scale_X = np.linspace(mean_wait_time_X['WAIT_TIME'].min(),
                                mean_wait_time_X['WAIT_TIME'].max(), 6, dtype=int).tolist()

# draw the choropleth
m.choropleth(
    geo_data=geo_json_natal,
    data=mean_wait_time_X,
    name='UberX',
    columns=['NEIGHBORHOOD', 'WAIT_TIME'],
    key_on='feature.properties.name',
    fill_color = 'OrRd',
    legend_name='Mean wait time for UberX in the neighborhoods of Natal (in seconds)',
    highlight=True,
    threshold_scale = threshold_scale_X
)

# define the path for the default icon
icon_path = os.path.join('icon', 'marker-icon.png')
icon_image = icon_path

# print one marker on each neighborhood
for neighborhood in geo_json_natal['features']:
    #customize the default icon for a marker, 
    #making a litle smaller for better visualization in our map
    icon = folium.features.CustomIcon(
        icon_image,
        icon_size=(15, 25)
    )
    
    # get the name of neighborhood
    name = neighborhood['properties']['name']
    # take the coordinates (lat,log) of neighborhood
    geom = neighborhood['geometry']['coordinates']
    # create a polygon using all coordinates
    polygon = Polygon(geom[0])
    
    #filter the dataset to each neighborhood
    data_neigh = data_X[ data_X['NEIGHBORHOOD'] == name ]
    # aggregate the data by the hours of the day, calculating the mean of the wait_time column
    mean_wait_time_per_hour = data_neigh.pivot_table(index='HOUR', values='WAIT_TIME', aggfunc=np.mean)
    mean_wait_time_per_hour.reset_index(inplace=True)    
    mean_wait_time_per_hour.sort_values(by='HOUR',inplace=True)
    
    #create a Vega bar plot
    bar = vincent.Bar(mean_wait_time_per_hour,
                      key_on='HOUR',
                      columns=['WAIT_TIME'],
                      width=450,
                      height=220)
    bar.axis_titles(x='Hour of the day', y='Mean wait time (seconds)')
    #define the title of the legend as the name of the neighborhood
    bar.legend(title=name)
    #put the Vega plot in a dictionary
    bar_dict = json.loads(bar.to_json())
    
    # Let's create a Vega popup based on bar_dict.
    popup = folium.Popup(max_width=580)
    folium.Vega(bar_dict, height=270, width=580).add_to(popup)
    
    # print a marker with the name of the neighborhood and the mean wait time
    folium.Marker([polygon.centroid.y, polygon.centroid.x],
                  icon = icon,
                  popup=popup
                 ).add_to(m)
    

# add a layer control
folium.LayerControl().add_to(m)
# print the map
m

### 5.3. UberSelect

Below we print the choropleth map filtering the data to contain only the UberSelect product, which is the most fancy product of the city, with higher prices.


In [None]:
# filter the data to only the UberX product
data_select = data[ data['UBER_TYPE'] == 'UberSELECT' ]
# aggregate the data by the neighborhoods, calculating the mean of the wait_time column
mean_wait_time_select = data_select.pivot_table(index='NEIGHBORHOOD', values='WAIT_TIME', aggfunc=np.mean)
# remove the 'NEIGHBORHOOD' as the index, making them as a regular column
mean_wait_time_select.reset_index(inplace=True)

In [None]:
# Create a map object
m = folium.Map(
    location = [-5.802592, -35.212558],
    zoom_start = 12,
    tiles='OpenStreetMap'
)

# create a threshold of legend
threshold_scale_select = np.linspace(mean_wait_time_select['WAIT_TIME'].min(),
                              mean_wait_time_select['WAIT_TIME'].max(), 6, dtype=int).tolist()

# draw the choropleth
m.choropleth(
    geo_data=geo_json_natal,
    data=mean_wait_time_select,
    name='UberSELECT',
    columns=['NEIGHBORHOOD', 'WAIT_TIME'],
    key_on='feature.properties.name',
    fill_color = 'OrRd',
    legend_name='Mean wait time for UberSelect in the neighborhoods of Natal (in seconds)',
    highlight=True,
    threshold_scale = threshold_scale_select
)

# define the path for the default icon
icon_path = os.path.join('icon', 'marker-icon.png')
icon_image = icon_path

# print one marker on each neighborhood
for neighborhood in geo_json_natal['features']:
    #customize the default icon for a marker, 
    #making a litle smaller for better visualization in our map
    icon = folium.features.CustomIcon(
        icon_image,
        icon_size=(15, 25)
    )
    
    # get the name of neighborhood
    name = neighborhood['properties']['name']
    # take the coordinates (lat,log) of neighborhood
    geom = neighborhood['geometry']['coordinates']
    # create a polygon using all coordinates
    polygon = Polygon(geom[0])
    
    
    #filter the dataset to each neighborhood
    data_neigh = data_select[ data_select['NEIGHBORHOOD'] == name ]
    # aggregate the data by the hours of the day, calculating the mean of the wait_time column
    mean_wait_time_per_hour = data_neigh.pivot_table(index='HOUR', values='WAIT_TIME', aggfunc=np.mean)
    mean_wait_time_per_hour.reset_index(inplace=True)    
    mean_wait_time_per_hour.sort_values(by='HOUR',inplace=True)
    
    #create a Vega bar plot
    bar = vincent.Bar(mean_wait_time_per_hour,
                      key_on='HOUR',
                      columns=['WAIT_TIME'],
                      width=450,
                      height=220)
    bar.axis_titles(x='Hour of the day', y='Mean wait time (seconds)')
    #define the title of the legend as the name of the neighborhood
    bar.legend(title=name)
    #put the Vega plot in a dictionary
    bar_dict = json.loads(bar.to_json())
    
    # Let's create a Vega popup based on bar_dict.
    popup = folium.Popup(max_width=580)
    folium.Vega(bar_dict, height=270, width=580).add_to(popup)
    
    # print a marker with the name of the neighborhood and the mean wait time
    folium.Marker([polygon.centroid.y, polygon.centroid.x],
                  icon = icon,
                  popup=popup
                 ).add_to(m)
    

# add a layer control
folium.LayerControl().add_to(m)
# print the map
m

## 6. Conclusion

It's possible to observe that in the central and south region we have the lowest mean wait time, namely in the neighborhoods of Capim Macio, Lagoa Nova, Lagoa Seca, Barro Vermelho, Tirol, Petrópolis, Dix-Sept Rosado e Nossa Senhora de Nazaré. The explanation of this finding is probably because those regions have a more population density and also has a lot of comerce.

The periferic regions tends to have highest mean wait time, probably for the opposite reason from above. They also tend to have lesser per capita incomes than the rest of the neighborhoods, which diminish the probability of the citizen to choose Uber rather than others cheaper public transportation, specially on the UberSelect. To reinforce those theories, the neighborhood with the highest mean wait time it's a very remote one and also one with lesser per capita incomes of the city, which is the Guarapes.

The North region, characterized by the neighborhoods of Salinas, Redinha, Potengi, Igapó, Nossa Senhora da Apresentação and Pajuçara, has a trend a litle indepedent of the rest of the city. We can see that the neighborhoods in the center of this region also has lesser mean wait times. This happens because of the distance of those neighborhoods to the center of Natal, which probably cause the Ubers cars to focus his work only in the North Region and the best way to do that is by positioning in the center of this region.

Analyzing the difference between the maps of the two products (UberX and Uber Select), we can see that, in general, the mean wait times of the UberX is considerably lower than the Uber Select. This is kind of obvious, once that the UberX is the most popular (cheaper) product and, therefore, has more drivers assigned to it. Also, the UberSelect is relatively new in Natal - RN. But the general behavior is almost the same for each neighborhood, with a litle exception for the periferic neighborhoods, where the difference tend to be higher, probably because they have a lesser per capita incomes, therefore, they often chosen the cheaper product. 

Analyzing the behavior of the mean wait time by the hours of the day we can see that in pretty much all neighborhoods the period of 00:00 (midnight) to 05:59 am has the longest mean wait time. This is also kind of obvious. 

One interesting thing to observ in this analyze is the fact that for the periferic neighborhoods there isn't much difference in the mean wait time inside each neighborhood (considering the period between 06:00 am to 11:59 pm), but in the neighborhoods of the center of Natal, including the center of the North Region and all most of the South Region, the mean wait time by the 12hs and 18hs are much higher then for the middle of the day. This probably happens because is the rush hour and the traffic gets messy, which cause the Ubers to take more time to reach the calling spot.

Another interesting observation is that in the Capim Macio neighborhood the mean wait time in 22hs is higher than 21hs and 23hs. This probably happens because there is a lot of colleges in this region, which cause a considerable traffic in the end of the classes, which occurs about 22hs.