## Getting MODA Waze Data

MODA gets Waze's crowd sourced traffic data through Waze's Connected Citizens Program. The data feed is in JSON format, and gets updated every 2 minutes. The raw data is stored in Microsoft Azure Storage. 

This notebook shows you how to:
- access the "raw-upload" container in "modawazedata" Azure Storage through their rest API (Azure developer resources here: https://docs.microsoft.com/en-us/rest/api/azure/)
- extract all the filenames of the raw feed in the container
- grab the raw data and save it to a file

Contact Mitsue Iwata, miwata@analytics.nyc.gov, with any questions
Sept 2018

In [2]:
# We have been storing Waze data since June 22nd, 2018. 
# To get an approximation of the number of records stored, we multiply 
# number of weeks * number of days/week * number of hours/day * number of feeds/hour
12 * 7 * 24 * 30

60480

There are roughly 60,480 records from Waze. 

Note, this is a rough approximation. There have been a few short timeframes when the data was not collected 
because of server issues or other hiccups.

## Accessing Azure Storage through REST API

We can call the API to get a list of filenames in a container. However, this only returns 5000 records at a time. Since there is no way to count the number of records, or "blobs", in an Azure container, we will use the approximation from above to loop through our requests in increments of 5000, and use the returned value for "nextmarker" to return the next set of results.


https://docs.microsoft.com/en-us/rest/api/storageservices/list-blobs

### MODA Azure Storage name is: modawazedata
### Container name is: raw-upload



In [3]:
import requests
from bs4 import BeautifulSoup
import lxml
import json
from transform_waze_raw_functions import *

In [4]:
##helper functions
MAXRES_STR = "&maxresults="
MARKER_STR = "&marker="

URL = "https://modawazedata.blob.core.windows.net/raw-upload?restype=container&comp=list&timeout=60"

#build url with max results and next marker
def build_string(max_res, next_marker, url=URL, maxres_str=MAXRES_STR, marker_str=MARKER_STR):
    max_str = maxres_str + str(max_res)
    marker = marker_str + next_marker
    return url + max_str + marker

#make request and soupify
def get_request(url):
    r = requests.get(url)
    c = r.content
    soup = BeautifulSoup(c,'lxml')
    return soup

#generate list of blob names
def make_list_of_blobnames(blob_vals):
    blob_names = []
    for n in blob_vals:
        blob_names.append(n.text)
    return blob_names

#get request and return list of filenames and next marker
def get_filenames_and_marker(soup):
    soup_name = soup.find_all('name')
    
    #generate list of blobnames
    filename_list = make_list_of_blobnames(soup_name)
    
    #get next marker
    marker = soup.find_all('nextmarker')[0].text
    
    return filename_list, marker

#use recursive method with max_results to get all filenames in container 
def get_next_marker_vals(rec_cnt, marker, url, all_blobs):
    if rec_cnt == 0:
        return all_blobs
    else:
        new_url = build_string(rec_cnt, marker, url=url, maxres_str=MAXRES_STR, marker_str=MARKER_STR)
        #print("new URL in recursion ", new_url)
        soup = get_request(new_url)
        blob_list, marker = get_filenames_and_marker(soup)
        all_blobs += blob_list
        print('number of blob filenames ', len(all_blobs))
        rec_cnt = rec_cnt-5000
        print('record count ', rec_cnt)
        
        return get_next_marker_vals(rec_cnt, marker, url, all_blobs)

In [5]:
NUM_RECORDS = 70000 #increments of 5000
MAXRES_STR = "&maxresults="
MARKER_STR = "&marker="

url = "https://modawazedata.blob.core.windows.net/raw-upload?restype=container&comp=list&timeout=60"

#create list object for all blobs names
all_blobs = []

#make first call to url and get the nextmarker value for subsequent calls
s = get_request(url)
blob_list, mark = get_filenames_and_marker(s)
all_blobs += blob_list # add blob names to list

blobnames = get_next_marker_vals(NUM_RECORDS, mark, url, all_blobs)

number of blob filenames  10000
record count  65000
number of blob filenames  15000
record count  60000
number of blob filenames  20000
record count  55000
number of blob filenames  25000
record count  50000
number of blob filenames  30000
record count  45000
number of blob filenames  35000
record count  40000
number of blob filenames  40000
record count  35000
number of blob filenames  45000
record count  30000
number of blob filenames  50000
record count  25000
number of blob filenames  55000
record count  20000
number of blob filenames  60000
record count  15000
number of blob filenames  65000
record count  10000
number of blob filenames  70000
record count  5000
number of blob filenames  75000
record count  0


In [6]:
print(len(blobnames))
print(len(set(blobnames)))

75000
75000


This code isn't perfect - there looks to be some redundancy, but we have about 60K filenames!

In [7]:
blobnames[:10]

['wazeprocessorraw_2018-06-22T20:04:00.3740299Z.json',
 'wazeprocessorraw_2018-06-22T20:06:00.7231344Z.json',
 'wazeprocessorraw_2018-06-22T20:08:00.9739038Z.json',
 'wazeprocessorraw_2018-06-22T20:10:00.6637321Z.json',
 'wazeprocessorraw_2018-06-22T20:12:00.2506104Z.json',
 'wazeprocessorraw_2018-06-22T20:14:00.7759341Z.json',
 'wazeprocessorraw_2018-06-22T20:16:00.2133851Z.json',
 'wazeprocessorraw_2018-06-22T20:18:00.4600203Z.json',
 'wazeprocessorraw_2018-06-22T20:20:00.7461843Z.json',
 'wazeprocessorraw_2018-06-22T20:22:01.1014014Z.json']

## Raw File URL

Now we have to build the url to grab the raw file, then we can write it to a local file.

In [8]:
base_url = "https://modawazedata.blob.core.windows.net/raw-upload/"
blobnames = list(set(blobnames))

file_url_list = []
for b in blobnames:
    file_url_list.append(base_url + b)

file_url_list[:10]

['https://modawazedata.blob.core.windows.net/raw-upload/wazeprocessorraw_2018-09-29T20:38:00.3502926Z.json',
 'https://modawazedata.blob.core.windows.net/raw-upload/wazeprocessorraw_2018-07-26T10:48:00.0634092Z.json',
 'https://modawazedata.blob.core.windows.net/raw-upload/wazeprocessorraw_2018-09-22T21:50:00.6489454Z.json',
 'https://modawazedata.blob.core.windows.net/raw-upload/wazeprocessorraw_2018-07-12T00:08:00.1862321Z.json',
 'https://modawazedata.blob.core.windows.net/raw-upload/wazeprocessorraw_2018-08-05T08:56:01.0974502Z.json',
 'https://modawazedata.blob.core.windows.net/raw-upload/wazeprocessorraw_2018-08-23T20:24:00.6667180Z.json',
 'https://modawazedata.blob.core.windows.net/raw-upload/wazeprocessorraw_2018-08-24T02:52:00.5646405Z.json',
 'https://modawazedata.blob.core.windows.net/raw-upload/wazeprocessorraw_2018-07-12T23:26:01.0355045Z.json',
 'https://modawazedata.blob.core.windows.net/raw-upload/wazeprocessorraw_2018-08-11T14:08:01.0583745Z.json',
 'https://modawazed

In [9]:
L1 = []
L2 = []
L3 = []
import numpy as np
for i in np.arange(100):
    req = requests.get(file_url_list[i])
    data = req.json()
    with open('data.json','w') as f:
        json.dump(data,f)
    if 'alerts' in data.keys():
        al = pd.DataFrame(data['alerts'])
        L1.append(al)
    if 'jams' in data.keys():
        jm = pd.DataFrame(data['jams'])
        L2.append(jm)
    if 'irregularities' in data.keys():
        irr = pd.DataFrame(data['irregularities'])
        L3.append(irr)
        
al = pd.concat(L1)
jm = pd.concat(L2)
irr = pd.concat(L3)

al = transform_alerts(al)
jm = transform_jams(jm)
irr = transform_irreg(irr)

al.reset_index(drop=True)
jm.reset_index(drop=True)
irr.reset_index(drop=True)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.




in jams



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


Any questions or suggestions, email, miwata@analytics.nyc.gov

In [10]:
'''
To do:
- How to combine every records with stored filename
- How to add new records from the feed as files are being ETL'd
- How to append new records that come in
'''

"\nTo do:\n- How to combine every records with stored filename\n- How to add new records from the feed as files are being ETL'd\n- How to append new records that come in\n"

In [18]:
al.reset_index(drop=True)

Unnamed: 0,uuid,city,state,country,confidence,location,magvar,thumbs_up,reliability,report_description,report_rating,road_type,street,subtype,type,pub_millis,pub_utc_date
0,cc657b5d-f45b-3dae-aef4-052c7eb11bcc,Woodbridge,NJ,US,1,"{'x': -74.26359, 'y': 40.559926}",0,0,9,Road Closed,0,,to Thomas A Edison Service Area,ROAD_CLOSED_EVENT,ROAD_CLOSED,1536579314691,2018-09-10T07:35:14.691000
1,68c3c7fe-e720-32ec-98e8-4d8252e5bc2a,Woodbridge,NJ,US,1,"{'x': -74.263735, 'y': 40.559672}",0,0,9,Road Closed,0,,,ROAD_CLOSED_EVENT,ROAD_CLOSED,1537100524441,2018-09-16T08:22:04.441000
2,45beaa65-7059-3083-a588-f9396881ee4b,Staten Island,NY,US,5,"{'x': -74.19371008872987, 'y': 40.58851264863442}",171,0,10,,0,2.0,,HAZARD_ON_ROAD_CONSTRUCTION,WEATHERHAZARD,1531966160000,2018-07-18T22:09:20
3,f3eec89f-532e-3e89-92fd-9a39249af58f,Staten Island,NY,US,5,"{'x': -74.19337749481203, 'y': 40.58887928499723}",171,0,10,,0,2.0,,HAZARD_ON_ROAD_CONSTRUCTION,WEATHERHAZARD,1530827975000,2018-07-05T17:59:35
4,89fa63e5-9fa9-3b5d-9d6a-d5e359bd68de,Carteret,NJ,US,0,"{'x': -74.216277, 'y': 40.582356}",18,0,7,,3,,Industrial Rd,HAZARD_ON_ROAD_POT_HOLE,WEATHERHAZARD,1538241977496,2018-09-29T13:26:17.496000
5,c66ce900-afb6-339b-8575-73029d420499,Carteret,NJ,US,0,"{'x': -74.232291, 'y': 40.597838}",24,0,6,,3,4.0,to Cars Only,JAM_STAND_STILL_TRAFFIC,JAM,1538251998291,2018-09-29T16:13:18.291000
6,4843fb56-5083-3f7a-9ed3-41a09a09eb70,Staten Island,NY,US,2,"{'x': -74.228874, 'y': 40.525965}",267,0,9,,2,4.0,to NY-440 S / Outerbridge Crossing / New Jersey,HAZARD_ON_SHOULDER_CAR_STOPPED,WEATHERHAZARD,1538252509249,2018-09-29T16:21:49.249000
7,8e4cc133-fa51-3cb5-8823-1662835d2978,Elizabeth,NJ,US,0,"{'x': -74.191085, 'y': 40.681085}",0,0,7,Construction,0,,Basilone Rd,ROAD_CLOSED_EVENT,ROAD_CLOSED,1537100525998,2018-09-16T08:22:05.998000
8,512e9cd3-1564-3fdd-936d-84e0ed1cd219,Elizabeth,NJ,US,0,"{'x': -74.189309, 'y': 40.681168}",0,0,7,Construction,0,,Earhart Dr,ROAD_CLOSED_EVENT,ROAD_CLOSED,1537100542466,2018-09-16T08:22:22.466000
9,32d535c5-4015-3ec2-b8e9-3770525c2540,Elizabeth,NJ,US,0,"{'x': -74.189449, 'y': 40.681968}",0,0,7,Construction,0,,Earhart Dr,ROAD_CLOSED_EVENT,ROAD_CLOSED,1530422387958,2018-07-01T01:19:47.958000
