# 11.S188/11.S952 Hack the City: Applied Data Science for Public Good

# Hacking Workshop 2 : Data Acquisition


## Overview
In this session, we will guide you to test three major approaches to acquire data: (1) direct downloding, (2) accessing data from the Web, and (3) fetching data via an API. We provide multiple toturials to demonstrate some basic techniques for data acquisition. Since webpages and APIs may vary case by case, we encourage a quick consultation if you are interested in testing a new API. 

### 1. Access Data from A File

#### Workshop Demo: Cambridge Open Data Portal
We will go through [Cambridge Open Data Portal](https://www.cambridgema.gov/departments/opendata) in the workshop session.

Most city open data portals allow you to directly download data in csv, excel, or shapefile format. For example, you can visit Cambridge Open Data Portal (https://data.cambridgema.gov/browse), select a particular dataset by click "View Data" and directly download dataset by clicking "Export" and selecting the preferred format (csv, excel, json, shapefile, etc.)

Some Tips:

- Creating a folder dedicated for raw data (for example, you can name it as "input") may help you manage large number of datasets in a project. When a hacking project progresses, you may acquire multiple datasets from various sources and generate some intermediate or output data. So you may want to save them into different folders (for example, input data, processed data, output data) to organize your project.


- Pay attention to the file's name when directly downloading a dataset. If there are multiple files with similar naming convention, you may download, read, and concatenate multiple files in a batch. 


- Normally, you can download a file and save the data in your local drive. However, this approach becomes inefficient when you are working with files in large size. A better way is to directly load a file and read data in Python (see below). Still, this approach takes your local machine's memory to load data. If you are working with extremely large dataset, you may need to consider [load data line by line](https://stackoverflow.com/questions/8009882/how-to-read-a-large-file-line-by-line) or [use Panda with certain techniques](https://towardsdatascience.com/why-and-how-to-use-pandas-with-large-data-9594dda2ea4c).

In [None]:
import csv
import requests

# To access a file via its url:
url = 'https://tidesandcurrents.noaa.gov/sltrends/data/8443970_meantrend.csv'

with requests.Session() as s:
    download = s.get(url)

    decoded_content = download.content.decode('utf-8')
    cr = csv.reader(decoded_content.splitlines(), delimiter=',')
    
    for row in list(cr)[:10]:   #pring the first 10 rows
        print(row)

In [None]:
# If the size of your data is managable, you can directly load the csv file as a DataFrame using Pandas:
import pandas as pd

url = 'https://tidesandcurrents.noaa.gov/sltrends/data/8443970_meantrend.csv'

df = pd.read_csv(url)
df.head()

### 2. Access Data from the Web
The module __urllib.request__  or __requests__ can be used for downloading data from the web. Typically, retrieving data over HTTP/HTTPS proceeds as the following steps:
- Open the web (using URL) with the function __urlopen()__ and obtain the open URL handle. Please note, this is a read-only open file handle, so you need to use __read()__, __readline()__, or __readlines()__ to access actual data.


- Parse the data and get the value that you are interested. Parsing techniques may vary depending on web document format (xml, HTML, txt). __BeautifulSoup__ is a popular package used for parsing and accessing HTML documents in Python. We provide a detail example below demostrating how to use urllib.request and BeautifulSoup to get data from HTML web pages. 


- When you need to get data from multiple web pages (if their URLs are strucutred similarlly), you can programatically (in a for loop, for example) access and parse many pages and save parsed data into a list, dictionary, or dataframe. 

__NOTE:__ When processing large number of web pages, you may fail to open particular URL. So it is necessary to write an exception to handle such situation.

#### Access HTML files
HTML is a markup language on the web that is human-readable. You may find the structure of an HTML document familiar, if you have ever written a website. In short, HTML contains text and predefined tags (enclosed in angle brackes <>) that control the presentation of the texts. 

#### Tutorial: Parsing zipcode level housing affordability index from website. 
Visit the website https://massachusetts.hometownlocator.com/zip-codes/ and observe how the URL changes when you browsing invidual web page for a particular zipcode. 

In [None]:
from bs4 import BeautifulSoup 
from urllib.request import urlopen

# Let's look at this webpage for zipcode 02139:
url = "https://massachusetts.hometownlocator.com/zip-codes/data,zipcode,02139.cfm"
html = urlopen(url)
bsObj = BeautifulSoup(html.read())

In [None]:
#print (bsObj)
#print (bsObj.body)
#print (bsObj.body.td)
#print (bsObj.body.td.strong.string)
#print (bsObj.body.td.strong.string.split('is '))
#print (bsObj.body.td.strong.string.split('is ')[-1])

In [None]:
# Load website and parse zipcode level affordabiltiy index:
def get_affordability_index(zipcode):
    from bs4 import BeautifulSoup 
    from urllib.request import urlopen
    
    url = "https://massachusetts.hometownlocator.com/zip-codes/data,zipcode,"+zipcode+".cfm"
    html = urlopen(url)
    bsObj = BeautifulSoup(html.read())
    affordability_index = bsObj.body.td.strong.string.split('is ')[-1]
    
    return int(affordability_index)

In [None]:
print (get_affordability_index('02139'))

In [None]:
# Now we can run this function with a list of zipcode:
for zipcode in ['02139', '02324', '02466', '02642', '02145']:
    print (zipcode,':',get_affordability_index(zipcode))

### Access txt files


In [None]:
## This url link is a txt file reporting sea level changes in Boston area.

from urllib.request import urlopen

url = "https://tidesandcurrents.noaa.gov/sltrends/data/8443970_meantrend.txt"
data = urlopen(url) 

# for line in data: 
#     print (line.decode('utf-8'))

In [None]:
## Alternatively, you can also use the requests library, which works better in parsing text.
import requests
response = requests.get(url)

#print (response.text)

In [None]:
# Similarlly, if the size of your data is managable, you can directly load the csv file as a DataFrame using Pandas:
df = pd.read_csv(url, sep='\s+') # This txt file uses spaces to split columns
df.head()

### 3. Fetch Data via API

An application programming interface (API) is an interface or communication protocol between different parts of a computer program intended to simplify the implementation and maintenance of software ([Wikipedia](https://en.wikipedia.org/wiki/Application_programming_interface)). There are various APIs provided by public and private sectors that are related to cities, with different specifications and data fetching procedures (typically you can follow a tutorial to get started). 

Commonly, you need to fetch data by using a GET request, similar to how you request a web page. Then you can specify certain parameters by passing a query string, to get data based on a particular region, time range, or other criteria. Since many APIs only allow a limited number of requests, so it is important for you to define a proper scope with parameters. 

We provide a tutorial below demonstrating how to fetch data using an Open Street Map (OSM) related API. Please note, all APIs vary by specifications and regulations, so it is necessary to carefully read the manual or tutorial before accessing a new API. Some APIs (for example, Zillow and Twitter) may require an account token as authentication. Normally this can be easily done by registering a developer account and requesting a token. 


#### Tutorial:  Open Street Map (OSM) Overpass API

The Overpass API (formerly known as OSM Server Side Scripting, or OSM3S before 2011) is a read-only API that serves up custom selected parts of the OSM map data. It acts as a database over the web: the client sends a query to the API and gets back the data set that corresponds to the query. https://wiki.openstreetmap.org/wiki/Overpass_API

Unlike the OSM main API, which is optimized for editing, Overpass API is optimized for data consumers that need a few elements within a glimpse or up to roughly 10 million elements in some minutes, both selected by search criteria like e.g. location, type of objects, tag properties, proximity, or combinations of them. It acts as a database backend for various services.

#### Get existing bus stops from OSM
You will identify the map extent as a bounding box for loading OSM data. you can specify a bounding box by (south, west, north, east) in latitude and longitude. Bus stops are nodes in OSM. Inside the query, you can also specify what type of nodes to acquire based on tags.

Learn more about how to query data via OverPass API: https://wiki.openstreetmap.org/wiki/Overpass_API/Overpass_QL

In [None]:
import pandas as pd
import numpy as np
import requests
import json

# Cambirdge, MA
overpass_url = "https://lz4.overpass-api.de/api/interpreter"

overpass_query = """
[out:json];
node["highway"="bus_stop"]
(42.291298,-71.19126,42.447343,-70.971799); 
out center;
"""
response = requests.get(overpass_url, params={'data': overpass_query})
data = response.json()

stop_list = data.get('elements')
print ("Total number of bus stops: ",len(stop_list))

In [None]:
print (stop_list[0])

In [None]:
# Convert dictionary to dataframe
df = pd.DataFrame()

for stop in stop_list:
    df = df.append(pd.DataFrame.from_dict(stop, orient='columns'))[['id','lat','lon','tags']]
    
df = df.reset_index()
df = df[df['index']=='highway'][['id','lat','lon','tags']].reset_index(drop=True)

# Collect coords into list
coords = []

for element in data['elements']:
    if element['type'] == 'node':
        lon = element['lon']
        lat = element['lat']
        coords.append((lon, lat))
    elif 'center' in element:
        lon = element['center']['lon']
        lat = element['center']['lat']
        coords.append((lon, lat))

In [None]:
# NOTE: it is not always necessary to create this DataFrame.
# For example, if you plan to visualize data in D3.js, 
# you may want to maintain its json format or convert dataframe to json.
# (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html)

#df.head()

In [None]:
# Now we use matplotlib to visualize all bus stops based on their coordinates:
# We will look into matplotlib more in a later workshop session on plotting data in python.

import matplotlib.pyplot as plt
%matplotlib inline\

X = np.array(coords)# Convert coordinates into numpy array

plt.plot(X[:, 0], X[:, 1], 'o', markersize=1)
plt.title('Bus Stops in Boston Area')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.axis('equal')
plt.show()

Many city open data portals provide API, so you can access data beyond direct downloading. In addition, here are some popular APIs that are relevant to urban science, happy coding!

[Zillow](https://www.zillow.com/howto/api/APIOverview.htm)

[Uber](https://developer.uber.com/)

[Airbnb](https://www.airbnb.com/partner)

[Foursquare](https://developer.foursquare.com/)

[Yelp](https://www.yelp.com/developers)

[Education Data Explorer by Urban Institute](https://educationdata.urban.org/documentation/#about_the_api)

[Bing Maps API](https://www.microsoft.com/en-us/maps/choose-your-bing-maps-api)

[Boston MBTA API](https://www.mbta.com/developers/v3-api)

[Coord by Sidewalk Labs](https://www.coord.com/api-developers)
