# Week 4: APIs and Distance Metrics

Today we will continue learning about gathering data. We discussed open data portals and how to get US Census data from the Census Bureau API. We will continue learning about different API types and learn how to use the Google Maps API to geolocate places, get directions, and more. Through this, we will also learn about different distance measurements. 

In [None]:
import pandas as pd
import numpy as np
import geopandas as gpd

### Brief aside: dealing with strings

Recall: strings are sets of text made up of individual characters. There are a few important ways you can manipulate strings:

In [None]:
# Recall that strings can be indexed just like lists
address = "921 University Ave, Ithaca, NY 14853"
address[0:10]

In [None]:
# You can also replace chunks of text with another
# The format is x.replace(old_text, new_text)
address.replace("University", "College")

In [None]:
# This is helpful when you want to remove characters, you can replace them with empty strings: ""
address.replace("921", "")

In [None]:
# It will replace ALL instances of the old text, like for example spaces:
address.replace(" ", "")

In [None]:
# You can also concatenate or join text in a list together into one string:
# The format is "separator".join(list)

address_chunks = ["921", "University", "Ave", "Ithaca", "NY", "14853"]
# You usually want to add a space between words so the separator can be " "
" ".join(address_chunks)

In [None]:
# But it can be any character that you want
"*".join(address_chunks)

In [None]:
# Let's look at our list of new york housing listings

nyc_housing = pd.read_csv('newyork_housing.csv')
nyc_housing.head()

In [None]:
# With dataframes, you can also add text a few different ways:

#For example lets save the city state
nyc_housing['citystate'] = nyc_housing['address/city'] + ", " + nyc_housing['address/state']
nyc_housing.head()

In [None]:
# We can also use .apply and .join

nyc_housing['citystate'] = nyc_housing.apply(lambda x: ", ".join([x['address/city'], x['address/state']]), axis=1) 
# We add axis=1 when we .apply to more than one column, in this case the whole dataframe

In [None]:
# Error! .join expects strings and it looks like sometimes we have NaN values
# We can REPLACE the nans with empty strings and then ensure the type is a string

nyc_housing['address/city'] = nyc_housing['address/city'].astype(str).replace(np.nan, '')
nyc_housing['address/state'] = nyc_housing['address/state'].astype(str).replace(np.nan, '')
nyc_housing['citystate'] = nyc_housing.apply(lambda x: ", ".join([x['address/city'], x['address/state']]), axis=1) 
nyc_housing.head()

In [None]:
## YOUR TURN
## Create a new column that has the whole string address 

## HINT You will want to make the zipcodes into integers


## Google Maps API

One powerful API we can use is from Google Maps.

First, we need to get set up in the Google Cloud Console. 

Once we are set up, we can import the `googlemaps` package and insert our API key. 

In [None]:
!pip install googlemaps

In [None]:
import googlemaps
from gmaps_key import API_KEY # Add your key to the gmaps_key.py file

# Just like with the census data, we need to provide our API_KEY to the googlemaps library.
gmaps = googlemaps.Client(key=API_KEY)

In [None]:
# We can get the latitude-longitude location of an address

geocode_result = gmaps.geocode("921 University Ave, Ithaca, NY 14853")
geocode_result

In [None]:
# This is a list of length = 1 because we gave it one address
# Inside is a dictionary

# We can get just the location we need:
geocode_result[0]['geometry']['location'] # It is really nested in there!

In [None]:
# We can apply the API to a set of addresses from a dataframe:
# Looping through each row of a dataframe uses .iterrows() like this

# First let's create a temporary smaller dataframe to make this quicker
tmp = nyc_housing.iloc[0:100]

# Now, let's save the lat and lon of each address
all_lats, all_lons = [],[]
for idx, row in tmp.iterrows():
    location = gmaps.geocode(row['full_address'])
    lat = location[0]['geometry']['location']['lat']
    lon = location[0]['geometry']['location']['lng']
    all_lats.append(lat)
    all_lons.append(lon)

tmp['geocode_lat'] = all_lats
tmp['geocode_lon'] = all_lons
tmp.head()
    


In [None]:
# We can also go the opposite direction. Take the lat/lon values and get an address.

house = tmp.iloc[0][['full_address', 'geocode_lat', 'geocode_lon']]
## Note, that the format here is Lat, Lng! (y, x)
reverse_geocode_result = gmaps.reverse_geocode((house['geocode_lat'],house['geocode_lon']))
reverse_geocode_result

In [None]:
print(reverse_geocode_result[1]['formatted_address'])
print(house['full_address'])

In [None]:
## YOUR TURN 
## Compute the lat/lon coordinates of a set of listings


In [None]:
## Load in our citibike stations and compute the distance between them and the listings



## (Optional) Webscraping

Webscraping is a method of programmatically retrieving data from websites. It is used when there are no APIs or when data that is published from the website isn't made avaliable in a downloadable format. 

#### A note on webscrapping:
Webscrapping can be used for all kinds of malicious purposes, for instance, to copy website content and republish it. Here's a [complaint from Craiglist](https://www.scribd.com/doc/313832868/CraigslistVRadpad-Complaint?secret_password=7gTybamKvrbeVhxfi4mx) about a company called Radpad scraping Craigslist and reposting those listing on their own website:

<mark>
“[The content scraping service] would, on a daily basis, send an army of digital robots to craigslist to copy and download the full text of millions of craigslist user ads. [The service] then indiscriminately made those misappropriated listings available—through its so-called ‘data feed’—to any company that wanted to use them, for any purpose. Some such ‘customers’ paid as much as $20,000 per month for that content…”</mark>
<br>
<br>

<mark>
According to the claim, scraped data was used for spam and email fraud, among other activities: </mark>
<br>
<br>

<mark>
“[The defendants] then harvest craigslist users’ contact information from that database, and initiate many thousands of electronic mail messages per day to the addresses harvested from craigslist servers…. [The messages] contain misleading subject lines and content in the body of the spam messages, designed to trick craigslist users into switching from using craigslist’s services to using [the defenders’] service…”
</mark>
<br>
<br>

Uff. 

**What about webscrapping for research or academic purposes?** Most of the above issues most likely won't apply to you, but webscrapping makes a website's traffic *spike* if you don't modulate how often you're pinging the website. This can cause the website's server to crash. This is not very nice. Also, a lot of websites won't allow you to do it. (If you go to almost any URL and put `/robots.txt` after it, you can see a list of subdomains that site will or won't allow you to scrape.)

(adapted from Wenfei Xu)

However, if you webscrape ethically and legally, in Python the place to start is with `beautifulsoup`. Let's install it.

In [None]:
!pip install beautifulsoup4

In [None]:
from bs4 import BeautifulSoup
import requests

Here, we are importing BeautifulSoup from the bs4 library, which we will use to parse the HTML data we scrape from a website. We are also importing `requests`, which we will use to send HTTP requests to the website and retrieve the HTML data.

The next step is to send a request to the website and retrieve the HTML data. You can do this using the `requests.get()` method, which takes the website URL as a parameter and returns the HTML content of the website.

In [None]:
## This is a list of Los Angeles houses available for purchase
url = 'https://www.mlslistings.com/Search/Result/los-angeles/1?criteria=H4sIAAAAAAAACnWPwU7DMBBEfwXtFRttHNdpfEOFAzekph-wcbbBUuRU9ga1Qvw7CoQDB06jndHb0XxAYcrh7VQ4H3niIHFOLwN4VFuyHlD1ja1DOGtHfaWtdajbintd01Bbt3fnxrTwS3R8FfAwzeWO0sgTlz9Rd7sweAAFkikV-q7czH65gYIpFolpPArJUsDDY5D4zupHTmngfJiTZAqiXjkNMY2gIGQm4SeS9Y9B4zRWGvedQW8aX-8ejMUGd-09okdcW6jI85XDIrxO_BdxZkM-vwBP4133LQEAAKqVmB1bu5Y1yo6-ZKtZzPNRqElLF3R7PAK_NhXGODxr'
# First we can set the "headers" to a few different web browser codes 
headers = requests.utils.default_headers()
ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
headers.update({"User-Agent": ua})

response = requests.get(url, headers=headers)
# What is in response?
response

There are a number of "response codes", generally anything in the 200s is good, and anything in the 400s or 500s is bad. 

Here is the full list of codes: [https://en.wikipedia.org/wiki/List_of_HTTP_status_codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)

Once we have retrieved the HTML data, the next step is to parse it using Beautiful Soup. We can do this by creating a BeautifulSoup object and passing the HTML data as a parameter.

The second parameter `'html.parser'` specifies the parser to use for parsing the HTML data. In this case, we are using the built-in HTML parser that comes with Beautiful Soup.


In [None]:
soup = BeautifulSoup(response.text, 'html.parser')

In [None]:
# What is in soup?
soup

In [None]:
prices = []
for text in soup.find_all('strong', class_= 'listing-price d-block pull-left pr-25'):
    prices.append(text.get_text(strip=True))

beds = []
for link in soup.find_all('strong', class_= 'info-item-value d-block pull-left pr-25'):
    beds.append(link.get_text(strip=True))

addresses = []
for link in soup.find_all('h5', class_= 'card-title font-weight-bold listing-address mb-25'):
    addresses.append(link.get_text(strip=True))


In [None]:
prices