# Brunching in Chicago
Time Out Chicago (Amy Cavanaugh, 8th March 2016) published an [article](http://www.timeout.com/chicago/restaurants/20-best-brunch-spots-in-chicago) listing the 20 best brunch spots in Chicago. Yum! I want to see all the restaurant locations on a map and find the closest one to me that I haven't already visited. To do this, I needed to scrape the Time Out Chicago website for the restaurant names and locations, convert the addresses to lattitude and longitude co-ordinates for the calculation of distances and plotting on a map.

> This project is inspired by Lecture 2 (about data scraping) of Harvard's CS109 Data Science course and is adapted from a number of the examples in the relevant lecture notes, including the example taken from Katharine Jarmul's talk about data scraping at PyCon2014. I'm using this project to practice the skills taught in these lectures. 

Let's start by importing all the libraries needed:

In [1]:
# All imports
import requests
import urllib2
import bs4
import socket
import re
import time
from pygeocoder import Geocoder
from math import radians, cos, sin, asin, sqrt
import folium
from IPython.core.display import HTML
from collections import OrderedDict

import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

I am using `requests` and `BeautifulSoup` to scrape and parse the html from the website: http://www.timeout.com/chicago/restaurants/20-best-brunch-spots-in-chicago.

In [2]:
url = 'http://www.timeout.com/chicago/restaurants/20-best-brunch-spots-in-chicago'

# While urllib2 works, CS109 lab 2 suggests becoming acquainted with requests
# source = urllib2.urlopen(url).read()
# tree = bs4.BeautifulSoup(source, 'lxml')

req = requests.get(url)
page = req.text
tree = bs4.BeautifulSoup(page, 'html.parser')

# print tree.prettify()

Inspecting the elements of the website, the restaurant names are contained in `<h3>` tags, which are children of `<div>` tags with the class `"feature-item__column"`. There are 40 such tags, two for each restaurant (one for an image and the other for text). The following code takes the restaurant names from the `<h3>` tags using `.findChild()`. 

Stepping down to the next child, the `<h3>` tags contain an `<a>` tag with the internal link to the Time Out page for each individual restaurant (which contains a more detailed description of the restaurant including reviews, average prices, location and transport). Using the `.findChild()` method twice, I can extract the `href` attribute for each restaurant and then follow these links to get additional information on the restaurants (such as the physical address).

In [3]:
# Get names of restaurants
div_list = tree.find_all('div', 'feature-item__column')
div_text = [t.findChild().text.strip() for t in div_list]
brunch_names = [x for x in div_text if x] # Removes empty elements in div_text list

# Get internal links to Time Out restaurant pages
child_text = [s.findChild().findChild().get('href') for s in div_list]
brunch_links = list(OrderedDict.fromkeys(child_text)) # Removes duplicate links, preserving order of list

The variable `brunch_names` is now a list of 20 restaurant names, and `brunch_links` is a list of 20 internal links. Inspecting the elements of the Time Out webpage for the first restaurant (Longman & Eagle), the physical address of the restaurant is contained within a `<td>` tag. Once again, there are multiple `<td>` tags on the page, but the address is in the same location on each page, the 3rd row of the "Details" table.

In [4]:
brunch_addresses = []
counter = 0

for link in brunch_links:
    counter += 1  
    url = "http://www.timeout.com" + link
    req = requests.get(url)
    html_page = req.text
    html_tree = bs4.BeautifulSoup(html_page, 'html.parser')
    address = html_tree.find_all('td')[2].text.strip()
    address = re.sub(" \n                                       ", ",", address)
    brunch_addresses.append(address)

Let's turn all this information so far into a Pandas DataFrame and look at it: 

In [5]:
table = pd.DataFrame(data = zip(brunch_names, brunch_addresses, brunch_links), 
                          columns = ["Restaurant Name", "Address", "Time Out (internal) link"])
table

Unnamed: 0,Restaurant Name,Address,Time Out (internal) link
0,Longman & Eagle,"2657 N Kedzie Ave, Chicago",/chicago/restaurants/longman-eagle
1,Au Cheval,"800 W Randolph St, Chicago",/chicago/restaurants/au-cheval
2,Cherry Circle Room,"12 S Michigan Ave, Chicago",/chicago/restaurants/cherry-circle-room
3,Bohemian House,"11 W Illinois St, Chicago",/chicago/restaurants/bohemian-house
4,Southport Grocery and Café,"3552 N Southport Ave, Chicago",/chicago/restaurants/southport-grocery-and-cafe
5,Analogue,"2523 N Milwaukee Ave, Chicago",/chicago/bars/analogue
6,Baker Miller Bakery & Millhouse,"4610 N Western Ave, Chicago",/chicago/restaurants/baker-miller-bakery-millh...
7,Jam,"3057 W Logan Blvd, Chicago",/chicago/restaurants/jam
8,Pub Royale,"2049 W Division St, Chicago",/chicago/bars/pub-royale
9,Cantina 1910,"5025 N Clark St, Chicago",/chicago/restaurants/cantina-1910


Let's say I am at the Tribune Tower. Now I want to calculate the birds eye distance between me and each of these restaurants. So I convert the physical addresses into lattitude and longitude co-ordinates in preparation for calculating the distance between these points. 


In [6]:
# Geocoding assistance from:
# http://stackoverflow.com/questions/22342097/is-it-possible-to-create-a-google-map-from-python
coordinates = []

for address in brunch_addresses:
    result = Geocoder.geocode(address)
    coordinates.append(result[0].coordinates)
    time.sleep(1) # slow number of requests to prevent OVER_QUERY_LIMIT error
    
# Let's say I am at the Tribune Tower
tribune = '435 N Michigan Ave, Chicago'
result = Geocoder.geocode(tribune)
tribune_coords = result[0].coordinates


To calculate the distance between the Tribune Tower and each of the 20 restaurants, I am going to use the Haversine Formula, which calculates the great-circle distance between two points on a sphere. While the curvature of the Earth probably doesn't have a large impact on the distances we are talking here, Michael Dunn on [Stack Overflow](http://stackoverflow.com/questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points) has already created a function to calculate these distances, so let's use it.


In [7]:
# Haversine function from Stack Overflow:
# http://stackoverflow.com/questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
    c = 2 * asin(sqrt(a))
    # Radius of earth in miles.
    miles = 3956 * c  
    return miles

# Calculate distances between Tribune Tower and the 20 restaurants
distances = []

for lat, lon in coordinates:
    distances.append(haversine(tribune_coords[1], tribune_coords[0], lon, lat))

table['Distance from Tribune (miles)'] = pd.DataFrame(data = distances, columns = ["Distances"]) 
table


Unnamed: 0,Restaurant Name,Address,Time Out (internal) link,Distance from Tribune (miles)
0,Longman & Eagle,"2657 N Kedzie Ave, Chicago",/chicago/restaurants/longman-eagle,5.103849
1,Au Cheval,"800 W Randolph St, Chicago",/chicago/restaurants/au-cheval,1.312113
2,Cherry Circle Room,"12 S Michigan Ave, Chicago",/chicago/restaurants/cherry-circle-room,0.603255
3,Bohemian House,"11 W Illinois St, Chicago",/chicago/restaurants/bohemian-house,0.268448
4,Southport Grocery and Café,"3552 N Southport Ave, Chicago",/chicago/restaurants/southport-grocery-and-cafe,4.430205
5,Analogue,"2523 N Milwaukee Ave, Chicago",/chicago/bars/analogue,4.876605
6,Baker Miller Bakery & Millhouse,"4610 N Western Ave, Chicago",/chicago/restaurants/baker-miller-bakery-millh...,6.182131
7,Jam,"3057 W Logan Blvd, Chicago",/chicago/restaurants/jam,4.928953
8,Pub Royale,"2049 W Division St, Chicago",/chicago/bars/pub-royale,2.999283
9,Cantina 1910,"5025 N Clark St, Chicago",/chicago/restaurants/cantina-1910,6.162883


Sanity checking these calculated distances, I picked two restaurants to check: 

- The haversine function calculated the closest brunch location is Bohemian House, at approximately 0.3 miles from the Tribune Tower. Using Google Maps, the approximate walking distance between these two locations is 0.4 miles, which seems reasonable. 
- Checking another, Au Cheval is calculated at 1.3 miles from the Tribune Tower. Google Maps puts the shortest walking distance at 1.5 miles, which is also reasonable, given the Haversine formula calculates direct distances between two GPS points, while Google approximates walking distance along sidewalks. 

Let's visualise this! The code below puts these restaurant locations on a map.

In [8]:
map_data = pd.DataFrame(data = coordinates, columns = ["Latitude", "Longitude"]) 
map_data['Restaurant Name'] = table['Restaurant Name']

map_osm = folium.Map(location = [tribune_coords[0], tribune_coords[1]], 
                    tiles = 'Stamen Terrain',
                    zoom_start = 11)
for row in range(20):
    folium.Marker([map_data.ix[row, 'Latitude'], map_data.ix[row, 'Longitude']], 
                  popup = map_data.ix[row, 'Restaurant Name']).add_to(map_osm)
folium.Marker([tribune_coords[0], tribune_coords[1]], popup = 'Tribune Tower', 
              icon = folium.Icon(color = 'red', icon = 'question-sign')).add_to(map_osm)

map_osm

Great!

Now I want to send myself an email that suggests a new brunch spot. But first, I want to eliminate those places I've already been: 

In [9]:
# Places I've been to from the brunch list:
brunch_done = ['Analogue', 'Au Cheval', 'Beatrix']

# Place from the brunch list I haven't been:
brunch_not_done = [i for i in table['Restaurant Name'] if i not in brunch_done]

brunch_df = table.loc[table['Restaurant Name'].isin(brunch_not_done)]
brunch_df

Unnamed: 0,Restaurant Name,Address,Time Out (internal) link,Distance from Tribune (miles)
0,Longman & Eagle,"2657 N Kedzie Ave, Chicago",/chicago/restaurants/longman-eagle,5.103849
2,Cherry Circle Room,"12 S Michigan Ave, Chicago",/chicago/restaurants/cherry-circle-room,0.603255
3,Bohemian House,"11 W Illinois St, Chicago",/chicago/restaurants/bohemian-house,0.268448
4,Southport Grocery and Café,"3552 N Southport Ave, Chicago",/chicago/restaurants/southport-grocery-and-cafe,4.430205
6,Baker Miller Bakery & Millhouse,"4610 N Western Ave, Chicago",/chicago/restaurants/baker-miller-bakery-millh...,6.182131
7,Jam,"3057 W Logan Blvd, Chicago",/chicago/restaurants/jam,4.928953
8,Pub Royale,"2049 W Division St, Chicago",/chicago/bars/pub-royale,2.999283
9,Cantina 1910,"5025 N Clark St, Chicago",/chicago/restaurants/cantina-1910,6.162883
10,Dove's Luncheonette,"1545 N Damen Ave, Chicago",/chicago/restaurants/doves-luncheonette,3.074009
11,A10,"1462 E 53rd St, Chicago",/chicago/restaurants/a10,6.485957


Okay, from this list of brunch places I haven't been to yet, I want my py script to pick the closest restaurant and suggest it to me. 

In [10]:
brunch_df[brunch_df['Distance from Tribune (miles)'] == min(brunch_df['Distance from Tribune (miles)'])]

brunch_index = brunch_df['Distance from Tribune (miles)'].idxmin(1)
closest_new_brunch = brunch_names[brunch_index]

# Create list of Time Out's short descriptions of each restaurant
brunch_desc = []

for t in tree.find_all('div', "column"):
    desc = t.findChild().text.encode('utf-8')
    # I feel like there is a better way to deal with the unicode below, more research needed
    desc = desc.replace("\xc2\xa0", " ").replace("\xe2\x80\x94", "-").replace(
                        "\xe2\x80\x99", "'").replace("\xcc\x81", "")
    brunch_desc.append(desc)

# Now, let's make a mail message we can read:
# Code adapted from https://github.com/kjam/python-web-scraping-tutorial/blob/master/scraper.py
message = 'Hey Jill,\n\n'
message += 'You should go to brunch this weekend! \n'
message += "The closest place you haven't been to already is %s at %s. " % (
            closest_new_brunch, brunch_addresses[brunch_index])
message += 'This is what Time Out Chicago says: \n\n'
message += '==============================\n'
message += '%s \n' % (brunch_desc[brunch_index])
message += '==============================\n\n'
message += 'Check out this link for more details: www.timeout.com%s. ' % (brunch_links[brunch_index])
message += '\n\nFrom,\n Your Py Script'

print message

Hey Jill,

You should go to brunch this weekend! 
The closest place you haven't been to already is Bohemian House at 11 W Illinois St, Chicago. This is what Time Out Chicago says: 

The best doughnuts in the city are hiding in an unlikely place: a modern Central European restaurant in River North. Here, chef Jimmy Papadopoulos and crew turn out warm orbs of fried dough tossed with sugar and served with Bavarian cream and raspberry jam. The rest of the menu is just as stellar: Tender nuggets of beef tongue mingle with potatoes and cabbage in the hash, while the giant open-faced schnitzel sandwich could easily serve two.  

Check out this link for more details: www.timeout.com/chicago/restaurants/bohemian-house. 

From,
 Your Py Script


Now doesn't that sound tasty? 