# Web scraping
*Katarzyna O Rachuta, last updated 2017-02-07*

This notebook shows how I scraped a website (www.google.com) to get the average visit time for different places. This information can be found in a box on a right handside when one can see 'Plan your visit: average visit time X min'.

In [2]:
# Importing necessary modules.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
import urllib
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

Because I'm using a list of locations, I have written a function. 

The function takes a location name and then encodes it so Google search is possible. It then opens the webdriver (using Chrome) and fetches the data. Not all of the locations will have the average visit time, in which case I want to see 'None'.<br>
I want the data to be output in a pandas DataFrame.

In [None]:
# My function.

def visit_time_function(original_merchant_name):
    encoded_merchant = urllib.quote_plus(original_merchant_name)
    browser = webdriver.Chrome()
    browser.get('https://www.google.com/#q='+ encoded_merchant)
    delay = 3 # seconds
    try:
        WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.CLASS_NAME, "_B1k")))
    except TimeoutException:
        pass
    c = browser.page_source
    soup = BeautifulSoup(c)
    tables = soup.find('div',{'class':'_B1k'})
    if tables is None:
        visit_time = None
    tables = str(tables)
    visit_time = re.search('\d+ min', tables)
    if visit_time is not None:
       visit_time = visit_time.group(0)
    else:
       visit_time = None
    
    visit_time_df = pd.DataFrame({
            'Location': [location],
            'Visit time': [visit_time]
            
        })
    browser.quit()
    return visit_time_df

<b>To do:</b><br>
    a) Add a list of sample locations.<br>
    b) Loop over these locations.

# Scraping Yelp
This part of the script scrapes Yelp data for the Price range (which is not given through the API). If the Price range does not appear on the Yelp page, 'None' will be returned.

In [3]:
# I chose 5 sample businesses that I know of.
locations = pd.read_csv("sample_businesses.csv")
locations

Unnamed: 0,Name,Yelp URL
0,Teaquation,https://www.yelp.com/biz/teaquation-redwood-city
1,Taqueria Guadalajara,https://www.yelp.com/biz/taqueria-guadalajara-...
2,Lolo,https://www.yelp.com/biz/loló-san-francisco-4
3,Philz Coffee,https://www.yelp.com/biz/philz-coffee-san-fran...
4,21st Amendment Brewpub,https://www.yelp.com/biz/21st-amendment-brewpu...


In [4]:
# I define a function that scrapes the data.
def get_price_range(business_name):
    url_df = locations[locations['Name'] == business_name]
    url = url_df.iloc[0,1]
    browser = webdriver.Chrome()
    browser.get(url)
    c = browser.page_source
    soup = BeautifulSoup(c)
    tables = soup.find('dd',{'class':'price-description'})
    if tables is None:
        price_range = None
    else:
        tables = tables.getText()
    tables = str(tables)
    price_range = re.search('\w+.+', tables)
    if price_range is not None:
        price_range = price_range.group(0)
    else:
        price_range = None
    browser.quit()
    
    price_range_df = pd.DataFrame({
            'Business name': [business_name],
            'Price range': [price_range]
            
        })
    
    return price_range_df

In [7]:
# Creating an empty DataFrame for the final results.
price_range_df_full = pd.DataFrame({
        'Business name': [],
        'Price range': []
    })

# Creating a list of businesses to consider.

businesslist = locations['Name'].tolist()

# Looping over to provide a full list.

for business in businesslist:
    price_range_df_full = price_range_df_full.append(get_price_range(business))

# Showing the output.

price_range_df_full

Unnamed: 0,Business name,Price range
0,Teaquation,Inexpensive
0,Taqueria Guadalajara,Under $10
0,Lolo,11-30
0,Philz Coffee,Inexpensive
0,21st Amendment Brewpub,Moderate
