## Web Scraping part II ##

In [1]:
# Example: List of events from nyc.com
import urllib.request as ur
url = 'http://www.nyc.com/events/?int4=1&from=10/15/2015&to=10/16/2015'
data = ur.urlopen(url).read().decode('utf-8')
print(data)


<!DOCTYPE html>
<!--[if lt IE 7]>      <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js">
<!--<![endif]-->
<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no, minimal-ui" />

    
    <script type="text/javascript">
        (function () {
            if (navigator.userAgent.toLocaleLowerCase().indexOf("ipad") >= 0)
                document.getElementsByName("viewport")[0].setAttribute("content", "width=device-width, initial-scale=0.75");
        })();
    </script>

        <title>New York Events and Event Calendar | NYC.com - New York&#39;s Box Office</title>
    <meta name="keywords" content="new york event calendar, new york, manhat

In [2]:
def getListFromString(data_string,search_string,terminator_string):
    listToBeReturned = list()
    searchStringLoc = data_string.find(search_string)
    while (searchStringLoc > -1): 
        start_index =searchStringLoc + len(search_string)
        end_index = data_string[start_index:].find(terminator_string) + start_index
        item = data_string[start_index:end_index]
        listToBeReturned.append(item)
        
        data_string = data_string[end_index+1:]
        searchStringLoc = data_string.find(search_string) 

    return listToBeReturned

In [3]:
links_to_events=getListFromString(data,'<h3 itemprop="name">','</h3>')
print(len(links_to_events))
print(links_to_events)

0
[]


1. Common issues with web scraping:
- Static Vs Dynamic Vs Real-time Web sites.
- Browser Vs Programs/Robots/Crawlers

2. To view Real-time pages -> Selenium 
- Selenium sends a url request through a browser and therefore gets data from sites that don't want to talk to programs.

#You will need to install selenium. If on a mac: sudo pip install selenium 
#Firefox driver comes pre-installed. drivers are also available for Chrome and IE

In [6]:
# In our example, we'll use firefox as a conduit. That's usually the least complicated
# Basic steps:
# 1. Open a Firefox browser using Selenium's webdriver
# 2. Tell the browser to get data from the url
# 3. Use the appropriate find_elements function.
# 4. Get and print the data
# 5. Close the browser (prevents multiple windows opening on your computer)

from selenium import webdriver

browser = webdriver.Chrome() #1
url='http://www.gobiernotransparentechile.cl/directorio/entidad/25/351/per_planta/Ao-2015' #2
browser.get(url) 
links_to_events=browser.find_elements_by_tag_name("tbody") #3
print(len(links_to_events)) #4
browser.quit() #5

1


In [10]:
from selenium import webdriver
import time

browser = webdriver.Firefox() #1
url='http://www.nyc.com/events/?int4=1&from=10/15/2015&to=10/16/2015' #2
browser.get(url) #This will open a Firefox window on your machine

browser.execute_script("window.scrollTo(0,5000)") # Scroll down the page
time.sleep(5) # Wait for the scroll
links_to_events=browser.find_elements_by_tag_name("h3") #3
print(len(links_to_events)) #4
browser.quit() #5

47


#### Next - Code elegance

1. Get rid of the annoying browser popup window: 
- Use PhantomJS, a "headless" browser. You need to install http://phantomjs.org or https://github.com/eugene1g/phantomjs/releases for OS-X Yosemite

2. Handle delays better:
- Use Selenium's facilities to implement a "smart" wait. 

In [11]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

browser = webdriver.PhantomJS(executable_path='/Applications/phantomjs')
url='http://www.nyc.com/events/?int4=1&from=10/15/2015&to=10/16/2015' 
browser.get(url) 
browser.execute_script("window.scrollTo(0,20000)") 

try: # Need try to catch timeout error
    element = WebDriverWait(browser,20).until(EC.visibility_of_element_located((By.TAG_NAME,"footer")))
except:
    pass
finally:
    links_to_events=browser.find_elements_by_tag_name("h3") 
    print(len(links_to_events)) 
    browser.quit() 


47
