# Introduction

Sometimes web pages are build from JavaScript once the document is loaded into the browser. 
An example is http://news.gsu.edu where the news contents it not part of the HTML document, but is rather loaded
dynamically.

In other cases, user interactions like logging into the site might be required. We demonstrate this on Google Analytics.


Here, we need an real web-browser to render the page before we can extract any data points. 
The browser can be either
- headless https://en.wikipedia.org/wiki/Headless_browser ,or
- a regular web browser that can be controlled from a program.

This notebook gives an example of using Selenium Web Driver http://www.seleniumhq.org/projects/webdriver/

**Reference:** http://selenium-python.readthedocs.io

# Install Chrome Driver (macOS)

In [2]:
%%sh
brew services list

Name         Status  User  Plist
zookeeper    stopped       
kafka        stopped       
chromedriver started Peter /Users/Peter/Library/LaunchAgents/homebrew.mxcl.chromedriver.plist


In [3]:
%%sh
##brew serices start chromedriver

We also need:
1. Beautyful Soup `bs4`
2. `lxml`

# Getting Data from Google Analytics

In [1]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys


In [2]:
driver = webdriver.Chrome()

Now, Chrome should open a window that looks like this
(screenshot)


In [3]:
driver.get("https://analytics.google.com")

In [7]:
for el in driver.find_elements_by_class_name('ga-nav-link-label'):
    if el.text == 'BEHAVIOR':
        print "clicking on it ... "
        el.click()
        break;
print "done."

clicking on it ... 
done.


In [8]:
def ga_nav_link_labels(drv):
    lst = [ el.text for el in driver.find_elements_by_class_name('ga-nav-link-label')]
    return filter(lambda s: len(s)>0, lst)

In [9]:
ga_nav_link_labels(driver)

[u'HOME',
 u'CUSTOMIZATION',
 u'REAL-TIME',
 u'AUDIENCE',
 u'ACQUISITION',
 u'BEHAVIOR',
 u'Overview',
 u'Behavior Flow',
 u'Site Content',
 u'Site Speed',
 u'Site Search',
 u'Events',
 u'Publisher',
 u'Experiments',
 u'CONVERSIONS',
 u'DISCOVER',
 u'ADMIN']

In [10]:
def get_ga_nav_link(drv, txt):
    for el in driver.find_elements_by_class_name('ga-nav-link-label'):
        if el.text == txt:
            return el
    return None

In [11]:
get_ga_nav_link(driver, 'Site Search').click()
get_ga_nav_link(driver, 'Search Terms').click()

In [18]:
body = driver.find_element_by_tag_name('body')
body.get_attribute('innerHTML')



In [11]:
tbl = driver.find_element_by_id('ID-rowTable')
print tbl.id

0.6139261861771723-2


In [5]:
import pandas as pd
import StringIO

In [12]:
tbl_doc = tbl.get_attribute('innerHTML')
print tbl_doc[:100]

<thead><tr class="_GABB"><th class="_GATf _GARIb ACTION-sort TARGET-analytics.query ID-dimension-col


In [13]:
df = pd.read_html(StringIO.StringIO('<table>%s</table>'%tbl_doc))[0]
print df.shape
df.head()

(5000, 6)


Unnamed: 0,Search Query,Clicks,Impressions,CTR,Average Position,Unnamed: 5
0,1.0,(other),"1,118(28.96%)","23,045(11.05%)",4.85%,22.0
1,2.0,youtube merchandise,256(6.63%),721(0.35%),35.51%,1.0
2,3.0,youtube merch,211(5.46%),838(0.40%),25.18%,1.7
3,4.0,youtube shop,148(3.83%),315(0.15%),46.98%,1.0
4,5.0,youtube store,110(2.85%),274(0.13%),40.15%,1.0


# Exercise

The task is to build a script that:
1. navigates to the right analysis view (left navigations),
2. sets the date range for the analysis by using the date fields in the top right corner, and
3. traverses through all pages; meanwhile
4. extracts the datatable and downloads 
