<a href="https://colab.research.google.com/github/lblogan14/web_scraping_with_python/blob/master/ch11_scrap_javascript.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Client-side scripting languages are languages that are run in the browser itself, rather
than on a web server. The success of a client-side language depends on your browser’s
ability to interpret and execute the language correctly.

JavaScript is the most common and most well-supported client-side scripting language on the web today. It is embedded between `script` tags in the page's source code:


```
<script>
alert("This creates a pop-up using JavaScript");
</script>
```

#A Brief Introduction to JavaScript
JavaScript is a weakly typed language, with a syntax that is often compared to C++
and Java. For example, the following recursively calcualtes values in the Fibonacci sequence, and prints them to the browser's developer console:


```
<script>
function fibonacci(a, b){
  var nextNum = a + b;
  console.log(nextNum+" is in the Fibonacci sequence");
  if(nextNum < 100){
    fibonacci(b, nextNum);
  }
}
fibonacci(1, 1);
</script>
```
All variables are identified by preceding them with `var`. This is similar to the type declaration (`int`, `String`, `List`, etc) in Java or C++.

JavaScript is also extremely good at passing around functions just like variables:


```
<script>
var fibonacci = function() {
  var a = 1;
  var b = 1;
  return function () {
    var temp = b;
    b = a + b;
    a = temp;
    return b;
  }
}
var fibInstance = fibonacci();
console.log(fibInstance()+" is in the Fibonacci sequence");
console.log(fibInstance()+" is in the Fibonacci sequence");
console.log(fibInstance()+" is in the Fibonacci sequence");
</script>
```

This is just similar to the lambda expressions. The variable `fibonacci` is defined as a function. The value of its function returns a function that prints increasingly large values in the Fibonacci sequence.

##Common JavaScript Libraries
**jQuery** \\
*jQuery* is used by identifying the following:
```
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
```
jQuery is adept at dynamically creating HTML content that appears only after the JavaScript is executed. The users will retrieve only the preloaded page that appears before the JavaScript has created the content if the page's content is scraped by using traditional methods.

**Google Analytics** \\
It will have JavaScript at the bottom, similar to the following:

```
<!-- Google Analytics -->
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-4591498-1']);
_gaq.push(['_setDomainName', 'oreilly.com']);
_gaq.push(['_addIgnoredRef', 'oreilly.com']);
_gaq.push(['_setSiteSpeedSampleRate', 50]);
_gaq.push(['_trackPageview']);
(function() { var ga = document.createElement('script'); ga.type =
'text/javascript'; ga.async = true; ga.src = ('https:' ==
document.location.protocol ? 'https://ssl' : 'http://www') +
'.google-analytics.com/ga.js'; var s =
document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(ga, s); })();
</script>
```

This script handles Google Analytics–specific cookies used to track your visit from
page to page. This can sometimes be a problem for web scrapers that are designed to
execute JavaScript and handle cookies

**Google Maps** \\
One of the most common ways to denote a location in Google Maps is through a *marker* (also known as a *pin*). Markers can be inserted into any Google Map by using code such as the following:


```
var marker = new google.maps.Marker({
  position: new google.maps.LatLng(-25.363882,131.044922),
  map: map,
  title: 'Some marker text'
});
```



#Ajax and Dynamic HTML
Ajax is a group of technologies used to accomplish a certain task. Ajax stands for *Asynchronous JavaScript and XML*, and is used to send information to and receive it from a web server without making a separate page request.

Like Ajax, *dynamic HTML* (DHTML) is a collection of technologies used for a common purpose. DHTML is HTML code, CSS language, or both that changes as client-side scripts change HTML elements on the page. A button might appear only after the user moves the cursor, a background color might change on a click, or an Ajax request might trigger a new block of content to load.

The content in the browser may not match the content in the source retrieved from the site. Also, the web page might also have a loading page that appears to redirect to another
page of results, but the page’s URL never changes when this redirect happens. Both of these are caused by a failure of a scraper to execute the JavaScript.

##Executing JavaScript in Python with Selenium
PhantomJS is what is known as a *headless browser*. It loads websites into memory and
executes JavaScript on the page, but does it without any graphic rendering of the website to the user.

Selenium works by automating browsers to load the website,
retrieve the required data, and even take screenshots or assert that certain actions
happen on the website.

The example here uses the page (http://pythonscraping.com/pages/javascript/ajaxDemo.html).

In [1]:
!pip3 install selenium

Collecting selenium
[?25l  Downloading https://files.pythonhosted.org/packages/80/d6/4294f0b4bce4de0abf13e17190289f9d0613b0a44e5dd6a7f5ca98459853/selenium-3.141.0-py2.py3-none-any.whl (904kB)
[K     |████████████████████████████████| 911kB 2.8MB/s 
Installing collected packages: selenium
Successfully installed selenium-3.141.0


The Selenium library is an API called on the object WebDriver. The WebDriver is a
bit like a browser in that it can load websites, but it can also be used like a `BeautifulSoup` object to find page elements, interact with elements on the page (send text,
click, etc.), and do other actions to drive the web scrapers.

The following code retrieves text behind an Ajax "wall" on the test page:

In [0]:
from selenium import webdriver
import time

driver = webdriver.PhantomJS(executable_path='<PhantomJS Path Here>')
driver.get('http://pythonscraping.com/pages/javascript/ajaxDemo.html')
time.sleep(3)
print(driver.find_element_by_id('content').text)
driver.close()

This creates a new Selenium WebDriver, using the *PhantomJS* library, which tells the
WebDriver to load a page and then pauses execution for three seconds before looking
at the page to retrieve the (hopefully loaded) content.

To run this on chrome:

In [0]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

In [0]:
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(executable_path='drivers/chromedriver', 
                          options=chrome_options)
driver.get('http://pythonscraping.com/pages/javascript/ajaxDemo.html')
time.sleep(3)
print(driver.find_element_by_id('content').text)
driver.close()

If the `time.sleep` pause is changed to one second instead of three:


```
time.sleep(1)
```



Selenium uses an entirely new set of selectors to find an element in a WebDriver’s DOM, although they have fairly straightforward names. \\
DOM, Document Object Model, connects web pages to scripts or programming languages by representing the structure of a document -- such as the HTML representing a web page -- in memory. Usually that means JavaScript, although modeling HTML, SVG, or XML documents as objects is not part of the JavaScript language, as such. The DOM represents a document with a logical tree. Each branch of the tree ends in a node, and each node contains objects. DOM methods allow programmatic access to the tree; with them the document's structure, style, or content can be changed. Nodes can also have event handlers attached to them; once an event is triggered, the event handlers get executed.

In the example, `find_element_by_id` is used, although the following other selectors can be used as well:


```
driver.find_element_by_css_selector('#content')
driver.find_element_by_tag_name('div')
```

To select multiple elements on the page:


```
driver.find_elements_by_css_selector('#content')
driver.find_elements_by_css_selector('div')
```

To use `BeautifulSoup` to parse the content, use the WebDriver's `page_source` function:


```
pageSource = driver.page_source
bs = BeautifulSoup(pageSource, 'html.parser')
print(bs.find(id='content').get_text())
```



Although this solution works, it is somewhat inefficient, and implementing it could
cause problems on a large scale. Page-load times are inconsistent, depending on the
server load at any particular millisecond, and natural variations occur in connection
speed.

A more efficient solution would repeatedly check for the existence of a particular element on a fully loaded page and return only when that element exists. This code uses the presence of the button with the ID `loadedButton` to declare that the page has been fully loaded:

In [0]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [0]:
driver = webdriver.PhantomJS(executable_path='')
driver.get('http://pythonscraping.com/pages/javascript/ajaxDemo.html')
try:
  element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'loadedButton')))
finally:
  print(driver.find_element_by_id('content').text)
  driver.close()

The `WebDriverWait` and `expected_conditions` are combined to form what Selenium calls an *implicit wait*. It waits for a cetain state in the DOM to occur before continuing, while an explicit wait defines a hardcoded time. In an implicit wait, the triggering DOM state is defined by `expected_condition` (the import is cast to `EC` here)

Most of these expected conditions require that users specify an element to watch for in
the first place. Elements are specified using locators. A *locator* is an abstract query language, using the `By` object, which can be used in a variety of ways, including to make selectors.

A locator is used to find elements with the ID `loadedButton`:
```
EC.presence_of_element_located((By.ID, 'loadedButton'))
```

A locator can be used to create selectors, suing the `find_element` WebDriver function:
```
print(driver.find_element(By.ID, 'content').text)
```
which is functionally equivalent to:
```
print(driver.find_element_by_id('content').text)
```

To run this on chrome:

In [0]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [0]:
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(executable_path='drivers/chromedriver',
                          options=chrome_options)

driver.get('http://pythonscraping.com/pages/javascript/ajaxDemo.html')
try:
  element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'loadedButton')))
finally:
  print(driver.find_element_by_id('content').text)
  driver.close()

##Additional Selenium Webdrivers
The Selenium group curates a collection of these webdrivers for easy reference.

```
firefox_driver = webdriver.Firefox('<path to Firefox webdriver>')
chrome_driver = webdriver.Chrome('<path to Chrome webdriver>')
safari_driver = webdriver.Safari('<path to Safari webdriver>')
ie_driver = webdriver.Ie('<path to Internet Explorer webdriver>')
```



#Handling Redirects
Client-side redirects are page redirects that are executed in the browser by JavaScript, rather than a redirect performed on the server, before the page content is sent.

A server-side redirect, depending on how it is handled, can be easily traversed by Python’s urllib library without any help from Selenium (mentioned in Chapter 3)

Client-side redirects won’t be handled at all unless something is executing the JavaScript.

Detecting the redirect successfully is to "watch" an element in the DOM when the page initially loads and then repeatedly call that element until Selenium throws a `StaleElementReferenceException`; the element is no longer attached to the page's DOM and the site has redirected:

In [0]:
from selenium import webdriver
import time
from selenium.webdriver.remote.webelement import WebElement
from selenium.common.exceptions import StaleElementReferenceException

In [0]:
def waitForLoad(driver):
  elem = driver.find_element_by_tag_name("html")
  count = 0
  while True:
    count += 1
    if count > 20:
      print('Timing out after 10 seconds and returning')
      return
    time.sleep(.5)
    try:
      elem == driver.find_element_by_tag_name('html')
    except StaleElementReferenceException:
      return

In [0]:
driver = webdriver.PhantomJS(executable_path='<Path to Phantom JS>')
driver.get('http://pythonscraping.com/pages/javascript/redirectDemo1.html')
waitForLoad(driver)
print(driver.page_source)

This script checks the page every half second, with a time-out of 10 seconds, although
the times used for the checking time and time-out can be easily adjusted up or down
as needed.

To run this on chrome:

In [0]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.remote.webelement import WebElement
from selenium.common.exceptions import StaleElementReferenceException
import time

In [0]:
def waitForLoad(driver):
  elem = driver.find_element_by_tag_name("html")
  count = 0
  while True:
    count += 1
    if count > 20:
      print("Timing out after 10 seconds and returning")
      return
    time.sleep(.5)
    try:
      elem == driver.find_element_by_tag_name("html")
    except StaleElementReferenceException:
      return

In [0]:
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(executable_path='drivers/chromedriver',
                          options=chrome_options)
driver.get("http://pythonscraping.com/pages/javascript/redirectDemo1.html")
waitForLoad(driver)
print(driver.page_source)
driver.close()

Alternative way is to write a similar loop checking the current URL of the page until the URL changes, or it matches a specific URL that is searched for.

Waiting for elements to appear and disappear is a common task in Selenium, and the `WebDriverWait` can also be used. The following example provides a time-out of 15 seconds and an XPath selector that looks for the page body content to accomplish the same task:

In [0]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

In [0]:
driver = webdriver.PhantomJS(executable_path='drivers/phantomjs/phantomjs-2.1.1-macosx/bin/phantomjs')
driver.get('http://pythonscraping.com/pages/javascript/redirectDemo1.html')

try:
  bodyElement = WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.XPATH, '//body[contains(text(),"This is the page you are looking for!)]")))
  print(bodyElement.text)
except TimeoutException:
  print('Did not find the element')

To run this on chrome

In [0]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

In [0]:
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(executable_path='drivers/chromedriver', 
                          options=chrome_options)
driver.get('http://pythonscraping.com/pages/javascript/redirectDemo1.html')
try:
  bodyElement = WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.XPATH, '//body[contains(text(), "This is the page you are looking for!")]')))
  print(bodyElement.text)
except TimeoutException:
  print('Did not find the element')