# Intro

Due to the rise in single page applications, and rising user expectations for dynamic and interactive websites, javascript is being increasingly used in constructing websites. This can lead to some trouble when scrapping. 


# Splash

Splash is a lightweight scriptable browser. We can interactive with it by writing lua browsing scripts and also by excuting javascript. To use splash we need to first install [docker](https://docs.docker.com/install/). Docker is similar to a virtual machine, but more lightweight, it provides an easy way to run other virtural enviroments. 
Follow the docker install guides to install it.  You can check that docker is running with.

```
sudo systemctl status docker
```

After you've installed docker you  can then install [splash](https://splash.readthedocs.io/en/stable/install.html) by running. 

```
sudo docker pull scrapinghub/splash
```

To launch the splash server run:

```
sudo docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash
```


Since thats quite a long command to remember you may wish to put an alias in your `.bashrc` (if on linux or mac).

```
alias splash="sudo docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash"

```

That way you can launch the server by just running `splash`. Once splash is running you can vist http://localhost:8050/, here there is a interactive enviroment that you an use to test splash scripts. Bellow is an example splash splash script.

```lua
function main(splash, args)
  splash:go(args.url)
  local scroll_to = splash:jsfunc("window.scrollTo")
  scroll_to(0, 300)
  return {png=splash:png()}
end
```

As you can see the code is very pythonic, if your unfamilair with lua you can utilize this [cheatsheet](https://learnxinyminutes.com/docs/lua/) to help you out. The splash webpage (http://localhost:8050/) also has a series of examples to help get you started.

# Scrapy Splash

Scrapy splash is a plugin for scrapy that allows it to send html to splash for rendering, and the excution of scripts. Follow the [setup guide](https://github.com/scrapy-plugins/scrapy-splash) setup guide from the link above, making sure to copy the relevant parts into the settings of your scrapy project.

## Requests

In order to do this we need to send our requests to splash to render, there are two ways to do this. One way is to use `scrapy_splash.SplashRequest` , you'll likely do this inside the `start_request` method.


```python
from scrapy_splash import SplashRequest
...
#Inside spider class add
    def start_requests(self): 
         yield SplashRequest(
                        url,
                        self.parse, #pasing callback we sih to use
                        endpoint='execute',
                        args = { 
                            'lua_source': splash_script,
                            'wait': 1,  #wait 1 second st start
                            'elements' : ['#id-of-element','.some-class'], #css selctors
                            'maxwait' : [20,20] #time willing to wait for each in seconds
                            }
                        )
...

```

The second way is to modify the original scrapy request, this method works well when are are getting our links from the LinkExtractor. 

```python

...
   rules = [
            Rule(LinkExtractor(allow = (r'https://detail.tmall.com/item.htm.*')),
                 process_request = "use_splash"),
        ]
        
    def use_splash(self, request):
        request.meta['splash'] = {
                'endpoint' : 'execute',
                 'args' : { 
                    'lua_source': splash_script,
                    'wait': 1,  
                    'elements' : ['#id-of-element','.some-class'], 
                    'maxwait' : [20,20]
                    }
                 }
        return request
...
```



Make sure you you read the splash script in from the file.

```python
with open('path/to/splash/script','r') as f:
    splash_script = f.read()
```

In the above examples we're asking splash to excute a lua script for us, this might not always be needed. Sometimes just sending the html to splash for rendering, along with a slight wait is enough to get  correct html. 

```python
    yield SplashRequest(url,
        args={
            # optional; parameters passed to Splash HTTP API
            'wait': 5,
        },
        endpoint='render.html',
    )
```

Note how the endpoint argument change.

## Splash Response

Splash returns us a subclass of a scrapy response, there are 3 differebt types depending on what we asked splash to return us.

* SplashResponse is returned for binary Splash responses - e.g. for /render.png responses;
* SplashTextResponse is returned when the result is text - e.g. for /render.html responses;
* SplashJsonResponse is returned when the result is a JSON object - e.g. for /render.json responses or /execute responses when script returns a Lua table.


Splash responses https://stackoverflow.com/questions/37203458/how-to-handle-multiple-return-values-in-scrapy-from-splash

# Example

Bellow is an example script that we can use to wait for a particular element to load

## Waiting For An Element

Often we'll request a web page but the html element that what we're looking for isn't there, this is because it's dynamically generated by javascript after the page is loaded. In these cases we can use splash to wait until the element is loaded and then send us the html. 

```lua
function wait_for_element(splash, css, maxwait)
    -- Wait until a selector matches an element
    -- in the page. Return an error if waited more
    -- than maxwait seconds.
    if maxwait == nil then
        maxwait = 10
    end
    return splash:wait_for_resume(string.format([[
      function main(splash) {
        var selector = '%s';
        var maxwait = %s;
        var end = Date.now() + maxwait*1000;
        function check() {
          if(document.querySelector(selector)) {
            splash.resume('Element found');
          } else if(Date.now() >= end) {
            var err = 'Timeout waiting for element';
            splash.error(err + " " + selector);
          } else {
            setTimeout(check, 200);
          }
        }
        check();
      }
    ]], css, maxwait))
  end
  
  function main(splash, args)
    splash:go(args.url)
    for i = 1 , #args.elements do 
        wait_for_element(splash, args.elements[i], args.maxwait[i])
    end
    return {html=splash:html()}
  end
```

# Exercise

* Use scrapy and splash to scrape http://quotes.toscrape.com/js/