# When `wget` fails you: Using Selenium to tackle tough dataset downloads

If you're a data-scientist, you've probably had to download big datasets while SSH'd into a remote Linux machine. Trying to use `wget` or `curl` often doesn't work when the download is locked behind some JavaScript mechanism that requires real browser interaction.

This gives me some choices:

1. *Download to my laptop and upload to the server.* I don't want to do this, because my laptop is short on memory, and this method would be slow.

2. *Use X-Forwarding and download that way.* This method is the best and the easiest, but unsafe. (See [this security.stackexchange post 14815](https://security.stackexchange.com/questions/14815/security-concerns-with-x11-forwarding) for more details.) So, this was off the table.

3. *Programmatically interact with the webpage to download it.* This is fast, does not require X11 forwarding, and can be extended to other webpages.

In this notebook, I'll use Selenium and the Firefox Webdriver to download some data from [nuscenes.org](https://www.nuscenes.org/), a public autonomous driving dataset. This tutorial assumes you have familiarity with Linux environments (command line and the concept of a 'path'), Python environments (i.e. installing with `pip` or `conda`), and very intro web-development (i.e. understanding HTML, browser developer tools, etc.)

---

# Part 0: How `wget` fails us

> **TLDR:** The download is locked behind Amazon Cognito, which time-gates the downloads and makes using `wget` difficult.

First, let's just be sure that wget actually does fail. Let's download a dataset from the US server.

Here's the NuScenes download page:

> ![image.png](attachment:d7ed109b-42aa-4c0f-a251-d9aa56d3a6d7.png)

Right-click + copy URL gives us `https://www.nuscenes.org/download#`, which means the download is hidden behind some JavaScript garbage. Sigh.

> ![image.png](attachment:6d70c28d-df69-4c96-8b67-6e23b06927d4.png)

But what's this? If we inspect our Download folder, we find a URL. Perhaps we can use this same URL on the server? Let's try with `wget`:

In [8]:
!wget -O nuscenes 'https://s3.amazonaws.com/data.nuscenes.org/public/v1.0/v1.0-test_meta.tgz?AWSAccessKeyId=AKIA6RIK4RRMFUKM7AM2&Signature=SWb%2BtTHDTWvYiss92nLY%2FyY4dDQ%3D&Expires=1634561933'

--2021-10-25 14:46:31--  https://s3.amazonaws.com/data.nuscenes.org/public/v1.0/v1.0-test_meta.tgz?AWSAccessKeyId=AKIA6RIK4RRMFUKM7AM2&Signature=SWb%2BtTHDTWvYiss92nLY%2FyY4dDQ%3D&Expires=1634561933
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.142.216
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.142.216|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2021-10-25 14:46:31 ERROR 403: Forbidden.



Sigh... Well, no problem! This gives us a chance to break out **Selenium** for web-scraping.

Selenium is really cool: We'll use it to interact with the webpage using a full-fat web-browser. So let's try this again, using Selenium.

---

# Part 1: Setting things up

[Selenium (selenium.dev)](https://www.selenium.dev/documentation/) is, as we use it, a tool to interface with a **WebDriver**. It has libraries for Python, Ruby, JavaScript, Java, Kotlin, and C#.

A WebDriver ([see Selenium docs here](https://www.selenium.dev/documentation/webdriver/)) is a special web-browser that is built so we can interact with it programmatically. 

> **TLDR:** In this section, we
> 1. Download the FireFox 'geckodriver' binary from [github.com/mozilla/geckodriver/releases](https://github.com/mozilla/geckodriver/releases)
> 2. Install Selenium
> 3. Instantiate the driver in Python
> 4. Set preferences so that the browser does not ask for confirmation before downloading

First things first, we need to download our web-driver to somewhere that is in our path:

In [23]:
# This downloads geckodriver 0.30.0 to our current directory
!wget https://github.com/mozilla/geckodriver/releases/download/v0.30.0/geckodriver-v0.30.0-linux32.tar.gz

--2021-10-25 14:54:21--  https://github.com/mozilla/geckodriver/releases/download/v0.30.0/geckodriver-v0.30.0-linux32.tar.gz
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-releases.githubusercontent.com/25354393/e7fc3349-3879-407e-867c-399a10191a07?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20211025%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20211025T185421Z&X-Amz-Expires=300&X-Amz-Signature=99ddd1c95e3fdaa811de959410fbbee00f2351e845a7f55792a11e9abb6c84d0&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=25354393&response-content-disposition=attachment%3B%20filename%3Dgeckodriver-v0.30.0-linux32.tar.gz&response-content-type=application%2Foctet-stream [following]
--2021-10-25 14:54:21--  https://github-releases.githubusercontent.com/25354393/e7fc3349-3879-407e-867c-399a10191a07?X-Amz-Algorithm=AWS4-HMAC-SHA25

In [24]:
# This extracts the contents of the download, which is just the executable, `geckodriver`
!tar xf geckodriver-v0.30.0-linux32.tar.gz

In [25]:
# Optionally, move this to /usr/bin, /usr/local/bin, or anywhere in your path
# I have ~/bin/ in my path, so I move it there
!mv geckodriver ~/bin/geckodriver

!rm geckodriver-v0.30.0-linux32.tar.gz

Now that we have downloaded `geckodriver`, let's install Selenium. I use `conda`, but you can install it using Pip or other tools as well.

In [27]:
!conda install -c conda-forge selenium --yes
# note: using `--yes` considered harmful.

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/lynn/miniconda3/envs/ds

  added / updated specs:
    - selenium


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2021.10.8          |   py37h89c1867_0         144 KB  conda-forge
    selenium-3.141.0           |py37h5e8e339_1002         870 KB  conda-forge
    ------------------------------------------------------------
                                           Total:        1014 KB

The following packages will be UPDATED:

  ca-certificates    pkgs/main::ca-certificates-2021.7.5-h~ --> conda-forge::ca-certificates-2021.10.8-ha878542_0
  certifi            pkgs/main::certifi-2021.5.30-py37h06a~ --> conda-forge::certifi-2021.10.8-py37h89c1867_0
  selenium           pkgs/main::selenium-3.141.0-py37h14c3~ --> conda-forge::selenium-3.141.0-py37h5e8

Voila! We're now ready for the main event. Let's import selenium, and set some important preferences in Python. Specifically, we want to ask Firefox not to ask permission for each download, since this will require GUI interaction.

In [29]:
from selenium import webdriver # we use this to start up our driver
from selenium.webdriver.common.keys import Keys # we use this to send keystrokes

In [32]:
firefox_profile = webdriver.FirefoxProfile()

# there are a lot of different files we want FireFox not to save to disk
firefox_profile.set_preference(
    "browser.helperApps.neverAsk.saveToDisk",
    "application/x-tar,application/gzip,application/x-gzip,application/x-gtar,"+\
    "application/x-tgz,application/tar,application/gzip,application/tar+gzip,"+\
    "application/octet-stream,application/json"
)

# Optionally, run this to run the browser in 'headless' mode
# for servers without displays
firefox_options = webdriver.FirefoxOptions()
#firefox_options.add_argument('--headless')

driver = webdriver.Firefox(firefox_profile = firefox_profile, firefox_options = firefox_options)

  app.launch_new_instance()


> ![image.png](attachment:cb51f614-37e0-490d-9814-ae213aeeecbf.png)

Voila! This opens firefox in a new tab.

Now, let's show the code that will let us navigate the nuscenes.org site.

# Part 2. Logging into NuScenes and Downloading stuff

In this section, we explore the actual functionality that gives us our downloading superpowers.

> **TLDR:**
> 
> The `driver` element offers many methods for navigation, such as `find_element_by_id`.
> Combine these with `send_keys` commands (for sending strings or keys such as `Keys.Enter`).

Let's log in to `nuscenes`:

In [34]:
driver.get("https://nuscenes.org/login")

By methods such as `find_element_by_name`, we are able to quickly jump to parts of the page. To find the name (or ID or Xpath), use your browsers developer tools.

In [35]:
from getpass import getpass
# `getpass` operates just like `input`, but does not display the keys entered.

email = getpass("What is your email address?")
driver.find_element_by_name("username").send_keys(email)

What is your email address? ····················


In [36]:
password = getpass("and password?")
driver.find_element_by_name("password").send_keys(password)

and password? ········


... and now we log in!

In [41]:
dir(driver)

['CONTEXT_CHROME',
 'CONTEXT_CONTENT',
 'NATIVE_EVENTS_ALLOWED',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_file_detector',
 '_is_remote',
 '_mobile',
 '_switch_to',
 '_unwrap_value',
 '_web_element_cls',
 '_wrap_value',
 'add_cookie',
 'application_cache',
 'back',
 'binary',
 'capabilities',
 'close',
 'command_executor',
 'context',
 'create_web_element',
 'current_url',
 'current_window_handle',
 'delete_all_cookies',
 'delete_cookie',
 'desired_capabilities',
 'error_handler',
 'execute',
 'execute_async_script',
 'execute_script',
 'file_detector',
 'file_detector_context',
 'find_element',
 'find_element_by_class_name',
 'find_e

In [44]:
# you can get Xpath in most browser using Right Click > Copy > ... while using developer tools
# it's like an "address" for specific elements when a more convenient accessor (such as 'name' or 'id') is not defined
driver.find_element_by_xpath("/html/body/div/div/div/div[3]/div/div/form/div[1]/input").send_keys(Keys.ENTER)

... Now, using these same tricks, we start a download.

In [45]:
driver.get("https://nuscenes.org/download")

In [47]:
xpath_to_download_link = "/html/body/div/div/div/div[3]/div/div[2]/div[2]/div[2]/div[6]/div/div[1]/div[2]/div/div[1]/div/div/span[1]/a"

download_link_element = driver.find_element_by_xpath(xpath_to_download_link)

download_link_element.send_keys(Keys.ENTER)

> ![image.png](attachment:7532f110-8223-4798-a3a4-7f161f995f11.png)

Success! The download starts.

In [48]:
driver.quit()