# Web scraping and crawling

Now we're moving forward in terms of difficulty - writing code to traverse and capture data from the web.

You largely already have the skills necessary to do this, the major skill is being able to parse the structure and text of a HTML document. Now we are simply going to put together the mental map of how to instruct a program to walk.

# Orders of complexity

There is an increasing level of difficulty in how one scrapes web pages and the intransigence of your target should be the determining factor in which approach you implement (i.e. don't buy a bazooka to go to a knife fight).

* Exploiting regularly structured urls (`requests`)
* Crawling a site with typically static content 
* Crawling a site with dynamic content and human restrictions (`selenium`)



## Scraping with Selenium

We will need to download and install the `geckodriver` according to your system instructions (You will also need to move the `geckodriver` into `/usr/local/bin/` 
or `C:\Windows\System32\`

Now watch for something totally crazy.

In [8]:
!wget  https://github.com/mozilla/geckodriver/releases/download/v0.33.0/geckodriver-v0.33.0-linux64.tar.gz

--2025-03-27 15:13:36--  https://github.com/mozilla/geckodriver/releases/download/v0.33.0/geckodriver-v0.33.0-linux64.tar.gz
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/25354393/31e07152-f930-40e0-8011-5495dd63fee9?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20250327%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250327T151336Z&X-Amz-Expires=300&X-Amz-Signature=b7211ee4855281b044ec756f5171cb82a50825615c4468c0adb13ad3f2ae6053&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3Dgeckodriver-v0.33.0-linux64.tar.gz&response-content-type=application%2Foctet-stream [following]
--2025-03-27 15:13:36--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/25354393/31e07152-f930-40e0-8011-5495dd63fee9?X-Am

and then unpack it with this command (but also using sudo)

In [10]:
!tar -xvf geckodriver-v0.33.0-linux64.tar.gz

geckodriver


In [None]:
!python selenium_example.py

Yup, that's right. It started an entire web browser (Firefox in this case). This is why selenium is the most powerful (and costly) solution to scraping. 

So now let's inspect this code:

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys


elem = driver.find_element(by=By.NAME, value="q")
elem.clear()
elem.send_keys("pycon")
# elem.send_keys(Keys.RETURN)
# assert "No results found." not in driver.page_source
# driver.close()


NameError: name 'driver' is not defined

You start from the webdriver with the browser of choice (you can choose). 

Using `driver.get()` you give a url address.

Once there, you can give instructions to search for a specific element by it's name. In this case `q` is the input field for search the site.

As a pre-emptive move, the code clears the box and then sends the query `pycon`

It then hits return and checks to make sure that no results are returned before closing.

Simple, right?

Now let's try to search for `Biden` on CNN.

In [None]:
#Exercise


Amazing! **But complicated**. We can also use the forward and back buttons for the browser

In [None]:
driver.back()

In [None]:
driver.forward()

And you could print (and thus save the page source) or put it into beautiful soup

In [None]:
driver.page_source

But this won't work magic, if it's not in the source in your browser then it won't be in the source for selenium either.

We can also find all/multiple elements with the same name.

In [None]:
headlines = driver.find_elements_by_class_name("cnn-search__result-headline")

In [None]:
headlines

In [None]:
for hl in headlines:
    print(hl.text)