# Getting Started with Python Web Scraping
Source: https://learning.oreilly.com/videos/getting-started-with/9781787283244/9781787283244/

## Chapter 1 - Scraping with Selenium

### When and Why Web Scrape
* Increased efficiency
* Retrieve updating data
* Automate tedious/repetitive tasks
* Make informed decisions


In Console you can test tags using $$(".class_id"). 

Example: https://github.com/microsoft/TypeScript/commits/main
    $$(".mb-1") yields: 
    
    
    (35) [p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1, p.mb-1]0: p.mb-11: p.mb-12: p.mb-13: p.mb-14: p.mb-15: p.mb-16: p.mb-17: p.mb-18: p.mb-19: p.mb-110: p.mb-111: p.mb-112: p.mb-113: p.mb-114: p.mb-115: p.mb-116: p.mb-117: p.mb-118: p.mb-119: p.mb-120: p.mb-121: p.mb-122: p.mb-123: p.mb-  124: p.mb-125: p.mb-126: p.mb-127: p.mb-128: p.mb-129: p.mb-130: p.mb-131: p.mb-132: p.mb-133: p.mb-134: p.mb-1length: 35[[Prototype]]: Array(0)

If there isn't a good ID nearby, we can use Xpaths.  This however makes the query very easy to break with small changes. 

     "$x('//*[@id="code-tab"]/span[1]')"

### Using the Selenium Module

#### WebDriver
* Commonly used to test web-applications
* Allows us to write Python to automate the browser


In [137]:
from selenium import webdriver
import selenium
import pprint
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
options = Options()
options.add_argument('--headless')
browser = webdriver.Chrome(options=options)

In [138]:
browser.get("https://old.reddit.com")

In [141]:
#titles = browser.find_elements_by_css_selector("a.title") 

titles2 = browser.find_elements(By.CSS_SELECTOR, "a.title")
[t.text for t in titles2]

["What show has an intro that is such a banger, you wouldn't dream of hitting the skip button?",
 '[Highlight] Nikola Jokic shoves Markieff Morris to the ground from behind',
 'NASA’s James Webb Telescope Arriving for Launch',
 "LPT Request: To poor spellers out there....the reason people don't respect your poor spelling isn't purely because you spell poorly. It's because...",
 'He went naked to prove a point',
 "i'd rather be saved than my baby",
 'I was involved in a hit and run, but the idiot left their plate in my wheel.',
 'I damn near hate my downstairs neighbors austic child',
 'How the fu*k can billionaire corporations pay so little in wages that their employees get food stamps paid by tax ?',
 'Why Facebook’s Metaverse Is Dead on Arrival',
 'Fortnite Pulls the Travis Scott Emote After Astroworld Concert Tragedy',
 'Politician to miss his anti-vaccine mandate rally because he has COVID',
 '1977 interview - the dignity of Dolly Parton, while Barbara Walters does her best to humi

In [147]:
# What if we want to navigate to the next page?
#
#next = browser.find_element_by_css_selector(".next-button a")
next = browser.find_element(By.CSS_SELECTOR, '.next-button')
next.click()
#titles = browser.find_elements_by_css_selector("a.title") 
titles = browser.find_elements(By.CSS_SELECTOR, "a.title") 
[t.text for t in titles]

['Inspirational',
 'One of the best actors out there',
 'Deepest earthquake ever detected should have been impossible',
 'ich学iel',
 'My lack of religious beliefs was used against me in family court',
 'Love is powerful.',
 'QT ran into a glass door and is currently in the ER',
 "Don't stop believing",
 'Florida, Ladies and Gentlemen...',
 'me_irl',
 'They probably gonna invite her to their birthday party idk',
 'YSK how to increase your chances of survival in a crowd crush',
 "Paying £400 a month for a monthly train ticket, only to see the prick infront push through the barriers whilst the attendant looks on and I'm stuck there for 10 seconds like a moron as it resets.",
 'Yeah.... Can you imagine?',
 'Calling any or all strangers to watch this toddler for $0.52 /hour!',
 'The only thing said in her bio was that she designed clothes…',
 'meirl',
 'My first Six years of employment as an Immigrant in US [OC]',
 "There's no Excusing This. This is Trashy in Any Context.",
 'Professor X as

In [148]:
titles_text = []

for i in range(3):    
    titles = browser.find_elements(By.CSS_SELECTOR, "a.title") 
    titles_text += [t.text for t in titles]
    
    next = browser.find_element(By.CSS_SELECTOR, ".next-button a")
    next.click()


In [149]:
for t in titles_text:
    print(t)

Inspirational
One of the best actors out there
Deepest earthquake ever detected should have been impossible
ich学iel
My lack of religious beliefs was used against me in family court
Love is powerful.
QT ran into a glass door and is currently in the ER
Don't stop believing
Florida, Ladies and Gentlemen...
me_irl
They probably gonna invite her to their birthday party idk
YSK how to increase your chances of survival in a crowd crush
Paying £400 a month for a monthly train ticket, only to see the prick infront push through the barriers whilst the attendant looks on and I'm stuck there for 10 seconds like a moron as it resets.
Yeah.... Can you imagine?
Calling any or all strangers to watch this toddler for $0.52 /hour!
The only thing said in her bio was that she designed clothes…
meirl
My first Six years of employment as an Immigrant in US [OC]
There's no Excusing This. This is Trashy in Any Context.
Professor X asks a girl, "what is your mutant power?"
I don't know how Dad could afford my s

***
## Chapter 2 - Parsing with BeautifulSoup

### Server-side
* Send you generated HTML to render website

### Client-side
* Send you code that runs in your browser to generate HTML
* If source HTML different from HTML in Element explorer

In [51]:
import requests

In [71]:
url = "https://en.wikipedia.org/wiki/List_of_HTTP_status_codes"
header = { "From":"Student at CU Boulder" }

response = requests.get(url, headers=header)
if response.status_code !=200:
    print("Failed to get HTML: ",
              response.status_code, response.reason)
    exit()
html = response.text

In [75]:
test_html = '''
    <p id="foo1"></p>
    <p id="foo2"></p>
'''
# See the difference between html and html5lib: 
#
from bs4 import BeautifulSoup
BeautifulSoup(test_html, "html.parser")


<p id="foo1"></p>
<p id="foo2"></p>

In [76]:
BeautifulSoup(test_html, "html5lib")

<html><head></head><body><p id="foo1"></p>
    <p id="foo2"></p>
</body></html>

### Navigating HTML with Beautiful Soup
### Using "find_all" and "find"
* soup.find_all("p")                            | # return all "p"
* soup.find_all(["th","td"])                    | # Anything matching either tag
* soup.find_all(class_="buzz")                  | # Looks for specific class
* soup.find_all(id=re.compile("^foo"))          | # return tags with id beginning with foo

#### A Tag Object
* Tag attributes are in a dictionary
    * Can see full dictionary with: tag.attrs
    * tag["id"] will throw error if attribute not specified, us tag.get("id")
    * Properties both read and writable
    
#### Warning: Don't Go Down the Rabbit Hole
* take advantage of t he power of selectors
* Don't over-Python the parsing
* It's easy to start writing bad code, it's hard to stop. 


#### Example: Scraping Wikipedia
* Goals:
    * Working through an example with inconsistant data
    * Using CSS Selectors with BeautifulSoup
    * Using Python and BeautifulSoup methods to narrow down data

In [80]:
url = "https://en.wikipedia.org/wiki/List_of_English_monarchs"
header = { "From":"Student at CU Boulder" }

response = requests.get(url, headers=header)
html=response.text

soup = BeautifulSoup(html, "html5lib")

In [94]:
for a in soup.select(".wikitable a b"):
    name = a.text
    cell = a.find_parent("td")
    
    contents = cell.text.split("\n")
    print(contents)
   

['Ælfweardc.\u200917 July 924–2 August 924[9](16\xa0days)', '']
['(1st reign)[a]ÆthelredÆthelred the Unready18 March 978–1013(34–35 years)', '']
['SweynSweyn Forkbeard25 December 1013–3 February 1014(41\xa0days)', '']
['(2nd reign)ÆthelredÆthelred the Unready3 February 1014–23 April 1016(2\xa0years, 81\xa0days)', '']
['CanuteCnut the Great18 October 1016–12 November 1035(19\xa0years, 26\xa0days)', '']
['William IWilliam the Conqueror[d]25 December 1066–9 September 1087(20\xa0years, 259\xa0days)', '']
['William IIWilliam Rufus26 September 1087[i]–2 August 1100(12\xa0years, 311\xa0days)', '']
['Henry IHenry Beauclerc5 August 1100[ii]–1 December 1135(35\xa0years, 119\xa0days)', '']
['StephenStephen of Blois22 December 1135[iii]–25 October 1154(18\xa0years, 308\xa0days)', '']
['MatildaEmpress Matilda7 April 1141–1 November 1141(209\xa0days)', '']
['Henry IIHenry Curtmantle19 December 1154[iv]–6 July 1189(34\xa0years, 200\xa0days)', '']
['Richard IRichard the Lionheart3 September 1189[v]–6 