# Automate Boring Stuff with Python
## Section 13: Web Scrapping

In [2]:
import webbrowser

In [4]:
# Open a website
webbrowser.open('https://automatetheboringstuff.com')

True

In [5]:
# Tip: look for structure in webpages that provide services to use them
# Example: google maps
def mapit(adress):
    url = 'https://www.google.com/maps/place/' + adress
    webbrowser.open(url)

In [6]:
mapit('Tximistarri bidea 9, San Sebastian')

### Downloading from webpages

In [7]:
# Install: requests
# pip install requests
# pip3 install requests
# conda install requests

In [8]:
import requests

In [9]:
res = requests.get('https://automatetheboringstuff.com/files/rj.txt')

In [12]:
# check if everything went ok with the status code
# 200: everything ok, 404: not found, etc
res.status_code

200

In [17]:
# access the text in the downloaded text file
len(res.text)

178978

In [18]:
# raise an error if something went wrong
res.raise_for_status()

In [19]:
# More info
# https://requests.readthedocs.io/en/master/

In [20]:
# If we want to save the file AND maintain the encoding
# we need to save the text in binary!

In [21]:
text_file = open('romeo_and_juliet.txt', 'wb') # create & write as binary

In [22]:
# binary data can be stored in bhunks of bytes
# we can decide the size of the chunk
for chunk in res.iter_content(10000):
    text_file.write(chunk)

### Parsing HTML with beutifulsoup module

In [24]:
# Firefox: F12, HTML elements shown; right click + 'Inspect element', HTML code highllighted
# requests downloads the whole webpage text; but we need to parse it --> pip/3/conda install beautifulsoup4

In [26]:
import bs4
import requests

In [31]:
res = requests.get('https://automatetheboringstuff.com')

In [32]:
res.raise_for_status()

In [35]:
soup = bs4.BeautifulSoup(res.text)

In [38]:
# Firefox, F12, select element + right click -> inspect; in code, right click + Copy > CSS Path
# Then, paste it in .select()
# elems: we get the HTML part
elem = soup.select('html body div.main div ul li a')

In [45]:
# Visualize content
elem

[<a href="/2e/chapter0/">Chapter  0 – Introduction</a>,
 <a href="/2e/chapter1/">Chapter  1 – Python Basics</a>,
 <a href="/2e/chapter2/">Chapter  2 – Flow Control</a>,
 <a href="/2e/chapter3/">Chapter  3 – Functions</a>,
 <a href="/2e/chapter4/">Chapter  4 – Lists</a>,
 <a href="/2e/chapter5/">Chapter  5 – Dictionaries and Structuring Data</a>,
 <a href="/2e/chapter6/">Chapter  6 – Manipulating Strings</a>,
 <a href="/2e/chapter7/">Chapter  7 – Pattern Matching with Regular Expressions</a>,
 <a href="/2e/chapter8/">Chapter  8 – Input Validation</a>,
 <a href="/2e/chapter9/">Chapter  9 – Reading and Writing Files</a>,
 <a href="/2e/chapter10/">Chapter 10 – Organizing Files</a>,
 <a href="/2e/chapter11/">Chapter 11 – Debugging</a>,
 <a href="/2e/chapter12/">Chapter 12 – Web Scraping</a>,
 <a href="/2e/chapter13/">Chapter 13 – Working with Excel Spreadsheets</a>,
 <a href="/2e/chapter14/">Chapter 14 – Working with Google Spreadsheets</a>,
 <a href="/2e/chapter15/">Chapter 15 – Working wi

In [46]:
# Visualize content of first element
elem[0]

<a href="/2e/chapter0/">Chapter  0 – Introduction</a>

In [47]:
# Get text
elem[0].text

'Chapter  0 – Introduction'

In [48]:
# Strip text to the minimum
elem[0].text.strip()

'Chapter  0 – Introduction'

### Example: get price of Amazon product

For me, it did't work, probably because Amazon changed their page to hide this kind of info...
That's actually a pitty, because the application of getting automatically the amazon price of hundred of thousands of products is very interesting...

#### Other cool applications we could try
- weather.org: parse weather
- xkcd.com: donwload comic strips: download current, and go back following Prev link

### Controlling the web browser with selenium

In [None]:
# Install selenium: pip/3/conda install selenium
# https://selenium-python.readthedocs.io

Sometimes webpages rely on javascript, or you need to log in, etc - then, downloading the content is not enough.
The Selenium 3rd party modules allows controlling web pages
We can fill out forms, click submit buttons, etc.
It's slower than beautifulsoup, because it opens a browser.

geckodriver must be downloaded & installed:

https://learn-automation.com/firefox-browser-on-mac-using-selenium-webdriver/

https://firefox-source-docs.mozilla.org/testing/geckodriver/Notarization.html

In [67]:
from selenium import webdriver

In [95]:
# Open the browser
browser = webdriver.Firefox()

In [71]:
# Open a page/URL
browser.get('https://automatetheboringstuff.com/')

In [80]:
# Get an element
# Firefox: right click + Inspect element, right click on code + copy CSS selector
elem = browser.find_element_by_css_selector('.main > div:nth-child(1) > ul:nth-child(21) > li:nth-child(1) > a:nth-child(1)')

In [81]:
# Once we have the element, we can perform typical actions on it with code!
elem.click()

In [82]:
# We can also get more general elements, eg, all paragraphs
elems = browser.find_elements_by_css_selector('p')

In [77]:
len(elems)

109

In [84]:
# Other Selenium's webdriver methods
# https://automatetheboringstuff.com/2e/chapter12/
# browser.find_element_by_class_name()
# browser.find_elements_by_class_name()
# browser.find_element_by_id()
# browser.find_elements_by_is()
# ...

In [96]:
# Use the search field
browser.get('https://nostarch.com/automatestuff2')
searchElem = browser.find_element_by_css_selector('div.logo-wrapper:nth-child(2) > div:nth-child(1) > div:nth-child(1) > section:nth-child(1) > form:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(2) > input:nth-child(1)')

In [97]:
# Write text into field + submit text (no need of clicking, submit detects what to do)
searchElem.send_keys('deep learning')
searchElem.submit()

In [98]:
# Navigate
browser.back()

In [108]:
browser.forward()

In [100]:
browser.refresh()

In [101]:
# Close browser
browser.quit()

In [102]:
browser = webdriver.Firefox()

In [103]:
browser.get('https://automatetheboringstuff.com/')

In [104]:
# Get a paragraph element
elem = browser.find_element_by_css_selector('.main > div:nth-child(1) > p:nth-child(9)')

In [105]:
# Extract text
elem.text

"If you've ever spent hours renaming files or updating hundreds of spreadsheet cells, you know how tedious tasks like these can be. But what if you could have your computer do them for you?"

In [106]:
# Get entire page: html is the element in a webpage that contains the whole web page
elem = browser.find_element_by_css_selector('html')

In [107]:
elem.text

'Home | Buy on No Starch Press | Buy on Amazon | @AlSweigart |\nAutomate the Boring Stuff with Python\nBy Al Sweigart. Free to read under a Creative Commons license.\n\nNew Book: "Beyond the Basic Stuff with Python"\nYou\'ve read a beginner resource like Automate the Boring Stuff with Python or Python Crash Course, but still don\'t feel like a "real" programmer? Beyond the Basic Stuff covers software development tools and best practices so you can code like a professional. Available in November 2020, but you can use discount code PREORDER for 25% off.\n\nSecond Edition of Automate the Boring Stuff with Python\n\nPurchase directly from the publisher to get free PDF, Kindle, and epub ebook copies.\nBuy on Amazon\n\n\n\nUse this link to sign up for the Automate the Boring Stuff with Python online course on Udemy.\nPreview the first 15 of the course\'s 50 videos for free on YouTube.\n"The best part of programming is the triumph of seeing the machine do something useful. Automate the Boring