
The scripts and notes below were developed/written based on exercises or "totally copied" from [Teclado Code](https://github.com/tecladocode/complete-python-course/blob/master/course_contents/11_web_scraping/projects)


## WebScraping
Webscraping is the process of downloading the HTML from some page and parse the data in order to ingest data from it.


## BeautifulSoup4
The BeautifulSoup4 is a lib that enable us to work with HTML/XML languages to manipulate data ingested from it.

To run the examples, it is necessary to install into Python:  

```pip3 install beautifulsoup4```  
```pip3 install requests```

In [1]:
from bs4 import BeautifulSoup

SIMPLE_HTML = '''<html>
<head></head>
<body>
<h1>This is a title</h1>
<p class="subtitle">Lorem ipsum dolor sit amet. Consectetur edipiscim elit.</p>
<p>Here's another p without a class</p>
<ul>
    <li>Rolf</li>
    <li>Charlie</li>
    <li>Jen</li>
    <li>Jose</li>
</ul>
</body>
</html>'''

simple_soup = BeautifulSoup(SIMPLE_HTML, 'html.parser')


def find_title():
    print(simple_soup.find('h1').string)


def find_list_items():
    list_items = simple_soup.find_all('li')
    list_content = [e.string for e in list_items]
    print(list_content)


def find_paragraph():
    print(simple_soup.find('p', {'class': 'subtitle'}).string)


def find_other_paragraph():
    paragraphs = simple_soup.find_all('p')
    other_paragraph = [p for p in paragraphs if 'subtitle' not in p.attrs.get('class', [])]
    print(other_paragraph[0].string)


find_title()
find_list_items()
find_paragraph()
find_other_paragraph()


This is a title
['Rolf', 'Charlie', 'Jen', 'Jose']
Lorem ipsum dolor sit amet. Consectetur edipiscim elit.
Here's another p without a class


More complex way to read data from HTML doing webscraping:

In [42]:
import re

from bs4 import BeautifulSoup


ITEM_HTML = '''<html><head></head><body>
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
    <article class="product_pod">
            <div class="image_container">
                    <a href="catalogue/a-light-in-the-attic_1000/index.html"><img src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg" alt="A Light in the Attic" class="thumbnail"></a>
            </div>
                <p class="star-rating Three">
                    <i class="icon-star"></i>
                    <i class="icon-star"></i>
                    <i class="icon-star"></i>
                    <i class="icon-star"></i>
                    <i class="icon-star"></i>
                </p>
            <h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
            <div class="product_price">
        <p class="price_color">£51.77</p>
<p class="instock availability">
    <i class="icon-ok"></i>
        In stock
</p>
    <form>
        <button type="submit" class="btn btn-primary btn-block" data-loading-text="Adding...">Add to basket</button>
    </form>
            </div>
    </article>
</li>
</body></html>
'''

soup = BeautifulSoup(ITEM_HTML, 'html.parser')


locator = 'article.product_pod div.image_container a'
test = soup.select_one(locator).attrs["href"]
print(test)

locator = 'article.product_pod h3 a'
item_name = soup.select_one(locator).attrs['title']
print("\n", item_name)

locator = "article.product_pod p.price_color"
test = soup.select_one(locator).string
print("\n", test)

pattern = ".([0-9]+.+)"
check = re.search(pattern, test)
print(check.group(0), "-", check.group(1))


def find_item_name():
    locator = 'article.product_pod h3 a'
    item_name = soup.select_one(locator).attrs['title']
    return item_name


def find_item_page_link():
    locator = 'article.product_pod h3 a'
    item_url = soup.select_one(locator).attrs['href']
    return item_url


def find_item_price():
    locator = 'article.product_pod p.price_color'
    item_price = soup.select_one(locator).string

    pattern = '£([0-9]+\.[0-9]+)'
    matcher = re.search(pattern, item_price)
    return float(matcher.group(1))


def find_item_rating():
    locator = 'article.product_pod p.star-rating'
    star_rating_element = soup.select_one(locator)
    classes = star_rating_element.attrs['class']
    #rating_classes = [x for x in classes if x != 'star-rating']
    #return rating_classes
    rating_classes = filter(lambda x: x != 'star-rating', classes)
    return next(rating_classes)
    


print(find_item_name())
print(find_item_page_link())
print(find_item_price())
print(find_item_rating())

# You can then turn it into a dictionary or whichever
# way is easiest to store and work with:

item = {
    'name': find_item_name(),
    'link': find_item_page_link(),
    'price': find_item_price(),
    'rating': find_item_rating()
}

print(item)

catalogue/a-light-in-the-attic_1000/index.html

 A Light in the Attic

 £51.77
£51.77 - 51.77
A Light in the Attic
catalogue/a-light-in-the-attic_1000/index.html
51.77
Three
{'name': 'A Light in the Attic', 'link': 'catalogue/a-light-in-the-attic_1000/index.html', 'price': 51.77, 'rating': 'Three'}


A good way to structure the functions of some webscraping is creating a class with all methods necessary. Besides, to make it simpler, it is recommended to create classes for each purpose. For instance, one class regarding the locator rule to treat HOW TO GET THE DATA (thinking about the structure of the HTML) and another one to do the parsing of the data, without considering the WAY to get the data, but using this method from the first class created.

In [43]:
import re

from bs4 import BeautifulSoup


ITEM_HTML = '''<html><head></head><body>
<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
    <article class="product_pod">
            <div class="image_container">
                    <a href="catalogue/a-light-in-the-attic_1000/index.html"><img src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg" alt="A Light in the Attic" class="thumbnail"></a>
            </div>
                <p class="star-rating Three">
                    <i class="icon-star"></i>
                    <i class="icon-star"></i>
                    <i class="icon-star"></i>
                    <i class="icon-star"></i>
                    <i class="icon-star"></i>
                </p>
            <h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
            <div class="product_price">
        <p class="price_color">£51.77</p>
<p class="instock availability">
    <i class="icon-ok"></i>
        In stock
</p>
    <form>
        <button type="submit" class="btn btn-primary btn-block" data-loading-text="Adding...">Add to basket</button>
    </form>
            </div>
    </article>
</li>
</body></html>
'''


class ParsedItemLocators:
    """
    Locators for an item in the HTML page.
    This allows us to easily see what our code will be looking at
    as well as change it quickly if we notice it is now different.
    """
    NAME_LOCATOR = 'article.product_pod h3 a'
    LINK_LOCATOR = 'article.product_pod h3 a'
    PRICE_LOCATOR = 'article.product_pod p.price_color'
    RATING_LOCATOR = 'article.product_pod p.star-rating'


class ParsedItem:
    """
    A class to take in an HTML page or content, and find properties of an item
    in it.
    """
    def __init__(self, page):
        self.soup = BeautifulSoup(page, 'html.parser')

    @property
    def name(self):
        locator = ParsedItemLocators.NAME_LOCATOR
        item_name = self.soup.select_one(locator).attrs['title']
        return item_name

    @property
    def link(self):
        locator = ParsedItemLocators.LINK_LOCATOR
        item_url = self.soup.select_one(locator).attrs['href']
        return item_url

    @property
    def price(self):
        locator = ParsedItemLocators.PRICE_LOCATOR
        item_price = self.soup.select_one(locator).string

        pattern = '£([0-9]+\.[0-9]+)'
        matcher = re.search(pattern, item_price)
        return float(matcher.group(1))

    @property
    def rating(self):
        locator = ParsedItemLocators.RATING_LOCATOR
        star_rating_element = self.soup.select_one(locator)
        classes = star_rating_element.attrs['class']
        rating_classes = filter(lambda x: x != 'star-rating', classes)
        return next(rating_classes)


item = ParsedItem(ITEM_HTML)
print(item.price)

51.77


How to download the HTML and do the parsing:

In [2]:
import requests
from bs4 import BeautifulSoup

page = requests.get("http://www.example.com")
soup = BeautifulSoup(page.content, "html.parser")

print(soup.find("h1").string)
print(soup.select_one("p a").attrs["href"])

<Response [200]>
Example Domain
https://www.iana.org/domains/example


Doing webscraping from http://quotes.toscrape.com/ site and selecting the author, quotes and tags for it:


In [1]:
from bs4 import BeautifulSoup

import requests

page = requests.get("http://quotes.toscrape.com/")
soup = BeautifulSoup(page.content, "html.parser")

# Define the locators
quote_page_locator = "div.quote"
author_locator = "small.author"
quote_locator = "span.text"
tag_locator = "div a.tag"

print("******************************\n")

# Get data from each block of html
for content_html in soup.select(quote_page_locator):
    # print(content_html)

    author = content_html.select_one(author_locator)
    quote = content_html.select_one(quote_locator)
    tags = [tag.string for tag in content_html.select(tag_locator)]

    print("Author: ", author.string)
    print("Quote Text: ", quote.string)
    print("Tags: ", tags, "\n")
    #print("-----------\n")

print("******************************\n")

******************************

Author:  Albert Einstein
Quote Text:  “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Tags:  ['change', 'deep-thoughts', 'thinking', 'world'] 

Author:  J.K. Rowling
Quote Text:  “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Tags:  ['abilities', 'choices'] 

Author:  Albert Einstein
Quote Text:  “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Tags:  ['inspirational', 'life', 'live', 'miracle', 'miracles'] 

Author:  Jane Austen
Quote Text:  “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Tags:  ['aliteracy', 'books', 'classic', 'humor'] 

Author:  Marilyn Monroe
Quote Text:  “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Tags:  ['be-yourself', 'i

Doing webscraping from https://webscraper.io/test-sites/e-commerce/allinone site and selecting information from the most sold products as product name, description and price for it:


In [1]:
from bs4 import BeautifulSoup

import requests
import re

page = requests.get("https://webscraper.io/test-sites/e-commerce/allinone")
soup = BeautifulSoup(page.content, "html.parser")

# Define the locators
page_locator = "div.container div.row div.col-md-9"
product_locator = "a.title"
product_desc_locator = "p.description"
price_locator = "h4.pull-right"

# Define regex for prices
price_regex = ".([0-9]+\.[0-9]+)"

print("******************************\n")

# Get data from each block of html
for content_html in soup.select(page_locator):
    # print(content_html)

    product = [prod.string for prod in content_html.select(product_locator)]
    product_desc = [prod.string
                    for prod in content_html.select(product_desc_locator)]
    price = [prod.string for prod in content_html.select(price_locator)]
    price_treated = [float(re.search(price_regex, prod.string).group(1))
                     for prod in content_html.select(price_locator)]

for x in range(0, len(product)):
    print("Product: ", product[x])
    print("Product Description: ", product_desc[x])
    print("Price Treated: ", price_treated[x], "\n")

print("Total of products: ", sum(price_treated))

print("******************************\n")


******************************

Product:  Dell Latitude 55...
Product Description:  Dell Latitude 5580, 15.6" FHD, Core i7-7600U, 16GB, 256GB SSD, GeForce GT930MX, Linux
Price Treated:  1341.22 

Product:  Acer Aspire ES1-...
Product Description:  Acer Aspire ES1-572 Black, 15.6" HD, Core i3-6006U, 4GB, 128GB SSD, Windows 10 Home
Price Treated:  469.1 

Product:  LG Optimus
Product Description:  3.2" screen
Price Treated:  57.99 

Total of products:  1868.3100000000002
******************************



## Browsing automation with Selenium

Selenium is a portable framework for testing web applications. It is used to implement browser automations and for webscraping when the webpage was developed using Javascript, which doesn't allow BeautifulSoup to get data from it. It is going to simulate an user's operation!
To test the Selenium it is necessary to:

* Install the selenium lib  
`pip3 install selenium`
* Install the chrome driver using this page: https://chromedriver.chromium.org/downloads

Example of webscraping with Selenium as it is being done using beautifulSoup

In [38]:
from selenium import webdriver

chrome = webdriver.Chrome(executable_path="/home/milhomem/Downloads/chromedriver_linux64/chromedriver")
chrome.get("http://quotes.toscrape.com")
#soup = BeautifulSoup(page.content, "html.parser")

# Define the locators
quote_page_locator = "div.quote"
author_locator = "small.author"
quote_locator = "span.text"
tag_locator = "div a.tag"

print("******************************\n")

# Get data from each block of html
for content_html in chrome.find_elements_by_css_selector(quote_page_locator):
    #print(content_html)

    author = content_html.find_element_by_css_selector(author_locator)
    quote = content_html.find_element_by_css_selector(quote_locator)
    tags = [tag.text for tag in content_html.find_elements_by_css_selector(tag_locator)]

    print("Author: ", author.text)
    print("Quote Text: ", quote.text)
    print("Tags: ", tags, "\n")
    #print("-----------\n")

print("******************************\n")

******************************

Author:  Albert Einstein
Quote Text:  “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Tags:  ['change', 'deep-thoughts', 'thinking', 'world'] 

Author:  J.K. Rowling
Quote Text:  “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Tags:  ['abilities', 'choices'] 

Author:  Albert Einstein
Quote Text:  “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Tags:  ['inspirational', 'life', 'live', 'miracle', 'miracles'] 

Author:  Jane Austen
Quote Text:  “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Tags:  ['aliteracy', 'books', 'classic', 'humor'] 

Author:  Marilyn Monroe
Quote Text:  “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Tags:  ['be-yourself', 'i

Example of manipulation of options developed in Javascript

In [52]:
from selenium import webdriver
from selenium.webdriver.support.ui import Select

chrome = webdriver.Chrome(executable_path="/home/milhomem/Downloads/chromedriver_linux64/chromedriver")
chrome.get("http://quotes.toscrape.com/search.aspx")
#soup = BeautifulSoup(page.content, "html.parser")

# Define the locators
quote_page_locator = "div.quote"

author_locator = "span.author"
quote_locator = "span.text"
tag_locator = "span.tag"

author_dropdown = "select#author"
tag_dropdown = "select#tag"
search_button = 'input[name="submit_button"]'


print("******************************\n")

# Select the author dropdown and select an option available
element_author = chrome.find_element_by_css_selector(author_dropdown)
Select(element_author).select_by_visible_text("Jane Austen")

# Select all the possibilities of a tag
tags = [option.text.strip() for option in Select(chrome.find_element_by_css_selector(tag_dropdown)).options]
print(tags)

# Select the tag dropdown and select an option available
element_tag = chrome.find_element_by_css_selector(tag_dropdown)
Select(element_tag).select_by_visible_text("love")

# Click on the button search to show the quote
chrome.find_element_by_css_selector(search_button).click()


print("******************************\n")

******************************

['----------', 'aliteracy', 'books', 'classic', 'humor', 'friendship', 'love', 'romantic', 'women', 'library', 'reading', 'elizabeth-bennet', 'jane-austen']
******************************



### <a href=https://www.linkedin.com/in/jmilhomem/>br.linkedin.com/in/jmilhomem</a> ###