# Web Scraping

## Data Access Schemes
* Bulk downloads
* API access
* Web Scraping



### API (Application Programming Interface)
* set of rules, protocols, and tools that allows different software applications to communicate with each other
* defines the methods and data formats that developers can use to interact with a software component, service, or system
<br>
* Interoperability: between different software systems, allowing to exchange data and perform actions
* Abstraction: abstract the underlying implementation details of a system, providing a simplified interface to work with
* Reuse and Modularity: promote code reuse and modularity
* Standardization: follow industry standards and conventions, ensures consistency and comparability

## HTTP (Hyper Text Transfer Protocol)

### Basics of Protocol
* A request-response protocol for communications between clients and servers
* used for communication between web browsers and web servers
* Request
    * `GET` HTTP request is mostly used for quering data from a server
    * `POST` and `PUT` HTTP requests are mostly used for sending data to a server

* other HTTP requests are available: `DELETE`, `OPTIONS` etc.


## HTML (Hypertext Markup Language)
* markup language used to create and structure web pages
* consists of a series of elements (tags) that define the structure and content of a web page
* HTML documents are interpreted by web browsers, which render the content and display it to users
* HTML *Element* is an individual component of an HTML document or web page


```HTML
<!DOCTYPE html> 
<html>
    <head>
        <title>
            My beautiful web page!
        </title>
    </head> 
    <body>
        Here is my content.
    </body>
</html>
```

<img src="./08.1 images/html_elements.png" alt="HTTP Steps" style="width:70h0px;"/>


### Building blocks of an HTML element

* Tags begin with < and end with >
* Usually occur in pairs:
    ```HTML
    <p> Wow! </p>
    ```
* Elements can be nested: 
    ```HTML
    <p>This is a <em>really</em> interesting paragraph.</p>
    ```
* Some tags never occur in pairs: usual to use trailing slash, but it is not necessary 
    ```HTML
    <img src="photo.jpg" />
    ```
* Tags may have one ore more attributes
    ```HTML
    <a href="contact.html">Contact us</a>
    ```

### Noteworthy tags

* `<h1>`, ... , `<h6>` - heading,
* `<p>` - paragraph of text,
* `<a>` - link, typically displayed as underlined, blue text
    ```HTML
    <a href="http://www.example.com/">An example link</a>
    ```
* `<span>` - arbitrary span of text, typically within a larger containing element like `<p>`
* `<div>` - arbitrary division within the document used for grouping and containing related elements
* Lists
  * `<ul>` - unordered lists (e.g., bulleted lists)
  * `<ol>` - ordered lists (often numbered) 
  * `<li>` - list items within `<ul>` and `<ol>`
      ```HTML
      <ul>
          <li> Item 1 </li>
          <li> Item 2 </li>
      </ul>
      ```

## Attributes

* HTML elements can be assigned attributes by including property/value pairs in the opening tag
    ```HTML
    <tagname property="value"></tagname>
    ```
* E.g., a link can be given an href attribute, whose value specifies the URL for that link
    ```HTML
    <a href="http://d3js.org/">The D3 website</a>
    ```    ```

## Classes and Ids

*Classes* and *ids* attributes allow to identify an element in an HTML document. This is useful for applying styles and manipulating an element

* `id` attribute identifies an element in HTML document. 
    * Once a name is used as the value of an id attribute of an HTML element, it **can not be** used as the value of any other element's id attribute.
    ```HTML
    <div id="content">This is content</div>
    <div id="button">This is button</div>
    ```
<br> 

* `class` attribute identifies an element in HTML document:
    * This happens when the value of `class` attribute of an HTML element matches the name of class. 
    * Usually classes are used to apply styles.
    * Elements of the same page can be assigned to multiple classes and multiple elements can have the same class

    ```HTML
    <p class="awesome">Awe-inspiring paragraph</p>
    <p class="uplifting awesome">
        Awe-inspiring uplifting paragraph
    </p>
    <p class="awesome">Awe-inspiring paragraph</p>
    ```

<img src="./08.1 images/DOM.png" alt="DOM" style="height:400px;"/>

<img src="./08.1 images/DOM_element.png" alt="DOM element"/>

# BeautifulSoup

* Beautiful Soup (BS4) is a parsing library that can use different parsers.

* **Pros**
    * Small learning curve, easy to learn.
    * If you need to handle messy documents, choose Beautiful Soup.

<br> 

* **Cons**
    * If the default parser chosen for you is incorrect, they may incorrectly parse results without warnings, which can lead to disastrous results.
    * Projects built using bs4 might not be flexible in terms of extensibility.
    * You need to import multiprocessing  to make it run quicker


## Loading and examining HTML page

Import 
* `requests` for loading web pages
* `math` for math operations
* `bs4` for loading *BeautifulSoup* for working with HTML

In [1]:
import requests
import bs4
import math

In [2]:
url = 'http://www.crummy.com/software/BeautifulSoup'
source = requests.get(url).text
print(source[:500])
print(len(source))
#print(source.status_code)

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/transitional.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Beautiful Soup: We called him Tortoise because he taught us.</title>
<link rev="made" href="mailto:leonardr@segfault.org">
<link rel="stylesheet" type="text/css" href="/nb/themes/Default/nb.css">
<meta name="Description" content="Beautiful Soup: a library designed for screen-scraping HTML and XML
10563


In [9]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">There were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a 
<a href="./well">well</a>.</p>

<p class="story">...</p>
The <a>end</a>
</body>
<html/>
"""

type(html_doc)

from bs4 import BeautifulSoup

#  BeautifulSoup object, 
# which represents the document as a nested data structure:
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   There were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a
   <a href="./well">
    well
   </a>
   .
  </p>
  <p class="story">
   ...
  </p>
  The
  <a>
   end
  </a>
 </body>
 <html>
 </html>
</html>



In [10]:
# extracting all the URLs found within a page’s tags:
for link in soup.find_all('a'):
    print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie
./well
None


In [11]:
# extracting all the text from a page:
print(soup.get_text())



The Dormouse's story

The Dormouse's story
There were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a 
well.
...
The end



