# Web scraping with requests and beautifulsoup

- https://docs.python-requests.org/en/latest/
- https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Install modules with:
```
pip install requests
pip install beautifulsoup4
```

## Import requests module
```Python
import requests
```

## Make a basic request with requests module
```Python
res = requests.get('https://google.com')
```

Get method returns a response object. Before doind anything else it's better to check did the request went through:
```Python
if not res.ok:
    print("Error: Webpage can't be downloaded")

# or:

if res.status_code < 400:
    # do something with res
    # webpage content is available in res.text
```

## requests functions

- `delete(url, args)`: Sends a DELETE request to the specied url.
- `get(url, args)`: Sends a GET request to the specied url.
- `head(url, args)`: Sends a HEAD request to the specied url.
- `patch(url, args)`: Sends a PATCH request to the specied url.
- `post(url, args)`: Sends a POST request to the specied url.
- `options(url, args)`: Sends an OPTIONS request to the specied url.
- `put(url, args)`: Sends a PUT request to the specied url.
- `request(method, url, args)`: Sends a requests of the specified [method](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods) to the specified url.

The most used and common request type is **get**.

### Common args for functions:
- `auth`: Auth tuple to enable Basic/Digest/Custom HTTP Auth.
- `cookies`: Dict or CookieJar object to send with the Request.
- `data`: Dictionary, list of tuples, bytes, or file-like object to send in the body of the Request.
- `files`: Dictionary of 'name': file-like-objects (or {'name': file-tuple}) for multipart encoding upload.
- `headers`: Dictionary of HTTP Headers to send with the Request.
- `json`: A JSON serializable Python object to send in the body of the Request.
- `params`: Only for `get`. Dictionary, list of tuples or bytes to send in the query string for the Request.
- `timeout`: How many seconds to wait for the server to send data before giving up, as a float, or a (connect timeout, read timeout) tuple.

## Other requests module attributes

- `codes`: The `codes` object defines a mapping from common names for HTTP statuses to their numerical codes in dict form (requests.structures.LookupDict)
- `requests.utils.get_encodings_from_content(content)`: Returns encodings from given content string.
- `requests.utils.get_encoding_from_headers(headers)`: Returns encodings from given HTTP Header Dict

## Response object

- `apparent_encoding`: The apparent encoding of the response (str)
- `close()`: Releases the connection back to the pool.
- `connection`: Connection object (requests.adapters.HTTPAdapter)
- `content`: Content of the response in bytes (raw content) (bytes)
- `cookies`: Cookies object (requests.cookies.RequestsCookieJar)
- `elapsed`: Timedelta object with the time elapsed from sending the request to the arrival of the response (datetime.timedelta)
- `encoding`: Encoding used to decode to response.text (str)
- `headers`: Response headers dict object (requests.structures.CaseInsensitiveDict)
- `history`: Response history list (list)
- `is_permanent_redirect`: (bool)
- `is_redirect`: (bool)
- `iter_content()`: Iterates over the response data.
- `iter_lines()`: Iterates over the response data, one line at a time.
- `json()`: Returns the json-encoded content of a response, if any.
- `links`: Response header links (dict)
- `next`: PreparedRequest object for the next request in a redirection (ResponseObject)
- `ok`: True if status_code is less than 400, otherwise False (bool)
- `raise_for_status()`: Raises `HTTPError`, if one occurred.
- `raw`: Raw response object (urllib3.response.HTTPResponse)
- `reason`: Text corresponding to the status code (str)
- `request`: Request object that requested this response (requests.models.PreparedRequest)
- `status_code`: Number that indicates the [HTTP status](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) of the response (int)
- `text`: The content of the response, in UTF-8 (str)
- `url`: The URL of the response (str)


## Import the BeautifulSoup module

How to import Beautiful Soup:
```Python
from bs4 import BeautifulSoup
```

## Starting the BeautifulSoup parser

After that, create a soup element to work with:
```Python
# Start the parser (basic parser: 'html.parser')
soup = BeautifulSoup(html_doc, 'html.parser')
```

## BeautifulSoup object types

There are four basic object types:
- `BeautifulSoup`: It's the basic element, that contains all the source code
- `Tag`: It's an HTML element/Tag node object. It contains an HTML element including it's descendants
- `NavigableString`: It's a text node element. Every piece of text it's contained in one object of this type (One object for every peace of text), including possible new lines between some HTML elements.
- `Comment`: It's a comment node element.

There are more types, but these are the more common ones.

## Locate elements

Find the first element of a certain tag name: soup.tag_name
```Python
p = soup.p      # First paragraph
h1 = soup.h1    # First header h1
img = soup.img  # First image
li = soup.ul.li # Firs list item in the first unordered list
```

p, h1, img and li are Tag objects. More searches can be made on those objects if necessary.

## Tag object methods

### Altering/Modifying

- `append()`: Appends the given PageElement to the contents of this one.
- `clear()`: Wipe out all children of this PageElement by calling extract() on them.
- `decompose()`: Recursively destroys this PageElement and its children.
- `extend()`: Appends the given PageElements to this one's contents.
- `extract()`: Destructively rips this element out of the tree.
- `insert()`: Insert a new PageElement in the list of this PageElement's children.
- `insert_after()`: Makes the given element(s) the immediate successor of this one.
- `insert_before()`: Makes the given element(s) the immediate predecessor of this one.
- `replace_with()`: Replace this PageElement with one or more PageElements, keeping the rest of the tree the same.
- `replace_with_children()`: Replace this PageElement with its contents.
- `smooth()`: Smooth out this element's children by consolidating consecutive strings.
- `unwrap()`: Replace this PageElement with its contents.
- `wrap()`: Wrap this PageElement inside another one.

### Finding/Locating

- `find()`: Look in the children of this PageElement and find the first PageElement that matches the given criteria.
- `find_all()`: Look in the children of this PageElement and find all PageElements that match the given criteria.
- `find_all_next()`: Find all PageElements that match the given criteria and appear later in the document than this PageElement.
- `find_all_previous()`: Look backwards in the document from this PageElement and find all PageElements that match the given criteria.
- `find_next()`: Find the first PageElement that matches the given criteria and appears later in the document than this PageElement.
- `find_next_sibling()`: Find the closest sibling to this PageElement that matches the given criteria and appears later in the document.
- `find_next_siblings()`: Find all siblings of this PageElement that match the given criteria and appear later in the document.
- `find_parent()`: Find the closest parent of this PageElement that matches the given criteria.
- `find_parents()`: Find all parents of this PageElement that match the given criteria.
- `find_previous()`: Look backwards in the document from this PageElement and find the first PageElement that matches the given criteria.
- `find_previous_sibling()`: Returns the closest sibling to this PageElement that matches the given criteria and appears earlier in the document.
- `find_previous_siblings()`: Returns all siblings to this PageElement that match the given criteria and appear earlier in the document.
- `select()`: Perform a CSS selection operation on the current element.
- `select_one()`: Perform a CSS selection operation on the current element.

### Others

- `decode()`: Render a Unicode representation of this PageElement and its contents.
- `decode_contents()`: Renders the contents of this tag as a Unicode string.
- `encode()`: Render a bytestring representation of this PageElement and its contents.
- `encode_contents()`: Renders the contents of this PageElement as a bytestring.
- `get()`: Returns the value of the 'key' attribute for the tag, or the value given for 'default' if it doesn't have that attribute.
- `get_attribute_list()`: The same as get(), but always returns a list.
- `get_text()`: Get all child strings of this PageElement, concatenated using the given separator.
- `has_attr()`: Does this PageElement have an attribute with the given name?.
- `index()`: Find the index of a child by identity, not value.
- `prettify()`: Pretty-print this PageElement as a string.


## Tag object attributes

- `attrs`: Dictionary with Tag's HTML attributes (dict)
- `children`: Tag's (direct) childrens as an iterator (list_iterator)
- `contents`: Tag's (direct) childrens as a list (list)
- `decomposed`: Has node been decomposed? (bool)
- `descendants`: Tag's descendats as an iterator (generator)
- `is_empty_element`: Is this a self-closing element, like img? (bool)
- `name`: Tag name (str)
- `namespace`: XML namespace (str)
- `next_element`: Next node (text, tag, comment, ...) after this tag's start tag (Also inside current tag)(bs4 object)
- `next_elements`: Generator for next elements (generator)
- `next_sibling`: Next sibling node after this tag (skips this tag's content) (bs4 object)
- `next_siblings`: Generator for next siblings (generator)
- `parent`: Tag's direct parent tag/element (bs4.element.Tag)
- `parents`: Tag's parents (ascendants) as generator (generator)
- `previous_element`: Previous node (text, tag, comment, ...) before this tag (bs4 object)
- `previous_elements`: Generator for previous elements/tags (generator)
- `previous_sibling`: Previous sibling node before this tag (bs4 object)
- `previous_siblings`: Generator for previous siblings (generator)
- `sourceline`: Position row in source code (int)
- `sourcepos`: Position column in source code (int)
- `string`: Tag's text node as NavigableString (can also be this tag's only child's string) (bs4.element.NavigableString)
- `strings`: Generator for all text contained in the tag (including descendants) (generator)
- `stripped_strings`: Same as previous but strings are stripped (generator)
- `text`: Tag's contained text (includes text from descendants) (str)

Tag's HTML attributes can be accessed as in a dictionary:
```Python
tag['id']       # Tag's ID attribute
tag['class']    # Tag's class attribute
tag['name']     # Tag's name attribute
tag['style']    # Tag's style attribute

# See also tag.attrs
```

Two main function/methods for locating elements/tags:
- `find`: Find the first element/tag to match the search criteria. Returns an object.
- `find_all`: Find all the elements/tags that match the search criteria. Returns a list.

Anyway the rest of find_* functions work similary.

You can use different search criteria to filter based on a tag’s name, on its attributes, on the text of a string, or on some combination of these.
- A **string** will find matching tag names. The `name` parameter can also be used.
```Python
soup.find('b')          # Find first 'b' tag
soup.find_all('b')      # Find all 'b' tags (returns a list)
soup.find_all(name='b') # Find all 'b' tags (returns a list)
```
- A **regular expression** (using `re` module) that matches against tag names
```Python
sound.find_all(re.compile('^b'))     # Find tags like 'b' and 'body' (start with b)
```
- A **list** will find tag names included in the list
```Python
soup.find_all(['img', 'a'])     # Find img and a elements
```
- A **True** value will find all tags
- A custom **Function** can be passed that find tags on user's own criteries. Function's parameter is a tag and it should return True if matches and False otherwise.
- An **HTML attribute** in the form "key=value". Multiple attributes can be defined. As `class` is a Python reserved keyword and can't be used, `class_` should be used instead. Element's **text** can also be searched, using the `string` attribute. A regular expression can also be used as value.
```Python
soup.find_all(id='main', class_='important') # Find all elements with given attributes
soup.find_all(string='Download')             # Find all elements with exact match text
soup.find_all(href=re.compile('php'))        # Find links that contains the word 'php'
```
- Multiple criteria can be combined together:
```Python
soup.find('a', id='main')   # Find a element with given id.
```
- Some attributes, like the data-* attributes in HTML 5, have names that can’t be used as the names of keyword arguments. A search can be done on those using the `attrs` parameter:
```Python
soup.find(attrs={"data-foo": "value"})  # Find element with data-foo attribute equal to value
```
- Also, the `name` HTML attribute can't be searched directly, as `name` parameter is reserved for searching tag names. `attrs` parameter should be used instead:
```Python
soup.find(attrs={"name": "email"})
```
- Parameters's value can be a string, regular expression (using re), a boolean value or a function:
    - String means exact match against the string. In case of an attribute with multiple possible values (like class), the search will try to match every value separately.
    - Regular expressions. Same as before, in case of attributes with multiple possible values (like class), the search will try to match every value separately.
    - A Boolean means if the element should contain such attribute or not.
    - A custom function that receives the attribute as parameter and returns True or False, depending on whether it matches or not.

Aside from the find_* functions, the `select` function can also be used to search based on CSS selectors:
```Python
soup.select('ul>li.main')   # Select a list of elements based on a selector
```

Calling a tag without any function is like calling `find_all`:
```Python
soup('a')   # Find all a elements
```

### Exercise:

From https://realpython.github.io/fake-jobs/ find all the job offers for the location state of 'AE' and print the job offer's title, location and apply link. 

How many offers are with these characteristics?

In [None]:
from bs4 import BeautifulSoup
import requests
import sys


# Download the webpage with requests module. Setting timeout just in case
res = requests.get('https://realpython.github.io/fake-jobs/', timeout=30)
if not res.ok:
    print("Error: Can't download webpage")
    sys.exit(1)

# Start the parser (basic parser: 'html.parser')
soup = BeautifulSoup(res.text, 'html.parser')

# Find the job offer cards. The ones that have 'card' in the class
jobs = soup.find_all(class_='card')

# Let's count how many job offers
job_count = 0

# Traverse all job offers
for job in jobs:
    # For every job:
    # Get the job location. It is the HTML element with class 'location'.
    # Get the text node of the element and strip it.
    location = job.find(class_='location').text.strip()

    # Get the state from the location
    state = location.split(',')[1].strip()

    # If the state is 'AE', list job title, location and apply link
    if state == 'AE':
        # Title is the H2 element with class 'title'. Get the text node of the element.
        title = job.find('h2', class_='title').text

        # The apply link is the link with 'Apply' as text
        # Find the text node and navigate to the parent (a element) to get the href attribute
        link = job.find(text='Apply').parent['href']
        # Print the desired fields of the job offer
        print(f'Title: {title}\nLocation: {location}\nApply here: {link}\n\n')

        # Increment job count
        job_count += 1

# Print the total number of job offers
print(f'Job count: {job_count}')

: 