# WEB SCRAPING WITH PYTHON

Second Edition by Ryan Mitchell (O’Reilly). Copyright 2018 Ryan Mitchell, 978-1-491-998557-1.

Code examples on [github](https://github.com/REMitchell/python-scraping).

_ChatGPT:_  
These modules are commonly used for web scraping tasks in Python and are known for their flexibility, ease of use, and robust functionality.

- **Beautiful Soup**: A powerful parsing library for extracting data from HTML and XML files.
- **Scrapy**: An open-source web crawling and scraping framework for Python.
- **Requests**: A simple and easy-to-use library for making HTTP requests and handling responses.
- **Selenium**: A web browser automation tool that can be used for web scraping by simulating user interactions.
- **LXML**: A fast and feature-rich XML and HTML processing library.
- **PyQuery**: A library that provides jQuery-like syntax for parsing and manipulating HTML and XML documents.
- **Pandas**: A data manipulation library that can be useful for organizing and cleaning data retrieved from web scraping.

# <b>Preface</b>

## What Is Web Scraping?

The automated gathering of data from the internet is nearly as old as the internet itself. Although web scraping is not a new term, in years past the practice has been more commonly known as 
- screen scraping, 
- data mining, 
- web harvesting,

or similar variations. General consensus today seems to favor web scraping, so that is the term I use throughout the book, although I also refer to programs that specifically traverse multiple pages as web crawlers or refer to the web scraping programs themselves as bots.

In theory, 

> **web scraping** is the practice of gathering data through any means other than a program interacting with an API (or, obviously, through a human using a web browser). 

This is most commonly accomplished by writing an automated program that 
- queries a web server, 
- requests data (usually in the form of HTML and other files that compose web pages), and then 
- parses that data to extract needed information.

In practice, web scraping encompasses a wide variety of programming techniques and technologies, such as 
- data analysis, 
- natural language parsing, and 
- information security. 

Because the scope of the field is so broad, this book covers the fundamental basics of web scraping and crawling in Part I and delves into advanced topics in Part II. I suggest that all readers carefully study the first part and delve into the more specific in the second part as needed.

With few exceptions, if you can view data in your browser, you can access it via a Python script. If you can access it in a script, you can store it in a database. And if you can store it in a database, you can do virtually anything with that data.

Web scraping is a relatively disparate subject, with practices that require the use of databases, web servers, HTTP, HTML, internet security, image processing, data science, and other tools. This book attempts to cover all of these, and other topics, from the perspective of “data gathering.”

# <b>PART I. BUILDING SCRAPERS</b>

- how to use Python to request information from a web server, 
- how to perform basic handling of the server’s response, and 
- how to begin interacting with a website in an automated fashion.

In all likelihood, 90% of web scraping projects you’ll encounter will draw on techniques used in just the next six chapters. This section covers what the general (albeit technically savvy) public tends to think of when they think of “web scrapers”:
- Retrieving HTML data from a domain name,
- Parsing that data for target information,
- Storing the target information,
- Optionally, moving to another page to repeat the process.

# <b>1. Your First Web Scraper</b>

# 1.1 Connecting

In fact, browsers are a relatively recent invention in the history of the internet, considering Nexus was released in 1990.

Yes, the web browser is a useful application for creating these packets of information, telling your operating system to send them off, and interpreting the data you get back as pretty pictures, sounds, videos, and text. However, a web browser is just code, and code can be taken apart, broken into its basic components, rewritten, reused, and made to do anything you want. A web browser can tell the processor to send data to the application that handles your wireless (or wired) interface, but you can do the same thing in Python with just three lines of code:

In [2]:
from urllib.request import urlopen

html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


This command outputs the complete HTML code for `page1` located at the URL `http://pythonscraping.com/pages/page1.html`. More accurately, this outputs the HTML file `page1.html`, found in the directory `<web root>/pages`, on the server located at the domain name `http://pythonscraping.com`.

Why is it important to start thinking of these addresses as “files” rather than “pages”? Most modern web pages have many resource files associated with them. These could be image files, JavaScript files, CSS files, or any other content that the page you are requesting is linked to. When a web browser hits a tag such as `<img src="cuteKitten.jpg">`, the browser knows that it needs to make another request to the server to get the data at the file `cuteKitten.jpg` in order to fully render the page for the user.

Of course, your Python script doesn’t have the logic to go back and request multiple files (yet); it can only read the single HTML file that you’ve directly requested.

`urllib` is a standard Python library and contains functions for requesting data across the web, handling cookies, and even changing metadata such as headers and your user agent. 

`urlopen` is used to open a remote object across a network and read it. Because it is a fairly generic function (it can read HTML files, image files, or any other file stream with ease), we will be using it quite frequently throughout the book.

# 1.2 An Introduction to BeautifulSoup

Like its Wonderland namesake, `BeautifulSoup` tries to make sense of the nonsensical; it helps format and organize the messy web by fixing bad HTML and presenting us with easily traversable Python objects representing XML structures.

```sh
(venv) user@host:~$ pip install beautifulsoup4
```

You cannot easily access venv in bash from Jupyter:

In [8]:
! source ~/venv/venv3.12/bin/activate && echo $(python -V)
! python -V

Python 3.12.1
Python 3.11.2


In [33]:
import bs4

help(bs4)

Help on package bs4:

NAME
    bs4 - Beautiful Soup Elixir and Tonic - "The Screen-Scraper's Friend".

DESCRIPTION
    http://www.crummy.com/software/BeautifulSoup/

    Beautiful Soup uses a pluggable XML or HTML parser to parse a
    (possibly invalid) document into a tree representation. Beautiful Soup
    provides methods and Pythonic idioms that make it easy to navigate,
    search, and modify the parse tree.

    Beautiful Soup works with Python 3.6 and up. It works better if lxml
    and/or html5lib is installed.

    For more than you ever wanted to know about Beautiful Soup, see the
    documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

PACKAGE CONTENTS
    builder (package)
    css
    dammit
    diagnose
    element
    formatter
    tests (package)

CLASSES
    bs4.element.Tag(bs4.element.PageElement)
        BeautifulSoup

    class BeautifulSoup(bs4.element.Tag)
     |  BeautifulSoup(markup='', features=None, builder=None, parse_only=None, from_encoding

## Running BeautifulSoup

### `html.parser`

The most commonly used object in the BeautifulSoup library is, appropriately, the `BeautifulSoup object`. Let’s take a look at it in action, modifying the example found in the beginning of this chapter:

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)

<h1>An Interesting Title</h1>


Note that this returns only the first instance of the `h1` tag found on the page. By convention, only one `h1` tag should be used on a single page, but conventions are often broken on the web, so you should be aware that this will retrieve the first instance of the tag only, and not necessarily the one that you’re looking for.

As in previous web scraping examples, you are importing the `urlopen` function and calling `html.read()` in order to get the HTML content of the page. In addition to the text string, BeautifulSoup can also use the file object directly returned by `urlopen`, without needing to call `.read()` first:

In [2]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html, 'html.parser')
print(bs.h1)

<h1>An Interesting Title</h1>


This HTML content is then transformed into a BeautifulSoup object, with the following structure:
```html
html → <html><head>...</head><body>...</body></html>
    head → <head><title>A Useful Page</title></head>
        title → <title>A Useful Page</title>
    body → <body><h1>An Int...</h1><div>Lorem ip...</div></body>
        h1 → <h1>An Interesting Title</h1>
        div → <div>Lorem Ipsum dolor...</div>
```

Note that the `h1` tag that you extract from the page is nested two layers deep into your BeautifulSoup object structure (`html → body → h1`). However, when you actually fetch it from the object, you call the `h1` tag directly:

```python
bs.h1
```

In fact, any of the following function calls would produce the same output:

```python
bs.html.body.h1
bs.body.h1
bs.html.h1
```

When you create a BeautifulSoup object, two arguments are passed in:

```python
bs = BeautifulSoup(html.read(), 'html.parser')
```

The first is the HTML text the object is based on, and the second specifies the parser that you want BeautifulSoup to use in order to create that object. In the majority of cases, it makes no difference which parser you choose.

- `html.parser` is a parser that is included with Python 3 and requires no extra installations in order to use. Except where required, we will use this parser throughout the book.

### `lxml`

Another popular parser is `lxml`. This can be installed through pip:

```sh
$ pip3 install lxml
```

`lxml` can be used with BeautifulSoup by changing the parser string provided:

```python
bs = BeautifulSoup(html.read(), 'lxml')
```

In [3]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'lxml')
print(bs.h1)

<h1>An Interesting Title</h1>


> `lxml` has some advantages over `html.parser` in that it is generally better at parsing “messy” or malformed HTML code. 

It is forgiving and fixes problems like unclosed tags, tags that are improperly nested, and missing head or body tags. It is also somewhat faster than `html.parser`, although speed is not necessarily an advantage in web scraping, given that the speed of the network itself will almost always be your largest bottleneck.

One of the disadvantages of `lxml` is that it has to be installed separately and depends on third-party C libraries to function. This can cause problems for portability and ease of use, compared to `html.parser`.

### `html5lib`

Another popular HTML parser is `html5lib`. 

Like `lxml`, `html5lib` is an extremely forgiving parser that takes even more initiative correcting broken HTML. It also depends on an external dependency, and is slower than both `lxml` and `html.parser`. Despite this, it may be a good choice if you are working with messy or handwritten HTML sites.

It can be used by installing and passing the string `html5lib` to the BeautifulSoup object:

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html5lib')
print(bs.h1)

<h1>An Interesting Title</h1>


Virtually any information can be extracted from any HTML (or XML) file, as long as it has an identifying tag surrounding it or near it. Chapter 2 delves more deeply into more-complex BeautifulSoup function calls, and presents regular expressions and how they can be used with BeautifulSoup in order to extract information from websites.

### `html` or `html.read()`?

ChatGPT:

When you use `html = urlopen(url)`, it returns a response object which contains the HTML content. You can directly pass this response object to the BeautifulSoup constructor to parse the HTML content.

# 1.3 Connecting Reliably and Handling Exceptions

## Server error Exceptions

Let’s take a look at the first line of our scraper, after the import statements, and figure out how to handle any exceptions this might throw:

```python
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
```

Two main things can go wrong in this line:
- The page is not found on the server (or there was an error in retrieving it):
    - an `HTTPError` will be returned (404 Page Not Found,” “500 Internal Server Error,” and so forth);
- The server is not found:
    - urlopen will throw an `URLError` - no server could be reached at all, and, because the remote server is responsible for returning HTTP status codes, an `HTTPError` cannot be thrown, and the more serious URLError must be caught.

Author's solutions:

In [15]:
from urllib.request import urlopen
from urllib.error import HTTPError

try:
    html = urlopen('http://www.pythonscraping.com/page/page1.html')
except HTTPError as e:
    print(e)
    # return null, break, or do some other "Plan B"
else:
    # program continues. Note: If you return or break in the
    # exception catch, you do not need to use the "else" statement
    pass

HTTP Error 404: Not Found


In [9]:
# my simple solution
from urllib.request import urlopen
from bs4 import BeautifulSoup


def main():
    bad_url = 'http://www.pythonscraping.com/page/page1.html'
    try:
        html = urlopen(bad_url)
    except Exception as e:
        print(e)
        return 1

    bs = BeautifulSoup(html.read(), 'html5lib')
    print(bs.h1)

    return 0


main()

HTTP Error 404: Not Found


1

In [16]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen('https://pythonscrapingthisurldoesnotexist.com')
except HTTPError as e:
    print(e)
except URLError as e:
    print('The server could not be found!')
else:
    print('It Worked!')

The server could not be found!


## Tag errors - `None`

Every time you access a tag in a BeautifulSoup object, it’s smart to add a check to make sure the tag actually exists. If you attempt to access a tag that does not exist, BeautifulSoup will return a `None` object. The problem is, attempting to access a tag on a `None` object itself will result in an `AttributeError` being thrown:

```python
AttributeError: 'NoneType' object has no attribute 'someTag'
```

The easiest way is to explicitly check for both situations:

```python
try:
    badContent = bs.nonExistingTag.anotherTag
except AttributeError as e:
    print('Tag was not found')
else:
    if badContent == None:
        print ('Tag was not found')
    else:
        print(badContent)
```

This checking and handling of every error does seem laborious at first, but it’s easy to add a little reorganization to this code to make it less difficult to write (and, more important, much less difficult to read). This code, for example, is our same scraper written in a slightly different way:

## Examples

1. Working url:

In [10]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup


def main():

    working_url = 'http://www.pythonscraping.com/pages/page1.html'

    title = get_title(working_url)
    if title == None:
        print('Title could not be found')
    else:
        print(title)


def get_title(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        print(e)
        return None
    try:
        bs = BeautifulSoup(html, 'html.parser')
        title = bs.body.h1
    except AttributeError as e:
        print(e)
        return None
    return title


main()

<h1>An Interesting Title</h1>


2. Bad url:

In [13]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup


def main():

    bad_url = 'http://www.pythonscraping.com/page/page1.html'

    title = get_title(bad_url)    # modified
    if title == None:
        print('Title could not be found')
    else:
        print(title)


def get_title(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        print(e)
        return None
    try:
        bs = BeautifulSoup(html, 'html.parser')
        title = bs.body.h1
    except AttributeError as e:
        print(e)
        return None
    return title


main()

HTTP Error 404: Not Found
Title could not be found


3. Pass `None` to the function

In [14]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup


def main():

    working_url = 'http://www.pythonscraping.com/pages/page1.html'

    title = get_title(working_url)
    if title == None:
        print('Title could not be found')
    else:
        print(title)


def get_title(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        print(e)
        return None
    try:
        bs = BeautifulSoup(html, 'html.parser')
        title = bs.body.h2.p    # modified: there is no h2 tag and we try to get h2.p
    except AttributeError as e:
        print(e)
        return None
    return title


main()

'NoneType' object has no attribute 'p'
Title could not be found


4. Or even more simple:

In [15]:
from urllib.request import urlopen
from bs4 import BeautifulSoup


def main():

    working_url = 'http://www.pythonscraping.com/pages/page1.html'

    title = get_title(working_url)
    if title == None:
        print('Title could not be found')
    else:
        print(title)


def get_title(url):
    try:
        html = urlopen(url)
    except Exception as e:    # modified
        print(e)
        return None
    try:
        bs = BeautifulSoup(html, 'html.parser')
        title = bs.body.h2.p
    except Exception as e:    # modified
        print(e)
        return None
    return title


main()

'NoneType' object has no attribute 'p'
Title could not be found


When writing scrapers, it’s important to think about the overall pattern of your code in order to handle exceptions and make it readable at the same time. 

You’ll also likely want to heavily reuse code. Having generic functions such as `get_site_html` and `get_title` (complete with thorough exception handling) makes it easy to quickly — and reliably — scrape the web.

# <b>2. Advanced HTML Parsing</b>

In this section, we’ll discuss 
- searching for tags by attributes, 
- working with lists of tags, and 
- navigating parse trees.

Keep in mind that layering the techniques used in this section with reckless abandon can lead to code that is difficult to debug, fragile, or both. Before getting started, let’s take a look at some of the ways you can avoid altogether the need for advanced HTML parsing!

- Look for a “Print This Page” link, or perhaps a mobile version of the site that has better-formatted HTML (more on presenting yourself as a mobile device — and receiving mobile site versions — in **Chapter 14**).
- Look for the information hidden in a JavaScript file. Remember, you might need to examine the imported JavaScript files in order to do this. For example, I once collected street addresses (along with latitude and longitude) off a website in a neatly formatted array by looking at the JavaScript for the embedded Google Map that displayed a pinpoint over each address.
- This is more common for page titles, but the information might be available in the URL of the page itself.
- If the information you are looking for is unique to this website for some reason, you’re out of luck. If not, try to think of other sources you could get this information from. Is there another website with the same data? Is this website displaying data that it scraped or aggregated from another website?

Especially when faced with buried or poorly formatted data, it’s important not to just start digging and write yourself into a hole that you might not be able to get out of. 

> Take a deep breath and think of alternatives.

If you’re certain no alternatives exist, the rest of this chapter explains standard and creative ways of selecting tags based on their position, context, attributes, and contents. The techniques presented here, when used correctly, will go a long way toward writing more stable and reliable web crawlers.

# 2.1 CSS

CSS relies on the differentiation of HTML elements that might otherwise have the exact same markup in order to style them differently. Some tags might look like this:

```html
<span class="green"></span>
```

Others look like this:

```html
<span class="red"></span>
```

Web scrapers can easily separate these two tags based on their class; for example, they might use BeautifulSoup to grab all the red text but none of the green text. Because CSS relies on these identifying attributes to style sites appropriately, you are almost guaranteed that these class and ID attributes will be plentiful on most modern websites.

Let’s create an example web scraper that scrapes the page located at http://www.pythonscraping.com/pages/warandpeace.html.

On this page, the lines spoken by characters in the story are in red, whereas the names of characters are in green. You can see the span tags, which reference the appropriate CSS classes, in the following sample of the page’s source code:

```html
<span class="red">Heavens! what a virulent attack!</span> replied
<span class="green">the prince</span>, not in the least disconcerted
by this reception.
```

You can grab the entire page and create a BeautifulSoup object with it by using a program similar to the one used in **Chapter 1**. Using this BeautifulSoup object, you can use the `find_all` function to extract a Python list of proper nouns found by selecting only the text within `<span class="green"></span>` tags (`find_all` is an extremely flexible function you’ll be using a lot later in this book):

In [23]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html, 'html.parser')

name_list = bs.find_all('span', {'class': 'green'})
for name in name_list:
    print(name.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


When run, it should list all the proper nouns in the text, in the order they appear in **"War and Peace"**. So what’s going on here? 

Previously, you’ve called `bs.tagName` to get the first occurrence of that tag on the page. Now, you’re calling `bs.find_all(tagName, tagAttributes)` to get a list of all of the tags on the page, rather than just the first.

After getting a list of names, the program iterates through all names in the list, and prints `name.get_text()` in order to separate the content from the tags.

## When To `get_text()` And When To Preserve Tags

`.get_text()` strips all tags from the document you are working with and returns a Unicode string containing the text only. For example, if you are working with a large block of text that contains many hyperlinks, paragraphs, and other tags, all those will be stripped away, and you’ll be left with a tagless block of text.

Keep in mind that it’s much easier to find what you’re looking for in a BeautifulSoup object than in a block of text. Calling `.get_text()` should always be the last thing you do, immediately before you print, store, or manipulate your final data. In general, 

> you should try to preserve the tag structure of a document as long as possible.

## `find()` and `find_all()`

In [26]:
help(bs.find_all)

Help on method find_all in module bs4.element:

find_all(name=None, attrs={}, recursive=True, string=None, limit=None, **kwargs) method of bs4.BeautifulSoup instance
    Look in the children of this PageElement and find all
    PageElements that match the given criteria.

    All find_* methods take a common set of arguments. See the online
    documentation for detailed explanations.

    :param name: A filter on tag name.
    :param attrs: A dictionary of filters on attribute values.
    :param recursive: If this is True, find_all() will perform a
        recursive search of this PageElement's children. Otherwise,
        only the direct children will be considered.
    :param limit: Stop looking after finding this many results.
    :kwargs: A dictionary of filters on attribute values.
    :return: A ResultSet of PageElements.
    :rtype: bs4.element.ResultSet



In [27]:
help(bs.find)

Help on method find in module bs4.element:

find(name=None, attrs={}, recursive=True, string=None, **kwargs) method of bs4.BeautifulSoup instance
    Look in the children of this PageElement and find the first
    PageElement that matches the given criteria.

    All find_* methods take a common set of arguments. See the online
    documentation for detailed explanations.

    :param name: A filter on tag name.
    :param attrs: A dictionary of filters on attribute values.
    :param recursive: If this is True, find() will perform a
        recursive search of this PageElement's children. Otherwise,
        only the direct children will be considered.
    :param limit: Stop looking after finding this many results.
    :kwargs: A dictionary of filters on attribute values.
    :return: A PageElement.
    :rtype: bs4.element.Tag | bs4.element.NavigableString



```python
find_all(name=None, attrs={}, recursive=True, string=None, limit=None, **kwargs)
find(name=None, attrs={}, recursive=True, string=None, **kwargs)
```

1. The `name` argument is one that you’ve seen before; you can pass a string name of a tag or even a Python list of string tag names. For example, the following returns a list of all the header tags in a document:

```python
.find_all(['h1','h2','h3','h4','h5','h6'])
```

- The `attributes` argument takes a Python dictionary of attributes and matches tags that contain any one of those attributes. For example, the following function would return both the green and red span tags in the HTML document:

```python
.find_all('span', {'class':{'green', 'red'}})
```

- The `recursive` argument is a boolean. How deeply into the document do you want to go? 
    - If recursive is set to `True`, the `find_all` function looks into children, and children’s children, for tags that match your parameters. 
    - If it is `False`, it will look only at the top-level tags in your document.
<br>By default, `find_all` works recursively (`recursive` is set to `True`); it’s generally a good idea to leave this as is, unless you really know what you need to do and performance is an issue.<br>
<br>

### `string` argument

- The `string` argument is unusual in that it matches based on the text content of the tags, rather than properties of the tags themselves. For instance, if you want to find the number of times “the prince” is surrounded by tags on the example page, you could replace your `.find_all()` function in the previous example with the following lines:

In [30]:
from urllib.request import urlopen
from bs4 import BeautifulSoup


def main():

    html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
    bs = BeautifulSoup(html, 'html.parser')

    name_list = bs.find_all(string='the prince')
    print(len(name_list))


main()

7


- The `limit` argument, of course, is used only in the `find_all` method; `find` is equivalent to the same `find_all` call, with a limit of `1`. You might set this if you’re interested only in retrieving the first `x` items from the page. Be aware, however, that this gives you the first items on the page in the order that they occur, not necessarily the first ones that you want.

- The `**kwargs` argument allows you to select tags that contain a particular attribute or set of attributes. For example:

```python
title = bs.find_all(id='title', class_='text')
```

This returns the first tag with the word “text” in the `class_` attribute and “title” in the `id` attribute. Note that, by convention, each value for an `id` should be used only once on the page. Therefore, in practice, a line like this may not be particularly useful, and should be equivalent to the following:

```python
title = bs.find(id='title')
```

_ChatGPT:_  
The `id` and `class_` attributes are commonly used in HTML to uniquely identify an element (`id`) or to categorize elements (`class`). These attributes come from the HTML code of the page being parsed.

The `id` attribute should be unique for each element on a page, and it's typically used to identify a specific element. The `class` attribute can be used to apply similar styling or behavior to multiple elements.

In practice, using `find_all` with both `id` and `class_` attributes may not be particularly useful since the `id` should be unique. 

Recall that passing a list of tags to `.find_all()` via the attributes list acts as an “or” filter (it selects a list of all tags that have `tag1, tag2, or tag3...`). If you have a lengthy list of tags, you can end up with a lot of stuff you don’t want. The `keyword argument` allows you to add an additional “and” filter to this.

## Keyword arguments and “class”

The `keyword argument` can be helpful in some situations. However, it is technically redundant as a BeautifulSoup feature. Keep in mind that anything that can be done with keyword can also be accomplished using techniques covered later in this chapter (see `regular_express` and `lambda_express`).

For instance, the following two lines are identical:

```python
bs.find_all(id='text')
bs.find_all('', {'id':'text'})
```

_VR:_  
Not true! You cannot leave `name=''`, you have to put the name of the tag: 

In [12]:
from urllib.request import urlopen
from bs4 import BeautifulSoup


def main():

    html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
    bs = BeautifulSoup(html, 'html.parser')

    test2 = bs.find('', {'class': 'red'})

    print(test2)


main()

None


And this thing works:

In [9]:
from urllib.request import urlopen
from bs4 import BeautifulSoup


def main():

    html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
    bs = BeautifulSoup(html, 'html.parser')

    test1 = bs.find_all(class_='red')
    test2 = bs.find_all('span', {'class': 'red'})

    for i, j in zip(test1, test2):
        if i == j:
            print("Identical")
        else:
            print("Different")
        break


main()

Identical


In addition, you might occasionally run into problems using keyword, most notably when searching for elements by their `class` attribute, because `class` is a protected keyword in Python. That is, `class` is a reserved word in Python that cannot be used as a variable or argument name (no relation to the `BeautifulSoup.find_all()` keyword argument, previously discussed). For example, if you try the following call, you’ll get a syntax error due to the nonstandard use of class:

```python
bs.find_all(class='green')
```

Instead, you can use BeautifulSoup’s somewhat clumsy solution, which involves adding an underscore:
```python
bs.find_all(class_='green')
```

Alternatively, you can enclose class in quotes (_no, you cannot! you will have to provide `name='tagname'`, which is not needed in the first case._ - VR) :
```python
bs.find_all('', {'class':'green'})
```

# 2.2 Other `BeautifulSoup` Objects

So far in the book, you’ve seen two types of objects in the BeautifulSoup library:
- **BeautifulSoup objects** - `bs`, and
- **Tag objects** - 
    - `bs.div.h1`, retrieved in lists, or 
    - retrieved individually by calling `find` and `find_all` on a BeautifulSoup object, or drilling down.

However, there are two more objects in the library that, although less commonly used, are still important to know about:
- **NavigableString objects** ("NUH-vi-guh-buhl-string") - used to represent text within tags, rather than the tags themselves (some functions operate on and produce `NavigableStrings`, rather than tag objects).
- **Comment object** - used to find HTML comments in comment tags, `<!--like this one-->`.

These four objects are the only objects you will ever encounter in the BeautifulSoup library (at the time of this writing).

See details [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#kinds-of-objects).

# 2.3 Navigating Trees

The `find_all` function is responsible for finding tags based on their name and attributes. But 

> what if you need to find a tag based on its location in a document? 

That’s where tree navigation comes in handy. In **Chapter 1**, you looked at navigating a BeautifulSoup tree in a single direction:

```html
bs.tag.subTag.anotherSubTag
```

Now let’s look at navigating up, across, and diagonally through HTML trees. You’ll use our highly questionable online shopping site at http://www.pythonscraping.com/pages/page3.html.

The HTML for this page, mapped out as a tree (with some tags omitted for brevity), looks like this:

- HTML
    - body
        - div.wrapper
            - h1
            - div.content
        - table#giftList
            - tr
                - th
                - th
                - th
                - th
            - tr.gift#gift1
                - td
                - td
                    - span.excitingNote
                - td
                - td
                    - img
            - ...table rows continue...
        - div.footer
        
You will use this same HTML structure as an example in the next few sections.

## Children and other descendants

In the BeautifulSoup library, as well as many other libraries, there is a distinction drawn between **children** and **descendants**: 
- much like in a human family tree, children are always exactly one tag below a parent, whereas 
- descendants can be at any level in the tree below a parent. 

For example, the `tr` tags are children of the `table` tag, whereas `tr`, `th`, `td`, `img`, and `span` are all descendants of the `table` tag (at least in our example page). All children are descendants, but not all descendants are children.

In general, BeautifulSoup functions always deal with the descendants of the current tag selected. For instance, `bs.body.h1` selects the first `h1` tag that is a descendant of the `body` tag. It will not find tags located outside the `body`.

Similarly, `bs.div.find_all('img')` will find the first `div` tag in the document, and then retrieve a list of all `img` tags that are descendants of that `div` tag.

If you want to find only descendants that are children, you can use the `.children` tag:

In [24]:
from urllib.request import urlopen
from bs4 import BeautifulSoup


def main():
    html = urlopen('http://www.pythonscraping.com/pages/page3.html')
    bs = BeautifulSoup(html, 'html.parser')

    count = 0
    for child in bs.find('table', {'id': 'giftList'}).children:
        count += 1
        print(child)

    print(count)


main()



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>


This code prints the list of product rows in the `giftList` table, including the initial row of column labels. If you were to write it using the `descendants()` function instead of the `children()` function, about two dozen tags would be found within the table and printed, including `img` tags, `span` tags, and individual `td` tags. 

In [27]:
from urllib.request import urlopen
from bs4 import BeautifulSoup


def main():
    html = urlopen('http://www.pythonscraping.com/pages/page3.html')
    bs = BeautifulSoup(html, 'html.parser')

    count = 0
    for child in bs.find('table', {'id': 'giftList'}).descendants:
        count += 1
        # print(child)

    print(count)


main()

86


## Siblings

### `next_siblings()`

The BeautifulSoup `next_siblings()` function makes it trivial to collect data from tables, especially ones with title rows:

In [34]:
from urllib.request import urlopen
from bs4 import BeautifulSoup


def main():
    html = urlopen('http://www.pythonscraping.com/pages/page3.html')
    bs = BeautifulSoup(html, 'lxml')

    for sibling in bs.find('table', {'id': 'giftList'}).tr.next_siblings:
        print(sibling)


main()



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parr

The output of this code is to print all rows of products from the product table, except for the first title row. 

> Why does the title row get skipped? 

Objects cannot be siblings with themselves. Anytime you get siblings of an object, the object itself will not be included in the list. As the name of the function implies, it calls next siblings only. If you were to select a row in the middle of the list, for example, and call `next_siblings` on it, only the subsequent siblings would be returned. So, by selecting the title row and calling `next_siblings`, you can select all the rows in the table, without selecting the title row itself.

> MAKE SELECTIONS SPECIFIC</br>
</br>The preceding code will work just as well, if you select bs.table.tr or even just bs.tr in order to select the first row of the table. However, in the code, I go through all of the trouble of writing everything out in a longer form:</br>
</br>`bs.find('table',{'id':'giftList'}).tr`</br>
</br>Even if it looks like there’s just one table (or other target tag) on the page, it’s easy to miss things. In addition, page layouts change all the time. What was once the first of its kind on the page might someday be the second or third tag of that type found on the page. To make your scrapers more robust, it’s best to be as specific as possible when making tag selections. **Take advantage of tag attributes when they are available.**

### `previous_siblings()`

The `previous_siblings` function can often be helpful if there is an easily selectable tag at the end of a list of sibling tags that you would like to get.

And, of course, there are the `next_sibling` and `previous_sibling` functions, which perform nearly the same function as `next_siblings` and `previous_siblings`, except they return a single tag rather than a list of them.

## Parents

When scraping pages, you will likely discover that you need to find parents of tags less frequently than you need to find their children or siblings. Typically, when you look at HTML pages with the goal of crawling them, you start by looking at the top layer of tags, and then figure out how to drill your way down into the exact piece of data that you want. Occasionally, however, you can find yourself in odd situations that require BeautifulSoup’s parent-finding functions, `.parent` and `.parents`. For example:

In [37]:
from urllib.request import urlopen
from bs4 import BeautifulSoup


def main():
    html = urlopen('http://www.pythonscraping.com/pages/page3.html')
    bs = BeautifulSoup(html, 'lxml')

    print(bs.find(
            'img', {'src': '../img/gifts/img1.jpg'}
                ).parent.previous_sibling.get_text())


main()


$15.00



This code will print the price of the object represented by the image at the location `../img/gifts/img1.jpg` (in this case, the price is `$15.00`).

![](./data/images/Screenshot_20240122_004505.png)

1. The image tag where `src="../img/gifts/img1.jpg"` is first selected.
1. You select the parent of that tag (in this case, the `td` tag).
1. You select the `previous_sibling` of the `td` tag (in this case, the `td` tag that contains the dollar value of the product).
1. You select the text within that tag, `“$15.00”`.

# <b>Additional</b>

|bash|description|
|-|-|
|`scrapy startproject <name>`|start a new scrapy project|
|`scrapy genspider <spider_name> <domain>`|generate a spider in the `spider` dir|
|`scrapy runspider <spider_file>.py`|start the crawler|
|||
|||
|||

# 3. Creating a Scrapy project

You should work in the virtual environment.

```sh
pip install --upgrade pip
pip install scrapy
```

A **spider** is a Scrapy project that, like its arachnid namesake, is designed to crawl webs.

```sh
$ scrapy startproject test1
New Scrapy project 'test1', using template directory '/home/commi/venv/venv3.12/lib/python3.12/site-packages/scrapy/templates/project', created in:
    /home/commi/Yandex.Disk/it_learning/08_parsing_data/data/test1

You can start your first spider with:
    cd test1
    scrapy genspider example example.com
```

## Project dir

In [35]:
cd /home/commi/Yandex.Disk/it_learning/08_parsing_data/data/

In [36]:
tree test1

[01;34mtest1[0m
├── scrapy.cfg
└── [01;34mtest1[0m
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── [01;34mspiders[0m
        └── __init__.py

3 directories, 7 files


In [37]:
cat test1/scrapy.cfg

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = test1.settings

[deploy]
#url = http://localhost:6800/
project = test1


### Deeper

In [38]:
tree test1/test1

[01;34mtest1/test1[0m
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── [01;34mspiders[0m
    └── __init__.py

2 directories, 6 files


In [39]:
cat test1/test1/items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class Test1Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass


In [40]:
cat test1/test1/middlewares.py

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


class Test1SpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spide

In [41]:
cat test1/test1/pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class Test1Pipeline:
    def process_item(self, item, spider):
        return item


In [42]:
cat test1/test1/settings.py

# Scrapy settings for test1 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = "test1"

SPIDER_MODULES = ["test1.spiders"]
NEWSPIDER_MODULE = "test1.spiders"


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = "test1 (+http://www.yourdomain.com)"

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay settin

### Even deeper

In [43]:
tree test1/test1/spiders

[01;34mtest1/test1/spiders[0m
└── __init__.py

1 directory, 1 file


In [44]:
cat test1/test1/spiders/__init__.py

# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.


# 4. Write a Simple Scraper

To create a crawler, you will add a new file inside the spiders directory at test1/test1/spiders/bookspider.py.

```sh
$ cd test1/test1/spiders/
$ scrapy genspider bookspider books.toscrape.com
```
```
Created spider 'bookspider' using template 'basic' in module:
  test1.spiders.bookspider
```

In [45]:
tree test1/test1/spiders/

[01;34mtest1/test1/spiders/[0m
├── bookspider.py
├── __init__.py
└── [01;34m__pycache__[0m
    └── __init__.cpython-312.pyc

2 directories, 3 files


In [46]:
cat test1/test1/spiders/bookspider.py

import scrapy


class BookspiderSpider(scrapy.Spider):
    name = "bookspider"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com"]

    def parse(self, response):
        pass
