# WEB SCRAPING WITH PYTHON workbook

Second Edition by Ryan Mitchell (O’Reilly). Copyright 2018 Ryan Mitchell, 978-1-491-998557-1.

Code examples on [github](https://github.com/REMitchell/python-scraping).

_Creating a well-structured web scraper doesn’t require a lot of arcane knowledge, but it does require taking a moment to step back and think about your project._

In [97]:
!neofetch

[?25l[?7l[37m[0m[1m       _,met$$$$$gg.
    ,g$$$$$$$$$$$$$$$P.
  ,g$$P"     """Y$$.".
 ,$$P'              `$$$.
',$$P       ,ggs.     `$$b:
`d$$'     ,$P"'   [0m[31m[1m.[37m[0m[1m    $$$
 $$P      d$'     [0m[31m[1m,[37m[0m[1m    $$P
 $$:      $$.   [0m[31m[1m-[37m[0m[1m    ,d$$'
 $$;      Y$b._   _,d$P'
 Y$$.    [0m[31m[1m`.[37m[0m[1m`"Y$$$$P"'
[37m[0m[1m `$$b      [0m[31m[1m"-.__
[37m[0m[1m  `Y$$
   `Y$$.
     `$$b.
       `Y$$b.
          `"Y$b._
              `"""[0m
[17A[9999999D[30C[0m[1m[31m[1mcommi[0m@[31m[1mdebian-laptop[0m 
[30C[0m-------------------[0m 
[30C[0m[31m[1mOS[0m[0m:[0m Debian GNU/Linux 12 (bookworm) x86_64[0m 
[30C[0m[31m[1mHost[0m[0m:[0m Inspiron 7348 A15[0m 
[30C[0m[31m[1mKernel[0m[0m:[0m 6.1.0-17-amd64[0m 
[30C[0m[31m[1mUptime[0m[0m:[0m 5 hours, 55 mins[0m 
[30C[0m[31m[1mPackages[0m[0m:[0m 2732 (dpkg)[0m 
[30C[0m[31m[1mShell[0m[0m:[0m bash 5.2.15[0m 
[30C[0m

_ChatGPT:_  
These modules are commonly used for web scraping tasks in Python and are known for their flexibility, ease of use, and robust functionality.

- **Beautiful Soup**: A powerful parsing library for extracting data from HTML and XML files.
- **Scrapy**: An open-source web crawling and scraping framework for Python.
- **Requests**: A simple and easy-to-use library for making HTTP requests and handling responses.
- **Selenium**: A web browser automation tool that can be used for web scraping by simulating user interactions.
- **LXML**: A fast and feature-rich XML and HTML processing library.
- **PyQuery**: A library that provides jQuery-like syntax for parsing and manipulating HTML and XML documents.
- **Pandas**: A data manipulation library that can be useful for organizing and cleaning data retrieved from web scraping.

# <b>Preface</b>

## What Is Web Scraping?

The automated gathering of data from the internet is nearly as old as the internet itself. Although web scraping is not a new term, in years past the practice has been more commonly known as 
- screen scraping, 
- data mining, 
- web harvesting,

or similar variations. General consensus today seems to favor web scraping, so that is the term I use throughout the book, although I also refer to programs that specifically traverse multiple pages as web crawlers or refer to the web scraping programs themselves as bots.

In theory, 

> **web scraping** is the practice of gathering data through any means other than a program interacting with an API (or, obviously, through a human using a web browser). 

This is most commonly accomplished by writing an automated program that 
- queries a web server, 
- requests data (usually in the form of HTML and other files that compose web pages), and then 
- parses that data to extract needed information.

In practice, web scraping encompasses a wide variety of programming techniques and technologies, such as 
- data analysis, 
- natural language parsing, and 
- information security. 

Because the scope of the field is so broad, this book covers the fundamental basics of web scraping and crawling in Part I and delves into advanced topics in Part II. I suggest that all readers carefully study the first part and delve into the more specific in the second part as needed.

With few exceptions, if you can view data in your browser, you can access it via a Python script. If you can access it in a script, you can store it in a database. And if you can store it in a database, you can do virtually anything with that data.

Web scraping is a relatively disparate subject, with practices that require the use of databases, web servers, HTTP, HTML, internet security, image processing, data science, and other tools. This book attempts to cover all of these, and other topics, from the perspective of “data gathering.”

# <b>PART I. BUILDING SCRAPERS</b>

- how to use Python to request information from a web server, 
- how to perform basic handling of the server’s response, and 
- how to begin interacting with a website in an automated fashion.

In all likelihood, 90% of web scraping projects you’ll encounter will draw on techniques used in just the next six chapters. This section covers what the general (albeit technically savvy) public tends to think of when they think of “web scrapers”:
- Retrieving HTML data from a domain name,
- Parsing that data for target information,
- Storing the target information,
- Optionally, moving to another page to repeat the process.

# <b>1. Your First Web Scraper</b>

# 1.1 Connecting

In fact, browsers are a relatively recent invention in the history of the internet, considering Nexus was released in 1990.

Yes, the web browser is a useful application for creating these packets of information, telling your operating system to send them off, and interpreting the data you get back as pretty pictures, sounds, videos, and text. However, a web browser is just code, and code can be taken apart, broken into its basic components, rewritten, reused, and made to do anything you want. A web browser can tell the processor to send data to the application that handles your wireless (or wired) interface, but you can do the same thing in Python with just three lines of code:

In [2]:
from urllib.request import urlopen

html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())

b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'


This command outputs the complete HTML code for `page1` located at the URL `http://pythonscraping.com/pages/page1.html`. More accurately, this outputs the HTML file `page1.html`, found in the directory `<web root>/pages`, on the server located at the domain name `http://pythonscraping.com`.

Why is it important to start thinking of these addresses as “files” rather than “pages”? Most modern web pages have many resource files associated with them. These could be image files, JavaScript files, CSS files, or any other content that the page you are requesting is linked to. When a web browser hits a tag such as `<img src="cuteKitten.jpg">`, the browser knows that it needs to make another request to the server to get the data at the file `cuteKitten.jpg` in order to fully render the page for the user.

Of course, your Python script doesn’t have the logic to go back and request multiple files (yet); it can only read the single HTML file that you’ve directly requested.

`urllib` is a standard Python library and contains functions for requesting data across the web, handling cookies, and even changing metadata such as headers and your user agent. 

`urlopen` is used to open a remote object across a network and read it. Because it is a fairly generic function (it can read HTML files, image files, or any other file stream with ease), we will be using it quite frequently throughout the book.

# 1.2 An Introduction to BeautifulSoup

Like its Wonderland namesake, `BeautifulSoup` tries to make sense of the nonsensical; it helps format and organize the messy web by fixing bad HTML and presenting us with easily traversable Python objects representing XML structures.

```sh
(venv) user@host:~$ pip install beautifulsoup4
```

You cannot easily access venv in bash from Jupyter:

In [8]:
! source ~/venv/venv3.12/bin/activate && echo $(python -V)
! python -V

Python 3.12.1
Python 3.11.2


In [33]:
import bs4

help(bs4)

Help on package bs4:

NAME
    bs4 - Beautiful Soup Elixir and Tonic - "The Screen-Scraper's Friend".

DESCRIPTION
    http://www.crummy.com/software/BeautifulSoup/

    Beautiful Soup uses a pluggable XML or HTML parser to parse a
    (possibly invalid) document into a tree representation. Beautiful Soup
    provides methods and Pythonic idioms that make it easy to navigate,
    search, and modify the parse tree.

    Beautiful Soup works with Python 3.6 and up. It works better if lxml
    and/or html5lib is installed.

    For more than you ever wanted to know about Beautiful Soup, see the
    documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

PACKAGE CONTENTS
    builder (package)
    css
    dammit
    diagnose
    element
    formatter
    tests (package)

CLASSES
    bs4.element.Tag(bs4.element.PageElement)
        BeautifulSoup

    class BeautifulSoup(bs4.element.Tag)
     |  BeautifulSoup(markup='', features=None, builder=None, parse_only=None, from_encoding

## Running BeautifulSoup

### `html.parser`

The most commonly used object in the BeautifulSoup library is, appropriately, the `BeautifulSoup object`. Let’s take a look at it in action, modifying the example found in the beginning of this chapter:

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)

<h1>An Interesting Title</h1>


Note that this returns only the first instance of the `h1` tag found on the page. By convention, only one `h1` tag should be used on a single page, but conventions are often broken on the web, so you should be aware that this will retrieve the first instance of the tag only, and not necessarily the one that you’re looking for.

As in previous web scraping examples, you are importing the `urlopen` function and calling `html.read()` in order to get the HTML content of the page. In addition to the text string, BeautifulSoup can also use the file object directly returned by `urlopen`, without needing to call `.read()` first:

In [2]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html, 'html.parser')
print(bs.h1)

<h1>An Interesting Title</h1>


This HTML content is then transformed into a BeautifulSoup object, with the following structure:
```html
html → <html><head>...</head><body>...</body></html>
    head → <head><title>A Useful Page</title></head>
        title → <title>A Useful Page</title>
    body → <body><h1>An Int...</h1><div>Lorem ip...</div></body>
        h1 → <h1>An Interesting Title</h1>
        div → <div>Lorem Ipsum dolor...</div>
```

Note that the `h1` tag that you extract from the page is nested two layers deep into your BeautifulSoup object structure (`html → body → h1`). However, when you actually fetch it from the object, you call the `h1` tag directly:

```python
bs.h1
```

In fact, any of the following function calls would produce the same output:

```python
bs.html.body.h1
bs.body.h1
bs.html.h1
```

When you create a BeautifulSoup object, two arguments are passed in:

```python
bs = BeautifulSoup(html.read(), 'html.parser')
```

The first is the HTML text the object is based on, and the second specifies the parser that you want BeautifulSoup to use in order to create that object. In the majority of cases, it makes no difference which parser you choose.

- `html.parser` is a parser that is included with Python 3 and requires no extra installations in order to use. Except where required, we will use this parser throughout the book.

### `lxml`

Another popular parser is `lxml`. This can be installed through pip:

```sh
$ pip3 install lxml
```

`lxml` can be used with BeautifulSoup by changing the parser string provided:

```python
bs = BeautifulSoup(html.read(), 'lxml')
```

In [3]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'lxml')
print(bs.h1)

<h1>An Interesting Title</h1>


> `lxml` has some advantages over `html.parser` in that it is generally better at parsing “messy” or malformed HTML code. 

It is forgiving and fixes problems like unclosed tags, tags that are improperly nested, and missing head or body tags. It is also somewhat faster than `html.parser`, although speed is not necessarily an advantage in web scraping, given that the speed of the network itself will almost always be your largest bottleneck.

One of the disadvantages of `lxml` is that it has to be installed separately and depends on third-party C libraries to function. This can cause problems for portability and ease of use, compared to `html.parser`.

### `html5lib`

Another popular HTML parser is `html5lib`. 

Like `lxml`, `html5lib` is an extremely forgiving parser that takes even more initiative correcting broken HTML. It also depends on an external dependency, and is slower than both `lxml` and `html.parser`. Despite this, it may be a good choice if you are working with messy or handwritten HTML sites.

It can be used by installing and passing the string `html5lib` to the BeautifulSoup object:

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html5lib')
print(bs.h1)

<h1>An Interesting Title</h1>


Virtually any information can be extracted from any HTML (or XML) file, as long as it has an identifying tag surrounding it or near it. Chapter 2 delves more deeply into more-complex BeautifulSoup function calls, and presents regular expressions and how they can be used with BeautifulSoup in order to extract information from websites.

### `html` or `html.read()`?

ChatGPT:

When you use `html = urlopen(url)`, it returns a response object which contains the HTML content. You can directly pass this response object to the BeautifulSoup constructor to parse the HTML content.

# 1.3 Connecting Reliably and Handling Exceptions

## Server error Exceptions

Let’s take a look at the first line of our scraper, after the import statements, and figure out how to handle any exceptions this might throw:

```python
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
```

Two main things can go wrong in this line:
- The page is not found on the server (or there was an error in retrieving it):
    - an `HTTPError` will be returned (404 Page Not Found,” “500 Internal Server Error,” and so forth);
- The server is not found:
    - urlopen will throw an `URLError` - no server could be reached at all, and, because the remote server is responsible for returning HTTP status codes, an `HTTPError` cannot be thrown, and the more serious URLError must be caught.

Author's solutions:

In [15]:
from urllib.request import urlopen
from urllib.error import HTTPError

try:
    html = urlopen('http://www.pythonscraping.com/page/page1.html')
except HTTPError as e:
    print(e)
    # return null, break, or do some other "Plan B"
else:
    # program continues. Note: If you return or break in the
    # exception catch, you do not need to use the "else" statement
    pass

HTTP Error 404: Not Found


In [9]:
# my simple solution
from urllib.request import urlopen
from bs4 import BeautifulSoup


def main():
    bad_url = 'http://www.pythonscraping.com/page/page1.html'
    try:
        html = urlopen(bad_url)
    except Exception as e:
        print(e)
        return 1

    bs = BeautifulSoup(html.read(), 'html5lib')
    print(bs.h1)

    return 0


main()

HTTP Error 404: Not Found


1

In [16]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen('https://pythonscrapingthisurldoesnotexist.com')
except HTTPError as e:
    print(e)
except URLError as e:
    print('The server could not be found!')
else:
    print('It Worked!')

The server could not be found!


## Tag errors - `None`

Every time you access a tag in a BeautifulSoup object, it’s smart to add a check to make sure the tag actually exists. If you attempt to access a tag that does not exist, BeautifulSoup will return a `None` object. The problem is, attempting to access a tag on a `None` object itself will result in an `AttributeError` being thrown:

```python
AttributeError: 'NoneType' object has no attribute 'someTag'
```

The easiest way is to explicitly check for both situations:

```python
try:
    badContent = bs.nonExistingTag.anotherTag
except AttributeError as e:
    print('Tag was not found')
else:
    if badContent == None:
        print ('Tag was not found')
    else:
        print(badContent)
```

This checking and handling of every error does seem laborious at first, but it’s easy to add a little reorganization to this code to make it less difficult to write (and, more important, much less difficult to read). This code, for example, is our same scraper written in a slightly different way:

## Examples

1. Working url:

In [10]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup


def main():

    working_url = 'http://www.pythonscraping.com/pages/page1.html'

    title = get_title(working_url)
    if title == None:
        print('Title could not be found')
    else:
        print(title)


def get_title(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        print(e)
        return None
    try:
        bs = BeautifulSoup(html, 'html.parser')
        title = bs.body.h1
    except AttributeError as e:
        print(e)
        return None
    return title


main()

<h1>An Interesting Title</h1>


2. Bad url:

In [13]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup


def main():

    bad_url = 'http://www.pythonscraping.com/page/page1.html'

    title = get_title(bad_url)    # modified
    if title == None:
        print('Title could not be found')
    else:
        print(title)


def get_title(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        print(e)
        return None
    try:
        bs = BeautifulSoup(html, 'html.parser')
        title = bs.body.h1
    except AttributeError as e:
        print(e)
        return None
    return title


main()

HTTP Error 404: Not Found
Title could not be found


3. Pass `None` to the function

In [14]:
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup


def main():

    working_url = 'http://www.pythonscraping.com/pages/page1.html'

    title = get_title(working_url)
    if title == None:
        print('Title could not be found')
    else:
        print(title)


def get_title(url):
    try:
        html = urlopen(url)
    except HTTPError as e:
        print(e)
        return None
    try:
        bs = BeautifulSoup(html, 'html.parser')
        title = bs.body.h2.p    # modified: there is no h2 tag and we try to get h2.p
    except AttributeError as e:
        print(e)
        return None
    return title


main()

'NoneType' object has no attribute 'p'
Title could not be found


4. Or even more simple:

In [15]:
from urllib.request import urlopen
from bs4 import BeautifulSoup


def main():

    working_url = 'http://www.pythonscraping.com/pages/page1.html'

    title = get_title(working_url)
    if title == None:
        print('Title could not be found')
    else:
        print(title)


def get_title(url):
    try:
        html = urlopen(url)
    except Exception as e:    # modified
        print(e)
        return None
    try:
        bs = BeautifulSoup(html, 'html.parser')
        title = bs.body.h2.p
    except Exception as e:    # modified
        print(e)
        return None
    return title


main()

'NoneType' object has no attribute 'p'
Title could not be found


When writing scrapers, it’s important to think about the overall pattern of your code in order to handle exceptions and make it readable at the same time. 

You’ll also likely want to heavily reuse code. Having generic functions such as `get_site_html` and `get_title` (complete with thorough exception handling) makes it easy to quickly — and reliably — scrape the web.

# <b>2. Advanced HTML Parsing</b>

In this section, we’ll discuss 
- searching for tags by attributes, 
- working with lists of tags, and 
- navigating parse trees.

Keep in mind that layering the techniques used in this section with reckless abandon can lead to code that is difficult to debug, fragile, or both. Before getting started, let’s take a look at some of the ways you can avoid altogether the need for advanced HTML parsing!

- Look for a “Print This Page” link, or perhaps a mobile version of the site that has better-formatted HTML (more on presenting yourself as a mobile device — and receiving mobile site versions — in **Chapter 14**).
- Look for the information hidden in a JavaScript file. Remember, you might need to examine the imported JavaScript files in order to do this. For example, I once collected street addresses (along with latitude and longitude) off a website in a neatly formatted array by looking at the JavaScript for the embedded Google Map that displayed a pinpoint over each address.
- This is more common for page titles, but the information might be available in the URL of the page itself.
- If the information you are looking for is unique to this website for some reason, you’re out of luck. If not, try to think of other sources you could get this information from. Is there another website with the same data? Is this website displaying data that it scraped or aggregated from another website?

Especially when faced with buried or poorly formatted data, it’s important not to just start digging and write yourself into a hole that you might not be able to get out of. 

> Take a deep breath and think of alternatives.

If you’re certain no alternatives exist, the rest of this chapter explains standard and creative ways of selecting tags based on their position, context, attributes, and contents. The techniques presented here, when used correctly, will go a long way toward writing more stable and reliable web crawlers.

# 2.1 CSS

CSS relies on the differentiation of HTML elements that might otherwise have the exact same markup in order to style them differently. Some tags might look like this:

```html
<span class="green"></span>
```

Others look like this:

```html
<span class="red"></span>
```

Web scrapers can easily separate these two tags based on their class; for example, they might use BeautifulSoup to grab all the red text but none of the green text. Because CSS relies on these identifying attributes to style sites appropriately, you are almost guaranteed that these class and ID attributes will be plentiful on most modern websites.

Let’s create an example web scraper that scrapes the page located at http://www.pythonscraping.com/pages/warandpeace.html.

On this page, the lines spoken by characters in the story are in red, whereas the names of characters are in green. You can see the span tags, which reference the appropriate CSS classes, in the following sample of the page’s source code:

```html
<span class="red">Heavens! what a virulent attack!</span> replied
<span class="green">the prince</span>, not in the least disconcerted
by this reception.
```

You can grab the entire page and create a BeautifulSoup object with it by using a program similar to the one used in **Chapter 1**. Using this BeautifulSoup object, you can use the `find_all` function to extract a Python list of proper nouns found by selecting only the text within `<span class="green"></span>` tags (`find_all` is an extremely flexible function you’ll be using a lot later in this book):

In [23]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bs = BeautifulSoup(html, 'html.parser')

name_list = bs.find_all('span', {'class': 'green'})
for name in name_list:
    print(name.get_text())

Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron
Funke
The prince
Anna
Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna
Pavlovna
Anna Pavlovna


When run, it should list all the proper nouns in the text, in the order they appear in **"War and Peace"**. So what’s going on here? 

Previously, you’ve called `bs.tagName` to get the first occurrence of that tag on the page. Now, you’re calling `bs.find_all(tagName, tagAttributes)` to get a list of all of the tags on the page, rather than just the first.

After getting a list of names, the program iterates through all names in the list, and prints `name.get_text()` in order to separate the content from the tags.

## When To `get_text()` And When To Preserve Tags

`.get_text()` strips all tags from the document you are working with and returns a Unicode string containing the text only. For example, if you are working with a large block of text that contains many hyperlinks, paragraphs, and other tags, all those will be stripped away, and you’ll be left with a tagless block of text.

Keep in mind that it’s much easier to find what you’re looking for in a BeautifulSoup object than in a block of text. Calling `.get_text()` should always be the last thing you do, immediately before you print, store, or manipulate your final data. In general, 

> you should try to preserve the tag structure of a document as long as possible.

## `find()` and `find_all()`

In [26]:
help(bs.find_all)

Help on method find_all in module bs4.element:

find_all(name=None, attrs={}, recursive=True, string=None, limit=None, **kwargs) method of bs4.BeautifulSoup instance
    Look in the children of this PageElement and find all
    PageElements that match the given criteria.

    All find_* methods take a common set of arguments. See the online
    documentation for detailed explanations.

    :param name: A filter on tag name.
    :param attrs: A dictionary of filters on attribute values.
    :param recursive: If this is True, find_all() will perform a
        recursive search of this PageElement's children. Otherwise,
        only the direct children will be considered.
    :param limit: Stop looking after finding this many results.
    :kwargs: A dictionary of filters on attribute values.
    :return: A ResultSet of PageElements.
    :rtype: bs4.element.ResultSet



In [27]:
help(bs.find)

Help on method find in module bs4.element:

find(name=None, attrs={}, recursive=True, string=None, **kwargs) method of bs4.BeautifulSoup instance
    Look in the children of this PageElement and find the first
    PageElement that matches the given criteria.

    All find_* methods take a common set of arguments. See the online
    documentation for detailed explanations.

    :param name: A filter on tag name.
    :param attrs: A dictionary of filters on attribute values.
    :param recursive: If this is True, find() will perform a
        recursive search of this PageElement's children. Otherwise,
        only the direct children will be considered.
    :param limit: Stop looking after finding this many results.
    :kwargs: A dictionary of filters on attribute values.
    :return: A PageElement.
    :rtype: bs4.element.Tag | bs4.element.NavigableString



```python
find_all(name=None, attrs={}, recursive=True, string=None, limit=None, **kwargs)
find(name=None, attrs={}, recursive=True, string=None, **kwargs)
```

1. The `name` argument is one that you’ve seen before; you can pass a string name of a tag or even a Python list of string tag names. For example, the following returns a list of all the header tags in a document:

```python
.find_all(['h1','h2','h3','h4','h5','h6'])
```

- The `attributes` argument takes a Python dictionary of attributes and matches tags that contain any one of those attributes. For example, the following function would return both the green and red span tags in the HTML document:

```python
.find_all('span', {'class':{'green', 'red'}})
```

- The `recursive` argument is a boolean. How deeply into the document do you want to go? 
    - If recursive is set to `True`, the `find_all` function looks into children, and children’s children, for tags that match your parameters. 
    - If it is `False`, it will look only at the top-level tags in your document.
<br>By default, `find_all` works recursively (`recursive` is set to `True`); it’s generally a good idea to leave this as is, unless you really know what you need to do and performance is an issue.<br>
<br>

### `string` argument

- The `string` argument is unusual in that it matches based on the text content of the tags, rather than properties of the tags themselves. For instance, if you want to find the number of times “the prince” is surrounded by tags on the example page, you could replace your `.find_all()` function in the previous example with the following lines:

In [30]:
from urllib.request import urlopen
from bs4 import BeautifulSoup


def main():

    html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
    bs = BeautifulSoup(html, 'html.parser')

    name_list = bs.find_all(string='the prince')
    print(len(name_list))


main()

7


- The `limit` argument, of course, is used only in the `find_all` method; `find` is equivalent to the same `find_all` call, with a limit of `1`. You might set this if you’re interested only in retrieving the first `x` items from the page. Be aware, however, that this gives you the first items on the page in the order that they occur, not necessarily the first ones that you want.

- The `**kwargs` argument allows you to select tags that contain a particular attribute or set of attributes. For example:

```python
title = bs.find_all(id='title', class_='text')
```

This returns the first tag with the word “text” in the `class_` attribute and “title” in the `id` attribute. Note that, by convention, each value for an `id` should be used only once on the page. Therefore, in practice, a line like this may not be particularly useful, and should be equivalent to the following:

```python
title = bs.find(id='title')
```

_ChatGPT:_  
The `id` and `class_` attributes are commonly used in HTML to uniquely identify an element (`id`) or to categorize elements (`class`). These attributes come from the HTML code of the page being parsed.

The `id` attribute should be unique for each element on a page, and it's typically used to identify a specific element. The `class` attribute can be used to apply similar styling or behavior to multiple elements.

In practice, using `find_all` with both `id` and `class_` attributes may not be particularly useful since the `id` should be unique. 

Recall that passing a list of tags to `.find_all()` via the attributes list acts as an “or” filter (it selects a list of all tags that have `tag1, tag2, or tag3...`). If you have a lengthy list of tags, you can end up with a lot of stuff you don’t want. The `keyword argument` allows you to add an additional “and” filter to this.

## Keyword arguments and “class”

The `keyword argument` can be helpful in some situations. However, it is technically redundant as a BeautifulSoup feature. Keep in mind that anything that can be done with keyword can also be accomplished using techniques covered later in this chapter (see `regular_express` and `lambda_express`).

For instance, the following two lines are identical:

```python
bs.find_all(id='text')
bs.find_all('', {'id':'text'})
```

_VR:_  
Not true! You cannot leave `name=''`, you have to put the name of the tag: 

In [59]:
from urllib.request import urlopen
from bs4 import BeautifulSoup


def main():

    html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
    bs = BeautifulSoup(html, 'html.parser')

    test1 = bs.find_all(class_='red')
    test2 = bs.find_all('', {'class': 'red'})

    print("test1 length:", len(test1))
    print("test2 length:", len(test2))


if __name__ == "__main__":
    main()

test1 length: 34
test2 length: 0


And this thing works:

In [63]:
from urllib.request import urlopen
from bs4 import BeautifulSoup


def main():

    html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
    bs = BeautifulSoup(html, 'html.parser')

    test1 = bs.find_all(class_='red')
    test2 = bs.find_all('span', {'class': 'red'})

    print("test1 length:", len(test1))
    print("test2 length:", len(test2))


main()

test1 length: 34
test2 length: 34


In addition, you might occasionally run into problems using keyword, most notably when searching for elements by their `class` attribute, because `class` is a protected keyword in Python. That is, `class` is a reserved word in Python that cannot be used as a variable or argument name (no relation to the `BeautifulSoup.find_all()` keyword argument, previously discussed). For example, if you try the following call, you’ll get a syntax error due to the nonstandard use of class:

```python
bs.find_all(class='green')
```

Instead, you can use BeautifulSoup’s somewhat clumsy solution, which involves adding an underscore:
```python
bs.find_all(class_='green')
```

Alternatively, you can enclose class in quotes (_no, you cannot! you will have to provide `name='tagname'`, which is not needed in the first case._ - VR) :
```python
bs.find_all('', {'class':'green'})
```

# 2.2 Other `BeautifulSoup` Objects

So far in the book, you’ve seen two types of objects in the BeautifulSoup library:
- **BeautifulSoup objects** - `bs`, and
- **Tag objects** - 
    - `bs.div.h1`, retrieved in lists, or 
    - retrieved individually by calling `find` and `find_all` on a BeautifulSoup object, or drilling down.

However, there are two more objects in the library that, although less commonly used, are still important to know about:
- **NavigableString objects** ("NUH-vi-guh-buhl-string") - used to represent text within tags, rather than the tags themselves (some functions operate on and produce `NavigableStrings`, rather than tag objects).
- **Comment object** - used to find HTML comments in comment tags, `<!--like this one-->`.

These four objects are the only objects you will ever encounter in the BeautifulSoup library (at the time of this writing).

See details [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#kinds-of-objects).

# 2.3 Navigating Trees

The `find_all` function is responsible for finding tags based on their name and attributes. But 

> what if you need to find a tag based on its location in a document? 

That’s where tree navigation comes in handy. In **Chapter 1**, you looked at navigating a BeautifulSoup tree in a single direction:

```html
bs.tag.subTag.anotherSubTag
```

Now let’s look at navigating up, across, and diagonally through HTML trees. You’ll use our highly questionable online shopping site at http://www.pythonscraping.com/pages/page3.html.

The HTML for this page, mapped out as a tree (with some tags omitted for brevity), looks like this:

- HTML
    - body
        - div.wrapper
            - h1
            - div.content
        - table#giftList
            - tr
                - th
                - th
                - th
                - th
            - tr.gift#gift1
                - td
                - td
                    - span.excitingNote
                - td
                - td
                    - img
            - ...table rows continue...
        - div.footer
        
You will use this same HTML structure as an example in the next few sections.

## Children and other descendants

In the BeautifulSoup library, as well as many other libraries, there is a distinction drawn between **children** and **descendants**: 
- much like in a human family tree, children are always exactly one tag below a parent, whereas 
- descendants can be at any level in the tree below a parent. 

For example, the `tr` tags are children of the `table` tag, whereas `tr`, `th`, `td`, `img`, and `span` are all descendants of the `table` tag (at least in our example page). All children are descendants, but not all descendants are children.

In general, BeautifulSoup functions always deal with the descendants of the current tag selected. For instance, `bs.body.h1` selects the first `h1` tag that is a descendant of the `body` tag. It will not find tags located outside the `body`.

Similarly, `bs.div.find_all('img')` will find the first `div` tag in the document, and then retrieve a list of all `img` tags that are descendants of that `div` tag.

If you want to find only descendants that are children, you can use the `.children` tag:

In [24]:
from urllib.request import urlopen
from bs4 import BeautifulSoup


def main():
    html = urlopen('http://www.pythonscraping.com/pages/page3.html')
    bs = BeautifulSoup(html, 'html.parser')

    count = 0
    for child in bs.find('table', {'id': 'giftList'}).children:
        count += 1
        print(child)

    print(count)


main()



<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>


This code prints the list of product rows in the `giftList` table, including the initial row of column labels. If you were to write it using the `descendants()` function instead of the `children()` function, about two dozen tags would be found within the table and printed, including `img` tags, `span` tags, and individual `td` tags. 

In [27]:
from urllib.request import urlopen
from bs4 import BeautifulSoup


def main():
    html = urlopen('http://www.pythonscraping.com/pages/page3.html')
    bs = BeautifulSoup(html, 'html.parser')

    count = 0
    for child in bs.find('table', {'id': 'giftList'}).descendants:
        count += 1
        # print(child)

    print(count)


main()

86


## Siblings

### `next_siblings()`

The BeautifulSoup `next_siblings()` function makes it trivial to collect data from tables, especially ones with title rows:

In [34]:
from urllib.request import urlopen
from bs4 import BeautifulSoup


def main():
    html = urlopen('http://www.pythonscraping.com/pages/page3.html')
    bs = BeautifulSoup(html, 'lxml')

    for sibling in bs.find('table', {'id': 'giftList'}).tr.next_siblings:
        print(sibling)


main()



<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parr

The output of this code is to print all rows of products from the product table, except for the first title row. 

> Why does the title row get skipped? 

Objects cannot be siblings with themselves. Anytime you get siblings of an object, the object itself will not be included in the list. As the name of the function implies, it calls next siblings only. If you were to select a row in the middle of the list, for example, and call `next_siblings` on it, only the subsequent siblings would be returned. So, by selecting the title row and calling `next_siblings`, you can select all the rows in the table, without selecting the title row itself.

> MAKE SELECTIONS SPECIFIC</br>
</br>The preceding code will work just as well, if you select bs.table.tr or even just bs.tr in order to select the first row of the table. However, in the code, I go through all of the trouble of writing everything out in a longer form:</br>
</br>`bs.find('table',{'id':'giftList'}).tr`</br>
</br>Even if it looks like there’s just one table (or other target tag) on the page, it’s easy to miss things. In addition, page layouts change all the time. What was once the first of its kind on the page might someday be the second or third tag of that type found on the page. To make your scrapers more robust, it’s best to be as specific as possible when making tag selections. **Take advantage of tag attributes when they are available.**

### `previous_siblings()`

The `previous_siblings` function can often be helpful if there is an easily selectable tag at the end of a list of sibling tags that you would like to get.

And, of course, there are the `next_sibling` and `previous_sibling` functions, which perform nearly the same function as `next_siblings` and `previous_siblings`, except they return a single tag rather than a list of them.

## Parents

When scraping pages, you will likely discover that you need to find parents of tags less frequently than you need to find their children or siblings. Typically, when you look at HTML pages with the goal of crawling them, you start by looking at the top layer of tags, and then figure out how to drill your way down into the exact piece of data that you want. Occasionally, however, you can find yourself in odd situations that require BeautifulSoup’s parent-finding functions, `.parent` and `.parents`. For example:

In [37]:
from urllib.request import urlopen
from bs4 import BeautifulSoup


def main():
    html = urlopen('http://www.pythonscraping.com/pages/page3.html')
    bs = BeautifulSoup(html, 'lxml')

    print(bs.find(
            'img', {'src': '../img/gifts/img1.jpg'}
                ).parent.previous_sibling.get_text())


main()


$15.00



This code will print the price of the object represented by the image at the location `../img/gifts/img1.jpg` (in this case, the price is `$15.00`).

![](./data/images/Screenshot_20240122_004505.png)

1. The image tag where `src="../img/gifts/img1.jpg"` is first selected.
1. You select the parent of that tag (in this case, the `td` tag).
1. You select the `previous_sibling` of the `td` tag (in this case, the `td` tag that contains the dollar value of the product).
1. You select the text within that tag, `“$15.00”`.

# 2.4 Regular Expressions

As the old computer science joke goes: “_Let’s say you have a problem, and you decide to solve it with regular expressions. Well, now you have two problems_.”

Unfortunately, **regular expressions** (often shortened to **regex**) are often taught using large tables of random symbols, strung together to look like a lot of nonsense. This tends to drive people away, and later they get out into the workforce and write needlessly complicated searching and filtering functions, when all they needed was a one-line regular expression in the first place!

Regular expressions are so called because they are used to identify **regular strings**; they can definitively say, “_Yes, this string you’ve given me follows the rules, and I’ll return it_,” or “_This string does not follow the rules, and I’ll discard it_.” This can be exceptionally handy for quickly scanning large documents to look for strings that look like phone numbers or email addresses.

What is a regular string? It’s any string that can be generated by a series of **linear rules** such as these:

- Write the letter `a` at least once.
- Append to this the letter `b` exactly five times.
- Append to this the letter `c` any even number of times.
- Write either the letter `d` or `e` at the end.

Strings that follow these rules are `aaaabbbbbccccd`, `aabbbbbcce`, and so on (there are an infinite number of variations).

> You might be asking yourself, “_Are there ‘irregular’ expressions_?” **Nonregular expressions** are beyond the scope of this book, but they encompass strings such as “write a prime number of as, followed by exactly twice that number of bs” or “write a palindrome.” It’s impossible to identify strings of this type with a regular expression. 

Regular expressions are merely a shorthand way of expressing these sets of rules. For instance, here’s the regular expression for the series of steps just described:

```
aa*bbbbb(cc)*(d|e)
```

This string might seem a little daunting at first, but it becomes clearer when you break it into its components:

- `aa*` - The letter `a` is written, followed by `a*` (read as a **star**), which means “any number of `a`s, including `0` of them.” In this way, you can guarantee that the letter `a` is written at least once.
- `bbbbb` - No special effects here, just five `bs` in a row.
- `(cc)*` - Any even number of things can be grouped into pairs, so in order to enforce this rule about even things, you can write two `c`s, surround them in parentheses, and write an asterisk after it, meaning that you can have any number of pairs of `c`s (note that this can mean `0` pairs, as well).
- `(d|e)` - Adding a bar in the middle of two expressions means that it can be “_this thing or that thing_.” In this case, you are saying “_add a d or an e_.” In this way, you can guarantee that there is exactly one of either of these two characters.

## Regular Expressions and BeautifulSoup

Most functions that take in a string argument (e.g., `find(id="aTagIdHere")`) will also take in a regular expression just as well.

Let’s take a look at some examples, scraping the page found at http://www.pythonscraping.com/pages/page3.html. 

Notice that the site has many product images, which take the following form:

```html
<img src="../img/gifts/img3.jpg">
```

If you wanted to grab URLs to all of the product images, it might seem fairly straightforward at first: just grab all the image tags by using `.find_all("img")`. But there’s a problem. In addition to the obvious “extra” images (e.g., logos), modern websites often have hidden images, blank images used for spacing and aligning elements, and other random image tags you might not be aware of. Certainly, you can’t count on the only images on the page being product images.

Let’s also assume that the layout of the page might change, or that, for whatever reason, you don’t want to depend on the position of the image in the page in order to find the correct tag. This might be the case when you are trying to grab specific elements or pieces of data that are scattered randomly throughout a website. For instance, a featured product image might appear in a special layout at the top of some pages, but not others.

The solution is to look for something identifying about the tag itself. In this case, you can look at the file path of the product images:

In [5]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen('http://www.pythonscraping.com/pages/page3.html')
bs = BeautifulSoup(html, 'lxml')
images = bs.find_all('img', {'src': re.compile(r'\.\.\/img\/gifts\/img.*\.jpg')})

for image in images:
    print(image['src'])

../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg


This prints only the relative image paths that start with `../img/gifts/img` and end in `.jpg`.

# 2.5 Accessing Attributes

So far, you’ve looked at how to access and filter tags and access content within them. However, often in web scraping you’re not looking for the content of a tag; you’re looking for its **attributes**. This becomes especially useful for tags such as `a`, where the URL it is pointing to is contained within the `href` attribute; or the `img` tag, where the target image is contained within the `src` attribute.

With tag objects, a Python list of attributes can be automatically accessed by calling this:

```html
myTag.attrs
```

Keep in mind that this literally returns a Python dictionary object, which makes retrieval and manipulation of these attributes trivial. The source location for an image, for example, can be found using the following line:

```html
myImgTag.attrs['src']
```

In [22]:
# chatgpt
from bs4 import BeautifulSoup

# Assume that the following HTML content is stored in a variable called html_content
html_content = '''
<html>
  <body>
    <a href="https://www.example.com">Example Website</a>
    <img src="https://www.example.com/images/example.jpg">
  </body>
</html>
'''

# Create a BeautifulSoup object
soup = BeautifulSoup(html_content, 'html.parser')

# Accessing the 'href' attribute of the 'a' tag
a_tag = soup.find('a')
print(a_tag.attrs['href'])  # Output: https://www.example.com

# Accessing the 'src' attribute of the 'img' tag
img_tag = soup.find('img')
print(img_tag.attrs['src'])  # Output: https://www.example.com/images/example.jpg

https://www.example.com
https://www.example.com/images/example.jpg


## `a` vs `link` tags

In HTML, the "a" tag is an anchor tag and is used to create hyperlinks to other web pages, files, or locations within the same page. It includes the "href" attribute, which specifies the URL or destination of the link.

The "link" tag, on the other hand, is used to define relationships between the current document and external resources such as stylesheets. It is commonly used in the "head" section of an HTML document to refer to external CSS files, and it includes the "href" attribute to define the location of the external resource.

In summary, the "a" tag is used to create hyperlinks within the content of a webpage, while the "link" tag is used to establish relationships or references to external resources, typically within the "head" section of the HTML document.

# 2.6 Lambda Expressions

Essentially, a **lambda expression** is a function that is passed into another function as a variable; instead of defining a function as `f(x, y)`, you may define a function as `f(g(x), y)` or even `f(g(x), h(x))`.

BeautifulSoup allows you to pass certain types of functions as parameters into the `find_all` function.

The only restriction is that these functions must take a tag object as an argument and return a boolean. Every tag object that BeautifulSoup encounters is evaluated in this function, and tags that evaluate to `True` are returned, while the rest are discarded.

For example, the following retrieves all tags that have exactly two attributes:

```python
bs.find_all(lambda tag: len(tag.attrs) == 2)
```

Here, the function that you are passing as the argument is `len(tag.attrs) == 2`. Where this is `True`, the  `find_all` function will return the tag. That is, it will find tags with two attributes, such as the following:

```html
<div class="body" id="content"></div>
<span style="color:red" class="title"></span>
```

Lambda functions are so useful you can even use them to replace existing BeautifulSoup functions:

```python
bs.find_all(lambda tag: tag.get_text() ==
    'the prince')
```

This can also be accomplished without a lambda function:

```python
bs.find_all('', text='the prince')
```

> VR: Using lambda function you can easily access tag's attributes!

In [38]:
from urllib.request import urlopen
from bs4 import BeautifulSoup


def main():

    html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
    bs = BeautifulSoup(html, 'lxml')

    string = bs.find_all(
        lambda tag: tag.get_text() == 'the prince'
    )
    print(len(string))

    print("And this is the attribute:", end=' ')
    for i in string:
        print(i.attrs['class'])
        break


if __name__ == "__main__":
    main()

7
And this is the attribute: ['green']


_ChatGPT:_  
In this case, the lambda function takes each tag found by `find_all` and returns `True` if the tag's text content is equal to 'the prince', and `False` otherwise. This effectively filters the tags to only include the ones with the specified text content.

So, the lambda function is not explicitly taking a tag as an argument, but rather is being used as a filter to select specific tags based on their text content.

In [64]:
from urllib.request import urlopen
from bs4 import BeautifulSoup


def main():

    html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
    bs = BeautifulSoup(html, 'lxml')

    string = bs.find_all(string='the prince')
    print(len(string))

    try:
        for i in string:
            print(i.attrs['class'])
            break
    except Exception as e:
        print(e)


if __name__ == "__main__":
    main()

7
'NavigableString' object has no attribute 'attrs'


In [19]:
from urllib.request import urlopen
from bs4 import BeautifulSoup


def main():

    html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
    bs = BeautifulSoup(html, 'lxml')

    string = bs.find_all(
        lambda tag: tag.get_text() == 'the prince'
    )
    print(string)
    print(len(string))

    for i in string:
        try:
            print(i.attr['class'])
        except Exception as e:
            continue
            print(e)
        print('y')


if __name__ == "__main__":
    main()

[<span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">the prince</span>, <span class="green">the prince</span>]
7


However, if you remember the syntax for the lambda function, and how to access tag properties, you may never need to remember any other BeautifulSoup syntax again! 

Because the provided lambda function can be any function that returns a `True` or `False` value, you can even combine them with regular expressions to find tags with an attribute matching a certain string pattern.

# <b>3. Writing Web Crawlers</b>

**Web crawlers** are called such because they crawl across the web. At their core is an element of recursion. They must retrieve page contents for a URL, examine that page for another URL, and retrieve that page, ad infinitum.

# 3.1 Traversing a Single Domain

## ASCII urls

In [3]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

n = 20

def main():
    html = urlopen('https://en.wikipedia.org/wiki/Joseph_Stalin')
    bs = BeautifulSoup(html, 'lxml')

    a_tags = bs.find_all('a')

#     for link in links:
#         print(link)

    count = 0
    for tag in a_tags:
        if 'href' in tag.attrs:
            print(tag.attrs['href'])
            count += 1
        if count == n:
            break


main()

#bodyContent
/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
//en.wikipedia.org/wiki/Wikipedia:Contact_us
https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Main_Page
/wiki/Special:Search
/w/index.php?title=Special:CreateAccount&returnto=Joseph+Stalin
/w/index.php?title=Special:UserLogin&returnto=Joseph+Stalin
/w/index.php?title=Special:CreateAccount&returnto=Joseph+Stalin
/w/index.php?title=Special:UserLogin&returnto=Joseph+Stalin
/wiki/Help:Introduction


## Russian urls

In [21]:
import requests

r = requests.get('https://ru.wikipedia.org/wiki/Сталин,_Иосиф_Виссарионович')

In [23]:
print(r.text)

<!DOCTYPE html>
<html class="client-nojs" lang="ru" dir="ltr">
<head>
<meta charset="UTF-8">
<title>Сталин, Иосиф Виссарионович — Википедия</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":[",\t."," \t,"],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","январь","февраль","март","апрель","май","июнь","июль","август","сентябрь","октябрь","ноябрь","декабрь"],"wgRequestId":"9c608947-924b-455d-a087-b5a17adeb11a","wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Сталин,_Иосиф_Виссарионович","wgTitle":"Сталин, Иосиф Виссарионович","wgCurRevisionId":135753261,"wgRevisionId":135753261,"wgArticleId":3010,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Википедия:Cite web (не указан язык)","Википедия:Статьи с нераспознанным языком (ref)",
"Википедия:Статьи с нерабочими ссылками","Викип

In [14]:
from urllib.request import urlopen
from bs4 import BeautifulSoup


def main():
    try:
        html = urlopen('https://ru.wikipedia.org/wiki/Сталин,_Иосиф_Виссарионович')
    except Exception as e:
        print(type(e), e)


main()

<class 'UnicodeEncodeError'> 'ascii' codec can't encode characters in position 10-15: ordinal not in range(128)


_ChatGPT:_  
The error you are encountering is due to the attempt to encode non-ASCII characters in the URL. The problem is that the URL you are trying to access contains non-ASCII characters such as Cyrillic characters (for the Russian language), and it's causing the urlopen function to fail due to encoding issues.

To resolve this issue, you can encode the URL using the `quote` function from the `urllib.parse` module, which will properly handle any non-ASCII characters in the URL. 

By using the `quote` function to encode the URL before passing it to `urlopen`, you can avoid the `UnicodeEncodeError` related to non-ASCII characters in the URL. 

Then, you can use the `urllib.parse` module to `decode` the URL. 

In [10]:
from urllib.parse import quote, unquote
from urllib.request import urlopen
from bs4 import BeautifulSoup

n = 20

def main():
    url = 'https://ru.wikipedia.org/wiki/Сталин,_Иосиф_Виссарионович'
    encoded_url = quote(url, safe=':/')
    html = urlopen(encoded_url)
    bs = BeautifulSoup(html, 'lxml')

    a_tags = bs.find_all('a')

#     for link in links:
#         print(link)

    count = 0
    for tag in a_tags:
        if 'href' in tag.attrs:
            decoded_url = unquote(tag.attrs['href'])
            print(decoded_url)
            count += 1
        if count == n:
            break


main()

/wiki/Википедия:Патрулирование
https://ru.wikipedia.org/w/index.php?title=Служебная:Журналы&type=review&page=Сталин,_Иосиф_Виссарионович
#mw-head
#searchInput
/wiki/Сталин_(значения)
/wiki/Иосиф_Сталин_(значения)
/wiki/Джугашвили_(значения)
/wiki/Грузинский_язык
/wiki/Файл:CroppedStalin1943.jpg
/wiki/Тегеранская_конференция
/wiki/1943_год
/wiki/Файл:Coat_of_arms_of_the_Soviet_Union_(1946–1956).svg
/wiki/Председатель_Совета_Министров_СССР
/wiki/Файл:Flag_of_the_USSR_(1936-1955).svg
/wiki/Калинин,_Михаил_Иванович
/wiki/Шверник,_Николай_Михайлович
/wiki/Совет_народных_комиссаров_СССР
/wiki/Маленков,_Георгий_Максимилианович
/wiki/Файл:КПСС.svg
/wiki/Секретариат_ЦК_КПСС


In [19]:
help(quote)

Help on function quote in module urllib.parse:

quote(string, safe='/', encoding=None, errors=None)
    quote('abc def') -> 'abc%20def'

    Each part of a URL, e.g. the path info, the query, etc., has a
    different set of reserved characters that must be quoted. The
    quote function offers a cautious (not minimal) way to quote a
    string for most of these parts.

    RFC 3986 Uniform Resource Identifier (URI): Generic Syntax lists
    the following (un)reserved characters.

    unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
    reserved      = gen-delims / sub-delims
    gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
    sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

    Each of the reserved characters is reserved in some component of a URL,
    but not necessarily in all of them.

    The quote function %-escapes all characters that are neither in the
    unreserved chars ("always safe") nor the additional char

In [20]:
help(unquote)

Help on function unquote in module urllib.parse:

unquote(string, encoding='utf-8', errors='replace')
    Replace %xx escapes by their single-character equivalent. The optional
    encoding and errors parameters specify how to decode percent-encoded
    sequences into Unicode characters, as accepted by the bytes.decode()
    method.
    By default, percent-encoded sequences are decoded with UTF-8, and invalid
    sequences are replaced by a placeholder character.

    unquote('abc%20def') -> 'abc def'.



## Filter links

If you look at the list of links produced, you’ll notice that all the articles you’d expect are there: `/wiki/Pravda`, `/wiki/Central_Committee_of_the_Communist_Party_of_the_Soviet_Union`, `/wiki/Prague_Conference` and so on. However, there are some things that you don’t want as well:
`//wikimediafoundation.org/wiki/Privacy_policy`, `//en.wikipedia.org/wiki/Wikipedia:Contact_us`.

In fact, Wikipedia is full of sidebar, footer, and header links that appear on every page, along with links to the category pages, talk pages, and other pages that do not contain different articles:
- `/wiki/Help:Introduction`
- `/wiki/Wikipedia:Community_portal`

If you examine the links that point to article pages (as opposed to other internal pages), you’ll see that they all have three things in common:

- They reside within the `div` with the `id` set to `bodyContent`.
- The URLs do not contain colons.
- The URLs begin with `/wiki/`.

You can use these rules to revise the code slightly to retrieve only the desired article links by using the regular expression `^(/wiki/)((?!:).)*$")`:

In [21]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen('https://en.wikipedia.org/wiki/Joseph_Stalin')
bs = BeautifulSoup(html, 'lxml')

links = bs.find('div', {'id': 'bodyContent'}
               ).find_all('a', href=re.compile("^(/wiki/)((?!:).)*$")
)

In [24]:
n = 20
count = 0
for link in links:
    if 'href' in link.attrs:
        print(link.attrs['href'])
        count += 1
    if count == n:
        break

/wiki/Stalin_(disambiguation)
/wiki/Eastern_Slavic_naming_conventions
/wiki/Patronymic
/wiki/Surname
/wiki/Generalissimus_of_the_Soviet_Union
/wiki/Tehran_Conference
/wiki/General_Secretary_of_the_Communist_Party_of_the_Soviet_Union
/wiki/Vyacheslav_Molotov
/wiki/Nikita_Khrushchev
/wiki/Chairman_of_the_Council_of_Ministers_of_the_Soviet_Union
/wiki/Georgy_Malenkov
/wiki/Minister_of_Defence_(Soviet_Union)
/wiki/Semyon_Timoshenko
/wiki/Nikolai_Bulganin
/wiki/People%27s_Commissariat_for_Nationalities
/wiki/Vladimir_Lenin
/wiki/Old_Style_and_New_Style_dates
/wiki/Gori,_Georgia
/wiki/Tiflis_Governorate
/wiki/Russian_SFSR


Of course, having a script that finds all article links in one, hardcoded Wikipedia article, while interesting, is fairly useless in practice. You need to be able to take this code and transform it into something more like the following:
- A single function, `get_links`, that takes in a Wikipedia article URL of the form `/wiki/<Article_Name>` and returns a list of all linked article URLs in the same form.
- A `main` function that calls `get_links` with a starting article, chooses a random article link from the returned list, and calls `get_links` again, until you stop the program or until no article links are found on the new page.

Here is the complete code that accomplishes this:

In [2]:
import random
import re
from urllib.request import urlopen

from bs4 import BeautifulSoup

LIMIT = 20


def main():
    links = get_links("/wiki/Joseph_Stalin")

    count = 0
    while len(links) > 0:
        new_article = links[random.randint(0, len(links) - 1)].attrs["href"]
        print(new_article)
        links = get_links(new_article)
        count += 1
        if count == LIMIT:
            break
    return 0


def get_links(url):
    html = urlopen(f"https://en.wikipedia.org{url}")
    bs = BeautifulSoup(html, "lxml")

    return bs.find("div", {"id": "bodyContent"}).find_all(
        "a", href=re.compile("^(/wiki/)((?!:).)*$")
    )


if __name__ == "__main__":
    main()

/wiki/Volga-Volga
/wiki/Isaak_Dunayevsky
/wiki/Kharkiv_National_Kotlyarevsky_University_of_Arts
/wiki/National_Aerospace_University_%E2%80%93_Kharkiv_Aviation_Institute
/wiki/Vinnytsia_National_Technical_University
/wiki/Kyiv_National_Economic_University
/wiki/Management
/wiki/Procedure_(business)
/wiki/Standard_Operating_Procedure
/wiki/List_of_signs_and_symptoms_of_diving_disorders
/wiki/Type_7103_DSRV
/wiki/Scuba_gas_planning
/wiki/Constant_weight
/wiki/Dottie_Frazier
/wiki/Polespear
/wiki/GRUMEC
/wiki/HeinrichsWeikamp
/wiki/Underwater_Demolition_Team
/wiki/Dewey_Smith
/wiki/DIN_7876


Russian urls:

In [1]:
import random
import re
import urllib.parse
import urllib.request

from bs4 import BeautifulSoup

LIMIT = 20


def main():
    links = get_links("/wiki/Сталин,_Иосиф_Виссарионович")

    count = 0
    while len(links) > 0:
        new_article = links[random.randint(0, len(links) - 1)].attrs["href"]
        new_article = urllib.parse.unquote(new_article)
        print(new_article)
        links = get_links(new_article)
        count += 1
        if count == LIMIT:
            break
    return 0


def get_links(url):
    url = urllib.parse.quote(url)
    html = urllib.request.urlopen(f"https://ru.wikipedia.org{url}")
    bs = BeautifulSoup(html, "lxml")

    return bs.find("div", {"id": "bodyContent"}).find_all(
        "a", href=re.compile("^(/wiki/)((?!:).)*$")
    )


if __name__ == "__main__":
    main()

/wiki/Родионов,_Михаил_Иванович
/wiki/Черноусов,_Борис_Николаевич
/wiki/Оргбюро_ЦК_ВКП(б)
/wiki/Сталин
/wiki/Спасибо_товарищу_Сталину_за_наше_счастливое_детство!
/wiki/Сталинские_расстрельные_списки
/wiki/СОФИН
/wiki/Национальная_объединённая_партия_Армении
/wiki/Армянская_Советская_Социалистическая_Республика
/wiki/Спитакский_район
/wiki/Амасийский_район
/wiki/Арташатский_район
/wiki/Эчмиадзинский_район
/wiki/Анийский_район
/wiki/Мартунинский_район_(Армянская_ССР)
/wiki/Севанский_район
/wiki/Армения
/wiki/Западная_Армения
/wiki/Киперт,_Генрих
/wiki/Контрольный_номер_Библиотеки_Конгресса


### Needless code - VR

_VR:_  
It seems like this part is needless: 

```python
bs.find("div", {"id": "bodyContent"}).
```

Compare the results for the first link the `links` array, they are the same:

In [7]:
import random
import re
import urllib.parse
import urllib.request

from bs4 import BeautifulSoup

LIMIT = 20


def main():
    links = get_links("/wiki/Сталин,_Иосиф_Виссарионович")

    count = 0
    while len(links) > 0:
        new_article = links[0].attrs["href"]
        new_article = urllib.parse.unquote(new_article)
        print(new_article)
        links = get_links(new_article)
        count += 1
        if count == LIMIT:
            break
    return 0


def get_links(url):
    url = urllib.parse.quote(url)
    html = urllib.request.urlopen(f"https://ru.wikipedia.org{url}")
    bs = BeautifulSoup(html, "lxml")

    return bs.find("div", {"id": "bodyContent"}).find_all(
        "a", href=re.compile("^(/wiki/)((?!:).)*$")
    )


if __name__ == "__main__":
    main()

/wiki/Сталин_(значения)
/wiki/Псевдоним
/wiki/Псевдоним_(значения)
/wiki/Викисловарь
/wiki/Английский_язык
/wiki/Таблица_МФА_для_английского_языка
/wiki/МФА
/wiki/IPA_(значения)
/wiki/Викисловарь
/wiki/Английский_язык
/wiki/Таблица_МФА_для_английского_языка
/wiki/МФА
/wiki/IPA_(значения)
/wiki/Викисловарь
/wiki/Английский_язык
/wiki/Таблица_МФА_для_английского_языка
/wiki/МФА
/wiki/IPA_(значения)
/wiki/Викисловарь
/wiki/Английский_язык


In [6]:
import random
import re
import urllib.parse
import urllib.request

from bs4 import BeautifulSoup

LIMIT = 20


def main():
    links = get_links("/wiki/Сталин,_Иосиф_Виссарионович")

    count = 0
    while len(links) > 0:
        new_article = links[0].attrs["href"]
        new_article = urllib.parse.unquote(new_article)
        print(new_article)
        links = get_links(new_article)
        count += 1
        if count == LIMIT:
            break
    return 0


def get_links(url):
    url = urllib.parse.quote(url)
    html = urllib.request.urlopen(f"https://ru.wikipedia.org{url}")
    bs = BeautifulSoup(html, "lxml")

    return bs.find_all(
        "a", href=re.compile("^(/wiki/)((?!:).)*$")
    )


if __name__ == "__main__":
    main()

/wiki/Сталин_(значения)
/wiki/Псевдоним
/wiki/Псевдоним_(значения)
/wiki/Викисловарь
/wiki/Английский_язык
/wiki/Таблица_МФА_для_английского_языка
/wiki/МФА
/wiki/IPA_(значения)
/wiki/Викисловарь
/wiki/Английский_язык
/wiki/Таблица_МФА_для_английского_языка
/wiki/МФА
/wiki/IPA_(значения)
/wiki/Викисловарь
/wiki/Английский_язык
/wiki/Таблица_МФА_для_английского_языка
/wiki/МФА
/wiki/IPA_(значения)
/wiki/Викисловарь
/wiki/Английский_язык


# 3.2 Crawling an Entire Site

Crawling an entire site, especially a large one, is a memory-intensive process that is best suited to applications for which a database to store crawling results is readily available. However, you can explore the behavior of these types of applications without running them full-scale. To learn more about running these applications by using a database, see **Chapter 6**.

## Deep web

The **deep web** is any part of the web that’s not part of the **surface web**. The surface is part of the internet that is indexed by search engines. Estimates vary widely, but the deep web almost certainly makes up about 90% of the internet. Because Google can’t do things like submit forms, find pages that haven’t been linked to by a top-level domain, or investigate sites where `robots.txt` prohibits it, the surface web stays relatively small.

## Avoid crawling the same twice

To avoid crawling the same page twice, it is extremely important that all internal links discovered are formatted consistently, and kept in a running `set` for easy lookups, while the program is running (a relational database with `UNIQUE` constraint is much better). A `set` is similar to a `list`, but elements do not have a specific order, and only unique elements will be stored, which is ideal for our needs. Only links that are “new” should be crawled and searched for additional links:

```python
# I just put it here without check, it must be implemented with the database VR
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

pages = set()
def getLinks(pageUrl):
    global pages
    html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
    bs = BeautifulSoup(html, 'html.parser')
    for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                #We have encountered a new page
                newPage = link.attrs['href']
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)
getLinks('')
```

Initially, `getLinks` is called with an empty URL. This is translated as “the front page of Wikipedia” as soon as the empty URL is prepended with `http://en.wikipedia.org` inside the function. Then, each link on the first page is iterated through and a check is made to see whether it is in the global set of pages (a set of pages that the script has encountered already). If not, it is added to the list, printed to the screen, and the `getLinks` function is called recursively on it.

Python has a default recursion limit (the number of times a program can recursively call itself) of 1,000. Because Wikipedia’s network of links is extremely large, this program will eventually hit that recursion limit and stop, unless you put in a recursion counter or something to prevent that from happening.

## Collecting Data Across an Entire Site

Let’s look at how to build a scraper that collects the title, the first paragraph of content, and the link to edit the page (if available).

As always, the first step to determine how best to do this is to look at a few pages from the site and determine a pattern. By looking at a handful of Wikipedia pages (both articles and nonarticle pages such as the privacy policy page), the following things should be clear:

- All titles (on all pages, regardless of their status as an article page, an edit history page, or any other page) have titles under `h1 → span` tags, and these are the only `h1` tags on the page. 
- As mentioned before, all body text lives under the `div#bodyContent` tag. However, if you want to get more specific and access just the first paragraph of text, you might be better off using `div#mw-content-text → p` (selecting the first paragraph tag only). This is true for all content pages except file pages (for example, `https://en.wikipedia.org/wiki/File:Orbit_of_274301_Wikipedia.svg`), which do not have sections of content text.
- Edit links occur only on article pages. If they occur, they will be found in the `li#ca-edit` tag, under `li#ca-edit → span → a`.

By modifying our basic crawling code, you can create a combination crawler/data-gathering (or, at least, data-printing) program:

In [20]:
import re
from urllib.request import urlopen

from bs4 import BeautifulSoup

pages = set()


def main():

    try:
        getLinks('')
    except Exception as e:
        print(type(e), e)

    return 0


def getLinks(pageUrl):
    global pages
    html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
    bs = BeautifulSoup(html, 'html.parser')
    try:
        print(bs.h1.get_text())
        print(bs.find(id ='mw-content-text').find_all('p')[0])
        print(bs.find(id='ca-edit').find('span')
                    .find('a').attrs['href'])
    except AttributeError:
        print('This page is missing something! Continuing.')

    for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
        if 'href' in link.attrs:
            if link.attrs['href'] not in pages:
                #We have encountered a new page
                newPage = link.attrs['href']
                print('-'*20)
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)


if __name__ == "__main__":
    main()

Main Page
<p><b><a href="/wiki/W._Somerset_Maugham" title="W. Somerset Maugham">W. Somerset Maugham</a></b> (25 January 1874 – 16 December 1965) was an English writer. He achieved national celebrity as a playwright; by 1908 he had four plays running simultaneously in London's <a href="/wiki/West_End_theatre" title="West End theatre">West End</a>. After 1933 he concentrated on novels and short stories. His popularity provoked adverse reactions from highbrow critics, and many belittled him as merely competent. More recent assessments generally rank <i><a href="/wiki/Of_Human_Bondage" title="Of Human Bondage">Of Human Bondage</a></i> as a masterpiece, and his short stories are held in high critical regard. Maugham's plain prose became known for its lucidity, but his reliance on clichés attracted adverse critical commentary. During <a href="/wiki/World_War_I" title="World War I">World War I</a> Maugham worked for the <a href="/wiki/MI6" title="MI6">British Secret Service</a>, later drawing

> _VR:_ The code is not okay... Starting from the design (which I always have to remake) and finishing with some strange techniques...

The `for` loop in this program is essentially the same as it was in the original crawling program (with the addition of printed dashes for clarity, separating the printed content).

Because you can never be entirely sure that all the data is on each page, each `print` statement is arranged in the order that it is likeliest to appear on the site. That is, the `h1` `title` tag appears on every page (as far as I can tell, at any rate) so you attempt to get that data first. The `text` content appears on most pages (except for file pages), so that is the second piece of data retrieved. The `Edit` button appears only on pages in which both titles and text content already exist, but it does not appear on all of those pages.

You might notice that in this and all the previous examples, you haven’t been “collecting” data so much as “printing” it. Obviously, data in your terminal is hard to manipulate. You’ll look more at storing information and creating databases in **Chapter 5**.

## Handling redirects

**Redirects** allow a web server to point one domain name or URL to a piece of content at a different location. There are two types of redirects:

- Server-side redirects, where the URL is changed before the page is loaded;
- Client-side redirects, sometimes seen with a “You will be redirected in 10 seconds” type of message, where the page loads before redirecting to the new one.

With server-side redirects, you usually don’t have to worry. If you’re using the `urllib` library with Python `3.x`, it handles redirects automatically! If you’re using the `requests` library, make sure to set the allow-redirects flag to `True`:

```python
r = requests.get('http://github.com', allow_redirects=True)
```

Just be aware that, occasionally, the URL of the page you’re crawling might not be exactly the URL that you entered the page on.
For more information on client-side redirects, which are performed using JavaScript or HTML, see Chapter 12.

# 3.3 Crawling Across the Internet

Whenever I give a talk on web scraping, someone inevitably asks, “How do you build Google?” My answer is always twofold: “First, you get many billions of dollars so that you can buy the world’s largest data warehouses and place them in hidden locations all around the world. Second, you build a web crawler.”

When Google started in 1996, it was just two Stanford graduate students with an old server and a Python web crawler. Now that you know how to scrape the web, you officially have the tools you need to become the next tech multibillionaire!

In all seriousness, web crawlers are at the heart of what drives many modern web technologies, and you don’t necessarily need a large data warehouse to use them. To do any cross-domain data analysis, you do need to build crawlers that can interpret and store data across the myriad of pages on the internet.

Just as in the previous example, the web crawlers you are going to build will follow links from page to page, building out a map of the web. But this time, they will not ignore external links; they will follow them.

Before you start writing a crawler that follows all outbound links willy-nilly, you should ask yourself a few questions:
- What data am I trying to gather? Can this be accomplished by scraping just a few predefined websites (almost always the easier option), or does my crawler need to be able to discover new websites I might not know about?
- When my crawler reaches a particular website, will it immediately follow the next outbound link to a new website, or will it stick around for a while and drill down into the current website?
- Are there any conditions under which I would not want to scrape a particular site? Am I interested in non-English content?
- How am I protecting myself against legal action if my web crawler catches the attention of a webmaster on one of the sites it runs across? (Check out Chapter 18 for more information on this subject.)

A flexible set of Python functions that can be combined to perform a variety of types of web scraping can be easily written in fewer than 60 lines of code (with modifications - VR):

In [1]:
import random
import re
import signal
import sys

import requests
from bs4 import BeautifulSoup
from requests import utils


def get_response_object(url):
    """
    Returns
    - response object if success;
    - 0 if the error happens
    """

    result = None
    try:
        r = requests.get(url, allow_redirects=True)
        result = r.status_code
        if result == 200:
            return r
    except Exception as e:
        result = e
    print(f"Error {result}: {url}")
    return 0


# Retrieves a list of all Internal links found on a page
def get_internal_links(bs, include_url):
    """
    Returns list
    """

    internal_links = list()
    base_url, domain = get_domain(include_url)

    # Finds all links that begin with a "/"
    try:
        links = bs.find_all("a")
    except:
        return internal_links
    for link in links:
        try:
            reference = link.attrs["href"]
            # print("INTERNAL LINK:", reference)
        except Exception as e:
            print(type(e), e)
            continue
        if reference.startswith("/") or domain in reference:
            if reference.startswith("/"):
                reference = base_url + reference
            if reference not in internal_links:
                internal_links.append(reference)
    return internal_links


# Retrieves a list of all external links found on a page
def get_external_links(bs, exclude_url):
    """
    Returns: list
    """
    external_links = list()
    _, domain = get_domain(exclude_url)

    # Finds all links that start with "http" that do
    # not contain the current URL
    try:
        links = bs.find_all("a")
    except:
        return external_links
    for link in links:
        if "href" in link.attrs:
            reference = link.attrs["href"]
        else:
            continue
        if (domain not in reference) and ("http" in reference or "https" in reference):
            if reference not in external_links:
                external_links.append(reference)
    return external_links


def get_random_external_link(starting_page):
    """
    Returns: string
    """
    html = requests.get(starting_page).text
    bs = BeautifulSoup(html, "lxml")
    external_links = get_external_links(bs, starting_page)
    if len(external_links) == 0:
        try:
            print("No external links, looking around the site for one")
            internal_links = get_internal_links(bs, starting_page)
            return get_random_external_link(
                internal_links[random.randint(0, len(internal_links) - 1)]
            )
        except:
            print("No external links found.")
            return 0
    else:
        return external_links[random.randint(0, len(external_links) - 1)]


def follow_external_only(starting_site):
    try:
        external_link = get_random_external_link(starting_site)
        print("Random external link: {}".format(external_link))
        follow_external_only(external_link)
    except Exception as e:
        print("Error:", e)
        return 1


def get_domain(url):
    """
    Returns: tuple
    - domain (no 'www')
    - base_url - complete url with protocol, subdomain and domain.
    """
    domain = utils.urlparse(url).netloc
    base_url = f"{utils.urlparse(url).scheme}://{domain}"
    # take only site.com
    domain_list = domain.split(".")
    domain = ".".join(domain_list[-2:])
    return base_url, domain


# Catching KeyboarInterrupt exception
def signal_handler(sig, frame):
    """
    Catches system signals and exits.
    """
    print("KeyboardInterrupt caught")
    sys.exit(0)

In [3]:
pages = set()


def main():
    signal.signal(signal.SIGINT, signal_handler)
    url = "http://oreilly.com"
    follow_external_only(url)
    return 0


if __name__ == "__main__":
    main()

Random external link: https://oreilly.hk/
Random external link: https://www.oreilly.com/emails/newsletters/
Random external link: https://twitter.com/oreillymedia
KeyboardInterrupt caught


SystemExit: 0

In [26]:
all_int_links = set()
all_ext_links = set()


def main():
    signal.signal(signal.SIGINT, signal_handler)
    url = "http://oreilly.com"
    all_int_links.add(url)
    get_all_ext_links(url)


# Collects a list of all external URLs found on the site
def get_all_ext_links(url):
    r = get_response_object(url)
    if r:
        html = r.text
    else:
        return r

    bs = BeautifulSoup(html, "lxml")
    internal_links = get_internal_links(bs, url)
    external_links = get_external_links(bs, url)

    for link in external_links:
        if link not in all_ext_links:
            all_ext_links.add(link)
            print(link)
    for link in internal_links:
        if link not in all_int_links:
            all_int_links.add(link)
            get_all_ext_links(link)


if __name__ == "__main__":
    main()

https://twitter.com/oreillymedia
https://www.linkedin.com/company/oreilly-media
https://www.youtube.com/user/OreillyMedia
https://oreilly.hk/
https://oreillylearning.in/
https://oreilly.id/
https://www.oreilly.co.jp/index.shtml
https://itunes.apple.com/us/app/safari-to-go/id881697395
https://play.google.com/store/apps/details?id=com.safariflow.queue
https://channelstore.roku.com/details/c8a2d0096693eb9455f6ac165003ee06/oreilly
https://www.amazon.com/OReilly-Media-Inc/dp/B087YYHL5C/ref=sr_1_2?dchild=1&keywords=oreilly&qid=1604964116&s=mobile-apps&sr=1-2
https://www.amazon.com/OReilly-Media-Inc/dp/B087YYHL5C/ref=sr_1_2?dchild=1&amp;keywords=oreilly&amp;qid=1604964116&amp;s=mobile-apps&amp;sr=1-2
KeyboardInterrupt caught


SystemExit: 0

This code can be thought of as two loops—one gathering internal links, one gathering external links—working in conjunction with each other.

Jotting down or making diagrams of what the code should do before you write the code itself is a fantastic habit to get into, and one that can save you a lot of time and frustration as your crawlers get more complicated.

<center>
    <img src="./data/images/simple_crawler.png" alt="Figure 3-2. Flow diagram for the website crawler that collects all external links" style="width: 500px">
    <p style="text-align: center"><i>Figure 3-2. Flow diagram for the website crawler that collects all external links</i></p>

## Drafts

In [23]:
from requests import utils

include_url = 'https://www.oreilly.com/online-learning/features.html'
url = "{}://{}".format(
                    utils.urlparse(include_url).scheme,
                    utils.urlparse(include_url).netloc
)

print(urlparse(include_url).scheme)
print(urlparse(include_url).netloc)
print(url)

https
www.oreilly.com
https://www.oreilly.com


In [39]:
import re

domain = 'www.oreilly.com'
url = 'https://www.oreilly.com/online-learning/features.html'

pattern = re.compile('^(/|.*'+domain.replace('.', r'\.')+')')
re.findall(pattern, url)

['https://www.oreilly.com']

_ChatGPT:_  

- `^`: This symbol indicates the start of the string. So `^/` matches strings that start with a forward slash.
- `|`: This symbol is the `OR` operator in regex. It allows you to define multiple alternatives, and the pattern will match if any of the alternatives are found.
- `.*`: This matches any character (except for line terminators) zero or more times. In this context, it's used to match any sequence of characters.
- `+include_url+`: This part of the pattern includes the `include_url` variable itself. This means that the pattern will also match any string that contains the `include_url`.

So, the combined pattern `^(/|.*'+include_url+')` means that it will match strings that start with a forward slash **or** contain the `include_url` string anywhere within them.

In [21]:
import re

domain = 'https://oreilly.com'
subdomain = ''
url = 'https://www.верховныйсовет.ссср/online-learning/features.html'

pattern = re.compile("^(http|https)://(?:"+ subdomain+ ")(?!"+ domain+ r").*$")
re.search(pattern, url)

<re.Match object; span=(0, 61), match='https://www.верховныйсовет.ссср/online-learning/f>

_ChatGPT:_  
- The code `exclude_url.replace('.', r'\.')` is likely being used to escape the period (`.`) character in the `exclude_url` string. In regular expressions, the period (`.`) is a special character that matches any single character. By replacing the periods with `r'\.'`, it alters the string so that the periods will be treated as literal characters and not as special regex characters. This is a common technique used to escape special characters in regular expressions to match them literally.
- `?!` is a negative lookahead assertion. It is used to assert that the following characters do not match the pattern inside the lookahead (inside the parentheses). In this case, `'+exclude_url.replace('.', r'\.')+'` is the pattern inside the negative lookahead.

In [4]:
exclude_url = 'https://www.youtube.com/user/OreillyMedia'

domain = utils.urlparse(exclude_url).netloc
# we are not sure whether there is 'wwww' in the .netloc
subdomain = 'www'
print(domain.split('.')[0])
domain.split()

www


['www.youtube.com']

In [9]:
lst = ['ab', 'bc', 'cd', 'de']

lst[-2:]

['cd', 'de']

# <b>4. Web Crawling Models</b>

This chapter focuses primarily on web crawlers that collect a limited number of “types” of data (such as restaurant reviews, news articles, company profiles) from a variety of websites, and that store these data types as Python objects that read and write from a database.

# 4.1 Planning and Defining Objects

_This is the topic of the database design. See "many_to_many" model and the JSON usage in Postgres_ - VR

It can be tempting, when faced with a new project, to dive in and start writing Python to scrape websites immediately. The data model, left as an afterthought, often becomes strongly influenced by the availability and format of the data on the first website you scrape.

However, the data model is the underlying foundation of all the code that uses it. A poor decision in your model can easily lead to problems writing and maintaining code down the line, or difficulty in extracting and efficiently using the resulting data. Especially when dealing with a variety of websites—both known and unknown—it becomes vital to give serious thought and planning to what, exactly, you need to collect and how you need to store it.

One common trap of web scraping is defining the data that you want to collect based entirely on what’s available in front of your eyes. Simply adding attributes to your product type every time you see a new piece of information on a website will lead to far too many fields to keep track of. Not only that, but every time you scrape a new website, you’ll be forced to perform a detailed analysis of the fields the website has and the fields you’ve accumulated so far, and potentially add new fields (modifying your Python object type and your database structure). This will result in a messy and difficult-to-read dataset that may lead to problems using it.

One of the best things you can do when deciding which data to collect is often to ignore the websites altogether. You don’t start a project that’s designed to be large and scalable by looking at a single website and saying, “What exists?” but by saying, 

> **“What do I need?”**

and then finding ways to seek the information that you need from there.

Perhaps what you really want to do is compare product prices among multiple stores and track those product prices over time. In this case, you need enough information to uniquely identify the product, and that’s it:
- Product title
- Manufacturer
- Product ID number (if available/relevant)

It’s important to note that none of this information is specific to a particular store. For instance, product reviews, ratings, price, and even description are specific to the instance of that product at a particular store. That can be stored separately.

Other information (colors the product comes in, what it’s made of) is specific to the product, but may be sparse — it’s not applicable to every product. It’s important to take a step back and perform a checklist for each item you consider and ask yourself the following questions:
- Will this information help with the project goals? Will it be a roadblock if I don’t have it, or is it just “nice to have” but won’t ultimately impact anything?
- If it might help in the future, but I’m unsure, how difficult will it be to go back and collect the data at a later time?
- Is this data redundant to data I’ve already collected?
- Does it make logical sense to store the data within this particular object? (As mentioned before, storing a description in a product doesn’t make sense if that description changes from site to site for the same product.)

If you do decide that you need to collect the data, it’s important to ask a few more questions to then decide how to store and handle it in code:
- Is this data sparse or dense? Will it be relevant and populated in every listing, or just a handful out of the set?
- How large is the data? 
- Especially in the case of large data, will I need to regularly retrieve it every time I run my analysis, or only on occasion? 
- How variable is this type of data? Will I regularly need to add new attributes, modify types (such as fabric patterns, which may be added frequently), or is it set in stone (shoe sizes)? 

Let’s say you plan to do some meta analysis around product attributes and prices: for example, the number of pages a book has, or the type of fabric a piece of clothing is made of, and potentially other attributes in the future, correlated to price. You run through the questions and realize that this data is sparse (relatively few products have any one of the attributes), and that you may decide to add or remove attributes frequently. In this case, it may make sense to create a product type that looks like this:
- Product title
- Manufacturer
- Product ID number (if available/relevant)
- Attributes (optional list or dictionary)

And an attribute type that looks like this:
- Attribute name
- Attribute value

This allows you to flexibly add new product attributes over time, without requiring you to redesign your data schema or rewrite code. When deciding how to store these attributes in the database, you can 
- write JSON to the attribute field, or 
- store each attribute in a separate table with a product ID. 

See **Chapter 6** for more information about implementing these types of database models.

You can apply the preceding questions to the other information you’ll need to store as well. 

# 4.2 Dealing with Different Website Layouts

One of the most impressive feats of a search engine such as Google is that it manages to extract relevant and useful data from a variety of websites, having no upfront knowledge about the website structure itself.

Fortunately, in most cases of web crawling, you’re not looking to collect data from sites you’ve never seen before, but from a few, or a few dozen, websites that are pre-selected by a human. This means that you don’t need to use complicated algorithms or machine learning to detect which text on the page “looks most like a title” or which is probably the “main content.” You can determine what these elements are manually.

The most obvious approach is to 

> write a separate web crawler or page parser for each website. 

Each might take in a URL, string, or BeautifulSoup object, and return a Python object for the thing that was scraped.

The following is an example of a `Content class` (representing a piece of content on a website, such as a news article) and two `scraper functions` that take in a BeautifulSoup object and return an instance of `Content` (with the slight modifications from the original - VR):

In [35]:
import requests
from bs4 import BeautifulSoup


def main():

    url = 'https://www.brookings.edu/blog/future-development/2018/01/26/delivering-inclusive-urban-access-3-uncomfortable-truths/'
    brook_content = scrape_brookings(url)
    brook_content.print()

    url = 'https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.html'
    ny_content = scrape_nytimes(url)
    ny_content.print()

    return 0


class Content:
    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body
    def print(self):
        print('URL: {}'.format(self.url))
        print('TITLE: {}'.format(self.title))
        print('BODY: {}'.format(self.body), '\n')


def get_page(url):
    req = requests.get(url)
    return BeautifulSoup(req.text, 'lxml')


def scrape_nytimes(url):
    bs = get_page(url)
    # get title
    try:
        title = bs.find("h1").text
    except:
        title = "Not found"
    # get_body
    try:
        lines = bs.find_all("p", {"class": "story-content"})
        body = '\n'.join([line.text for line in lines])
    except:
        body = "Not found"
    return Content(url, title, body)


def scrape_brookings(url):
    bs = get_page(url)
    # get title
    try:
        title = bs.find("h1").text
    except:
        title = "Not found"
    # get_body
    try:
        ps = bs.find_all("p")
        body = '\n'.join([p.text for p in ps if "class" not in p.attrs])
    except Exception as e:
        body = "Not found"
        print(e)
    return Content(url, title, body)


if __name__ == "__main__":
    main()

URL: https://www.brookings.edu/blog/future-development/2018/01/26/delivering-inclusive-urban-access-3-uncomfortable-truths/
TITLE: Delivering inclusive urban access: 3 uncomfortable truths
BODY: The past few decades have been filled with a deep optimism about the role of cities and suburbs across the world. These engines of economic growth host a majority of world population, are major drivers of economic innovation, and have created pathways to opportunities for untold amounts of people.
But all is not well within our so-called Urban Century. Rapid urbanization, rising gentrification, concentrated poverty, and shortages of basic infrastructure have combined to create spatial inequity in cities and suburbs across the globe. The challenges of housing, moving, and employing so many people have led to longer travel times, rising housing costs, and unsustainable public spending. Moreover, policymakers are questioning traditional policies and approaches.
The past couple years, we’ve led a p

In [26]:
url = 'https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.html'
p = get_page(url)
print(p)

<html><head><title>nytimes.com</title><style>#cmsg{animation: A 1.5s;}@keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style></head><body style="margin:0"><p id="cmsg">Please enable JS and disable any ad blocker</p><script data-cfasync="false">var dd={'rt':'c','cid':'AHrlqAAAAAMAdQ98v2KMCEIAWftA_Q==','hsh':'499AE34129FA4E4FABC31582C3075D','t':'bv','s':17439,'e':'c699ef4994123637bd31af7d0e15a42cf985a302cf60dc33d8f7fcf1de3df8d0','host':'geo.captcha-delivery.com'}</script><script data-cfasync="false" src="https://ct.captcha-delivery.com/c.js"></script></body></html>



The only real site-dependent variables here are the CSS selectors used to obtain each piece of information. BeautifulSoup’s `find`  and `find_all` functions take in two arguments — a `tag` string and a dictionary of key/value `attributes` — so you can pass these arguments in as parameters that define the structure of the site itself and the location of the target data.

To make things even more convenient, rather than dealing with all of these tag arguments and key/value pairs, you can use the BeautifulSoup `select` function with a single string CSS selector for each piece of information you want to collect and put all of these selectors in a dictionary object.

Note that the `Website` class does not store information collected from the individual pages themselves, but stores instructions about how to collect that data. It doesn’t store the title “My Page Title.” It simply stores the string tag `h1` that indicates where the titles can be found. This is why the class is called `Website` (the information here pertains to the entire website) and not `Content` (which contains information from just a single page).

Using these `Content` and `Website` classes you can then write a `Crawler` to scrape the title and content of any URL that is provided for a given web page from a given website.

In [83]:
import requests
from bs4 import BeautifulSoup


class Website:

    def __init__(self, name, url, title_tag, body_tag):
        self.name = name
        self.url = url
        self.title_tag = title_tag
        self.body_tag = body_tag


class Content:

    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body

    def print(self):
        print('URL: {}'.format(self.url))
        print('TITLE: {}'.format(self.title))
        print('BODY: {}'.format(self.body), '\n')


class Crawler:

    def get_page(self, url):
        try:
            req = requests.get(url)
        except requests.exceptions.RequestException:
            return None
        print(req)
        return BeautifulSoup(req.text, 'lxml')

    def safe_get(self, page_obj, selector):
        """
        Utility function used to get a content string from a
        Beautiful Soup object and a selector. Returns an empty
        string if no object is found for the given selector
        """
        selected_elements = page_obj.select(selector)
        if selected_elements is not None and len(selected_elements) > 0:
            return '\n'.join([elem.get_text() for elem in selected_elements])
        return ''

    def parse(self, site, url):
        """
        Extract content from a given page URL
        """
        bs = self.get_page(url)
        if bs is not None:
            title = self.safe_get(bs, site.title_tag)
            body = self.safe_get(bs, site.body_tag)
            if title != '' and body != '':
                content = Content(url, title, body)
                content.print()
            else:
                print("Something is wrong", url)

In [84]:
def main():

    crawler = Crawler()
    site_data = [
        ['O\'Reilly Media', 'http://oreilly.com', 'h1', 'span'],
        ['Reuters', 'http://reuters.com', 'h1', 'div.article-body__content__17Yit'],
        ['Brookings', 'http://www.brookings.edu', 'h1', 'p'],
        ['New York Times', 'http://nytimes.com', 'h1', 'p.story-content']
    ]

    websites = []
    for row in site_data:
        websites.append(Website(*row))
    print(websites)

    crawler.parse(websites[0], 'http://shop.oreilly.com/product/0636920028154.do')
    crawler.parse(websites[1], 'http://www.reuters.com/article/us-usa-epa-pruitt-idUSKBN19W2D0')
    crawler.parse(websites[2], 'https://www.brookings.edu/blog/techtank/2016/03/01/idea-to-retire-old-methods-of-policy-education/')
    crawler.parse(websites[3], 'https://www.nytimes.com/2018/01/28/business/energy-environment/oil-boom.html')

    return 0


if __name__ == "__main__":
    main()

[<__main__.Website object at 0x7f78493b71d0>, <__main__.Website object at 0x7f78493b7d40>, <__main__.Website object at 0x7f78493b4080>, <__main__.Website object at 0x7f78493b7680>]
<Response [200]>
URL: http://shop.oreilly.com/product/0636920028154.do
TITLE: Learning Python, 5th Edition
BODY: Skip to main content




and more.
Mark Lutz
free trial.
200 top publishers.
your
Get a comprehensive, in-depth introduction to the core Python language with this hands-on book. Based on author Mark Lutz’s popular training course, this updated fifth edition will help you quickly write efficient, high-quality code with Python. It’s an ideal way to begin, whether you’re new to programming or a professional developer versed in other languages.Complete with quizzes, exercises, and helpful illustrations,  this easy-to-follow, self-paced tutorial gets you started with both Python 2.7 and 3.3— the latest releases in the 3.X  and 2.X lines—plus all other releases in common use today. You’ll also learn som

While this new method might not seem remarkably simpler than writing a new Python function for each new website at first glance, imagine what happens when you go from a system with 4 website sources to a system with 20 or 200 sources.

Each list of strings is relatively easy to write. It doesn’t take up much space. It can be loaded from a database or a CSV file. It can be imported from a remote source or handed off to an nonprogrammer with some frontend experience to fill out and add new websites to, and they never have to look at a line of code.

Of course, the downside is that you are giving up a certain amount of flexibility. In the first example, each website gets its own free-form function to select and parse HTML however necessary, in order to get the end result. In the second example, each website needs to have a certain structure in which fields are guaranteed to exist, data must be clean coming out of the field, and each target field must have a unique and reliable CSS selector.

However, I believe that the power and relative flexibility of this approach more than makes up for its real or perceived shortcomings.

# 4.3 Structuring Crawlers

Creating flexible and modifiable website layout types doesn’t do much good if you still have to locate each link you want to scrape by hand. The previous chapter showed various methods of crawling through websites and finding new pages in an automated way.

This section shows how to incorporate these methods into a well-structured and expandable website crawler that can gather links and discover data in an automated way. I present just three basic web crawler structures here, although I believe that they apply to the majority of situations that you will likely need when crawling sites in the wild, perhaps with a few modifications here and there.

## Crawling Sites Through Search

One of the easiest ways to crawl a website is via the same method that humans do: using the **search bar**. Although the process of searching a website for a keyword or topic and collecting a list of search results may seem like a task with a lot of variability from site to site, several key points make this surprisingly trivial:

- Most sites retrieve a list of search results for a particular topic by passing that topic as a string through a parameter in the URL. For example: `http://example.com?search=myTopic`. The first part of this URL can be saved as a property of the Website object, and the topic can simply be appended to it.
- After searching, most sites present the resulting pages as an easily identifiable list of links, usually with a convenient surrounding tag such as `<span class="result">`, the exact format of which can also be stored as a property of the Website object.
- Each result link is either a relative URL (e.g., `/articles/page.html`) or an absolute URL (e.g., `http://example.com/articles/page.html`). Whether or not you are expecting an absolute or relative URL can be stored as a property of the Website object.
- After you’ve located and normalized the URLs on the search page, you’ve successfully reduced the problem to the example in the previous section — extracting data from a page, given a website format.

Let’s look at an implementation of this algorithm in code. The `Content` class is much the same as in previous examples. You are adding the URL property to keep track of where the content was found:

In [2]:
class Content:
    """Common base class for all articles/pages"""

    def __init__(self, topic, url, title, body):
        self.topic = topic
        self.url = url
        self.title = title
        self.body = body

    def print(self):
        print("New article found for topic: {}".format(self.topic))
        print("TITLE: {}".format(self.title))
        print("BODY:\n{}".format(self.body))
        print("URL: {}".format(self.url))

The `Website` class has a few new properties added to it. 
- The `searchUrl` defines where you should go to get search results if you append the topic you are looking for. 
- The `resultListing` defines the “box” that holds information about each result, and 
- the `resultUrl` defines the tag inside this box that will give you the exact URL for the result (this url may work not properly... - VR). 
- The `absoluteUrl` property is a boolean that tells you whether these search results are absolute or relative URLs.

In [3]:
class Website:
    """Contains information about website structure"""

    def __init__(
        self,
        name,
        url,
        search_url,
        result_listing,
        result_url,
        absolute_url,
        title_tag,
        body_tag,
    ):
        self.name = name
        self.url = url
        self.search_url = search_url
        self.result_listing = result_listing
        self.result_url = result_url
        self.absolute_url = absolute_url
        self.title_tag = title_tag
        self.body_tag = body_tag


`crawler.py` has been expanded a bit and contains 
- our `Website` data, 
- a list of topics to search for, and 
- two loops that iterate through all the topics and all the websites. 

It also contains a search function that navigates to the search page for a particular website and topic, and extracts all the result URLs listed on that page.

In [13]:
import requests
from bs4 import BeautifulSoup


class Crawler:
    def get_page(self, url):
        try:
            req = requests.get(url)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text, "lxml")

    def safe_get(self, page_obj, selector):
        child_obj = page_obj.select(selector)
        if child_obj is not None and len(child_obj) > 0:
            return child_obj[0].get_text()
        return ""

    def search(self, topic, site):
        """
        Searches a given website for a given topic and records all pages found
        """
        bs = self.get_page(site.search_url + topic)
        search_results = bs.select(site.result_listing)
        for result in search_results:
            url = result.select(site.result_url)[0].get("href")
            # check to see whether it's a relative or an absolute url
            if site.absolute_url:
                bs = self.get_page(url)
            else:
                bs = self.get_page(site.url + url)
            if bs is None:
                print("Something was wrong with that page or URL. Skipping!")
                return
            title = self.safe_get(bs, site.title_tag)
            body = self.safe_get(bs, site.body_tag)
            if title != "" and body != "":
                content = Content(topic, title, body, url)
                content.print()


In [14]:
crawler = Crawler()

site_data = [
    [
        "O'Reilly Media",
        "http://oreilly.com",
        "https://ssearch.oreilly.com/?q=",
        "article.product-result",
        "p.title a",
        True,
        "h1",
        "span",
    ],
    [
        "Reuters",
        "http://reuters.com",
        "http://www.reuters.com/search/news?blob=",
        "div.search-result-content",
        "h3.search-result-title a",
        False,
        "h1",
        "div.article-body__content__17Yit",
    ],
    [
        "Brookings",
        "http://www.brookings.edu",
        "https://www.brookings.edu/search/?s=",
        "div.list-content article",
        "h4.title a",
        True,
        "h1",
        "p",
    ],
]


sites = []
for row in site_data:
    sites.append(Website(*row))

topics = ["python", "data science"]
for topic in topics:
    print("GETTING INFO ABOUT:", topic)
    for target_site in sites:
        crawler.search(topic, target_site)

GETTING INFO ABOUT: python
GETTING INFO ABOUT: data science


This script loops through all the topics in the topics list and announces before it starts scraping for a topic:
```sh
GETTING INFO ABOUT python
```

Then it loops through all of the sites in the sites list and crawls each particular site for each particular topic. Each time that it successfully scrapes information about a page, it prints it to the console:

```
New article found for topic: python
URL: http://example.com/examplepage.html
TITLE: Page Title Here
BODY: Body content is here
```

Note that it loops through all topics and then loops through all websites in the inner loop. Why not do it the other way around, collecting all topics from one website, and then all topics from the next website? Looping through all topics first is a way to 

> more evenly distribute the load placed on any one web server. 

This is especially important if you have a list of hundreds of topics and dozens of websites. You’re not making tens of thousands of requests to one website at once; you’re making 10 requests, waiting a few minutes, making another 10 requests, waiting a few minutes, and so forth.

Although the number of requests is ultimately the same either way, it’s generally better to distribute these requests over time as much as is reasonable. Paying attention to how your loops are structured is an easy way to do this.

## Crawling Sites Through Links

The previous chapter covered some ways of identifying internal and external links on web pages and then using those links to crawl across the site. In this section, you’ll combine those same basic methods into a more flexible website crawler that can follow any link matching a specific URL pattern.

This type of crawler works well for projects when you want to gather all the data from a site — not just data from a specific search result or page listing. It also works well when the site’s pages may be disorganized or widely dispersed.

These types of crawlers don’t require a structured method of locating links, as in the previous section on crawling through search pages, so the attributes that describe the search page aren’t required in the Website object. However, because the crawler isn’t given specific instructions for the locations/positions of the links it’s looking for, you do need some rules to tell it what sorts of pages to select. You provide a `targetPattern` (regular expression for the target URLs) and leave the boolean `absoluteUrl` variable to accomplish this:

In [15]:
class Website:
    """Common base class for all articles/pages"""

    def __init__(self, name, url, target_pattern, absolute_url, title_tag, body_tag):
        self.name = name
        self.url = url
        self.target_pattern = target_pattern
        self.absolute_url = absolute_url
        self.title_tag = title_tag
        self.body_tag = body_tag


class Content:
    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body

    def print():
        print("URL: {}".format(self.url))
        print("TITLE: {}".format(self.title))
        print("BODY:\n{}".format(self.body))

The `Content` class is the same one used in the first crawler example.

The `Crawler` class is written to start from the home page of each site, locate internal links, and parse the content from each internal link found:

In [17]:
import re

import requests


class Crawler:
    def __init__(self, site):
        self.site = site
        self.visited = []

    def get_page(self, url):
        try:
            req = requests.get(url)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text, "lxml")

    def safe_get(self, page_obj, selector):
        if selected_elems is not None and len(selected_elems) > 0:
            return "\n".join([elem.get_text() for elem in selected_elems])
        return ""

    def parse(self, url):
        bs = self.get_page(url)
        if bs is not None:
            title = self.safe_get(bs, self.site.title_tag)
            body = self.safe_get(bs, self.site.body_tag)
            if title != "" and body != "":
                content = Content(url, title, body)
                content.print()

    def crawl(self):
        """
        Get pages from website home page
        """
        bs = self.get_page(self.site.url)
        target_pages = bs.findall("a", href=re.compile(self.site.target_pattern))
        for target_page in target_pages:
            target_page = target_page.get("href")
            if target_page not in self.visited:
                self.visited.append(target_page)
                if not self.site.absolute_url:
                    target_page = "{}{}".forman(self.site.url, target_page)
            self.parse(target_page)

In [18]:
reuters = Website(
    "Reuters",
    "https://www.reuters.com",
    "^(/article/)",
    False,
    "h1",
    "div.StandardArticleBody_body_1gnLA",
)
crawler = Crawler(reuters)
crawler.crawl()

TypeError: 'NoneType' object is not callable

Another change here that was not used in previous examples: the `Website` object (in this case, the variable `reuters`) is a property of the `Crawler` object itself. This works well to store the visited pages (`visited`) in the `crawler`, but means that a new `crawler` must be instantiated for each website rather than reusing the same one to crawl a list of websites.

Whether you choose to make a `crawler` website-agnostic or choose to make the `website` an attribute of the `crawler` is a design decision that you must weigh in the context of your own specific needs. Either approach is generally fine.

Another thing to note is that this `crawler` will get the pages from the home page, but will not continue crawling after all those pages have been logged. You may want to write a crawler incorporating one of the patterns in **Chapter 3** and have it look for more targets on each page it visits. You can even follow all the URLs on each page (not just ones matching the target pattern) to look for URLs containing the target pattern.

## Crawling Multiple Page Types

Unlike crawling through a predetermined set of pages, crawling through all internal links on a website can present a challenge in that you never know exactly what you’re getting. Fortunately, there are a few basic ways to identify the page type:

- By the URL
    - All blog posts on a website might contain a URL (`http://example.com/blog/title-of-post`, for example).
- By the presence or lack of certain fields on a site
    - If a page has a date, but no author name, you might categorize it as a press release. If it has a title, main image, price, but no main content, it might be a product page. 
- By the presence of certain tags on the page to identify the page
    - You can take advantage of tags even if you’re not collecting the data within the tags. Your crawler might look for an element such as `<div id="related-products">` to identify the page as a product page, even though the crawler is not interested in the content of the related products. 

To keep track of multiple page types, you need to have multiple types of page objects in Python. This can be done in two ways:

- If the pages are all similar (they all have basically the same types of content), you may want to add a `pageType` attribute to your existing web-page object:

In [19]:
class Website:
    """Common base class for all articles/pages"""

    def __init__(
        self,
        name,
        url,
        search_url,
        result_listing,
        result_url,
        absolute_url,
        title_tag,
        body_tag,
        page_type,
    ):
        self.name = name
        self.url = url
        self.title_tag = title_tag
        self.body_tag = body_tag
        self.page_type = page_type

If you’re storing these pages in an SQL-like database, this type of pattern indicates that all these pages would probably be stored in the same table, and that an extra `pageType` column would be added.

If the pages/content you’re scraping are different enough from each other (they contain different types of fields), this may warrant creating new objects for each page type. Of course, some things will be common to all web pages — they will all have a URL, and will likely also have a name or page title. This is an ideal situation in which to use subclasses. 

This is not an object that will be used directly by your crawler, but an object that will be referenced by your page types:

In [20]:
class Webpage:
    """Common base class for all articles/pages"""

    def __init__(self, name, url, title_tag):
        self.name = name
        self.url = url
        self.title_tag = title_tag


class Product(Webpage):
    """Contains information for scraping a product page"""

    def __init__(self, name, url, title_tag, product_number_tag, price_tag):
        super().__init__(name, url, title_tag)
        self.product_number_tag = product_number_tag
        self.price_tag = price_tag


class Article(Webpage):
    """Contains information for scraping an article page"""

    def __init__(self, name, url, title_tag, body_tag, date_tag):
        super().__init__(name, url, title_tag)
        self.body_tag = body_tag
        self.date_tag = date_tag

This `Product` page extends the `Website` base class and adds the attributes `productNumber` and price that apply only to products, and the `Article` class adds the attributes `body` and `date`, which don’t apply to products.

You can use these two classes to scrape, for example, a store website that might contain blog posts or press releases in addition to products.

# 4.4 Thinking About Web Crawler Models

Collecting information from the internet can be like drinking from a fire hose. There’s a lot of stuff out there, and it’s not always clear what you need or how you need it. The first step of any large web scraping project (and even some of the small ones) should be to answer these questions.

When collecting similar data across multiple domains or from multiple sources, your goal should almost always be to try to **normalize** it. Dealing with data with identical and comparable fields is much easier than dealing with data that is completely dependent on the format of its original source.

In many cases, you should build scrapers under the assumption that more sources of data will be added to them in the future, and with the goal to minimize the programming overhead required to add these new sources. Even if a website doesn’t appear to fit your model at first glance, there may be more subtle ways that it does conform. Being able to see these underlying patterns can save you time, money, and a lot of headaches in the long run.

The connections between pieces of data should also not be ignored. Are you looking for information that has properties such as “type,” “size,” or “topic” that span across data sources? How do you store, retrieve, and conceptualize these attributes?

Software architecture is a broad and important topic that can take an entire career to master. Fortunately, software architecture for web scraping is a much more finite and manageable set of skills that can be relatively easily acquired. As you continue to scrape data, you will likely find the same basic patterns occurring over and over. Creating a well-structured web scraper doesn’t require a lot of arcane knowledge, but it does require taking a moment to step back and think about your project.

# <b>5. Scrapy</b>

You should work in the virtual environment.

```sh
. /venv/bin/activate
(venv) pip install --upgrade pip
(venv) pip install scrapy
```

> Note: Python3.12 does not contain all needed dependencies, use Python3.11 (as of Feb. 4, 2024)

|bash|description|
|-|-|
|`scrapy startproject <name>`|start a new scrapy project|
|`scrapy genspider <spider_name> <domain>`|generate a spider in the `spider` dir|
|`scrapy runspider <spider_file>.py`|start the crawler|
|**IPython**||
|`scrapy.cfg`|add line: `[settings] shell = ipython`|
|`scrapy shell`|run ipython|
|||
|||
|||
|||
|||

The previous chapter presented some techniques and patterns for building large, scalable, and (most important!) maintainable web crawlers. Although this is easy enough to do by hand, many libraries, frameworks, and even GUI-based tools will do this for you, or at least try to make your life a little easier.

This chapter introduces one of the best frameworks for developing crawlers: [Scrapy](https://scrapy.org/). 

One of the challenges of writing web crawlers is that you’re often performing the same tasks again and again: 
- find all links on a page, 
- evaluate the difference between internal and external links, 
- go to new pages. 

These basic patterns are useful to know and to be able to write from scratch, but the Scrapy library handles many of these details for you.

Of course, Scrapy isn’t a mind reader. You still need to define page templates, give it locations to start scraping from, and define URL patterns for the pages that you’re looking for. But in these cases, it provides a clean framework to keep your code organized.

# 5.1 Initializing a New Spider

Once you’ve installed the Scrapy framework, a small amount of setup needs to be done for each spider. A `spider` is a Scrapy project that, like its arachnid namesake, is designed to crawl webs. Throughout this chapter, I use **“spider”** to describe a Scrapy project in particular, and **“crawler”** to mean “any generic program that crawls the web, using Scrapy or not.”

To create a new spider in the current directory, run the following from the command line:

```sh
$ scrapy startproject test1
```
```
New Scrapy project 'test1', using template directory '/home/username/venv/venv3.12/lib/python3.12/site-packages/scrapy/templates/project', created in:
    /home/username/web_scraping/data/test1

You can start your first spider with:
    cd test1
    scrapy genspider example example.com
```

This creates a new subdirectory in the directory the project was created in, with the title `test1`. Inside this directory is the following file structure:

In [2]:
!tree ./data/test1/

[01;34m./data/test1/[0m
├── scrapy.cfg
└── [01;34mtest1[0m
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── [01;34mspiders[0m
        └── __init__.py

3 directories, 7 files


These Python files are initialized with stub code (код заглушки) to provide a fast means of creating a new spider project. Each section in this chapter works with this `test1` project.

# 5.2 Writing a Simple Scraper

To create a crawler, you will add a new file inside the spiders directory at `test1/test1/spiders/article.py`. In your newly created `article.py` file, write the following:

In [3]:
import scrapy


class ArticleSpider(scrapy.Spider):
    name = "article"

    def start_requests(self):
        urls = [
            "http://en.wikipedia.org/wiki/Python_",
            "%28programming_language%29",
            "https://en.wikipedia.org/wiki/Functional_programming",
            "https://en.wikipedia.org/wiki/Monty_Python",
        ]
        return [scrapy.Request(url=url, callback=self.parse) for url in urls]

    def parse(self, response):
        url = response.url
        title = response.css("h1::text").extract_first()
        print("URL is: {}".format(url))
        print("Title is: {}".format(title))

## Spider class

The name of this class (`ArticleSpider`) is different from the name of the directory (`wikiSpider`, `test1` in my case), indicating that this class in particular is responsible for spidering through only article pages, under the broader category of `wikiSpider`, `test1` in my case, which you may later want to use to search for other page types.

For large sites with many types of content, you might have separate Scrapy items for each type (blog posts, press releases, articles, etc.), each with different fields, but all running under the same Scrapy project. 

> The name of each spider must be unique within the project.

## Spider methods

The other key things to notice about this spider are the two functions: 
- `start_requests` and 
- `parse`. 

`start_requests` is a Scrapy-defined entry point to the program used to generate Request objects that Scrapy uses to crawl the website.

`parse` is a callback function defined by the user, and is passed to the Request object with `callback=self.parse`. Later, you’ll look at more-powerful things that can be done with the parse function, but for now it prints the title of the page.

> _ChatGPT:_ A **callback function** in Python is a function that is passed as an argument to another function and is expected to be called at a later time. This allows the calling function to "call back" to the passed function once it completes its task. Callback functions are commonly used in event-driven programming, asynchronous programming, and in libraries and frameworks to handle specific events or conditions.

You can run this article spider by navigating to the `test1/test1` directory and running:

```sh
$ scrapy runspider article.py
```

The default Scrapy output is fairly verbose. Along with debugging information, this should print out lines like the following:
2018-01-21 23:28:57 [scrapy.core.engine] DEBUG: Crawled (200)
<GET https://en.wikipedia.org/robots.txt> (referer: None)
2018-01-21 23:28:57 [scrapy.downloadermiddlewares.redirect]
DEBUG: Redirecting (301) to <GET https://en.wikipedia.org/wiki/
Python_%28programming_language%29> from <GET http://en.wikipedia.org/
wiki/Python_%28programming_language%29>
2018-01-21 23:28:57 [scrapy.core.engine] DEBUG: Crawled (200)
<GET https://en.wikipedia.org/wiki/Functional_programming>
(referer: None)
URL is: https://en.wikipedia.org/wiki/Functional_programming
Title is: Functional programming
2018-01-21 23:28:57 [scrapy.core.engine] DEBUG: Crawled (200)
<GET https://en.wikipedia.org/wiki/Monty_Python> (referer: None)
URL is: https://en.wikipedia.org/wiki/Monty_Python
Title is: Monty Python
The scraper goes to the three pages listed as the start_urls, gathers information, and then terminates.

# <b>2. Advanced HTML Parsing</b>

# <b>2. Advanced HTML Parsing</b>

# <b>2. Advanced HTML Parsing</b>