# Web Scraping with Python Using Beautiful Soup

The internet is an absolutely massive source of data — data that we can access using web scraping and Python!

In fact, web scraping is often the only way we can access data. There is a lot of information out there that isn’t available in convenient CSV exports or easy-to-connect APIs. And websites themselves are often valuable sources of data — consider, for example, the kinds of analysis you could do if you could download every post on a web forum.

To access those sorts of on-page datasets, we’ll have to use web scraping.

## The Fundamentals of Web Scraping:
### What is Web Scraping in Python?
Some websites offer data sets that are downloadable in CSV format, or accessible via an Application Programming Interface (API). But many websites with useful data don’t offer these convenient options.

If we wanted to analyze the data from a website, or download it for use in some other app, we wouldn’t want to painstakingly copy-paste everything. Web scraping is a technique that lets us use programming to do the heavy lifting.

### How Does Web Scraping Work?
When we scrape the web, we write code that sends a request to the server that’s hosting the page we specified. The server will return the source code — HTML, mostly — for the page (or pages) we requested.

So far, we’re essentially doing the same thing a web browser does — sending a server request with a specific URL and asking the server to return the code for that page.

But unlike a web browser, our web scraping code won’t interpret the page’s source code and display the page visually. Instead, we’ll write some custom code that filters through the page’s source code looking for specific elements we’ve specified, and extracting whatever content we’ve instructed it to extract.

For example, if we wanted to get all of the data from inside a table that was displayed on a web page, our code would be written to go through these steps in sequence:

- Request the content (source code) of a specific URL from the server

- Download the content that is returned

- Identify the elements of the page that are part of the table we want

- Extract and (if necessary) reformat those elements into a dataset we can analyze or use in whatever way we require.

## The Components of a Web Page


Before we start writing code, we need to understand a little bit about the structure of a web page. We’ll use the site’s structure to write code that gets us the data we want to scrape, so understanding that structure is an important first step for any web scraping project.
When we visit a web page, our web browser makes a request to a web server. This request is called a GET request, since we’re getting files from the server. The server then sends back files that tell our browser how to render the page for us. These files will typically include:

- HTML — the main content of the page.

- CSS — used to add styling to make the page look nicer.

- JS — Javascript files add interactivity to web pages.

- Images — image formats, such as JPG and PNG, allow web pages to show pictures.


After our browser receives all the files, it renders the page and displays it to us.

There’s a lot that happens behind the scenes to render a page nicely, but we don’t need to worry about most of it when we’re web scraping. When we perform web scraping, we’re interested in the main content of the web page, so we look primarily at the HTML.

## HTML


HyperText Markup Language (HTML) is the language that web pages are created in. HTML isn’t a programming language, like Python, though. It’s a markup language that tells a browser how to display content.

HTML has many functions that are similar to what you might find in a word processor like Microsoft Word — it can make text bold, create paragraphs, and so on.

If you’re already familiar with HTML, feel free to jump to the next section of this tutorial. Otherwise, let’s take a quick tour through HTML so we know enough to scrape effectively.

HTML consists of elements called tags. The most basic tag is the `<html>` tag. This tag tells the web browser that everything inside of it is HTML. We can make a simple HTML document just using this tag:


```html
<html>
</html>
```


We haven’t added any content to our page yet, so if we viewed our HTML document in a web browser, we wouldn’t see anything:

Right inside an `html` tag, we can put two other tags: the `head` tag, and the `body` tag.

The main content of the web page goes into the body tag. The head tag contains data about the title of the page, and other information that generally isn’t useful in web scraping:

```html
<html>
<head>
</head>
<body>
</body>
</html>
```


We still haven’t added any content to our page (that goes inside the body tag), so if we open this HTML file in a browser, we still won’t see anything:

You may have noticed above that we put the head and body tags inside the html tag. In HTML, tags are nested, and can go inside other tags.

We’ll now add our first content to the page, inside a p tag. The p tag defines a paragraph, and any text inside the tag is shown as a separate paragraph:


```html
<html>
<head>
</head>
<body>
<p>
Here's a paragraph of text!
</p>
<p>
Here's a second paragraph of text!
</p>
</body>
</html>
```

Rendered in a browser, that HTML file will look like this:

Here’s a paragraph of text!

Here’s a second paragraph of text!

Tags have commonly used names that depend on their position in relation to other tags:

- child — a child is a tag inside another tag. So the two p tags above are both children of the body tag.

- parent — a parent is the tag another tag is inside. Above, the html tag is the parent of the body tag.

- sibling — a sibling is a tag that is nested inside the same parent as another tag. For example, head and body are siblings, since they’re both inside html. Both p tags are siblings, since they’re both inside body.


We can also add properties to HTML tags that change their behavior. Below, we’ll add some extra text and hyperlinks using the a tag.

```html
<html>
<head>
</head>
<body>
<p>
Here's a paragraph of text!
<a href="https://www.uni-osnabrueck.de/">Uni Osnabrueck</a>
</p>
<p>
Here's a second paragraph of text!
<a href="https://www.python.org">Python</a> </p>
</body></html>
```


Here’s how this will look:

Here’s a paragraph of text! Uni Osnabrueck

Here’s a second paragraph of text! Python

In the above example, we added two a tags. a tags are links, and tell the browser to render a link to another web page. The href property of the tag determines where the link goes.

a and p are extremely common html tags. Here are a few others:

- div — indicates a division, or area, of the page.

- b — bolds any text inside.

- i — italicizes any text inside.

- table — creates a table.

- form — creates an input form.


For a full list of tags, visit [HTML Tags](https://www.w3schools.com/tags/default.asp).


Note: Even though there are lots of them, most of them are rarely used.


Before we move into actual web scraping, let’s learn about the `class` and `id` properties. These special properties give HTML elements names, and make them easier to interact with when we’re scraping.

One element can have multiple classes, and a class can be shared between elements. Each element can only have one id, and an id can only be used once on a page. Classes and ids are optional, and not all elements will have them.

We can add classes and ids to our example:

```html
<html>
<head>
</head>
<body>
<p>
Here's a paragraph of text!
<a href="https://www.uni-osnabrueck.de/" id="learn-link">Uni Osnabrueck</a>
</p>
<p>
Here's a second paragraph of text!
<a href="https://www.python.org" class="extra-large bold">Python</a> </p>
</body></html>
```

Here’s how this will look:

Here’s a paragraph of text! Uni Osnabrueck

Here’s a second paragraph of text! Python

As you can see, adding classes and ids doesn’t change how the tags are rendered at all. They are called selectors and used for selecting the tags to apply styling.


## The requests library

The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests library.

The requests library will make a `GET` request to a web server, which will download the HTML contents of a given web page for us. There are several different types of requests we can make using requests, of which GET is just one.

Let’s try downloading a simple sample website,



In [1]:
import requests

url = "https://webscrapingtutorial.netlify.app/simple.html"
headers = {"Accept-Language": "en-US"}
# headers are not mandatory but some websites will change the page language depending on from where you connect.
# setting Accept-Language to en-US will make sure that the page is in english.

page = requests.get(url, headers=headers)
print(page.status_code)

200


A status_code of `200` means that the page downloaded successfully. We won’t fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.

[Check more on http status codes (optional)](https://http.cat)


We can print out the HTML content of the page using the content property:

In [2]:
print(page.content)

b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n        <p>Here is some more text.</p>\n    </body>\n</html>'


## Parsing a page with BeautifulSoup

As you can see above, we now have downloaded an HTML document.

We can use the BeautifulSoup library to parse this document, and extract the text from the p tag.

We first have to import the library, and create an instance of the BeautifulSoup class to parse our document:


In [2]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
# BeautifulSoup expects the html code and the parser. 
# lxml can be also use as parser instead of html.parser

print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   A simple example page
  </title>
 </head>
 <body>
  <p>
   Here is some simple content for this page.
  </p>
  <p>
   Here is some more text.
  </p>
 </body>
</html>


## Finding all instances of a tag at once

If we want to extract a single tag, we can instead use the find_all method, which will find all the instances of a tag on a page.


In [4]:
soup.find_all('p')

[<p>Here is some simple content for this page.</p>,
 <p>Here is some more text.</p>]

Note that find_all returns a list, so we’ll have to loop through, or use list indexing, it to extract text:

In [5]:
soup.find_all('p')[1]

<p>Here is some more text.</p>

In [6]:
# to get the text instead of the html code we can extract it using .text or .get_text()
soup.find_all('p')[1].text
soup.find_all('p')[1].get_text()

'Here is some more text.'

if you instead only want to find the first instance of a tag, you can use the find method, which will return a single BeautifulSoup object:

In [7]:
soup.find('p')

<p>Here is some simple content for this page.</p>

## Searching for tags by class and id

We introduced classes and ids earlier, but it probably wasn’t clear why they were useful.

Classes and ids are used by CSS to determine which HTML elements to apply certain styles to. But when we’re scraping, we can also use them to specify the elements we want to scrape.

To illustrate this principle, we’ll work with the following page:

```html
<html>
    <head>
        <title>A simple example page</title>
    </head>
    <body>
        <div>
            <p class="inner-text first-item" id="first">
                First paragraph.
            </p>
            <p class="inner-text">
                Second paragraph.
            </p>
        </div>
            <p class="outer-text first-item" id="second">
                <b>
                First outer paragraph.
                </b>
            </p>
            <p class="outer-text">
                <b>
                Second outer paragraph.
                </b>
            </p>
    </body>
</html>
```


Let’s first download the page and create a BeautifulSoup object:

In [3]:
page = requests.get("https://webscrapingtutorial.netlify.app/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
soup

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
                </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
                </b>
</p>
</body>
</html>

Now, we can use the `find_all` method to search for items by class or by id. In the below example, we’ll search for any p tag that has the class outer-text:

In [24]:
import re
soup.find_all(re.compile(r'p|(b\s)'))

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>,
 <p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
                 </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
                 </b>
 </p>]

In [None]:
soup.find_all(string=re.compile(r'\s(\w)* outer'))

In [25]:
help(soup.findall)

Help on NoneType object:

class NoneType(object)
 |  Methods defined here:
 |  
 |  __bool__(self, /)
 |      True if self else False
 |  
 |  __repr__(self, /)
 |      Return repr(self).
 |  
 |  ----------------------------------------------------------------------
 |  Static methods defined here:
 |  
 |  __new__(*args, **kwargs) from builtins.type
 |      Create and return a new object.  See help(type) for accurate signature.



In [46]:
print("1st try:", soup.find_all("p", "-text"), "", sep="\n")
print("2nd try:", soup.find_all("p", re.compile("-text")), sep="\n")

1st try:
[]

2nd try:
[<p class="inner-text first-item" id="first">
                First paragraph.
            </p>, <p class="inner-text">
                Second paragraph.
            </p>, <p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
                </b>
</p>, <p class="outer-text">
<b>
                Second outer paragraph.
                </b>
</p>]


In [12]:
for tag in soup.find_all(re.compile("b")):
    print(tag.name)

body
b
b


In [14]:
print(soup.find(re.compile(r'o.{3}r')))

None


In [9]:
#find all p tags which has outer-text class
soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
                 </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
                 </b>
 </p>]

In [10]:
#find all tags which has outer-text and first-item classes
soup.find_all(class_="outer-text first-item")

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
                 </b>
 </p>]

In [11]:
#We can also search for elements by id:
soup.find_all(id="first")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

In [12]:
# or we can pass an object to select multiple attributes
soup.find_all('p', {'class': 'outer-text first-item', 'id': 'second'})

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
                 </b>
 </p>]

## Using CSS Selectors

We can also search for items using CSS selectors. These selectors are how the CSS language allows developers to specify HTML tags to style. Here are some examples:

- p a — finds all a tags inside of a p tag.

- body p a — finds all a tags inside of a p tag inside of a body tag.

- html body — finds all body tags inside of an html tag.

- p.outer-text — finds all p tags with a class of outer-text.

- p#first — finds all p tags with an id of first.

- body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.

[Check more on CSS selectors](https://www.w3schools.com/css/css_selectors.asp)

In [13]:
soup.select("div p")
#Note that the select method above returns a list of BeautifulSoup objects, just like find and find_all.

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>]

## Regex
Regular expressions are an important tool for webscraping. You will need to search for patterns of strings in HTML documents, for instance for certain tags. Before we start using regex in beautiful soup, we will go over a few basics. In the depicted examples the first chapter of "Harry Potter and the Philosopher's Stone"

In [None]:
import re
with open("Potter1.1.txt", "rt") as file:
    text = file.read()

print(text[:200])

In [None]:
# find all adverbs (and Lily)
re.findall(r"\w+ly\b", text)

In [None]:
# What is Harry's aunt called again? It's Mrs. something
m = re.search(r"Mrs. (\w+)", text)
print("The whole match:", m[0], sep="\n")
print("\nJust the name, i.e. the first group of the match:", m[1], sep="\n")

### Regex Syntax
Before we can query for text passages in some string, we should create and compile a regular expression. A compiled regex will be a ``Pattern`` object. Further down the line you will see, that you often won't need to explicitly compile regular expressions, but the functions you will use (e.g. in beautiful soup) does that for you. Nevertheless in the following cells the explicit way is shown for better understanding.

In [None]:
# a match can be checked with an if-statement on a `Match` object
def print_result(match):
    if match:
        print("It's a match!")
    else:
        print("No match :'(")

In [None]:
# this regular expression will "match" any string, that consists of lower-case letters, only and contains a "l"
r = r"[a-z]*l[a-z]*"

# compile regex
p = re.compile(r)

# create a `Match` object
m = p.match("hello")

print_result(m)

# No upper case letters before l
print_result(p.match("Hello"))

print_result(p.match("hi"))

# After a match is found, the remainder of the string is ignored
print_result(p.match("hellO"))

#### r before string

In [None]:
print("hello \n world!")
print(r"hello \n world!")

#### []

In [None]:
# squared brackets indicate a set
p = re.compile(r"[ab]")

print_result(p.match("a"))
print_result(p.match("b"))
print_result(p.match("c"))

#### * and +

In [None]:
# * matches an arbitrary amount of the preceeding symbol
p = re.compile(r"a*")

print_result(p.match("aaaaaa"))
print_result(p.match(""))

# + matches an arbitrary amount greater than zero of the preceding symbol
p = re.compile(r"a+")

print_result(p.match("aaaaaa"))
print_result(p.match(""))

# quantify set
p = re.compile(r"[ab]*c")

print_result(p.match("abbbaaabc"))
print_result(p.match("bbbbbda"))

#### {}

In [None]:
# curly brackets can be used for specifying the amount of repetitions
p = re.compile(r"a{5}b")

print_result(p.match("aaaaab"))
print_result(p.match("aaaab"))

# or a range of repetition amounts
p = re.compile(r"a{4,5}b")

print_result(p.match("aaaaab"))
print_result(p.match("aaaab"))

#### .

In [None]:
# . matches any symbol, except newline
p = re.compile(r".*y")

print_result(p.match("Hi! This is string is matched, as soon as a 'y' is encountered"))
print_result(p.match("Newlines\nbefore 'y' can cause trouble"))

# the `re.DOTALL` flag can help
p = re.compile(r".*y", re.DOTALL)
print_result(p.match("Newlines\nbefore 'y' can cause trouble"))

#### ^ and $

In [None]:
# ^ indicates the start of the string
p = re.compile(r"^hello")

print_result(p.match("hello world!"))
print_result(p.match("welcome and hello"))

# $ indicates the end of the string
p = re.compile(r".*hello$")

print_result(p.match("hello world!"))
print_result(p.match("welcome and hello"))

#### \

In [None]:
### \ is the escape character
p = re.compile(r"\*$")

print_result(p.match("*"))

#### ()

In [None]:
# round brackets are used for creating groups
p = re.compile(r".*(hello).*(world)")
m = p.match("Somewhere in this text are the words hello and world and some stuff afterwards")

print(m[0]) # the whole match
print(m[1]) # first group
print(m[2]) # second group

# (?P<name>...) can be used for naming groups
p = re.compile(r".*(?P<group01>hello)")
m = p.match("The word hello will be found")

m["group01"]

#### *?

In [None]:
# * and + are greedy, they try to match as many symbols as possible
p = re.compile(r".*in")
m = p.match("The word 'in' is twice in this sentence")
print(m[0])

# use *? for non-greedy behaviour
p = re.compile(r".*?in")
m = p.match("The word 'in' is trwice in this sentence")
print(m[0])

#### \w and \s

In [None]:
# \w matches any 'word character'
p = re.compile(r"^\w*$")
print_result(p.match("abc_öüäçâбш大水سلام"))
print_result(p.match(" "))

# \s matches whitespace characters
p = re.compile(r"^\s*$")
print_result(p.match(" \t\n\r\f\v"))

#### And much more
If you come in a situation, where the above is not enough, you can find (much) more syntax in the python `re` [documentation](https://docs.python.org/3/library/re.html#regular-expression-syntax).

### Pattern matching
https://docs.python.org/3/library/re.html#regular-expression-objects
So far we have used ``match()`` to simply find out, if the regex and the string match. Often we rather want to extract substrings with the help of regex. The pattern objects offer several helpful methods.

In [None]:
p = re.compile(".{10}[Hh]arry.{10}")

# search finds the first occurance of a regex
p.search(text)[0]

In [None]:
# findall, well...
p.findall(text)

In [None]:
# split the text to words
p = re.compile(r"\W+")

p.split(text)[:20]

You usually don't have to compile a regular expression. Instead you can put the uncompiled regex as an argument in a function of the ``re`` module.

In [None]:
re.search(r".{10}[Hh]arry.{10}", text)[0]

In [None]:
re.findall(r".{10}[Hh]arry.{10}", text)[:20]

### Match Objects
We have already indexed ``Match`` objects, to find the matched string and used the ``if`` statement, to verify if a regex and a string match. There is more we can do with ``Match`` objects.
https://docs.python.org/3/library/re.html#match-objects

In [None]:
import re
m = re.match(r".*?(?P<McG>McGonagall)", text, re.DOTALL)

# start gives the starting index of the specified group in the whole string
startpos = m.start("McG")
endpos = m.end("McG")
print(text[startpos-20:endpos+1])

In [None]:
# groupdict can be used to create a dictionary out of named groups
m = re.match(r".*?(?P<first_name>\w+) (?P<last_name>Dumbledore)", text, re.DOTALL)
print(m.groupdict())
m = re.match(r".*?(?P<first_name>\w+) (?P<last_name>McGonagall)", text, re.DOTALL)
print(m.groupdict())

### Spice up your Soup with Regex
When you have extracted some wall of text from a website, you might want to extract certain information out of it with regular expressions. But there are also a few ways of using regex directly in BeautifulSoup functions.

In [None]:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://webscrapingtutorial.netlify.app/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')

In [None]:
# matching the tags
soup.find_all(re.compile(r'p|(b\s)'))

In [None]:
# matching the class
print("1st try:", soup.find_all("p", "-text"), "", sep="\n")
print("2nd try:", soup.find_all("p", re.compile("-text")), sep="\n")

In [None]:
# match a tag's attribute
soup.find_all(id=re.compile(r"(first)|(second)"))

In [None]:
# matching the text instead of tags
soup.find_all(string=re.compile(r'\s(\w)* outer'))

In [None]:
# Question: What kinds of outer paragraphs are in the tags with class = "outer..."?
text = str(soup.find_all("p", re.compile("outer")))
re.compile(r'\s(\w*)\souter paragraph').findall(text)