# Without Beautiful Soup

In [1]:
from urllib.request import urlopen

In [2]:
url = "http://olympus.realpython.org/profiles/aphrodite"

In [3]:
page = urlopen(url)

In [4]:
page

<http.client.HTTPResponse at 0x7fe3e2e6fe10>

In [5]:
# extract the HTML from the page
html_bytes = page.read()
# decode the bytes returned by the previous method into a string using UTF-8
html = html_bytes.decode("utf-8")

In [6]:
print(html)

<html>
<head>
<title>Profile: Aphrodite</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/aphrodite.gif" />
<h2>Name: Aphrodite</h2>
<br><br>
Favorite animal: Dove
<br><br>
Favorite color: Red
<br><br>
Hometown: Mount Olympus
</center>
</body>
</html>



## Extract Text From HTML With String Methods

One way to extract information from a web page’s HTML is to use string methods. For instance, you can use .find() to search through the text of the HTML for the <title> tags and extract the title of the web page.

Let’s extract the title of the web page you requested in the previous example. If you know the index of the first character of the title and the first character of the closing </title> tag, then you can use a string slice to extract the title.

Since .find() returns the index of the first occurrence of a substring, you can get the index of the opening <title> tag by passing the string "<title>" to .find():

In [7]:
title_index = html.find("<title>")
title_index

14

You don’t want the index of the `<title>` tag, though. You want the index of the title itself. To get the index of the first letter in the title, you can add the length of the string "<title>" to title_index:

In [8]:
start_index = title_index + len("<title>")
start_index

21

Now get the index of the closing </title> tag by passing the string `"</title>"` to .find():

In [9]:
end_index = html.find("</title>")
end_index

39

Finally, you can extract the title by slicing the html string:

In [10]:
title = html[start_index:end_index]
title

'Profile: Aphrodite'

In [11]:
url = "http://olympus.realpython.org/profiles/poseidon"
page = urlopen(url)
html = page.read().decode("utf-8")
start_index = html.find("<title>") + len("<title>")
end_index = html.find("</title>")
title = html[start_index:end_index]

In [12]:
title

'\n<head>\n<title >Profile: Poseidon'

Whoops! There’s a bit of HTML mixed in with the title. Why’s that?

The HTML for the /profiles/poseidon page looks similar to the /profiles/aphrodite page, but there’s a small difference. The opening `<title>` tag has an extra space before the closing angle bracket (>), rendering it as `<title >.`

`html.find("<title>")` returns -1 because the exact substring `"<title>"` doesn’t exist. When -1 is added to `len("<title>")`, which is 7, the start_index variable is assigned the value 6.

The character at index 6 of the string html is a newline character (\n) right before the opening angle bracket (<) of the `<head>` tag. This means that html[start_index:end_index] returns all the HTML starting with that newline and ending just before the `</title>` tag.

These sorts of problems can occur in countless unpredictable ways. You need a more reliable way to extract text from HTML.

## Using regular expressions

### A Primer on Regular Expressions

Regular expressions—or regexes for short—are patterns that can be used to search for text within a string. Python supports regular expressions through the standard library’s `re` module.

In [14]:
import re

Regular expressions use special characters called metacharacters to denote different patterns. For instance, the asterisk character (*) stands for zero or more of whatever comes just before the asterisk.

In the following example, you use findall() to find any text within a string that matches a given regular expression:

In [15]:
re.findall("ab*c", "ac")

['ac']

The first argument of re.findall() is the regular expression that you want to match, and the second argument is the string to test. In the above example, you search for the pattern "ab*c" in the string "ac".

The regular expression "ab*c" matches any part of the string that begins with an "a", ends with a "c", __and has zero or more instances of "b" between the two__. re.findall() returns a list of all matches. The string "ac" matches this pattern, so it’s returned in the list.

Here’s the same pattern applied to different strings:

In [17]:
re.findall("ab*c", "abcd"), re.findall("ab*c", "acc"), re.findall("ab*c", "abcac"), re.findall("ab*c", "abdc")

(['abc'], ['ac'], ['abc', 'ac'], [])

Notice that if no match is found, then findall() returns an empty list.

Pattern matching is case sensitive. If you want to match this pattern regardless of the case, then you can pass a third argument with the value re.IGNORECASE:

In [18]:
re.findall("ab*c", "ABC")

[]

In [22]:
re.findall("ab*c", "ABC", re.IGNORECASE)

['ABC']

You can use a **period (.) to stand for any single character in a regular expression**. For instance, you could find all the strings that contain the letters "a" and "c" separated by a single character as follows:

In [25]:
re.findall("a.c", "abc"), re.findall("a.c", "abbc"), re.findall("a.c", "ac"), re.findall("a.c", "acc")

(['abc'], [], [], ['acc'])

The pattern __.* inside a regular expression stands for any character repeated any number of times__. For instance, "a.*c" can be used to find every substring that starts with "a" and ends with "c", regardless of which letter—or letters—are in between:

In [26]:
re.findall("a.*c", "abc"), re.findall("a.*c", "abbc"), re.findall("a.*c", "ac"), re.findall("a.*c", "acc")

(['abc'], ['abbc'], ['ac'], ['acc'])

Often, you use re.search() to search for a particular pattern inside a string. This function is somewhat more complicated than re.findall() because it returns an object called a MatchObject that stores different groups of data. This is because there might be matches inside other matches, and re.search() returns every possible result.

The details of the MatchObject are irrelevant here. For now, just know that __calling .group() on a MatchObject will return the first and most inclusive result__, which in most cases is just what you want:

In [27]:
match_results = re.search("ab*c", "ABC", re.IGNORECASE)
match_results.group()

'ABC'

There’s one more function in the re module that’s useful for parsing out text. re.sub(), which is short for substitute, allows you to replace text in a string that matches a regular expression with new text. It behaves sort of like the .replace() string method.

The arguments passed to re.sub() are the regular expression, followed by the replacement text, followed by the string. Here’s an example:

In [28]:
string = "Everything is <replaced> if it's in <tags>."
string = re.sub("<.*>", "ELEPHANTS", string)
string

'Everything is ELEPHANTS.'

Perhaps that wasn’t quite what you expected to happen.

re.sub() uses the regular expression __"<.*>"__ to find and replace everything between the first < and last >, which spans from the beginning of `<replaced>` to the end of `<tags>`. This is because Python’s regular expressions are greedy, meaning they try to __find the longest possible match when characters like * are used__.

Alternatively, you can use the non-greedy matching pattern __*?__, which works the same way as * except that it __matches the shortest possible string of text__:

In [29]:
string = "Everything is <replaced> if it's in <tags>."
string = re.sub("<.*?>", "ELEPHANTS", string)
string

"Everything is ELEPHANTS if it's in ELEPHANTS."

 This time, re.sub() finds two matches, <replaced> and <tags>, and substitutes the string "ELEPHANTS" for both matches.

### Extract Text From HTML With Regular Expressions

In [30]:
import re
from urllib.request import urlopen

url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")

pattern = "<title.*?>.*?</title.*?>"
match_results = re.search(pattern, html, re.IGNORECASE)
title = match_results.group()
title = re.sub("<.*?>", "", title) # Remove HTML tags

print(title)

Profile: Dionysus


Let’s take a closer look at the first regular expression in the pattern string by breaking it down into three parts:

`<title.*?>` matches the opening `<TITLE >` tag in html. The `<title part of the pattern matches with <TITLE because re.search() is called with re.IGNORECASE, and .*?>` matches any text after `<TITLE up to the first instance of >`.

`.*?` non-greedily matches all text after the opening `<TITLE >`, stopping at the first match for `</title.*?>`.

`</title.*?>` differs from the first pattern only in its use of the / character, so it matches the closing `</title / >` tag in html.

The second regular expression, the string `"<.*?>"`, also uses the non-greedy .*? to match all the HTML tags in the title string. By replacing any matches with "", re.sub() removes all the tags and returns only the text.

Note: Web scraping in Python or any other language can be tedious. No two websites are organized the same way, and HTML is often messy. Moreover, websites change over time. Web scrapers that work today are not guaranteed to work next year—or next week, for that matter!

## Check Your Understanding

The 1st one

In [33]:
import re
from urllib.request import urlopen

url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")

In [35]:
print(html)

<html>
<head>
<TITLE >Profile: Dionysus</title  / >
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/dionysus.jpg" />
<h2>Name: Dionysus</h2>
<img src="/static/grapes.png"><br><br>
Hometown: Mount Olympus
<br><br>
Favorite animal: Leopard <br>
<br>
Favorite Color: Wine
</center>
</body>
</html>



In [57]:
name_pattern = '<h2.*?>.*?</h2.*?>'
match_results = re.search(name_pattern, html, re.IGNORECASE)
print(match_results.group())

Name: Dionysus</h2>


In [55]:
name = match_results.group()
name = re.sub("<.*?>", "", name) # Remove HTML tags
print(name)

Name: Dionysus


In [47]:
print(re.search(favcol_pattern, html, re.IGNORECASE))

<re.Match object; span=(289, 298), match='</center>'>


In [63]:
url = "http://olympus.realpython.org/profiles/dionysus"
html_page = urlopen(url)
html_text = html_page.read().decode("utf-8")

for string in ["Name: ", "Favorite Color:"]:
    string_start_idx = html_text.find(string)
    text_start_idx = string_start_idx + len(string)
    #print(text_start_idx)
    
    next_html_tag_offset = html_text[text_start_idx:].find("<")
    text_end_idx = text_start_idx + next_html_tag_offset

    raw_text = html_text[text_start_idx : text_end_idx]
    #print(raw_text)
    clean_text = raw_text.strip(" \r\n\t")
    print(clean_text)

Dionysus
Wine


# With Beautiful Soup

## Use an HTML Parser for Web Scraping in Python

Install Beautiful Soup  
To install Beautiful Soup, you can run the following in your terminal:

In [68]:
!pip install beautifulsoup4

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.9.3-py3-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 588 kB/s eta 0:00:01
[?25hCollecting soupsieve>1.2
  Using cached soupsieve-2.0.1-py3-none-any.whl (32 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.9.3 soupsieve-2.0.1


In [2]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

In [3]:
soup

<html>
<head>
<title>Profile: Dionysus</title>
</head>
<body bgcolor="yellow">
<center>
<br/><br/>
<img src="/static/dionysus.jpg"/>
<h2>Name: Dionysus</h2>
<img src="/static/grapes.png"/><br/><br/>
Hometown: Mount Olympus
<br/><br/>
Favorite animal: Leopard <br/>
<br/>
Favorite Color: Wine
</center>
</body>
</html>

This program does three things:

1. Opens the URL http://olympus.realpython.org/profiles/dionysus using urlopen() from the urllib.request module

2. Reads the HTML from the page as a string and assigns it to the html variable

3. Creates a BeautifulSoup object and assigns it to the soup variable

The BeautifulSoup object assigned to soup is created with two arguments. The first argument is the HTML to be parsed, and the second argument, the string "html.parser", tells the object which parser to use behind the scenes. "html.parser" represents Python’s built-in HTML parser.

**Use a BeautifulSoup Object**

For example, BeautifulSoup objects have a `.get_text()` method that can be used to extract all the text from the document and automatically remove any HTML tags.

In [4]:
print(soup.get_text())



Profile: Dionysus





Name: Dionysus

Hometown: Mount Olympus

Favorite animal: Leopard 

Favorite Color: Wine






There are a lot of blank lines in this output. These are the result of newline characters in the HTML document’s text. You can remove them with the string `.replace()` method if you need to.

In [11]:
print(soup.get_text().replace("\n\n\n", "\n").replace("\n\n", "\n"))


Profile: Dionysus
Name: Dionysus
Hometown: Mount Olympus
Favorite animal: Leopard 
Favorite Color: Wine



Often, you need to get only specific text from an HTML document. Using Beautiful Soup first to extract the text and then using the .find() string method is sometimes easier than working with regular expressions.

However, sometimes the HTML tags themselves are the elements that point out the data you want to retrieve. For instance, perhaps you want to retrieve the URLs for all the images on the page. These links are contained in the src attribute of `<img>` HTML tags.

In this case, you can use find_all() to return a list of all instances of that particular tag:

In [15]:
soup.find("img")

<img src="/static/dionysus.jpg"/>

In [14]:
soup.find_all("img")

[<img src="/static/dionysus.jpg"/>, <img src="/static/grapes.png"/>]

This returns a list of all `<img>` tags in the HTML document. The objects in the list look like they might be strings representing the tags, but they’re actually instances of the Tag object provided by Beautiful Soup. Tag objects provide a simple interface for working with the information they contain.

Let’s explore this a little by first unpacking the Tag objects from the list:

In [16]:
image1, image2 = soup.find_all("img")

Each Tag object has a __.name property that returns a string containing the HTML tag type__:

In [17]:
image1.name

'img'

You can access the HTML attributes of the Tag object by putting their name between square brackets, just as if the attributes were keys in a dictionary.

For example, the `<img src="/static/dionysus.jpg"/>` tag has a single attribute, src, with the value "/static/dionysus.jpg". Likewise, an HTML tag such as the link `<a href="https://realpython.com" target="_blank">` has two attributes, href and target.

To get the source of the images in the Dionysus profile page, you access the src attribute using the dictionary notation mentioned above:

In [20]:
image1["src"]

'/static/dionysus.jpg'

Certain tags in HTML documents can be accessed by properties of the Tag object. For example, to get the `<title>` tag in a document, you can use the .title property:

In [21]:
soup.title

<title>Profile: Dionysus</title>

In [23]:
soup.head

<head>
<title>Profile: Dionysus</title>
</head>

In [24]:
soup.h2

<h2>Name: Dionysus</h2>

Beautiful Soup automatically cleans up the tags for you by removing the extra space in the opening tag and the extraneous forward slash (/) in the closing tag.

You can also retrieve just the string between the title tags with the .string property of the Tag object:

In [25]:
soup.h2.string

'Name: Dionysus'

One of the more useful features of Beautiful Soup is the ability to search for specific kinds of tags whose attributes match certain values. For example, if you want to find all the <img> tags that have a src attribute equal to the value /static/dionysus.jpg, then you can provide the following additional argument to .find_all():

In [28]:
soup.find_all("img", src="/static/dionysus.jpg")

[<img src="/static/dionysus.jpg"/>]

Check your understanding

In [29]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "http://olympus.realpython.org/profiles"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

In [31]:
soup.find_all("a")

[<a href="/profiles/aphrodite">Aphrodite</a>,
 <a href="/profiles/poseidon">Poseidon</a>,
 <a href="/profiles/dionysus">Dionysus</a>]

In [33]:
links = soup.find_all("a")

In [34]:
links[1]["href"]

'/profiles/poseidon'

In [37]:
for link in links:
    print("http://olympus.realpython.org/{}".format(link["href"]))

http://olympus.realpython.org//profiles/aphrodite
http://olympus.realpython.org//profiles/poseidon
http://olympus.realpython.org//profiles/dionysus


First, import the urlopen function from the urlib.request module and the BeautifulSoup class from the bs4 package:

<code>from urllib.request import urlopen
from bs4 import BeautifulSoup
</code>
Each link URL on the /profiles page is a relative URL, so create a base_url variable with the base URL of the website:  
<code>base_url = "http://olympus.realpython.org"
</code>
You can build a full URL by concatenating base_url with a relative URL.

Now open the /profiles page with urlopen() and use .read() to get the HTML source:
<code>html_page = urlopen(base_url + "/profiles")
html_text = html_page.read().decode("utf-8")
</code>
With the HTML source downloaded and decoded, you can create a new BeautifulSoup object to parse the HTML:

<code>soup = BeautifulSoup(html_text, "html.parser")
soup.find_all("a") 
</code>
returns a list of all links in the HTML
source. You can loop over this list to print out all the links on the webpage:

<code>for link in soup.find_all("a"):
    link_url = base_url + link["href"]
    print(link_url)
</code>
The relative URL for each link can be accessed through the "href" subscript. Concatenate this value with base_url to create the full link_url.

# Beautiful Soup: Build a Web Scraper With Python

https://www.monster.com/jobs/search/?q=Machine-Learning-Engineer&where=California&stpage=1&page=2

https://www.monster.com/jobs/search/?q=Deep-Learning-Engineer&where=New-York

In [38]:
!pip install requests



In [39]:
import requests

URL = 'https://www.monster.com/jobs/search/?q=Software-Developer&where=Australia'
page = requests.get(URL)

This code performs an HTTP request to the given URL. It retrieves the HTML data that the server sends back and stores that data in a Python object.

If you take a look at the downloaded content, then you’ll notice that it looks very similar to the HTML you were inspecting earlier with developer tools. To improve the structure of how the HTML is displayed in your console output, you can print the object’s .content attribute with pprint().

In [44]:
print(page.content)

b'<!DOCTYPE html>\r\n<html xmlns="https://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\r\n<head>\r\n    \r\n<meta http-equiv="X-UA-Compatible" content="IE=edge" />\r\n<meta http-equiv="Expires" content="0" />\r\n<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=2.0, minimum-scale=1" />\r\n<meta name="j_jp" content="1" />\r\n<meta charset="UTF-8">\r\n<title>Software Developer Jobs in Australia. Australia Software Developer Jobs. | Monster.com</title>\r\n\r\n        <style type="text/css">\r\n                @font-face{font-family:\'Roboto\';font-style:normal;font-weight:100;font-display:optional;src:local(\'Roboto Thin\'),local(\'Roboto-Thin\'),local(\'sans-serif-thin\'),url(https://fonts.gstatic.com/s/roboto/v19/KFOkCnqEu92Fr1MmgVxFIzIXKMnyrYk.woff2) format(\'woff2\');unicode-range:U+0460-052F,U+1C80-1C88,U+20B4,U+2DE0-2DFF,U+A640-A69F,U+FE2E-FE2F}@font-face{font-family:\'Roboto\';font-style:normal;font-weight:100;font-display:optional;src:local(\'R

**Hidden Websites**

Some pages contain information that’s hidden behind a login. That means you’ll need an account to be able to see (and scrape) anything from the page. The process to make an HTTP request from your Python script is different than how you access a page from your browser. That means that just because you can log in to the page through your browser, that doesn’t mean you’ll be able to scrape it with your Python script.

However, there are some advanced techniques that you can use with the requests to access the content behind logins. These techniques will allow you to log in to websites while making the HTTP request from within your script.  

**Dynamic Websites**  

Static sites are easier to work with because the server sends you an HTML page that already contains all the information as a response. You can parse an HTML response with Beautiful Soup and begin to pick out the relevant data.

On the other hand, with a dynamic website the server might not send back any HTML at all. Instead, you’ll receive JavaScript code as a response. This will look completely different from what you saw when you inspected the page with your browser’s developer tools.

> **Note:** To offload work from the server to the clients’ machines, many modern websites avoid crunching numbers on their servers whenever possible. Instead, they’ll send JavaScript code that your browser will execute locally to produce the desired HTML.  

As mentioned before, what happens in the browser is not related to what happens in your script. Your browser will diligently execute the JavaScript code it receives back from a server and create the DOM and HTML for you locally. However, doing a request to a dynamic website in your Python script will not provide you with the HTML page content.

When you use requests, you’ll only receive what the server sends back. In the case of a dynamic website, you’ll end up with some JavaScript code, which you won’t be able to parse using Beautiful Soup. The only way to go from the JavaScript code to the content you’re interested in is to **execute** the code, just like your browser does. The requests library can’t do that for you, but there are other solutions that can.

For example, requests-html is a project created by the author of the requests library that allows you to easily render JavaScript using syntax that’s similar to the syntax in requests. It also includes capabilities for parsing the data by using Beautiful Soup under the hood.

> **Note:** Another popular choice for scraping dynamic content is **Selenium.** You can think of Selenium as a slimmed-down browser that executes the JavaScript code for you before passing on the rendered HTML response to your script.

You won’t go deeper into scraping dynamically-generated content in this tutorial. For now, it’s enough for you to remember that you’ll need to look into the above-mentioned options if the page you’re interested in is generated in your browser dynamically.

## Parse HTML Code With Beautiful Soup

In [46]:
import requests
from bs4 import BeautifulSoup

URL = 'https://www.monster.com/jobs/search/?q=Software-Developer&where=Australia'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

In [47]:
soup

<!DOCTYPE html>

<html lang="en" xml:lang="en" xmlns="https://www.w3.org/1999/xhtml">
<head>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="0" http-equiv="Expires"/>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=2.0, minimum-scale=1" name="viewport"/>
<meta content="1" name="j_jp"/>
<meta charset="utf-8"/>
<title>Software Developer Jobs in Australia. Australia Software Developer Jobs. | Monster.com</title>
<style type="text/css">
                @font-face{font-family:'Roboto';font-style:normal;font-weight:100;font-display:optional;src:local('Roboto Thin'),local('Roboto-Thin'),local('sans-serif-thin'),url(https://fonts.gstatic.com/s/roboto/v19/KFOkCnqEu92Fr1MmgVxFIzIXKMnyrYk.woff2) format('woff2');unicode-range:U+0460-052F,U+1C80-1C88,U+20B4,U+2DE0-2DFF,U+A640-A69F,U+FE2E-FE2F}@font-face{font-family:'Roboto';font-style:normal;font-weight:100;font-display:optional;src:local('Roboto Thin'),local('Roboto-Thin'),local('sans-serif-thin'),url(http

### Find Elements by ID

In an HTML web page, every element can have an id attribute assigned. As the name already suggests, that id attribute makes the element uniquely identifiable on the page. You can begin to parse your page by selecting a specific element by its ID.

Switch back to developer tools and identify the HTML object that contains all of the job postings. Explore by hovering over parts of the page and using right-click to Inspect.

Note: Keep in mind that it’s helpful to periodically switch back to your browser and interactively explore the page using developer tools. This helps you learn how to find the exact elements you’re looking for.

At the time of this writing, the element you’re looking for is a `<div>` with an id attribute that has the value "ResultsContainer". It has a couple of other attributes as well, but below is the gist of what you’re looking for:

``` 
<div id="ResultsContainer">
    <!-- all the job listings -->
</div>
```

Beautiful Soup allows you to find that specific element easily by its ID:

In [48]:
results = soup.find(id='ResultsContainer')

In [49]:
results

<div class="mux-custom-scroll" data-extend="left" data-mux="customScroll" data-target="html" id="ResultsContainer">
<div class="scrollable" id="ResultsScrollable">
<script type="application/ld+json">
            {"@context":"https://schema.org","@type":"ItemList","mainEntityOfPage":{
            "@type":"CollectionPage","@id":"https://www.monster.com/jobs/search/?q=Software-Developer&amp;where=Australia"
            }
            ,"itemListElement":[

                 {"@type":"ListItem","position":1,"url":"https://job-openings.monster.com/technical-support-engineer-iii-messaging-melbourne-victoria-vic-us-twilio/6b4b387e-3c9a-4c6f-996d-9db14e785b4f"}
                    ,
                 {"@type":"ListItem","position":2,"url":""}
                    ,
                 {"@type":"ListItem","position":3,"url":"https://job-openings.monster.com/lead-performance-engineer-software-systems-plantation-fl-sunnyvale-ca-culver-new-york-city-ca-seattle-wa-austin-tx-toronto-ny-us-magic-leap-inc/e5a

For easier viewing, you can .prettify() any Beautiful Soup object when you print it out. If you call this method on the results variable that you just assigned above, then you should see all the HTML contained within the `<div>`:

In [50]:
print(results.prettify())

<div class="mux-custom-scroll" data-extend="left" data-mux="customScroll" data-target="html" id="ResultsContainer">
 <div class="scrollable" id="ResultsScrollable">
  <script type="application/ld+json">
   {"@context":"https://schema.org","@type":"ItemList","mainEntityOfPage":{
            "@type":"CollectionPage","@id":"https://www.monster.com/jobs/search/?q=Software-Developer&amp;where=Australia"
            }
            ,"itemListElement":[

                 {"@type":"ListItem","position":1,"url":"https://job-openings.monster.com/technical-support-engineer-iii-messaging-melbourne-victoria-vic-us-twilio/6b4b387e-3c9a-4c6f-996d-9db14e785b4f"}
                    ,
                 {"@type":"ListItem","position":2,"url":""}
                    ,
                 {"@type":"ListItem","position":3,"url":"https://job-openings.monster.com/lead-performance-engineer-software-systems-plantation-fl-sunnyvale-ca-culver-new-york-city-ca-seattle-wa-austin-tx-toronto-ny-us-magic-leap-inc/

### Find Elements by HTML Class Name

You’ve seen that every job posting is wrapped in a `<section>` element with the class card-content. Now you can work with your new Beautiful Soup object called results and select only the job postings. These are, after all, the parts of the HTML that you’re interested in! You can do this in one line of code:

In [51]:
job_elems = results.find_all('section', class_='card-content')

In [52]:
job_elems

[<section class="card-content" data-jobid="6b4b387e-3c9a-4c6f-996d-9db14e785b4f" onclick="MKImpressionTrackingMouseDownHijack(this, event)">
 <div class="flex-row">
 <div class="mux-company-logo thumbnail"></div>
 <div class="summary">
 <header class="card-header">
 <h2 class="title"><a data-bypass="true" data-m_impr_a_placement_id="JSR2CW" data-m_impr_j_cid="545" data-m_impr_j_coc="" data-m_impr_j_jawsid="459452813" data-m_impr_j_jobid="2445702" data-m_impr_j_jpm="2" data-m_impr_j_jpt="3" data-m_impr_j_lat="0" data-m_impr_j_lid="0" data-m_impr_j_long="0" data-m_impr_j_occid="11985" data-m_impr_j_p="1" data-m_impr_j_postingid="6b4b387e-3c9a-4c6f-996d-9db14e785b4f" data-m_impr_j_pvc="238a473f-f326-42d6-97e2-4fb4dd657137" data-m_impr_s_t="t" data-m_impr_uuid="a300431b-9655-4454-8acf-7a2d5a893d5f" href="https://job-openings.monster.com/technical-support-engineer-iii-messaging-melbourne-victoria-vic-us-twilio/6b4b387e-3c9a-4c6f-996d-9db14e785b4f" onclick="clickJobTitle('plid=0&amp;pcid=545

Here, you call `.find_all()` on a Beautiful Soup object, which returns an iterable containing all the HTML for all the job listings displayed on that page.

Take a look at all of them:

In [53]:
for job_elem in job_elems:
    print(job_elem, end='\n'*2)

<section class="card-content" data-jobid="6b4b387e-3c9a-4c6f-996d-9db14e785b4f" onclick="MKImpressionTrackingMouseDownHijack(this, event)">
<div class="flex-row">
<div class="mux-company-logo thumbnail"></div>
<div class="summary">
<header class="card-header">
<h2 class="title"><a data-bypass="true" data-m_impr_a_placement_id="JSR2CW" data-m_impr_j_cid="545" data-m_impr_j_coc="" data-m_impr_j_jawsid="459452813" data-m_impr_j_jobid="2445702" data-m_impr_j_jpm="2" data-m_impr_j_jpt="3" data-m_impr_j_lat="0" data-m_impr_j_lid="0" data-m_impr_j_long="0" data-m_impr_j_occid="11985" data-m_impr_j_p="1" data-m_impr_j_postingid="6b4b387e-3c9a-4c6f-996d-9db14e785b4f" data-m_impr_j_pvc="238a473f-f326-42d6-97e2-4fb4dd657137" data-m_impr_s_t="t" data-m_impr_uuid="a300431b-9655-4454-8acf-7a2d5a893d5f" href="https://job-openings.monster.com/technical-support-engineer-iii-messaging-melbourne-victoria-vic-us-twilio/6b4b387e-3c9a-4c6f-996d-9db14e785b4f" onclick="clickJobTitle('plid=0&amp;pcid=545&amp;p

In [58]:
for job_elem in job_elems:
    # Each job_elem is a new BeautifulSoup object.
    # You can use the same methods on it as you did before.
    title_elem = job_elem.find('h2', class_='title')
    company_elem = job_elem.find('div', class_='company')
    location_elem = job_elem.find('div', class_='location')
    print(title_elem) 
    print(company_elem)
    print(location_elem)
    print()

<h2 class="title"><a data-bypass="true" data-m_impr_a_placement_id="JSR2CW" data-m_impr_j_cid="545" data-m_impr_j_coc="" data-m_impr_j_jawsid="459452813" data-m_impr_j_jobid="2445702" data-m_impr_j_jpm="2" data-m_impr_j_jpt="3" data-m_impr_j_lat="0" data-m_impr_j_lid="0" data-m_impr_j_long="0" data-m_impr_j_occid="11985" data-m_impr_j_p="1" data-m_impr_j_postingid="6b4b387e-3c9a-4c6f-996d-9db14e785b4f" data-m_impr_j_pvc="238a473f-f326-42d6-97e2-4fb4dd657137" data-m_impr_s_t="t" data-m_impr_uuid="a300431b-9655-4454-8acf-7a2d5a893d5f" href="https://job-openings.monster.com/technical-support-engineer-iii-messaging-melbourne-victoria-vic-us-twilio/6b4b387e-3c9a-4c6f-996d-9db14e785b4f" onclick="clickJobTitle('plid=0&amp;pcid=545&amp;poccid=11985','Software Developer',''); clickJobTitleSiteCat('{&quot;events.event48&quot;:&quot;true&quot;,&quot;eVar25&quot;:&quot;Technical Support Engineer III (Messaging)&quot;,&quot;eVar66&quot;:&quot;Monster&quot;,&quot;eVar67&quot;:&quot;JSR2CW&quot;,&quo

### Extract Text From HTML Elements

For now, you only want to see the title, company, and location of each job posting. And behold! Beautiful Soup has got you covered. You can add `.text` to a Beautiful Soup object to return only the **text content** of the HTML elements that the object contains:

In [59]:
for job_elem in job_elems:
    # Each job_elem is a new BeautifulSoup object.
    # You can use the same methods on it as you did before.
    title_elem = job_elem.find('h2', class_='title')
    company_elem = job_elem.find('div', class_='company')
    location_elem = job_elem.find('div', class_='location')
    print(title_elem.textb) 
    print(company_elem.text)
    print(location_elem.text)
    print()

Technical Support Engineer III (Messaging)


Twilio





Melbourne, Victoria, VIC





AttributeError: 'NoneType' object has no attribute 'text'

Run the above code snippet and you’ll see the text content displayed. However, you’ll also get a lot of whitespace. Since you’re now working with Python strings, you can .strip() the superfluous whitespace. You can also apply any other familiar Python string methods to further clean up your text.

> **Note:** The web is messy and you can’t rely on a page structure to be consistent throughout. Therefore, you’ll more often than not run into errors while parsing HTML.

When you run the above code, you might encounter an AttributeError:

If that’s the case, then take a step back and inspect your previous results. Were there any items with a value of None? You might have noticed that the structure of the page is not entirely uniform. There could be an advertisement in there that displays in a different way than the normal job postings, which may return different results. For this tutorial, you can safely disregard the problematic element and skip over it while parsing the HTML:

In [62]:
for job_elem in job_elems:
    title_elem = job_elem.find('h2', class_='title')
    company_elem = job_elem.find('div', class_='company')
    location_elem = job_elem.find('div', class_='location')
    if None in (title_elem, company_elem, location_elem):
        continue
    print(title_elem.text.strip())
    print(company_elem.text.strip())
    print(location_elem.text.strip())
    print()

Technical Support Engineer III (Messaging)
Twilio
Melbourne, Victoria, VIC

Lead Performance Engineer, Software Systems
Magic Leap, Inc.
Plantation, FL; Sunnyvale, CA; Culver New York City, CA; Seattle, WA; Austin, TX; Toronto, NY

Technical Support Engineer III (Messaging)
Twilio
Sydney, New South Wales, NSW

Senior/Lead Software Engineer, Browser
Magic Leap, Inc.
Sunnyvale, CA; Plantation, FL (HQ); Austin, TX; Culver New York City, CA; Seattle, WA; Toronto, NY

Sales Engineer – Commercial Real Estate SaaS - Sydney, New South Wales
MRI Software
Sydney, NSW

Tech Intern: Software Engineering
Comcast
New York, WA

Technical Account Manager
Khoros, LLC
Sydney, FL

Customer Experience Technical Analyst - Sydney, New South Wales
Mediaocean
Sydney, NSW

Program Manager
Khoros, LLC
Sydney, FL

Senior Technical Consultant
Khoros, LLC
Sydney, FL

Software Development Engineer
Amazon Corporate LLC
Seattle, WA

Customer Success Manager II
Khoros, LLC
Sydney, FL



### Find Elements by Class Name and Text Content

By now, you’ve cleaned up the list of jobs that you saw on the website. While that’s pretty neat already, you can make your script more useful. However, not all of the job listings seem to be developer jobs that you’d be interested in as a Software Development Engineer - Amazon Physical Stores at Amazon. So instead of printing out all of the jobs from the page, you’ll first filter them for some keywords.

You know that job titles in the page are kept within `<h2>` elements. To filter only for specific ones, you can use the string argument:

In [65]:
python_jobs = results.find_all('h2', string='Software Development Engineer - Amazon Physical Stores at Amazon')

In [66]:
python_jobs

[]

There was definitely a job with that title in the search results, so why is it not showing up? When you use string= like you did above, your program looks for exactly that string. Any differences in capitalization or whitespace will prevent the element from matching. In the next section, you’ll find a way to make the string more general.

### Pass a Function to a Beautiful Soup Method
In addition to strings, you can often pass functions as arguments to Beautiful Soup methods. You can change the previous line of code to use a function instead:

In [73]:
python_jobs = results.find_all('h2', string=lambda text:'development' in text.lower())

In [74]:
python_jobs

[<h2 class="title"><a data-bypass="true" data-m_impr_a_placement_id="JSRSEC" data-m_impr_j_cid="660" data-m_impr_j_coc="xwag2321849x" data-m_impr_j_jawsid="462828500" data-m_impr_j_jobid="222681619" data-m_impr_j_jpm="1" data-m_impr_j_jpt="1" data-m_impr_j_lat="47.6388" data-m_impr_j_lid="0" data-m_impr_j_long="-122.37" data-m_impr_j_occid="11904" data-m_impr_j_p="11" data-m_impr_j_postingid="b70e8f60-d99e-4805-9002-fa6f0e033e93" data-m_impr_j_pvc="monster" data-m_impr_s_t="m" data-m_impr_uuid="5436d918-9c64-45d5-83c0-dabbf6e3a7b6" href="https://job-openings.monster.com/software-development-engineer-seattle-wa-us-amazon-corporate-llc/222681619" onclick="clickJobTitle('plid=0&amp;pcid=660&amp;poccid=11904','Software Developer',''); clickJobTitleSiteCat('{&quot;events.event48&quot;:&quot;true&quot;,&quot;events.event65&quot;:&quot;true&quot;,&quot;eVar25&quot;:&quot;Software Development Engineer&quot;,&quot;eVar66&quot;:&quot;Monster&quot;,&quot;eVar67&quot;:&quot;JSRSEC&quot;,&quot;eVar

Now you’re passing an anonymous function to the string= argument. The lambda function looks at the text of each `<h2>` element, converts it to lowercase, and checks whether the substring 'python' is found anywhere in there. Now you’ve got a match:

In [75]:
print(len(python_jobs))

1


The process of finding specific elements depending on their text content is a powerful way to filter your HTML response for the information that you’re looking for. Beautiful Soup allows you to use either exact strings or functions as arguments for filtering text in Beautiful Soup objects.

### Extract Attributes From HTML Elements
At this point, your Python script already scrapes the site and filters its HTML for relevant job postings. Well done! However, one thing that’s still missing is the link to apply for a job.

While you were inspecting the page, you found that the link is part of the element that has the title HTML class. The current code strips away the entire link when accessing the .text attribute of its parent element. As you’ve seen before, .text only contains the visible text content of an HTML element. Tags and attributes are not part of that. To get the actual URL, you want to extract one of those attributes instead of discarding it.

Look at the list of filtered results python_jobs that you created above. The URL is contained in the href attribute of the nested `<a>` tag. Start by fetching the `<a>` element. Then, extract the value of its href attribute using square-bracket notation:

In [83]:
python_jobs = results.find_all('h2', string=lambda text: 'development' in text.lower())

for dev_job in python_jobs:
    link = dev_job.find('a')['href']
    print(dev_job.text.strip())
    print(f'Apply here: {link}\n')

Software Development Engineer
Apply here: https://job-openings.monster.com/software-development-engineer-seattle-wa-us-amazon-corporate-llc/222681619



The filtered results will only show links to job opportunities that include development in their title. You can use the same square-bracket notation to extract other HTML attributes as well. A common use case is to fetch the URL of a link, as you did above.

# Modern Web Automation With Python and Selenium

In this tutorial you’ll learn advanced Python web automation techniques: using Selenium with a “headless” browser, exporting the scraped data to CSV files, and wrapping your scraping code in a Python class.

Today you will use a full-fledged browser running in headless mode to do the HTTP requests for you.

A headless browser is just a regular web browser, except that it contains no visible UI element. Just like you’d expect, it can do more than make requests: it can also render HTML (though you cannot see it), keep session information, and even perform asynchronous network communications by running JavaScript code.

If you want to automate the modern web, headless browsers are essential.

## Setup

Your first step, before writing a single line of Python, is to install a Selenium supported WebDriver for your favorite web browser. In what follows, you will be working with Firefox, but Chrome could easily work too.

Assuming that the path ~/.local/bin is in your execution PATH, here’s how you would install the Firefox WebDriver, called geckodriver, on a Linux machine:

In [84]:
! wget https://github.com/mozilla/geckodriver/releases/download/v0.19.1/geckodriver-v0.19.1-linux64.tar.gz
! tar xvfz geckodriver-v0.19.1-linux64.tar.gz
! mv geckodriver ~/.local/bin

--2020-12-10 16:12:03--  https://github.com/mozilla/geckodriver/releases/download/v0.19.1/geckodriver-v0.19.1-linux64.tar.gz
Résolution de github.com (github.com)… 140.82.121.3
Connexion à github.com (github.com)|140.82.121.3|:443… connecté.
requête HTTP transmise, en attente de la réponse… 302 Found
Emplacement : https://github-production-release-asset-2e65be.s3.amazonaws.com/25354393/e31e4c22-be6f-11e7-9bc7-dedc3490a7fd?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20201210%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20201210T151205Z&X-Amz-Expires=300&X-Amz-Signature=47af73358694d407d48062d98601efba491780397fc664fb28a88cf2c89a5b20&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=25354393&response-content-disposition=attachment%3B%20filename%3Dgeckodriver-v0.19.1-linux64.tar.gz&response-content-type=application%2Foctet-stream [suivant]
--2020-12-10 16:12:05--  https://github-production-release-asset-2e65be.s3.amazonaws.com/25354393/e31e4c22-be6f-11e7-9bc7

In [88]:
! mv geckodriver /usr/local/bin

In [85]:
!pip install selenium

Collecting selenium
  Using cached selenium-3.141.0-py2.py3-none-any.whl (904 kB)
Installing collected packages: selenium
Successfully installed selenium-3.141.0


## Test Driving a Headless Browser
To test that everything is working, you decide to try out a basic web search via DuckDuckGo. You fire up your preferred Python interpreter and type the following:

I run :
> brew install geckodriver # Install  
> which geckodriver # Get the correct path  

Source: https://stackoverflow.com/questions/41435983/selenium-in-python-on-mac-geckodriver-executable-needs-to-be-in-path

In [2]:
from selenium import webdriver
browser = webdriver.Firefox(executable_path = '/usr/local/bin/geckodriver')
browser.get('http://inventwithpython.com')

In [5]:
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options

opts = Options()
opts.set_headless()
assert opts.headless # Operating in headless mode
browser = Firefox(options=opts, executable_path = '/usr/local/bin/geckodriver')
browser.get('https://duckduckgo.com')

  opts.set_headless()


So far, you have created a headless Firefox browser and navigated to https://duckduckgo.com. You made an Options instance and used it to activate headless mode when you passed it to the Firefox constructor. This is akin to typing firefox -headless at the command line.

The best way is to open your web browser and use its developer tools to inspect the contents of the page. Right now, you want to get ahold of the search form so you can submit a query. By inspecting DuckDuckGo’s home page, you find that the search form `<input>` element has an id attribute "search_form_input_homepage". That’s just what you needed:

In [6]:
search_form = browser.find_element_by_id('search_form_input_homepage')
search_form.send_keys('real python')
search_form.submit()

In [7]:
search_form

<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="d1c3a5da-cee4-c845-84f2-ba7529d80d44", element="a95af599-3404-2a42-abfe-3fd3a04d46fe")>

In [13]:
results = browser.find_elements_by_class_name('result')

In [14]:
print(results[1].text) # Get the first result from the search

Real Python's Office Hours: Learn With Python Experts in ...
https://realpython.com/office-hours-announcement/
Come learn with Python experts at the Real Python Office Hours, a weekly video call that offers Real Python members the chance to get help with Python-related questions, meet new Pythonistas, learn about new and trending topics in the community, and get feedback and tips on Python code and projects.


In [15]:
browser.close()

In [16]:
#quit()

## Groovin’ on Tunes

You’ve tested that you can drive a headless browser using Python. Now you can put it to use:

- You want to play music.
- You want to browse and explore music.
- You want information about what music is playing.

To start, you navigate to https://bandcamp.com and start to poke around in your browser’s developer tools. You discover a big shiny play button towards the bottom of the screen with a class attribute that contains the value"playbutton". You check that it works:

In [None]:
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options

opts = Options()
opts.set_headless()
assert opts.headless # Operating in headless mode
browser = Firefox(options=opts, executable_path = '/usr/local/bin/geckodriver')
browser.get('https://duckduckgo.com')

In [20]:
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options

opts = Options()
opts.set_headless()
assert opts.headless # Operating in headless mode
browser = Firefox(options=opts, executable_path = '/usr/local/bin/geckodriver')
browser.get('https://bandcamp.com')

  opts.set_headless()


In [23]:
browser.find_element_by_class_name('playbutton').click()

You should hear music! Leave it playing and move back to your web browser. Just to the side of the play button is the discovery section. Again, you inspect this section and find that each of the currently visible available tracks has a class value of "discover-item", and that each item seems to be clickable. In Python, you check this out:

In [26]:
tracks = browser.find_elements_by_class_name('discover-item')
len(tracks) # 8
tracks[3].click()

A new track should be playing! This is the first step to exploring bandcamp using Python! You spend a few minutes clicking on various tracks in your Python environment but soon grow tired of the meagre library of eight songs.

### Exploring the Catalogue

Looking a back at your browser, you see the buttons for exploring all of the tracks featured in bandcamp’s music discovery section. By now, this feels familiar: each button has a class value of "item-page". The very last button is the “next” button that will display the next eight tracks in the catalogue. You go to work:

In [28]:
len(browser.find_elements_by_class_name('item-page'))

11

In [29]:
browser.find_elements_by_class_name('item-page')[0].text

'previous'

In [30]:
browser.find_elements_by_class_name('item-page')[1].text

'1'

In [34]:
for elet in browser.find_elements_by_class_name('item-page'):
    print(elet.text.lower().find('next'))

-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
0


In [53]:
next_button = [e for e in browser.find_elements_by_class_name('item-page')
                    if e.text.lower().strip() == str('next')]

In [55]:
if next_button:
    next_button[0].click()

<code>for elt in next_button:
        print(elt)
for e in browser.find_elements_by_class_name('item-page'):
        if e.text.lower().find('next') > -1:
        e.click()
</code>

Great! Now you want to look at the new tracks, so you think, “I’ll just repopulate my tracks variable like I did a few minutes ago.” But this is where things start to get tricky.

First, bandcamp designed their site for humans to enjoy using, not for Python scripts to access programmatically. When you call next_button.click(), the real web browser responds by executing some JavaScript code.

If you try it out in your browser, you see that some time elapses as the catalogue of songs scrolls with a smooth animation effect. If you try to repopulate your tracks variable before the animation finishes, you may not get all the tracks, and you may get some that you don’t want.

What’s the solution? You can just sleep for a second, or, if you are just running all this in a Python shell, you probably won’t even notice. After all, it takes time for you to type too.

Another slight kink is something that can only be discovered through experimentation. You try to run the same code again:

In [57]:
tracks = browser.find_elements_by_class_name('discover-item')
len(tracks)
#assert(len(tracks) == 8)

24

But you notice something strange. len(tracks) is not equal to 8 even though only the next batch of 8 should be displayed. Digging a little further, you find that your list contains some tracks that were displayed before. To get only the tracks that are actually visible in the browser, you need to filter the results a little.

After trying a few things, you decide to keep a track only if its x coordinate on the page fall within the bounding box of the containing element. The catalogue’s container has a class value of "discover-results". Here’s how you proceed:

In [59]:
discover_section = browser.find_element_by_class_name('discover-results')
discover_section

<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="2e41bfbc-3e5e-7846-acbf-207761bf4f65", element="5199494b-0b3f-104b-83be-4201af34ee1b")>

In [61]:
left_x = discover_section.location['x']
left_x

130

In [62]:
right_x = left_x + discover_section.size['width']
right_x

860.5166625976562

In [70]:
discover_items = browser.find_elements_by_class_name('discover-item')

In [71]:
discover_items

[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="2e41bfbc-3e5e-7846-acbf-207761bf4f65", element="1e299f51-99fd-f041-950a-62e289d3a614")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="2e41bfbc-3e5e-7846-acbf-207761bf4f65", element="ee640349-2922-0945-baf4-70aed941d319")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="2e41bfbc-3e5e-7846-acbf-207761bf4f65", element="e3d6b59d-4497-f549-8380-d40c8821fd7b")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="2e41bfbc-3e5e-7846-acbf-207761bf4f65", element="da637cf0-6d7a-1f4e-96c0-d496e4bdebbb")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="2e41bfbc-3e5e-7846-acbf-207761bf4f65", element="129c805c-b288-a548-b361-f35fc6008d7e")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="2e41bfbc-3e5e-7846-acbf-207761bf4f65", element="fa5dce85-a77c-3244-9e06-01e3613a51b2")>,
 <selenium.webdriver.firefox.webelement.FirefoxWebElement 

In [72]:
tracks = [t for t in discover_items if t.location['x'] >= left_x and t.location['x'] < right_x]

In [73]:
assert len(tracks) == 8

In [75]:
browser.close()

## Building a Class

If you are growing weary of retyping the same commands over and over again in your Python environment, you should dump some of it into a module. A basic class for your bandcamp manipulation should do the following:

- Initialize a headless browser and navigate to bandcamp
- Keep a list of available tracks
- Support finding more tracks
- Play, pause, and skip tracks

Here’s the basic code, all in one go:

In [81]:
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
from time import sleep, ctime
from collections import namedtuple
from threading import Thread
from os.path import isfile
import csv

BANDCAMP_FRONTPAGE = 'https://bandcamp.com/'

class BandLeader():
    def __init__(self):
        # Create a headless browser
        opts = Options()
        opts.set_headless()
        assert opts.headless # Operating in headless mode
        self.browser = Firefox(options=opts)
        self.browser.get(BANDCAMP_FRONTPAGE)
        
        # Track list related state
        self._current_track_number = 1
        self.track_list = []
        self.tracks()
        
    def tracks(self):
        """
        Query the page to populate a list of available tracks.
        """
        
        # Sleep to give the browser time to render and finish any animations
        sleep(1)
        
        # Get the container for the visible track list
        discover_section = self.browser.find_element_by_class_name('discover-results')
        left_x = discover_section.location['x']
        right_x = left_x + discover_section.size['width']
        
        # Filter the items in the list to include only those we can click
        discover_items = self.browser.find_elements_by_class_name('discover-item')
        self.track_list = [t for t in discover_items 
                           if t.location['x'] >= left_x and t.location['x'] < right_x]
        
        # Print the available tracks to the screen
        for (i, track) in enumerate(self.track_list):
            print('[{}]'.format(i+1))
            lines = track.text.split('\n')
            print('Album : {}'.format(lines[0]))
            print('Artist : {}'.format(lines[1]))
            if len(lines) > 2:
                print('Genre : {}'.format(lines[2]))
                
    def catalogue_pages(self):
        """
        Print the available pages, (i.e. previous, 1, 2, ..., next), 
        in the catalogue that are presently accessible.
        """
        print('PAGES')
        for e in self.browser.find_elements_by_class_name('item-page'):
            print(e.text)
        print('')
        
    def more_tracks(self, page='next'):
        """
        Advances the catalogue and repopulates the track list. 
        We can pass in a number to advance any of the available pages.
        """
        next_btn = [e for e in self.browser.find_elements_by_class_name('item-page')
                    if e.text.lower().strip() == str(page)]
        
        if next_btn:
            next_btn[0].click()
            self.tracks()
    
    def play(self, track=None):
        """
        Play a track. If no track number is supplied, the presently selected
        track will play.
        """
        if self._current_track_number < len(self.track_list):
            self.play(self._current_track_number+1)
        else:
            self.more_tracks()
            self.play(1)
            
    def pause(self):
        """
        Pauses the playback
        """
        self.play()

In [82]:
instance = BandLeader()

  opts.set_headless()


[1]
Album : Christmas Songs Vol. 1
Artist : manchester orchestra
Genre : folk
[2]
Album : High Visceral {Part One} & {Part Two} UK/EU Limited Release
Artist : Psychedelic Porn Crumpets
Genre : rock
[3]
Album : Sinnohvation
Artist : insaneintherainmusic
Genre : jazz
[4]
Album : End Of Forever
Artist : Samsara Blues Experiment
Genre : rock
[5]
Album : Oncle Jazz
Artist : Men I Trust
Genre : electronic
[6]
Album : PRE-SALE: Totschläger (A Saintslayer's Songbook)
Artist : ABIGOR
Genre : metal
[7]
Album : Live at Roundhouse - London, UK - 10/30/17 10/30/17
Artist : Jason Isbell and the 400 Unit
Genre : country
[8]
Album : The Helm of Sorrow
Artist : Emma Ruth Rundle & Thou
Genre : metal


### Collecting Structured Data

Your final task is to keep track of the songs that you actually listened to. How might you do this? What does it mean to actually listen to something anyway? If you are perusing the catalogue, stopping for a few seconds on each song, do each of those songs count? Probably not. You are going to allow some ‘exploration’ time to factor in to your data collection.

Your goals are now to:

- Collect structured information about the currently playing track
- Keep a “database” of tracks
- Save and restore that “database” to and from disk

You decide to use a [namedtuple](https://dbader.org/blog/writing-clean-python-with-namedtuples) (they are like classes but immutable) to store the information that you track. Named tuples are good for representing bundles of attributes with no functionality tied to them, a bit like a database record:

In [83]:
TrackRec = namedtuple('TrackRec', [
    'title',
    'artist',
    'artist_url',
    'album',
    'album_url',
    'timestamp' # When you played it
])

In order to collect this information, you add a method to the BandLeader class. Checking back in with the browser’s developer tools, you find the right HTML elements and attributes to select all the information you need. Also, you only want to get information about the currently playing track if there music is actually playing at the time. Luckily, the page player adds a "playing" class to the play button whenever music is playing and removes it when the music stops.

With these considerations in mind, you write a couple of methods:

In [84]:
def is_playing(self):
    """
    Returns 'True' if a track is presently playing
    """
    playbtn = self.browser.find_element_by_class_name('playbutton')
    return playbtn.get_attribute('class').find('playing') > -1 # class="playbutton playing"

def currently_playing(self):
    """
    Returns the record for the currently playing track,
    or None if nothing is playing
    """
    try:
        if self.is_playing():
            title = self.browser.find_element_by_class_name('title').text
            album_detail = self.browser.find_element_by_css_selector('.detail-album > a')
            album_title = album_detail.text
            album_url = album_detail.get_attribute('href').split('?')[0]
            artist_detail = self.browser.find_element_by_css_selector('.detail-artist > a')
            artist = artist_detail.text
            artist_url = artist_detail.get_attribute('href').split('?')[0]
            return TrackRec(title, artist, artist_url, album_title, album_url, ctime())
        
    except Exception as e:
        print('there was an error: {}'.format(e))

    return None

In [None]:
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
from time import sleep, ctime
from collections import namedtuple
from threading import Thread
from os.path import isfile
import csv

BANDCAMP_FRONTPAGE = 'https://bandcamp.com/'

class BandLeader():
    def __init__(self):
        # Create a headless browser
        opts = Options()
        opts.set_headless()
        assert opts.headless # Operating in headless mode
        self.browser = Firefox(options=opts)
        self.browser.get(BANDCAMP_FRONTPAGE)
        
        # Track list related state
        self._current_track_number = 1
        self.track_list = []
        self.tracks()
        
    def tracks(self):
        """
        Query the page to populate a list of available tracks.
        """
        
        # Sleep to give the browser time to render and finish any animations
        sleep(1)
        
        # Get the container for the visible track list
        discover_section = self.browser.find_element_by_class_name('discover-results')
        left_x = discover_section.location['x']
        right_x = left_x + discover_section.size['width']
        
        # Filter the items in the list to include only those we can click
        discover_items = self.browser.find_elements_by_class_name('discover-item')
        self.track_list = [t for t in discover_items 
                           if t.location['x'] >= left_x and t.location['x'] < right_x]
        
        # Print the available tracks to the screen
        for (i, track) in enumerate(self.track_list):
            print('[{}]'.format(i+1))
            lines = track.text.split('\n')
            print('Album : {}'.format(lines[0]))
            print('Artist : {}'.format(lines[1]))
            if len(lines) > 2:
                print('Genre : {}'.format(lines[2]))
                
    def catalogue_pages(self):
        """
        Print the available pages, (i.e. previous, 1, 2, ..., next), 
        in the catalogue that are presently accessible.
        """
        print('PAGES')
        for e in self.browser.find_elements_by_class_name('item-page'):
            print(e.text)
        print('')
        
    def more_tracks(self, page='next'):
        """
        Advances the catalogue and repopulates the track list. 
        We can pass in a number to advance any of the available pages.
        """
        next_btn = [e for e in self.browser.find_elements_by_class_name('item-page')
                    if e.text.lower().strip() == str(page)]
        
        if next_btn:
            next_btn[0].click()
            self.tracks()
    
    def play(self, track=None):
        """
        Play a track. If no track number is supplied, the presently selected
        track will play.
        """
        if track is None:
            self.browser.find_element_by_class_name('playbutton').click()
        elif type(track) is int and track <= len(self.track_list) and track >= 1:
            self._current_track_number = track
            self.track_list[self._current_track_number - 1].click()
        
        sleep(0.5)
        if self.is_playing():
            self._current_track_record = self.currently_playing()
        
    def play_next(self):
        """
        Plays the next available track
        """
        if self._current_track_number < len(self.track_list):
            self.play(self._current_track_number+1)
        else:
            self.more_tracks()
            self.play(1)
            
    def pause(self):
        """
        Pauses the playback
        """
        self.play()
        
    def is_playing(self):
        """
        Returns 'True' if a track is presently playing
        """
        playbtn = self.browser.find_element_by_class_name('playbutton')
        return playbtn.get_attribute('class').find('playing') > -1 # class="playbutton playing"

    def currently_playing(self):
        """
        Returns the record for the currently playing track,
        or None if nothing is playing
        """
        try:
            if self.is_playing():
                title = self.browser.find_element_by_class_name('title').text
                album_detail = self.browser.find_element_by_css_selector('.detail-album > a')
                album_title = album_detail.text
                album_url = album_detail.get_attribute('href').split('?')[0]
                artist_detail = self.browser.find_element_by_css_selector('.detail-artist > a')
                artist = artist_detail.text
                artist_url = artist_detail.get_attribute('href').split('?')[0]
                return TrackRec(title, artist, artist_url, album_title, album_url, ctime())

        except Exception as e:
            print('there was an error: {}'.format(e))

        return None    

Next, you’ve got to keep a database of some kind. Though it may not scale well in the long run, you can go far with a simple list. You add `self.database = []` to BandCamp’s `__init__()` method. Because you want to allow for time to pass before entering a TrackRec object into the database, you decide to use Python’s threading tools to run a separate process that maintains the database in the background.

You’ll supply a `_maintain()` method to BandLeader instances that will run in a separate thread. The new method will periodically check the value of `self._current_track_record` and add it to the database if it is new.

You will start the thread when the class is instantiated by adding some code to `__init__()`:

In [88]:
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
from time import sleep, ctime
from collections import namedtuple
from threading import Thread
from os.path import isfile
import csv

BANDCAMP_FRONTPAGE = 'https://bandcamp.com/'

class BandLeader():
    def __init__(self):
        # Create a headless browser
        opts = Options()
        opts.set_headless()
        #assert opts.headless # Operating in headless mode
        self.browser = Firefox(options=opts)
        self.browser.get(BANDCAMP_FRONTPAGE)
        
        # Track list related state
        self._current_track_number = 1
        self.track_list = []
        self.tracks()
        
        # State for the database
        self.database = []
        self._current_track_record = None
        
        # The database maintenance thread
        self.thread = Thread(target=self._maintain)
        self.thread.daemon = True # Kills the thread when the main process dies
        self.thread.start()
        
        self.tracks()
        
    def _maintain(self):
        while True:
            self._update_db()
            sleep(20) # Check every 20 seconds
    
    def _update_db(self):
        try:
            check = (self._current_track_record is not None
                     and (len(self.database) == 0
                          or self.database[-1] != self._current_track_record)
                     and self.is_playing())
            if check:
                self.database.append(self._current_track_record)
                
        except Exception as e:
            print('error while updating the db: {}'.format(e))

    def tracks(self):
        """
        Query the page to populate a list of available tracks.
        """

        # Sleep to give the browser time to render and finish any animations
        sleep(1)

        # Get the container for the visible track list
        discover_section = self.browser.find_element_by_class_name('discover-results')
        left_x = discover_section.location['x']
        right_x = left_x + discover_section.size['width']

        # Filter the items in the list to include only those we can click
        discover_items = self.browser.find_elements_by_class_name('discover-item')
        self.track_list = [t for t in discover_items 
                           if t.location['x'] >= left_x and t.location['x'] < right_x]

        # Print the available tracks to the screen
        for (i, track) in enumerate(self.track_list):
            print('[{}]'.format(i+1))
            lines = track.text.split('\n')
            print('Album : {}'.format(lines[0]))
            print('Artist : {}'.format(lines[1]))
            if len(lines) > 2:
                print('Genre : {}'.format(lines[2]))
                
    def catalogue_pages(self):
        """
        Print the available pages, (i.e. previous, 1, 2, ..., next), 
        in the catalogue that are presently accessible.
        """
        print('PAGES')
        for e in self.browser.find_elements_by_class_name('item-page'):
            print(e.text)
        print('')
        
    def more_tracks(self, page='next'):
        """
        Advances the catalogue and repopulates the track list. 
        We can pass in a number to advance any of the available pages.
        """
        next_btn = [e for e in self.browser.find_elements_by_class_name('item-page')
                    if e.text.lower().strip() == str(page)]
        
        if next_btn:
            next_btn[0].click()
            self.tracks()
    
    def play(self, track=None):
        """
        Play a track. If no track number is supplied, the presently selected
        track will play.
        """
        if track is None:
            self.browser.find_element_by_class_name('playbutton').click()
        elif type(track) is int and track <= len(self.track_list) and track >= 1:
            self._current_track_number = track
            self.track_list[self._current_track_number - 1].click()
        
        sleep(0.5)
        if self.is_playing():
            self._current_track_record = self.currently_playing()
        
    def play_next(self):
        """
        Plays the next available track
        """
        if self._current_track_number < len(self.track_list):
            self.play(self._current_track_number+1)
        else:
            self.more_tracks()
            self.play(1)
            
    def pause(self):
        """
        Pauses the playback
        """
        self.play()
        
    def is_playing(self):
        """
        Returns 'True' if a track is presently playing
        """
        playbtn = self.browser.find_element_by_class_name('playbutton')
        return playbtn.get_attribute('class').find('playing') > -1 # class="playbutton playing"

    def currently_playing(self):
        """
        Returns the record for the currently playing track,
        or None if nothing is playing
        """
        try:
            if self.is_playing():
                title = self.browser.find_element_by_class_name('title').text
                album_detail = self.browser.find_element_by_css_selector('.detail-album > a')
                album_title = album_detail.text
                album_url = album_detail.get_attribute('href').split('?')[0]
                artist_detail = self.browser.find_element_by_css_selector('.detail-artist > a')
                artist = artist_detail.text
                artist_url = artist_detail.get_attribute('href').split('?')[0]
                return TrackRec(title, artist, artist_url, album_title, album_url, ctime())

        except Exception as e:
            print('there was an error: {}'.format(e))

        return None    

If you’ve never worked with multithreaded programming in Python, you should [read up on it!](https://dbader.org/blog/python-parallel-computing-in-60-seconds) For your present purpose, you can think of thread as a loop that runs in the background of the main Python process (the one you interact with directly). Every twenty seconds, the loop checks a few things to see if the database needs to be updated, and if it does, appends a new record. Pretty cool.

The very last step is saving the database and restoring from saved states. Using the csv package, you can ensure your database resides in a highly portable format and remains usable even if you abandon your wonderful BandLeader class!

The __init__() method should be yet again altered, this time to accept a file path where you’d like to save the database. You’d like to load this database if it is available, and you’d like to save it periodically, whenever it is updated. The updates look like this:

In [89]:
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
from time import sleep, ctime
from collections import namedtuple
from threading import Thread
from os.path import isfile
import csv

BANDCAMP_FRONTPAGE = 'https://bandcamp.com/'

class BandLeader():
    def __init__(self, csvpath=None):
        self.database_path = csvpath
        self.database = []
        
        # Load database from disk if possible
        if isfile(self.database_path):
            with open(self.database_path, newline='') as dbfile:
                dbreader = csv.reader(dbfile)
                next(dbreader) # To ignore the header line
                self.database = [TrackRec._make(rec) for rec in dbreader]
        
        # Create a headless browser
        opts = Options()
        opts.set_headless()
        #assert opts.headless # Operating in headless mode
        self.browser = Firefox(options=opts)
        self.browser.get(BANDCAMP_FRONTPAGE)
        
        # Track list related state
        self._current_track_number = 1
        self.track_list = []
        self.tracks()
        
        # State for the database
        self._current_track_record = None
        
        # The database maintenance thread
        self.thread = Thread(target=self._maintain)
        self.thread.daemon = True # Kills the thread when the main process dies
        self.thread.start()
        
        self.tracks()
        
    def save_db(self):
        with open(self.database_path, 'w', newline='') as dbfile:
            dbwriter = csv.writer(dbfile)
            dbwriter.writerow(list(TrackRec._fields))
            for entry in self.database:
                dbwriter.writerow(list(entry))
                
    def _maintain(self):
        while True:
            self._update_db()
            sleep(20) # Check every 20 seconds
    
    def _update_db(self):
        try:
            check = (self._current_track_record is not None
                     and (len(self.database) == 0
                          or self.database[-1] != self._current_track_record)
                     and self.is_playing())
            if check:
                self.database.append(self._current_track_record)
                
        except Exception as e:
            print('error while updating the db: {}'.format(e))

    def tracks(self):
        """
        Query the page to populate a list of available tracks.
        """

        # Sleep to give the browser time to render and finish any animations
        sleep(1)

        # Get the container for the visible track list
        discover_section = self.browser.find_element_by_class_name('discover-results')
        left_x = discover_section.location['x']
        right_x = left_x + discover_section.size['width']

        # Filter the items in the list to include only those we can click
        discover_items = self.browser.find_elements_by_class_name('discover-item')
        self.track_list = [t for t in discover_items 
                           if t.location['x'] >= left_x and t.location['x'] < right_x]

        # Print the available tracks to the screen
        for (i, track) in enumerate(self.track_list):
            print('[{}]'.format(i+1))
            lines = track.text.split('\n')
            print('Album : {}'.format(lines[0]))
            print('Artist : {}'.format(lines[1]))
            if len(lines) > 2:
                print('Genre : {}'.format(lines[2]))
                
    def catalogue_pages(self):
        """
        Print the available pages, (i.e. previous, 1, 2, ..., next), 
        in the catalogue that are presently accessible.
        """
        print('PAGES')
        for e in self.browser.find_elements_by_class_name('item-page'):
            print(e.text)
        print('')
        
    def more_tracks(self, page='next'):
        """
        Advances the catalogue and repopulates the track list. 
        We can pass in a number to advance any of the available pages.
        """
        next_btn = [e for e in self.browser.find_elements_by_class_name('item-page')
                    if e.text.lower().strip() == str(page)]
        
        if next_btn:
            next_btn[0].click()
            self.tracks()
    
    def play(self, track=None):
        """
        Play a track. If no track number is supplied, the presently selected
        track will play.
        """
        if track is None:
            self.browser.find_element_by_class_name('playbutton').click()
        elif type(track) is int and track <= len(self.track_list) and track >= 1:
            self._current_track_number = track
            self.track_list[self._current_track_number - 1].click()
        
        sleep(0.5)
        if self.is_playing():
            self._current_track_record = self.currently_playing()
        
    def play_next(self):
        """
        Plays the next available track
        """
        if self._current_track_number < len(self.track_list):
            self.play(self._current_track_number+1)
        else:
            self.more_tracks()
            self.play(1)
            
    def pause(self):
        """
        Pauses the playback
        """
        self.play()
        
    def is_playing(self):
        """
        Returns 'True' if a track is presently playing
        """
        playbtn = self.browser.find_element_by_class_name('playbutton')
        return playbtn.get_attribute('class').find('playing') > -1 # class="playbutton playing"

    def currently_playing(self):
        """
        Returns the record for the currently playing track,
        or None if nothing is playing
        """
        try:
            if self.is_playing():
                title = self.browser.find_element_by_class_name('title').text
                album_detail = self.browser.find_element_by_css_selector('.detail-album > a')
                album_title = album_detail.text
                album_url = album_detail.get_attribute('href').split('?')[0]
                artist_detail = self.browser.find_element_by_css_selector('.detail-artist > a')
                artist = artist_detail.text
                artist_url = artist_detail.get_attribute('href').split('?')[0]
                return TrackRec(title, artist, artist_url, album_title, album_url, ctime())

        except Exception as e:
            print('there was an error: {}'.format(e))

        return None    

Voilà! You can listen to music and keep a record of what you hear! Amazing.

Something interesting about the above is that using a namedtuple really begins to pay off. When converting to and from CSV format, you take advantage of the ordering of the rows in the CSV file to fill in the rows in the TrackRec objects. Likewise, you can create the header row of the CSV file by referencing the TrackRec._fields attribute. This is one of the reasons using a tuple ends up making sense for columnar data.

In [90]:
bl = BandLeader('myhistory.csv')
bl.play() # should start playing a track

  opts.set_headless()


[1]
Album : Christmas Songs Vol. 1
Artist : manchester orchestra
Genre : folk
[2]
Album : High Visceral {Part One} & {Part Two} UK/EU Limited Release
Artist : Psychedelic Porn Crumpets
Genre : rock
[3]
Album : Sinnohvation
Artist : insaneintherainmusic
Genre : jazz
[4]
Album : End Of Forever
Artist : Samsara Blues Experiment
Genre : rock
[5]
Album : Oncle Jazz
Artist : Men I Trust
Genre : electronic
[6]
Album : PRE-SALE: Totschläger (A Saintslayer's Songbook)
Artist : ABIGOR
Genre : metal
[7]
Album : Live at Roundhouse - London, UK - 10/30/17 10/30/17
Artist : Jason Isbell and the 400 Unit
Genre : country
[8]
Album : The Helm of Sorrow
Artist : Emma Ruth Rundle & Thou
Genre : metal
[1]
Album : Christmas Songs Vol. 1
Artist : manchester orchestra
Genre : folk
[2]
Album : High Visceral {Part One} & {Part Two} UK/EU Limited Release
Artist : Psychedelic Porn Crumpets
Genre : rock
[3]
Album : Sinnohvation
Artist : insaneintherainmusic
Genre : jazz
[4]
Album : End Of Forever
Artist : Samsara

In [91]:
bl.play() # if the song is playing it will stop it

In [92]:
bl.play(3) # plays the third track in the listing

In [93]:
bl.play_next() # advances the track

In [94]:
bl.play()

In [95]:
bl.more_tracks() # see more music to play with

[1]
Album : Going to Georgia
Artist : Merge Records
[2]
Album : L'Esprit de Nyege 2020
Artist : Nyege Nyege Tapes
Genre : world
[3]
Album : The New Standards Holiday Show Album
Artist : The New Standards
Genre : jazz
[4]
Album : Live From Symphony Hall (Atlanta, GA - 11/6/15)
Artist : The Milk Carton Kids
Genre : folk
[5]
Album : Introspection
Artist : Ponies at Dawn
Genre : electronic
[6]
Album : RAVENING IRON
Artist : Eternal Champion
Genre : metal
[7]
Album : Mank (Original Musical Score) WITH EXTRAS
Artist : Trent Reznor & Atticus Ross
Genre : soundtrack
[8]
Album : Slowgirl
Artist : Violet Newman
Genre : alternative


In [96]:
bl.browser.quit() # close the webdriver instance

# References :
- [A Practical Introduction to Web Scraping in Python: Intro to Beautiful Soup](https://realpython.com/python-web-scraping-practical-introduction/)
- [Beautiful Soup: Build a Web Scraper With Python](https://realpython.com/beautiful-soup-web-scraper-python/)
- [Modern Web Automation With Python and Selenium](https://realpython.com/modern-web-automation-with-python-and-selenium/)
- [BandLeader GitHub](https://github.com/realpython/python-web-scraping-examples)