<center><a target="_blank" href="https://learning.constructor.org/"><img src="https://drive.google.com/uc?id=1wxkbM60NlBlkbGK1JqUypKL24RrTiiYk" width="200" style="background:none; border:none; box-shadow:none;" /></a> </center>

_____

<center> <h1> Web Scraping with BeautifulSoup </h1> </center>

<p style="margin-bottom:1cm;"></p>

_____

<center>Constructor Learning, 2023.</center>


<a id='SU' name="SU"></a>
## [Introduction to Web Scraping](#P0)

As we all know, the internet is a tremendous source of information, whatever your interest is. Ideally, every piece of interesting data would be available to download in the form of a csv file or equivalent, so that we could read it directly with pandas and start analyzing it. Of course, this is not the case.

The aim of this session is to demonstrate how you can extract information from web pages. This task is called
**Web scraping/web crawling**.

In Python, we use mainly 4 packages for web scraping/crawling:

- [Requests](https://requests.readthedocs.io/en/master/user/quickstart/): is a package that lets you make http requests. In the context of web scraping, it allows you to get the content of a webpage, that can be further analyzed using a html parser.
- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/): is a library that allows you to extract information from html documents (but also xml, even though it is not relevant here.). This is exactly what we need since html is the standard language to write webpages.
- [Selenium](https://selenium-python.readthedocs.io): is a library you can use for web pages that are rendered in real-time (e.g. have to physically click buttons, scroll down, ...).
- [Scrapy](https://scrapy.org): is a library to create web-spiders, scripts that perform web-crawling.

In this course, we learn how to use requests in combination with BeautifulSoup to extract online data. To do so, we first need to understand how information is structured on a web page.

### The components of a web page

When we visit a web page, our web browser makes a request to a web server. The server then sends back files that tell our browser how to render the page for us. The files fall into 4 main types:

* **HTML** — Contain the main content of the page.
* **CSS** — Add styling to make the page look nicer.
* **JS** — Javascript files add interactivity to web pages.
* **Images** — Image formats, such as JPG and PNG allow web pages to show pictures.


After our browser receives all the files, it renders the page and displays it to us. There's a lot that happens behind the scenes to render a page nicely, but we don't need to worry about most of it when web scraping: we are interested in the main content of the web page: HTML.

<a id='P1' name="P1"></a>
## [Sample HTML](#P0)

HyperText Markup Language (HTML) is a language that web pages are created in. HTML is not a programming language, like Python — instead, it is a **markup language**: it tells browsers how to layout content. HTML allows you to do similar things to what you do in a word processor like Microsoft Word — make text bold, create paragraphs, and so on. Because HTML is not a programming language, it is not nearly as complex as Python.

You can write html in the code cells of your jupyter notebook by using the `%%HTML` magic. Let's do that:

In [8]:
# mention another magics

In [9]:
%%HTML

<html>
<hr>
<head> I am in the header!! </head>
<body>
<h2> This is a title </h2>
<h4> Sample HTML </h4>

<p>This is a <b>paragraph</b>. <br> Where we can write about some stuff.</p>
<p>This is another paragraph. And here we can write about some more stuff,
   as well as include <a href=http://dataquestio.github.io/web-scraping-pages/simple.html>links</a>.</p>
<hr>
</body>
</html>

In HTML, the information is organized in the form of nested elements. Each html element is delimited by tags, for example:

- `<html>  ... </html>` is the tag that defines the root element: it tells the web browser that everything inside of it is HTML.
- Right inside the HTML tag, we have `<head> ... </head>` and `<body> ... </body>` tags.
- The `<p> ... </p>` tag defines the paragraph, and any text inside the tag is shown as a separate paragraph.
- the `<a> ... </a>` tag declares a link to another webpage. The `href` property determines where the link goes.
- ``<div>`` — Indicates a division, or area, of the page.
- ``<b>`` — Bolds any text inside.
- ``<i>`` — Italicizes any text inside.
- ``<table>`` — Creates a table.
- ``<form>`` — Creates an input form.

Checkout the full list of tags [here](https://developer.mozilla.org/en-US/docs/Web/HTML/Element).

The best way to learn about html is to explore the source code of actual webpages. You can do that in your browser. The procedure is slightly different depending on web browsers, in google chrome, you can just type `Ctrl`/`Cmd` + `Shift` + `C`. This is the result you have if you visit [Job.ch](https://www.jobs.ch/en/vacancies/?term=Data%20Engineer) (searching for Data Engineer):

<img src="https://drive.google.com/uc?id=1BN4XwaGu7Mg7Z6LR8hEZXXVhdf6iNdUp" width="60%" style="background:none; border:none; box-shadow:none;" />

The pannel on the right side of the window shows you the raw html document. You can recognize the tags that we talked about and many more. While webscraping, we will alternate between this window and the jupyter notebook window.

Html elements can also contain **attributes**. Attributes provide extra information about an HTML element. All attributes are made up of two sections − a name and a value. Common attributes include `id`, `title`, `class` and `style`. These attributes are specified within the tags. For example checkout a close up from the same page source code:

<img src="https://drive.google.com/uc?id=1I6exoU9KIzaXnHk9fC_w0btIMsOXCE7_" width="60%" style="background:none; border:none; box-shadow:none;" />

`id` attributes a name to an HTML element, `class` assigns html elements to groups. These properties make elements easier to interact with when we are scraping. One element can have multiple classes, and a class can be shared between elements. Each element can only have one id and it can only be used once on a page. `class` and `id` are optional, and not all elements will have them.

<a id='P2' name="P2"></a>
## [The requests library](#P0)

The first thing we need to do to scrape a web page is to download it. We can do so using the Python `requests` library. The requests library will make a `GET` request to a web server, which will download the HTML contents of a given web page for us. After running our request, one gets a Response object. This object has a `status_code` property, which indicates if the page was downloaded successfully. A `status_code` of 200 means that the page downloaded successfully. We won't fully dive into status codes here, but a status code starting with a 2 generally indicates success, and a code starting with a 4 or a 5 indicates an error.
The page content can be printed using `page.content`.

In [10]:
import requests

let's start with a simple page:

In [11]:
# making a request to a webpage
page = requests.get("https://dataquestio.github.io/web-scraping-pages/simple.html")
print(page)
print(page.status_code) ## prints 200 as status code
print(page.content) ## prints the entire webpage content

<Response [200]>
200
b'<!DOCTYPE html>\n<html>\n    <head>\n        <title>A simple example page</title>\n    </head>\n    <body>\n        <p>Here is some simple content for this page.</p>\n    </body>\n</html>'


In [12]:
print(page.text)

<!DOCTYPE html>
<html>
    <head>
        <title>A simple example page</title>
    </head>
    <body>
        <p>Here is some simple content for this page.</p>
    </body>
</html>


- When making a request to an external service, your system will need to wait for an answer before moving on.
- By default, requests waits indefinitely, so you should almost always specify a `timeout` duration to prevent your script from getting stuck:

In [13]:
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html", timeout=15)

The `timeout` variable denotes the maximum number of seconds that `requests.get()` will wait for an answer.

Let's see the content for the jobs.ch page:

In [14]:
page = requests.get("https://www.jobs.ch/en/vacancies/?term=data%20engineer", timeout=15)
print(page)
print(page.status_code)
print(page.content[:5000])

<Response [200]>
200
b'<!doctype html>\n<!--[if IE]><html lang="en" class="ie9-or-less"><![endif]-->\n<html lang="en">\n  <head>\n    <meta charset="utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0" />\n\n    <!-- No cache for index.html -->\n    <meta http-equiv="Pragma" content="no-cache" />\n    <meta http-equiv="cache-control" content="no-cache, must-revalidate" />\n\n    <title>1306 Data engineer jobs - jobs.ch</title>\n\n    <!-- Preconnects : only first party origin and used everywhere! -->\n\n    <link rel="preconnect" href="https://media.jobs.ch/" />\n    \n\n    \n    <link rel="preconnect" href="//c.jobs.ch" />\n    \n\n    <!-- Fonts -->\n    \n      <link as="font" type="font/woff2" href="/public/fonts/ab0d2f2149fcf1f8e377.woff2" importance="high" fetchpriority="high" crossorigin />\n    \n      <link as="font" type="font/woff2" href="/public/fonts/bbee25c411374ef3ede3.woff2" importance="high" fetchpriority="high" crossorigin />\n    \n    

As you can see, finding elements within `page.content` is difficult. That is where BeautifulSoup will assist us.

<a id='P3' name="P3"></a>
## [Parsing a page with BeautifulSoup](#P0)

BeautifulSoup is a html and xml parser. It means that it analyses the contents of a html or xml document and provides methods the access specific elements of the document. In this section, we will parse `page.content` using BeautifulSoup and create a pandas dataframe that contains:

- job title
- posting date
- company
- location
- link to job add

### 1. import libraries and parse the page content

In [15]:
from bs4 import BeautifulSoup
import re # a library to do regular expression matching
import pandas as pd
import numpy as np

In [16]:
soup = BeautifulSoup(page.content, "html.parser") # converts the page content into a beautifulsoup object

In [17]:
type(soup)

bs4.BeautifulSoup

We now have a BeautifoulSoup object that contains all the content of the original webpage in a format that is much easier to access. For example you can extract all the text content of the page by typing:

In [18]:
soup.find('link').get('href')

'https://media.jobs.ch/'

In [19]:
soup.find_all('link')[7].get('href')

'https://www.jobs.ch/fr/offres-emplois/?term=data%20engineer'

In [20]:
soup.get_text()[:5000]

"\n\n\n\n\n\n\n\n\n1306 Data engineer jobs - jobs.ch\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nYou are currently using an obsolete browser which is no longer 100% supported. It can cause display problems.Please upgrade your browser.Some alternatives: Firefox, Chrome.Skip to contentjobs.ch Navigation logoFind a jobExplore companiesCompare salariesRecruiter AreaDeutschFrançaisEnglishLoginFind a jobExplore companiesCompare salariesRecruiter AreaMy accountLoginDon't have an account yet?Sign upSelect LanguageDeutschFrançaisEnglishProfession, keywords or companydata engineerPlace of work or regionChoose a regionSearchAll filtersPublished sinceWorkloadOccupational fieldEmployment typeLanguageAll filtersReset filters1\u202f306 Data engineer job offersBy dateBy relevancePublished: 24 October 202324 OctData EngineerFlawil80% – 100%Unlimited employmentBÜCHI Labortechnik AG Published: 12 November 202312 NovData Engineer (a) in Züri

In [21]:
soup.get_text()[:500]

"\n\n\n\n\n\n\n\n\n1306 Data engineer jobs - jobs.ch\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nYou are currently using an obsolete browser which is no longer 100% supported. It can cause display problems.Please upgrade your browser.Some alternatives: Firefox, Chrome.Skip to contentjobs.ch Navigation logoFind a jobExplore companiesCompare salariesRecruiter AreaDeutschFrançaisEnglishLoginFind a jobExplore companiesCompare salariesRecruiter AreaMy accountLoginDon't have an account yet?Sign upSele"

You can find the first element that has tag `<a>...</a>` by typing:

In [22]:
soup.find('a')

<a class="A-sc-1q4zv2a-0 hPPUKb Link-sc-czsz28-2 bzpUGN" href="https://www.mozilla.org/en-US/firefox/new/" rel="noopener nofollow" target="_blank">Firefox</a>

And you can find all elements that have tags `<a>...</a>` by typing:

In [23]:
soup.find_all('a')[:3]

[<a class="A-sc-1q4zv2a-0 hPPUKb Link-sc-czsz28-2 bzpUGN" href="https://www.mozilla.org/en-US/firefox/new/" rel="noopener nofollow" target="_blank">Firefox</a>,
 <a class="A-sc-1q4zv2a-0 hPPUKb Link-sc-czsz28-2 bzpUGN" href="https://www.google.com/intl/en_us/chrome/" rel="noopener nofollow" target="_blank">Chrome</a>,
 <a class="A-sc-1q4zv2a-0 Header___StyledA-sc-z7prqv-1 biTpub iqmgeY" href="#skip-link-target">Skip to content</a>]

but you can also filter through tags using attributes:

In [24]:
soup.find_all('a', {'data-cy': 'job-link'})[0]

<a class="Link__ExtendedRR6Link-sc-czsz28-1 kZzJcl Link-sc-czsz28-2 VacancyLink___StyledLink-sc-ufp08j-0 bzpUGN zoplL" data-cy="job-link" href="/en/vacancies/detail/6cdc0c68-ad20-4b9d-b806-a3bd69296f35/?source=vacancy_search" tabindex="0" title="Data Engineer"><div aria-selected="true" class="Div-sc-1cpunnt-0 VacancySerpItemUpdated__StyledSerpItem-sc-i0986f-1 hBFpng dFEfiy" data-cy="vacancy-serp-item-active"><div class="Div-sc-1cpunnt-0 Flex-sc-mjmi48-0 bHInWw" data-cy="serp-item-6cdc0c68-ad20-4b9d-b806-a3bd69296f35"><p class="P-sc-hyu5hk-0 Text__p2-sc-1lu7urs-10 Span-sc-1ybanni-0 Text__span-sc-1lu7urs-12 Text-sc-1lu7urs-13 VacancySerpItemUpdated___StyledText-sc-i0986f-3 geNnlb iAfDeY"><span class="Span-sc-1ybanni-0 Text__span-sc-1lu7urs-12 Text-sc-1lu7urs-13 ftUOUz eEFkdA">Published: 24 October 2023</span><span aria-hidden="true" class="Span-sc-1ybanni-0" title="24 October 2023">24 Oct</span></p></div><button aria-label="Save" class="Button-sc-zfgt48-0 IconButton__StyledButton-sc-4tlh

In [25]:
len(soup.find_all('a', {'data-cy': 'job-link'}))

21

Once you have an element of interest, you can extract data using:

In [30]:
soup.find_all('a',  {'data-cy': 'job-link'})[0].attrs

{'data-cy': 'job-link',
 'tabindex': '0',
 'title': 'Data Engineer',
 'class': ['Link__ExtendedRR6Link-sc-czsz28-1',
  'kZzJcl',
  'Link-sc-czsz28-2',
  'VacancyLink___StyledLink-sc-ufp08j-0',
  'bzpUGN',
  'zoplL'],
 'href': '/en/vacancies/detail/6cdc0c68-ad20-4b9d-b806-a3bd69296f35/?source=vacancy_search'}

In [25]:
soup.find_all('a',  {'data-cy': 'job-link'})[0].get('href')

'/en/vacancies/detail/6cdc0c68-ad20-4b9d-b806-a3bd69296f35/?source=vacancy_search'

Now that we know how to extract data from a webpage, let's find the relevant html elements for our project. We need to locate them in the original page:

### 2. Browse the source code in a web browser

The html code inspector is slightly different depending on your web browser. here we use Chrome. With Chrome, we can select the following icon at the top of the code inspector tab:

<img src="https://drive.google.com/uc?id=1piq5H0kGqufsgeKQTs5VifXCGwDBVs1e" width="20%" style="background:none; border:none; box-shadow:none;" />

This allows us to select any component of the page and find the corresponding element in the html code. Let's do this for the list of jobs:

<img src="https://drive.google.com/uc?id=1gpxxDzK0WZR73rLGNVIHkHEqP17X1CN_" width="60%" style="background:none; border:none; box-shadow:none;" />

After navigating within the html element tree, we observe that each job box is a seperate `div`element and has the `class` attribute `VacancySerpItemUpdated__ShadowBox-sc-i0986f-0`.

<img src="https://drive.google.com/uc?id=1p3_xg7cS8Era6dl4dHKsePxAtprMNeYR" width="60%" style="background:none; border:none; box-shadow:none;" />

Note that websites get updated, so the name of the class might be different by the time you follow this tutorial!

Here, we can get the html element for a job by typing:

In [53]:
len(soup.find_all('article', {'class' : 'VacancySerpItemUpdated__ShadowBox-sc-i0986f-0'}))

20

In [54]:
one_job_ad = soup.find('article', {'class' : 'VacancySerpItemUpdated__ShadowBox-sc-i0986f-0'})

In [55]:
one_job_ad

<article class="Div-sc-1cpunnt-0 VacancySerpItemUpdated__ShadowBox-sc-i0986f-0 lojXzj" data-cy="serp-item"><a class="Link__ExtendedRR6Link-sc-czsz28-1 kZzJcl Link-sc-czsz28-2 VacancyLink___StyledLink-sc-ufp08j-0 bzpUGN zoplL" data-cy="job-link" href="/en/vacancies/detail/6cdc0c68-ad20-4b9d-b806-a3bd69296f35/?source=vacancy_search" tabindex="0" title="Data Engineer"><div aria-selected="true" class="Div-sc-1cpunnt-0 VacancySerpItemUpdated__StyledSerpItem-sc-i0986f-1 hBFpng dFEfiy" data-cy="vacancy-serp-item-active"><div class="Div-sc-1cpunnt-0 Flex-sc-mjmi48-0 bHInWw" data-cy="serp-item-6cdc0c68-ad20-4b9d-b806-a3bd69296f35"><p class="P-sc-hyu5hk-0 Text__p2-sc-1lu7urs-10 Span-sc-1ybanni-0 Text__span-sc-1lu7urs-12 Text-sc-1lu7urs-13 VacancySerpItemUpdated___StyledText-sc-i0986f-3 geNnlb iAfDeY"><span class="Span-sc-1ybanni-0 Text__span-sc-1lu7urs-12 Text-sc-1lu7urs-13 ftUOUz eEFkdA">Published: 24 October 2023</span><span aria-hidden="true" class="Span-sc-1ybanni-0" title="24 October 2023">

Please note: by the time we updated this notebook, the exact value of the class has changed from `VacancySerpItem__ShadowBox-qr45cp-0` to `VacancySerpItem__ShadowBox-ppntto-0`, hence the difference between the code and the photo above.

To find all of them, we can write:

In [50]:
job_ads = soup.find_all('article', {'class' : 'Div-sc-1cpunnt-0'})

In [51]:
len(job_ads)

20

we obtain a list of 24 job ads! Each component of the list contains the html element describing the job ad:

In [32]:
job_ads[-1]

<article class="Div-sc-1cpunnt-0 VacancySerpItemUpdated__ShadowBox-sc-i0986f-0 dHEdIW" data-cy="serp-item"><a class="Link__ExtendedRR6Link-sc-czsz28-1 kZzJcl Link-sc-czsz28-2 VacancyLink___StyledLink-sc-ufp08j-0 bzpUGN zoplL" data-cy="job-link" href="/en/vacancies/detail/a8760059-ff9f-4574-b997-245e0ca61132/?source=vacancy_search" tabindex="0" title="Ingenieur/in Aeronautical Information System"><div class="Div-sc-1cpunnt-0 VacancySerpItemUpdated__StyledSerpItem-sc-i0986f-1 hBFpng jVoBIW" data-cy="vacancy-serp-item"><div class="Div-sc-1cpunnt-0 Flex-sc-mjmi48-0 bHInWw" data-cy="serp-item-a8760059-ff9f-4574-b997-245e0ca61132"><p class="P-sc-hyu5hk-0 Text__p2-sc-1lu7urs-10 Span-sc-1ybanni-0 Text__span-sc-1lu7urs-12 Text-sc-1lu7urs-13 VacancySerpItemUpdated___StyledText-sc-i0986f-3 geNnlb iAfDeY"><span class="Span-sc-1ybanni-0 Text__span-sc-1lu7urs-12 Text-sc-1lu7urs-13 ftUOUz eEFkdA">Published: 08 November 2023</span><span aria-hidden="true" class="Span-sc-1ybanni-0" title="08 November 2

now that we have identified the data blocks for jobs, we can extract the individual pieces of information. Let's start with the link to the job ad. This will be located within a `<a>` tag, and we are interested in the `href` attribute:

In [33]:
job_ads[0].find('a').get('href')

'/en/vacancies/detail/6cdc0c68-ad20-4b9d-b806-a3bd69296f35/?source=vacancy_search'

Getting the link can be done without specifying any attribute because that is the only 'a' tag content in one job-ad-box but it is also better to specify the attribute to avoid confusion:

In [34]:
job_ads[0].find('a', {'data-cy' : 'job-link'}).get('href')

'/en/vacancies/detail/6cdc0c68-ad20-4b9d-b806-a3bd69296f35/?source=vacancy_search'

We get the link to the job ad page but to access the page from outside, we need to add 'https://www.jobs.ch' as prefix:

In [35]:
'https://www.jobs.ch' + job_ads[0].find('a',{'data-cy' : 'job-link'}).get('href')

'https://www.jobs.ch/en/vacancies/detail/6cdc0c68-ad20-4b9d-b806-a3bd69296f35/?source=vacancy_search'

In [36]:
job_ads[0].find('a',{'data-cy' : 'job-link'}).get('href').split('/')[4]

'6cdc0c68-ad20-4b9d-b806-a3bd69296f35'

let's check it works for other links:

In [37]:
['https://www.jobs.ch' + job.find('a',{'data-cy' : 'job-link'}).get('href') for job in job_ads]

['https://www.jobs.ch/en/vacancies/detail/6cdc0c68-ad20-4b9d-b806-a3bd69296f35/?source=vacancy_search',
 'https://www.jobs.ch/en/vacancies/detail/ac85350c-40a5-4f77-bea5-c8df34991e0a/?source=vacancy_search',
 'https://www.jobs.ch/en/vacancies/detail/d87b7c1a-5cca-45b2-a23c-151ab641f3c8/?source=vacancy_search',
 'https://www.jobs.ch/en/vacancies/detail/3f633030-82d2-4e3b-8278-1d066b47fbcd/?source=vacancy_search',
 'https://www.jobs.ch/en/vacancies/detail/2e95ec17-5aaa-417d-a6c5-10e377c00101/?source=vacancy_search',
 'https://www.jobs.ch/en/vacancies/detail/2f2efbf2-9a79-4312-b916-dcbcd9a624b2/?source=vacancy_search',
 'https://www.jobs.ch/en/vacancies/detail/6b2be353-98ff-4f59-b042-c73d4d412903/?source=vacancy_search',
 'https://www.jobs.ch/en/vacancies/detail/c804f640-0bc3-49a1-8746-9d2ed5552b00/?source=vacancy_search',
 'https://www.jobs.ch/en/vacancies/detail/4deeb12d-85f6-47bb-bca5-5361071ef512/?source=vacancy_search',
 'https://www.jobs.ch/en/vacancies/detail/da181874-9de9-40aa-84e

now we can see that in the `<a>`element, we have an attribute called title, which corresponds the the title of the ad:

In [None]:
# show attributes

In [56]:
for job in job_ads:
    print(job.find('a', {'data-cy' : 'job-link'}).get('title'))

Data Engineer
Data Engineer (a) in Zürich
Data Engineer in Medical Imaging
Platform Engineer (w/m/d)
Frontend Engineer #1908
Building Service Layout Engineer (E3D Designer)
Senior Engineer SPS/HMI 100 % (m/w/d)
Requirements Engineer / Product Owner Dokumentenmanagement (a) (80-100%)
Scientific Engagement Engineer
Mechanical Engineer
SYSTEM ENGINEER / PROJECT MANAGER (A)
DESIGN ENGINEER
Innovation & Development Engineer (w/m/d) 80-100 %
Senior AVEVA PI Engineer (w/m/d)
Dynamics 365 / Power Platform Engineer (60 - 100%)
Traktionsingenieur 80 - 100 % (m/w/d)
Robotics Engineer (m/w) 100%
Ingenieur/-in / Verkehrsplanung
Data & Analytics Principal Engineer (m/f/x)
Ingenieur/in Aeronautical Information System


we also find that the name of the company, the location of the job and the date of posting are all texts within a `<span>` element with attribute `class=Text__span-sc-1lu7urs-8`, and that it always come in the following order:

- job title
- company
- location
- date of publication

we can then extract the data using `findAll`:

In [57]:
job_ads[0].find('a',{'data-cy' : 'job-link'}).find_all("span", {"class":"Span-sc-1ybanni-0"})

[<span class="Span-sc-1ybanni-0 Text__span-sc-1lu7urs-12 Text-sc-1lu7urs-13 ftUOUz eEFkdA">Published: 24 October 2023</span>,
 <span aria-hidden="true" class="Span-sc-1ybanni-0" title="24 October 2023">24 Oct</span>,
 <span class="Span-sc-1ybanni-0 iiJtIM"><svg aria-hidden="true" class="Svg-sc-t66izd-0 Icon__Svg-sc-1yrtjf1-0 cazQux" fill="currentColor" focusable="false" viewbox="0 0 24 24"><path d="M17 3H7c-1.1 0-2 .9-2 2v16l7-3 7 3V5c0-1.1-.9-2-2-2zm0 15-5-2.18L7 18V5h10v13z"></path></svg></span>,
 <span class="Span-sc-1ybanni-0 Text__span-sc-1lu7urs-12 Text-sc-1lu7urs-13 VacancySerpItemUpdated___StyledText2-sc-i0986f-4 fvFrdm eqJMMs iDyMWN">Data Engineer</span>,
 <span class="Span-sc-1ybanni-0 cKKlDN"><svg aria-hidden="true" class="Svg-sc-t66izd-0 Icon__Svg-sc-1yrtjf1-0 cazQux" fill="currentColor" focusable="false" viewbox="0 0 24 24"><path d="M12 12c.55 0 1.021-.196 1.413-.588.391-.391.587-.862.587-1.412a1.93 1.93 0 0 0-.587-1.413A1.928 1.928 0 0 0 12 8c-.55 0-1.02.196-1.412.587A1.9

In [58]:
info = job_ads[6].find('a',{'data-cy' : 'job-link'}).findAll(["span", "p"], {"class":"Span-sc-1ybanni-0"})

which creates a list with the relevant html elements:

In [59]:
len(info)

12

In [60]:
[i.get_text().strip() for i in info]

['Published: 07 November 202307 Nov',
 'Published: 07 November 2023',
 '07 Nov',
 '',
 'Senior Engineer SPS/HMI 100 % (m/w/d)',
 '',
 'Zürich - Spreitenbach',
 '',
 '100%',
 '',
 'Unlimited employment',
 'CHRONOS Personalberatung']

We can then get the text part of each element by using the `text` attribute:

In [61]:
info[0].text

'Published: 07 November 202307 Nov'

<div style="background:#EEEDF5;border-top:0.1cm solid #EF475B;border-bottom:0.1cm solid #EF475B;">
    <div style="margin-left: 0.5cm;margin-top: 0.5cm;margin-bottom: 0.5cm;color:#303030">
        <p><strong>Goal:</strong> Collect the Info about more jobs from more pages</p>
    </div>
</div>

we also find that the name of the company, the location of the job and the date of posting are all texts within a `<span>` element with different attributes, and that it always come in the following order:

- job title
- company
- location
- date of publication

we can then extract the data using `find` or `findAll`:

### Find Title of the job, Company name, City and Date it was published

In [62]:
# find title
job_ads[0].find('span', {"class":"eqJMMs"}).text

'Data Engineer'

In [63]:
# find company
job_ads[0].find('div', {'class': 'gNMZoz'}).find('strong').text

'BÜCHI Labortechnik AG'

In [64]:
# find date
job_ads[0].find('span', {"class":"Span-sc-1ybanni-0", 'aria-hidden': True}).text

'24 Oct'

**Notice that we are also able to ask for the spans that also contain the attribute aria-hidden, which is the case for the date.**

We need this since the class Span-sc-1ybanni-0 is shared among many other different tags

In [75]:
# find city
job_ads[0].find('p', {"class":"cMsiEU"}).text

'Flawil'

In [73]:
# find workload percentage
job_ads[0].find('p', {"class":"cMsiEU"}).find_next('p').text

'80% – 100%'

Another way to get some attributes:

In [67]:
job_ads[0].find('a',{'data-cy' : 'job-link'}).get('href')

'/en/vacancies/detail/6cdc0c68-ad20-4b9d-b806-a3bd69296f35/?source=vacancy_search'

### Create a DataFrame with all the information

In [None]:
# how do we find all the infos for all jobs on page?

In [68]:
def get_jobdata_page(job_ads):

    cols = ["title", "company", "area", "date", "link"]
    df = pd.DataFrame(columns = cols)

    for i in range(len(job_ads)):

        url = 'https://www.jobs.ch' + job_ads[i].find('a',{'data-cy' : 'job-link'}).get('href')
        try:
            title = job_ads[i].find('span', {"class":"eqJMMs"}).text
        except AttributeError:
            title = np.nan

        try:
            company = job_ads[i].find('div', {'class': 'gNMZoz'}).find('strong').text
        except AttributeError:
            company = np.nan

        try:
            area = job_ads[i].find('p', {"class":"cMsiEU"}).text
        except AttributeError:
            area = np.nan

        try:
            date = job_ads[i].find('span', {"class":"Span-sc-1ybanni-0", 'aria-hidden': True}).text
        except AttributeError:
            date = np.nan

        job_dict = {
          'title': title,
          'company': company,
          'area': area,
          'date': date,
          'link':url }

        df = pd.concat([df, pd.DataFrame(job_dict, index=[0])], ignore_index=True)

    return df

In [69]:
df = get_jobdata_page(job_ads)

In [70]:
df

Unnamed: 0,title,company,area,date,link
0,Data Engineer,BÜCHI Labortechnik AG,Flawil,24 Oct,https://www.jobs.ch/en/vacancies/detail/6cdc0c...
1,Data Engineer (a) in Zürich,Axept Business Software AG,Zürich,12 Nov,https://www.jobs.ch/en/vacancies/detail/ac8535...
2,Data Engineer in Medical Imaging,Universität Bern,Bern,02 Nov,https://www.jobs.ch/en/vacancies/detail/d87b7c...
3,Platform Engineer (w/m/d),Migros Online,Ecublens,14 Nov,https://www.jobs.ch/en/vacancies/detail/3f6330...
4,Frontend Engineer #1908,Competec Service AG,Mägenwil,20 Nov,https://www.jobs.ch/en/vacancies/detail/2e95ec...
5,Building Service Layout Engineer (E3D Designer),Hitachi Zosen Inova AG,Zürich (Schweiz),21 Nov,https://www.jobs.ch/en/vacancies/detail/2f2efb...
6,Senior Engineer SPS/HMI 100 % (m/w/d),CHRONOS Personalberatung,Zürich - Spreitenbach,07 Nov,https://www.jobs.ch/en/vacancies/detail/6b2be3...
7,Requirements Engineer / Product Owner Dokument...,aity AG,Bern-Liebefeld,13 Nov,https://www.jobs.ch/en/vacancies/detail/c804f6...
8,Scientific Engagement Engineer,Dectris AG,Baden-Daettwil,10 Nov,https://www.jobs.ch/en/vacancies/detail/4deeb1...
9,Mechanical Engineer,Universität Bern,Bern,21 Nov,https://www.jobs.ch/en/vacancies/detail/da1818...


### Saving Extracted Data

In [None]:
df.to_csv("data_engineer_jobs.csv")

<a id='P4' name="P4"></a>
## [Conclusion](#P0)

You now know the most important commands to work with requests and BeautifulSoup and have fun extracting data from webpages!
If you would like to go further, here are resources you should checkout:

* [Web Scraping Tutorial](https://www.dataquest.io/blog/web-scraping-tutorial-python/)
* [Selenium](https://www.seleniumhq.org/) a library for javascript rendered pages,
* [Scrapy](https://scrapy.org) a python library for web crawling

But first, let's practice!

<div style="border-top:0.1cm solid #EF475B"></div>
    <strong><a href='#Q0'><div style="text-align: right"> <h3>End of this Notebook.</h3></div></a></strong>