# Web Scraper using Beautiful Soup test_1.0

This notebook consists of basic web scraping code that used **requests** and **Beautiful Soup** packages. For this test, [Fake Python](https://realpython.github.io/fake-jobs/) site is used.

### Access Website and Read Elemets

In [38]:
# import required packages
import requests
from bs4 import BeautifulSoup

Send an HTTP get request to the URL, and retrieve HTML data that server sends back. Response **.headers** will give access to information about retireved data. Accoeding to the 'Content-Type', the retrieve data is 'text/html'.

In [39]:
# URL for the website we gonna scrape
URL = "https://realpython.github.io/fake-jobs/"
# request HTML data from the url page
page = requests.get(URL)
page.headers
#print(page.text)

{'Connection': 'keep-alive', 'Content-Length': '5721', 'Server': 'GitHub.com', 'Content-Type': 'text/html; charset=utf-8', 'permissions-policy': 'interest-cohort=()', 'Last-Modified': 'Mon, 12 Apr 2021 09:01:55 GMT', 'Access-Control-Allow-Origin': '*', 'ETag': 'W/"60740c83-197ed"', 'expires': 'Wed, 17 May 2023 04:06:48 GMT', 'Cache-Control': 'max-age=600', 'Content-Encoding': 'gzip', 'x-proxy-cache': 'MISS', 'X-GitHub-Request-Id': 'ACB6:4CDF:3AE6E50:5ABBFDE:64645080', 'Accept-Ranges': 'bytes', 'Date': 'Wed, 17 May 2023 05:29:03 GMT', 'Via': '1.1 varnish', 'Age': '0', 'X-Served-By': 'cache-dfw-kdfw8210048-DFW', 'X-Cache': 'HIT', 'X-Cache-Hits': '1', 'X-Timer': 'S1684301343.330582,VS0,VE46', 'Vary': 'Accept-Encoding', 'X-Fastly-Request-ID': '83f46ae9feb72fc9fba7b5205f68e5908a84e60c'}

Beautiful Soup is a Python library for pulling data out of HTML and XML files. Beautiful Soup object was created and parsed the html page content through the html parser.

In [40]:
# Create beautifulSoup object "s"
# 1st element is html contest requested earlier
# 2nd elemet is apropiate parser, this time html parser
s = BeautifulSoup(page.content, "html.parser")

Find the HTML element by ID that contains all the job listings. You can use **.prettify()** function to see all the HTML contained within the div id tag.

In [41]:
results = s.find(id="ResultsContainer")
#print(results.prettify())

Within the **ResultsContainer** ID, every job posting is wrapped in a div element with the class **card-content**. Use **.find_all()** to get all the HTML jobs data from the results Beautiful Soup object.

In [42]:
job_elements = results.find_all("div", class_="card-content")


Use **.find()** to access child elemets of each job. Use **.text** to return the text portions of the html elemet tags and **.strip()** to remove leading and trailing whitespace.

In [43]:
for job_element in job_elements:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    print()

Senior Python Developer
Payne, Roberts and Davis
Stewartbury, AA

Energy engineer
Vasquez-Davidson
Christopherville, AA

Legal executive
Jackson, Chambers and Levy
Port Ericaburgh, AA

Fitness centre manager
Savage-Bradley
East Seanview, AP

Product manager
Ramirez Inc
North Jamieview, AP

Medical technical officer
Rogers-Yates
Davidville, AP

Physiological scientist
Kramer-Klein
South Christopher, AE

Textile designer
Meyers-Johnson
Port Jonathan, AE

Television floor manager
Hughes-Williams
Osbornetown, AE

Waste management officer
Jones, Williams and Villa
Scotttown, AP

Software Engineer (Python)
Garcia PLC
Ericberg, AE

Interpreter
Gregory and Sons
Ramireztown, AE

Architect
Clark, Garcia and Sosa
Figueroaview, AA

Meteorologist
Bush PLC
Kelseystad, AA

Audiological scientist
Salazar-Meyers
Williamsburgh, AE

English as a second language teacher
Parker, Murphy and Brooks
Mitchellburgh, AE

Surgeon
Cruz-Brown
West Jessicabury, AA

Equities trader
Macdonald-Ferguson
Maloneshire, AE


### Search for spesific jobs on the page

Following we find the job titles that include the word "Scientist". All the job titles are in h2 tags, therefore we use **.find_all()** to search h2 tags and string "scientist". Lambda function is used to convert h2 tags to lower case to remove any issues that arise with the case sensitivity.

In [44]:
spec_jobs = results.find_all("h2",string = lambda text: "scientist" in text.lower())
print(spec_jobs)

[<h2 class="title is-5">Physiological scientist</h2>, <h2 class="title is-5">Audiological scientist</h2>, <h2 class="title is-5">Product/process development scientist</h2>, <h2 class="title is-5">Scientist, research (maths)</h2>, <h2 class="title is-5">Data scientist</h2>, <h2 class="title is-5">Scientist, forensic</h2>]


Above result presents us with the job titles that includes the word "scientist". To access the other data of these job titles, we need to access 3rd level **parent** element **class="card-content"** of each job. List Comprehension is used to access these data while looping through the selected jobs.

In [45]:
spec_job_elements = [
    h2_title.parent.parent.parent for h2_title in spec_jobs
]

for job_element in spec_job_elements:
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    print()

Physiological scientist
Kramer-Klein
South Christopher, AE

Audiological scientist
Salazar-Meyers
Williamsburgh, AE

Product/process development scientist
Gomez-Carroll
Marktown, AA

Scientist, research (maths)
Manning, Welch and Herring
Laurenland, AE

Data scientist
Thomas Group
Port Robertfurt, AA

Scientist, forensic
Gonzalez LLC
Colehaven, AP

