In this exercise, you'll practice using BeautifulSoup to parse the content of a web page. The page that you'll be scraping, https://realpython.github.io/fake-jobs/, contains job listings. Your job is to extract the data on each job and convert into a pandas DataFrame.

# 1. Start by performing a GET request on the url above and convert the response into a BeautifulSoup object.

In [18]:
# IMPORT STANDARD LIBRARIES
import re
import requests

# IMPORT 3RD-PARTY LIBRARIES
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# CONSTANTS
base_url = "https://realpython.github.io/fake-jobs/"

# SEND REQUEST
response = requests.get(url=base_url)

# DISPLAY STATUS CODE
print(f"STATUS CODE: {response.status_code}")

STATUS CODE: 200


In [3]:
# CREATE BEAUTIFUL SOUP OBJECT
soup = BeautifulSoup(markup=response.text)

## a. Use the .find method to find the tag containing the first job title ("Senior Python Developer").
Hint: can you find a tag type and/or a class that could be helpful for extracting this information? Extract the text from this title.

In [4]:
# GET FIRST JOB TITLE TEXT
job_title_1 = soup.find(name="h2").text

## b. Now, use what you did for the first title, but extract the job title for all jobs on this page.
Store the results in a list.

In [5]:
# GET ALL JOB TITLE TEXTS
job_title_list = [title.text for title in soup.find_all(name="h2")]

## c. Finally, extract the companies, locations, and posting dates for each job.
For example, the first job has a company of "Payne, Roberts and Davis", a location of "Stewartbury, AA", and a posting date of "2021-04-08". Ensure that the text that you extract is clean, meaning no extra spaces or other characters at the beginning or end.

In [6]:
# GET ALL COMPANY TEXTS
company_list = [company.text for company in soup.find_all(name="h3")]

# GET ALL LOCATION TEXTS
location_list = [location.text.strip() for location in soup.find_all(name="p", attrs={"class": "location"})]

# GET ALL POSTING DATE TEXTS
posting_date_list = [posting_date.text.strip("\n").strip() for posting_date in soup.find_all(name="time")]

## d. Take the lists that you have created and combine them into a pandas DataFrame.

In [7]:
# CREATE A DATAFRAME
jobs_df = pd.DataFrame(data={
    "JOB_TITLE": job_title_list,
    "COMPANY": company_list,
    "LOCATION": location_list,
    "POSTING_DATE": posting_date_list
})

# DISPLAY TOP 5 ROWS
jobs_df.head()

Unnamed: 0,JOB_TITLE,COMPANY,LOCATION,POSTING_DATE
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08


# 2. Next, add a column that contains the url for the "Apply" button. Try this in two ways.

## a. First, use the BeautifulSoup find_all method to extract the urls.

In [8]:
# GET APPLY BUTTON LINKS
apply_link_list_1 = []
for apply_link in soup.find_all(name="a"):
    if apply_link.text == "Apply":
        apply_link_list_1.append(apply_link.get("href"))

apply_link_list_1

['https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html',
 'https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html',
 'https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html',
 'https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html',
 'https://realpython.github.io/fake-jobs/jobs/product-manager-4.html',
 'https://realpython.github.io/fake-jobs/jobs/medical-technical-officer-5.html',
 'https://realpython.github.io/fake-jobs/jobs/physiological-scientist-6.html',
 'https://realpython.github.io/fake-jobs/jobs/textile-designer-7.html',
 'https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html',
 'https://realpython.github.io/fake-jobs/jobs/waste-management-officer-9.html',
 'https://realpython.github.io/fake-jobs/jobs/software-engineer-python-10.html',
 'https://realpython.github.io/fake-jobs/jobs/interpreter-11.html',
 'https://realpython.github.io/fake-jobs/jobs/architect-12.html',
 'https://realpython.gi

In [9]:
job_title_list

['Senior Python Developer',
 'Energy engineer',
 'Legal executive',
 'Fitness centre manager',
 'Product manager',
 'Medical technical officer',
 'Physiological scientist',
 'Textile designer',
 'Television floor manager',
 'Waste management officer',
 'Software Engineer (Python)',
 'Interpreter',
 'Architect',
 'Meteorologist',
 'Audiological scientist',
 'English as a second language teacher',
 'Surgeon',
 'Equities trader',
 'Newspaper journalist',
 'Materials engineer',
 'Python Programmer (Entry-Level)',
 'Product/process development scientist',
 'Scientist, research (maths)',
 'Ecologist',
 'Materials engineer',
 'Historic buildings inspector/conservation officer',
 'Data scientist',
 'Psychiatrist',
 'Structural engineer',
 'Immigration officer',
 'Python Programmer (Entry-Level)',
 'Neurosurgeon',
 'Broadcast engineer',
 'Make',
 'Nurse, adult',
 'Air broker',
 'Editor, film/video',
 'Production assistant, radio',
 'Engineer, communications',
 'Sales executive',
 'Software Deve

## b. Next, get those same urls in a different way.
Examine the urls and see if you can spot the pattern of how they are constructed. Then, build the url using the elements you have already extracted. Ensure that the urls that you created match those that you extracted using BeautifulSoup. Warning: You will need to do some string cleaning and prep in constructing the urls this way. For example, look carefully at the urls for the "Software Engineer (Python)" job and the "Scientist, research (maths)" job.

In [59]:
job_title = "Python/Programmer (Entry-Level)"

re.sub(pattern=r"[^\w\s]", repl="-", string=job_title)

'Python-Programmer -Entry-Level-'

In [69]:
# GET APPLY BUTTON LINKS
base_url = "https://realpython.github.io/fake-jobs/jobs/"

apply_link_list_2 = []
# "senior-python-developer-0.html"
for index, job_title in enumerate(job_title_list):
    # REPLACE SPACE OR FOWARD-SLASH WITH DASH
    job_title_fmt = re.sub(pattern=r"[\s/]", repl="-", string=job_title)

    # REMOVE ANY OTHER SYMBOLS
    job_title_fmt = re.sub(pattern=r"[^-\w]", repl="", string=job_title_fmt)
    
    # SET IN URL
    apply_link = base_url + f"{job_title_fmt.lower()}-{index}.html"

    # APPEND LINK TO LIST
    apply_link_list_2.append(apply_link)

# VALIDATE BOTH LISTS ARE THE SAME
apply_link_list_1 == apply_link_list_2

True

# 3. Finally, we want to get the job description text for each job.

## a. Start by looking at the page for the first job, https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html.
Using BeautifulSoup, extract the job description paragraph.

## b. We want to be able to do this for all pages.
Write a function which takes as input a url and returns the description text on that page. For example, if you input "https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html" into your function, it should return the string "At be than always different American address. Former claim chance prevent why measure too. Almost before some military outside baby interview. Face top individual win suddenly. Parent do ten after those scientist. Medical effort assume teacher wall. Significant his himself clearly very. Expert stop area along individual. Three own bank recognize special good along.".

## c. Use the .apply method on the url column you created above to retrieve the description text for all of the jobs.