In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
import re

## Webscraping

In this exercise, you'll practice using BeautifulSoup to parse the content of a web page. The page that you'll be scraping, https://realpython.github.io/fake-jobs/, contains job listings. Your job is to extract the data on each job and convert into a pandas DataFrame.

1. Start by performing a GET request on the url above and convert the response into a BeautifulSoup object.  
a. Use the .find method to find the tag containing the first job title ("Senior Python Developer"). Hint: can you find a tag type and/or a class that could be helpful for extracting this information? Extract the text from this title.  

In [2]:
# read in website for webscraping
URL = 'https://realpython.github.io/fake-jobs/'

headers = {
    "User-Agent": "MyPythonScraper"
}

response = requests.get(URL, headers=headers)

In [3]:
# check for connection
response.status_code

200

In [4]:
# inspect source HTML
response.text[:500]

'<!DOCTYPE html>\n<html>\n  <head>\n    <meta charset="utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n    <title>Fake Python</title>\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css">\n  </head>\n  <body>\n  <section class="section">\n    <div class="container mb-5">\n      <h1 class="title is-1">\n        Fake Python\n      </h1>\n      <p class="subtitle is-3">\n        Fake Jobs for Your Web Scraping Journey\n      </p>\n    </div>'

In [5]:
# create 'soup' object from fake jobs website scrape
fake_jobs_soup = BeautifulSoup(response.text)

In [15]:
# display readable webscrape 
print(fake_jobs_soup.prettify()[:500])

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   Fake Python
  </title>
  <link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/>
 </head>
 <body>
  <section class="section">
   <div class="container mb-5">
    <h1 class="title is-1">
     Fake Python
    </h1>
    <p class="subtitle is-3">
     Fake Jobs for Your Web Scraping Journey
    </p>
   </div>
   <div class="c


In [7]:
# locate title for fake job we want
fake_jobs_soup.find('h2')

<h2 class="title is-5">Senior Python Developer</h2>

b. Now, use what you did for the first title, but extract the job title for all jobs on this page. Store the results in a list.  

In [8]:
# scrape and clean all fake job titles
fake_job_titles = fake_jobs_soup.findAll('h2')
job_names = [x.get_text(strip=True) for x in fake_job_titles]
job_names[:10]

['Senior Python Developer',
 'Energy engineer',
 'Legal executive',
 'Fitness centre manager',
 'Product manager',
 'Medical technical officer',
 'Physiological scientist',
 'Textile designer',
 'Television floor manager',
 'Waste management officer']

c. Finally, extract the companies, locations, and posting dates for each job. For example, the first job has a company of "Payne, Roberts and Davis", a location of "Stewartbury, AA", and a posting date of "2021-04-08". Ensure that the text that you extract is clean, meaning no extra spaces or other characters at the beginning or end.  

In [9]:
# scrape all fake job companies
fake_job_companies = fake_jobs_soup.findAll('h3')
fake_job_companies[:5]

[<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>,
 <h3 class="subtitle is-6 company">Vasquez-Davidson</h3>,
 <h3 class="subtitle is-6 company">Jackson, Chambers and Levy</h3>,
 <h3 class="subtitle is-6 company">Savage-Bradley</h3>,
 <h3 class="subtitle is-6 company">Ramirez Inc</h3>]

In [10]:
# Regex and for loop to clean company names
company_names =[]
for company in fake_job_companies:
    name = re.sub(r'<.*?>', '', str(company))
    company_names.append(name)
company_names[:10]

['Payne, Roberts and Davis',
 'Vasquez-Davidson',
 'Jackson, Chambers and Levy',
 'Savage-Bradley',
 'Ramirez Inc',
 'Rogers-Yates',
 'Kramer-Klein',
 'Meyers-Johnson',
 'Hughes-Williams',
 'Jones, Williams and Villa']

In [11]:
# scrape and clean all fake job locations
fake_job_locations = fake_jobs_soup.findAll(attrs={'class' : 'location'})
location_names = [x.get_text(strip=True) for x in fake_job_locations]
location_names[:10]

['Stewartbury, AA',
 'Christopherville, AA',
 'Port Ericaburgh, AA',
 'East Seanview, AP',
 'North Jamieview, AP',
 'Davidville, AP',
 'South Christopher, AE',
 'Port Jonathan, AE',
 'Osbornetown, AE',
 'Scotttown, AP']

In [12]:
# scrape and clean all fake job datetimes
fake_job_datetimes = fake_jobs_soup.findAll('time')
dates = [x.get_text(strip=True) for x in fake_job_datetimes]
dates[:10]

['2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08',
 '2021-04-08']

d. Take the lists that you have created and combine them into a pandas DataFrame. 

In [13]:
data = {
    'Job Title': job_names, 
    'Company Name': company_names, 
    'Location': location_names, 
    'Date': dates
}

fake_jobs_df = pd.DataFrame(data)
fake_jobs_df.head()

Unnamed: 0,Job Title,Company Name,Location,Date
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08


2. Next, add a column that contains the url for the "Apply" button. Try this in two ways.   
    

a. First, use the BeautifulSoup find_all method to extract the urls.  
    

In [17]:
fake_urls = fake_jobs_soup.findAll('a', attrs= {'class' : 'card-footer-item'}, string = 'Apply')
fake_urls[:10]

[<a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" target="_blank">Apply</a>,
 <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html" target="_blank">Apply</a>,
 <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html" target="_blank">Apply</a>,
 <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html" target="_blank">Apply</a>,
 <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/product-manager-4.html" target="_blank">Apply</a>,
 <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/medical-technical-officer-5.html" target="_blank">Apply</a>,
 <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/physiological-scientist-6.html" target="_blank">Apply</a>,
 <a class="card-footer-item" href="https://realpython.git

b. Next, get those same urls in a different way. Examine the urls and see if you can spot the pattern of how they are constructed. Then, build the url using the elements you have already extracted. Ensure that the urls that you created match those that you extracted using BeautifulSoup. Warning: You will need to do some string cleaning and prep in constructing the urls this way. For example, look carefully at the urls for the "Software Engineer (Python)" job and the "Scientist, research (maths)" job.

In [40]:
fake_url_filter =[]
for url in fake_urls:
    name = re.findall(r'\shref="(.*?)"\s', str(url))[0]
    fake_url_filter.append(name)
fake_url_filter[:10]

['https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html',
 'https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html',
 'https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html',
 'https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html',
 'https://realpython.github.io/fake-jobs/jobs/product-manager-4.html',
 'https://realpython.github.io/fake-jobs/jobs/medical-technical-officer-5.html',
 'https://realpython.github.io/fake-jobs/jobs/physiological-scientist-6.html',
 'https://realpython.github.io/fake-jobs/jobs/textile-designer-7.html',
 'https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html',
 'https://realpython.github.io/fake-jobs/jobs/waste-management-officer-9.html']

In [41]:
fake_jobs_df['URLs'] = fake_url_filter
fake_jobs_df.head()

Unnamed: 0,Job Title,Company Name,Location,Date,URLs
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/se...
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/en...
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/le...
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fi...
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/pr...


3. Finally, we want to get the job description text for each job.  
    a. Start by looking at the page for the first job, https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html. Using BeautifulSoup, extract the job description paragraph.  
    

In [61]:
# read in website for webscraping
URL1 = 'https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html'

headers1 = {
    "User-Agent": "MyPythonScraper"
}

response1 = requests.get(URL1, headers=headers)

In [62]:
# check for connection
response1.status_code

200

In [63]:
# create 'soup' object from fake jobs website scrape
fake_description = BeautifulSoup(response1.text)

In [66]:
senior_dev_description = fake_description.find('div', attrs= {'class' : 'content'}).p.text
senior_dev_description 

'Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inclusive programs job CSS. Css discussions growth opportunity explore open-minded oversee. Css Python environmentally friendly collaborate inclusive role. Django no experience oversee dashboard environmentally friendly willing to learn programs. Programs open-minded programs asset.'

b. We want to be able to do this for all pages. Write a function which takes as input a url and returns the description text on that page. For example, if you input "https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html" into your function, it should return the string "At be than always different American address. Former claim chance prevent why measure too. Almost before some military outside baby interview. Face top individual win suddenly. Parent do ten after those scientist. Medical effort assume teacher wall. Significant his himself clearly very. Expert stop area along individual. Three own bank recognize special good along.".  
    

In [73]:
# read in website for webscraping
def url_description(url):
    URL3 = url
    headers1 = {"User-Agent": "MyPythonScraper"}
    html_text = requests.get(URL3, headers=headers)
    soup = BeautifulSoup(html_text.text)
    job_description = soup.find('div', attrs= {'class' : 'content'}).p.text
    return job_description 

In [74]:
url_description('https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html')

'At be than always different American address. Former claim chance prevent why measure too. Almost before some military outside baby interview. Face top individual win suddenly. Parent do ten after those scientist. Medical effort assume teacher wall. Significant his himself clearly very. Expert stop area along individual. Three own bank recognize special good along.'

c. Use the [.apply method](https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html) on the url column you created above to retrieve the description text for all of the jobs.