### Web Scraping Exercise - Fake Jobs

In this exercise, you'll practice using BeautifulSoup to parse the content of a web page. The page that you'll be scraping, https://realpython.github.io/fake-jobs/, contains job listings. Your job is to extract the data on each job and convert into a pandas DataFrame.

##### Question 1

Start by performing a GET request on the url above and convert the response into a BeautifulSoup object.  
a. Use the .find method to find the tag containing the first job title ("Senior Python Developer"). Hint: can you find a tag type and/or a class that could be helpful for extracting this information? Extract the text from this title.  
b. Now, use what you did for the first title, but extract the job title for all jobs on this page. Store the results in a list.  
c. Finally, extract the companies, locations, and posting dates for each job. For example, the first job has a company of "Payne, Roberts and Davis", a location of "Stewartbury, AA", and a posting date of "2021-04-08". Ensure that the text that you extract is clean, meaning no extra spaces or other characters at the beginning or end.  
d. Take the lists that you have created and combine them into a pandas DataFrame. 

In [110]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from io import StringIO
import re

In [111]:
# Send get request to the fake jobs URL
URL = 'https://realpython.github.io/fake-jobs/'
response = requests.get(URL)

In [112]:
# Convert response to BeautifulSoup object
soup = BeautifulSoup(response.text)

In [113]:
# Extracting the text from the first job title (navigating to the first card, then finding the text whose tag is h2)
soup.find('div', attrs={'class' : 'card'}).h2.text

'Senior Python Developer'

In [126]:
# Extracting the job title for all jobs on this page
jobs = soup.findAll('div', attrs={'class' : 'card'})
titles_text = [x.h2.text.strip() for x in jobs]

# Extracting the companies for each job
companies = soup.findAll('div', attrs={'class' : 'card'})
companies_text = [x.h3.text.strip() for x in jobs]

# Extracting the locations for each job
locations = soup.findAll('div', attrs={'class' : 'card'})
companies_text = [x.p.text.strip() for x in jobs]

# Extracting the posting date for each job
dates = soup.findAll('div', attrs={'class' : 'card'})
companies_text = [x.time.text.strip() for x in jobs]

In [127]:
jobs_df = pd.DataFrame({'Job_Title':titles_text, 'Company':companies_text, 'Location':locations_text, 'Posting_Date':dates_text})
jobs_df.head(2)

Unnamed: 0,Job_Title,Company,Location,Posting_Date
0,Senior Python Developer,2021-04-08,"Stewartbury, AA",2021-04-08
1,Energy engineer,2021-04-08,"Christopherville, AA",2021-04-08


##### Question 2

Next, add a column that contains the url for the "Apply" button. Try this in two ways.   
    a. First, use the BeautifulSoup find_all method to extract the urls.  
    b. Next, get those same urls in a different way. Examine the urls and see if you can spot the pattern of how they are constructed. Then, build the url using the elements you have already extracted. Ensure that the urls that you created match those that you extracted using BeautifulSoup. Warning: You will need to do some string cleaning and prep in constructing the urls this way. For example, look carefully at the urls for the "Software Engineer (Python)" job and the "Scientist, research (maths)" job.

In [128]:
# Adding a column that contains the url for the "Apply" button via find_all
jobs_df['URL'] = [x.findAll('a')[1].get('href') for x in jobs]

In [129]:
pd.set_option('display.max_colwidth', None)
jobs_df[jobs_df['Job_Title'].isin(['Software Engineer (Python)','Scientist, research (maths)'])]

Unnamed: 0,Job_Title,Company,Location,Posting_Date,URL
10,Software Engineer (Python),2021-04-08,"Ericberg, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/software-engineer-python-10.html
22,"Scientist, research (maths)",2021-04-08,"Laurenland, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/scientist-research-maths-22.html


In [136]:
# Adding a column that contains the url for the "Apply" button via pattern matching

# First, remove all parentheses and commas from job titles
titles_stripped = [re.sub(r'[(),]', '', title) for title in titles_text]
# Next, replace any slashes with spaces
titles_stripped = [re.sub(r'[\/]', ' ', title) for title in titles_stripped]

# Define the base URL and extension
base_URL = 'https://realpython.github.io/fake-jobs/jobs/'
extension = '.html'

# Build up the URLs using list comprehension, starting with base URL, adding job title with spaces replaced by hyphens followed by index, and ending with extension
built_URLs = [base_URL
              + job.lower().replace(' ','-')+'-'+str(titles_stripped.index(job))
              + extension for job in titles_stripped]

built_URLs

['https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html',
 'https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html',
 'https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html',
 'https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html',
 'https://realpython.github.io/fake-jobs/jobs/product-manager-4.html',
 'https://realpython.github.io/fake-jobs/jobs/medical-technical-officer-5.html',
 'https://realpython.github.io/fake-jobs/jobs/physiological-scientist-6.html',
 'https://realpython.github.io/fake-jobs/jobs/textile-designer-7.html',
 'https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html',
 'https://realpython.github.io/fake-jobs/jobs/waste-management-officer-9.html',
 'https://realpython.github.io/fake-jobs/jobs/software-engineer-python-10.html',
 'https://realpython.github.io/fake-jobs/jobs/interpreter-11.html',
 'https://realpython.github.io/fake-jobs/jobs/architect-12.html',
 'https://realpython.gi

In [137]:
assert jobs_df['URL'].to_list() == built_URLs

jobs_df['URL'].to_list()

['https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html',
 'https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html',
 'https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html',
 'https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html',
 'https://realpython.github.io/fake-jobs/jobs/product-manager-4.html',
 'https://realpython.github.io/fake-jobs/jobs/medical-technical-officer-5.html',
 'https://realpython.github.io/fake-jobs/jobs/physiological-scientist-6.html',
 'https://realpython.github.io/fake-jobs/jobs/textile-designer-7.html',
 'https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html',
 'https://realpython.github.io/fake-jobs/jobs/waste-management-officer-9.html',
 'https://realpython.github.io/fake-jobs/jobs/software-engineer-python-10.html',
 'https://realpython.github.io/fake-jobs/jobs/interpreter-11.html',
 'https://realpython.github.io/fake-jobs/jobs/architect-12.html',
 'https://realpython.gi