# mclphlps-web-scraping 

# READ ME

# EDA

### Hypertext Transfer Protocol (HTTP) is the foundation for data communication on the world wide web.
- Entering a URL is a request for the resource at that domain address
- Response is what happens (page loads? 404 error?)

To retrieve the contents of a website, we will be using the [_requests_](https://requests.readthedocs.io/en/master/) library.

In [1]:
import requests

In this notebook, we will be using a **GET** request. This is a request for data from a specified resource.  

Another common type or request is a **POST** request. POST submits data to be processed (e.g., from an HTML form) to the identified resource. The data is included in the body of the request. This may result in the creation of a new resource or the updates of existing resources or both.

To perform a GET request, use `requests.get()` and pass in the desired url.

In [2]:
URL = 'https://realpython.github.io/fake-jobs/'

headers = {
    "User-Agent": "MyPythonScript/1.0 (contact@example.com)"
}

response = requests.get(URL, headers = headers)

Let's see what kind of object we get.

In [3]:
type(response)

requests.models.Response

We can check the status code using the `status_code` attribute.

In [4]:
response.status_code

200

A 200 status code is the standard response for a successful request.  

Other common status codes:
 * 400: Bad Request
 * 404: Not Found

Let's see what this request returned.

In [5]:
response.text

'<!DOCTYPE html>\n<html>\n  <head>\n    <meta charset="utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n    <title>Fake Python</title>\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css">\n  </head>\n  <body>\n  <section class="section">\n    <div class="container mb-5">\n      <h1 class="title is-1">\n        Fake Python\n      </h1>\n      <p class="subtitle is-3">\n        Fake Jobs for Your Web Scraping Journey\n      </p>\n    </div>\n    <div class="container">\n    <div id="ResultsContainer" class="columns is-multiline">\n    <div class="column is-half">\n<div class="card">\n  <div class="card-content">\n    <div class="media">\n      <div class="media-left">\n        <figure class="image is-48x48">\n          <img src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1" alt="Real Python Logo">\n        </figure>\n      </div>\n      <div class="media-content"

It is very hard to decipher the above text. Luckily for us, the [_Beautiful Soup_](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library comes to the rescue. This library assists us in parsing HTML into something usable.

In [6]:
from bs4 import BeautifulSoup as BS

First, we can soupify our response text. Since we are working with HTML, we can specify that we need the html parser.

In [7]:
soup = BS(response.text)

Now, we can print it out in a slightly more readable form.

In [8]:
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   Fake Python
  </title>
  <link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/>
 </head>
 <body>
  <section class="section">
   <div class="container mb-5">
    <h1 class="title is-1">
     Fake Python
    </h1>
    <p class="subtitle is-3">
     Fake Jobs for Your Web Scraping Journey
    </p>
   </div>
   <div class="container">
    <div class="columns is-multiline" id="ResultsContainer">
     <div class="column is-half">
      <div class="card">
       <div class="card-content">
        <div class="media">
         <div class="media-left">
          <figure class="image is-48x48">
           <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
          </figure>
         </div>
         <div class="media-content">
          <h2 c

What we are looking at is the HTML for this page. This is rendered by your browser into the Wikipedia page that you see.

<img src="assets/html.png">


If you navigate to this page in your browser, you can view page source or inspect elements to see the underlying HTML.

If you are using Safari, this may not by avaiable and you'll need to activate it. According to [this](https://www.socialmeteor.com/2013/03/04/how-to-view-html-source-in-safari-web-browser/) website, you can activate this by following these steps:


1. Open Safari.
2. Select ‘Preferences’ from the ‘Safari’ menu.
3. In the ‘Advanced’ section and select ‘Show Develop menu’ in menu bar.’
4. Visit the web page you want to view HTML source for.
5. Select ‘Show Page Source’ from the ‘Develop’ menu that has been added to Safari.


Beautiful Soup lets us search through this HTML and extract out the contents we want by tag.  

Say we wanted to find the title of this page. We can accomplish this by using the `.find` method on our soup, telling it that we want to find the first `title` tag.

In [9]:
soup.find('title')

<title>Fake Python</title>

Notice that this returns a bs4 Tag object.

In [10]:
type(soup.find('title'))

bs4.element.Tag

To extract out the text, you can use the `.text` attribute.

In [11]:
soup.find('title').text

'Fake Python'

The `.find` method find the first matching tag. 

We can find _all_ elements with a particular tag using the `.findAll(<tag>)` method. Say we want to find all job titles. We'll look for the `h2` tag, which are the job titles on the web page.

In [12]:
job_title = soup.findAll('h2')
print(type(job_title))
job_title

<class 'bs4.element.ResultSet'>


[<h2 class="title is-5">Senior Python Developer</h2>,
 <h2 class="title is-5">Energy engineer</h2>,
 <h2 class="title is-5">Legal executive</h2>,
 <h2 class="title is-5">Fitness centre manager</h2>,
 <h2 class="title is-5">Product manager</h2>,
 <h2 class="title is-5">Medical technical officer</h2>,
 <h2 class="title is-5">Physiological scientist</h2>,
 <h2 class="title is-5">Textile designer</h2>,
 <h2 class="title is-5">Television floor manager</h2>,
 <h2 class="title is-5">Waste management officer</h2>,
 <h2 class="title is-5">Software Engineer (Python)</h2>,
 <h2 class="title is-5">Interpreter</h2>,
 <h2 class="title is-5">Architect</h2>,
 <h2 class="title is-5">Meteorologist</h2>,
 <h2 class="title is-5">Audiological scientist</h2>,
 <h2 class="title is-5">English as a second language teacher</h2>,
 <h2 class="title is-5">Surgeon</h2>,
 <h2 class="title is-5">Equities trader</h2>,
 <h2 class="title is-5">Newspaper journalist</h2>,
 <h2 class="title is-5">Materials engineer</h2>,
 

In [45]:
type(job_titles)

list

Let's look closer at the first job title.

In [13]:
first_job_title = job_title[0]
print(type(first_job_title))
first_job_title

<class 'bs4.element.Tag'>


<h2 class="title is-5">Senior Python Developer</h2>

You can access attributes of a Tag object in the same way that you would access values from a dictionary.

In [14]:
first_job_title['class']

['title', 'is-5']

You can also safely access attributes using `.get`. This might be useful if, for example, you aren't sure if a particular Tag or all tags had a certain attribute.

In [15]:
# Non-safe: meaning it will break if there's no 'class'
# first_image['class']

In [16]:
# Safe: meaning it will not break, even if there is no 'class'
first_job_title.get('class')

['title', 'is-5']

You can also specify a default value when using `get`.

In [17]:
first_job_title.get('class', default = 'No Class')

['title', 'is-5']

If you want to grab a particular attribute for all job titles, an easy way to do so is with a list comprehension.

In [18]:
job_title_class = [x.get('class') for x in job_title]

In [19]:
job_title_class

[['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],
 ['title', 'is-5'],


In [20]:
type(job_title_class)

list

In [27]:
job_title

[<h2 class="title is-5">Senior Python Developer</h2>,
 <h2 class="title is-5">Energy engineer</h2>,
 <h2 class="title is-5">Legal executive</h2>,
 <h2 class="title is-5">Fitness centre manager</h2>,
 <h2 class="title is-5">Product manager</h2>,
 <h2 class="title is-5">Medical technical officer</h2>,
 <h2 class="title is-5">Physiological scientist</h2>,
 <h2 class="title is-5">Textile designer</h2>,
 <h2 class="title is-5">Television floor manager</h2>,
 <h2 class="title is-5">Waste management officer</h2>,
 <h2 class="title is-5">Software Engineer (Python)</h2>,
 <h2 class="title is-5">Interpreter</h2>,
 <h2 class="title is-5">Architect</h2>,
 <h2 class="title is-5">Meteorologist</h2>,
 <h2 class="title is-5">Audiological scientist</h2>,
 <h2 class="title is-5">English as a second language teacher</h2>,
 <h2 class="title is-5">Surgeon</h2>,
 <h2 class="title is-5">Equities trader</h2>,
 <h2 class="title is-5">Newspaper journalist</h2>,
 <h2 class="title is-5">Materials engineer</h2>,
 

# ANSWERS

## 1. Start by performing a GET request on the url above and convert the response into a BeautifulSoup object.

### 1a. Use the .find method to find the tag containing the first job title ("Senior Python Developer").
- Hint: can you find a tag type (h2) and/or a class (title is-5) that could be helpful for extracting this information? Extract the text from this title.

In [49]:
first_job = soup.find('h2').get_text(strip=True)
print(first_job)

Senior Python Developer


### 1b. Now, use what you did for the first title, but extract the job title for all jobs on this page. Store the results in a list.

In [50]:
job_titles = [title.get_text(strip=True) for title in soup.find_all('h2')]
print(job_titles)

['Senior Python Developer', 'Energy engineer', 'Legal executive', 'Fitness centre manager', 'Product manager', 'Medical technical officer', 'Physiological scientist', 'Textile designer', 'Television floor manager', 'Waste management officer', 'Software Engineer (Python)', 'Interpreter', 'Architect', 'Meteorologist', 'Audiological scientist', 'English as a second language teacher', 'Surgeon', 'Equities trader', 'Newspaper journalist', 'Materials engineer', 'Python Programmer (Entry-Level)', 'Product/process development scientist', 'Scientist, research (maths)', 'Ecologist', 'Materials engineer', 'Historic buildings inspector/conservation officer', 'Data scientist', 'Psychiatrist', 'Structural engineer', 'Immigration officer', 'Python Programmer (Entry-Level)', 'Neurosurgeon', 'Broadcast engineer', 'Make', 'Nurse, adult', 'Air broker', 'Editor, film/video', 'Production assistant, radio', 'Engineer, communications', 'Sales executive', 'Software Developer (Python)', 'Futures trader', 'Tour

### 1c. Finally, extract the companies, locations, and posting dates for each job.
- For example, the first job has a company of "Payne, Roberts and Davis", a location of "Stewartbury, AA", and a posting date of "2021-04-08". Ensure that the text that you extract is clean, meaning no extra spaces or other characters at the beginning or end.
- company: tag = "h3"
- location: class = "location"
- date: tag = "time"

In [51]:
# Extract companies (h3)
companies = [company.get_text(strip=True) for company in soup.find_all('h3')]
print(companies)

['Payne, Roberts and Davis', 'Vasquez-Davidson', 'Jackson, Chambers and Levy', 'Savage-Bradley', 'Ramirez Inc', 'Rogers-Yates', 'Kramer-Klein', 'Meyers-Johnson', 'Hughes-Williams', 'Jones, Williams and Villa', 'Garcia PLC', 'Gregory and Sons', 'Clark, Garcia and Sosa', 'Bush PLC', 'Salazar-Meyers', 'Parker, Murphy and Brooks', 'Cruz-Brown', 'Macdonald-Ferguson', 'Williams, Peterson and Rojas', 'Smith and Sons', 'Moss, Duncan and Allen', 'Gomez-Carroll', 'Manning, Welch and Herring', 'Lee, Gutierrez and Brown', 'Davis, Serrano and Cook', 'Smith LLC', 'Thomas Group', 'Silva-King', 'Pierce-Long', 'Walker-Simpson', 'Cooper and Sons', 'Donovan, Gonzalez and Figueroa', 'Morgan, Butler and Bennett', 'Snyder-Lee', 'Harris PLC', 'Washington PLC', 'Brown, Price and Campbell', 'Mcgee PLC', 'Dixon Inc', 'Thompson, Sheppard and Ward', 'Adams-Brewer', 'Schneider-Brady', 'Gonzales-Frank', 'Smith-Wong', 'Pierce-Herrera', 'Aguilar, Rivera and Quinn', 'Lowe, Barnes and Thomas', 'Lewis, Gonzalez and Vasq

In [52]:
# Extract locations (class="location")
locations = [loc.get_text(strip=True) for loc in soup.find_all(class_='location')]
print(locations)

['Stewartbury, AA', 'Christopherville, AA', 'Port Ericaburgh, AA', 'East Seanview, AP', 'North Jamieview, AP', 'Davidville, AP', 'South Christopher, AE', 'Port Jonathan, AE', 'Osbornetown, AE', 'Scotttown, AP', 'Ericberg, AE', 'Ramireztown, AE', 'Figueroaview, AA', 'Kelseystad, AA', 'Williamsburgh, AE', 'Mitchellburgh, AE', 'West Jessicabury, AA', 'Maloneshire, AE', 'Johnsonton, AA', 'South Davidtown, AP', 'Port Sara, AE', 'Marktown, AA', 'Laurenland, AE', 'Lauraton, AP', 'South Tammyberg, AP', 'North Brandonville, AP', 'Port Robertfurt, AA', 'Burnettbury, AE', 'Herbertside, AA', 'Christopherport, AP', 'West Victor, AE', 'Port Aaron, AP', 'Loribury, AA', 'Angelastad, AP', 'Larrytown, AE', 'West Colin, AP', 'West Stephanie, AP', 'Laurentown, AP', 'Wrightberg, AP', 'Alberttown, AE', 'Brockburgh, AE', 'North Jason, AE', 'Arnoldhaven, AE', 'Lake Destiny, AP', 'South Timothyburgh, AP', 'New Jimmyton, AE', 'New Lucasbury, AP', 'Port Cory, AE', 'Gileston, AA', 'Cindyshire, AA', 'East Michaelf

In [53]:
# Extract dates (time tag)
dates = [date.get_text(strip=True) for date in soup.find_all('time')]
print(dates)

['2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021-04-08', '2021

### 1d. Take the lists that you have created and combine them into a pandas DataFrame.

In [56]:
# Combine into a DataFrame
import pandas as pd

jobs_df = pd.DataFrame({
    'Job Title': job_titles,
    'Company': companies,
    'Location': locations,
    'Date Posted': dates
})

jobs_df.head()

Unnamed: 0,Job Title,Company,Location,Date Posted
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08


## 2. Next, add a column that contains the url for the "Apply" button. Try this in two ways.

### 2a. First, use the BeautifulSoup find_all method to extract the urls.

In [60]:
# Extract all Apply URLs
apply_urls = [a['href'] for a in soup.find_all('a', class_='card-footer-item') if 'Apply' in a.text]

# Add to DataFrame
jobs_df['Apply URL'] = apply_urls

jobs_df.head()

Unnamed: 0,Job Title,Company,Location,Date Posted,Apply URL,Constructed URL
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/se...,https://realpython.github.io/fake-jobs/jobs/se...
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/en...,https://realpython.github.io/fake-jobs/jobs/en...
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/le...,https://realpython.github.io/fake-jobs/jobs/le...
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fi...,https://realpython.github.io/fake-jobs/jobs/fi...
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/pr...,https://realpython.github.io/fake-jobs/jobs/pr...


### 2b. Next, get those same urls in a different way. 
- Examine the urls and see if you can spot the pattern of how they are constructed. Then, build the url using the elements you have already extracted. Ensure that the urls that you created match those that you extracted using BeautifulSoup. Warning: You will need to do some string cleaning and prep in constructing the urls this way. For example, look carefully at the urls for the "Software Engineer (Python)" job and the "Scientist, research (maths)" job.
- https://realpython.github.io/fake-jobs/jobs/<job-title-slug>-<id>.html
- Where:

<job-title-slug> = job title lowercased, spaces → hyphens, special chars cleaned

<id> = unique numeric identifier (already in HTML, harder to rebuild perfectly without scraping href)


In [61]:
# Build slugs (simple version: lowercase, replace spaces with '-')
slugs = [title.lower().replace(' ', '-') for title in jobs_df['Job Title']]

# Build fake URLs (no IDs, just base + slug)
base_url = "https://realpython.github.io/fake-jobs/jobs/"
jobs_df['Constructed URL'] = [f"{base_url}{slug}.html" for slug in slugs]

print(jobs_df[['Job Title', 'Apply URL', 'Constructed URL']].head())

                 Job Title                                          Apply URL  \
0  Senior Python Developer  https://realpython.github.io/fake-jobs/jobs/se...   
1          Energy engineer  https://realpython.github.io/fake-jobs/jobs/en...   
2          Legal executive  https://realpython.github.io/fake-jobs/jobs/le...   
3   Fitness centre manager  https://realpython.github.io/fake-jobs/jobs/fi...   
4          Product manager  https://realpython.github.io/fake-jobs/jobs/pr...   

                                     Constructed URL  
0  https://realpython.github.io/fake-jobs/jobs/se...  
1  https://realpython.github.io/fake-jobs/jobs/en...  
2  https://realpython.github.io/fake-jobs/jobs/le...  
3  https://realpython.github.io/fake-jobs/jobs/fi...  
4  https://realpython.github.io/fake-jobs/jobs/pr...  


In [62]:
jobs_df.head()

Unnamed: 0,Job Title,Company,Location,Date Posted,Apply URL,Constructed URL
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/se...,https://realpython.github.io/fake-jobs/jobs/se...
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/en...,https://realpython.github.io/fake-jobs/jobs/en...
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/le...,https://realpython.github.io/fake-jobs/jobs/le...
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fi...,https://realpython.github.io/fake-jobs/jobs/fi...
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/pr...,https://realpython.github.io/fake-jobs/jobs/pr...


## 3 Finally, we want to get the job description text for each job.

### 3a. Start by looking at the page for the first job, https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html. Using BeautifulSoup, extract the job description paragraph.

In [64]:
import requests
from bs4 import BeautifulSoup

In [65]:
# Function to get job description from URL
def get_description(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    # Grab the first <div class="content"><p>...</p>
    description = soup.find('div', class_='content').p.get_text(strip=True)
    return description

# Test on first job
print(get_description(jobs_df['Apply URL'].iloc[0]))

Professional asset web application environmentally friendly detail-oriented asset. Coordinate educational dashboard agile employ growth opportunity. Company programs CSS explore role. Html educational grit web application. Oversea SCRUM talented support. Web Application fast-growing communities inclusive programs job CSS. Css discussions growth opportunity explore open-minded oversee. Css Python environmentally friendly collaborate inclusive role. Django no experience oversee dashboard environmentally friendly willing to learn programs. Programs open-minded programs asset.


### 3b. We want to be able to do this for all pages. 
- Write a function which takes as input a url and returns the description text on that page. For example, if you input "https://realpython.github.io/fake-jobs/jobs/television-floor-manager-8.html" into your function, it should return the string:
- "At be than always different American address. Former claim chance prevent why measure too. Almost before some military outside baby interview. Face top individual win suddenly. Parent do ten after those scientist. Medical effort assume teacher wall. Significant his himself clearly very. Expert stop area along individual. Three own bank recognize special good along."


### 3c. Use the .apply method on the url column you created above to retrieve the description text for all of the jobs.

In [66]:
# 3b & 3c. Apply function to all Apply URLs
jobs_df['Description'] = jobs_df['Apply URL'].apply(get_description)

# Show results
print(jobs_df.head())

                 Job Title                     Company              Location  \
0  Senior Python Developer    Payne, Roberts and Davis       Stewartbury, AA   
1          Energy engineer            Vasquez-Davidson  Christopherville, AA   
2          Legal executive  Jackson, Chambers and Levy   Port Ericaburgh, AA   
3   Fitness centre manager              Savage-Bradley     East Seanview, AP   
4          Product manager                 Ramirez Inc   North Jamieview, AP   

  Date Posted                                          Apply URL  \
0  2021-04-08  https://realpython.github.io/fake-jobs/jobs/se...   
1  2021-04-08  https://realpython.github.io/fake-jobs/jobs/en...   
2  2021-04-08  https://realpython.github.io/fake-jobs/jobs/le...   
3  2021-04-08  https://realpython.github.io/fake-jobs/jobs/fi...   
4  2021-04-08  https://realpython.github.io/fake-jobs/jobs/pr...   

                                     Constructed URL  \
0  https://realpython.github.io/fake-jobs/jobs/se...  

In [67]:
jobs_df.head(20)

Unnamed: 0,Job Title,Company,Location,Date Posted,Apply URL,Constructed URL,Description
0,Senior Python Developer,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/se...,https://realpython.github.io/fake-jobs/jobs/se...,Professional asset web application environment...
1,Energy engineer,Vasquez-Davidson,"Christopherville, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/en...,https://realpython.github.io/fake-jobs/jobs/en...,Party prevent live. Quickly candidate change a...
2,Legal executive,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08,https://realpython.github.io/fake-jobs/jobs/le...,https://realpython.github.io/fake-jobs/jobs/le...,Administration even relate head color. Staff b...
3,Fitness centre manager,Savage-Bradley,"East Seanview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/fi...,https://realpython.github.io/fake-jobs/jobs/fi...,Tv program actually race tonight themselves tr...
4,Product manager,Ramirez Inc,"North Jamieview, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/pr...,https://realpython.github.io/fake-jobs/jobs/pr...,Traditional page a although for study anyone. ...
5,Medical technical officer,Rogers-Yates,"Davidville, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/me...,https://realpython.github.io/fake-jobs/jobs/me...,Suffer which parent. Republican total policy h...
6,Physiological scientist,Kramer-Klein,"South Christopher, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/ph...,https://realpython.github.io/fake-jobs/jobs/ph...,Friend reach choose coach north. Assume be see...
7,Textile designer,Meyers-Johnson,"Port Jonathan, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/te...,https://realpython.github.io/fake-jobs/jobs/te...,Court boy state table agree moment. Budget hug...
8,Television floor manager,Hughes-Williams,"Osbornetown, AE",2021-04-08,https://realpython.github.io/fake-jobs/jobs/te...,https://realpython.github.io/fake-jobs/jobs/te...,At be than always different American address. ...
9,Waste management officer,"Jones, Williams and Villa","Scotttown, AP",2021-04-08,https://realpython.github.io/fake-jobs/jobs/wa...,https://realpython.github.io/fake-jobs/jobs/wa...,Food record power crime situation since book a...


# NOTES FROM WEB SCRAPING EXERCISE

We can further navigate the html tree to extract out other bits of information.

When scraping from a web page, you should make use of "View Page Source" and/or "Inspect Element" in your web browswer.

For example, let's say we want to look at the second header on the page.

In [38]:
soup.findAll('h2')[0]

<h2 class="title is-5">Senior Python Developer</h2>

Similar to using `find` and `findall` in the full soup, we can use the `.find` method just within a Tag.

In [None]:
soup.findAll('h2')[0].find('h2').get('id')

In [34]:
soup.findAll('h2')[0].find('h2').text

AttributeError: 'NoneType' object has no attribute 'text'

In [41]:
soup.find('h2').text

'Senior Python Developer'

Now, let's look for the table containing the Turing Award winners.

Using `.findAll` reveals that there are multiple tables on the page.

In [None]:
soup.findAll('table')

If we know a bit more about what we are looking for, we can include an `attrs` argument and pass a dictionary. 

Go to the Turing award page in your browser, right click on the top of the table and choose "Inspect". You will notice that this table is defined with tag `<table class="wikitable">.` Armed with this information, we can narrow down our search.

In [None]:
soup.find('table', attrs={'class' : 'wikitable'})

If we want to interact with the table, we can use the _pandas_ `read_html` method.

In [None]:
import pandas as pd

In [None]:
pd.read_html(str(soup.find('table', attrs={'class' : 'wikitable'})))[0]