# Web Scraping With Python



## Introduction

The web is a big database. 

- Billions of web pages, covering almost any topics: Great source of text data
- Most pages also contain images: The ImageNet project gets image data from the web
- Some may contain other form of data: Videos (i.e., youtube), Graphs – The Web itself is a big graph


Web scraping is an automatic method to obtain `large` amounts of data from websites. Most of this data is `unstructured` data in an HTML format (various layouts) which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications. 

Web crawling VS scrapping
- Crawling: indexing whole pages on Internet
- Scraping: scraping particular data from web pages of a website

ATTENTION: do some research on your own and make sure that you’re not violating any Terms of Service before you start a large-scale project.
When scraping a site, it is then crucial to respect the target site’s robots.txt file. According to the Google specification, each domain (or subdomain) can have a robots.txt file. This is optional and must be placed in the root directory of the domain. In other words, if the base URL of a site is https://example.com, then the robots.txt file will be available at https://example.com/robots.txt.  In detail, robots.txt specifies which bots are allowed to visit the site, what pages and resources they can access, at what rate, and more. 





#### An Alternative to Web Scraping: APIs

Some website providers (Google, Twitter, Facebook, StackOverflow, etc) offer application programming interfaces (APIs) that allow you to access their data in a predefined manner. With APIs, you can avoid parsing HTML. Instead, you can access the data directly using formats like JSON and XML.
When you use an API, the process is generally more stable than gathering the data through web scraping. That’s because developers create APIs to be consumed by programs rather than by human eyes.

## Steps:

### step 1. Inspect Your Data Source

You need to get to know the website that you want to scrape. Start by opening the site you want to scrape with your favorite browser. Decipher the Information in URLs. Your web scraping journey will be much easier if you first become familiar with how URLs work and what they’re made of. 

Some websites use query parameters to encode values that you submit when performing a search. You can think of them as query strings that you send to the database to retrieve specific records. You’ll find query parameters at the end of a URL. For example, if you go to Indeed and search for “software developer” in “Australia” through their search bar, you’ll see that the URL changes to include these values as query parameters:

`https://au.indeed.com/jobs?q=software+developer&l=Australia`

The query parameters in this URL are ?q=software+developer&l=Australia. Query parameters consist of three parts:

- Start: The beginning of the query parameters is denoted by a question mark (?).
- Information: The pieces of information constituting one query parameter are encoded in key-value pairs, where related keys and values are joined together by an equals sign (key=value).
- Separator: Every URL can have multiple query parameters, separated by an ampersand symbol (&).

Equipped with this information, you can pick apart the URL’s query parameters into two key-value pairs:

- q=software+developer selects the type of job.
- l=Australia selects the location of the job.

---

#### Next, inspect the Site Using Developer Tools:

All modern browsers come with developer tools installed. In this section, you’ll see how to work with the developer tools in Chrome.

- Right click in the chrome and click on `Inspect`.
- select the Elements tab in developer tools. You’ll see a structure with clickable HTML elements. You can expand, collapse, and even edit elements right in your browser:

![image.png](attachment:image.png)

## Step 2: Scrape HTML Content From a Page

First, you’ll want to get the site’s HTML code into your Python script so that you can interact with it. For this task, you’ll use Python’s requests library. 

`pip install requests`

In [6]:
import requests

url = "https://realpython.github.io/fake-jobs/"

# requests.get issues an HTTP GET request to the given URL. It retrieves the HTML data that the server sends back and stores that data in a Python object.
# GET is used to request data from a specified resource.
page = requests.get(url) 

# response status code: which is generated by the server to indicate the outcome of the request.
# Some commonly encountered status codes are:
#   200 OK: The request is fulfilled.
#   403 Forbidden: Server refuses to supply the resource, regardless of identity of client.
#   404 Not Found: The requested resource cannot be found in the server.
print(page.status_code) # checking if retrieval was successful

print(page.text) # static HTML content


200
<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>Fake Python</title>
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css">
  </head>
  <body>
  <section class="section">
    <div class="container mb-5">
      <h1 class="title is-1">
        Fake Python
      </h1>
      <p class="subtitle is-3">
        Fake Jobs for Your Web Scraping Journey
      </p>
    </div>
    <div class="container">
    <div id="ResultsContainer" class="columns is-multiline">
    <div class="column is-half">
<div class="card">
  <div class="card-content">
    <div class="media">
      <div class="media-left">
        <figure class="image is-48x48">
          <img src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1" alt="Real Python Logo">
        </figure>
      </div>
      <div class="media-content">
        <h2 class="title

In [1]:
page

''

## HTML Basics

HTML consists of nested elements\tags. Each element\tag gets opened with `<element>` and closed with `</element>`.

The basic structure of a webpage consists of the outer `<html>` tags, the `<head>` tag and the `<body>` tag. The contents of `<head>` are not displayed. Instead, it’s where javascript and css files can be loaded and where you can set the title of the webpage. The `<body>` is what actually gets displayed on the webpage.

![image-2.png](attachment:image-2.png)

---

#### Important HTML structures

Some HTML structures that will probably be repeatedly used:

- Title:  The title should describe the content and the meaning of the page.

        <title>This will be displayed in the caption bar</title>
![image-4.png](attachment:image-4.png)

- Headers: HTML headings are titles or subtitles that you want to display on a webpage.
`<h?>text</h?>`
header with ?=1-6, 1 being largest, 6 being smallest
![image-3.png](attachment:image-3.png)

- Paragraph: A paragraph always starts on a new line, and is usually a block of text.

        <p>paragraph text</p>

- Div: The `<div>` element is often used to group sections of a web page together.

- Forced Line Break: `<br />`

- Links:  Links are found in nearly all web pages. Links allow users to click their way from page to page.

        <a href=“url.htm”>Text Displayed</a>

- Unordered list: An unordered list starts with the `<ul>` tag. Each list item starts with the `<li>` tag. The list items will be marked with bullets (small black circles) by default

        <ul>
        <li>List item 1</li>
        <li>List item 2</li>
        </ul>

- Ordered list: An ordered list starts with the `<ol>` tag. Each list item starts with the `<li>` tag. The list items will be marked with numbers by default

        <ol>
        <li>List item 1</li>
        <li>List item 2</li>
        </ol>

- Tables: HTML tables allow web developers to arrange data into rows and columns.

        <table>
        <!--First Row-->
        <tr>
        <td>Table Data</td>
        <td>Table Data</td>
        </tr>
        <!—Second Row-->
        <tr>
        <td>Table Data</td>
        <td>Table Data</td>
        </tr>
        </table>

- Horizontal Rule: `<hr />`

- Commenting: `<!--In here is comments, in the code, but not on the website--> `

- Font Styles

        <b>bold</b>
        <strong>another bold</strong>
        <i>italics</i>
        <u>underline</u>
        <strike>strikethrough</strike>
        <tt>typewriter text</tt>

- Images: The HTML `<img>` tag is used to embed an image in a web page. The `<img>` tag has two required attributes: 

        - src - Specifies the path to the image
        - alt - Specifies an alternate text for the image

        <img src=“image.jpg” alt=“Alternate Text cannot display img”> 

- HTML elements can have attributes. Attributes provide additional information about elements. Attributes are always specified in the start tag. Attributes usually come in name/value pairs like: name="value"

  * The HTML id attribute is used to specify a unique id for an HTML element. You cannot have more than one element with the same id in an HTML document.
  * The HTML class attribute is used to specify a class for an HTML element. Multiple HTML elements can share the same class.

## Example of simple HTML file:


<!DOCTYPE html>
<html>
<head>
<style>
.city {
  background-color: tomato;
  color: white;
  margin: 20px;
  padding: 20px;
}
</style>
</head>
<body>

<div class="city">
  <h2>London</h2>
  <p>London is the capital of England.</p>
</div>

<div class="city">
  <h2>Paris</h2>
  <p>Paris is the capital of France.</p>
</div>

<div class="city">
  <h2>Tokyo</h2>
  <p>Tokyo is the capital of Japan.</p>
</div>

</body>
</html>

##### Note: 
We focus on scraping a <i>static website</i>. Static sites are straightforward to work with because the server sends you an HTML page that already contains all the page information in the response. You can parse that HTML response and immediately begin to pick out the relevant data.

A popular choice for scraping dynamic content is <i>Selenium</i> and it won't be convered in this course. 

## Step 3: Parse HTML Code With Beautiful Soup

Beautiful Soup is a Python library for parsing structured data.  Say you’ve found some webpages that display data relevant to your research, such as date or address information. Beautiful Soup helps you pull particular content from a webpage, remove the HTML markup, and save the information.

`pip install beautifulsoup4`

In [8]:
from bs4 import BeautifulSoup

# create a Beautiful Soup object that takes page.text, which is the HTML content you scraped earlier, as its input
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   Fake Python
  </title>
  <link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/>
 </head>
 <body>
  <section class="section">
   <div class="container mb-5">
    <h1 class="title is-1">
     Fake Python
    </h1>
    <p class="subtitle is-3">
     Fake Jobs for Your Web Scraping Journey
    </p>
   </div>
   <div class="container">
    <div class="columns is-multiline" id="ResultsContainer">
     <div class="column is-half">
      <div class="card">
       <div class="card-content">
        <div class="media">
         <div class="media-left">
          <figure class="image is-48x48">
           <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
          </figure>
         </div>
         <div class="media-content">
          <h2 c


In BeautifulSoup, `parsing` refers to the process of analyzing a raw HTML or XML document and breaking it down into a structured tree-like representation that can be easily navigated and manipulated. This structured representation allows you to access specific elements, attributes, and text within the document, making it easier to extract the information you need from the HTML or XML source.

BeautifulSoup provides various parsers that can be used to parse HTML or XML documents, such as the built-in HTML parser, lxml parser, and html5lib parser. 

By `parsing` the following html document, we can obtain the next tree stuctured representation:

![image-2.png](attachment:image-2.png)

With this basic understanding, we can see how python and beautifulsoup can help us traverse this tree to extract the data we need.

### Find Elements by ID:

In an HTML web page, every element can have an id attribute assigned. As the name already suggests, that id attribute makes the element uniquely identifiable on the page. You can begin to parse your page by selecting a specific element by its ID.

In [9]:
# The element you’re looking for is a <div> with an id attribute that has the value "ResultsContainer".
results = soup.find(id="ResultsContainer")
print(results.prettify())

<div class="columns is-multiline" id="ResultsContainer">
 <div class="column is-half">
  <div class="card">
   <div class="card-content">
    <div class="media">
     <div class="media-left">
      <figure class="image is-48x48">
       <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
      </figure>
     </div>
     <div class="media-content">
      <h2 class="title is-5">
       Senior Python Developer
      </h2>
      <h3 class="subtitle is-6 company">
       Payne, Roberts and Davis
      </h3>
     </div>
    </div>
    <div class="content">
     <p class="location">
      Stewartbury, AA
     </p>
     <p class="is-small has-text-grey">
      <time datetime="2021-04-08">
       2021-04-08
      </time>
     </p>
    </div>
    <footer class="card-footer">
     <a class="card-footer-item" href="https://www.realpython.com" target="_blank">
      Learn
     </a>
     <a class="card-footer-item" href=

### Find Elements by HTML Class Name:

Now you can work with your new object called results and select only the elements with the class card-content

In [10]:
job_elements = results.find_all(class_="card-content") #returns an iterable 
for job_element in job_elements:
    print(job_element.prettify(), end="\n"*2)

<div class="card-content">
 <div class="media">
  <div class="media-left">
   <figure class="image is-48x48">
    <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
   </figure>
  </div>
  <div class="media-content">
   <h2 class="title is-5">
    Senior Python Developer
   </h2>
   <h3 class="subtitle is-6 company">
    Payne, Roberts and Davis
   </h3>
  </div>
 </div>
 <div class="content">
  <p class="location">
   Stewartbury, AA
  </p>
  <p class="is-small has-text-grey">
   <time datetime="2021-04-08">
    2021-04-08
   </time>
  </p>
 </div>
 <footer class="card-footer">
  <a class="card-footer-item" href="https://www.realpython.com" target="_blank">
   Learn
  </a>
  <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" target="_blank">
   Apply
  </a>
 </footer>
</div>


<div class="card-content">
 <div class="media">
  <div class="media-lef

In [11]:
# still too much HTML
job_elements = results.find_all(class_="card-content") #returns an iterable 
for job_element in job_elements:
    job_title = job_element.find("h2", class_="title")
    company = job_element.find("h3", class_="subtitle")
    location = job_element.find("p", class_="location")
    print(job_title)
    print(company)
    print(location)
    print("\n"*2)

<h2 class="title is-5">Senior Python Developer</h2>
<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
<p class="location">
        Stewartbury, AA
      </p>



<h2 class="title is-5">Energy engineer</h2>
<h3 class="subtitle is-6 company">Vasquez-Davidson</h3>
<p class="location">
        Christopherville, AA
      </p>



<h2 class="title is-5">Legal executive</h2>
<h3 class="subtitle is-6 company">Jackson, Chambers and Levy</h3>
<p class="location">
        Port Ericaburgh, AA
      </p>



<h2 class="title is-5">Fitness centre manager</h2>
<h3 class="subtitle is-6 company">Savage-Bradley</h3>
<p class="location">
        East Seanview, AP
      </p>



<h2 class="title is-5">Product manager</h2>
<h3 class="subtitle is-6 company">Ramirez Inc</h3>
<p class="location">
        North Jamieview, AP
      </p>



<h2 class="title is-5">Medical technical officer</h2>
<h3 class="subtitle is-6 company">Rogers-Yates</h3>
<p class="location">
        Davidville, AP
      </p>





### Extract Text From HTML Elements

You only want to see the title, company, and location of each job posting. 
You use get_text() on each element.

In [12]:
job_elements = results.find_all(class_="card-content") #returns an iterable 
for job_element in job_elements:
    job_title = job_element.find("h2", class_="title").get_text(strip=True)
    company = job_element.find("h3", class_="subtitle").get_text(strip=True)
    location = job_element.find("p", class_="location").get_text(strip=True)
    print(job_title)
    print(company)
    print(location)
    print("\n")

Senior Python Developer
Payne, Roberts and Davis
Stewartbury, AA


Energy engineer
Vasquez-Davidson
Christopherville, AA


Legal executive
Jackson, Chambers and Levy
Port Ericaburgh, AA


Fitness centre manager
Savage-Bradley
East Seanview, AP


Product manager
Ramirez Inc
North Jamieview, AP


Medical technical officer
Rogers-Yates
Davidville, AP


Physiological scientist
Kramer-Klein
South Christopher, AE


Textile designer
Meyers-Johnson
Port Jonathan, AE


Television floor manager
Hughes-Williams
Osbornetown, AE


Waste management officer
Jones, Williams and Villa
Scotttown, AP


Software Engineer (Python)
Garcia PLC
Ericberg, AE


Interpreter
Gregory and Sons
Ramireztown, AE


Architect
Clark, Garcia and Sosa
Figueroaview, AA


Meteorologist
Bush PLC
Kelseystad, AA


Audiological scientist
Salazar-Meyers
Williamsburgh, AE


English as a second language teacher
Parker, Murphy and Brooks
Mitchellburgh, AE


Surgeon
Cruz-Brown
West Jessicabury, AA


Equities trader
Macdonald-Ferguson

### Find Elements by Class Name and Text Content

In [10]:
# This code finds all <h2> elements where the contained string has "python" within it
python_jobs = results.find_all("h2", string= lambda text: "python" in text.lower())
python_jobs

[<h2 class="title is-5">Senior Python Developer</h2>,
 <h2 class="title is-5">Software Engineer (Python)</h2>,
 <h2 class="title is-5">Python Programmer (Entry-Level)</h2>,
 <h2 class="title is-5">Python Programmer (Entry-Level)</h2>,
 <h2 class="title is-5">Software Developer (Python)</h2>,
 <h2 class="title is-5">Python Developer</h2>,
 <h2 class="title is-5">Back-End Web Developer (Python, Django)</h2>,
 <h2 class="title is-5">Back-End Web Developer (Python, Django)</h2>,
 <h2 class="title is-5">Python Programmer (Entry-Level)</h2>,
 <h2 class="title is-5">Software Developer (Python)</h2>]

### Extract Attributes From HTML Elements:


In [11]:
job_elements = results.find_all(class_="card-content") #returns an iterable 
for job_element in job_elements:
    job_title = job_element.find("h2", class_="title").get_text(strip=True)
    company = job_element.find("h3", class_="subtitle").get_text(strip=True)
    location = job_element.find("p", class_="location").get_text(strip=True)
    link = job_element.find("a", string= lambda text: "apply" in text.lower())
    link_url = link["href"]
    print(job_title)
    print(company)
    print(location)
    print(link)
    print(link_url)
    print("\n")

Senior Python Developer
Payne, Roberts and Davis
Stewartbury, AA
<a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" target="_blank">Apply</a>
https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html


Energy engineer
Vasquez-Davidson
Christopherville, AA
<a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html" target="_blank">Apply</a>
https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html


Legal executive
Jackson, Chambers and Levy
Port Ericaburgh, AA
<a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html" target="_blank">Apply</a>
https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html


Fitness centre manager
Savage-Bradley
East Seanview, AP
<a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html" target="_blank">Apply</a>
https://realpython.g

## Example: crawling Wikipedia

Let's dive deep into Wikipedia. We are going to grab a random `<a>` tag to another `Wikipedia` article and scrape that page.

![image.png](attachment:image.png)


In [11]:
import requests
from bs4 import BeautifulSoup
import random
import time

def scrapeWikiArticle(url, num_to_scrape= 0):
	time.sleep(2) # delay 2 seconds to not overload wikipedia
	if num_to_scrape > 10: #maximum depth of 10 for crawling, remove this condition for an endless scraper
		return
	
	response = requests.get(url=url)

	if response.status_code !=200: # check whether fetching was successful
		return
	
	soup = BeautifulSoup(response.text, 'html.parser')

	# When inspecting the Wikipedia page, we see that the title tag has the #firstHeading ID
	title = soup.find(id="firstHeading") 
	print(title.text)

	# Get all the links
	allLinks = soup.find(id="bodyContent").find_all("a", href=True) #find valid links that contain an url
	linkToScrape = random.choice(list(filter( lambda x: "/wiki/" in x['href'], allLinks)))

	scrapeWikiArticle("https://en.wikipedia.org" + linkToScrape['href'], num_to_scrape + 1)

scrapeWikiArticle("https://en.wikipedia.org/wiki/Web_scraping")

Web scraping
Injunction
Doctrine of marshalling
Equitable remedy
Fiduciary
Boardman v Phipps
Fibrosa Spolka Akcyjna v Fairbairn Lawson Combe Barbour Ltd
Adair Roche, Baron Roche
Category:Members of the Privy Council of the United Kingdom
George Baker (judge)
Category:Scottish knights


credit:

- https://www3.ntu.edu.sg/home/ehchua/programming/webprogramming/HTTP_Basics.html
- https://realpython.com/beautiful-soup-web-scraper-python/
- https://web.stanford.edu/class/cs1c/dorms/flomo_west/handouts/05BasicHtml.pdf
- https://brightdata.com/blog/how-tos/robots-txt-for-web-scraping-guide
- https://www.w3schools.com/html/
- https://sharplesson.com/html-page-structure/
- https://www.w3docs.com/learn-html/html-h1-h6-tags.html
- https://www.freecodecamp.org/news/scraping-wikipedia-articles-with-python/
- https://scrapfly.io/blog/web-scraping-with-python-beautifulsoup/