# Webscraping

- Webscraping is the process of automatically retrieve information from websites.
- Automated web scraping can be a solution to speed up the data collection process. You write your code once, and it will get the information you want many times and from many pages.
- The main challenges are *variety* of websites and the *durability* of the web scrapers. 

## How do websites work?

## Web servers

- Physical or cloud servers where the website lives. 
    - *Front end*: What the user sees.
    - *Back end* : Data and services

## Browser

- Software that *runs* the front end. 
    - HTML       : Website skeleton
    - CSS        : Website design
    - JavaScript : Website Dynamics
    
## DNS Server

- A DNS server is a computer server that contains a database of public IP addresses and their associated hostnames, and in most cases serves to resolve, or translate, those names to IP addresses as requested.

    

To begin webscraping we only need to know about HTML and CSS

### HTML

HTML (HyperText Markup Language) is the most basic building block of the Web. It defines the meaning and structure of web content.

```html
<html>
        <head>
            <title> Title of the webpage </title>
        </head>
        <body>
            Hello World
        </body>
</html>
```

In [1]:
from IPython.core.display import display, HTML
display(HTML("""Hello World """))

HTML code is made up of **tags** or **elements**

Most common tags include 

Heading tags (by size)
```html
<h1>Heading 1 </h1>  
<h2>Heading 2 </h2>
<h3>Heading 3 </h3> 
<h4>Heading 4 </h4> 
<h5>Heading 5 </h5> 
<h6>Heading 6 </h6>

```

In [2]:

display(HTML("""<h1>Heading 1 </h1>  
            <h2>Heading 2 </h2>
            <h3>Heading 3 </h3> 
            <h4>Heading 4 </h4> 
            <h5>Heading 5 </h5> 
            <h6>Heading 6 </h6>
            """))

## **Paragraph**

```html
<p> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. </p>
```

In [3]:
display(HTML("""
<p> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. </p>
            """))

**Emphasis Tag**

```html
<em>Emphasis Tag</em>
```

In [4]:
display(HTML("<em>Emphasis Tag</em>"))

**Bold tag**
```html
<b> Bold Text </b>
```

In [5]:
display(HTML("<b> Bold Text </b>"))

***Italic Text***

```html
<i> Italic Text </i>
```

In [6]:
display(HTML("<i> Italic Text </i>"))

**Anchor tag**
```html
<a href = "www.google.com"> Click Here </a>
```

In [7]:
display(HTML("<a href = 'www.google.com'> Click Here </a>"))

**List Tags**

```html
<li> One </li>
<li> Two </li>
<li> ... </li>
```

In [8]:
display(HTML("""<li> One </li>
                <li> Two </li>
                <li> ... </li>
"""))

**Ordered list**

```html
<ol>
    <li> One </li>
    <li> Two </li>
    <li> ... </li>
</ol>
```

In [9]:
display(HTML("""<ol>
                <li> One </li>
                <li> Two </li>
                <li> ... </li>
                </ol>
"""))

**Unordered list**

```html
<ul>
    <li> One </li>
    <li> Two </li>
    <li> ... </li>
</ul>
```

In [None]:
display(HTML("""<ul>
                <li> One </li>
                <li> Two </li>
                <li> ... </li>
                </ul>
"""))

## Tables (rows and columns)

```html
<table>
 <tr>
   <th>Month</th>
   <th>Savings</th>
 </tr>
 <tr>
   <td>January</td>
   <td>$100</td>
 </tr>
</table>
```

In [10]:
display(HTML("""
<table>
 <tr>
   <th>Month</th>
   <th>Savings</th>
 </tr>
 <tr>
   <td>January</td>
   <td>$100</td>
 </tr>
</table>
"""))

Month,Savings
January,$100


# CSS - Cascading Style Sheets

CSS Syntax

```css
Selector {
  		 Property 1 : value;
                	 Property 2 : value;
               	 Property 3 : value;
             }

```
Example (Embed css code inside a style tag)

```html
<style>
  h3{
      color:red;
     }
</style>  
<h3> Have a great day </h3>
```

In [11]:
display(HTML("""
<style>
  h3{
      color:red;
     }
</style>  
<h3> Have a great day </h3>
"""))

## Classes

What if I dont want to modify all the ```<h3>``` tags? Use **classes**

```html
<style>
  .my_class{
      color:red;
     }
  .my_class2{
      color:blue;
     }
</style>  
<h3 class = "my_class"> Have a great day </h3>
<h3> Have a great day </h3>
```

In [12]:
display(HTML("""
<style>
  .my_class{
      color:red;
     }
  .my_class2{
      color:blue;
     }
</style>  
<h3 class = "my_class"> Have a great day </h3>
<h3 class = "my_class2"> Have a great day </h3>
"""))

## IDs

```html
<style>
  #my_id{
      background-color:DodgerBlue;
     }
</style>  
<h1 id = "my_id"> Have a great day </h1>
```

In [13]:
display(HTML("""
<style>
  #my_id{
      background-color:DodgerBlue;
     }
</style>   
<h1 id = "my_id"> Have a great day </h1>
"""))

# Webscraping example

1. Inspect the website you want to inspect

In [14]:
url = "https://realpython.github.io/fake-jobs/"
display(HTML(url))

2. Inspect the html of the website you want to retreive. Right click + Inspect

## We will write a code to retrieve this information

In [15]:
!pip install requests
!pip install bs4

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [16]:
import requests

In [17]:
from bs4 import BeautifulSoup as bs

In [18]:
response = requests.get(url)

In [19]:
response # 200 - Success status

<Response [200]>

In [20]:
response.text

'<!DOCTYPE html>\n<html>\n  <head>\n    <meta charset="utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n    <title>Fake Python</title>\n    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css">\n  </head>\n  <body>\n  <section class="section">\n    <div class="container mb-5">\n      <h1 class="title is-1">\n        Fake Python\n      </h1>\n      <p class="subtitle is-3">\n        Fake Jobs for Your Web Scraping Journey\n      </p>\n    </div>\n    <div class="container">\n    <div id="ResultsContainer" class="columns is-multiline">\n    <div class="column is-half">\n<div class="card">\n  <div class="card-content">\n    <div class="media">\n      <div class="media-left">\n        <figure class="image is-48x48">\n          <img src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1" alt="Real Python Logo">\n        </figure>\n      </div>\n      <div class="media-content"

In [21]:
soup = bs(response.text, 'lxml')
print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   Fake Python
  </title>
  <link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/>
 </head>
 <body>
  <section class="section">
   <div class="container mb-5">
    <h1 class="title is-1">
     Fake Python
    </h1>
    <p class="subtitle is-3">
     Fake Jobs for Your Web Scraping Journey
    </p>
   </div>
   <div class="container">
    <div class="columns is-multiline" id="ResultsContainer">
     <div class="column is-half">
      <div class="card">
       <div class="card-content">
        <div class="media">
         <div class="media-left">
          <figure class="image is-48x48">
           <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
          </figure>
         </div>
         <div class="media-content">
          <h2 c

## Get Elements 

```html
<div class = "card-content">
```


In [22]:
elements = soup.find_all('div', {"class" : "card-content"})
len(elements)

100

In [23]:
elements[0]

<div class="card-content">
<div class="media">
<div class="media-left">
<figure class="image is-48x48">
<img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
</figure>
</div>
<div class="media-content">
<h2 class="title is-5">Senior Python Developer</h2>
<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
</div>
</div>
<div class="content">
<p class="location">
        Stewartbury, AA
      </p>
<p class="is-small has-text-grey">
<time datetime="2021-04-08">2021-04-08</time>
</p>
</div>
<footer class="card-footer">
<a class="card-footer-item" href="https://www.realpython.com" target="_blank">Learn</a>
<a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" target="_blank">Apply</a>
</footer>
</div>

In [24]:
type(elements[0])

bs4.element.Tag

In [25]:
# Find all retrieves a list, if there is only one element the [0] is required
elements[0].find_all('h2', {'class' : 'title is-5'})[0].text

'Senior Python Developer'

In [26]:
# Get all information
positions = []
companies = []
locations = []
dates     = []
for element in elements:
    position = element.find_all('h2', {'class' : 'title is-5'})[0].text
    company  = element.find_all('h3', {'class' : 'subtitle is-6 company'})[0].text
    date     = element.find_all('p',  {'class' : 'is-small has-text-grey'})[0].text
    location = element.find_all('p',  {'class' : 'location'})[0].text
    
    positions.append(position)
    companies.append(company)
    locations.append(location)
    dates.append(date)
    
import pandas as pd

df = pd.DataFrame.from_dict({'Company'  : companies,
                             'Location' : locations,
                             'Date'     : dates,
                             'Position' : positions})

df.head()

Unnamed: 0,Company,Location,Date,Position
0,"Payne, Roberts and Davis","\n Stewartbury, AA\n",\n2021-04-08\n,Senior Python Developer
1,Vasquez-Davidson,"\n Christopherville, AA\n",\n2021-04-08\n,Energy engineer
2,"Jackson, Chambers and Levy","\n Port Ericaburgh, AA\n",\n2021-04-08\n,Legal executive
3,Savage-Bradley,"\n East Seanview, AP\n",\n2021-04-08\n,Fitness centre manager
4,Ramirez Inc,"\n North Jamieview, AP\n",\n2021-04-08\n,Product manager


In [27]:
#Remove escape characters (common in raw text)
for col in df.columns:
    df[col] = df[col].apply(lambda x : x.replace("\n", ""))
df.head()

Unnamed: 0,Company,Location,Date,Position
0,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08,Senior Python Developer
1,Vasquez-Davidson,"Christopherville, AA",2021-04-08,Energy engineer
2,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08,Legal executive
3,Savage-Bradley,"East Seanview, AP",2021-04-08,Fitness centre manager
4,Ramirez Inc,"North Jamieview, AP",2021-04-08,Product manager


## Attributes

In [28]:
# Tag elements contain attributes that can be accessed through dictionaries
import numpy as np
apply_links = []
for element in elements:
    apply_link = element.find_all('a', {'class' : 'card-footer-item'})[1]['href'] # Note the 1 instead of the 0
    apply_links.append(apply_link)

df['href'] = np.array(apply_links)
df.head()
    

Unnamed: 0,Company,Location,Date,Position,href
0,"Payne, Roberts and Davis","Stewartbury, AA",2021-04-08,Senior Python Developer,https://realpython.github.io/fake-jobs/jobs/se...
1,Vasquez-Davidson,"Christopherville, AA",2021-04-08,Energy engineer,https://realpython.github.io/fake-jobs/jobs/en...
2,"Jackson, Chambers and Levy","Port Ericaburgh, AA",2021-04-08,Legal executive,https://realpython.github.io/fake-jobs/jobs/le...
3,Savage-Bradley,"East Seanview, AP",2021-04-08,Fitness centre manager,https://realpython.github.io/fake-jobs/jobs/fi...
4,Ramirez Inc,"North Jamieview, AP",2021-04-08,Product manager,https://realpython.github.io/fake-jobs/jobs/pr...
