<h1 style="color: #00BFFF;">Web Scraping with Selenium (advanced)</h1>

![legtsgo](https://i.giphy.com/media/v1.Y2lkPTc5MGI3NjExbzQ4eDdobnZlenhtN3c5MndmcDZpMW4wdXZzZTcxaDl1Zmo2YWt3dSZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/SpopD7IQN2gK3qN4jS/giphy.gif)

### Exploring Web Page Structures

To inspect the underlying HTML of a web page, right-click anywhere on the page. 
- Choose "View Page Source" in browsers like Chrome or Firefox.
- For Internet Explorer, choose "View Source," and for Safari, select "Show Page Source."
- (In Safari, if this option isn't visible, navigate to Safari Preferences, click on the Advanced tab, and enable "Show Develop menu in menu bar.")

To embark on your web scraping journey, you just need to grasp **three foundational aspects** of HTML:

### Fact 1: HTML is Built on Tags

At its core, HTML is composed of content enveloped in `<tags>`. Tags come in various types:
 * **Headings**: `<h1>`, `<h2>`, `<h3>`, `<h4>`...
 * **Phrasing**: `<b>`, `<strong>`, `<sub>`, `<i>`, `<a>`...
 * **Embedded Content**: `<audio>`, `<img>`, `<video>`, `<iframe>`...
 * **Tabulated Data**: `<table>`, `<tr>`, `<td>`, `<tbody>`...
 * **Page Sections**: `<header>`, `<section>`, `<nav>`, `<article>`...
 * **Metadata and Scripts**: `<meta>`, `<title>`, `<script>`, `<link>`...

### Fact 2: Tags Can Have Attributes

HTML tags can possess "attributes," which are defined within the opening tag itself. Examine the following example:
- `<a class="text-monospace" id="name_132" href="http://www.example.com"> Page Content </a>
`: This `div` tag encompasses the following attributes:
    + `class`: With the value "text-monospace". Remember, the class isn't unique across the page.
    + `id`: With the value "name_132". IDs are meant to be unique identifiers for tags on the page.
    + `href`: With the value www.example.com. The href commonly represents a link to another section of the page or to an external website.

**Key Notes**:
- The `id` attribute should be unique for a tag; no two tags should share the same `id`.
- The `class` attribute isn't meant to be unique. Instead, it often groups tags exhibiting similar behavior or styles.

For web scraping purposes, **understanding the semantics** behind terms like `<span>`, `class`, or `short-desc` **isn't crucial**.


### Fact 3: Tags Can Be Nested

Imagine the following segment of HTML code:

`Hello <strong><em>Ironhack</em> students</strong>`

Here, the phrase **Ironhack students** would be displayed in bold since it resides between the `<strong>` and `</strong>` tags. Additionally, the word ***Ironhack*** would be italicized due to the `<em>` tag, which signifies italic formatting. However, the word "Hello" remains unaffected by any formatting, as it lies outside both the `<strong>` and `<em>` tags. This results in the display:

Hello ***Ironhack* students**

This example illustrates a key principle: **tags influence the text from their opening to their closing points,** even if they are nested within other tags.

### Selecting Specific Elements in Web Scraping

When diving into web scraping, it's essential to target specific elements efficiently. To hone in on the precise content you need, consider filtering tags based on:

 * **Tag Name**: The main type of the element (e.g., `<div>`, `<a>`, `<p>`).
 * **Class**: A descriptor that groups multiple elements with similar characteristics.
 * **ID**: A unique identifier assigned to a particular element.
 * **Other Attributes**: Additional properties like `href`, `title`, or `lang` that can further specify the elements of interest.


<h1 style="color: #00BFFF;">00 | Use case: Indeed Jobs</h1>

In [5]:
# 📚 Basic libraries
import pandas as pd

#❗New Libraries !
from bs4 import BeautifulSoup
import requests

In [6]:
# ⚙️ Settings
pd.set_option('display.max_columns', None) # display all columns
import warnings
warnings.filterwarnings('ignore') # ignore warnings

<h1 style="color: #00BFFF;">01 | Data Extraction</h1>

In [7]:
link = 'https://es.indeed.com/jobs?q=data+analyst+junior&l=&from=searchOnHP&vjk=ad48ee37bc54e63e'

In [8]:
requests.get(link)

<Response [403]>

### How to Solve a 403 Error

When you get a `403` status code in response to a web request, it means "Forbidden." The server understands your request, but it refuses to fulfill it. This is often a measure by websites to prevent web scraping or automated access.

Here's why you might get a `403 Forbidden` error:

1. **User-Agent**: Many websites block requests that don't have a standard web browser User-Agent. The default User-Agent of the `requests` library often gets blocked.
2. **Robots.txt**: This is a file websites use to guide web crawlers about which pages or sections of the site shouldn't be processed or scanned. Respect it.
3. **Rate Limiting**: Websites might block you if you make too many requests in a short period.
And more...

To solve it, try the following, starting from the user-agent:

1. **Change the User-Agent**:
   You can mimic a request from a web browser by setting a User-Agent header.
   ```python
   headers = {
       "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
   }
   response = requests.get(url, headers=headers)
   ```

2. **Use a Web Scraper Library**:
   Libraries like Scrapy or Selenium can help bypass restrictions, especially when JavaScript rendering is involved.

3. **Respect `robots.txt`**:
   Always check `https://www.example.com/robots.txt` (replace `example.com` with the website's domain) to see which URLs you're allowed to access.

4. **Rate Limiting**:
   Implement delays in your requests using `time.sleep(seconds)` to avoid hitting rate limits.

5. **Use Proxies or VPN**:
   Rotate IP addresses or use a VPN service if the server has blocked your IP.

6. **Sessions & Cookies**:
   Some websites might require maintaining sessions or handling cookies.


<h1 style="color: #00BFFF;">00 | Advanced Use case: Linkedin Jobs using Selenium</h1>

In [9]:
# pip install selenium
# pip install webdriver-manager
# pip install selenium --upgrade

In [10]:
# 📚 Basic libraries
import pandas as pd
import os
import random
import re
from bs4 import BeautifulSoup

#❗New Libraries !
import time
from getpass import getpass # to safely storage your password
# selenium libraries
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import NoSuchElementException, ElementClickInterceptedException
import pathlib

### Code Breakdown

```python
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import time
import random
from getpass import getpass


### Imports:
- **webdriver**: The main module from Selenium that allows controlling the browser.
- **Service**: Used to set up the ChromeDriver service.
- **ChromeDriverManager**: Manages and installs the ChromeDriver automatically.
- **By**: Provides methods for locating elements in a webpage (like IDs, classes, etc.).
- **time**, **random**: Used to add delays (for mimicking human behavior).
- **getpass**: Allows securely getting the password input without displaying it on the screen.

In [11]:
# ⚙️ Settings
pd.set_option('display.max_columns', None) # display all columns
import warnings
warnings.filterwarnings('ignore') # ignore warnings

<h1 style="color: #00BFFF;">01 | Data Extraction</h1>

In web scraping, **WebDriver** is a tool that automates browsers, allowing you to interact with web pages just like a human would. It’s commonly used in conjunction with **Selenium**, a popular browser automation library. 

The WebDriver works by controlling the browser (e.g., Chrome, Firefox) and simulating user actions such as clicking buttons, filling out forms, and navigating web pages.

In [12]:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

In [13]:
# open the website
driver.get('https://www.linkedin.com/login/')

In [None]:
# Add email
email = input('Enter your email: ')

# Find email box
email_box = driver.find_element(By.ID, "username")

# Clear email box
email_box.clear()

# Input password into browser
email_box.send_keys(email)

# Add sleeping time to mimic human behaviour
time.sleep(random.random() * 3)

In [None]:
# Add password
password = getpass('Enter your password: ')

# Find password box
pass_box = driver.find_element(By.ID, "password")

# Clear password box
pass_box.clear()

# Input password into browser
pass_box.send_keys(password)

# Add sleeping time to mimic human behaviour
time.sleep(random.random() * 3)

In [None]:
# Find and click on the log-in button
login = driver.find_element(By.CLASS_NAME, 'login__form_action_container')
login.click()
time.sleep(random.random() * 3)

In [None]:
# Add exception handling
try:
    login = driver.find_element(By.CLASS_NAME, 'login__form_action_container')
    login.click()
    time.sleep(random.random() * 3)
except NoSuchElementException:
    print("Log-in already done!")
except Exception as e:
    print(repr(e))

<h2 style="color: #008080;">Final job position</h2>

In [None]:
# Go to job search bar
try:
    job_icon = driver.find_element(By.CSS_SELECTOR, "span[title='Jobs']")
    job_icon.click()
    time.sleep(random.random() * 3)
except ElementClickInterceptedException:
    print("Element not displayed by JS. Try zooming in or resizing the window")
except Exception as e:
    print(repr(e))

In [None]:
# Zooming in
driver.execute_script("document.body.style.zoom='200%'")

In [None]:
# Zooming out
driver.execute_script("document.body.style.zoom='67%'")

<h2 style="color: #008080;">Type of job</h2>

In [None]:
# Optional - Change window size
# driver.set_window_size(800, 600)

In [None]:
search_job = driver.find_elements(By.CLASS_NAME,'jobs-search-box__text-input')[0]
job = input('What job do you want to search for: ')
search_job.clear()
search_job.send_keys(job)
time.sleep(random.random() * 3)

# Go to the location tab
search_job.send_keys(Keys.TAB)

<h2 style="color: #008080;">Job location</h2>

In [None]:
location_box = driver.switch_to.active_element
location = input('Where do you want to search for jobs: ')
location_box.send_keys(location)
time.sleep(random.random() * 3)

In [None]:
# Now let's search
location_box.send_keys(Keys.ENTER)

In [None]:
# Maximize the window - useful to see all the elements as the page is dynamic
driver.maximize_window()

In [None]:
## Optional: you can also fullscreen the window
# driver.fullscreen_window()

<h2 style="color: #008080;">Scraping from HTML</h2>

As mentioned previously, Selenium can be quite slow, so we'd always want to check whether we can fetch our data directly using static web scraping tools (i.e. `requests`, `BeautifulSoup`, `scrapy`):

In [None]:
# Check if the source code contains the job listings
html = driver.page_source
soup = BeautifulSoup(html)
soup.find_all(attrs={'class': re.compile(r'job-card-list__title')})

In [None]:
# Clean the list
job_list_dirty = soup.find_all(attrs={'class': re.compile(r'job-card-list__title')})
job_list_clean = [job.text.strip() for job in job_list_dirty]
job_list_clean

In [None]:
# Do the same for the company
job_company_dirty = soup.find_all('div', attrs={'class': re.compile(r'^artdeco-entity-lockup__subtitle')})
job_company_clean = [company.text.strip() for company in job_company_dirty]
job_company_clean

In [None]:
# Make it into a dataset
data = zip(job_list_clean, job_company_clean)
df = pd.DataFrame(data, columns=['Job', 'Company'])
df

In [None]:
# Great, let's now create a function out of this:
def get_job_postings(driver, page):

     # Zoom in 100% to ensure all HTML is loaded
     driver.execute_script("document.body.style.zoom='100%'")

     # Go to bottom of page to retrieve all job postings
     page.send_keys(Keys.END)
     page.send_keys(Keys.CONTROL + Keys.HOME) # combination of the two keys brings you to the top of the element

     # Parse HTML
     html = driver.page_source
     soup = BeautifulSoup(html)

     # Get jobs
     job_list_dirty = soup.find_all(attrs={'class': re.compile(r'job-card-list__title')})
     job_list_clean = [job.text.strip() for job in job_list_dirty]

     # Get companies
     job_company_dirty = soup.find_all('div', attrs={'class': re.compile(r'^artdeco-entity-lockup__subtitle')})
     job_company_clean = [company.text.strip() for company in job_company_dirty]

     # Convert data in to dataframe
     data = zip(job_list_clean, job_company_clean)
     return pd.DataFrame(data, columns=['Job', 'Company'])

In [None]:
page = driver.find_element(By.CSS_SELECTOR,"a[class^='disabled ember-view']")
get_job_postings(driver, page)

In [None]:
# Get a list with the buttons in the page
def get_buttons(page):
    buttons = []
    for button in page.find_elements(By.XPATH, "//ul/li/button"):
        try:
            int(button.text)
            buttons.append(button)
        except:
            pass
    return buttons

In [None]:
# Get the number of pages to scrape
current_page = driver.find_element(By.CSS_SELECTOR,"a[class^='disabled ember-view']")
buttons = get_buttons(current_page)

<h2 style="color: #008080;">Final DataFrame</h2>

In [None]:
# Loop through pages and save results in a dataframe
df = pd.DataFrame()
driver.execute_script("document.body.style.zoom='100%'")

for i in range(len(buttons)):
    # Printing the button number for debugging purposes
    print(i)

    # Extract posts from current page
    current_page = driver.find_element(By.CSS_SELECTOR,"a[class^='disabled ember-view']")
    postings = get_job_postings(driver, current_page)

    # Refresh button list (if you don't the code will throw an exception.. trust me I spent half an hour debugging it)
    current_buttons = get_buttons(current_page)

    # Add to dataframe
    df = pd.concat([df, postings], axis=0)

    # Go to the next page
    current_buttons[i].click()

In [None]:
# Check dataframe
df.drop_duplicates()

<h2 style="color: #008080;">Summary</h2>

It's always recommended to check for the availability of an **API** before resorting to web scraping for the following reasons:
 * It is generally much easier to use
 * APIs are usually well-documented
 * Utilizing APIs is often preferred by server administrators

Refer to the `robots.txt` file on a website (by doing `www.example.com/robots.txt`) to understand the server's guidelines and limitations regarding web scraping.

1. **Web Technologies**:
   - **HTML**: This is the standard markup language that holds the content of the webpage. It is the primary target when we engage in web scraping.
   - **CSS**: Cascading Style Sheets are used to describe the look and formatting of a document written in HTML.
   - **JavaScript**: This is a scripting language used to create and interactive and dynamic website content.

2. **HTML Structure**:
   - **Hierarchical**: HTML documents are structured hierarchically, meaning elements are nested within other elements, forming a tree-like structure.
   - **Tags**: These are the building blocks of HTML, defining elements that hold different types of content.
   - **Attributes**: HTML tags can have attributes, which define properties of an element and are used to set various characteristics such as class, ID, and style.

3. **Web Scraping Tools**:
   - **Requests**: A Python library that allows you to send HTTP requests to get the HTML content of a webpage.
   - **Beautiful Soup**: A Python library that facilitates the programmatic analysis of HTML, helping in parsing the HTML and navigating the parse tree.
   - **Selenium**: In cases where the webpage content is dynamic and generated using JavaScript, tools like Selenium are often used. Selenium can interact with JavaScript to load dynamic content, making it accessible for scraping.
   
4. **Finding and Selecting Elements**:
   - **Selection by Tag, Class, and ID**: We can find elements using various attributes such as their tag name, class name, or ID.
   - **CSS Selectors**: These are patterns used to select elements more complexly, leveraging the relationships between different elements to find them in numerous ways.


<h2 style="color: #008080;">Further materials</h2>

[Web archive](http://web.archive.org/): Find the historical state of webpages in the past!