<h1 style="color: #00BFFF;">Level 1: Web Scraping with BeautifulSoup (Understanding HTML structure)</h1>

<h2 style="color: #00BFFF;">Three Sisters doc from Alice in Wonderland</h2>

![image.png](attachment:f6867904-6125-4bdf-a53f-9c1032bb70d5.png)

In [2]:
html_doc = """
<!DOCTYPE html>
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</html>
"""

In [3]:
html_doc

'\n<!DOCTYPE html>\n<html><head><title>The Dormouse\'s story</title></head>\n<body>\n<p class="title"><b>The Dormouse\'s story</b></p>\n\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,\n<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and\n<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n\n<p class="story">...</p>\n</html>\n'

<h2 style="color: #00BFFF;">Creating the Soup</h2>

![Cool GIF](https://i.pinimg.com/originals/e2/9e/f7/e29ef7444d2b503da0720e67ffee4002.gif)

https://www.crummy.com/software/BeautifulSoup/bs4/doc/
**Check documentation!!!!**

In [4]:
from bs4 import BeautifulSoup
# pip install bs4

In [5]:
# parse the element
soup = BeautifulSoup(html_doc, 'html.parser') # standard
soup


<!DOCTYPE html>

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

In [6]:
# what is soup? try with type!

#### `.prettify()`

In [7]:
# Makes things pretty!!!

#### `.title`

#### `.p`

#### `.a`

#### `.find()`

#### `.body`

#### `.find_all()`

`find` and `findAll` (or its equivalent `find_all`) are methods used to search the soup tree for tags that match a certain criterion.

1. **find**:
    - Returns only the **first** tag that matches a given set of criteria.
    - Useful when you know there's only one tag of interest or you only want the first occurrence.
    - Example: If you have multiple `<p>` tags on a page and you use `soup.find('p')`, you'll get only the first `<p>` tag.

2. **findAll (or find_all)**:
    - Returns a **list** of tags that match the given criteria.
    - Useful when you want to capture all occurrences of a particular tag or set of tags.
    - Example: Using `soup.find_all('p')` will give you a list containing all `<p>` tags on the page.

Here's a simple illustration:

```html
<html>
    <body>
        <p>First paragraph.</p>
        <p>Second paragraph.</p>
        <div>Some div.</div>
    </body>
</html>
```

Using `find('p')` would return the "First paragraph." while `find_all('p')` would return a list containing both "First paragraph." and "Second paragraph.".


A common web scraping task is to extract all urls from a website:

To get all content text from a website:

In [18]:
p_tags = soup.find_all("p")
len(p_tags)

3

In [19]:
p_tags

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

A common web scraping task is to extract all urls from a website:

#### `.get()`

#### `.get_text()`

**Access the attribute**: Once you have the element, use the `.get()` method to access the attribute value.

    ```python
    link_url = link_element.get('href')

To search for HTML elements by class in a webpage using BeautifulSoup, you can also use the `find` and `find_all` methods.

1. **Using `find` method to get the first matching element**:
   
   ```python
   result = soup.find(class_='your-class-name')
   ```

2. **Using `find_all` method to get a list of all matching elements**:

   ```python
   results = soup.find_all(class_='your-class-name')
   ```
   
Note that we are using the `class_` parameter because `class` is a reserved keyword in Python.

### Your turn:

Write code to print the following contents (not including the html tags, only human-readable text):

1. All the "fun facts".

2. The names of all the places.

3. The content (name and fact) of all the cities (only cities, not countries!)

4. The names (not facts!) of all the cities (not countries!)

In [23]:
geography = """
<!DOCTYPE html>
<html>
<head> Geography</head>
<body>

<div class="city">
  <h2>London</h2>
  <p>London is the most popular tourist destination in the world.</p>
</div>

<div class="city">
  <h2>Paris</h2>
  <p>Paris was originally a Roman City called Lutetia.</p>
</div>

<div class="country">
  <h2>Spain</h2>
  <p>Spain produces 43,8% of all the world's Olive Oil.</p>
</div>

</body>
</html>
"""

<h1 style="color: #00BFFF;">Level 2: Web Scraping Multiple Pages</h1>

![legtsgo](https://i.giphy.com/media/v1.Y2lkPTc5MGI3NjExOGllM2RkeHVmM3diZ2Mzdzhydm00MjFrbW8zMWgwNHEzbTl5eGEyMiZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/fJKG1UTK7k64w/giphy.gif)

<h2 style="color: #008080;">Web Scraping with Python</h2>

In this lesson, we'll use the `requests` library to fetch web pages and `Beautiful Soup` from the `bs4` package to parse these pages and extract information.

<h1 style="color: #00BFFF;">00 | Use case 1</h1>

In [26]:
# 📚 Basic libraries
import pandas as pd

#❗New Libraries !
from bs4 import BeautifulSoup
import requests

In [27]:
# ⚙️ Settings
pd.set_option('display.max_columns', None) # display all columns
import warnings
warnings.filterwarnings('ignore') # ignore warnings

<h1 style="color: #00BFFF;">01 | Data Extraction</h1>

Let's go to https://books.toscrape.com/, where we'll see a collection of books
Notice how each movie has the following elements:

- Title

- Price

Our objective is going to be to scrape this information and store it in a pandas dataframe.



In [28]:
# get the link
link = "https://books.toscrape.com/"

In [29]:
response = requests.get(link)
response.status_code # 200 status code means OK!

200

In [30]:
# prepare a beautiful soup
soup = BeautifulSoup(response.text, "html.parser")

<h2 style="color: #008080;">Getting all Books title</h2>

<h2 style="color: #008080;">Getting all Books Prices</h2>

<h1 style="color: #00BFFF;">02 | Extracted Data</h1>

<h1 style="color: #00BFFF;">Another example: Use case 2</h1>

<h1 style="color: #00BFFF;">01 | Data Extraction</h1>

In [34]:
r = requests.get('https://en.wikipedia.org/wiki/Silicon_Valley_(TV_series)')
r.status_code

200

In [35]:
html = r.content

In [36]:
soup = BeautifulSoup(html, 'html.parser')

BeautifulSoup allows filtering results using combinations, such as filtering by tag and class.

```python
tags = soup.find_all(name=tag_name, class_=class_name)
```

To extract the names from the provided HTML content, you can:

1. Use the `find_all` or `findAll` method to locate the `<h4>` tags with the specific class (`de-ProductTile-title` in this case).
2. Extract the text from the found tag.

<h1 style="color: #00BFFF;">02 | Extracted Data</h1>

In [38]:
# conda or pip install lxml

<h1 style="color: #00BFFF;">CSS Selectors</h1>

CSS selectors are patterns used to select and manipulate one or more elements in an HTML or XML document. When web scraping with Python, CSS selectors can be used to target specific elements of interest within the page's content.

The `select` method in BeautifulSoup allows you to pass a CSS selector and returns a list of elements matching that selector.

1. **Tag Selector**: Targets elements by their tag name.
   - `p`: selects all `<p>` elements.
   - `soup.select("p")` will retrieve all `<p>` elements

2. **Class Selector**: Targets elements by their class attribute.
   - `.classname`: selects all elements with `class="classname"`.
   - If class name has spaces, they must be changed by `.`
   - `soup.select(".classname")`
   - To combine both, we can have `soup.select("tagname.classname")`

3. **Descendant Selector**: Targets an element that is a descendant of another element.
   - `div p`: selects all `<p>` elements inside a `<div>` element.
   - `.class1 .class2`: selects all elements with class2 that is a descendant of an element with class1.
   
4. **Attribute Selector**: Targets elements based on their attributes and values.
   - `a[href]`: selects all `<a>` elements with an `href` attribute.
   - `a[href="https://www.example.com"]`: selects all `<a>` elements with an `href` value of "https://www.example.com".

![legtsgo](https://i.giphy.com/media/v1.Y2lkPTc5MGI3NjExcGo0MzlvbzI1cHFtZXo2eDNucHo4NGd4dzUzNXFxeWhvbDd4czdiMyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/9O2h5v3hqeSGNmKgDV/giphy.gif)

#### Using css selectors

One. Step. At the time:

- Let's learn first the syntax of css selectors playing this game: https://flukeout.github.io/

**Everyone should reach level 6!**

<h1 style="color: #00BFFF;">Level 3: Web Scraping with Selenium (Advanced)</h1>

![legtsgo](https://i.giphy.com/media/v1.Y2lkPTc5MGI3NjExbzQ4eDdobnZlenhtN3c5MndmcDZpMW4wdXZzZTcxaDl1Zmo2YWt3dSZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/SpopD7IQN2gK3qN4jS/giphy.gif)

<h1 style="color: #00BFFF;">00 | Use case: Indeed Jobs</h1>

In [40]:
# 📚 Basic libraries
import pandas as pd

#❗New Libraries !
from bs4 import BeautifulSoup
import requests

In [41]:
# ⚙️ Settings
pd.set_option('display.max_columns', None) # display all columns
import warnings
warnings.filterwarnings('ignore') # ignore warnings

<h1 style="color: #00BFFF;">01 | Data Extraction</h1>

In [42]:
link = 'https://es.indeed.com/jobs?q=data+analyst+junior&l=&from=searchOnHP&vjk=ad48ee37bc54e63e'

In [43]:
requests.get(link)

<Response [403]>

### How to Solve a 403 Error

When you get a `403` status code in response to a web request, it means "Forbidden." The server understands your request, but it refuses to fulfill it. This is often a measure by websites to prevent web scraping or automated access.

Here's why you might get a `403 Forbidden` error:

1. **User-Agent**: Many websites block requests that don't have a standard web browser User-Agent. The default User-Agent of the `requests` library often gets blocked.
2. **Robots.txt**: This is a file websites use to guide web crawlers about which pages or sections of the site shouldn't be processed or scanned. Respect it.
3. **Rate Limiting**: Websites might block you if you make too many requests in a short period.
And more...

To solve it, try the following, starting from the user-agent:

1. **Change the User-Agent**:
   You can mimic a request from a web browser by setting a User-Agent header.
   ```python
   headers = {
       "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
   }
   response = requests.get(url, headers=headers)
   ```

2. **Use a Web Scraper Library**:
   Libraries like Scrapy or Selenium can help bypass restrictions, especially when JavaScript rendering is involved.

3. **Respect `robots.txt`**:
   Always check `https://www.example.com/robots.txt` (replace `example.com` with the website's domain) to see which URLs you're allowed to access.

4. **Rate Limiting**:
   Implement delays in your requests using `time.sleep(seconds)` to avoid hitting rate limits.

5. **Use Proxies or VPN**:
   Rotate IP addresses or use a VPN service if the server has blocked your IP.

6. **Sessions & Cookies**:
   Some websites might require maintaining sessions or handling cookies.


<h1 style="color: #00BFFF;">00 | Advanced Use case: Linkedin Jobs using Selenium</h1>

In [44]:
# pip install selenium
# pip install webdriver-manager
# pip install selenium --upgrade

In [45]:
# 📚 Basic libraries
import pandas as pd
import os
import random
import re
from bs4 import BeautifulSoup

#❗New Libraries !
import time
from getpass import getpass # to safely storage your password
# selenium libraries
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import NoSuchElementException, ElementClickInterceptedException
import pathlib

ModuleNotFoundError: No module named 'selenium'

### Code Breakdown

```python
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
import time
import random
from getpass import getpass


### Imports:
- **webdriver**: The main module from Selenium that allows controlling the browser.
- **Service**: Used to set up the ChromeDriver service.
- **ChromeDriverManager**: Manages and installs the ChromeDriver automatically.
- **By**: Provides methods for locating elements in a webpage (like IDs, classes, etc.).
- **time**, **random**: Used to add delays (for mimicking human behavior).
- **getpass**: Allows securely getting the password input without displaying it on the screen.

In [None]:
# ⚙️ Settings
pd.set_option('display.max_columns', None) # display all columns
import warnings
warnings.filterwarnings('ignore') # ignore warnings

<h1 style="color: #00BFFF;">01 | Data Extraction</h1>

In web scraping, **WebDriver** is a tool that automates browsers, allowing you to interact with web pages just like a human would. It’s commonly used in conjunction with **Selenium**, a popular browser automation library. 

The WebDriver works by controlling the browser (e.g., Chrome, Firefox) and simulating user actions such as clicking buttons, filling out forms, and navigating web pages.

In [None]:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

In [None]:
# open the website
driver.get('https://www.linkedin.com/login/')

In [None]:
# Add email
email = input('Enter your email: ')

# Find email box
email_box = driver.find_element(By.ID, "username")

# Clear email box
email_box.clear()

# Input password into browser
email_box.send_keys(email)

# Add sleeping time to mimic human behaviour
time.sleep(random.random() * 3)

In [None]:
# Add password
password = getpass('Enter your password: ')

# Find password box
pass_box = driver.find_element(By.ID, "password")

# Clear password box
pass_box.clear()

# Input password into browser
pass_box.send_keys(password)

# Add sleeping time to mimic human behaviour
time.sleep(random.random() * 3)

In [None]:
# Find and click on the log-in button
login = driver.find_element(By.CLASS_NAME, 'login__form_action_container')
login.click()
time.sleep(random.random() * 3)

In [None]:
# Add exception handling
try:
    login = driver.find_element(By.CLASS_NAME, 'login__form_action_container')
    login.click()
    time.sleep(random.random() * 3)
except NoSuchElementException:
    print("Log-in already done!")
except Exception as e:
    print(repr(e))

<h2 style="color: #008080;">Final job position</h2>

In [None]:
# Go to job search bar
try:
    job_icon = driver.find_element(By.CSS_SELECTOR, "span[title='Jobs']")
    job_icon.click()
    time.sleep(random.random() * 3)
except ElementClickInterceptedException:
    print("Element not displayed by JS. Try zooming in or resizing the window")
except Exception as e:
    print(repr(e))

In [None]:
# Zooming in
driver.execute_script("document.body.style.zoom='200%'")

In [None]:
# Zooming out
driver.execute_script("document.body.style.zoom='67%'")

<h2 style="color: #008080;">Type of job</h2>

In [None]:
# Optional - Change window size
# driver.set_window_size(800, 600)

In [None]:
search_job = driver.find_elements(By.CLASS_NAME,'jobs-search-box__text-input')[0]
job = input('What job do you want to search for: ')
search_job.clear()
search_job.send_keys(job)
time.sleep(random.random() * 3)

# Go to the location tab
search_job.send_keys(Keys.TAB)

<h2 style="color: #008080;">Job location</h2>

In [None]:
location_box = driver.switch_to.active_element
location = input('Where do you want to search for jobs: ')
location_box.send_keys(location)
time.sleep(random.random() * 3)

In [None]:
# Now let's search
location_box.send_keys(Keys.ENTER)

In [None]:
# Maximize the window - useful to see all the elements as the page is dynamic
driver.maximize_window()

In [None]:
## Optional: you can also fullscreen the window
# driver.fullscreen_window()

<h2 style="color: #008080;">Scraping from HTML</h2>

As mentioned previously, Selenium can be quite slow, so we'd always want to check whether we can fetch our data directly using static web scraping tools (i.e. `requests`, `BeautifulSoup`, `scrapy`):

In [None]:
# Check if the source code contains the job listings
html = driver.page_source
soup = BeautifulSoup(html)
soup.find_all(attrs={'class': re.compile(r'job-card-list__title')})

In [None]:
# Clean the list
job_list_dirty = soup.find_all(attrs={'class': re.compile(r'job-card-list__title')})
job_list_clean = [job.text.strip() for job in job_list_dirty]
job_list_clean

In [None]:
# Do the same for the company
job_company_dirty = soup.find_all('div', attrs={'class': re.compile(r'^artdeco-entity-lockup__subtitle')})
job_company_clean = [company.text.strip() for company in job_company_dirty]
job_company_clean

In [None]:
# Make it into a dataset
data = zip(job_list_clean, job_company_clean)
df = pd.DataFrame(data, columns=['Job', 'Company'])
df

In [None]:
# Great, let's now create a function out of this:
def get_job_postings(driver, page):

     # Zoom in 100% to ensure all HTML is loaded
     driver.execute_script("document.body.style.zoom='100%'")

     # Go to bottom of page to retrieve all job postings
     page.send_keys(Keys.END)
     page.send_keys(Keys.CONTROL + Keys.HOME) # combination of the two keys brings you to the top of the element

     # Parse HTML
     html = driver.page_source
     soup = BeautifulSoup(html)

     # Get jobs
     job_list_dirty = soup.find_all(attrs={'class': re.compile(r'job-card-list__title')})
     job_list_clean = [job.text.strip() for job in job_list_dirty]

     # Get companies
     job_company_dirty = soup.find_all('div', attrs={'class': re.compile(r'^artdeco-entity-lockup__subtitle')})
     job_company_clean = [company.text.strip() for company in job_company_dirty]

     # Convert data in to dataframe
     data = zip(job_list_clean, job_company_clean)
     return pd.DataFrame(data, columns=['Job', 'Company'])

In [None]:
page = driver.find_element(By.CSS_SELECTOR,"a[class^='disabled ember-view']")
get_job_postings(driver, page)

In [None]:
# Get a list with the buttons in the page
def get_buttons(page):
    buttons = []
    for button in page.find_elements(By.XPATH, "//ul/li/button"):
        try:
            int(button.text)
            buttons.append(button)
        except:
            pass
    return buttons

In [None]:
# Get the number of pages to scrape
current_page = driver.find_element(By.CSS_SELECTOR,"a[class^='disabled ember-view']")
buttons = get_buttons(current_page)

<h2 style="color: #008080;">Final DataFrame</h2>

In [None]:
# Loop through pages and save results in a dataframe
df = pd.DataFrame()
driver.execute_script("document.body.style.zoom='100%'")

for i in range(len(buttons)):
    # Printing the button number for debugging purposes
    print(i)

    # Extract posts from current page
    current_page = driver.find_element(By.CSS_SELECTOR,"a[class^='disabled ember-view']")
    postings = get_job_postings(driver, current_page)

    # Refresh button list (if you don't the code will throw an exception.. trust me I spent half an hour debugging it)
    current_buttons = get_buttons(current_page)

    # Add to dataframe
    df = pd.concat([df, postings], axis=0)

    # Go to the next page
    current_buttons[i].click()

In [None]:
# Check dataframe
df.drop_duplicates()

<h2 style="color: #008080;">Summary</h2>

It's always recommended to check for the availability of an **API** before resorting to web scraping for the following reasons:
 * It is generally much easier to use
 * APIs are usually well-documented
 * Utilizing APIs is often preferred by server administrators

Refer to the `robots.txt` file on a website (by doing `www.example.com/robots.txt`) to understand the server's guidelines and limitations regarding web scraping.

1. **Web Technologies**:
   - **HTML**: This is the standard markup language that holds the content of the webpage. It is the primary target when we engage in web scraping.
   - **CSS**: Cascading Style Sheets are used to describe the look and formatting of a document written in HTML.
   - **JavaScript**: This is a scripting language used to create and interactive and dynamic website content.

2. **HTML Structure**:
   - **Hierarchical**: HTML documents are structured hierarchically, meaning elements are nested within other elements, forming a tree-like structure.
   - **Tags**: These are the building blocks of HTML, defining elements that hold different types of content.
   - **Attributes**: HTML tags can have attributes, which define properties of an element and are used to set various characteristics such as class, ID, and style.

3. **Web Scraping Tools**:
   - **Requests**: A Python library that allows you to send HTTP requests to get the HTML content of a webpage.
   - **Beautiful Soup**: A Python library that facilitates the programmatic analysis of HTML, helping in parsing the HTML and navigating the parse tree.
   - **Selenium**: In cases where the webpage content is dynamic and generated using JavaScript, tools like Selenium are often used. Selenium can interact with JavaScript to load dynamic content, making it accessible for scraping.
   
4. **Finding and Selecting Elements**:
   - **Selection by Tag, Class, and ID**: We can find elements using various attributes such as their tag name, class name, or ID.
   - **CSS Selectors**: These are patterns used to select elements more complexly, leveraging the relationships between different elements to find them in numerous ways.


<h2 style="color: #008080;">Further materials</h2>

[Web archive](http://web.archive.org/): Find the historical state of webpages in the past!