# Advanced Web Scraping with Python for Social Scientists

<div style="text-align: right;">
<b>Melih Can Yardı</b><br>
Researcher @Politus Analytics<br>
24.12.2024
</div>

## Table of Contents
### 1. Introduction to Web Scraping
- What is web scraping?
- Ethical considerations and legal aspects (`robots.txt`, Terms of Service).
- Differences between APIs and scraping.

### 2. Understanding the Web
- Basic structure of a webpage: HTML, CSS, JavaScript.
- Deep Dive into HTML

### 3. Static Web Scraping
- Making HTTP requests using `requests`.
    - HTTP status codes: `200`, `404`, `500`
- Parsing HTML with `Beautiful Soup`.
    - Selecting elements with Beautiful Soup:
      - `find`
      - `find_all`
- **Example 1:** Collect information about countries

### 4. Dynamic Web Scraping
- Introduction to JavaScript-rendered pages.
- Using Selenium for web scraping.
- Setting up a Chrome browser.
- **Example 2:** Collect Oscar Winning Films
- **Example 3:** Collect Hockey Teams

### 5. Q&A and Further Learning
- Q&A session.
- Sharing resources for further learning.

### Further Resources:
- [Data Collection from the Web using APIs and web scraping techniques](https://github.com/ahurriyetoglu/text-processing-for-social-sciences/blob/main/practical_sessions/1-Data_Collection/Practical_Session-Data_Collection.ipynb)
- [Requests Documentation](https://requests.readthedocs.io/en/latest/)
- [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Selenium with Python Documentation](https://selenium-python.readthedocs.io/)

# 1. Introduction to Web Scraping

---

## What is Web Scraping?
- **Web scraping** is the automated process of extracting data from websites. It involves sending `requests` to a webpage, parsing the `HTML` content, and extracting the desired information.
---

## Ethical Considerations and Legal Aspects
As researchers, it is essential to ensure that scraping is done responsibly and within legal boundaries.

- **Terms of Service (ToS)**: Many websites include restrictions against scraping in their ToS. Violating these could lead to legal consequences. Examples:
    - [X's Terms of Service](https://x.com/en/tos)
    - [Reddit's User Agreement](https://redditinc.com/policies/user-agreement)
- **robots.txt**:
    - A `robots.txt` file specifies parts of a website that are restricted for bots.
    - Always check this file to determine which pages can be scraped ethically and legally.
    - Example of a `robots.txt` file:
      ```
      User-agent: *
      Disallow: /private/
      Allow: /public/
      ```
    - Examples:
        - [Twitter's robots.txt file](https://x.com/robots.txt)
        - [Reddit's robots.txt file](https://www.reddit.com/robots.txt)
    - Note: `robots.txt` is not legally binding but indicates the site owner's preference.

- **Best Practices**:
  - Use scraping tools responsibly, adhering to the website's policies.
  - Attribute the source of the data when sharing or publishing your research.
  - Clearly indicate how the collected data will be used, especially for research purposes.

---

## Differences Between APIs and Scraping
Web scraping and APIs are both methods for collecting data, but they differ in approach, use cases, and limitations.

- **Web Scraping**:
  - Involves extracting data from HTML content on webpages.
  - Requires navigating and parsing the structure of the website (e.g., HTML, CSS).
  - Useful when an API is unavailable or provides limited access to the required data.

- **APIs (Application Programming Interfaces)**:
  - APIs are provided by websites or services to allow developers to access structured data directly.
  - Easier and more reliable than scraping (no need to parse HTML).
  - Often comes with rate limits and requires an API key or authentication.
  - Examples: [Twitter API Documentation](https://developer.x.com/en/docs/x-api), [Reddit API](https://www.reddit.com/dev/api/).

- **Comparison Table**:
  | Feature                | Web Scraping                 | APIs                        |
  |------------------------|------------------------------|-----------------------------|
  | **Ease of Use**        | More complex (requires HTML parsing) | Easier (structured data provided) |
  | **Data Access**        | Any public webpage           | Limited to API endpoints    |
  | **Rate Limits**        | Depends on owners' preferences | Enforced by API providers  |
  | **Reliability**        | Prone to break if site changes | More stable                |


# 2. Understanding the Web

The basic building blocks of a webpage: **HTML**, **CSS**, and **JavaScript**.

---

## HTML (HyperText Markup Language)
- **Definition**: HTML provides the structure and content of a webpage.
- **Example**:
```html
<h1>Welcome to My Page</h1>
<p>This is a paragraph of text.</p>
<a href="https://example.com">Click here to visit Example</a>
```

---

## CSS (Cascading Style Sheets)
- **Definition**: CSS controls the appearance and layout of HTML elements.
- **Example**:
```html
<style>
  h1 {
    color: blue;
    font-size: 24px;
  }
</style>
<h1>Styled Heading</h1>
```

---

## JavaScript
- **Definition**: JavaScript adds interactivity and dynamic behavior to webpages.
- **Example**:
```html
<script>
  function greet() {
    alert("Hello, World!");
  }
</script>
<button onclick="greet()">Click Me</button>
```
  

---

## Deep Dive into HTML
Since we will parse HTML, understanding its elements is crucial for web scraping.

### Common HTML Tags
- `<div>`: A container for other elements, often used for layout and grouping.
- `<span>`: Inline container for text or other elements.
- `<a>`: Defines a hyperlink (e.g., `<a href="https://example.com">Link</a>`).
- `<img>`: Embeds images (e.g., `<img src="image.jpg" alt="Description">`).
- `<table>`, `<tr>`, `<td>`: Used for tabular data.
- `<ul>`, `<ol>`, `<li>`: Defines lists.

---

### HTML Attributes
Attributes provide additional information about elements and are key for selecting specific parts of a page.
- **`id`**: Unique identifier for an element (e.g., `<div id="header">`).
- **`class`**: Groups elements for styling or selection (e.g., `<p class="text">`).
- **`href`**: URL for links (e.g., `<a href="https://example.com">Link</a>`).
- **`src`**: Source for images or scripts (e.g., `<img src="image.jpg">`).
- **`alt`**: Alternative text for images (e.g., `<img src="image.jpg" alt="Description">`).

---

### HTML Hierarchy and Nesting
HTML documents are structured as a hierarchy. Elements can be nested within one another, creating a tree-like structure.
- Example:
  ```html
  <div class="container">
    <h1>Title</h1>
    <p>This is a paragraph inside the container.</p>
  </div>
  ```
- **Parent**: `<div>` is the parent of `<h1>` and `<p>`.
- **Child**: `<h1>` and `<p>` are children of `<div>`.
- **Siblings**: `<h1>` and `<p>` are siblings.

### HTML Selectors
Selectors are used to locate and extract elements from a webpage.

- **By ID**: Use the unique `id` of an element.
  - Example: `#header` selects an element like `<div id="header">`.
- **By Class**: Use the `class` attribute to select groups of elements.
  - Example: `.text` selects all elements with `class="text"`.
- **By Tag Name**: Select all elements of a specific type.
  - Example: `p` selects all `<p>` (paragraph) tags.
- **Combining Selectors**: Use multiple selectors for precision.
  - Example: `div.container > p.text` selects `<p>` tags with `class="text"` that are direct children of a `<div>` with `class="container"`.

---

### Real-World HTML Example
```html
<div id="main">
  <h2 class="title">Article Title</h2>
  <p class="author">By John Doe</p>
  <ul class="tags">
    <li>Python</li>
    <li>Web Scraping</li>
    <li>Data Science</li>
  </ul>
</div>


# 3. Static Web Scraping

- **`requests`**: A library for making `HTTP requests` to fetch webpage content.
    - **HTTP Request**: A communication method used by a client (e.g., your Python script) to interact with a server, asking it to send back specific data or perform an action. Common HTTP request methods include:
        - **GET**: Retrieve data from a server (e.g., fetch a webpage).
        - **POST**: Submit data to a server (e.g., send form data).
  - **HTTP Status Codes**: Indicate the result of the request:
    - `200`: Success – The request was successful, and the content is available.
    - `404`: Not Found – The requested resource could not be found.
    - `500`: Server Error – The server encountered an error.
- **`Beautiful Soup`**: A library for `parsing HTML and XML documents`, making it easy to extract and navigate webpage elements.

### Example 1: Collect information about countries

In [1]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# Send a GET request to the specified URL
url = 'https://www.scrapethissite.com/pages/simple/'
response = requests.get(url)

In [3]:
# Check status code
response.status_code

200

In [4]:
# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.text, 'html.parser')

In [5]:
# Find all country blocks in the page
country_elements = soup.find_all('div', class_='col-md-4 country')

In [6]:
countries = []

# Iterate over each country block
for country in country_elements:
    # Extract the country name
    name = country.find('h3', class_='country-name').get_text(strip=True)
    # Extract the capital
    capital = country.find('span', class_='country-capital').get_text(strip=True)
    # Extract the population
    population = country.find('span', class_='country-population').get_text(strip=True)
    # Extract the area
    area = country.find('span', class_='country-area').get_text(strip=True)
    # Save extracted info into a dictionary
    country_item = {"name":name, "capital": capital, "population": population, "area": area}
    # Append dictionary to countries list
    countries.append(country_item)

In [7]:
# Create a Pandas DataFrame from countries list
countries_df = pd.DataFrame(countries)

In [8]:
# Display countries DataFrame
countries_df.head(3)

Unnamed: 0,name,capital,population,area
0,Andorra,Andorra la Vella,84000,468.0
1,United Arab Emirates,Abu Dhabi,4975593,82880.0
2,Afghanistan,Kabul,29121286,647500.0


In [9]:
# Optional: Save countries DataFrame as Excel
#countries_df.to_excel("countries.xlsx", index=False)

# 4. Dynamic Web Scraping

Some webpages are rendered dynamically using JavaScript, meaning the content may not be fully loaded in the HTML source code fetched by `requests`. In such cases, we can use tools like **`Selenium`**` to interact with and scrape these pages.

---

## Introduction to JavaScript-Rendered Pages
- Unlike static pages, dynamic pages use JavaScript to load or modify content after the initial HTML is loaded.
- Example: Data tables that load as you scroll or elements that appear after clicking a button.

---

## Using Selenium for Web Scraping
- **Selenium**: A Python library that allows automation of web browsers for tasks like interacting with JavaScript-rendered pages.
- **WebDriver**: A tool within Selenium that acts as a bridge between Python and the browser.

---

## Setting Up Selenium with Chrome
- Option 1: Manually Set Up Chrome WebDriver
- Option 2: Automatically Set Up Chrome WebDriver (`webdriver_manager`)

### Option 1: Manually Set Up Chrome WebDriver

To manually set up the Chrome WebDriver, follow these steps:

1. Go to the [Chrome for Testing Downloads](https://googlechromelabs.github.io/chrome-for-testing/).
2. Select the appropriate file for your operating system (e.g., Windows, macOS, Linux).
3. Download and unzip the file.
4. Note the path to the unzipped `chromedriver` executable.
5. Update your script to set the `Service` parameter with the path to the executable:

```python
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service

chrome_options = Options()

service = Service(executable_path="chromedriver.exe")
driver = webdriver.Chrome(service=service, options=chrome_options)
driver.maximize_window()
```

### Option 2: Automatically Set Up Chrome WebDriver (Recommended)

To automatically set up the Chrome WebDriver, follow these steps:

1. Install the `webdriver_manager` library:
   ```bash
   pip install webdriver_manager
   ```
2. Import ChromeDriverManager from webdriver_manager
3. Set the `Service` parameter using ChromeDriverManager:
```python
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager

# Set up Chrome WebDriver using webdriver_manager
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)
```

### Example 2: Collect Oscar Winning Films

In [10]:
# Import libraries
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd

In [11]:
# Set up the WebDriver
chrome_options = Options()

service = Service(executable_path="chromedriver.exe")
driver = webdriver.Chrome(service=service, options=chrome_options)
driver.maximize_window()

In [12]:
# Navigate to the webpage
url = "https://www.scrapethissite.com/pages/ajax-javascript/"
driver.get(url)

In [13]:
# Wait for the page to load
wait = WebDriverWait(driver, 10)

# Find all year buttons
# Option 1: Find items by id attribute
year_ids = ["2015", "2014", "2013", "2012", "2011", "2010"]
year_buttons = [driver.find_element(By.ID, year_id) for year_id in year_ids]

# Option 2: Find items by class attribute
year_buttons = driver.find_elements(By.CLASS_NAME, "year-link")

# Option 3 (advanced): Find items by XPATH or CSS Selector
year_buttons = driver.find_elements(By.XPATH, "//a[@class='year-link']")
year_buttons = driver.find_elements(By.CSS_SELECTOR, ".year-link")

In [14]:
movies = []

# Loop through each year button
for button in year_buttons:
    # Click the year button
    button.click()
    time.sleep(3)
    
    # Wait for the movies table to load
    wait.until(EC.presence_of_element_located((By.CLASS_NAME, "table")))
    
    # Extract movie details
    movie_elements = driver.find_element(By.TAG_NAME, "tbody").find_elements(By.TAG_NAME, "tr")
    for movie in movie_elements:
        title = movie.find_element(By.CLASS_NAME, "film-title").text
        nominations = movie.find_element(By.CLASS_NAME, "film-nominations").text
        awards = movie.find_element(By.CLASS_NAME, "film-awards").text
        best_picture = "Yes" if movie.find_element(By.CLASS_NAME, "film-best-picture").find_elements(By.TAG_NAME, "i") else "No"
        
        # Append to the movies list
        movies.append({
            "title": title,
            "nominations": nominations,
            "awards": awards,
            "best_picture": best_picture
        })
    
    # Wait a moment before clicking the next button (to prevent issues with JavaScript execution)
    time.sleep(1)

In [15]:
# Create a Pandas DataFrame from collected movies list
movies_df = pd.DataFrame(movies)

In [16]:
# Display movies DataFrame
movies_df.head(3)

Unnamed: 0,title,nominations,awards,best_picture
0,Spotlight,6,2,Yes
1,Mad Max: Fury Road,10,6,No
2,The Revenant,12,3,No


In [17]:
# Optional: Save movies DataFrame as Excel
#movies_df.to_excel("movies.xlsx", index=False)

In [18]:
# Close the browser
driver.quit()

### Example 3: Collect Hockey Teams

In [19]:
# Import libraries
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
import time

In [20]:
# Set up driver
chrome_options = Options()

service = Service(executable_path="chromedriver.exe")
driver = webdriver.Chrome(service=service, options=chrome_options)
driver.maximize_window()

In [21]:
# Navigate to the webpage
url = "https://www.scrapethissite.com/pages/forms/"
driver.get(url)

# Wait for the page to load completely
wait = WebDriverWait(driver, 10)

In [22]:
# 1. Interact with the input search box and search for a team
search_box = wait.until(EC.presence_of_element_located((By.ID, "q")))
search_box.send_keys("Pittsburgh Penguins")  # Enter the team name

In [23]:
search_button = driver.find_element(By.CLASS_NAME, "btn.btn-primary")
search_button.click()

In [24]:
# 2. Handle pagination by clicking on the "Next" button
driver.get(url)
pagination = driver.find_element(By.CLASS_NAME, "pagination")
pagination_items = pagination.find_elements(By.TAG_NAME, "li")

next_button = driver.find_element(By.XPATH, "//a[@aria-label='Next']")
next_button.click()

In [25]:
# 3. Select an option from the dropdown menu ("Per Page")
dropdown = Select(wait.until(EC.presence_of_element_located((By.ID, "per_page"))))
dropdown.select_by_visible_text("50")  # Select "50 per page"

In [26]:
# 4. Scroll down and up
# Scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Wait for a moment to observe the scroll effect
time.sleep(2)

In [27]:
# Scroll up the page
driver.execute_script("window.scrollTo(0, 0);")

# Wait again to observe the scroll effect
time.sleep(2)

In [28]:
# Scroll down by 500 pixels
driver.execute_script("window.scrollBy(0, 500);")

# Wait for observation
time.sleep(2)

In [29]:
# Scroll up by 300 pixels
driver.execute_script("window.scrollBy(0, -300);")

# Wait for observation
time.sleep(2)

In [30]:
# Close the browser
driver.quit()

# 5. Further Learning

To deepen your understanding of web scraping and related topics, here are some suggestions:

---

## 5.1. Understanding APIs
- Learn to use APIs for data collection when available, as an alternative to web scraping.
  - Examples: Twitter API, Reddit API, Spotify API

---

## 5.2. Inspecting Network Requests
- Use the **Network** tab in the browser's developer tools to monitor and analyze network activity.
  - **AJAX Requests**: Identify dynamic data-loading requests.
  - **Request/Response Headers**: Understand the information exchanged between client and server.
  - **Filtering**: Use filters like `XHR`, `JS`, or `Doc` to isolate specific requests.
  - **Preview/Response Tab**: Examine JSON or other data returned by APIs.

---

## 5.3. Handling Advanced Web Scraping Challenges
- **Pagination**: Automate scraping across multi-page websites using "Next" buttons or page links.
- **Infinite Scrolling**: Learn to simulate scrolling with Selenium or fetch data using AJAX requests.
- **CAPTCHA Handling**: While ethically debatable, understand basic approaches and alternatives like manual CAPTCHA solving.

---

## 5.4. Deploying Scraping Scripts
- Automate scraping tasks using schedulers:
  - **Linux**: Use `cron` jobs.
  - **Windows**: Use Task Scheduler.

---

## 5.5. Debugging and Logging
- **Debugging Tools**:
  - Use browser developer tools to troubleshoot issues in the webpage structure.
  - Leverage Python debugging tools like `pdb` for your scripts.
- **Logging**:
  - Add logs to your scripts using Python’s `logging` module to monitor scraping activity and identify errors.

---