---
**Author**: Malo Jan  
**Date**: 2024-01-06  
**Project**: Scraping party press releases

**Description**: This notebook scrapes links to press releases from the website of the French political party La France Insoumise. 

---

This script uses Selenium to automate the following tasks:
1. Set up a Selenium WebDriver instance (using Firefox in this case).
2. Navigate to the "Communiqués" page of the La France Insoumise website.

Requirements:
- Selenium library installed (`pip install selenium`).
- Firefox browser and the corresponding GeckoDriver executable available in your system PATH.

Modules and Functions:
- webdriver: Provides the WebDriver class to interact with the browser.
- By: Used to locate elements on the page.
- WebDriverWait: Implements explicit waits for certain conditions.
- EC: Contains conditions like element presence, visibility, or clickability.
- TimeoutException: Handles timeouts when waiting for elements.
- time: Used for adding delays between actions.

Steps Performed in This Script:
1. Initialize a Firefox WebDriver instance.
2. Navigate to the specified URL ("https://lafranceinsoumise.fr/category/actualites/communiques/").


In [2]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
import time



# Step 1: Set up Selenium WebDriver
# Ensure GeckoDriver is installed and in your PATH.
driver = webdriver.Firefox()


The geckodriver version (0.33.0) detected in PATH at /opt/homebrew/bin/geckodriver might not be compatible with the detected firefox version (134.0); currently, geckodriver 0.35.0 is recommended for firefox 134.*, so it is advised to delete the driver in PATH and retry


In [3]:


# Step 2: Navigate to the desired webpage
# The page contains news and announcements ("Communiqués") from La France Insoumise.
driver.get("https://lafranceinsoumise.fr/category/actualites/communiques/")

# Notes:
# - Replace "webdriver.Firefox()" with another browser's driver (e.g., webdriver.Chrome()) if needed.
# - Ensure the corresponding driver executable is correctly installed and configured.


In [4]:
def scroll_down(driver, max_retries=10):
    """
    Scrolls down a webpage by repeatedly clicking the "See More" button until the button is no longer available 
    or the maximum number of retries is reached.

    This function waits for the presence and clickability of the "See More" button, brings it into view,
    and attempts to click it. If the button is not found or an error occurs during interaction, the function
    retries up to `max_retries` times.

    Args:
        driver (WebDriver): The Selenium WebDriver instance controlling the browser.
        max_retries (int, optional): Maximum number of retries to handle intermittent issues 
                                     (default is 10).

    Raises:
        TimeoutException: If the "See More" button is not found within the timeout period.
        Exception: For any other errors during the button interaction process.

    Behavior:
        - Waits for the presence of the button using an explicit wait.
        - Scrolls the button into view using JavaScript.
        - Attempts to click the button.
        - Retries on failure up to the specified limit.

    Example:
        # Assuming 'driver' is a valid WebDriver instance
        scroll_down(driver, max_retries=5)
    """
    retries = 0

    while retries < max_retries:
        try:
            # Locate the "See More" button and wait until it is present
            see_more_button = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, '.e-loop__load-more .elementor-button-text'))
            )
            # Scroll the button into view and ensure it's interactable
            driver.execute_script("arguments[0].scrollIntoView(true);", see_more_button)
            WebDriverWait(driver, 5).until(EC.element_to_be_clickable(see_more_button))
            
            # Click the button
            see_more_button.click()
            time.sleep(3)  # Allow time for content to load
            retries = 0  # Reset retries on success
        except TimeoutException:
            # Exit loop if button is not found within the timeout
            break
        except:
            # Increment retries on other errors
            retries += 1
            time.sleep(1)  # Pause before retrying
    
    if retries >= max_retries:
        pass  # No further action if max retries are reached

# Scroll down the page to load more content
    
scroll_down(driver)


In [5]:
import pandas as pd
# Collect all of the urls on the page by css selector
elements = driver.find_elements(By.CSS_SELECTOR, '.elementor-size-default a')

# Create an empty list to store the urls

href_list = []
for element in elements:
    href = element.get_attribute('href')
    href_list.append(href)

# Save as dataframe
df = pd.DataFrame({'href': href_list})
print(df)

                                                 href
0   https://lafranceinsoumise.fr/2025/01/07/reacti...
1   https://lafranceinsoumise.fr/2024/12/31/mayott...
2   https://lafranceinsoumise.fr/2024/12/30/le-doc...
3   https://lafranceinsoumise.fr/2024/12/14/cyclon...
4   https://lafranceinsoumise.fr/2024/12/11/soutie...
5   https://lafranceinsoumise.fr/2024/12/09/autoro...
6   https://lafranceinsoumise.fr/2024/12/08/laveni...
7   https://lafranceinsoumise.fr/2024/12/06/repons...
8   https://lafranceinsoumise.fr/2024/12/06/ue-mer...
9   https://lafranceinsoumise.fr/2024/12/03/mercos...
10  https://lafranceinsoumise.fr/2024/12/03/face-a...
11  https://lafranceinsoumise.fr/2024/12/02/dernie...
12  https://lafranceinsoumise.fr/2024/11/29/factur...
13  https://lafranceinsoumise.fr/2024/11/28/mur-de...
14  https://lafranceinsoumise.fr/2024/11/27/educat...
15  https://lafranceinsoumise.fr/2024/11/26/nous-v...
16  https://lafranceinsoumise.fr/2024/11/25/le-snu...
17  https://lafranceinsoumis

In [34]:

# Save as csv

# df.to_csv("fra_lfi_urls.csv")