---
**Author**: Malo Jan  
**Date**: 2024-12-22  
**Project**: Scraping Presidential Speeches

**Description**: This notebook scrapes links to french presidential speeches from "La vie publique". It : 

- Collect speeches urls from Mitterand to Macron
- Extract text data for each speech

---

This notebook introduces web scraping with Python using the Selenium library, which allows you to simulate a browser in order to visit a website, click on links, send keyboard inputs, and extract information from the HTML source code of a webpage. In contrast to rvest in RStudio, Selenium is more powerful and can extract content from dynamic, non-static websites.



To install the Selenium library, you can use the following command. If you run this notebook on your local machine, you do not need to install the library every time you run the notebook. However, if you run this notebook on Google Colab, you need to install the library every time you connect to a new runtime.

In [49]:
%%capture

!pip install selenium

In [1]:
from selenium import webdriver

# Start the WebDriver and load the webpage simulating a firefox browser

browser = webdriver.Firefox()

The geckodriver version (0.33.0) detected in PATH at /opt/homebrew/bin/geckodriver might not be compatible with the detected firefox version (134.0); currently, geckodriver 0.35.0 is recommended for firefox 134.*, so it is advised to delete the driver in PATH and retry


If this does not work, here are instructions copied from Rubing Shen, a former Medialab PhD student in its Python course

> **If you have the error** `WebDriverException: Message: 'geckodriver' executable needs to be in PATH.`, please follow these steps:
>   
> 1. Go to this website: https://github.com/mozilla/geckodriver/releases. Download `geckodriver` file corresponding to your operation system. (`geckodriver-v0.30.0-win64.zip` for Windows, `geckodriver-v0.30.0-macos.tar.gz` for Mac OS.) 
> 2. Unzip the dowloaded file. Move the executable file `geckodriver` to the folder `anaconda3/condabin` under the folder `anaconda3` where you have installed Anaconda.  
If you don't remember where you have installed Anaconda, the command `!where conda` will find the path of the folder `anaconda3`.
> 3. Once you have moved the executable file `geckodriver` into the folder `anaconda3`, try to run the code `browser = webdriver.Firefox()` again.
> 4. If you still have the same error, move the `geckodriver` file into the folder `bin` inside of the folder `anaconda3`.

#### Collecting the URLs of French Presidential Speeches

In [2]:
urls = [f"https://www.vie-publique.fr/discours/recherche?search_api_fulltext_discours=&sort_by=field_date_prononciation_discour&field_intervenant_title=&field_intervenant_qualite=&field_date_prononciation_discour_interval[min]=&field_date_prononciation_discour_interval[max]=&field_type_emetteur[9340]=9340&form_build_id=form-0lIEiuE4R0BPL2Z9cNPox5p-k-YkmYjjhbfBahdmtI0&form_id=views_exposed_form&page={page}" for page in range(0, 999)]

# For the exemple, let's restrict to 5 pages only. Remove this line if you want to download all pages

urls = urls[:5]

urls


['https://www.vie-publique.fr/discours/recherche?search_api_fulltext_discours=&sort_by=field_date_prononciation_discour&field_intervenant_title=&field_intervenant_qualite=&field_date_prononciation_discour_interval[min]=&field_date_prononciation_discour_interval[max]=&field_type_emetteur[9340]=9340&form_build_id=form-0lIEiuE4R0BPL2Z9cNPox5p-k-YkmYjjhbfBahdmtI0&form_id=views_exposed_form&page=0',
 'https://www.vie-publique.fr/discours/recherche?search_api_fulltext_discours=&sort_by=field_date_prononciation_discour&field_intervenant_title=&field_intervenant_qualite=&field_date_prononciation_discour_interval[min]=&field_date_prononciation_discour_interval[max]=&field_type_emetteur[9340]=9340&form_build_id=form-0lIEiuE4R0BPL2Z9cNPox5p-k-YkmYjjhbfBahdmtI0&form_id=views_exposed_form&page=1',
 'https://www.vie-publique.fr/discours/recherche?search_api_fulltext_discours=&sort_by=field_date_prononciation_discour&field_intervenant_title=&field_intervenant_qualite=&field_date_prononciation_discour

In [5]:
# Install the required packages if you haven't already with !pip install pandas selenium

import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException

# Create data folder if it doesn't exist

!mkdir -p data


def scrape_links(urls):
    """
    Scrapes the URLs of presidential speeches from given webpages.

    Args:
        urls (list): A list of URLs to scrape. Each URL should point to a page that contains links to speeches.
    
    This function uses Selenium WebDriver to navigate through each URL, finds all links to presidential speeches
    on the page, and appends them to a CSV file called "president_links.csv". The links are extracted using a CSS
    selector corresponding to the titles of the speeches.
    """

    # Initialize the WebDriver (using Firefox here)
    with webdriver.Firefox() as driver:
        
        # Initialize a list to hold all the links
        all_links = []
        
        for url in urls:
            try:
                # Navigate to the webpage
                driver.get(url)

                # Wait for the elements to load (adjust the time and condition as necessary)
                WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".fr-card__title a")))

                # Find all links matching the CSS selector
                elements = driver.find_elements(By.CSS_SELECTOR, ".fr-card__title a")

                # Extract the href attribute of each link
                links = [element.get_attribute("href") for element in elements]
                
                # Append the links to the all_links list
                all_links.extend(links)
            
            except Exception as e:
                # Log any errors encountered during scraping for a specific URL
                print(f"Error scraping {url}: {e}")
        
        # Save all the links to a CSV file if there are any links found
        if all_links:
            df = pd.DataFrame({"Links": all_links})
            df.to_csv("data/president_links.csv", mode='w', header=True, index=False)
            print(f"Successfully saved {len(all_links)} links to 'president_links.csv'.")
        else:
            print("No links found to save.")

        # Return the list of links
            
        return all_links





In [6]:

# Run scraping

speeches_urls = scrape_links(urls)

The geckodriver version (0.33.0) detected in PATH at /opt/homebrew/bin/geckodriver might not be compatible with the detected firefox version (134.0); currently, geckodriver 0.35.0 is recommended for firefox 134.*, so it is advised to delete the driver in PATH and retry


Successfully saved 50 links to 'president_links.csv'.


#### Scraping Content from the Speeches

In [8]:

# Create function to scrape content from the URLs

def scrape_content(urls):
    """
    Scrapes content from the given list of URLs and saves the data into a CSV file.
    
    Args:
        urls (list): List of URLs to scrape data from. Each URL should point to a page with presidential speech information.
    """
    
    # Set up the Selenium Firefox driver
    with webdriver.Firefox() as driver:
        
        # Create an empty list to store the collected data rows
        data_rows = []
        
        for url in urls:
            # Navigate to the webpage
            driver.get(url)

            # Wait for the page to load and elements to be present
            try:
                # Wait for the presence of the title element
                WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, ".fr-h1, .fr-h3")))
            except Exception as e:
                print(f"Error loading page {url}: {e}")
                continue  # Skip this URL and move to the next

            # Initialize variables for the data
            title = "NA"
            date = "NA"
            rubrique = "NA"
            intervenant = "NA"
            circonstance = "NA"
            tags = "NA"
            speech = "NA"

            # Try multiple selectors for the title
            title_selectors = [".fr-h1", ".fr-h3"]
            for selector in title_selectors:
                try:
                    title = driver.find_element(By.CSS_SELECTOR, selector).text
                    break  # Break the loop if the title is found using the selector
                except NoSuchElementException:
                    continue  # Continue to the next selector if the current one fails

            # Scrape other fields
            try:
                date = driver.find_element(By.CSS_SELECTOR, ".vp-discours-details p:nth-child(1)").text
            except NoSuchElementException:
                pass

            try:
                rubrique = driver.find_element(By.CSS_SELECTOR, ".vp-page-thematic .list-secondaire").text
            except NoSuchElementException:
                pass

            try:
                intervenant = driver.find_element(By.CSS_SELECTOR, ".line-intervenant").text
            except NoSuchElementException:
                pass

            try:
                circonstance = driver.find_element(By.CSS_SELECTOR, ".field--type-string").text
            except NoSuchElementException:
                pass

            try:
                tags = driver.find_element(By.CSS_SELECTOR, ".vp-tags .list-secondaire").text
            except NoSuchElementException:
                pass

            try:
                speech = driver.find_element(By.CSS_SELECTOR, ".field--type-text-long").text
            except NoSuchElementException:
                pass

            # Append the data to the list of data rows
            row_data = {
                "link": url,
                "title": title,
                "date": date,
                "rubrique": rubrique,
                "intervenant": intervenant,
                "circonstance": circonstance,
                "tags": tags,
                "speech": speech
            }
            data_rows.append(row_data)

        # Create a DataFrame from the collected data rows
        data = pd.DataFrame(data_rows)

        # Save the DataFrame to a CSV file
        data.to_csv("presidential_speeches.csv", index=False)
        print(f"Data saved to 'presidential_speeches.csv'.")

        # Return the DataFrame

        return data
    

# Run the scraping function with the list of URLs
    
scraped_data = scrape_content(speeches_urls)


The geckodriver version (0.33.0) detected in PATH at /opt/homebrew/bin/geckodriver might not be compatible with the detected firefox version (134.0); currently, geckodriver 0.35.0 is recommended for firefox 134.*, so it is advised to delete the driver in PATH and retry


Data saved to 'presidential_speeches.csv'.


In [9]:
scraped_data



Unnamed: 0,link,title,date,rubrique,intervenant,circonstance,tags,speech
0,https://www.vie-publique.fr/discours/296746-em...,"Déclaration de M. Emmanuel Macron, président d...",Prononcé le 6 janvier 2025,International,Emmanuel Macron - Président de la République,Conférence des ambassadrices et ambassadeurs 2025,Relations internationales\nPolitique étrangère...,"Monsieur le Premier ministre, \nMesdames, Mess..."
1,https://www.vie-publique.fr/discours/296709-em...,"Déclaration de M. Emmanuel Macron, président d...",Prononcé le 31 décembre 2024,Institutions,Emmanuel Macron - Président de la République,Vœux aux Français,Institutions de l'Etat\nPolitique gouvernement...,"Mes chers compatriotes,\nEnsemble cette année,..."
2,https://www.vie-publique.fr/discours/296745-em...,"Déclaration à la presse de M. Emmanuel Macron,...",Prononcé le 22 décembre 2024,International,Emmanuel Macron - Président de la République,Visite officielle en Éthiopie,Afrique\nEthiopie\nFrance - Ethiopie\nRelation...,"Merci beaucoup pour vos mots, Monsieur le Prem..."
3,https://www.vie-publique.fr/discours/296744-em...,"Déclaration à la presse de M. Emmanuel Macron,...",Prononcé le 21 décembre 2024,International,Emmanuel Macron - Président de la République,Visite aux forces stationnées à Djibouti,Afrique\nDjibouti\nFrance - Djibouti\nArmée\nC...,"Merci beaucoup, Monsieur le président, cher am..."
4,https://www.vie-publique.fr/discours/296743-em...,"Déclaration de M. Emmanuel Macron, président d...",Prononcé le 20 décembre 2024,International,Emmanuel Macron - Président de la République,Visite aux forces stationnées à Djibouti,Défense\nArmée\nDjibouti\nMilitaire\nPolitique...,"Merci, mon général.\nMessieurs les ministres,\..."
5,https://www.vie-publique.fr/discours/296592-em...,"Déclaration de M. Emmanuel Macron, président d...",Prononcé le 12 décembre 2024,International,Emmanuel Macron - Président de la République,Déplacement en Pologne,Relations internationales\nPolitique étrangère...,"Merci beaucoup, Monsieur le Premier ministre, ..."
6,https://www.vie-publique.fr/discours/296573-em...,"Déclaration de M. Emmanuel Macron, président d...",Prononcé le 7 décembre 2024,Société,Emmanuel Macron - Président de la République,Cérémonie de réouverture de la cathédrale Notr...,Culture - Médias\nPatrimoine culturel\nEglise\...,"Je me tiens devant vous, avant que ne commence..."
7,https://www.vie-publique.fr/discours/296510-em...,"Déclaration de M. Emmanuel Macron, président d...",Prononcé le 5 décembre 2024,Institutions,Emmanuel Macron - Président de la République,,Institutions de l'Etat\nGouvernement\nMotion d...,"Françaises, Français. Mes chers compatriotes,\..."
8,https://www.vie-publique.fr/discours/296591-em...,"Déclaration de M. Emmanuel Macron, président d...",Prononcé le 3 décembre 2024,Société,Emmanuel Macron - Président de la République,One Water Summit,Environnement\nEau\nClimat\nEau potable\nTechn...,"Monsieur le Prince héritier d'Arabie saoudite,..."
9,https://www.vie-publique.fr/discours/296590-em...,"Déclaration de M. Emmanuel Macron, président d...",Prononcé le 3 décembre 2024,International,Emmanuel Macron - Président de la République,Clôture du Forum d'affaires France-Arabie saou...,Asie\nArabie saoudite\nFrance - Arabie saoudit...,"Monsieur les ministres, \nMonsieur les ambassa..."


'https://www.vie-publique.fr/discours/289645-emmanuel-macron-31052023-france-slovaquie'