### Data Collection Methodology

In this section, we outline the methodology used to collect Airbnb listings data through web scraping. The scraping process involves the following steps:

1. **Web Scraping Setup**:
   - We use Selenium, a powerful tool for automating browsers, to navigate and interact with the Airbnb website.
   - We initiate a Firefox WebDriver and configure it to scrape Airbnb listings, specifying a search location (France) for this project.

2. **Navigating Airbnb’s Website**:
   - After setting up the WebDriver, we open the Airbnb homepage and interact with the necessary cookies pop-up, setting up the search field to target a specific location.
   - Using XPath, we locate web elements essential for navigation and data collection.

3. **Data Collection Process**:
   - We define a loop to navigate through multiple pages of listings to retrieve details on each page.
   - For each listing, we collect the URL, allowing us to later visit each listing page and extract specific details, such as:
      - Type of lodging and address
      - Guest capacity
      - Number of bedrooms and bathrooms
      - Rating and number of reviews
      - Price per night, and host information
   - After navigating each page and collecting URLs, we close the WebDriver session to complete this phase of data collection.

4. **Error Handling**:
   - Throughout the process, we use `try-except` blocks to handle potential errors (e.g., elements not loading) gracefully, which ensures that the scraping continues without interruptions.

The following code cells implement this methodology step-by-step.


##  WEB SCRAPING SET UP

In [6]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

# Initialize the WebDriver to open Airbnb's homepage and handle cookies pop-up
driver = webdriver.Firefox()


driver.get("https://www.airbnb.fr/")

# Accepter uniquement les cookies nécessaires
try:
    cookies_button = WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.XPATH, "//button[contains(text(), 'Uniquement les cookies nécessaires')]"))
    )
    cookies_button.click()
    print("Bouton 'Uniquement les cookies nécessaires' cliqué.")
except Exception as e:
    print("Erreur lors de la sélection du bouton des cookies :", e)

# Utiliser l'id "bigsearch-query-location-input" pour localiser le champ de destination
try:
    destination_field = driver.find_element(By.ID, "bigsearch-query-location-input")
    destination_field.clear()  # S'assurer que le champ est vide
    destination_field.send_keys("France")
    #destination_field.send_keys(Keys.ENTER)  # Appuyer sur Entrée pour valider la recherche
    print("Destination définie sur France.")
except Exception as e:
    print("Erreur lors de la saisie de la destination :", e)

# Valider la recherche en cliquant sur le bouton de recherche
try:
    search_button = driver.find_element(By.XPATH, "/html/body/div[5]/div/div/div[1]/div/div[3]/div[2]/div/div/div/header/div/div[2]/div[2]/div/div/div/form/div[2]/div/div[5]/div[2]/div[2]/button")
    search_button.click()
    print("Recherche validée.")
except Exception as e:
    print("Erreur lors de la validation de la recherche :", e)



Erreur lors de la sélection du bouton des cookies : Message: Element <button class="l1ovpqvx atm_1he2i46_1k8pnbi_10saat9 atm_yxpdqi_1pv6nv4_10saat9 atm_1a0hdzc_w1h1e8_10saat9 atm_2bu6ew_929bqk_10saat9 atm_12oyo1u_73u7pn_10saat9 atm_fiaz40_1etamxe_10saat9 bmx2gr4 atm_9j_tlke0l atm_9s_1o8liyq atm_gi_idpfg4 atm_mk_h2mmj6 atm_r3_1h6ojuz atm_rd_glywfm atm_70_5j5alw atm_tl_1gw4zv3 atm_9j_13gfvf7_1o5j5ji c1ih3c6 atm_bx_48h72j atm_cs_10d11i2 atm_5j_t09oo2 atm_kd_glywfm atm_uc_1lizyuv atm_r2_1j28jx2 atm_jb_1fkumsa atm_3f_glywfm atm_26_18sdevw atm_7l_1v2u014 atm_8w_1t7jgwy atm_uc_glywfm__1rrf6b5 atm_kd_glywfm_1w3cfyq atm_uc_aaiy6o_1w3cfyq atm_70_1b8lkes_1w3cfyq atm_3f_glywfm_e4a3ld atm_l8_idpfg4_e4a3ld atm_gi_idpfg4_e4a3ld atm_3f_glywfm_1r4qscq atm_kd_glywfm_6y7yyg atm_uc_glywfm_1w3cfyq_1rrf6b5 atm_kd_glywfm_pfnrn2_1oszvuo atm_uc_aaiy6o_pfnrn2_1oszvuo atm_70_1b8lkes_pfnrn2_1oszvuo atm_3f_glywfm_1icshfk_1oszvuo atm_l8_idpfg4_1icshfk_1oszvuo atm_gi_idpfg4_1icshfk_1oszvuo atm_3f_glywfm_b5gff8_1oszv

## LOCATING THE URL OF EACH ANNOUNCE

In [8]:
# Liste pour stocker les URLs d'annonces
annonces_urls = []

# Nombre de pages à scraper (vous pouvez ajuster ce nombre selon vos besoins)
max_pages = 15

for page in range(1, max_pages + 1):
    print(f"Scraping des annonces sur la page {page}...")

    # Scraper les annonces sur la page actuelle
    try:
        listings = driver.find_elements(By.XPATH, "//a[contains(@href, '/rooms/')]")
        for listing in listings:
            url = listing.get_attribute("href")
            if url not in annonces_urls:  # Éviter les doublons
                annonces_urls.append(url)
                print(f"URL de l'annonce : {url}")
    except Exception as e:
        print("Erreur lors de l'extraction des liens d'annonces :", e)
    
    # Passer à la page suivante (page suivante en cliquant sur le numéro de la page)
    try:
        next_page = WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.XPATH, f"//a[contains(text(), '{page + 1}')]"))
        )
        next_page.click()
        time.sleep(5)  # Attendre le chargement de la nouvelle page
        print(f"Passage à la page {page + 1}...")
    except Exception as e:
        print(f"Fin du scraping à la page {page}.")
        break  # Si aucune page suivante n'est trouvée, terminer la boucle

# Fermer le driver après avoir extrait les données
#driver.quit()

# Afficher le nombre total d'annonces récupérées
print(f"Nombre total d'annonces récupérées : {len(annonces_urls)}")

Scraping des annonces sur la page 1...
URL de l'annonce : https://www.airbnb.fr/rooms/1029555185028483066?adults=1&category_tag=Tag%3A8144&children=0&enable_m3_private_room=true&infants=0&pets=0&photo_id=1898243596&search_mode=regular_search&check_in=2025-01-19&check_out=2025-01-24&source_impression_id=p3_1736514540_P3pD0XkSigrgFezW&previous_page_section_name=1000&federated_search_id=e7b3aaff-2bff-4f20-92ab-e5c7af27569a
URL de l'annonce : https://www.airbnb.fr/rooms/1237350342980296107?adults=1&category_tag=Tag%3A8661&children=0&enable_m3_private_room=true&infants=0&pets=0&photo_id=1998864943&search_mode=regular_search&check_in=2025-02-16&check_out=2025-02-21&source_impression_id=p3_1736514540_P3TqebDYtctsabPe&previous_page_section_name=1000&federated_search_id=e7b3aaff-2bff-4f20-92ab-e5c7af27569a
URL de l'annonce : https://www.airbnb.fr/rooms/1190686584914640891?adults=1&category_tag=Tag%3A8536&children=0&enable_m3_private_room=true&infants=0&pets=0&photo_id=1988907194&search_mode=reg

In [None]:
#from selenium import webdriver
#from selenium.webdriver.common.by import By
#from selenium.webdriver.firefox.service import Service
#from selenium.webdriver.firefox.options import Options
#import time

# Exemple d'options pour le navigateur Firefox (à ajuster selon vos besoins)
#options = Options()
#options.headless = True  # Lancez en mode headless si vous ne voulez pas ouvrir la fenêtre du navigateur

# Définir le service et créer une instance de Firefox WebDriver
#service = Service('C:/Users/natha/Downloads/geckodriver-v0.35.0-win64/geckodriver.exe')  # Spécifiez le chemin correct vers geckodriver

# Démarrer WebDriver
#driver = webdriver.Firefox(service=service, options=options)

## ITERATON ON EACH ANNOUNCE URL TO EXTRACT RELEVANT INFORMATION

### FUNCTION TO SCROLL THE REVIEWS POPUP AND COLLECT THE COMMENTS "scroll_commentaires_lentement"
The scroll is set to continue only for the first 100 comments. Because a number of comments too high would be difficult to handle

In [9]:
from selenium.common.exceptions import NoSuchElementException, TimeoutException
import time

# Function to perform a slow scrolling of the comments
def scroll_commentaires_lentement(driver, scroll_pause_time=2, max_comments=50):
    try:
        # find the scrolling element in the reviews pop-up
        scroll_element= driver.find_element(By.XPATH, "/html/body/div[9]/div/div/section/div/div/div[2]/div/div[3]")
        all_commentaires = []
        
        # Perform slow scrolling
        while len(all_commentaires) < max_comments:
            # find all comments 
            commentaires_elements = driver.find_elements(By.XPATH, '//span[contains(@class, "l1h825yc")]')

        # extract the text from the comments and remove duplicates
            for commentaire in commentaires_elements:
                if commentaire.text not in all_commentaires:
                    all_commentaires.append(commentaire.text)
       
            #print(f"Commentaires collected so far: {len(all_commentaires)}")
            if len(all_commentaires) >= max_comments:
                print(f"Reached the limit of {max_comments} commentaires.")
                break
        
        # Store the current scroll position
            last_scroll_position = driver.execute_script("return arguments[0].scrollTop;", scroll_element)
            
            # Scroll down
            driver.execute_script("arguments[0].scrollTop = arguments[0].scrollTop + 500;", scroll_element)
            time.sleep(scroll_pause_time)  # Pause between scrolls to simulate human interaction
            
            # Check if the scroll position has changed
            new_scroll_position = driver.execute_script("return arguments[0].scrollTop;", scroll_element)
            
            # If the scroll position hasn't changed, it means the end has been reached
            if last_scroll_position == new_scroll_position:
                print("Reached the end of the pop-up. No more comments to load.")
                break

        print("Completed scrolling and/or end of commentaires reached.")
        return all_commentaires
    
    
    except Exception as e:
        print(f"Error during scrolling: {e}")
        return []


In [11]:
from selenium.webdriver.support import expected_conditions as EC

# Dictionary to store ad details, keyed by the ad URL
annonces_details = {}

# Function to extract an element with error handling
def extraire_element(xpath):
    try:
        element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, xpath)))
        return element.text.strip()  # Extract the text and remove extra blank spaces
    except Exception as e:
        print(f"Error in the extraction of element: {e}")
        return 'Not specified'

# Loop through each ad URL to extract details
for annonce_url in annonces_urls[201:]:  # Only the first 9
    print(f"Scraping les détails de l'annonce : {annonce_url}")
    
    # Visit the ad page
    driver.get(annonce_url)
    time.sleep(5)  # Wait for the page to fully load
    
    try:
        #translation button 
        #pop_up_translator_close = driver.find_element(By.XPATH, '//div[contains(@class,"c1lbtiq8")]')
        #pop_up_translator_close.click()
        #Esegui uno script per rimuovere l'elemento
        driver.execute_script("document.querySelector('.c1wj82si').remove();")  # Sostituisci con il selettore corretto

    except Exception as e:
        print (f"There is no translation pop-up. Could not find element: {e}")

    try:     
        #Extract general information
        type_logement_adresse = extraire_element('//*[@id="site-content"]/div/div[1]/div[3]/div/div[1]/div/div[1]/div/div/div/section/div[1]/h2')
        #voyageurs = extraire_element ('//li[contains(@class, "l7n4lsf") and contains(text(), "voyageurs")]')
        voyageurs = extraire_element('//*[@id="site-content"]/div/div[1]/div[3]/div/div[1]/div/div[1]/div/div/div/section/div[2]/ol/li[1]')
        chambres = extraire_element('//*[@id="site-content"]/div/div[1]/div[3]/div/div[1]/div/div[1]/div/div/div/section/div[2]/ol/li[2]')
        salles_de_bain = extraire_element ('//*[@id="site-content"]/div/div[1]/div[3]/div/div[1]/div/div[1]/div/div/div/section/div[2]/ol/li[3][contains(text(), "bain")] | //*[@id="site-content"]/div/div[1]/div[3]/div/div[1]/div/div[1]/div/div/div/section/div[2]/ol/li[4][contains(text(), "bain")]')
        note = extraire_element('//span[contains(text(),"étoile(s)")]')
        prix_par_nuit = extraire_element('//*[@id="site-content"]/div/div[1]/div[3]/div/div[2]/div/div/div[1]/div/div/div/div/div/div/div/div[1]/div[1]/div/div/span/div/span[1]')
        annulation_gratuite = extraire_element('//*[@id="site-content"]/div/div[1]/div[3]/div/div[1]/div/div[4]/div/div[2]/section/div[2]/div[3]/div/div[2]/div[1]/h3')
        hote = extraire_element('//div[contains(@class, "t1pxe1a4")]')
        #hote = extraire_element('//*[@id="site-content"]/div/div[1]/div[3]/div/div[1]/div/div[2]/div[2]/div/div/div/div[2]/div[1]')
        #experience_hote = extraire_element('//*[@id="site-content"]/div/div[1]/div[3]/div/div[1]/div/div[2]/div[2]/div/div/div/div[2]/div[2]/ol')
        experience_hote = extraire_element('//div[contains(@class,"s1l7gi0l")]')
        

        # extract comments and number of comments
        show_reviews_link = driver.find_element(By.XPATH, '//a[contains(@href, "/reviews") and contains(@class, "l1ovpqvx")]')
        show_reviews_link.click()
        num_commentaires = extraire_element('//div[contains(@class, "_1j6cqxi")]')
        WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '//div[contains(@class, "di536pa")]')))
        print("Le pop-up des commentaires est maintenant visible.")
        all_commentaires = scroll_commentaires_lentement(driver, max_comments = 50)
        
        # Store ad details in the dictionary using the ad URL as the key
        annonces_details[annonce_url] = {
            "Type of accommodation and address": type_logement_adresse,
            "Travelers": voyageurs,
            "Rooms": chambres,
            "Bathrooms": salles_de_bain,
            "Rating": note,
            "Number of comments": num_commentaires,
            "Price per night": prix_par_nuit,
            "Free cancellation": annulation_gratuite,
            "Host": hote,
            "Host experience": experience_hote,
            "Comments": all_commentaires  # Add comments list if available
        }
        
        print(f"Adding details for URL {annonce_url}: {annonces_details[annonce_url]}")

    except Exception as e:
        print(f"Erreur lors du scraping de l'annonce : {e}")

# Display the total number of ads scraped
print(f"Nombre total d'annonces détaillées scrappées : {len(annonces_details)}")


Scraping les détails de l'annonce : https://www.airbnb.fr/rooms/792553097805413642?adults=1&category_tag=Tag%3A8536&children=0&enable_m3_private_room=true&infants=0&pets=0&photo_id=1758408835&search_mode=regular_search&check_in=2025-01-11&check_out=2025-01-16&source_impression_id=p3_1736514618_P3v1_kD1HqsH5MfW&previous_page_section_name=1000&federated_search_id=a74be516-74af-477b-a363-1091b78c7ba6
There is no translation pop-up. Could not find element: Message: TypeError: document.querySelector(...) is null
Stacktrace:
@https://www.airbnb.fr/rooms/792553097805413642?adults=1&category_tag=Tag%3A8536&children=0&enable_m3_private_room=true&infants=0&pets=0&photo_id=1758408835&search_mode=regular_search&check_in=2025-01-11&check_out=2025-01-16&source_impression_id=p3_1736514618_P3v1_kD1HqsH5MfW&previous_page_section_name=1000&federated_search_id=a74be516-74af-477b-a363-1091b78c7ba6:2:16
@https://www.airbnb.fr/rooms/792553097805413642?adults=1&category_tag=Tag%3A8536&children=0&enable_m3_pr

In [25]:
print(annonces_urls[0:10])

['https://www.airbnb.fr/rooms/1089276169280272196?adults=1&children=0&enable_m3_private_room=true&infants=0&pets=0&search_mode=regular_search&check_in=2025-01-19&check_out=2025-01-24&source_impression_id=p3_1735389882_P3BE_Ic4fYulDPOQ&previous_page_section_name=1000&federated_search_id=0e044253-86af-4db3-9f11-d00b4cc60092', 'https://www.airbnb.fr/rooms/1043482526726266969?adults=1&category_tag=Tag%3A670&children=0&enable_m3_private_room=true&infants=0&pets=0&photo_id=1798409252&search_mode=regular_search&check_in=2025-01-26&check_out=2025-01-31&source_impression_id=p3_1735389882_P3HIUHJ8qmiZtL5k&previous_page_section_name=1000&federated_search_id=0e044253-86af-4db3-9f11-d00b4cc60092', 'https://www.airbnb.fr/rooms/1029555185028483066?adults=1&category_tag=Tag%3A8144&children=0&enable_m3_private_room=true&infants=0&pets=0&photo_id=1898243596&search_mode=regular_search&check_in=2025-01-05&check_out=2025-01-10&source_impression_id=p3_1735389882_P3IFdI2u8v5Fd0pe&previous_page_section_name=1

In [6]:
annonces_details

{'https://www.airbnb.fr/rooms/1190686584914640891?adults=1&category_tag=Tag%3A8536&children=0&enable_m3_private_room=true&infants=0&pets=0&photo_id=1988907194&search_mode=regular_search&check_in=2025-01-12&check_out=2025-01-17&source_impression_id=p3_1735627598_P3Rjm_F1UKxjWQpp&previous_page_section_name=1000&federated_search_id=bec8b566-090e-45d4-9721-acbfe3d97b1d': {'Type of accommodation and address': 'Cabane perchée - Le Passage, France',
  'Travelers': '2 voyageurs',
  'Rooms': '· 1 chambre',
  'Bathrooms': '· 1 salle de bain',
  'Rating': '5,0 étoile(s) sur 5.',
  'Number of comments': '79 commentaires',
  'Price per night': '65 €',
  'Free cancellation': 'Annulation gratuite avant le 7 janv',
  'Host': 'Hôte : Jeremie',
  'Host experience': 'Superhôte · Hôte depuis 6 ans',
  'Comments': ['',
   'Ce logement fait partie des 5 % de logements préférés sur Airbnb parmi les logements éligibles, à partir des évaluations, des commentaires et de la fiabilité des annonces selon les voyag

In [12]:
import csv

# Afficher le nombre total d'annonces scrappées
print(f"Nombre total d'annonces détaillées scrappées : {len(annonces_details)}")

# Enregistrer les détails dans un fichier CSV
with open('annonces_details_201_270_50commentaires.csv', mode='w', newline='', encoding='utf-8') as file:
    # Définir les noms de colonnes (fieldnames) correspondant aux clés du dictionnaire `annonces_details`
    fieldnames = ["URL", "Type of accommodation and address", "Travelers", "Rooms", "Bathrooms", "Rating", "Number of comments", "Price per night", "Free cancellation", "Host", "Host experience", "Comments"]
    
    # Créer un writer pour le fichier CSV
    writer = csv.DictWriter(file, fieldnames=fieldnames)
    
    # Écrire l'en-tête (noms des colonnes)
    writer.writeheader()
    
    # Write each ad (each dictionary) to the CSV file
    for url, ad_details in annonces_details.items():
        # Check if 'Comments' is a list and remove any empty strings
        if isinstance(ad_details['Comments'], list):
            # Filter out any empty strings in the list
            filtered_comments = [comment for comment in ad_details['Comments'] if comment.strip()]
            # Join the filtered comments into one string separated by a newline
            ad_details['Comments'] = "\n".join(filtered_comments)
        
        # Add the URL as the first field (key) in the dictionary
        ad_details['URL'] = url
        
        # Write the details of the ad to the CSV file
        writer.writerow(ad_details)
   

print("Les détails des annonces ont été sauvegardés dans 'annonces_details_50_100_50commentaires.csv'")



Nombre total d'annonces détaillées scrappées : 42
Les détails des annonces ont été sauvegardés dans 'annonces_details_50_100_50commentaires.csv'


### Data Cleaning and transformation 