In [1]:
import json
import pandas as pd
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, ElementClickInterceptedException, StaleElementReferenceException
from selenium import webdriver
import os
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
import requests
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import re
from datetime import datetime
#import locale
import numpy as np
from selenium.common.exceptions import TimeoutException
from concurrent.futures import ThreadPoolExecutor #pip install futures
from lxml import html #pip install lxml
from tqdm import tqdm #pip install tqdm

# Web Scraping objectives and structure

Our intention is to provide a web scraping code which puts together in the end the DataFrames of two distinct parts. The result is provided as a final .csv file which can be easily used for the rest of the parts (a pre-prepared .csv file will be provided in case anyone wants to skip running this code for 20-30 minutes).

The two main parts of this code are:
- General scraping from the main search
    - Most of the elements are scraped during this 1st phase. The code goes through all the 4 necessary searches and scrapes all the information that Booking.com provides for each hotel on the main search results pages. During this part, the code also adds the URL for the individual page of each hotel to their pertinent row.

- Individual scraping of each description   
    - During this 2nd phase the code will enter each individual link to the hotel's portal on Booking.com, search for their description and add the found description to the pertinent row. After this is complete, the code deletes the column with the individual links as they are no longer needed. That being said, this last part can easily be modified if it suits best to one's interest to keep those URLs.


As a result, the code provides a very extensive dataset with information for each hotel that appears on the searches.

We need to mention that our focus was on the hotels because they provide a more standardized approach to the prices as they have multiple rooms, compared to touristic apartments, which are not as easy to compare. In short, it is probable to find a comparable room for the same hotel in both the treatment and control weeks, but it is less probable to find the exact same apartment for both as with a single reservation in the middle of one of the weeks that apartment will not show up anymore in the search. 
Nevertheless, our code allows for the appliance of other filters and, therefore, would be useful as well if we wanted to search for apartments or even hostels instead of just hotels.

# Start of the actual code

##### Please make sure to install the necessary packages if they are not installed yet, the only ones that might give problems are the three latest ones, which are for mostly aesthetic purposes, but which are necessary for the rest of the code to run smoothly.

#### IMPORTANT: change the "user_path" to the path of the folder in which you have this notebook. Make sure you are using Firefox (the code really needs appropriate tweaking in order to use Chrome instead). Also make sure to have the ubloc-origin.xpi and the geckodriver in this same folder, the ones included are good for Macbook Silicon.

Below you can find all of the tweaks possible in the same block of code. The possible tweaks are:
- Filters, in the appropriate place you can find the available options for it.
- Columns for the resulting dataframe. We advise not to do change this part because it has effects all through the code below.
- Cities to scrape. This possibility is offered in order to easily change the cities to scrape.
- Dates to scrape. This possibility is offered in order to easily change the dates to scrape.
- "user_path" as said, the user_path is defined to allow a fast "import" of the pipeline into new computers.
- The last part is defined to allow fast tweaking of the download folder, the geko_path and the link of the website to scrape (we REALLY advise against changing the website, as the code is specifically prepared for it and not really robust to different websites). 
- We also added a final bit that allows to quickly define the name of the .csv file of the output from the webscraping.

On a final note, in some parts of the code there are some "time.sleep()" functions which are set to timers which should be sufficient for any decent internet connection. If the scraping code needs to be run on very slow internet connections and it gives errors in those parts due to the time out being to short, please do adjust the timings to best suit your internet connection. Those can be easily found by searching cmd + F "time.sleep".

In [2]:
# the filters are: 'Hotels', 'Apartments', 'Hostels', 'Very Good', 'Swimming pool'
filters = ['Hotels']

'_________________________________________________________________________________'

df_columns = ['name', 'rating', 'num_reviews', 'neighborhood', 'dist_from_center', 'price', 'other', 'is_Barna', 'is_Merce_time', 'short_description', 'link']
df = pd.DataFrame(columns = df_columns)
display(df)

'_________________________________________________________________________________'

Treatment_city = 'Barcelona'

Control_city = 'Naples Italy'

'_________________________________________________________________________________'

Treatment_initial_date = '2024-09-19'
Treatment_final_date = '2024-09-25'
Control_initial_date = '2024-09-26'
Control_final_date = '2024-10-02'

'_________________________________________________________________________________'

"""Define here the path to the project folder"""
user_path = "/Users/guillemmirabentrubinat/Library/CloudStorage/OneDrive-Personal/BSE/Intro to Text Mining and Natural Language Processing/problem_sets/ps1"

'_________________________________________________________________________________'

dfolder='./downloads'
geko_path=f'{user_path}/geckodriver'
link='https://www.booking.com'

'_________________________________________________________________________________'

name_csv = 'booking_latest_scraping' #.csv is added directly in the end, do not add it here.

Unnamed: 0,name,rating,num_reviews,neighborhood,dist_from_center,price,other,is_Barna,is_Merce_time,short_description,link


In [3]:

def ffx_preferences(dfolder, download=False):
    '''
    Sets the preferences of the firefox browser: download path.
    '''
    my_profile = f'{user_path}/myprofile'
    profile = webdriver.FirefoxProfile(my_profile)
    # set download folder:
    profile.set_preference("browser.download.dir", dfolder)
    profile.set_preference("browser.download.folderList", 2)
    profile.set_preference("browser.download.manager.showWhenStarting", False)
    profile.set_preference("browser.helperApps.neverAsk.saveToDisk",
                            "application/msword,application/rtf, application/csv,text/csv,image/png ,image/jpeg, application/pdf, text/html,text/plain,application/octet-stream")
    
    profile.add_extension(f'{user_path}/ublock_origin-1.55.0.xpi')


    # this allows to download pdfs automatically
    if download:
        profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf,application/x-pdf")
        profile.set_preference("pdfjs.disabled", True)

    options = Options()
    options.profile = profile
    return options

def start_up(link, dfolder, geko_path,donwload=True):
    #geko_path='/Users/guillemmirabentrubinat/Library/CloudStorage/OneDrive-Personal/BSE/Intro to Text Mining and Natural Language Processing/problem_sets/ps1/geckodriver'
    download_path='./downloads'
    os.makedirs(dfolder, exist_ok=True)

    options = None #ffx_preferences(dfolder,donwload)
    service = Service(geko_path)
    browser = webdriver.Firefox(service=service, options=options)
    browser.install_addon(f'{user_path}/ublock_origin-1.55.0.xpi')

    # Enter the website address here
    browser.get(link)
    time.sleep(5)  # Adjust sleep time as needed
    return browser

def check_and_click(browser, xpath, type):
    '''
    Function that checks whether the object is clickable and, if so, clicks on
    it. If not, waits one second and tries again.
    '''
    ck = False
    ss = 0
    while ck == False:
        ck = check_obscures(browser, xpath, type)
        time.sleep(1)
        ss += 1
        if ss == 15:
            # warn_sound()
            # return NoSuchElementException
            ck = True
            # browser.quit()

def check_obscures(browser, xpath, type):
    '''
    Function that checks whether the object is being "obscured" by any element so
    that it is not clickable. Important: if True, the object is going to be clicked!
    '''
    try:
        if type == "xpath":
            browser.find_element('xpath',xpath).click()
        elif type == "id":
            browser.find_element('id',xpath).click()
        elif type == "css":
            browser.find_element('css selector',xpath).click()
        elif type == "class":
            browser.find_element('class name',xpath).click()
        elif type == "link":
            browser.find_element('link text',xpath).click()
    except (ElementClickInterceptedException, NoSuchElementException, StaleElementReferenceException) as e:
        print(e)
        return False
    return True

In [4]:
""" We generate a function that searches the total number of pages of a given city and dates """
def get_num_pages(browser):
    nums_pages = browser.find_elements(By.XPATH, "//button[@class='a83ed08757 a2028338ea' and @type='button']")
    return int(nums_pages[-1].text)

""" We define the function that scrapes the data from one page of the search.
    It returns a dataframe with the data of the page.
    Specifically, it scrapes for:
        - name of the hotel
        - rating
        - number of reviews
        - neighborhood
        - distance from center of the city, although Booking doesn't easily define what 'center' means for them
        - price
        - other info such as subway acces or beach proximity
        - short description, which corresponds to the basic description info of each hotel that appears on the search page, without entering the hotel web
        - link to the hotel web
        - if the hotel is in Barcelona or not
        - if the search is for the period of La Mercè or not"""
def PAGE_SCRAPER_2000mk5(browser):
    search_results_elements = browser.find_elements(By.XPATH, "//div[@class='c066246e13' and @data-testid='property-card-container']")
    #print(len(search_results_elements))
    #df_columns = ['name', 'rating', 'num_reviews', 'neighborhood', 'dist_from_center', 'price', 'other', 'description', 'link', 'descriptions', 'is_Barna', 'is_Merce_time']
    TEMP_df = pd.DataFrame(columns = df_columns)
    
    names_list = []
    ratings_list = []
    num_reviews_list = []
    neighborhoods_list = []
    dists_from_center_list = []
    prices_list = []
    others_list = []
    short_descriptions_list = []
    links_list = []
    is_Barna_list = []
    is_Merce_time_list = []
    
    for r in search_results_elements:
        try:
            TEMP_name_element = r.find_element(By.XPATH, ".//div[@class='f6431b446c a15b38c233' and @data-testid='title']")
            name = str(TEMP_name_element.text)
        except NoSuchElementException:
            name = np.nan
        
        
        try:
            TEMP_rating_element = r.find_element(By.XPATH, ".//div[@data-testid='review-score']//div[@class='a3b8729ab1 d86cee9b25']")
            rating = float(TEMP_rating_element.text)
        except NoSuchElementException:
            try:
                TEMP_rating_element = r.find_element(By.XPATH, ".//div[@data-testid='external-review-score']//div[@class='a3b8729ab1 e6208ee469 cb2cbb3ccb']")
                temp_rating = str(TEMP_rating_element.text)
                rating = float(re.search(r"\b[\d.]+\b", temp_rating).group())
            except NoSuchElementException:
                rating = np.nan
        
        
        try:
            TEMP_num_reviews_element = r.find_element(By.XPATH, ".//div[@data-testid='review-score']//div[@class='abf093bdfe f45d8e4c32 d935416c47']")
            num_reviews = int(re.search(r"\b[\d]*\b", TEMP_num_reviews_element.text.replace(',', '')).group())
        except NoSuchElementException:
            try:
                TEMP_num_reviews_element = r.find_element(By.XPATH, ".//div[@data-testid='external-review-score']//div[@class='abf093bdfe f45d8e4c32 d935416c47']")
                num_reviews = int(re.search(r"[\d]*", TEMP_num_reviews_element.text.replace(',', '')).group())
            except NoSuchElementException:
                num_reviews = np.nan
        
        
        try:
            TEMP_neighborhood = r.find_element(By.XPATH, ".//span[@class='aee5343fdb def9bc142a' and @data-testid='address']")
            neighborhood = str(re.search(r"([\w\s]+)(?=, )", TEMP_neighborhood.text).group())
        except:
            neighborhood = np.nan
        
        
        try:
            TEMP_dist_from_center = r.find_element(By.XPATH, ".//span[@class='f419a93f12']//span[@data-testid='distance']")
            dist_from_center = float(re.search(r"([\d.]*(?=[\skm m]+))", TEMP_dist_from_center.text).group())
            if re.search(r"km", TEMP_dist_from_center.text):
                dist_from_center *= 1000
                dist_from_center = int(dist_from_center)
            elif re.search(r"\bm\b", TEMP_dist_from_center.text):
                dist_from_center = int(dist_from_center)        
            #print(dist_from_center)
        except NoSuchElementException:
            dist_from_center = np.nan
        
        
        try:
            TEMP_other = r.find_elements(By.XPATH, ".//span[@class='aee5343fdb']//span[@class='f419a93f12']//span[@aria-expanded='false']")
            len_other = len(TEMP_other)
            list_other = []
            for i in range(1, len_other):
                list_other.append(str(TEMP_other[i].text))
            matches = re.findall(r"[^,\[\]']+",str(list_other))
            matches = [m.strip() for m in matches if m.strip()]
            other = ', '.join(matches)
            try:
                re.search(r"\w+", other).group()
            except:
                other = np.nan
        except NoSuchElementException:
            other = np.nan
        
        
        try:
            TEMP_price = r.find_element(By.XPATH, ".//span[@class='f6431b446c fbfd7c1165 e84eb96b1f' and @data-testid='price-and-discounted-price']")
            price = int(re.search(r"[\d]+", TEMP_price.text.replace(',', '')).group())
            #price = TEMP_price.text
        except NoSuchElementException:
            price = np.nan
        
        
        try:
            TEMP_description = r.find_element(By.XPATH, ".//div[@class='c59cd18527']")
            sh_description = str(TEMP_description.text)
            #description
        except NoSuchElementException:
            sh_description = np.nan
        
        
        try:
            TEMP_link = r.find_element(By.XPATH, ".//a[@data-testid='title-link']")
            link = str(TEMP_link.get_attribute('href'))
        except NoSuchElementException:
            link = np.nan
        
        if city == Treatment_city:
            is_Barna = 1
        else:
            is_Barna = 0
        
        if date_initial == Treatment_initial_date:
            is_Merce_time = 1
        else:
            is_Merce_time = 0
        
        #print(f"{'-' * 55}\n>>> {name} | {rating} | {num_reviews} | {neighborhood} | {dist_from_center} | {price} | {other} | {description} | {link}")
        
        
        names_list.append(name)
        ratings_list.append(rating)
        num_reviews_list.append(num_reviews)
        neighborhoods_list.append(neighborhood)
        dists_from_center_list.append(dist_from_center)
        prices_list.append(price)
        others_list.append(other)
        short_descriptions_list.append(sh_description)
        links_list.append(link)
        is_Barna_list.append(is_Barna)
        is_Merce_time_list.append(is_Merce_time)
        
    
    cols_dict = {
        'name': names_list,
        'rating': ratings_list,
        'num_reviews': num_reviews_list,
        'neighborhood': neighborhoods_list,
        'dist_from_center': dists_from_center_list,
        'price': prices_list,
        'other': others_list,
        'short_description': short_descriptions_list,
        'link': links_list,
        'is_Barna': is_Barna_list,
        'is_Merce_time': is_Merce_time_list
        }
    
    for c in df_columns:
        TEMP_df[c] = cols_dict[c]
    
    return TEMP_df

""" The following function is just a quality of life function to avoid possible pop-ups from 'Genius' that might appear due to the negation to sign in"""
def POP_UP_REJECTOR_1000_MK3(browser, waiting_time):
    not_logged_path = "//button[@aria-label='Dismiss sign-in info.']"
    wait = WebDriverWait(browser, waiting_time)
    
    try:
        wait.until(EC.presence_of_element_located((By.XPATH, not_logged_path)))
        browser.find_element(By.XPATH, not_logged_path).click()
    except TimeoutException:
        pass

""" We define a function that allows us to select different pre-selected filters from Booking.com. In the end we only used the 'Hotels' and 'Apartments' filters.
    With the hotels we actually run the whole code, the Apartments was only used for testing purposes."""
def filters_selector(browser, filter: list):
    
    if len(filter) > 0:
        browser.find_element(By.XPATH, "//button[@data-testid='filters-group-expand-collapse']").click()
        
        filters_found = browser.find_elements(By.XPATH, "//label[@class='aca0ade214 aaf30230d9 c2931f4182 d79e71457a bd597ff2d8']")
        
        unique_filters = []
        unique_filters_text = []
        
        for e in filters_found:
            try:
                TEMP_filter_text = re.search(r"\b[a-zA-Z\s]+\b", e.text).group().replace('\n', '')
            except AttributeError:
                TEMP_filter_text = np.nan
            
            if TEMP_filter_text not in unique_filters_text:
                unique_filters.append(e)
                unique_filters_text.append(TEMP_filter_text)
        
        for e in unique_filters:
            try:
                element_text = re.search(r"\b[a-zA-Z\s]+\b", e.text).group().replace('\n', '')
            except:
                continue
                
            if 'Hotels' in filters and str(element_text) == 'Hotels':
                e.find_element(By.XPATH, ".//span[@class='ef785aa7f4']").click()
            
            if 'Apartments' in filters and str(element_text) == 'Apartments':
                e.find_element(By.XPATH, ".//span[@class='ef785aa7f4']").click()
            
            if 'Hostels' in filters and str(element_text) == 'Hostels':
                e.find_element(By.XPATH, ".//span[@class='ef785aa7f4']").click()
            
            if 'Very Good' in filters and str(element_text) == 'Very Good':
                e.find_element(By.XPATH, ".//span[@class='ef785aa7f4']").click()
            
            if 'Swimming pool' in filters and str(element_text) == 'Swimming pool':
                e.find_element(By.XPATH, ".//span[@class='ef785aa7f4']").click()
    
    else:
        pass

""" We define a function that the webpage of a single hotel with beautifulsoup in order to get the full description from it. """
def scrape_description(url):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raises an HTTPError if the HTTP request returned an unsuccessful status code
        soup = BeautifulSoup(response.content, 'html.parser')
        # Convert BeautifulSoup object to a lxml element tree to use xpath
        tree = html.fromstring(str(soup))
        description = tree.xpath("//div[@class='bui-grid__column bui-grid__column-8 k2-hp--description']//text()")
        return ' '.join(description).strip()
    except Exception as e:
        return f"Error: {e}"

""" We also define a quality of life function that closes the browser and quits the driver once we are done with the scraping. 
    This way we can run the code, leave it running and, when it finishes, we won't find any open instances of the browser. """
def browser_closer(browser):
    browser.close()
    browser.quit()

# Barcelona La Mercè iteration

In [5]:
""" In this first block we are defining city and dates to scrape. We don't need to tweak anything here because the code automatically sets it for us
    based on the variables we defined at the beginning of the code. """

# Set the city here:
city = Treatment_city

# Set the dates here, FOLLOW FORMAT them in 'yyyy-mm-dd':
date_initial = Treatment_initial_date
date_final = Treatment_final_date

In [6]:
""" We initiallize the browser. """

browser=start_up(dfolder=dfolder,link=link,geko_path=geko_path)
browser.get(link)

In [7]:
""" We wait for 'Google sign-in' to appear and we close it. Needed because on 13 inch laptop screens this pop-up always obscures the 'Search' button. """

wait = WebDriverWait(browser, 10)  # Wait for a maximum of 10 seconds

try:
    # Step 1: Locate the iframe and switch to it
    iframe = wait.until(EC.presence_of_element_located((By.TAG_NAME, 'iframe')))
    browser.switch_to.frame(iframe)

    # Step 2: Wait for the close button to be clickable in the iframe
    close_button = wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="close"]')))
    close_button.click()

    # Step 3: Switch back to the main document
    browser.switch_to.default_content()

except Exception as e:
    print(f"An error occurred: {e}")

In [8]:
""" We also reject cookies becuase the pop-up obscures some dates that we need to click on."""

browser.find_element(By.XPATH, '//*[@id="onetrust-reject-all-handler"]').click()

In [9]:
""" We select the city. """

browser.find_element(by='xpath',value='//*[@id=":re:"]').click()
search1 = browser.find_element(by='xpath',value='//*[@id=":re:"]')
search1.send_keys(city)

In [10]:
""" We open the dates selector. """

css='button.ebbedaf8ac:nth-child(2) > span:nth-child(1)'

browser.find_element('css selector',css).click()

In [11]:
""" We find out in which month is our initial date. """

target_month = datetime.strptime(date_initial, '%Y-%m-%d').strftime('%B')
print(target_month)

September


In [12]:
""" We click on 'next months' until we find the month of our initial date, then we click again because we might want to have a final date on the next month. """

month_path = '//*[@class="e1eebb6a1e ee7ec6b631"]'
visible_months = []

i = 12
while i > 0:
    temp_visible_months = browser.find_elements(By.XPATH, month_path)
    for m in temp_visible_months:
        visible_months.append(re.search(r'^\w+' , m.text).group())
    i -= 1
    print(visible_months)
    if target_month in visible_months:
        browser.find_element(By.XPATH, "//*[@class='a83ed08757 c21c56c305 f38b6daa18 d691166b09 f671049264 deab83296e f4552b6561 dc72a8413c f073249358']").click()
        break
    else:
        browser.find_element(By.XPATH, "//*[@class='a83ed08757 c21c56c305 f38b6daa18 d691166b09 f671049264 deab83296e f4552b6561 dc72a8413c f073249358']").click()
        visible_months = []

#print(visible_months)

['February', 'March']
['March', 'April']
['April', 'May']
['May', 'June']
['June', 'July']
['July', 'August']
['August', 'September']


In [13]:
""" We check the dates that are there on the calendar and we click on the initial/final date. """

path='//div[@id="calendar-searchboxdatepicker"]//table[@class="eb03f3f27f"]//tbody//td[@class="b80d5adb18"]//span[@class="cf06f772fa"]'

dates = browser.find_elements('xpath',path)
for date in dates:
    print(date.get_attribute("data-date"))
    if date.get_attribute("data-date") ==  date_initial:
        date.click()
    if date.get_attribute("data-date") == date_final:
        date.click()
        break

2024-09-01
2024-09-02
2024-09-03
2024-09-04
2024-09-05
2024-09-06
2024-09-07
2024-09-08
2024-09-09
2024-09-10
2024-09-11
2024-09-12
2024-09-13
2024-09-14
2024-09-15
2024-09-16
2024-09-17
2024-09-18
2024-09-19
2024-09-20
2024-09-21
2024-09-22
2024-09-23
2024-09-24
2024-09-25


In [14]:
""" We click on 'search' button. """

BUSCAR_xpath="//div[@class='e22b782521 d12ff5f5bf']//button[@class='a83ed08757 c21c56c305 a4c1805887 f671049264 d2529514af c082d89982 cceeb8986b' and @type='submit']"

check_and_click(browser,BUSCAR_xpath , type='xpath')

In [15]:
""" We execute the pop-up rejector for the 'Genius' pop-up. """

POP_UP_REJECTOR_1000_MK3(browser, 10)

In [16]:
""" We select the filters we want to use. """

time.sleep(2)
#wait.until(EC.presence_of_element_located((By.XPATH, "//label[@class='aca0ade214 aaf30230d9 c2931f4182 d79e71457a bd597ff2d8' and @for=':rv:']//span[@class='ef785aa7f4']")))
filters_selector(browser, filters)

In [17]:
""" We get the number of pages of the search. """

time.sleep(5)
#wait.until(EC.presence_of_element_located((By.XPATH, "//button[@class='a83ed08757 a2028338ea' and @type='button']")))
search_num_pages = get_num_pages(browser)
print(search_num_pages)

16


In [18]:
""" We scrape all the data from all the pages of the search we just set up. """

for page in range(search_num_pages-1):
    single_page_df = PAGE_SCRAPER_2000mk5(browser)
    df = pd.concat([df, single_page_df], ignore_index=True)
    browser.find_element(By.XPATH, "//button[@class='a83ed08757 c21c56c305 f38b6daa18 d691166b09 ab98298258 deab83296e bb803d8689 a16ddf9c57' and @aria-label='Next page' and @type='button']").click()
    POP_UP_REJECTOR_1000_MK3(browser, 5)

single_page_df = PAGE_SCRAPER_2000mk5(browser)
df = pd.concat([df, single_page_df], ignore_index=True) 


browser_closer(browser)

  df = pd.concat([df, single_page_df], ignore_index=True)


# Barcelona NOT La Mercè iteration

#### Each of the following sections are just iterations of the same code but for each combination of City and Week.

In [19]:
# Set the city here:
city = Treatment_city

# Set the dates here, FOLLOW FORMAT them in 'yyyy-mm-dd':
date_initial = Control_initial_date
date_final = Control_final_date

In [20]:
browser=start_up(dfolder=dfolder,link=link,geko_path=geko_path)
browser.get(link)

wait = WebDriverWait(browser, 10)  # Wait for a maximum of 10 seconds



try:
    # Step 1: Locate the iframe and switch to it
    iframe = wait.until(EC.presence_of_element_located((By.TAG_NAME, 'iframe')))
    browser.switch_to.frame(iframe)

    # Step 2: Wait for the close button to be clickable in the iframe
    close_button = wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="close"]')))
    close_button.click()

    # Step 3: Switch back to the main document
    browser.switch_to.default_content()

except Exception as e:
    print(f"An error occurred: {e}")



browser.find_element(By.XPATH, '//*[@id="onetrust-reject-all-handler"]').click()



browser.find_element(by='xpath',value='//*[@id=":re:"]').click()
search1 = browser.find_element(by='xpath',value='//*[@id=":re:"]')
search1.send_keys(city)



css='button.ebbedaf8ac:nth-child(2) > span:nth-child(1)'

browser.find_element('css selector',css).click()



target_month = datetime.strptime(date_initial, '%Y-%m-%d').strftime('%B')
print(target_month)



month_path = '//*[@class="e1eebb6a1e ee7ec6b631"]'
visible_months = []

i = 12
while i > 0:
    temp_visible_months = browser.find_elements(By.XPATH, month_path)
    for m in temp_visible_months:
        visible_months.append(re.search(r'^\w+' , m.text).group())
    i -= 1
    print(visible_months)
    if target_month in visible_months:
        browser.find_element(By.XPATH, "//*[@class='a83ed08757 c21c56c305 f38b6daa18 d691166b09 f671049264 deab83296e f4552b6561 dc72a8413c f073249358']").click()
        break
    else:
        browser.find_element(By.XPATH, "//*[@class='a83ed08757 c21c56c305 f38b6daa18 d691166b09 f671049264 deab83296e f4552b6561 dc72a8413c f073249358']").click()
        visible_months = []

#print(visible_months)



path='//div[@id="calendar-searchboxdatepicker"]//table[@class="eb03f3f27f"]//tbody//td[@class="b80d5adb18"]//span[@class="cf06f772fa"]'

dates = browser.find_elements('xpath',path)
for date in dates:
    print(date.get_attribute("data-date"))
    if date.get_attribute("data-date") ==  date_initial:
        date.click()
    if date.get_attribute("data-date") == date_final:
        date.click()
        break



BUSCAR_xpath="//div[@class='e22b782521 d12ff5f5bf']//button[@class='a83ed08757 c21c56c305 a4c1805887 f671049264 d2529514af c082d89982 cceeb8986b' and @type='submit']"

check_and_click(browser,BUSCAR_xpath , type='xpath')



POP_UP_REJECTOR_1000_MK3(browser, 10)





time.sleep(2)
#wait.until(EC.presence_of_element_located((By.XPATH, "//label[@class='aca0ade214 aaf30230d9 c2931f4182 d79e71457a bd597ff2d8' and @for=':rv:']//span[@class='ef785aa7f4']")))
filters_selector(browser, filters)



time.sleep(5)
#wait.until(EC.presence_of_element_located((By.XPATH, "//button[@class='a83ed08757 a2028338ea' and @type='button']")))
search_num_pages = get_num_pages(browser)
print(search_num_pages)




for page in range(search_num_pages-1):
    single_page_df = PAGE_SCRAPER_2000mk5(browser)
    df = pd.concat([df, single_page_df], ignore_index=True)
    browser.find_element(By.XPATH, "//button[@class='a83ed08757 c21c56c305 f38b6daa18 d691166b09 ab98298258 deab83296e bb803d8689 a16ddf9c57' and @aria-label='Next page' and @type='button']").click()
    POP_UP_REJECTOR_1000_MK3(browser, 5)

single_page_df = PAGE_SCRAPER_2000mk5(browser)
df = pd.concat([df, single_page_df], ignore_index=True) 


browser_closer(browser)

September
['February', 'March']
['March', 'April']
['April', 'May']
['May', 'June']
['June', 'July']
['July', 'August']
['August', 'September']
2024-09-01
2024-09-02
2024-09-03
2024-09-04
2024-09-05
2024-09-06
2024-09-07
2024-09-08
2024-09-09
2024-09-10
2024-09-11
2024-09-12
2024-09-13
2024-09-14
2024-09-15
2024-09-16
2024-09-17
2024-09-18
2024-09-19
2024-09-20
2024-09-21
2024-09-22
2024-09-23
2024-09-24
2024-09-25
2024-09-26
2024-09-27
2024-09-28
2024-09-29
2024-09-30
2024-10-01
2024-10-02
17


# Napoli La Mercè iteration

In [21]:
# Set the city here:
city = Control_city

# Set the dates here, FOLLOW FORMAT them in 'yyyy-mm-dd':
date_initial = Treatment_initial_date
date_final = Treatment_final_date

In [22]:
browser=start_up(dfolder=dfolder,link=link,geko_path=geko_path)
browser.get(link)

wait = WebDriverWait(browser, 10)  # Wait for a maximum of 10 seconds



try:
    # Step 1: Locate the iframe and switch to it
    iframe = wait.until(EC.presence_of_element_located((By.TAG_NAME, 'iframe')))
    browser.switch_to.frame(iframe)

    # Step 2: Wait for the close button to be clickable in the iframe
    close_button = wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="close"]')))
    close_button.click()

    # Step 3: Switch back to the main document
    browser.switch_to.default_content()

except Exception as e:
    print(f"An error occurred: {e}")



browser.find_element(By.XPATH, '//*[@id="onetrust-reject-all-handler"]').click()



browser.find_element(by='xpath',value='//*[@id=":re:"]').click()
search1 = browser.find_element(by='xpath',value='//*[@id=":re:"]')
search1.send_keys(city)



css='button.ebbedaf8ac:nth-child(2) > span:nth-child(1)'

browser.find_element('css selector',css).click()



target_month = datetime.strptime(date_initial, '%Y-%m-%d').strftime('%B')
print(target_month)



month_path = '//*[@class="e1eebb6a1e ee7ec6b631"]'
visible_months = []

i = 12
while i > 0:
    temp_visible_months = browser.find_elements(By.XPATH, month_path)
    for m in temp_visible_months:
        visible_months.append(re.search(r'^\w+' , m.text).group())
    i -= 1
    print(visible_months)
    if target_month in visible_months:
        browser.find_element(By.XPATH, "//*[@class='a83ed08757 c21c56c305 f38b6daa18 d691166b09 f671049264 deab83296e f4552b6561 dc72a8413c f073249358']").click()
        break
    else:
        browser.find_element(By.XPATH, "//*[@class='a83ed08757 c21c56c305 f38b6daa18 d691166b09 f671049264 deab83296e f4552b6561 dc72a8413c f073249358']").click()
        visible_months = []

#print(visible_months)



path='//div[@id="calendar-searchboxdatepicker"]//table[@class="eb03f3f27f"]//tbody//td[@class="b80d5adb18"]//span[@class="cf06f772fa"]'

dates = browser.find_elements('xpath',path)
for date in dates:
    print(date.get_attribute("data-date"))
    if date.get_attribute("data-date") ==  date_initial:
        date.click()
    if date.get_attribute("data-date") == date_final:
        date.click()
        break



BUSCAR_xpath="//div[@class='e22b782521 d12ff5f5bf']//button[@class='a83ed08757 c21c56c305 a4c1805887 f671049264 d2529514af c082d89982 cceeb8986b' and @type='submit']"

check_and_click(browser,BUSCAR_xpath , type='xpath')



POP_UP_REJECTOR_1000_MK3(browser, 10)




time.sleep(2)
#wait.until(EC.presence_of_element_located((By.XPATH, "//label[@class='aca0ade214 aaf30230d9 c2931f4182 d79e71457a bd597ff2d8' and @for=':rv:']//span[@class='ef785aa7f4']")))
filters_selector(browser, filters)



time.sleep(5)
#wait.until(EC.presence_of_element_located((By.XPATH, "//button[@class='a83ed08757 a2028338ea' and @type='button']")))
search_num_pages = get_num_pages(browser)
print(search_num_pages)



for page in range(search_num_pages-1):
    single_page_df = PAGE_SCRAPER_2000mk5(browser)
    df = pd.concat([df, single_page_df], ignore_index=True)
    browser.find_element(By.XPATH, "//button[@class='a83ed08757 c21c56c305 f38b6daa18 d691166b09 ab98298258 deab83296e bb803d8689 a16ddf9c57' and @aria-label='Next page' and @type='button']").click()
    POP_UP_REJECTOR_1000_MK3(browser, 5)


single_page_df = PAGE_SCRAPER_2000mk5(browser)
df = pd.concat([df, single_page_df], ignore_index=True) 


browser_closer(browser)

September
['February', 'March']
['March', 'April']
['April', 'May']
['May', 'June']
['June', 'July']
['July', 'August']
['August', 'September']
2024-09-01
2024-09-02
2024-09-03
2024-09-04
2024-09-05
2024-09-06
2024-09-07
2024-09-08
2024-09-09
2024-09-10
2024-09-11
2024-09-12
2024-09-13
2024-09-14
2024-09-15
2024-09-16
2024-09-17
2024-09-18
2024-09-19
2024-09-20
2024-09-21
2024-09-22
2024-09-23
2024-09-24
2024-09-25
7


# Napoli NOT La Mercè iteration

In [23]:
# Set the city here:
city = Control_city

# Set the dates here, FOLLOW FORMAT them in 'yyyy-mm-dd':
date_initial = Control_initial_date
date_final = Control_final_date

In [24]:
browser=start_up(dfolder=dfolder,link=link,geko_path=geko_path)
browser.get(link)

wait = WebDriverWait(browser, 10)  # Wait for a maximum of 10 seconds



try:
    # Step 1: Locate the iframe and switch to it
    iframe = wait.until(EC.presence_of_element_located((By.TAG_NAME, 'iframe')))
    browser.switch_to.frame(iframe)

    # Step 2: Wait for the close button to be clickable in the iframe
    close_button = wait.until(EC.element_to_be_clickable((By.XPATH, '//*[@id="close"]')))
    close_button.click()

    # Step 3: Switch back to the main document
    browser.switch_to.default_content()

except Exception as e:
    print(f"An error occurred: {e}")



browser.find_element(By.XPATH, '//*[@id="onetrust-reject-all-handler"]').click()



browser.find_element(by='xpath',value='//*[@id=":re:"]').click()
search1 = browser.find_element(by='xpath',value='//*[@id=":re:"]')
search1.send_keys(city)



css='button.ebbedaf8ac:nth-child(2) > span:nth-child(1)'

browser.find_element('css selector',css).click()



target_month = datetime.strptime(date_initial, '%Y-%m-%d').strftime('%B')
print(target_month)



month_path = '//*[@class="e1eebb6a1e ee7ec6b631"]'
visible_months = []

i = 12
while i > 0:
    temp_visible_months = browser.find_elements(By.XPATH, month_path)
    for m in temp_visible_months:
        visible_months.append(re.search(r'^\w+' , m.text).group())
    i -= 1
    print(visible_months)
    if target_month in visible_months:
        browser.find_element(By.XPATH, "//*[@class='a83ed08757 c21c56c305 f38b6daa18 d691166b09 f671049264 deab83296e f4552b6561 dc72a8413c f073249358']").click()
        break
    else:
        browser.find_element(By.XPATH, "//*[@class='a83ed08757 c21c56c305 f38b6daa18 d691166b09 f671049264 deab83296e f4552b6561 dc72a8413c f073249358']").click()
        visible_months = []

#print(visible_months)



path='//div[@id="calendar-searchboxdatepicker"]//table[@class="eb03f3f27f"]//tbody//td[@class="b80d5adb18"]//span[@class="cf06f772fa"]'

dates = browser.find_elements('xpath',path)
for date in dates:
    print(date.get_attribute("data-date"))
    if date.get_attribute("data-date") ==  date_initial:
        date.click()
    if date.get_attribute("data-date") == date_final:
        date.click()
        break



BUSCAR_xpath="//div[@class='e22b782521 d12ff5f5bf']//button[@class='a83ed08757 c21c56c305 a4c1805887 f671049264 d2529514af c082d89982 cceeb8986b' and @type='submit']"

check_and_click(browser,BUSCAR_xpath , type='xpath')



POP_UP_REJECTOR_1000_MK3(browser, 10)




time.sleep(2)
#wait.until(EC.presence_of_element_located((By.XPATH, "//label[@class='aca0ade214 aaf30230d9 c2931f4182 d79e71457a bd597ff2d8' and @for=':rv:']//span[@class='ef785aa7f4']")))
filters_selector(browser, filters)



time.sleep(5)
#wait.until(EC.presence_of_element_located((By.XPATH, "//button[@class='a83ed08757 a2028338ea' and @type='button']")))
search_num_pages = get_num_pages(browser)
print(search_num_pages)



for page in range(search_num_pages-1):
    single_page_df = PAGE_SCRAPER_2000mk5(browser)
    df = pd.concat([df, single_page_df], ignore_index=True)
    browser.find_element(By.XPATH, "//button[@class='a83ed08757 c21c56c305 f38b6daa18 d691166b09 ab98298258 deab83296e bb803d8689 a16ddf9c57' and @aria-label='Next page' and @type='button']").click()
    POP_UP_REJECTOR_1000_MK3(browser, 5)


single_page_df = PAGE_SCRAPER_2000mk5(browser)
df = pd.concat([df, single_page_df], ignore_index=True) 



browser_closer(browser)

September
['February', 'March']
['March', 'April']
['April', 'May']
['May', 'June']
['June', 'July']
['July', 'August']
['August', 'September']
2024-09-01
2024-09-02
2024-09-03
2024-09-04
2024-09-05
2024-09-06
2024-09-07
2024-09-08
2024-09-09
2024-09-10
2024-09-11
2024-09-12
2024-09-13
2024-09-14
2024-09-15
2024-09-16
2024-09-17
2024-09-18
2024-09-19
2024-09-20
2024-09-21
2024-09-22
2024-09-23
2024-09-24
2024-09-25
2024-09-26
2024-09-27
2024-09-28
2024-09-29
2024-09-30
2024-10-01
2024-10-02
7


# Descriptions scraping and export to csv

In [25]:
""" We check the final df. """

display(df)
# distance is in meters
# price is in euros
# rating is in a scale from 0 to 10

Unnamed: 0,name,rating,num_reviews,neighborhood,dist_from_center,price,other,is_Barna,is_Merce_time,short_description,link
0,Room Mate Gerard,8.8,2783,Eixample,700,1730,"Subway Access, Beach Nearby",1,1,Junior Suite\nPrivate suite • 1 bedroom • 1 li...,https://www.booking.com/hotel/es/room-mate-ger...
1,Sonder Los Arcos,8.4,149,Ciutat Vella,1000,1530,"Subway Access, Beach Nearby",1,1,Queen Room with Two Queen Beds\n2 queen beds\n...,https://www.booking.com/hotel/es/sonder-los-ar...
2,Occidental Barcelona 1929,8.9,4066,Montjuïc,2300,909,Subway Access,1,1,Superior Double Room\nBeds: 1 double or 2 twin...,https://www.booking.com/hotel/es/ona-hotels-te...
3,Hotel Alimara,8.3,3892,Guinardó,5500,749,Subway Access,1,1,Double Room\n1 queen bed\nFree cancellation,https://www.booking.com/hotel/es/hotelalimara....
4,Weflating City Center,8.8,1573,Eixample,500,978,"Subway Access, Beach Nearby",1,1,Economy Double Room\n2 bunk beds\nFree cancell...,https://www.booking.com/hotel/es/weflating-cit...
...,...,...,...,...,...,...,...,...,...,...,...
1148,Hotel Martini,6.9,607.0,Capodichino,6300,432,,0,0,Double Room\n1 queen bed\nBreakfast included,https://www.booking.com/hotel/it/martini-secon...
1149,Eurostars Hotel Excelsior,8.6,3221.0,Lungomare Caracciolo,1500,3199,,0,0,Classic Double Room\n1 king bed,https://www.booking.com/hotel/it/eurostars-hot...
1150,Hotel Barbato,5.8,110.0,Capodichino,5800,459,,0,0,"Standard Triple Room\n2 beds (1 twin, 1 full)\...",https://www.booking.com/hotel/it/barbato.html?...
1151,Hotel Serena,7.9,18.0,,5500,528,,0,0,Double Room\n1 queen bed\nBreakfast included,https://www.booking.com/hotel/it/serena-napoli...


In [26]:
""" We scrape the individual long descriptions from each hotel link. """

num_threads = 8

# Create a ThreadPoolExecutor to run operations in parallel
with ThreadPoolExecutor(max_workers=num_threads) as executor:
    # Use executor.map to apply the scrape_description function to each URL in parallel
    descriptions = list(tqdm(executor.map(scrape_description, df['link']), total = len(df)))

# Adding descriptions to the DataFrame
df['descriptions'] = descriptions

# Converting the descriptions column to string
df['descriptions'] = df['descriptions'].apply(lambda x: str(x))

  0%|          | 0/1153 [00:00<?, ?it/s]

100%|██████████| 1153/1153 [06:14<00:00,  3.07it/s]


In [27]:
""" We check again, we should see the descriptions added now. """

display(df)

Unnamed: 0,name,rating,num_reviews,neighborhood,dist_from_center,price,other,is_Barna,is_Merce_time,short_description,link,descriptions
0,Room Mate Gerard,8.8,2783,Eixample,700,1730,"Subway Access, Beach Nearby",1,1,Junior Suite\nPrivate suite • 1 bedroom • 1 li...,https://www.booking.com/hotel/es/room-mate-ger...,You're eligible for a Genius discount at Room ...
1,Sonder Los Arcos,8.4,149,Ciutat Vella,1000,1530,"Subway Access, Beach Nearby",1,1,Queen Room with Two Queen Beds\n2 queen beds\n...,https://www.booking.com/hotel/es/sonder-los-ar...,Ideally located in the Ciutat Vella district o...
2,Occidental Barcelona 1929,8.9,4066,Montjuïc,2300,909,Subway Access,1,1,Superior Double Room\nBeds: 1 double or 2 twin...,https://www.booking.com/hotel/es/ona-hotels-te...,You're eligible for a Genius discount at Occid...
3,Hotel Alimara,8.3,3892,Guinardó,5500,749,Subway Access,1,1,Double Room\n1 queen bed\nFree cancellation,https://www.booking.com/hotel/es/hotelalimara....,You're eligible for a Genius discount at Hotel...
4,Weflating City Center,8.8,1573,Eixample,500,978,"Subway Access, Beach Nearby",1,1,Economy Double Room\n2 bunk beds\nFree cancell...,https://www.booking.com/hotel/es/weflating-cit...,You're eligible for a Genius discount at Wefla...
...,...,...,...,...,...,...,...,...,...,...,...,...
1148,Hotel Martini,6.9,607.0,Capodichino,6300,432,,0,0,Double Room\n1 queen bed\nBreakfast included,https://www.booking.com/hotel/it/martini-secon...,"Set in Casavatore, 3.7 mi from Naples, Hotel M..."
1149,Eurostars Hotel Excelsior,8.6,3221.0,Lungomare Caracciolo,1500,3199,,0,0,Classic Double Room\n1 king bed,https://www.booking.com/hotel/it/eurostars-hot...,You're eligible for a Genius discount at Euros...
1150,Hotel Barbato,5.8,110.0,Capodichino,5800,459,,0,0,"Standard Triple Room\n2 beds (1 twin, 1 full)\...",https://www.booking.com/hotel/it/barbato.html?...,"Set 2.5 mi from Naples Capodichino Airport, Ho..."
1151,Hotel Serena,7.9,18.0,,5500,528,,0,0,Double Room\n1 queen bed\nBreakfast included,https://www.booking.com/hotel/it/serena-napoli...,You're eligible for a Genius discount at Hotel...


In [28]:
""" We drop the 'link' column as it is no longer needed. """

df.drop('link', axis = 1, inplace = True)

In [29]:
""" We export the DataFrame to a csv file. """

df.to_csv(f'{name_csv}.csv', index = False)