 ![bse_logo_textminingcourse](https://bse.eu/sites/default/files/bse_logo_small.png)

# Introduction to Text Mining and Natural Language Processing
## Project 1 Booking 

by Natalia Beltrán, Harry Morley, Xi Cheng 

### 1. Future Event 

This mini research project aims to investigate the effect of the SONAR Festival has on the rental prices in Barcelona. The SONAR Festival is a prominent music festival that is held annually in June, and attracks a large influx of visitors. The festival takes place during the week of the 12th to the 16th of June. 

### 2. Period & Second City Selection

The chosen time periods for data collection are the weeks of the 5th to 9th and 12th to 16th of June. This selection allows for a comparative analysis before and during the SONAR Festival in Barcelona. By contrasting these two weeks, we aim to capture any discernible shifts in rental prices that could be associated with the event, offering a comprehensive understanding of its impact. Alicante was selected as the second city for its comparable characteristics to Barcelona. Both cities share a coastal setting, vibrant cultural scenes, and historical attractions offering a similar appeal to tourists. By examining rental prices in Alicante during the same time frame, we aim to discern whether the observed trends in Barcelona are event-specific or part of a broader pattern in cites with similar profiles. 

### 3. Pipeline

Our research relies on a well-structured pipeline to systematically collect data on the rental prices during significant events. This automated framework ensures efficiency and reliability in extracting information from the Booking platform. In this section we detail the key steps of the pipeline, specifically focusing on Barcelona during the weeks before and during the SONAR Festival. 

- Initialize:
    - Initializing the web scraping pipeline by selecting the preferred browser (Firefox) using Selenium. 
- Open the Booking browser 
- Handling Cookies: 
    - Mechanism implemented to handle cookie pop-ups that may appear during web browsing. 
- Selecting Travel Destination: 
    - Navigate to the desired travel section on the booking platform and input the desired city for analysis. 
- Date Selection: 
    - Open the calendar section to select the desired date range for analysis. This includes functions to interact with the calendar, allowing for the specification of both start and end dates. 
- Initiate the hotel search process: 
    - Prompts the platform to fetch relevant accommodation options. This marks the beginning of the data retrieval phase. 
- Hotel Pages: 
    - Count the total number of pages containing hotel options, allows the pipeline to adapt to varying search result pages and ensures comprehensive data collection. 
- Load XPaths: 
    - Loads all the necessary XPaths for a systematic and efficient way to identify and interact with specific elements of the web page, allowing for accurate data extraction from various hotel listings. 
- Hotel Data Extraction: 
    - Retrieve information such as hotel names, ratings, and prices. The Booking page was broken down into different sections for each hotel listing in order to more seamlessly and efficiently extract the hotel information from all the relevant pages. 
- Description Extraction: 
    - Extract detailed descriptions of each hotel. 
- Save as CSV: 
    - Combine extracted hotel data and descriptions, and save the consolidated information in a CSV file. 

### 4. Scrape date, room price, hotel name, and hotel description

Imports 

In [None]:
import json
import pandas as pd
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, ElementClickInterceptedException, StaleElementReferenceException
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
import requests
from concurrent.futures import ThreadPoolExecutor
import statsmodels.api as sm
import numpy as np 


# Headers
#headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}
 
# Go get geckodriver from : https://github.com/mozilla/geckodriver/releases

Browsing Functions

In [None]:
def ffx_preferences(dfolder, download=False):
    '''
    Sets the preferences of the firefox browser: download path.
    '''
    profile = webdriver.FirefoxProfile()
    # set download folder:
    profile.set_preference("browser.download.dir", dfolder)
    profile.set_preference("browser.download.folderList", 2)
    profile.set_preference("browser.download.manager.showWhenStarting", False)
    profile.set_preference("browser.helperApps.neverAsk.saveToDisk",
                           "application/msword,application/rtf, application/csv,text/csv,image/png ,image/jpeg, application/pdf, text/html,text/plain,application/octet-stream")
    

    # this allows to download pdfs automatically
    if download:
        profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf,application/x-pdf")
        profile.set_preference("pdfjs.disabled", True)

    options = Options()
    options.profile = profile
    return options


def start_up(link, dfolder, geko_path,donwload=True):
    os.makedirs(dfolder, exist_ok=True)

    options = ffx_preferences(dfolder,donwload)
    service = Service(geko_path)
    browser = webdriver.Firefox(service=service, options=options)
    # Enter the website address here
    browser.get(link)
    time.sleep(5)  # Adjust sleep time as needed
    return browser


def check_and_click(browser, xpath, type):
    '''
    Function that checks whether the object is clickable and, if so, clicks on
    it. If not, waits one second and tries again.
    '''
    ck = False
    ss = 0
    while ck == False:
        ck = check_obscures(browser, xpath, type)
        time.sleep(1)
        ss += 1
        if ss == 15:
            ck = True
            

def check_obscures(browser, xpath, type):
    '''
    Function that checks whether the object is being "obscured" by any element so
    that it is not clickable. Important: if True, the object is going to be clicked!
    '''
    try:
        if type == "xpath":
            browser.find_element('xpath',xpath).click()
        elif type == "id":
            browser.find_element('id',xpath).click()
        elif type == "css":
            browser.find_element('css selector',xpath).click()
        elif type == "class":
            browser.find_element('class name',xpath).click()
        elif type == "link":
            browser.find_element('link text',xpath).click()
    except (ElementClickInterceptedException, NoSuchElementException, StaleElementReferenceException) as e:
        print(e)
        return False
    return True

Open browser

In [None]:
# Opens the Booking.com website 

dfolder='/Users/nataliabeltran/Downloads/BSE_courses/Term_2_courses/22DM014_Introduction_to_Text_Mining_NLP/ta_session_text_mining/booking_assignment'
geko_path='/Users/nataliabeltran/Downloads/geckodriver'
link='https://www.booking.com/index.es.html'


browser=start_up(dfolder=dfolder,link=link,geko_path=geko_path)

Reject the cookies 

In [None]:
# Rejects the cookies after logging into the Booking website
x_path_cookies = '//button[@id="onetrust-reject-all-handler"]'
check_and_click(browser, x_path_cookies, 'xpath')

City Selection

In [None]:
# Opens the city to travel section
browser.find_element(by='xpath',value='//input[@id=":re:"]').click()

In [None]:
# allows us to input a city to scrape hotels for 
place = input('Where do you want to go?')
search1 = browser.find_element(by='xpath',value='//*[@id=":re:"]')
search1.send_keys(place)

Select dates

In [None]:
# Opens the calendar section 
css='button.ebbedaf8ac:nth-child(2) > span:nth-child(1)'
browser.find_element('css selector',css).click()

In [None]:
# Gets all the dates of the calendar
path='//div[@id="calendar-searchboxdatepicker"]//table[@class="eb03f3f27f"]//tbody//td[@class="b80d5adb18"]//span[@class="cf06f772fa"]'

dates = browser.find_elements('xpath',path)


for date in dates:
    print(date.get_attribute("data-date"))

In [None]:
def click_element(browser, xpath, num_clicks=1):
    """
    Clicks the arrow element that is identified by the given XPath multiple times.

    Parameters:
    - browser (WebDriver): The browser instance to interact with.
    - xpath (str): XPath of the element to click (arrow).
    - num_clicks (int): Number of times to click the element. Default is 1.
    """
    
    for _ in range(num_clicks):
        browser.find_element('xpath', xpath).click()

def select_month(browser, xpath_button, num_month):
    """
    Selects a month by clicking the arrow element identified by the given XPath a certain amount of times.

    Parameters:
    - browser (WebDriver): The browser instance to interact with.
    - xpath_button (str): XPath of the button to select the month.
    - num_month (int): Number of months from the current month to navigate to get to desired month. 
    
    """

    for _ in range(num_month):
        browser.find_element('xpath', xpath_button).click()

def select_date_range(browser, from_day, to_day):
    """
    Selects a date range on a calendar.

    Parameters:
    - browser: The browser instance to interact with.
    - from_day (str): The start day of the date range in 'DD' format.
    - to_day (str): The end day of the date range in 'DD' format.
    """

    path = '//div[@id="calendar-searchboxdatepicker"]//table[@class="eb03f3f27f"]//tbody//td[@class="b80d5adb18"]//span[@class="cf06f772fa"]'
    dates = browser.find_elements('xpath', path)

    for date in dates:
        date_value = date.get_attribute("data-date")
        if date_value == f"2024-06-{from_day}":
            date.click()
        if date_value == f"2024-06-{to_day}":
            date.click()
            break

def calendar_days(browser, from_day, to_day):
    """
    Main function to select desired days of the calendar.

    Parameters:
    - browser : The browser instance to interact with.
    - from_day (str): The start day of the date range in 'DD' format. 
    - to_day (str): The end day of the date range in 'DD' format. 
    """

    num_clicks = 1
    arrow_xpath = '/html/body/div[3]/div[2]/div/form/div[1]/div[2]/div/div[2]/div/nav/div[2]/div/div[1]/button/span/span'
    click_element(browser, arrow_xpath, num_clicks)

    # change num_month depending on how far the month is from today
    num_month = 3 
    month_xpath = '/html/body/div[3]/div[2]/div/form/div[1]/div[2]/div/div[2]/div/nav/div[2]/div/div[1]/button[2]'
    select_month(browser, month_xpath, num_month)

    select_date_range(browser, from_day, to_day)

"""
Example:
main(browser, from_day='15', to_day='20') 
"""


In [None]:
# Run the main calendar days function
calendar_days(browser, '12', '16') 

Search for hotels

In [None]:
#XPath to search 
my_xpath='/html/body/div[3]/div[2]/div/form/div[1]/div[4]/button/span'

check_obscures(browser,my_xpath , type='xpath')
check_and_click(browser,my_xpath , type='xpath')

Count Pages to click through

In [None]:
def get_number_pages(browser):
    '''
    Get the number of pages. 
    '''
    a = browser.find_elements('xpath',
        '//div[@class="ab95b25344"]')
    return a[0].text.split("\n")[-1]
    #return(int(a[-1].text))


pages = get_number_pages(browser)

print(pages)


XPaths to search through the website

In [None]:
# PAGES XPATH 
css_pages = 'div.b16a89683f:nth-child(3) > button:nth-child(1) > span:nth-child(1) > span:nth-child(1)'

#XPATHS
sections = browser.find_elements('xpath', '//div[@class="c066246e13"]')
hotels_xpath = './/div[@class="f6431b446c a15b38c233"]'
ratings_xpath = './/div[@class="a3b8729ab1 d86cee9b25"]'
prices_xpath = './/span[@class="f6431b446c fbfd7c1165 e84eb96b1f"]'

EXTRACT HOTEL NAMES, RATINGS, & PRICE

In [None]:
def hotel_information_scraping(browser, hotels_xpath, ratings_xpath, prices_xpath, css_pages, pages): 
    """ 
    Scrapes hotel information from a series of pages using Selenium. 

    Parameters: 
        - browser (Webdriver): The Selenium Webdriver instance. 
        - hotels_xpath (str): XPath to locate the hotel name element within each section of hotel options. 
        - ratings_xpath (str): XPath to locate the rating element within each section of hotel options. 
        - prices_xpath (str): XPath to locate the price element within each section of hotel options. 
        - css_pages (str): CSS Selector to locate the element for switiching pages. 
        - pages (int): The total number of pages to scrape found using function get_number_pages(). 

    Returns: 
        - pd.Dataframe: A Dataframe containing the scraped hotel information. 

    Notes: 
        - The function prints the hotel information for each section on each page. 
        - If using for a different website, the xpath for the Section & Hotel URLs need to be replaced with the correct path. 

    Example: 
       '''python:  df_hotel = scrape_hotel_data(browser, hotels,xpath, ratings_xpath, prics_xpath, css_pages, pages)'''
    
    """

    # Lists to store the extracted information
    hotels_list = []
    ratings_list = []
    prices_list = []
    url_list = []
    

    for page in range(int(pages)+1): 
        #Print page that it is in 
        print(f'Page: {page + 1}')
        sections = browser.find_elements('xpath', '//div[@class="c066246e13"]')
        for hotel in sections:
            try:
                hotel_name = hotel.find_element('xpath', hotels_xpath).text

            except NoSuchElementException: 
                print("Element could not be found. Printing page source:", browser.page_source)
                continue # Skip to the next iteration if the hotel section was not found 

            except StaleElementReferenceException: 
                    print("StaleElementReferenceException. Retrying...")
                    #Retry finding the element by waiting for it be present 
                    wait = WebDriverWait(browser, 10)

                    try: 
                        hotel_name = wait.until(EC.presence_of_element_located(('xpath', hotels_xpath))).text 
                    except NoSuchElementException: 
                        print("Element still cannot be found. Skipping to next Hotel.")
                        continue
                
            try:
                rating = hotel.find_element('xpath', ratings_xpath).text
            except:
                rating = "N/A"

            try:
                price = hotel.find_element('xpath', prices_xpath).text
            except:
                price = "N/A"  # I don't think i need an N/A for the price cause there should always be a price 

            try: 
                url = hotel.find_element('xpath', './/a[@href]')
                hotel_url= url.get_attribute('href')
            except:
                hotel_url = "N/A"
        
            # Print the hotel information as it runs through each section
            print("Hotel Name:", hotel_name)
            print("Rating:", rating)
            print("Price", price)
            print("Hotel URL", hotel_url)
            print("\n")

            # Append the information to their respective lists
            hotels_list.append(hotel_name)
            ratings_list.append(rating)
            prices_list.append(price)
            url_list.append(hotel_url)



            # Switch page with CSS Selector
        switch_pages = browser.find_element('css selector', css_pages).click()
        time.sleep(2)


    # Create a dictonary to store all list together
    dict = { 'Hotel Name': hotels_list, 'Rating': ratings_list, 'Price': prices_list, 'Hotel URL': url_list }

    # Create a dataframe with the dictionary
    df = pd.DataFrame(dict)
    return df

In [None]:
# Running hotel scrape function. Repeat for all desired cities needed to scrape. 
df = hotel_information_scraping(browser, hotels_xpath, ratings_xpath, prices_xpath, css_pages, pages)

EXTRACT DESCRIPTIONS

In [None]:
def description_info(url):

    """
    Extracts and returns the description text from a specified URL. 
    
    Parameters: 
        - url (str): The URL of the webpage to extract the description text from. 

    Returns: 
        - str or None: The extracted description text that is found, otherwise a None. 

    Raises: 
        - request.exceptions.RequestException: In place in case there is a potential issue with the HTTP request.
    """
    
    try:
        req = requests.get(url, headers=headers, timeout=60)
        req.raise_for_status() 
    except requests.exceptions.RequestException as e:
        print(f"Can't process the {url}: {e}")
        return None

    soup = BeautifulSoup(req.text, 'html.parser')
    description_text = soup.find('p', class_='a53cbfa6de b3efd73f69')

    if description_text:
        return description_text.get_text(strip=True)
    else:
        print("Description could not be found")
        return None

# Create an empty list to store the descriptions
descriptions_list = []

# Iterate through the URLs and scrape descriptions sequentially
for url in df['Hotel URL']:
    description = description_info(url)
    descriptions_list.append(description)

# Assign the descriptions to the 'Descriptions' column in the DataFrame
df['Descriptions'] = descriptions_list

Save dataframe as a CSV file

In [None]:
csv = df.to_csv('Alicante_festival.csv') # Change name of csv as desired for each week that is being scraped