![bse_logo_textminingcourse](https://bse.eu/sites/default/files/bse_logo_small.png)

# Text Mining: Models and Algorithms

## Problem Set 1

### *1.⁠ ⁠Identify a (future) event that makes a lot of people come to Barcelona. Think about music festivals, local festivities etc. (2 points)*
We have selected the Sónar festival. It is the 31st edition of the Barcelona International Festival of Advanced Music and Multimedia Art in 2024. This vibrant event takes place in Montjuic and attracts enthusiasts from all over the world to Barcelona to participate in its rich offer.


### *2.⁠ ⁠Think of the time periods to scrape and what second city to scrape for these same timer periods. Explain your choices in written. (2 points)*

The festival unfolds on June 13, 14, and 15, and we have opted to analyze the period preceding it, that is the corresponding days on June 6, 7, and 8, 2024. Ensuring an equivalent number of days and proximity to the event dates is crucial for a meaningful comparison of similar scenarios. We also include Valencia as the second city to control for due to its proximity and similarities to Barcelona. Both cities are located on Spain's eastern coastline along the Mediterranean Sea and share similar geographical and cultural situation.

In [5]:
import json
import pandas as pd
import numpy as np
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException, ElementClickInterceptedException, StaleElementReferenceException
from selenium import webdriver
from bs4 import BeautifulSoup
import time
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options
from concurrent.futures import ThreadPoolExecutor
import requests
import os
import warnings

# Ignore SettingWithCopyWarning
warnings.filterwarnings("ignore", category=UserWarning, module="pandas")
warnings.filterwarnings("ignore", category=FutureWarning, module="pandas")


# Go get geckodriver from : https://github.com/mozilla/geckodriver/releases

#### Utils

In [6]:
def ffx_preferences(dfolder, download=False):
    '''
    Sets the preferences of the firefox browser: download path.
    '''
    profile = webdriver.FirefoxProfile()
    # set download folder:
    profile.set_preference("browser.download.dir", dfolder)
    profile.set_preference("browser.download.folderList", 2)
    profile.set_preference("browser.download.manager.showWhenStarting", False)
    profile.set_preference("browser.helperApps.neverAsk.saveToDisk",
                           "application/msword,application/rtf, application/csv,text/csv,image/png ,image/jpeg, application/pdf, text/html,text/plain,application/octet-stream")
    
    # profile.add_extension('/Users/luisignaciomenendezgarcia/Dropbox/CLASSES/class_bse_text_mining/class_scraping_bse/booking/booking/ublock_origin-1.55.0.xpi')


    # this allows to download pdfs automatically
    if download:
        profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf,application/x-pdf")
        profile.set_preference("pdfjs.disabled", True)

    options = Options()
    options.profile = profile
    return options


def start_up(link, dfolder, geko_path,donwload=True):
    # geko_path='/Users/luisignaciomenendezgarcia/Dropbox/CLASSES/class_bse_text_mining/class_scraping_bse/booking/geckodriver'
    # download_path='./downloads'
    os.makedirs(dfolder, exist_ok=True)

    options = ffx_preferences(dfolder,donwload)
    service = Service(geko_path)
    browser = webdriver.Firefox(service=service, options=options)
    # Enter the website address here
    browser.get(link)
    time.sleep(5)  # Adjust sleep time as needed
    return browser
        
def check_and_click(browser, xpath, type):
    '''
    Function that checks whether the object is clickable and, if so, clicks on
    it. If not, waits one second and tries again.
    '''
    start_time = time.time()  # Record the start time
    while True:
        try:
            element = browser.find_element(By.XPATH, xpath)
            element.click()
            return "Clicked!"  # Element found and clicked successfully
        except NoSuchElementException:
            pass  # Continue if element not found
        except Exception as e:
            print(f"An error occurred: {e}")
            return False  # Other unexpected errors

        time.sleep(1)
        elapsed_time = time.time() - start_time
        if elapsed_time >= 3:
            # print("** The element was not found in the page. **")
            return None  # Element not found after 5 seconds
        
def check_obscures(browser, xpath, type):
    '''
    Function that checks whether the object is being "obscured" by any element so
    that it is not clickable. Important: if True, the object is going to be clicked!
    '''
    try:
        if type == "xpath":
            browser.find_element('xpath', xpath).click()
        elif type == "id":
            browser.find_element('id', xpath).click()
        elif type == "css":
            browser.find_element('css selector', xpath).click()
        elif type == "class":
            browser.find_element('class name', xpath).click()
        elif type == "link":
            browser.find_element('link text', xpath).click()
    except (ElementClickInterceptedException, StaleElementReferenceException) as e:
        print(e)
        return False
    except NoSuchElementException:
        # Do nothing if NoSuchElementException occurs (suppress the error)
        pass
    return True

def element_exists(browser, path):
    try:
        browser.find_element('xpath', path)
        return True
    except NoSuchElementException:
        return False

### Scraping Class

In [7]:
class Scrape:
    '''
    Class for web scraping accommodation information from Booking.com.

    Attributes:
    - browser: Selenium WebDriver instance for controlling the web browser.
    - search_bar_xpath: XPath for the search bar on the Booking.com page.
    - search_x_path: XPath for the search button on the page.
    - date_button_css: CSS Selector for the date selection button.
    - number_of_people_xpath: XPath for the button to select the number of people.
    - search_button_xpath: XPath for the final search button.
    - x_path_prev_date: XPath for the button to navigate to the previous date.
    - x_path_next_date: XPath for the button to navigate to the next date.
    - x_path_month: XPath for displaying the current month.
    - people_path: XPath for the button to select the number of people.
    - pages: Number of pages to scrape.
    - headers: HTTP headers for making requests.
    - data: Pandas DataFrame to store scraped information.
    - place: Variable to store the destination place.

    Methods:
    - input_place(): Takes user input for the destination place and interacts with the search bar.
    - input_dates(): Takes user input for the stay dates and interacts with the date selection.
    - input_people(): Takes user input for the number of people and interacts with the selection.
    - search(): Initiates the search for accommodations.
    - get_pages(limit): Retrieves the total number of pages to scrape.
    - scrape_info(): Scrapes general information about accommodations.
    - scrape_description(url): Scrapes the description of an accommodation given its URL.
    - get_descriptions(): Scrapes descriptions for all accommodations in parallel.
    '''
    def __init__(self):
        print("Remember to close the annoying Google popup on the page")
        dfolder='./downloads'
        geko_path='./geckodriver'
        link='https://www.booking.com/index.es.html'
        self.browser =start_up(dfolder=dfolder,link=link,geko_path=geko_path)
        self.search_bar_xpath = '//div[@class="b9b84f4305"]'
        self.search_x_path= '/html/body/div[3]/div[2]/div/form/div[1]/div[4]/button/span'
        self.search_x_path2 = '/html/body/div[4]/div/div[2]/div/div[1]/div/form/div[1]/div[4]/button'
        self.date_button_css = 'button.ebbedaf8ac:nth-child(2) > span:nth-child(1)'
        self.number_of_people_xpath = '/html/body/div[3]/div[2]/div/form/div[1]/div[3]/div/button'
        self.search_button_xpath = '/html/body/div[3]/div[2]/div/form/div[1]/div[4]/button/span'
        self.x_path_prev_date = '//button[@class="a83ed08757 c21c56c305 f38b6daa18 d691166b09 f671049264 deab83296e f4552b6561 dc72a8413c c9804790f7"]'
        self.x_path_next_date = '//button[@class="a83ed08757 c21c56c305 f38b6daa18 d691166b09 f671049264 deab83296e f4552b6561 dc72a8413c f073249358"]'
        self.x_path_month = '//h3[@class="e1eebb6a1e ee7ec6b631"]'
        x_path_cookies = '//button[@id="onetrust-accept-btn-handler"]'
        self.people_path = '/html/body/div[3]/div[2]/div/form/div[1]/div[3]/div/button'
        self.pages = 1
        self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}
        self.data = pd.DataFrame(columns=['Hotels', 'Ratings', 'Price', 'Link'])
        self.place = ''
        check_and_click(self.browser, x_path_cookies, 'xpath')
    def input_place(self):
        place = input('Where do you want to go?')
        self.place = (place.lower()).capitalize()
        self.browser.find_element(by='xpath', value='//div[@class="b9b84f4305"]').click()
        search = self.browser.find_element(by='xpath', value='//*[@id=":re:"]')
        search.clear()
        search.send_keys(place)
        print(f'Place of stay: {(place.lower()).capitalize()}')
        x_path_close = '/html/body/div[4]/div/div[2]/div/div[1]/div/form/div[1]/div[1]/div/div/div[1]/div/div/div[1]/span/svg'
        check_and_click(self.browser, x_path_close, 'xpath')
    def input_dates(self):
        print("Just a second...")
        self.browser.find_element('css selector',self.date_button_css).click()
        while element_exists(self.browser, self.x_path_prev_date):
                self.browser.find_element('xpath', self.x_path_prev_date).click()
                time.sleep(1)
        start_date = (input("Input the start date of your programmed stay in the form (XX mes XXXX). Use Spanish month names. ¡Cuidado con la ortografía!")).lower()
        end_date = (input("Input the end date of your programmed stay in the form (XX mes XXXX). Use Spanish month names. ¡Cuidado con la ortografía!")).lower()
        month_and_year_start = start_date[3:]
        month_and_year_end = end_date[3:]
        month_and_year = self.browser.find_element('xpath', self.x_path_month).text
        while month_and_year != month_and_year_start:
                self.browser.find_element('xpath', self.x_path_next_date).click()
                month_and_year = self.browser.find_element('xpath', self.x_path_month).text
                time.sleep(1)
        months_dict = {'enero': '01', 'febrero': '02', 'marzo': '03', 'abril': '04', 'mayo': '05', 'junio': '06', 'julio': '07', 'agosto': '08', 'septiembre': '09', 'octubre': '10', 'noviembre': '11', 'diciembre': '12'}
        x_path_dates='//div[@id="calendar-searchboxdatepicker"]//table[@class="eb03f3f27f"]//tbody//td[@class="b80d5adb18"]//span[@class="cf06f772fa"]'
        dates = self.browser.find_elements('xpath',x_path_dates)
        from_day = start_date[:2]
        to_day = end_date[:2]
        month_start = month_and_year_start[:-5]
        month_end = month_and_year_end[:-5]
        year = month_and_year[-4:]
        for date in dates:
            if date.get_attribute("data-date") == f"{year}-{months_dict[month_start]}-{from_day}":
                date.click()
                break
        if month_start == month_end:
            for date in dates:
                if date.get_attribute("data-date") == f"{year}-{months_dict[month_end]}-{to_day}":
                    date.click()
                    break
        else:
            while self.browser.find_element('xpath', self.x_path_month).text != month_and_year_end:
                self.browser.find_element('xpath', self.x_path_next_date).click()
            dates = self.browser.find_elements('xpath',x_path_dates)
            for date in dates:
                if date.get_attribute("data-date") == f"{year}-{months_dict[month_end]}-{to_day}":
                    date.click()
                    break
        self.browser.find_element('css selector',self.date_button_css).click()
        print(f'Start date of the stay: {start_date}')
        print(f'End date of the stay: {end_date}')
    def input_people(self):
        self.browser.find_element('xpath', self.people_path).click()
        number_of_people = int(input('How many people in total are you looking an accomodation for?'))
        css_minus = '/html/body/div[3]/div[2]/div/form/div[1]/div[3]/div/div/div/div/div[1]/div[2]/button[1]'
        css_plus = '/html/body/div[3]/div[2]/div/form/div[1]/div[3]/div/div/div/div/div[1]/div[2]/button[2]'
        if number_of_people == 1:
            self.browser.find_element('xpath', css_minus).click()
        elif number_of_people > 2:
            i = 2
            while i < number_of_people:
                self.browser.find_element('xpath', css_plus).click()
                i+=1
                time.sleep(2)
        self.browser.find_element('xpath', self.people_path).click()
        print(f'Accomodations for: {number_of_people} people')
    def search(self):
        check_and_click(self.browser, self.search_x_path, 'xpath')
        check_and_click(self.browser, self.search_x_path2, 'xpath')
        place = self.place
        print(f'Searching accomodations in {(place.lower()).capitalize()}...')
    def get_pages(self, limit=None):
        a = self.browser.find_elements('xpath', '//button[@class="a83ed08757 a2028338ea"]')
        if a:
            total_pages = int(a[-1].text)
            if limit is not None and total_pages > limit:
                self.pages = limit
            else:
                self.pages = total_pages
        else:
            self.pages = 1
    def scrape_info(self):
        print("Scraping general info...\n")
        change_page_xpath = '/html/body/div[4]/div/div[2]/div/div[2]/div[3]/div[2]/div[2]/div[4]/div[2]/nav/nav/div/div[3]/button/span/span'
        css = 'div.b16a89683f:nth-child(3) > button:nth-child(1) > span:nth-child(1) > span:nth-child(1)'
        first_page_xpath='/html/body/div[4]/div/div[2]/div/div[2]/div[3]/div[2]/div[2]/div[4]/div[2]/nav/nav/div/div[2]/ol/li[1]/button'
        check_and_click(self.browser,first_page_xpath , type='xpath')
        for i in range(self.pages):
            print(f'Page: {i + 1}')
            containers = self.browser.find_elements('xpath', '//div[@class="c066246e13"]')
            for hotel in containers:
                hotel_name = hotel.find_element('xpath', './/div[@class="f6431b446c a15b38c233"]').text
                try:
                    hotel_rating = hotel.find_element('xpath', './/div[@class="a3b8729ab1 d86cee9b25"]').text
                except:
                    hotel_rating = np.nan
                try:
                    hotel_price = hotel.find_element('xpath', './/span[@class="f6431b446c fbfd7c1165 e84eb96b1f"]').text
                except:
                    hotel_price = np.nan
                try:
                    url = hotel.find_element('xpath', './/a[@href]')
                    hotel_url= url.get_attribute('href')
                except:
                    hotel_url = np.nan
                new_row = {'Hotels': hotel_name, 'Ratings': hotel_rating, 'Price':hotel_price, 'Link': hotel_url}
                self.data = pd.concat([self.data, pd.DataFrame([new_row])], ignore_index=True)
            next = self.browser.find_element('css selector', css)
            time.sleep(2)
        print("\nDone!\n")
    def scrape_description(self,url):
        try:
            response = requests.get(url, headers=self.headers, timeout=10)
            response.raise_for_status() 
        except requests.exceptions.RequestException as e:
            print(f"Error processing {url}: {e}")
            return None

        soup = BeautifulSoup(response.text, 'html.parser')
        description_tag = soup.find('p', class_='a53cbfa6de b3efd73f69')

        if description_tag:
            return description_tag.get_text(strip=True)
        else:
            print(f"Description tag not found on the page: {url}")
            return None
    def get_descriptions(self):
        print("Scraping descriptions...")
        num_threads = 16
        with ThreadPoolExecutor(max_workers=num_threads) as executor:
            descriptions = []
            for i, description in enumerate(executor.map(self.scrape_description, self.data['Link']), start=1):
                descriptions.append(description)
                if i % 50 == 0:
                    print(f"Scraped {i} links")
        self.data['Descriptions'] = descriptions
        print(f"Scraped {len(descriptions)} links in total")
        display(self.data)
        print("\nDone!\n")
        return (self.data)


            
        
            

## Pipeline strategy

For scraping our data we have developed the "Scrape" class which contains all the necessary functions to inpute the data and scrape the website. The pipeline reported in the 2 code blocks below that instantiate the class and run the different methods, were developed as an interactive chatbot that gives the necessary information to inpute the data in the correct form and reports the step by step inputed data in a nice fashion to keep track of the scraping data inputed. This strategy makes the retrieval of the data from the website very pleasant and efficient. 

In the following section there is a step by step notebook to follow the scraping pipeline in an easier way and make the logic of the code more understandable with extensive comments.

### Pipeline (using the scraping class)

#### 1. Searching

In [9]:
instance1 = Scrape()
instance1.input_place()
instance1.input_dates()
instance1.input_people()
instance1.search()


#### 2. Scraping general Information and Description text

In [None]:
instance1.get_pages(limit = 2)
instance1.scrape_info()
data = instance1.get_descriptions()

'instance1.get_pages(limit = 2)\ninstance1.scrape_info()\ndata = instance1.get_descriptions()'

### *3. Design a careful scraping pipeline that follows the advises seen in class and TAs. (5points) The basic points to bear in mind are:*

+ *Organize the data you need, format and structure to store it beforehand. Try to foresee how you will need to read in the data to answer your questions. If you want, you can include some few lines explaining your pipeline strategy at the*
+ *Codes should be as automated as possible. That is, you don't want to rely on human intervention to get your data.*

+ *Use only the packages we have seen in the course. Although firefox is recommended, you can also use chrome as your scraping browser.*

+ *Document your codes and make them robust and efficient.*


### Pipeline (Step by Step)

### 1. Opening the Browser

In [1036]:
# Start-up the browser
dfolder='./downloads'
geko_path='./geckodriver'
link='https://www.booking.com/index.es.html'


browser=start_up(dfolder=dfolder,link=link,geko_path=geko_path)


### 2. Accepting Cookies

In [1037]:
# Click on "Accept cookies" button
x_path_cookies = '//button[@id="onetrust-accept-btn-handler"]'
check_and_click(browser, x_path_cookies, 'xpath')


An error occurred: Message: Element <button id="onetrust-accept-btn-handler"> could not be scrolled into view
Stacktrace:
RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8
WebDriverError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:191:5
ElementNotInteractableError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:351:5
webdriverClickElement@chrome://remote/content/marionette/interaction.sys.mjs:166:11
interaction.clickElement@chrome://remote/content/marionette/interaction.sys.mjs:135:11
clickElement@chrome://remote/content/marionette/actors/MarionetteCommandsChild.sys.mjs:204:29
receiveMessage@chrome://remote/content/marionette/actors/MarionetteCommandsChild.sys.mjs:84:31



False

### 3. Search Bar

In [4]:
# Click on the search bar
browser.find_element(by='xpath',value='//div[@class="b9b84f4305"]').click()

NameError: name 'browser' is not defined

### 4. Input the place

In [1039]:
# Input the location to search for
place = input('Where do you want to go?')
search1 = browser.find_element(by='xpath',value='//*[@id=":re:"]')
search1.send_keys(place)

### 5. Input the Dates

In [1040]:
# Find and click the element to input the date
css_date='button.ebbedaf8ac:nth-child(2) > span:nth-child(1)'
browser.find_element('css selector',css_date).click()

In [1041]:
x_path_prev_date = '//button[@class="a83ed08757 c21c56c305 f38b6daa18 d691166b09 f671049264 deab83296e f4552b6561 dc72a8413c c9804790f7"]'
while element_exists(browser, x_path_prev_date):
        browser.find_element('xpath', x_path_prev_date).click()
        time.sleep(1)
x_path_month1 = '//h3[@class="e1eebb6a1e ee7ec6b631"]'

# Input the wanted date for the stay
start_date = (input("Input the start date of your programmed stay in the form (XX mes XXXX). Use Spanish month names. ¡Cuidado con la ortografía!")).lower()
end_date = (input("Input the end date of your programmed stay in the form (XX mes XXXX). Use Spanish month names. ¡Cuidado con la ortografía!")).lower()

# Retrieve the current date on the screen and find the month of the start date inputed
month_and_year_start = start_date[3:]
month_and_year_end = end_date[3:]
month_and_year = browser.find_element('xpath', x_path_month1).text
x_path_next_date = '//button[@class="a83ed08757 c21c56c305 f38b6daa18 d691166b09 f671049264 deab83296e f4552b6561 dc72a8413c f073249358"]'
while month_and_year != month_and_year_start:
        browser.find_element('xpath', x_path_next_date).click()
        month_and_year = browser.find_element('xpath', x_path_month1).text
        time.sleep(1)
print(month_and_year)




junio 2024


### 6. Select the dates

In [1042]:
months_dict = {'enero': '01', 'febrero': '02', 'marzo': '03', 'abril': '04', 'mayo': '05', 'junio': '06', 'julio': '07', 'agosto': '08', 'septiembre': '09', 'octubre': '10', 'noviembre': '11', 'diciembre': '12'}
x_path_dates='//div[@id="calendar-searchboxdatepicker"]//table[@class="eb03f3f27f"]//tbody//td[@class="b80d5adb18"]//span[@class="cf06f772fa"]'
dates = browser.find_elements('xpath',x_path_dates)
from_day = start_date[:2]
to_day = end_date[:2]
month_start = month_and_year_start[:-5]
month_end = month_and_year_end[:-5]
year = month_and_year[-4:]

# Select the dates
for date in dates:
    if date.get_attribute("data-date") == f"{year}-{months_dict[month_start]}-{from_day}":
        date.click()
        break
if month_start == month_end:
    for date in dates:
        if date.get_attribute("data-date") == f"{year}-{months_dict[month_end]}-{to_day}":
            date.click()
            break
else:
    while browser.find_element('xpath', x_path_month1).text != month_and_year_end:
        browser.find_element('xpath', x_path_next_date).click()
    dates = browser.find_elements('xpath',x_path_dates)
    for date in dates:
        if date.get_attribute("data-date") == f"{year}-{months_dict[month_end]}-{to_day}":
            date.click()
            break
browser.find_element('css selector',css_date).click()


### 7. Input the number of people

In [1043]:
x_path = '/html/body/div[3]/div[2]/div/form/div[1]/div[3]/div/button'

browser.find_element('xpath', x_path).click()

In [1044]:
# Select the number of people
number_of_people = int(input('How many people in total are you looking an accomodation for?'))

css_minus = '/html/body/div[3]/div[2]/div/form/div[1]/div[3]/div/div/div/div/div[1]/div[2]/button[1]'
css_plus = '/html/body/div[3]/div[2]/div/form/div[1]/div[3]/div/div/div/div/div[1]/div[2]/button[2]'
if number_of_people == 1:
    browser.find_element('xpath', css_minus).click()
elif number_of_people > 2:
    i = 2
    while i < number_of_people:
        browser.find_element('xpath', css_plus).click()
        i+=1
        time.sleep(2)
    

### 8. Search

In [1045]:
# Click on the search button
search_xpath='/html/body/div[3]/div[2]/div/form/div[1]/div[4]/button/span'

check_obscures(browser,search_xpath , type='xpath')
check_and_click(browser,search_xpath , type='xpath')


### 9. Extracting Number of Pages

In [1047]:
def get_number_pages(browser):
    '''
    Get the number of pages. 
    '''
    a = browser.find_elements('xpath',
        '//button[@class="a83ed08757 a2028338ea"]')
    if a:
        return(int(a[-1].text))
    else:
        return (1)

pages = get_number_pages(browser)

print(pages)


27


### *4. Scrape date, room price, hotel name and hotel description. (5 points)*

### 10. Scraping Pipeline

In [1048]:
# Finding the button to change the page in Booking.com
change_page_xpath = '/html/body/div[4]/div/div[2]/div/div[2]/div[3]/div[2]/div[2]/div[4]/div[2]/nav/nav/div/div[3]/button/span/span'
css = 'div.b16a89683f:nth-child(3) > button:nth-child(1) > span:nth-child(1) > span:nth-child(1)'
# Creating DataFrame
data = pd.DataFrame(columns=['Hotels', 'Ratings', 'Price', 'Link'])
# Make sure to be on the first page when starting to scrape the data
first_page_xpath='/html/body/div[4]/div/div[2]/div/div[2]/div[3]/div[2]/div[2]/div[4]/div[2]/nav/nav/div/div[2]/ol/li[1]/button'
check_and_click(browser,first_page_xpath , type='xpath')
# loop to scrape the data and populate the DataFrame
for i in range(pages):
    print(f'Page: {i + 1}')
    # Dividing the page in the Container Objects, one for every hotel and extracting the wanted data from each
    containers = browser.find_elements('xpath', '//div[@class="c066246e13"]')
    for hotel in containers:
        hotel_name = hotel.find_element('xpath', './/div[@class="f6431b446c a15b38c233"]').text
        try:
            hotel_rating = hotel.find_element('xpath', './/div[@class="a3b8729ab1 d86cee9b25"]').text
        except:
            hotel_rating = np.nan
        try:
            hotel_price = hotel.find_element('xpath', './/span[@class="f6431b446c fbfd7c1165 e84eb96b1f"]').text
        except:
            hotel_price = np.nan
        try:
            url = hotel.find_element('xpath', './/a[@href]')
            hotel_url= url.get_attribute('href')
        except:
            hotel_url = np.nan
        new_row = {'Hotels': hotel_name, 'Ratings': hotel_rating, 'Price':hotel_price, 'Link': hotel_url}
        data = pd.concat([data, pd.DataFrame([new_row])], ignore_index=True)
    # Change page with CSS Selector
    next = browser.find_element('css selector', css)
    time.sleep(2)
print("\nDone!\n")
display(data)

Page: 1
Page: 2
Page: 3
Page: 4
Page: 5
Page: 6
Page: 7
Page: 8
Page: 9
Page: 10
Page: 11


KeyboardInterrupt: 

### 11. Scraping Descriptions using BeutifulSoup with parrallelized operations

In [None]:
import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import time

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}

# Function to scrape the descriptions using Beautiful Soup
def scrape_description(url):
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status() 
        # time.sleep(0.5)
    except requests.exceptions.RequestException as e:
        print(f"Error processing {url}: {e}")
        return None

    soup = BeautifulSoup(response.text, 'html.parser')
    description_tag = soup.find('p', class_='a53cbfa6de b3efd73f69')

    if description_tag:
        return description_tag.get_text(strip=True)
    else:
        print(f"Description tag not found on the page: {url}")
        return None

# Set the number of concurrent threads (adjust this based on the processing power of your computer
num_threads = 16

# Create a ThreadPoolExecutor to run operations in parallel
with ThreadPoolExecutor(max_workers=num_threads) as executor:
    # Use executor.map to apply the scrape_description function to each URL in parallel
    descriptions = []
    for i, description in enumerate(executor.map(scrape_description, data['Link']), start=1):
        descriptions.append(description)
        # Print every 50 link to check the progess of the scraping
        if i % 50 == 0:
            print(f"Scraped {i} links")

# Assign the descriptions to the 'Descriptions' column in the DataFrame
data['Descriptions'] = descriptions

# Print count after all threads have completed
print(f"Scraped {len(descriptions)} links")
print("\nDone!\n")


Scraped 50 links
Scraped 100 links
Scraped 150 links


KeyboardInterrupt: 

### Visualize the data

In [None]:
data

Unnamed: 0,Hotels,Ratings,Price,Link,Descriptions
0,Room Mate Gerard,88,€ 546,https://www.booking.com/hotel/es/room-mate-ger...,El Room Mate Gerard en Barcelona ofrece alojam...
1,Sonder Los Arcos,84,€ 653,https://www.booking.com/hotel/es/sonder-los-ar...,Sonder Los Arcos dispone de alojamiento con wi...
2,Catalonia Sagrada Familia,82,€ 337,https://www.booking.com/hotel/es/cataloniaarag...,El Catalonia Sagrada Familia se halla a 15 min...
3,Catalonia Diagonal Centro,84,€ 400,https://www.booking.com/hotel/es/catalonia-dia...,El Catalonia Diagonal Centro se encuentra en p...
4,Hostal Santa Ana,70,€ 174,https://www.booking.com/hotel/es/hostal-cortes...,El Hostal Santa Ana se encuentra en pleno cent...
...,...,...,...,...,...
853,Hostal Boqueria,84,€ 422,https://www.booking.com/hotel/es/hostal-boquer...,Este establecimiento se encuentra en las Rambl...
854,Catalonia Passeig de Gràcia 4* Sup,90,€ 570,https://www.booking.com/hotel/es/catalonia-pas...,El Catalonia Passeig de Gràcia 4* Sup se encue...
855,Hostal Benidorm,82,€ 332,https://www.booking.com/hotel/es/hostal-benido...,El Hostal Benidorm está situado en el famoso p...
856,Hostal Conde Güell,86,€ 179,https://www.booking.com/hotel/es/hostal-conde-...,El Hostal Conde Güell está ubicado en Barcelon...


### 12. Create CSV Files

In [None]:
# Create a CSV file with the scraped data (Adjust the path)
data.to_csv('Data.csv', index=False)