# Webscraping practice

## Static websites with Selenium
1. Data model with Pydantic (refresh)
2. Get robot.txt and create exclusion protocol for other calls [webdriver subclass and wrapper for driver.get()]
3. Get the structure for directories (refresh)
4. Scrape from catalogue
5. Follow pagination
6. Save data to CSV

## Melcom (Ghana)

### Data Model

In [1]:
from datetime import date
from pydantic import BaseModel

In [2]:
class Product(BaseModel):
    link: str
    source: str
    category: str = None
    subcategory: str = None
    subsubcategory: str = None
    name: str = None
    brand: str = None
    uid: str = None
    price: float
    regular_price: float = None
    currency: str
    in_stock: str = None
    description: str = None
    date: str = date.today().strftime("%Y-%m-%d")

class Melcom(Product):
    source: str = "Melcom"
    currency: str = "GHS"

### Get and parse robots.txt

In [3]:
from urllib.robotparser import RobotFileParser
import time
from bs4 import BeautifulSoup

In [4]:
melcom_root = "https://melcom.com/"
robots_melcom = RobotFileParser(melcom_root + "robots.txt")
robots_melcom.read()

We setup the browser and try to get the homepage, just to check it works.  
When we create the driver object we set up an implicit wait [https://www.selenium.dev/documentation/webdriver/waits/](https://www.selenium.dev/documentation/webdriver/waits/). You can also use explicit waits, but it is not suggested to mix the two types. In my experience, implicit waits are easier to manage. However, explicit waits can solve more complicated and specific issues. Each case may have a different optimal solution.

In [5]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service

opts = Options()

# Windows and Linux options for setting the paths, uncomment the right one for your Operating system
# And comment the other one

#chromium_path = "C:\\Users\\domin\\chrome-win\\chrome.exe"  # Windows path Chromium
chromium_path = "/snap/chromium/current/usr/lib/chromium-browser/chrome"  # Linux Path Chromium
opts.binary_location = chromium_path
opts.add_argument("user-agent=Webscraping Capacity Building 1.0")
#chromedriver_path = Service("C:\\Users\\domin\\chromedriver_win32\\chromedriver.exe")  # Windows path Chromedriver
chromedriver_path = Service(executable_path="/snap/chromium/current/usr/lib/chromium-browser/chromedriver")  # Linux Path Chromedriver
driver = webdriver.Chrome(service=chromedriver_path, options=opts)
driver.implicitly_wait(30)  # The driver will wait up to 30 seconds each time we ask to perform an action or get an element
driver.get(melcom_root)



Since we have an open browser, let's check we have correcly set the User Agent and which information we are sending to the website. We use one of the websites (there are many others) that displays that for us: [https://www.whatismybrowser.com/detect/](https://www.whatismybrowser.com/detect/)

In [6]:
# Check the User Agent:
driver.get("https://www.whatismybrowser.com/detect/what-is-my-user-agent/")

In [7]:
# Check all the headers we send
driver.get("https://www.whatismybrowser.com/detect/what-http-headers-is-my-browser-sending")

Note that we are only sending the User Agent and not the email like we were doing with Requests. In order to further modify the headers for the HTTP requests we make we should use a different library, for instance [Selenium Wire](https://pypi.org/project/selenium-wire/), in order to get more control on the requests we make. Or - simpler - we could include our email inside the User Agent. Since Selenium Wire releases sometimes lags several months behind Selenium, I would suggest this second options.

In [8]:
# We close the driver and rebuild again with an updated User Agent
driver.quit()

opts.add_argument("user-agent=Webscraping Capacity Building 1.0/mail: luigi.palumbo@unitus.it")
driver = webdriver.Chrome(service=chromedriver_path, options=opts)
driver.implicitly_wait(30)  # The driver will wait up to 30 seconds each time we ask to perform an action or get an element
driver.get("https://www.whatismybrowser.com/detect/what-http-headers-is-my-browser-sending")

Another way to check the User Agent you are sending is to use JavaScript, and simply execute a JavaScript command in the browser:

In [10]:
driver.execute_script('return navigator.userAgent')

'Webscraping Capacity Building 1.0/mail: luigi.palumbo@unitus.it'

In [11]:
# Always remember to close the browser!
driver.quit()

### Safe Get method

This time we take a different approach to make sure our requests respect the website robots.txt. We create a subclass from the webdriver.Chrome class, and we add a new method - safe_get() - which also takes as an argument the parsed robots.txt file.

In [13]:
class SafeChrome(webdriver.Chrome):
    def safe_get(self, url: str, robots: RobotFileParser):
        """Load a webpage in the current browser session
        if allowed by robots.txt for the specific website
        """
        if robots.can_fetch(self.execute_script('return navigator.userAgent'), url):
            self.get(url)

In [14]:
driver = SafeChrome(service=chromedriver_path, options=opts)
driver.implicitly_wait(30)  # The driver will wait up to 30 seconds each time we ask to perform an action or get an element
driver.safe_get(melcom_root, robots_melcom)

Let's also make a double check on a forbidden URL from another website

In [15]:
carrefour_root = "https://www.carrefour.tn/"
robots_carrefour = RobotFileParser(carrefour_root + "robots.txt")
robots_carrefour.read()

In [16]:
# The first call should work, as it is the homepage
driver.safe_get("https://www.carrefour.tn/", robots_carrefour)

In [17]:
# The second one should not, as it is in the Disallow list for all User Agents
driver.safe_get("https://www.carrefour.tn/review/", robots_carrefour)  # the browser does not go to this new page

In [18]:
# But regular get request would go thru...
driver.get("https://www.carrefour.tn/review/")  # and find a broken page, in this case

### Extracting content from the browser

Selenium has advanced capabilities to extract information from the webpage selecting just specific elements. This is very important in case different elements load at different times (also see Explicit Waits above if this case is relevant for you). However, in many instances it is preferrable to completely extract the full HTML content from the webpage and parse it with BeautifulSoup. This enables you to only use a single tool and methodology for parsing content, independently whether you are scraping via Requests or Selenium, and ultimately makes your work easier to maintain.

In [19]:
# We go back to Melcom homepage
driver.safe_get(melcom_root, robots_melcom)

# There are two alternative methods to extract the HTML, and they do not always behave the same way.
homepage = driver.execute_script("return document.body.innerHTML;")

# Alternative, uncomment to run
#homepage = driver.page_source

In [21]:
# Parsing the HTML with BeautifulSoup
homepage = BeautifulSoup(homepage, "html.parser")

In [22]:
# Getting the categories links
categories = [
    {"name": item.get_text(), "link": item.find("a").get("href") }
    for item in homepage.find("div", {"id": "custom.topnav"}).find("ul", {"class": "mega-columns"}).find_all("li", {"class": "level2"})]

categories[:3] # We display the first three categories

[{'name': 'Television & Audio',
  'link': 'https://melcom.com/categories/electronics-appliances/television-audio.html'},
 {'name': 'Refrigerators & Freezers',
  'link': 'https://melcom.com/categories/electronics-appliances/refrigerators-freezers.html'},
 {'name': 'Washing Machines',
  'link': 'https://melcom.com/categories/electronics-appliances/washing-machines.html'}]

In [23]:
# Just browsing across a couple of categories
for cat in categories[:2]:
    driver.safe_get(cat.get("link"), robots_melcom)
    time.sleep(5)

In [24]:
page = driver.execute_script("return document.body.innerHTML;")
# Parsing the HTML with BeautifulSoup
page = BeautifulSoup(page, "html.parser")

We grab the main data from the catalogue page

In [25]:
# Product name
page.find("ol", {"class": "container-products-switch"}).find_all("div", {"class": "product-item-details"})[0].find("a", {"class": "product-item-link"}).get_text().strip()

'AKAI FRIDGE DOUBLE DOOR DISPLAY 520L BLACK'

In [26]:
# Product link
page.find("ol", {"class": "container-products-switch"}).find_all("div", {"class": "product-item-details"})[0].find("a", {"class": "product-item-link"}).get("href")

'https://melcom.com/categories/electronics-appliances/refrigerators-freezers/akai-fridge-double-door-display-520l-black.html'

In [27]:
# Price
page.find("ol", {"class": "container-products-switch"}).find_all("div", {"class": "product-item-details"})[0].find("span", {"class": "price-wrapper"}).get("data-price-amount")

'12999'

In [28]:
# Special price for products in offer
page.find("ol", {"class": "container-products-switch"}).find_all("div", {"class": "product-item-details"})[2].find("span", {"class": "special-price"}).find("span", {"class": "price-wrapper"}).get("data-price-amount")

'6099'

In [29]:
# Regular price for products in offer
page.find("ol", {"class": "container-products-switch"}).find_all("div", {"class": "product-item-details"})[2].find("span", {"class": "old-price"}).find("span", {"class": "price-wrapper"}).get("data-price-amount")

'7899'

In [30]:
# Let's get all the products
products = [
    {
    "name": item.find("a", {"class": "product-item-link"}).get_text().strip(),
    "link": item.find("a", {"class": "product-item-link"}).get("href"),
    "price": item.find("span", {"class": "price-wrapper"}).get("data-price-amount")
    }
    for item in page.find("ol", {"class": "container-products-switch"}).find_all("div", {"class": "product-item-details"})
]

# We display just two
products[:2]

[{'name': 'AKAI FRIDGE DOUBLE DOOR DISPLAY 520L BLACK',
  'link': 'https://melcom.com/categories/electronics-appliances/refrigerators-freezers/akai-fridge-double-door-display-520l-black.html',
  'price': '12999'},
 {'name': 'AKAI FRIDGE SINGLE DOOR DISPLAY 289L BLACK',
  'link': 'https://melcom.com/categories/electronics-appliances/refrigerators-freezers/akai-fridge-single-door-display-289l-black.html',
  'price': '8499'}]

### Pagination

We need to identify the way this website gives us the link for the new page.

In [31]:
next_page = page.find("ul", {"class": "pages-items"}).find("a",{"class": "next"}).get("href")

next_page

'https://melcom.com/categories/electronics-appliances/refrigerators-freezers.html?p=2'

In [32]:
# Let's go the the last page and check what comes out if we look for the next one

driver.safe_get("https://melcom.com/categories/electronics-appliances/refrigerators-freezers.html?p=3", robots_melcom)

page = driver.execute_script("return document.body.innerHTML;")
# Parsing the HTML with BeautifulSoup
page = BeautifulSoup(page, "html.parser")

In [33]:
page.find("ul", {"class": "pages-items"}).find("a",{"class": "next"}) is None

True

In [34]:
page.find("ul", {"class": "pages-items"}).find("a",{"class": "next"}).get("href")

AttributeError: 'NoneType' object has no attribute 'get'

In [35]:
# We need also to consider that pages without pagination does not even have an ul tag of class page-items
driver.safe_get("https://melcom.com/categories/electronics-appliances/washing-machines.html", robots_melcom)

page = driver.execute_script("return document.body.innerHTML;")
# Parsing the HTML with BeautifulSoup
page = BeautifulSoup(page, "html.parser")

In [36]:
page.find("ul", {"class": "pages-items"}) is None

True

In [38]:
page.find("ul", {"class": "pages-items"}).find("a",{"class": "next"}) 

This is one of the examples where grabbing values directly from Selenium could be more linear. Or we could try to select the link in less steps (for instance, skipping the ul and directly searching for the a tag with class next).

However, I believe that the additional solidity of more granular selectors is worth the additional effort and overhead of a try-except statement. This again is an opinion, different choices may be completely valid.

Let's build a function that starts from a category link and gets data for all products in a category from the catalogue page, moving from page to page.

In [39]:
def scrape_category(link: str, category: str, Item: BaseModel, wd: webdriver, robots: RobotFileParser, delay: float = 1) -> list:
    """Function to scrape a category from catalogue pages with Selenium
    following pagination.
    Parameters:
        link (str): starting link for a category
        category (str): category name
        Item (BaseModel): class of the data object for the specific source
        wd (webdriver): Selenium Webdriver with User-Agent properly set and safe_get() method
        robots (RobotFileParser): Initialized robots.txt parsed object for the specific website
        delay (float): delay in seconds between calls to prevent overloading the source and allow pages to fully render

    Returns:
        list of product with all information

    """
    wd.safe_get(link, robots)
    time.sleep(delay)
    page = driver.execute_script("return document.body.innerHTML;")
    # Parsing the HTML with BeautifulSoup
    page = BeautifulSoup(page, "html.parser")
    results = []
    for item in page.find("ol", {"class": "container-products-switch"}).find_all("div", {"class": "product-item-details"}):
        product = {}
        # Parse product information
        product["name"] = item.find("a", {"class": "product-item-link"}).get_text().strip()
        product["link"] = item.find("a", {"class": "product-item-link"}).get("href")
        if item.find("span", {"class": "old-price"}) is not None:
            product["price"] = item.find("span", {"class": "special-price"}).find("span", {"class": "price-wrapper"}).get("data-price-amount")
            product["regular_price"] = item.find("span", {"class": "old-price"}).find("span", {"class": "price-wrapper"}).get("data-price-amount")
        else:
            product["price"] = item.find("span", {"class": "price-wrapper"}).get("data-price-amount")

        results.append(product)
    
    # Parse all products accodring to the data class
    results = [Item(**res, category=category) for res in results]

    # Follow pagination if exists
    try:
        next_page = page.find("ul", {"class": "pages-items"}).find("a",{"class": "next"})
        if next_page is not None:
            next_page = next_page.get("href")
            next_results = scrape_category(link=next_page, category=category, Item=Item, wd=wd, robots= robots, delay=delay)
            results.extend(next_results)
    except AttributeError as e:
        print(e)  # This should go away in production
        pass
    
    return results


In [40]:
# Let's scrape a couple of categories to test
melcom_data = []

for cat in categories[:2]:  # Limit to two categories for test
    melcom_data.extend(scrape_category(cat.get("link"), cat.get("name"), Melcom, driver, robots_melcom, 2))

In [41]:
# Always remember to close the browser!
driver.quit()

### Save data as CSV

In [42]:
import pandas as pd

melcom_catalog_df = pd.DataFrame([prod.dict(exclude_none=True) for prod in melcom_data])
melcom_catalog_df.to_csv("melcom_catalog_{}.csv".format(date.today().strftime("%Y-%m-%d")), index=False)


In [43]:
melcom_catalog_df.head()

Unnamed: 0,link,source,category,name,price,currency,date,regular_price
0,https://melcom.com/categories/electronics-appl...,Melcom,Television & Audio,YAMAHA FRONT SURROUND SYSTEM BLACK YAS209,6499.0,GHS,2023-03-15,
1,https://melcom.com/categories/electronics-appl...,Melcom,Television & Audio,YAMAHA MUSIC SYNTHESIZER MODX8 B/E,34999.0,GHS,2023-03-15,
2,https://melcom.com/categories/electronics-appl...,Melcom,Television & Audio,YAMAHA DIGITAL KEYBOARD WITH ADAPTOR PSR-F52Y,2099.0,GHS,2023-03-15,
3,https://melcom.com/categories/electronics-appl...,Melcom,Television & Audio,YAMAHA RECORDER YRS-24B ID,99.0,GHS,2023-03-15,
4,https://melcom.com/catalog/product/view/id/514...,Melcom,Television & Audio,LG LED TV 43 SAT SMT UHD 4K 43UQ7006LB,5799.0,GHS,2023-03-15,7199.0


## Dynamic websites with Selenium
1. Get robot.txt and create exclusion protocol for other calls [webdriver subclass and wrapper for driver.get()]
2. Selenium IDE
3. Integrate Selenium IDE functions with webdriver subclass
4. Data Model
5. Scraping
6. Save data to CSV


More information on Selenium framework: [https://www.selenium.dev/](https://www.selenium.dev/)

## Tunisie Booking

### Get and parse robots.txt

In [44]:
booking_root = "https://www.tunisiebooking.com/"
robots_booking = RobotFileParser(booking_root + "robots.txt")
robots_booking.read()

### Selenium IDE

This website - like many travel websites - requires to fill a form with travel dates: [https://www.tunisiebooking.com/](https://www.tunisiebooking.com/). We use Selenium IDE to record the website interaction and input the data.  

Open Chromium and start a new project in Selenium IDE. Record new test cases:
1. Book a room for 2 people in the current month in the default city
2. Book a room for 2 people in the next month in the default city
3. Book a room for 2 people in the current month in another city

From the recorded cases, you can export Python code. This will be the base for our scraping job, and we will add a specific method for each one of those operations into a custom webdriver.Chrome subclass. The main modifications we make:
1. Transform some input data (city, checkin day, checkout day) in parameters we can change
2. Use our safe_get() method insted than the regular get.

In [45]:
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains

class BookingChrome(webdriver.Chrome):
    def safe_get(self, url: str, robots: RobotFileParser):
        """Load a webpage in the current browser session
        if allowed by robots.txt for the specific website
        """
        if robots.can_fetch(self.execute_script('return navigator.userAgent'), url):
            self.get(url)

    def search_rooms_same_month(self, robots: RobotFileParser, city: str, checkin: int, checkout: int):
        self.safe_get("https://www.tunisiebooking.com/", robots)
        self.set_window_size(960, 946)
        dropdown = self.find_element(By.ID, "ville_des")
        dropdown.find_element(By.XPATH, f"//option[. = '{city}']").click()
        element = self.find_element(By.ID, "ville_des")
        actions = ActionChains(self)
        actions.move_to_element(element).click_and_hold().perform()
        element = self.find_element(By.ID, "ville_des")
        actions = ActionChains(self)
        actions.move_to_element(element).perform()
        element = self.find_element(By.ID, "ville_des")
        actions = ActionChains(self)
        actions.move_to_element(element).release().perform()
        self.find_element(By.ID, "check1").click()
        self.find_element(By.LINK_TEXT, str(checkin)).click()
        self.find_element(By.LINK_TEXT, str(checkout)).click()
        self.find_element(By.ID, "boutonr").click()
        self.execute_script("window.scrollTo(0,276)")
        # Not all cities may have more results...
        try:
            self.find_element(By.CSS_SELECTOR, "#plus_res > img").click()
        except:
            pass

In [46]:
# Test
driver = BookingChrome(service=chromedriver_path, options=opts)
driver.implicitly_wait(0)  # The driver will wait up to 0 seconds each time we ask to perform an action or get an element
driver.safe_get(booking_root, robots_booking)

In [55]:
driver.search_rooms_same_month(robots_booking, "Douz", 25, 27)  # Days are purely random

In [56]:
# We select the buttons to get more information for each hotel
hotels = driver.find_elements(By.XPATH, '//*[@id="tailleprix"]')

In [57]:
# We click on the first one to get an idea of the webpage structure
hotels[0].click()

In [53]:
# Check what are the options for the rate
from selenium.webdriver.support.ui import Select 

[opt.text for opt in Select(driver.find_element(By.XPATH, '//*[@id="arrangement"]')).options]
# We will need to select those in a loop to get all the info


['Demi Pension', 'Petit Dejeuner', 'Pension Complete']

In [None]:
# We could loop over those rate type to see all prices
for opt in Select(driver.find_element(By.XPATH, '//*[@id="arrangement"]')).options:
    dropdown = driver.find_element(By.XPATH, '//*[@id="arrangement"]')
    dropdown.find_element(By.XPATH, "//option[. = '{}']".format(opt.text)).click()
    time.sleep(5)

In [59]:
# Get the page HTML and parse it with BeautifulSoup
page = driver.execute_script("return document.body.innerHTML;")
page = BeautifulSoup(page, "html.parser")

In [60]:
# Hotel name
page.find("div", {"class": "bloc_titre_hotels"}).find("h2").get_text().strip()

'Sahara Douz Douz'

In [61]:
# Hotel star rating
page.find("div", {"class": "bloc_titre_hotels"}).find("span", {"class": "h2styless"}).find_all("i")

[<i class="icon-star-1" style="color:#c9c7c7;font-size: 14px; padding: 0px; margin: -5px; padding-left:8px;"></i>,
 <i class="icon-star-1" style="color:#c9c7c7;font-size: 14px; padding: 0px; margin: -5px; padding-left:8px;"></i>,
 <i class="icon-star-1" style="color:#c9c7c7;font-size: 14px; padding: 0px; margin: -5px; padding-left:8px;"></i>,
 <i class="icon-star-1" style="color:#c9c7c7;font-size: 14px; padding: 0px; margin: -5px; padding-left:8px;"></i>]

In [62]:
# We just count the stars...this may need changes for half stars (if the website has them)
len(page.find("div", {"class": "bloc_titre_hotels"}).find("span", {"class": "h2styless"}).find_all("i"))

4

In [63]:
# Room information and price
page.find("div", {"id": "result_par_arrangement"}).find_all("input")

[<input checked="checked" id="choix_formule1_dp_1" name="choix_formule1" onclick="slectform('1','dp','1','1')" type="radio" value="dp@@1@@Chambre  Double Standard@@15@@DLX-G-ROOM-DBL@@652729"/>,
 <input id="chambre1" name="chambre1" type="hidden" value=""/>,
 <input id="libelle_chambre1" name="libelle_chambre1" type="hidden" value="Chambre  Double Standard"/>,
 <input id="choi_chambre1" name="choi_chambre1" type="hidden" value=""/>,
 <input id="price_dp_1_1" name="price_dp_1_1" type="hidden" value="194"/>,
 <input id="pricebar_dp_1_1" name="pricebar_dp_1_1" type="hidden" value="0"/>]

In [64]:
import re
# Room type
page.find("div", {"id": "result_par_arrangement"}).find("input", id=re.compile("^libelle_chambre")).get("value").strip() # Need to check when multiple rooms are available

'Chambre  Double Standard'

In [65]:
# Price
page.find("div", {"id": "result_par_arrangement"}).find("input", id=re.compile("^price_")).get("value").strip() # Need to check when multiple rooms are available

'194'

In [66]:
# Availability
page.find("div", {"id": "result_par_arrangement"}).find("div", {"id": "disponible"}).get_text() # Need to check when multiple rooms are listed and some is not available

' Disponible'

For exploration purposes, we manually select other city/dates to get to a webpage where an hotel has multiple rooms and we parse the new webpage

In [67]:
# Manually navigate into Chromium to get the new webpage
page = driver.execute_script("return document.body.innerHTML;")
page = BeautifulSoup(page, "html.parser")

In [68]:
page.find("div", {"class": "bloc_titre_hotels"}).find("h2").get_text().strip()

'Royal Azur Thalassa Hammamet'

In [69]:
len(page.find("div", {"class": "bloc_titre_hotels"}).find("span", {"class": "h2styless"}).find_all("i"))

5

In [70]:
# Each result is inside a div with class line_result. We put them into a list and parse individually
room_types = page.find("div", {"id": "result_par_arrangement"}).find_all("div", {"class": "line_result"})

# We loop across reults and parse them
for room in room_types:
    print("Room type: {}".format(room.find("input", id=re.compile("^libelle_chambre")).get("value").strip()))
    print("Price: {}".format(room.find("input", id=re.compile("^price")).get("value")))
    print("Availability: {}".format(room.find("div", {"id": "disponible"}).get_text()))
    print("-"*20)

Room type: Chambre Double Premium
Price: 244
Availability:  Disponible
--------------------
Room type: Chambre Double  Premium Vue Mer
Price: 303
Availability:  Disponible
--------------------
Room type: Bungalow Double Vue Mer
Price: 377
Availability:  Disponible
--------------------
Room type: Suite Double Premium
Price: 392
Availability:  Disponible
--------------------
Room type: Suite Double Business
Price: 437
Availability:  Disponible
--------------------


### Data model

The data model in this case is quite different than the one we used for products. We need to consider other characteristics like Hotel star rating, city, room type, etc. The data class below is an example that can be extended according to the specific needs for CPI calculation.

In [71]:
class Room(BaseModel):
    source: str
    hotel_name: str
    hotel_stars: str = None
    city: str
    room_type: str
    n_pax: int = 2
    n_days: int
    checkin: str
    checkout: str
    price: float
    rate_type: str
    currency: str
    description: str = None
    availability: str = None
    date: str = date.today().strftime("%Y-%m-%d")

class TnBooking(Room):
    source: str = "TunisieBooking"
    currency: str = "EUR"

### Scraping

We create some helper function for parsing data, considering that we will need to select the rate type each time in each hotel.

In [72]:
def get_room_rates(page: str, rate_type: str) -> list:
    """Function to extract the room rates from a webpage
    Parameters:
        page (str): HTML page with results from room search in an hotel
        rate_type (str): type or rate (only breakfast, all, inclusive, etc)

    Returns:
        list of dictionaries with room prices and information
    """
    page = BeautifulSoup(page, "html.parser")
    results = []

    room_list = page.find("div", {"id": "result_par_arrangement"}).find_all("div", {"class": "line_result"})
    for room in room_list:
        res = {}
        res["room_type"] = room.find("input", id=re.compile("^libelle_chambre")).get("value").strip()
        res["price"] = room.find("input", id=re.compile("^price")).get("value")
        res["availability"] = room.find("div", {"id": "disponible"}).get_text()
        res["rate_type"] = rate_type
        results.append(res)

    return results
        

In [73]:
def get_hotel_name_stars(page: str) -> tuple[str, int]:
    """Function to extract hotel name and star rating from a webpage
    Parameters:
        page (str): HTML page with results from room search in an hotel

    Returns:
        hotel_name (str): name of the hotel
        hotel_stars (int): star rating of the hotel
    """
    page = BeautifulSoup(page, "html.parser")
    hotel_name = page.find("div", {"class": "bloc_titre_hotels"}).find("h2").get_text().strip()
    hotel_stars = len(page.find("div", {"class": "bloc_titre_hotels"}).find("span", {"class": "h2styless"}).find_all("i"))
    return hotel_name, hotel_stars

We also define a sets of inputs for checkin and checkout dates, and city name. Those variables can be put into production in several different ways:

1. Environment variables - see `os.getenv()` [https://docs.python.org/3/library/os.html](https://docs.python.org/3/library/os.html)
2. ConfigParser - see [https://docs.python.org/3/library/configparser.html](https://docs.python.org/3/library/configparser.html)
3. Command line arguments with argparse [https://docs.python.org/3/library/argparse.html](https://docs.python.org/3/library/argparse.html) or other tools

In [74]:
checkin_date = "2023-03-25"
checkin = int(checkin_date.split("-")[-1])
checkout_date = "2023-03-27"
checkout = int(checkout_date.split("-")[-1])

n_days = checkout - checkin     # This is very raw and only suitable for training. 
                                # It breaks when checkin and chechout are on different months
                                # You shold use a timedelta object in production
                                # See https://docs.python.org/3/library/datetime.html

city = "Mahdia"

We search for rooms in Mahdia from the 25th to the 27th of current month

In [75]:
driver.search_rooms_same_month(robots_booking, city, checkin, checkout) 

In [76]:
hotels = driver.find_elements(By.XPATH, '//*[@id="tailleprix"]')

In [77]:
booking_data = []

for h in hotels:
    h.click()
    time.sleep(2)
    room_options = [opt.text for opt in Select(driver.find_element(By.XPATH, '//*[@id="arrangement"]')).options]
    page = driver.execute_script("return document.body.innerHTML;")
    hotel_name, hotel_stars = get_hotel_name_stars(page)
    for room_opt in room_options:
        dropdown = driver.find_element(By.XPATH, '//*[@id="arrangement"]')
        dropdown.find_element(By.XPATH, "//option[. = '{}']".format(room_opt)).click()
        time.sleep(2)
        page = driver.execute_script("return document.body.innerHTML;")
        res = get_room_rates(page, room_opt)
        parsed_res = [
            TnBooking(
                **r,
                hotel_name=hotel_name,
                hotel_stars=hotel_stars,
                city=city,
                n_days=n_days,
                checkin=checkin_date,
                checkout=checkout_date)
            for r in res]
        booking_data.extend(parsed_res)
    time.sleep(1)
    driver.back()
    time.sleep(2)

### Save data as CSV

In [78]:
tunisiebooking_df = pd.DataFrame([item.dict(exclude_none=True) for item in booking_data])
tunisiebooking_df.to_csv("tunisiebooking_{}.csv".format(date.today().strftime("%Y-%m-%d")), index=False)

In [79]:
tunisiebooking_df.head()

Unnamed: 0,source,hotel_name,hotel_stars,city,room_type,n_pax,n_days,checkin,checkout,price,rate_type,currency,availability,date
0,TunisieBooking,Mahdia Palace Thalasso Mahdia,5,Mahdia,Chambre Double Standard,2,2,2023-03-25,2023-03-27,149.0,Demi Pension plus,EUR,Disponible,2023-03-15
1,TunisieBooking,Mahdia Palace Thalasso Mahdia,5,Mahdia,Chambre Double Vue Mer,2,2,2023-03-25,2023-03-27,149.0,Demi Pension plus,EUR,3 Disponibles,2023-03-15
2,TunisieBooking,Mahdia Palace Thalasso Mahdia,5,Mahdia,Chambre Double Standard,2,2,2023-03-25,2023-03-27,120.0,Petit Dejeuner,EUR,Disponible,2023-03-15
3,TunisieBooking,Mahdia Palace Thalasso Mahdia,5,Mahdia,Chambre Double Vue Mer,2,2,2023-03-25,2023-03-27,120.0,Petit Dejeuner,EUR,3 Disponibles,2023-03-15
4,TunisieBooking,Mahdia Beach & Aquapark Mahdia,4,Mahdia,Chambre Double Standard,2,2,2023-03-25,2023-03-27,117.0,Demi Pension,EUR,Disponible,2023-03-15


In [None]:
# Always remember to close the browser!
driver.quit()