# Data Collection and Storage

This notebook implements describes the collection and storage of data used for the project application.

Contents
--------
1. [Fetching data through API calls](#api)
    A. [Get geocoding information](#geocoding)
    B. [Get weather forecast](#weather)
2. [Web scraping](#scraping)
    A. [Reverse engineering the website requests](#scraping_reverse)
    B. [Get hotel information by scraping](#scraping_fetch)
3. [Storage in a database](#database)
    A. [Scraping utilities](#scraping_utils)
4. [Storage in a data lake](#datalake)
    A. [Scraping utilities](#scraping_utils)


## <a name="api"></a>API Calls

We request APIs to get geocoding information from the name of a place and weather forecast at given geographic coordinates. Utilities for the corresponding API calls are defined in the module `etl/api_mgmt.py`.

To use them, we first setup a `requests.Session`. Since the API servers may limit the number of allowed requests, we add a retry policy to the HTTP session. We also load a file containing the names of the locations of interest.

In [None]:
import csv
import time

import requests
from requests.adapters import HTTPAdapter, Retry

In [None]:
# setup session with retry policy in case of failure
s = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[403, 502, 503, 504])
s.mount('https://', HTTPAdapter(max_retries=retries))

# Load locations of interest
with open("./data/locations.csv", 'rt', encoding='utf-8') as f:
    reader = csv.reader(f, delimiter=',')
    next(reader, None) # remove header
    locations = [f"{row[0]}, {row[1]}" for row in reader]

### <a name="geocoding"></a>Get geocoding information

We use [Nominatim API](https://nominatim.org/) to fetch geocoding information. The API is quite restrictive in its [use policy](https://operations.osmfoundation.org/policies/nominatim/). Most importantly, the number of requests is limited to one per second, which forces us to throttle the rate accordingly.

In [None]:
from etl import get_coords

coordinates = {}
for loc in locations:
    coordinates[loc] = get_coords(loc)
    time.sleep(1.1)

coordinates['Rouen, France']

### <a name="weather"></a>Get weather forecast

We use [Open-Meteo API](https://open-meteo.com/en/docs) to get weather forecast information.

In [None]:
from etl import get_weather_forecast

weather_forecast = {}
for loc, coords in coordinates.items():
    weather_forecast[loc] = get_weather_forecast(s, **coords)

weather_forecast['Rouen, France']

## <a name="scraping"></a>Web scraping

The collection of hotels information at the selected locations is done through web scraping of [booking.com](https://www.booking.com). This approach is more complex and unstable than API calls. It is standard practice to first study how requests and responses are related to web browser interaction. This will allow us to tailor automated requests for scraping.

### <a name="scraping_reverse"></a>Reverse engineering the website requests

The first step is to get the request resulting from regular user interaction. We thus go to the index page [https://www.booking.com/index.en-gb.html](https://www.booking.com/index.en-gb.html) and fill the search bar. Here we look for an hotal in Rouen, France for 2 adults between march, 1st and march, 9th.

<img src="media/booking_search.png" alt="booking_search" width="1000"/>

After clicking on "Search" the request actually sent by the web browser is displayed in the address bar. This allows us to recover the parameters of the GET request.

In [None]:
req_url = 'https://www.booking.com/searchresults.en-gb.html?ss=Rouen%2C+France&efdco=1&label=gen173nr-1FCAEoggI46AdICVgEaE2IAQGYAQm4ARjIAQ_YAQHoAQH4AQKIAgGoAgS4AsWf4r0GwAIB0gIkYmMwNmI1MjktMzkyZS00N2FjLTllNWYtOWZmZGIwMWZjODhj2AIF4AIB&aid=304142&lang=en-gb&sb=1&src_elem=sb&src=index&dest_id=-1462807&dest_type=city&ac_position=0&ac_click_type=b&ac_langcode=en&ac_suggestion_list_length=5&search_selected=true&search_pageview_id=a6f166e2b250077a&ac_meta=GhBhNmYxNjZlMmIyNTAwNzdhIAAoATICZW46DVJvdWVuLCBGcmFuY2VAAEoAUAA%3D&checkin=2025-03-01&checkout=2025-03-09&group_adults=2&no_rooms=1&group_children=0'

req_url.split('?')[1].split('&')

['ss=Rouen%2C+France',
 'efdco=1',
 'label=gen173nr-1FCAEoggI46AdICVgEaE2IAQGYAQm4ARjIAQ_YAQHoAQH4AQKIAgGoAgS4AsWf4r0GwAIB0gIkYmMwNmI1MjktMzkyZS00N2FjLTllNWYtOWZmZGIwMWZjODhj2AIF4AIB',
 'aid=304142',
 'lang=en-gb',
 'sb=1',
 'src_elem=sb',
 'src=index',
 'dest_id=-1462807',
 'dest_type=city',
 'ac_position=0',
 'ac_click_type=b',
 'ac_langcode=en',
 'ac_suggestion_list_length=5',
 'search_selected=true',
 'search_pageview_id=a6f166e2b250077a',
 'ac_meta=GhBhNmYxNjZlMmIyNTAwNzdhIAAoATICZW46DVJvdWVuLCBGcmFuY2VAAEoAUAA%3D',
 'checkin=2025-03-01',
 'checkout=2025-03-09',
 'group_adults=2',
 'no_rooms=1',
 'group_children=0']

We can already get a few insights from the request URL:
- `ss` (search string) corresponds to the text written in the search bar,
- `checkin` and `checkout` correspond to the travel dates (the calendar widget in the center),
- `group_adults`, `no_rooms` and `group_children` correspond to the input of the right widget.

It turns out that a valid request can be made with a different approach, by specifying only the latitude and longitude of the destination. For instance, the URL
`'https://www.booking.com/searchresults.en-gb.html?latitude=49.4404591&longitude=1.0939658'` yields a page with the hotels ranked by inreasing distance to the coordinates. The parameters above can also be specified to refine the search, but we will not use them here. The picture below shows the results obtaines after requesting for the above URL. Note how the hotels are sorted by increasing distance from the coordinates.

<img src="media/booking_optimized_request.png" alt="booking_opt_search" width="700"/>

### <a name="scraping_fetch"></a>Get hotel information by scraping

The above analysis helped to setup the scraping functionality, which is implemented in the module `etl/scraping_mgmt.py`. Let us detail briefly the sraping procedure:
- We reach the target website through an automated browser driven with Selenium WebDriver. We favor this approach over other possibilities such as using `scrapy`. The reason is that our target, booking.com, implements infinite scrolling. This feature is implemented in javascript and is complex to trigger if the scraping tool used cannot execute javascript. A browser, however, natively executes javascript and therefore suits better our task.
- For each location, we send a request with the URL `'https://www.booking.com/searchresults.en-gb.html?latitude={latitude}&longitude={longitude}'`.
- We scrape the hotels data that we need, scrolling down if necessary.

## <a name="database"></a>Storage in a database

For long-term storage, the collected data is transferred a data lake in csv format. We use an AWS S3 bucket for that purpose. The functionality to transfer and load data from the database is implemented in the module `etl/db_mgmt.py`.

## <a name="datalake"></a>Storage in a data lake

For long-term storage, the collected data is transferred a data lake in csv format. We use an AWS S3 bucket for that purpose. The functionality to transfer and load data from the data lake is implemented in the module `etl/s3_mgmt.py`. We format our data in multiple csv file matching the database structure before uploading in the S3 bucket. A copy of the files is then downloaded and stored locally.