**Plan your trip with Kayak**


> *Company's description*

Kayak is a travel search engine that helps user plan their next trip at the best price.

The company was founded in 2004 by S. Hafner and P. M. English. After a few rounds of fundraising, Kayak was acquired by Booking Holdings which now holds:
  

*   Booking.com
*   Kayak
*   Priceline
*   Agoda
*   RentalCars
*   OpenTable

With over $300 million revenue a year, Kaya operates in almost all countries and all languages to helo their users book travels across the globe. 



> *Project*

The marketing team needs help on a new project where users can have more information about the planned destination that they are visiting.

Kayak Marketing Team would like to create an application that will recommend where people should plan their next holidays. The application should be based on real data:

*   Weather
*   Hotels in the area

The application should then be able to recommend the best destinations and hotels based on the those variables at any given time.

**Goals**

Your job will be to: 
*   Scrape data from destinations
*   Get weather data from each destination
*   Get hotels' information about each destination
*   Store all the information above in a data lake
*   Extract, transform and load cleaned data from your datalake to a data warehouse

**Scope of this project**

Marketing team wants to focus first on the best cities to travel to in France. Here are the top 35 cities according to One-Week-In.com:

    Mont Saint Michel
    St Malo
    Bayeux
    Le Havre
    Rouen
    Paris
    Amiens
    Lille
    Strasbourg
    Chateau du Haut Koenigsbourg
    Colmar
    Eguisheim
    Besancon
    Dijon
    Annecy
    Grenoble
    Lyon
    Verdon Gorge
    Bormes les Mimosas
    Cassis
    Marseille
    Aix en Provence
    Avignon
    Uzès
    Nímes
    Aigues Mortes
    Saintes Maries de la mer
    Collioure
    Carcassonne
    Ariege
    Toulouse
    Montauban
    Biarritz
    Bayonne
    La Rochelle

**Helpers**


Get weather data with an API:
*   https://nominatim.org/ or
*   https://openweathermap.org/appid

Save all the results in a .csv file with name of the cities, unique id.

Plot destinations on a map using plotly

Scrape Booking.com to obtain the informations:

*   hotel name
*   Url to booking.com page
*   Coordinates of latitude and longitude
*   Score given by the website users
*   Text description of the hotel

Create data lake using S3

ETL - create a SQL database using AWS RDS to extract data from S3 and store in new DB

**Deliverables**

1- A .csv file in an S3 bucket containing enriched information about weather and hotels for each French city

2- A SQL Database to get the cleaned data from S3

3- Two maps with TOP 5 destinations and TOP 20 hotels in the area

In [None]:
#Importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly
import requests
#import boto3


In [None]:
list_cities = [
  "Mont Saint Michel", "St Malo", "Bayeux", "Le Havre", "Rouen",
  "Paris", "Amiens", "Lille", "Strasbourg", "Chateau du Haut Koenigsbourg",
  "Colmar", "Eguisheim", "Besancon", "Dijon", "Annecy", "Grenoble", "Lyon",
  "Gorges du Verdon", "Bormes les Mimosas", "Cassis", "Marseille", "Aix en Provence",
  "Avignon", "Uzes", "Nimes", "Aigues Mortes", "Saintes Maries de la mer",
  "Collioure", "Carcassonne", "Ariege", "Toulouse", "Montauban", "Biarritz",
  "Bayonne", "La Rochelle"
  ]


In [None]:
####GERI DON BURAYA
wk = "d_0"

In [None]:
for aa in list_cities:

  url='https://nominatim.openstreetmap.org/search?format=json&q=' + aa
  r1 = requests.get(url)
  print(r1.json()[0]['display_name'])
  break

Mont Saint-Michel, Avancée des Bombardes, Le Mont-Saint-Michel, Avranches, Manche, Normandie, France métropolitaine, 50170, France


In [None]:
#r1.json()['']

In [None]:
dname = r1.json()[0]['display_name']
lon = r1.json()[0]['lon'] 
lat = r1.json()[0]['lat'] 

In [None]:
r2 = requests.get('https://api.openweathermap.org/data/2.5/onecall?lat=' + lat +'&lon='+ lon+'&exclude=current,minutely,hourly' + '&units=metric' +'&appid='+wk).json()


#daily.pop : probability of precipitation. The values of the parameter vary between 0 and 1, where 0 is equal to 0%, 1 is equal to 100%
#daily.rain (where available) Precipitation volume, mm
#daily.wind_speed Wind speed. Units – default: metre/sec, metric: metre/sec, imperial: miles/hour
#daily.temp Units – default: kelvin, metric: Celsius, imperial: Fahrenheit. How to change units used
#&units={units} 
    #For temperature in Fahrenheit and wind speed in miles/hour, use units=imperial
    #For temperature in Celsius and wind speed in meter/sec, use units=metric
    #Temperature in Kelvin and wind speed in meter/sec is used by default, so there is no need to use the units parameter in the API call if you want this


In [None]:
#r2



```
r2['daily'][0]
```
**output**
{'clouds': 10,
 'dew_point': 4.43,
 'dt': 1659787200,
 'feels_like': {'day': 23.24, 'eve': 22.6, 'morn': 16.05, 'night': 14.79},
 'humidity': 28,
 'moon_phase': 0.28,
 'moonrise': 1659796440,
 'moonset': 1659739860,
 'pop': 0,
 'pressure': 1025,
 'sunrise': 1659761196,
 'sunset': 1659814639,
 'temp': {'day': 24.05,
  'eve': 23.32,
  'max': 26.26,
  'min': 10.82,
  'morn': 16.85,
  'night': 15.77},
 'uvi': 6.53,
 'weather': [{'description': 'clear sky',
   'icon': '01d',
   'id': 800,
   'main': 'Clear'}],
 'wind_deg': 41,
 'wind_gust': 12.26,
 'wind_speed': 8.69}

In [None]:
 for ii in range (7):
  print('------ Day ',ii+1, '-------')
  print('day_temp= ',r2['daily'][ii]['temp']['day'])
  print('weather_description= ', r2['daily'][ii]['weather'])#['description'])
  print('cloud_cover= ',r2['daily'][ii]['clouds'])
  print('rain_prop= ',r2['daily'][ii]['pop'])
  print('wind_speed= ',r2['daily'][ii]['wind_speed'])
  print('')

------ Day  1 -------
day_temp=  25.35
weather_description=  [{'id': 801, 'main': 'Clouds', 'description': 'few clouds', 'icon': '02d'}]
cloud_cover=  19
rain_prop=  0
wind_speed=  8.65

------ Day  2 -------
day_temp=  27.51
weather_description=  [{'id': 800, 'main': 'Clear', 'description': 'clear sky', 'icon': '01d'}]
cloud_cover=  7
rain_prop=  0
wind_speed=  7.8

------ Day  3 -------
day_temp=  29.18
weather_description=  [{'id': 800, 'main': 'Clear', 'description': 'clear sky', 'icon': '01d'}]
cloud_cover=  0
rain_prop=  0
wind_speed=  7.46

------ Day  4 -------
day_temp=  29.75
weather_description=  [{'id': 803, 'main': 'Clouds', 'description': 'broken clouds', 'icon': '04d'}]
cloud_cover=  74
rain_prop=  0
wind_speed=  7.77

------ Day  5 -------
day_temp=  31.07
weather_description=  [{'id': 802, 'main': 'Clouds', 'description': 'scattered clouds', 'icon': '03d'}]
cloud_cover=  40
rain_prop=  0
wind_speed=  7.85

------ Day  6 -------
day_temp=  33.8
weather_description=  [{'

**Part 2 - Scrapping Booking.com**

In [None]:
#installing scrappy library
!pip install scrapy

  Building wheel for PyDispatcher (setup.py) ... [?25l[?25hdone
  Created wheel for PyDispatcher: filename=PyDispatcher-2.0.5-py3-none-any.whl size=11516 sha256=48d6791dcea8aac2679c2706c73f9ebf8af56d5f8134f630ffeb9563ff3dff1f
  Stored in directory: /root/.cache/pip/wheels/2d/18/21/3c6a732eaa69a339198e08bb63b7da2c45933a3428b29ec454
Successfully built PyDispatcher
Installing collected packages: w3lib, cssselect, zope.interface, requests-file, parsel, jmespath, itemadapter, incremental, hyperlink, cryptography, constantly, Automat, Twisted, tldextract, service-identity, queuelib, pyOpenSSL, PyDispatcher, protego, itemloaders, scrapy
Successfully installed Automat-20.2.0 PyDispatcher-2.0.5 Twisted-22.4.0 constantly-15.1.0 cryptography-37.0.4 cssselect-1.1.0 hyperlink-21.0.0 incremental-21.3.0 itemadapter-0.7.0 itemloaders-1.0.4 jmespath-1.0.1 parsel-1.6.0 protego-0.2.1 pyOpenSSL-22.0.0 queuelib-1.6.2 requests-file-1.5.1 scrapy-2.6.2 service-identity-21.1.0 tldextract-3.3.1 w3lib-1.22.0 z

In [None]:
#importing libraries
import os
import logging
import requests
import scrapy
from scrapy.crawler import CrawlerProcess

In [None]:
r = requests.get('https://www.booking.com/index.fr.html')
r

<Response [200]>

In [None]:
#!python booking.py
city = 'La Rochelle'#'Paris'

In [None]:
class BookingSpider(scrapy.Spider):
    name = "booking"

    start_urls = ['https://www.booking.com/index.fr.html']

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'ss': city},
            callback=self.after_search
        )

    def after_search(self, response):
        
        booking = response.css('.sr_item')

        for data in booking:
            yield {
                'name': data.css('.sr-hotel__name::text').get(),
                'url': 'https://www.booking.com' + data.css('.hotel_name_link').attrib["href"],
                'coords': data.css('.sr_card_address_line a').attrib["data-coords"],
                'score': data.css('.bui-review-score__badge::text').get(),
                'description': data.css('.hotel_desc::text').get()
                
            }
        
        
        try:
            next_page = response.css('a.paging-next').attrib["href"]
        except KeyError:
            logging.info('No next page. Terminating crawling process.')
        else:
            yield response.follow(next_page, callback=self.after_search)

In [None]:
filename = "hotels_" + city.replace(" ", "-") + ".json"


process = CrawlerProcess(settings = {
    'USER_AGENT': 'Chrome/84.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'LOG_LEVEL': logging.INFO,
    "FEEDS": {
        'res/' + filename: {"format": "json"},
    }
})

process.crawl(BookingSpider)
process.start()

INFO:scrapy.utils.log:Scrapy 2.6.2 started (bot: scrapybot)
2022-08-07 09:31:30 [scrapy.utils.log] INFO: Scrapy 2.6.2 started (bot: scrapybot)
INFO:scrapy.utils.log:Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.7.13 (default, Apr 24 2022, 01:04:09) - [GCC 7.5.0], pyOpenSSL 22.0.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 37.0.4, Platform Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
2022-08-07 09:31:30 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.4.0, Python 3.7.13 (default, Apr 24 2022, 01:04:09) - [GCC 7.5.0], pyOpenSSL 22.0.0 (OpenSSL 3.0.5 5 Jul 2022), cryptography 37.0.4, Platform Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
INFO:scrapy.crawler:Overridden settings:
{'LOG_LEVEL': 20,
 'USER_AGENT': 'Chrome/84.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2022-08-07 09:31:30 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 20,
 'USER_AGE

ReactorNotRestartable: ignored

<scrapy.settings.Settings object at 0x7ff5b0bdba10>


In [None]:
start_urls = ['https://www.booking.com/searchresults.html?&no_rooms=1&ac_langcode=en&dest_type=city&ss=Paris%2C+Ile+de+France%2C+France&checkin=2022-08-07&checkout= 2022-08-14&ss_raw=Paris&search_selected=false&order=bayesian_review_score']


In [None]:
formdata={'ss': 'Paris%2C+Ile+de+France%2C+France', 'checkin':'2022-08-07',
                  'checkout': '2022-08-14', 'ss_raw':'Paris'},
 

In [None]:
import requests as r 

In [None]:
names = response.xpath('//*[@id="main-content"]/div[2]/div/div[3]/div[1]/div[1]/div[3]/div[2]/div[2]/div/div/div/div[7]/div[%s]/div[1]/div[2]/div/div[1]/div[1]/div/div[1]/div/h3/a/div[1]/text()'%NUM)
urls = response.xpath('//*[@id="main-content"]/div[2]/div/div[3]/div[1]/div[1]/div[3]/div[2]/div[2]/div/div/div/div[7]/div[3]/div[1]/div[2]/div/div[1]/div[1]/div/div[1]/div/h3/a')
    