# Web Crawling and web Scraping

Sometimes, they have APIs but they have no well-written packages in the language you prefer (e.g. only Java but no Python libraries). Even worse, there may not be APIs for the public and we have to design a scraper to retrieve all the relevant informaiton we want. In such cases, we can manually build our own wrapper functions.

Web crawling and web scraping are two related techniques used to extract information from websites.

Web crawling, also known as web indexing or web spidering, is the process of automatically exploring and indexing web pages on the internet. Web crawlers, also called spiders, bots, or robots, navigate through websites, follow links, and index the content of the pages they encounter. Search engines like Google and Bing use web crawlers to build their indexes of web pages, which enables users to find information easily.

Web scraping, on the other hand, is the process of extracting specific data from web pages. Web scraping involves analyzing the HTML structure of a webpage, identifying the relevant information, and extracting it into a structured format such as a CSV or JSON file. Web scraping can be used to extract product information, pricing data, news articles, and more.

Web crawling and web scraping can be done manually, but it's often more efficient to use specialized software tools. Python is a popular language for web crawling and web scraping, and there are many libraries available, including BeautifulSoup, Scrapy, and Selenium.

However, it's important to note that web scraping can raise legal and ethical concerns, particularly if done without permission or in violation of website terms of service. Web scraping can also put a strain on website servers, potentially causing them to crash or become unavailable. As such, it's important to use web scraping responsibly and within legal and ethical boundaries.

##### Preliminiary examples

Examples from <a href="https://www.w3schools.com/html/tryit.asp?filename=tryhtml_basic_document" target="blank_">w3schools</a>.

```html
<!DOCTYPE html>
<html>
<body>

<h1>My First Heading</h1>

<p>My 1st paragraph.</p>
<p>My 2nd paragraph.</p>
<p>My 3rd paragraph.</p>

</body>
</html>
```

Save this code to your disk as `sample.html` (or any other name). We will use a great library called ___`Beautiful Soup`___ to read the contents from Python. You may also need to install lxml, which is for parsing specific formats (e.g., html and xml).

    pip install beautifulsoup4 lxml

In [None]:
pip install lxml

In [1]:
## Do the following if you have not
from time import sleep
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup as Soup

In [None]:
driver = webdriver.Chrome()
to_location = 'AGP'
url = 'https://www.kayak.dk/flights/BLL-{to_location}/2024-08-03-flexible-2days/2adults?fs=cfc=1&sort=bestflight_a'.format(to_location=to_location)

In [None]:
driver = webdriver.Chrome()
to_location = 'AGP'
from_location = 'BLL'
date_start = '2024-08-03'

url = 'https://www.kayak.dk/flights/{from_location}-{to_location}/{date_start}-flexible-2days/2adults?fs=cfc=1&sort=bestflight_a'.format(to_location=to_location, from_location=from_location, date_start=date_start)

In [None]:
driver.get(url)
# Attempt to click on the popup window
sleep(10)
driver.find_element("xpath", '//*[@id="portal-container"]/div/div[2]/div/div/div[1]/div/span[2]/button').click()

In [None]:
def load_more():
    try:
        driver.find_element("xpath", '//*[@id="c4l5n"]/div/div').click()
        print('sleeping.....')
        sleep(randint(25,35))
    except:
        pass

In [None]:
flight_rows = driver.find_elements(By.XPATH, '//div[@class="nrc6-wrapper"]')
print(flight_rows)

In [None]:
lst_prices =[]

In [None]:
for WebElement in flight_rows:
    elementHTML = WebElement.get_attribute('outerHTML')
    elementSoup = Soup(elementHTML, 'lxml')

    # Now continue can be used properly within the loop
    try:
        # Your parsing code goes here
        # For example:
        temp_price = elementSoup.find("div", {"class": 'nrc6-price-section'})
        temp_airline = elementSoup.find("div", {"class": 'nrc6-default-footer'})
        temp_duration = elementSoup.find("div", {"class": 'nrc6-main'})
        #temp_stop = elementSoup.find("div", {"class": 'hJSA'})
        
        price = temp_price.find("div", {"class": "f8F1-price-text"})
        airline = temp_airline.find("div", {"class": "J0g6-operator-text"})
        duration = temp_duration.find("div", {"class": "xdW8 xdW8-mod-full-airport"})
        stop = temp_duration.find("span", {"class": "JWEO-stops-text"})
        time = temp_duration.find("div", {"class": "vmXl vmXl-mod-variant-large"})
        where_stop = temp_duration.find("div", {"class": ""})
        
        lst_prices.append(price.text)
        lst_prices.append(airline.text)
        lst_prices.append(duration.text)
        lst_prices.append(stop.text)
        lst_prices.append(time.text)
        pass
    except AttributeError:
        print("Attribute error occurred, skipping this element.")
        continue  # Move to the next iteration of the loop if there's an attribute error

In [None]:
print(lst_prices)

In [None]:
driver.quit()

In [13]:
driver = webdriver.Chrome(r"C:\Users\Christian\Downloads\chromedriver_win32\chromedriver")
sleep(3)

AttributeError: 'str' object has no attribute 'capabilities'

---

In [None]:
def start_kayak(to_location, from_location, date_start):
    """City codes - it's the IATA codes!
    Date format -  YYYY-MM-DD"""
    
    #to_location = 'AGP'
    #from_location = 'BLL'
    #date_start = '2024-08-03'
    url = 'https://www.kayak.dk/flights/{from_location}-{to_location}/{date_start}-flexible-2days/2adults?fs=cfc=1&sort=bestflight_a'.format(to_location=to_location, from_location=from_location, date_start=date_start)
    driver.get(url)
    sleep(10)
    
#Click on the popup window:
    driver.find_element("xpath", '//*[@id="portal-container"]/div/div[2]/div/div/div[1]/div/span[2]/button').click()
    print('Popup closed.....')

#Click on the 'load more' button in the end of the page:
    driver.find_element("xpath", '//*[@id="c4l5n"]/div/div').click()
    print('Loading more...')

    #Start the first scrape
    print('starting first scrape.....')
    flight_rows = driver.find_elements(By.XPATH, '//div[@class="nrc6-wrapper"]')
    #Create empty list
    lst_prices =[]

#Scrape through the WebElements from flight rows and extract desired data
    for WebElement in flight_rows:
        elementHTML = WebElement.get_attribute('outerHTML')
        elementSoup = Soup(elementHTML, 'lxml')

        try:
            # Your parsing code goes here
            # For example:
            temp_price = elementSoup.find("div", {"class": 'nrc6-price-section'})
            temp_airline = elementSoup.find("div", {"class": 'nrc6-default-footer'})
            temp_duration = elementSoup.find("div", {"class": 'nrc6-main'})
            #temp_stop = elementSoup.find("div", {"class": 'hJSA'})
        
            price = temp_price.find("div", {"class": "f8F1-price-text"})
            airline = temp_airline.find("div", {"class": "J0g6-operator-text"})
            duration = temp_duration.find("div", {"class": "xdW8 xdW8-mod-full-airport"})
            stop = temp_duration.find("span", {"class": "JWEO-stops-text"})
            time = temp_duration.find("div", {"class": "vmXl vmXl-mod-variant-large"})
            where_stop = temp_duration.find("div", {"class": ""})
        
            lst_prices.append(price.text)
            lst_prices.append(airline.text)
            lst_prices.append(duration.text)
            lst_prices.append(stop.text)
            lst_prices.append(time.text)
        except AttributeError:
            print("Attribute error occurred, skipping this element.")
            continue  # Move to the next iteration of the loop if there's an attribute error

In [None]:
to_location = 'AGP'
from_location = 'BLL'
date_start = '2024-08-03'

start_kayak(to_location, from_location, date_start)

With this in mind, you can scrape almost any webpage of interest. Other formats such as <a href="http://www.json.org/" target="_blank">JSON</a> and <a href="https://www.w3.org/XML/" target="_blank">XML</a> do have high similarities and a few differences. 

***But keep in mind that you should act politely, with propoer permission!! To find out whether specific paths/contents are allowed to be scraped, you can check their ___`robots.txt`___. For example, <a href="https://www.google.com/robots.txt" target="_blank">here's</a> the permission information set by Google.***

---

Note that the examples we are using here are relatively simple. There are cases that we cannot access the pagination/scoll simply by `requests` alone. In those cases, [Selenium](http://selenium-python.readthedocs.io/) will save our lifes by ___simulating Browsers___!

Some more tutorials/tools:

- https://scrapy.org/ #building a crawler 
- https://www.dataquest.io/blog/web-scraping-tutorial-python/
- https://www.quora.com/Python-programming-language-1/How-is-BeautifulSoup-different-from-Scrapy

---

return to [overview](../00_overview.ipynb)