# Web Scraping

- Web Scraping in python can be performed using these packages:
    - Selenium + BeautifulSoup4
        - `Selenium` will open web browser and redirects to specific website url automatically
        - `BeautifulSoup4` can be used to read HTML of a website and parse/convert them into text or 
    - Requests (optional)
        - `Requests` can be used to access website's directly using python without requiring web browser.

## Setup Selenium

- Make sure to set your preferred browser as default browser.
- Additionally you can dowload the appropriate driver for your web browser.
    - Follow the link -> https://colab.research.google.com/github/FTDS-learning-materials/phase-0/blob/main/w3/P0W3D1PM%20-%20Web%20Scraping.ipynb#scrollTo=Gxb42wTJIzyj
    - Tutorial setup reference:
        - Chrome https://www.lambdatest.com/blog/python-selenium-with-chrome/
        - Firefox https://www.browserstack.com/guide/geckodriver-selenium-python
        - Edge https://learn.microsoft.com/en-us/microsoft-edge/webdriver-chromium/?tabs=python

## Scraping code

- We can use the the following code to perform scraping in python
    ```py
    # Import packages
    from bs4 import BeautifulSoup
    from selenium import webdriver

    # setup seleinum webdriver
    driver = webdriver.Chrome()

    # define target website URL
    url="https://www.scrapethissite.com/pages/simple/"

    # redirect selenium to open the URL
    driver.get(url)

    # extract page content (HTML) from the website
    html = driver.page_source

    # convert/parse the HTML into BeautifulSoup object
    soup = BeautifulSoup(html, "html.parser")

    # print html content from object using `prettify()`` method
    print(soup.prettify())
    ```
- If we want to extract specific data from the HTML we can use these following methods:
    - .find() -> find the first element in a HTML that matches your criteria.
        - Example:
            ```html
            <html>
                <body>
                    <h1 class='title'>Lorem Ipsum</h1>
                    <ul class='list-data'>
                        <li id='data1'>dolor</li>
                        <li id='data2'>sit</li>
                        <li id='data3'>amet</li>
                    </ul>
                </body>
            </html>
            ```
            
            - To extract "Lorem Ipsum" text
                ```py
                ...
                soup = BeautifulSoup(html, "html.parser")

                # find first element with `<h1>` tag
                soup.find("h1")

                # or, find first element with `<h1>` tag and `class` attribute with value `title`
                soup.find("h1", {"class":"title"})
                ```
    - .find_all() -> find one or more elements in a HTML that matches your criteria.
    - reference: https://scrapeops.io/python-web-scraping-playbook/python-beautifulsoup-find/

## Data Extraction

### Use Case 1

- Target URL: https://www.scrapethissite.com/pages/simple/
- Extract data:
    - Country
    - Capital City
    - Area
    - Population

In [None]:
# code here

### Use Case 2

- Target URL: https://www.scrapethissite.com/pages/forms/
- Extract data:
    - Team Name
    - Year
    - Win percentage (%)

In [None]:
# code here

### Use Case 3

- Target URL: https://www.gramedia.com/categories/buku/komik
- Extract data:
    - Judul
    - Penulis
    - Harga

In [None]:
# code here

## References

- FTDS Colab https://colab.research.google.com/github/FTDS-learning-materials/phase-0/blob/main/w3/P0W3D1PM%20-%20Web%20Scraping.ipynb
- Codecademy https://www.codecademy.com/article/web-scrape-with-selenium-and-beautiful-soup
- FreeCodeCamp https://www.freecodecamp.org/news/better-web-scraping-in-python-with-selenium-beautiful-soup-and-pandas-d6390592e251/