## Selenium presentation, context of use

>`Selenium`  is a library that allows to control a browser (Chrome, Internet Explorer, Firefox, Safari,...) in an automatic way through a series of programs. Originally created to perform automated Web tests, this package is also used for Webscraping because of its compatibility with JavaScript. This strength makes it a real alternative to `BeautifulSoup` for dynamic Web pages, which are increasingly in the majority.

> On the other hand, the use of `Selenium` creates a major constraint: the automated control of browsers requires a lot of resources, thus reducing the efficiency and the speed of execution compared to a library like `BeautifulSoup`.

> The use of `Selenium` is therefore recommended (or even essential) for websites using **JavaScript** but is not recommended for retrieving **a large data load**.

### Some introductory html notions 

It is useful to know the basic concepts of HTML to use Selenium effectively. In particular, here are some points to know:

> * **HTML elements:** an HTML document is composed of elements nested within each other. Each element is defined by an opening and closing tag, such as <p> and </p> for a paragraph. Elements can have attributes that define additional properties, such as class or id.
> * **The structure of an HTML document:** an HTML document is organized into a set of elements that form a hierarchy. The document has a root, which is the html element, and it can contain two main parts: head and body. The head part contains information about the document, like its title, and the body part contains the content displayed on the screen.
> * **CSS selectors:** Selenium uses CSS selectors to find elements on a web page. A CSS selector is a string that allows you to select one or more elements based on their name, class or identifier. For example, the selector "div.review-card" selects all div elements that have the class review-card.

By knowing these basic HTML concepts, you will be able to understand the structure of a web page and how to select the elements you want with Selenium.

## 1. Discovering and getting started with selenium

> The first step to start scraping web sites using `Selenium` is to install the package on your virtual environment. 

> Run the following cell to install `selenium`.

In [None]:
!pip install selenium
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException

# Libraries for the last exercise (optionnal)
import os
import pandas as pd
import time
import matplotlib.pyplot as plt
import datetime
import argparse
from bs4 import BeautifulSoup
import requests

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting selenium
  Downloading selenium-4.8.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m64.7 MB/s[0m eta [36m0:00:00[0m
Collecting urllib3[socks]~=1.26
  Downloading urllib3-1.26.14-py2.py3-none-any.whl (140 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.6/140.6 KB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting trio~=0.17
  Downloading trio-0.22.0-py3-none-any.whl (384 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m384.9/384.9 KB[0m [31m38.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting trio-websocket~=0.9
  Downloading trio_websocket-0.9.2-py3-none-any.whl (16 kB)
Collecting async-generator>=1.9
  Downloading async_generator-1.10-py3-none-any.whl (18 kB)
Collecting exceptiongroup>=1.0.0rc9
  Downloading exceptiongroup-1.1.0-py3-none-any.whl (14 kB)
Collecti

> A **webdriver** is an essential ingredient in this process. It is what will automatically open your browser to access the website of your choice. This step is different depending on the browser you use to explore the internet. For the purpose of this class, we will use Google Chrome. For Chrome, you must first download the webdriver at https://chromedriver.chromium.org/downloads. There are several different download options depending on your version of Chrome. To find out what version of Chrome you have, click on the three vertical dots in the upper right corner of your browser window, scroll down to the help page, and select "About Google Chrome".
>
> Once chromedriver is downloaded, remember to place it at the same level as this notebook otherwise the rest of the instructions will not work.
>
> We can now initialize our webdriver to navigate on a page of the trustiplot site (https://fr.trustpilot.com/review/engie.fr)

> **Instruction: Set a webdriver to access the page https://fr.trustpilot.com/review/engie.fr**

In [None]:
# insert your code

In [None]:
url = "https://fr.trustpilot.com/review/engie.fr"

driver = webdriver.Chrome("chromedriver")
driver.get(url)

WebDriverException: ignored

> Once the driver is installed, the first step is to click on the cookies button to continue the navigation. 
> It is possible to find the path of the button by inspecting it directly: 
>
>
> The **`find_element`** function then allows us to search for the element using the located path. All that remains is to click on the button using the **`click`** function. Here is an example of code:
>
>```python
>cookie_button = driver.find_element(By.XPATH,cookie_button_path)
>cookie_button.click()
>```
>
>  **Instruction : using the example provided, inspect the web page to find the path to the cookie "ok" button and click it.**

In [None]:
# insert your code

In [None]:
# cookie_button_path = "/html/body/div[2]/div[2]/div/div/div[2]/div/div/button[1]"  #change
cookie_button_path = "/html/body/div[3]/div[2]/div/div[1]/div/div[2]/div/button[2]"
cookie_button = driver.find_element(By.ID, "onetrust-accept-btn-handler")
cookie_button.click()

> **Instruction: start by retrieving the title of the website's first comment.**
> 
> In the same way as for the button of validation of the cookies, it is necessary at first to inspect the web page: 
>
> Then all that remains is to retrieve the text of the element found. Here is an example of code : 
>
> ``title = driver.find_element(By.XPATH,title_path).text``

In [None]:
# insert your code

In [None]:
title_path = '//*[@id="__next"]/div/div/div/main/div/div[4]/section/div[4]/article/section/div[2]/a/h2'

title = driver.find_element(By.XPATH, title_path).text

> **Instruction**
>
> **On the same basis:**
> * Retrieve the body text of the first comment.
> * Retrieve the date of the first comment.
> * Retrieve the note of the first comment.

In [None]:
# insert your code

In [None]:
date_path = '//*[@id="__next"]/div/div/div/main/div/div[4]/section/div[4]/article/section/div[2]/p[2]'
com_path = '//*[@id="__next"]/div/div/div/main/div/div[4]/section/div[4]/article/section/div[2]/p[1]'
mark_path = '//*[@id="__next"]/div/div/div/main/div/div[4]/section/div[4]/article/section/div[1]/div[1]/img'

date = driver.find_element(By.XPATH, date_path).text
com = driver.find_element(By.XPATH, com_path).text
mark = driver.find_element(By.XPATH, mark_path).get_attribute("alt")

> Now that we have extracted some basic elements from the first page of the website https://fr.trustpilot.com/review/engie.fr, we would like to extract the same information but from the second page. 
>
> First we will extract the maximum number of pages we can navigate in. We will be able to make sure that we have at least 2 pages on the site.
>
> To find out the maximum number of pages on the site, you can search by inspecting the button on the last page.
>
> **Instruction: extract the number of pages of the site https://fr.trustpilot.com/review/engie.fr and check that we have at least two pages on the site.**

In [None]:
# insert your code

In [None]:
max_page_text = driver.find_element(By.NAME, "pagination-button-last").text
max_pages = int(max_page_text if max_page_text.strip() else 0)  # change
print(max_pages > 2)

> **Instruction: by inspecting the first page of the site https://fr.trustpilot.com/review/engie.fr, identify the location of the button to go to the next page and click on it (as previously done with the cookies button).**

In [None]:
# insert your code

In [None]:
next_page_button_name = "pagination-button-next"
next_button = driver.find_element(By.NAME, next_page_button_name)

# we may have ElementClickInterceptedException that's why we are testing twice the instruction
try:
    next_button.click()
except:
    try:
        button.click()
    except ElementClickInterceptedException:
        pass

> **Instruction - Once on page 2 of https://fr.trustpilot.com/review/engie.fr, as for the first page, extract the following information:**

> * The title of the first comment.
> * The content of the first comment.
> * The date of the first comment.

In [None]:
# insert your code

In [None]:
date_path = '//*[@id="__next"]/div/div/div/main/div/div[4]/section/div[4]/article/section/div[2]/p[2]'
com_path = '//*[@id="__next"]/div/div/div/main/div/div[4]/section/div[4]/article/section/div[2]/p[1]'
title_path = '//*[@id="__next"]/div/div/div/main/div/div[4]/section/div[4]/article/section/div[2]/a/h2'

date = driver.find_element(By.XPATH, date_path).text
com = driver.find_element(By.XPATH, com_path).text
title = driver.find_element(By.XPATH, title_path).text

> **Instruction: extract the content of all the comments on page 2 of the site.**

In [None]:
# insert your code

In [None]:
comments_list = driver.find_elements(
    By.XPATH,
    '//*[contains(@class, "typography_body-l__KUYFJ typography_appearance-default__AAY17 typography_color-black__5LYEn")]',
)
comments = list(map(lambda x: x.text, comments_list))

## 2. Exploitation and use of the extracted data

> On this 2nd part, we will focus on a second website which is "Avis Vérifiés" for the same company: https://www.avis-verifies.com/avis-clients/engie-homeservices.fr

> The code you will write will be to open a connection to a MySQL database, create a "reviews" table if it doesn't already exist, then use Selenium (as above) to open a Chrome browser and retrieve the reviews on all the available pages. 

> For each review, we will store the rating, the text and the date in the "reviews" table in the database.

### Option 1 - reviews list

In [None]:
# insert your code

In [None]:
import time

# Start the webdriver and navigate to the first page of the website
driver = webdriver.Chrome()
driver.get("https://www.avis-verifies.com/avis-clients/engie-homeservices.fr")
cookie_button = driver.find_element(By.CSS_SELECTOR, "#onetrust-accept-btn-handler")
cookie_button.click()

reviews_list = []
max_page = 10
for page in range(1, max_page):
    # Extract the review data from the current page
    reviews_elements = driver.find_elements(By.CLASS_NAME, "review")

    reviews_page = list(
        map(
            lambda x: {
                "rating": x.find_element(By.TAG_NAME, "span").text,
                "text": x.find_element(By.CLASS_NAME, "text").text,
                "date": x.find_element(By.CLASS_NAME, "details").text,
            },
            reviews_elements,
        )
    )
    reviews_list.extend(reviews_page)
    try:
        next_button = driver.find_element(
            By.XPATH, "/html/body/div[3]/div[3]/div[3]/ul/li[4]/a"
        )
        next_button.click()
        time.sleep(5)  # for the page to load
    except:
        break

# Close the webdriver
# driver.quit()

print(reviews_list)

NameError: ignored

### Option 2 - SQLite 

> **Instruction : import the necessary libraries, create the database connection and set up the webdriver**

In [None]:
# import the necessary additional library
import sqlite3

# create the connection to the database

db = sqlite3.connect("reviews.db")
cursor = db.cursor()

# Set up the webdriver
driver = webdriver.Chrome()

> **Instruction : Go to the website and accept the cookie by clicking on the cookie button**

In [None]:
# insert your code

In [None]:
driver.get("https://www.avis-verifies.com/avis-clients/engie-homeservices.fr")

# cookie button
cookie_button = driver.find_element(By.CSS_SELECTOR, "#onetrust-accept-btn-handler")
cookie_button.click()

> **Instruction: we would like to extract the evaluation data (review rating, review text and review date) from all pages of the website, starting from the first page and continuing to the last page.**

In [None]:
# insert your code

In [None]:
max_page = 10
for page in range(1, max_page):
    # data from current page
    elements = driver.find_elements(By.CLASS_NAME, "review")
    # data insertion into the database
    for review in elements:
        infos = {
            "rating": review.find_element(By.TAG_NAME, "span").text,
            "text": review.find_element(By.CLASS_NAME, "text").text,
            "date": review.find_element(By.CLASS_NAME, "details").text,
        }
        cursor.execute(
            "INSERT INTO reviews (rating, comment, date) VALUES (?, ?, ?)",
            (infos["rating"], infos["text"], infos["date"]),
        )
        db.commit()

    # move to the next page
    try:
        next_button = driver.find_element(
            By.XPATH, "/html/body/div[3]/div[3]/div[3]/ul/li[4]/a"
        )
        next_button.click()
    except NoSuchElementException:
        # If the "next page" button is not found, we are on the last page
        break

This code will retrieve assessment data from all pages of the website, starting with the first page and continuing to the last page. At each iteration of the loop, it retrieves the data from the current page, inserts it into the database, and then clicks the "next page" button to move to the next page. When the "next page" button cannot be found the loop ends.

We will try to leverage on the extracted database and detect negative comments.  
>
> **Instruction: Develop a generic approach to detect negative comments.**

In [None]:
# insert your code

In [None]:
# Select negative reviews (rating below 3)
Q = "SELECT * FROM reviews WHERE rating < 3"
cursor.execute(Q)

# Retrieving the results of the query
results = cursor.fetchall()

# For each result
for result in results:
    # Display of the note and the text of the no
    print(result[1])
    print(result[2])

# Close the webdriver and database connection
driver.close()
db.close()

This code opens a connection to a MySQL database, runs a query to select the negative reviews (rating below 3) from the "reviews" table, then displays the rating and text of each review.

You can modify this query to search for specific keywords in the text of the notices. For example, to search for notices containing the words "à fuir", you can write:

In [None]:
Q = "SELECT * FROM reviews WHERE review_text LIKE '%à fuir%'"

## 3. Bonus

Beautiful Soup is a Python library that is used for web scraping purposes to pull the data out of HTML and XML files. Selenium, on the other hand, is a browser automation tool that is used to automate web browsers.

When used together, Selenium can be used to open a web page and interact with its contents, and then Beautiful Soup can be used to extract the desired information from the page. For example, Selenium can be used to click on a button to load more data on a page, and then Beautiful Soup can be used to extract the data that was loaded.

> Instruction
> * Scrape all the ads of apartments for rent or sale in the city of Paris.
>
> * Extract the following information for each ad: the title, location, surface and price.
>
> * Store the extracted information in a CSV file.

NB: it is possible that the site blocks your IP address if the code runs several times

In [None]:
def parse_arguments():
    argparser = argparse.ArgumentParser(description="Immo Parser arguments")
    argparser.add_argument(
        "--type", type=str, default="vente", help="vente ou location"
    )
    argparser.add_argument(
        "--ville", type=str, default="Paris", help="ville de recherche"
    )
    argparser.add_argument("--prix_max", type=str, default=200000, help="Prix max")
    argparser.add_argument(
        "--surface_min", type=str, default=10, help="surface minimale"
    )
    argparser.add_argument(
        "--path", type=str, default="immobilier.csv", help="chemin vers le fichier .csv"
    )

    argparser, _ = argparser.parse_known_args()

    return argparser

In [None]:
def get_current_offers(type, ville, prix_max, surface_min):

    ville = ville.lower()
    driver = webdriver.Chrome()
    if type == "vente":
        driver.get("https://www.pap.fr/annonce/vente-immobiliere")

    elif type == "location":
        driver.get("https://www.pap.fr/annonce/locations")

    assert "PAP" in driver.title
    loc = driver.find_element(By.ID, "token-input-geo_objets_ids")
    loc.send_keys(ville)
    time.sleep(2)
    loc = driver.find_element(By.ID, "token-input-geo_objets_ids")
    loc.send_keys(Keys.RETURN)
    loc.send_keys(Keys.RETURN)
    pmax = driver.find_element(By.ID, "prix_max")
    pmax.send_keys(str(prix_max))
    smin = driver.find_element(By.ID, "surface_min")
    smin.send_keys(str(surface_min))
    smin.send_keys(Keys.RETURN)

    locations = driver.find_elements(By.CLASS_NAME, "h1")
    locations = [l.text.lower() for l in locations]

    surfaces_elements = driver.find_elements(By.CLASS_NAME, "item-tags")
    prices_elements = driver.find_elements(By.CLASS_NAME, "item-price")
    links_elements = driver.find_elements(By.CLASS_NAME, "item-title")

    surfaces, prices, links = [], [], []
    for surf in surfaces_elements:
        st = surf.text.split()
        if "m2" in st:
            surface_index = st.index("m2") - 1
            s = int(st[surface_index])
        else:
            s = None
        surfaces.append(s)
    for p in prices_elements:
        ptext = p.text.split()
        if len(ptext) > 0:
            prices.append(int(ptext[0].replace(".", "")))
        else:
            prices.append(None)
    for l in links_elements:
        links.append(l.get_attribute("href"))

    driver.close()
    df = pd.DataFrame([])
    df["lieu"] = locations
    df["prix"] = prices
    df["surface"] = surfaces
    df["lien"] = links
    df.dropna(axis=0, inplace=True)
    df = df[df["lieu"].str.contains(ville)]
    df["ratio"] = (df["prix"] / df["surface"]).round()
    df["date"] = len(df) * [datetime.date.today().strftime("%d/%m/%Y")]
    df["type"] = len(df) * [type]

    links = df["lien"]
    df.drop("lien", axis=1, inplace=True)

    return df, links

In [None]:
def get_additional_info(links):
    add_df = pd.DataFrame([], columns=["meuble"])
    indic_meuble = []
    for l in links:
        url = l
        r1 = requests.get(url, headers={"User-Agent": "Chrome/59.0.3071.115"})
        coverpage = r1.content
        soup1 = BeautifulSoup(coverpage, "html5lib")
        desc = soup1.find("div", class_="margin-bottom-30").text
        meuble = ("meubl" in desc) | ("LMNP" in desc)
        indic_meuble.append(meuble)
    add_df["meuble"] = meuble

    return add_df

In [None]:
def update_df_file(current_df, path):

    df = pd.read_csv(path)
    df = df.append(current_df)
    df.drop_duplicates(subset=["type", "lieu", "surface", "prix"], inplace=True)
    df.to_csv(path, index=False)


if __name__ == "__main__":

    argparser = parse_arguments()
    type = argparser.type
    if not type in ["location", "vente"]:
        raise ValueError("type doit etre location ou vente")
    ville = argparser.ville
    prix_max = argparser.prix_max
    surface_min = argparser.surface_min
    path = argparser.path
    current_df, links = get_current_offers(type, ville, prix_max, surface_min)
    add_df = get_additional_info(links)
    current_df = pd.concat((current_df, add_df))
    update_df_file(current_df, path)