## Selenium presentation, context of use

>`Selenium`  is a library that allows to control a browser (Chrome, Internet Explorer, Firefox, Safari,...) in an automatic way through a series of programs. Originally created to perform automated Web tests, this package is also used for Webscraping because of its compatibility with JavaScript. This strength makes it a real alternative to `BeautifulSoup` for dynamic Web pages, which are increasingly in the majority.

> On the other hand, the use of `Selenium` creates a major constraint: the automated control of browsers requires a lot of resources, thus reducing the efficiency and the speed of execution compared to a library like `BeautifulSoup`.

> The use of `Selenium` is therefore recommended (or even essential) for websites using **JavaScript** but is not recommended for retrieving **a large data load**.

### Some introductory html notions 

It is useful to know the basic concepts of HTML to use Selenium effectively. In particular, here are some points to know:

> * **HTML elements:** an HTML document is composed of elements nested within each other. Each element is defined by an opening and closing tag, such as <p> and </p> for a paragraph. Elements can have attributes that define additional properties, such as class or id.
> * **The structure of an HTML document:** an HTML document is organized into a set of elements that form a hierarchy. The document has a root, which is the html element, and it can contain two main parts: head and body. The head part contains information about the document, like its title, and the body part contains the content displayed on the screen.
> * **CSS selectors:** Selenium uses CSS selectors to find elements on a web page. A CSS selector is a string that allows you to select one or more elements based on their name, class or identifier. For example, the selector "div.review-card" selects all div elements that have the class review-card.

By knowing these basic HTML concepts, you will be able to understand the structure of a web page and how to select the elements you want with Selenium.

## 1. Discovering and getting started with selenium

> The first step to start scraping web sites using `Selenium` is to install the package on your virtual environment. 

> Run the following cell to install `selenium`.

In [None]:
!pip install selenium

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting selenium
  Downloading selenium-4.8.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m34.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting trio-websocket~=0.9
  Downloading trio_websocket-0.9.2-py3-none-any.whl (16 kB)
Collecting urllib3[socks]~=1.26
  Downloading urllib3-1.26.14-py2.py3-none-any.whl (140 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.6/140.6 KB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
Collecting trio~=0.17
  Downloading trio-0.22.0-py3-none-any.whl (384 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m384.9/384.9 KB[0m [31m33.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting exceptiongroup>=1.0.0rc9
  Downloading exceptiongroup-1.1.0-py3-none-any.whl (14 kB)
Collecting outcome
  Downloading outcome-1.2.0-py2.py3-none-any.whl (9.7 kB)
Collecting async-genera

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException

# Libraries for the last exercise (optionnal)
import os
import pandas as pd
import time
import matplotlib.pyplot as plt
import datetime
import argparse
from bs4 import BeautifulSoup
import requests

> A **webdriver** is an essential ingredient in this process. It is what will automatically open your browser to access the website of your choice. This step is different depending on the browser you use to explore the internet. For the purpose of this class, we will use Google Chrome. For Chrome, you must first download the webdriver at https://chromedriver.chromium.org/downloads. There are several different download options depending on your version of Chrome. To find out what version of Chrome you have, click on the three vertical dots in the upper right corner of your browser window, scroll down to the help page, and select "About Google Chrome".
>
> Once chromedriver is downloaded, remember to place it at the same level as this notebook otherwise the rest of the instructions will not work.
>
> We can now initialize our webdriver to navigate on a page of the trustiplot site (https://fr.trustpilot.com/review/engie.fr)

> **Instruction: Set a webdriver to access the page https://fr.trustpilot.com/review/engie.fr**

In [None]:
# insert your code
from selenium import web


  driver = webdriver.Chrome(executable_path="C:/Users/Rehan Ibrahim/Downloads/CHROMEDRIVER/chromedriver.exe")


WebDriverException: ignored

> Once the driver is installed, the first step is to click on the cookies button to continue the navigation. 
> It is possible to find the path of the button by inspecting it directly: 
>
>
> The **`find_element`** function then allows us to search for the element using the located path. All that remains is to click on the button using the **`click`** function. Here is an example of code:
>
>```python
>cookie_button = driver.find_element(By.XPATH,cookie_button_path)
>cookie_button.click()
>```
>
>  **Instruction : using the example provided, inspect the web page to find the path to the cookie "ok" button and click it.**

In [None]:
# insert your code
cookie_button = driver.find_element(By.XPATH,'//*[@id="onetrust-accept-btn-handler"]')
cookie_button.click()

> **Instruction: start by retrieving the title of the website's first comment.**
> 
> In the same way as for the button of validation of the cookies, it is necessary at first to inspect the web page: 
>
> Then all that remains is to retrieve the text of the element found. Here is an example of code : 
>
> ``title = driver.find_element(By.XPATH,title_path).text``

In [None]:
# insert your code


> **Instruction**
>
> **On the same basis:**
> * Retrieve the body text of the first comment.
> * Retrieve the date of the first comment.
> * Retrieve the note of the first comment.

In [None]:
# insert your code


> Now that we have extracted some basic elements from the first page of the website https://fr.trustpilot.com/review/engie.fr, we would like to extract the same information but from the second page. 
>
> First we will extract the maximum number of pages we can navigate in. We will be able to make sure that we have at least 2 pages on the site.
>
> To find out the maximum number of pages on the site, you can search by inspecting the button on the last page.
>
> **Instruction: extract the number of pages of the site https://fr.trustpilot.com/review/engie.fr and check that we have at least two pages on the site.**

In [None]:
# insert your code


> **Instruction: by inspecting the first page of the site https://fr.trustpilot.com/review/engie.fr, identify the location of the button to go to the next page and click on it (as previously done with the cookies button).**

In [None]:
# insert your code


> **Instruction - Once on page 2 of https://fr.trustpilot.com/review/engie.fr, as for the first page, extract the following information:**

> * The title of the first comment.
> * The content of the first comment.
> * The date of the first comment.

In [None]:
# insert your code


> **Instruction: extract the content of all the comments on page 2 of the site.**

In [None]:
# insert your code


## 2. Exploitation and use of the extracted data

> On this 2nd part, we will focus on a second website which is "Avis Vérifiés" for the same company: https://www.avis-verifies.com/avis-clients/engie-homeservices.fr

> The code you will write will be to open a connection to a MySQL database, create a "reviews" table if it doesn't already exist, then use Selenium (as above) to open a Chrome browser and retrieve the reviews on all the available pages. 

> For each review, we will store the rating, the text and the date in the "reviews" table in the database.

### Option 1 - reviews list

In [None]:
# insert your code


### Option 2 - SQLite 

> **Instruction : import the necessary libraries, create the database connection and set up the webdriver**

In [None]:
# insert your code


> **Instruction : Go to the website and accept the cookie by clicking on the cookie button**

In [None]:
# insert your code


> **Instruction: we would like to extract the evaluation data (review rating, review text and review date) from all pages of the website, starting from the first page and continuing to the last page.**

In [None]:
# insert your code


This code will retrieve assessment data from all pages of the website, starting with the first page and continuing to the last page. At each iteration of the loop, it retrieves the data from the current page, inserts it into the database, and then clicks the "next page" button to move to the next page. When the "next page" button cannot be found the loop ends.

We will try to leverage on the extracted database and detect negative comments.  
>
> **Instruction: Develop a generic approach to detect negative comments.**

In [None]:
# insert your code


This code opens a connection to a MySQL database, runs a query to select the negative reviews (rating below 3) from the "reviews" table, then displays the rating and text of each review.

You can modify this query to search for specific keywords in the text of the notices. For example, to search for notices containing the words "à fuir", you can write:

In [None]:
Q = "SELECT * FROM reviews WHERE review_text LIKE '%à fuir%'"

## 3. Bonus

Beautiful Soup is a Python library that is used for web scraping purposes to pull the data out of HTML and XML files. Selenium, on the other hand, is a browser automation tool that is used to automate web browsers.

When used together, Selenium can be used to open a web page and interact with its contents, and then Beautiful Soup can be used to extract the desired information from the page. For example, Selenium can be used to click on a button to load more data on a page, and then Beautiful Soup can be used to extract the data that was loaded.

> Instruction
> * Scrape all the ads of apartments for rent or sale in the city of Paris.
>
> * Extract the following information for each ad: the title, location, surface and price.
>
> * Store the extracted information in a CSV file.

NB: it is possible that the site blocks your IP address if the code runs several times

In [None]:
# insert your code
