# Class 8 - Web Crawling

---

1. We will use `selenium` package to imitate human browsing behavior and collect data from websites. 
 - Compared with other package, Selenium is more versatile, which supports both static and dynamic websites.

2. To enable `selenium`, you will need to install some browser driver, which will allow you to control browser using python scripts. 
  - We will rely on `chromedriver_autoinstaller` to automatically download and install chromedriver that supports the currently installed version of chrome. 

In [None]:
! pip install selenium
! pip install chromedriver_autoinstaller

In [None]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import chromedriver_autoinstaller
from selenium.webdriver.common.by import By

In [None]:
#install the chrome driver
chromedriver_autoinstaller.install()

In [None]:
#run an automatically controlled chrome
driver=webdriver.Chrome()

Download this file to your local folder: https://juniorworld.github.io/python-workshop/doc/html_assignment_solutions.html

In [None]:
#control the driver to open your html document for the last assignment
#driver.get(PATH) support both local path or online path
driver.get('ABSOLUTE PATH to the local assignment file')

In [None]:
#get a HTML file via a hyperlink
driver.get('URL')

In [None]:
#print the title
driver.title

In [None]:
#get the url of the page
driver.current_url

In [None]:
#print the source html document
print(driver.page_source)

## Find Element(s) by ID
Web elements in Selenium = Tags in HTML<br>
We can locate elements by their IDs.
- ID is the <font color="red">UNIQUE</font> identifier of an element. 
- This is the most efficient way to locate an element
- Syntax:

>```python
driver.find_element(By.ID,'id')
```

In [None]:
#locate an element with ID = name and save it as a "name" variable
name=driver.find_element(By.ID,'name')

<div class="alert alert-block alert-info">
    **<b>Tip</b>** You can use <b>.send_keys("input_content")</b> method to add content to the input field.</div>

In [None]:
#fill in your name in the input field
name.send_keys('Billie Jean')

<div class="alert alert-block alert-info">
**<b>Tip</b>** You can use <b>.clear()</b> to clear the input field.</div>

In [None]:
#clear the input field
name.clear()

#### Exercise
1. locate an element with ID = pw and add a password "123456" to this field
2. set the email address to "123456@life.hkbu.edu.hk"

In [None]:
#Write your code here



## Find Element(s) by Tag Name
We can also locate elements by their tag names.
Syntax:

>```python
driver.find_element(By.TAG_NAME,'p') #return the first element matched
driver.find_elements(By.TAG_NAME,'p') #return a list of elements matched
```

In [None]:
#get the first <a> tag
link=driver.find_element(By.TAG_NAME,'a')

<div class="alert alert-block alert-info">
**<b>Tip</b>** You can use .get_attribute("attribute_name") to retrieve the value of a certain attribute.</div>

In [None]:
#print out the hyperlink inside <a> tag


In [None]:
#get all <input> tags
inputs=driver.find_elements(By.TAG_NAME,'input')

In [None]:
#how many <input> tags are there in the HTML file?


#### Excercise
Write a for loop to print out the type of each input tag

In [None]:
#Write your code here


## Find Element(s) by Xpath

We can locate elements by their relative/absolute paths in the file with additional hints about their tag name, attribute name, and attribute value.<br>
- Xpath is an expression of HTML element path
    - `/` is the sign of **absolute path**:
        - if used at the begining (rarely): this is a xpath starting from the <font color="red">root</font> node
        - if used in the middle: refer to the **direct child** element at the next level
            - i.e. xpath of &lt;head&gt; can be written as "html/head" or "/html/head". 
            - If you write "/head", system will pop up a "NoSuchElementException" error because &lt;head&gt; is not the root of the file.
            - HTML documents can be very long with very complex structure. So, we rarely start a xpath with "/" because we don't want to waste our time in exhausting the entire absolute path
    - `//` is the sign of **relative path**: refer to any element that matches to the pattern <font color="red">no matter where they are</font>.
        - if used at the begining: this is a xpath starting **anywhere**
            - "//div" will match all &lt;div&gt; = driver.find_element(By.CLASS_NAME,"div")
        - if used in the middle: refer to the (direct or indirect) **descendant** elements, e.g. "//div//p" will match all &lt;p&gt; under any &lt;div&gt;
        - what does "//div/p" match?
    - **Attribute selector**: We can locate elements by specifying their attribute values
        - Syntax:`//tag_name[@attribute_name=attribute_value]`
        - Example: "//input[@type='reset']" will match all inputs whose input type is "reset"

In [None]:
#locate the body by its absolute path
body=

<div class="alert alert-block alert-info">
**<b>Tip</b>** You can use .text to retrieve text wrapped up in the tag.</div>

In [None]:
#print out the text of the matched element
print(body.text)

In [None]:
#locate <body> tag by its relative xpath
body=
print(body.text)

In [None]:
#locate the reset button
reset=driver.find_element(By.XPATH,'//input[@type="reset"]')

<div class="alert alert-block alert-info">
**<b>Tip</b>** You can use .click() method to click a link, a button, or simply anything that is clickable.</div>

In [None]:
#click the reset button
reset.click()

#### What if there are more than one element that can match the pattern?

1. Specify the index of the element
  - XPath follows the <font color="red">1-based</font> indexing system. It starts counting items from 1.
  - e.g. I want to get the fifth input element: `"//input[5]"`

2. Use `driver.find_elements()` function, which will result in a <font color="red">list</font> of elements. Then, use list indexing method to extract the target item
  - driver.find_elements(By.XPATH,"//input")[4]

In [None]:
#1st way: Indexing in XPATH
fifth_input=driver.find_element(By.XPATH,"//input[5]")

In [None]:
#2nd way: Indexing in Python List
inputs=driver.find_elements(By.XPATH,"//input")
fifth_input=inputs[4]

In [None]:
#get the type of the fifth input element
fifth_input.get_attribute("type")

#### Excersie: Click the fourth radio dot

In [None]:
#Write your code here



#### Excersie: Print out the text inside the sixth label tag

In [None]:
#Write your code here



<div class="alert alert-block alert-info">
**<b>Tip</b>** You can use * (wildcard) to indicate match any element/attribute.</div>

In [None]:
#get any tag whose name is "attitude"
driver.find_element(By.XPATH,"//*[@name='attitude']")

In [None]:
#get the first input tag whose value of any attribute is "radio"
driver.find_element(By.XPATH,"//input[@*='radio']")

<div class="alert alert-block alert-info">
    **<b>Tip</b>** You can use <font color="red">and</font> and <font color="red">or</font> operators to apply more complex conditions.</div>

In [None]:
inputs=driver.find_elements(By.XPATH,"//input[@*='radio' or @type='range']")

In [None]:
len(inputs)

In [None]:
#what is the result of the following syntax?
len(driver.find_elements(By.XPATH,"//input[@*='radio' and @type='range']"))

## Imitate Browsing Behavior

Some frequently used behaviors:
1. Click: `element.click()`
2. Add content to the input field: `element.send_keys('something')`
3. Clear existing content in the input field: `element.clear()`
4. Scroll: 
    - <font color="red"><b>Scroll to bottom:</b></font> `driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")`
    - Scroll to a specific location:
      - scroll down by 400px, `driver.execute_script("window.scrollTo(0, 400);")`
      - scroll right by 400px, `driver.execute_script("window.scrollTo(400, 0);")`

In [None]:
#click the link and scroll to the page bottom


---
# Break
---
# Quiz

https://www.menti.com/alv4rbooi7ro

<img src="https://juniorworld.github.io/python-workshop/img/Week%208_selenium_quiz.png" width="200" align="left">

# Practice: Red Book

In [None]:
#navigate to Red Book explore page
driver.get('https://www.xiaohongshu.com/explore')

In [None]:
#get a list of notes displayed on the page
#class=note-item
notes=

In [None]:
print(len(notes))

In [None]:
#have a look at the text of the first note
notes[0].text

In [None]:
print(notes[0].text)

In [None]:
#split the text by line break
#post_title, author_name, likes
notes[0].text.split('\n')

In [None]:
#write a for loop to write all these info into a csv file
#Hint1: encoding='utf-8', if you need to write Chinese characters in a text file
#Hint2: values in csv file can be separated by \t (tab)
#Hint3: You can also create a dataframe and save it to an external csv file
#------------------------------------------------------------



In [None]:
#scroll down the page to load more posts
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

In [None]:
notes=driver.find_elements(By.XPATH,"//*[@class='note-item']")
print(len(notes))

In [None]:
notes[0].text

#### Exercise
Write a nested for loop to:
1. Write post info to a csv file
2. Load more notes by scrolling to the bottom of the page
3. Write new post info to the csv file
4. Repeat Step 2 and 3 for FIVE times
5. Close the csv file

In [None]:
#Write your code here



In [None]:
#add author home page ids
