# Class 9 - Web Crawling (Continued)

### Why do you get an Element Not Found error?
1. Page is not properly loaded
    - Solution: Check the browser status, refresh the page and relocate your cursor
2. Element is not visible or is removed from the current page → Stale Element Reference Exception
    - Solution: Check the browser status, refresh the page and relocate your cursor
3. Wrong XPATH or element match conditions
    - Solution: Check the code, Copy the automatically generated XPATH (inspect ) for a reference

### Anti-Bot Techniques

1. Rate limit: prevent the frequency of an operation from exceeding some threshold and allows relatively infrequent accesses, e.g. 100 calls/min
    - Calls (aka. Hits or Requests): data exchange activities that you launch in order to get a HTML document or other documents from a remote server.
        - <font color="red"> driver.get('URL')</font> One line One call
    - Mechanism: (1) Account-based; (2) IP-based; (3) Session-based
    - Consequent: Account blocked or IP blocked
    - Solutions:
        - Reverse engineer the rate limit: count the call number and identify the threshold
        - Account-based: Slow down and use pseudo accounts
            - Add time.sleep() before every drive.get()
        - IP-based: Use proxy and rotate a IP pool
            - A proxy is an intermediary between client requests and server responses. e.g. HKBU's VPN
            - Unauthenticated proxy:
            >```python
            chrome_options = Options()
            PROXY = "212.237.16.60:3128"
            #add proxy in chrome_options
            chrome_options.add_argument(f'--proxy-server={PROXY}')
            driver = webdriver.Chrome(PATH,options=chrome_options)
            #check new IP
            driver.get("https://api.ipify.org/?format=json")
               ```
        - Session-based: Use multiple browsers and rotate user-agent

<img src="https://madooei.github.io/cs421_sp20_homepage/assets/client-server-1.png" width=400>

2. Header & Cookies: The host will investigate each request's header to see if it contains non-human signifier, such as "automation control". Block browsers without authentic cookies (browsing history).
    - Solutions: 
        - Turn off "useAutomationExtension" and "Exclude enable-automation switch"
        >```python
        #Disable the automatic software signifier
        options = webdriver.ChromeOptions() 
        options.add_argument("--disable-blink-features=AutomationControlled") 
        options.add_experimental_option("excludeSwitches", ["enable-automation"]) 
        options.add_experimental_option("useAutomationExtension", False) 
        driver = webdriver.Chrome(options=options) 
        driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})") 
        ```
        - Add cookies
        >```python
        driver.add_cookie(cookie_dict)
        driver.get_cookie()
        ```
        - Rotate user-agent
        >```python
        driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36'})
        ```
3. CAPTCHA: Ask users to perform a certain task that is hard to be completed by bots to verify their identities
    - Text recognition
    - Click
    - Simple slider
    - Puzzle slider
    - Automatic reCAPTCHA

<table>
<tr><td><img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*0LXnPGyW3gHt_tKZlGL4Bg.png" width=500></td><td><img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*GMcgXCSRkGW7GORpTh543Q.png" width=500></td><td><img src="https://images.squarespace-cdn.com/content/v1/5f8efd464888244a12c59aaf/3abf136f-6200-4343-84b7-eaca49d5f94b/Bot-Verification.png?format=1000w" width=500></td></tr>
    <tr><td><img src="https://www.jqueryscript.net/images/alphanumeric-captcha.jpg" width=500></td>
    <td><img src="https://www.jqueryscript.net/images/google-recaptcha-async.jpg" width=500></td>
    </tr>
    

In [None]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import chromedriver_autoinstaller
from selenium.webdriver.common.by import By
import pandas as pd

Download the file https://juniorworld.github.io/python-workshop/doc/RedBook_HKBU_COMM.csv

## Red Book - Collect Post Details

In [None]:
driver=webdriver.Chrome()

In [None]:
#Read post table
table=pd.read_csv('PATH TO FILE')

In [None]:
#Open first post
driver.get("URL TO THE FIRST POST")

In [None]:
#Get title
title=driver.find_element(By.CLASS_NAME,'title').text

In [None]:
#Get content list
content_list=

In [None]:
#Combine all content into a single string
content='' #Initialize an empty string
for i in content_list:
    content=

In [None]:
#Get the post time
date=driver.find_element(By.CLASS_NAME,'date').text

In [None]:
#Get the numbers of likes, favorites, and comments received
#Hint: To avoid mismatch, you need to look up the condition in inspect window 
#      to assure you will not have mismatched results
likes=
favs=
comments=

In [None]:
#Get img list
#Find Element By Class Name does not support class name with spaces
#You need to use XPATH to locate the element
img_list=[]
first_img=driver.find_element()

In [None]:
#Use split() function to cut out the hyper link
first_img_url=

In [None]:
#Use regular expression to find URL in any string
#Characters allowed in URL: alphabetic, numbers, and a few special characters like . ? and -
import re

Regular Expression Cheat Sheet:
1. `.`: Wildcard, any character
2. `[abc]`: Range (a or b or c)
    - \: escape following special characters
    - special characters that need to be escaped: `^ [ . $ { * ( \ + ) | ? < >`
3. `[^abc]`: Reverse range, Not (a or b or c)
4. Quantifier:
    - `*` at least 0 times.
    - `+` at least 1 times.
    - `?` at most 1 times.
    - `{n}`: Exactly n times. 
    - `{n,}`: At least n times
    - `{n,m}`: n-m times (m>n)
5. Character class:
    - `\c`: Control character, such as line break, tab
    - `\s`: White space, `\S`: Not white space
    - `\d`: Digit, `\D`: Not digit
    - `\w`: Word, `\W`: Not word
    - `\x`:Hexadecimal digit; `\O`: Octal digit
    - `[:alnum:]`: Digits and letters = [0-9a-zA-Z]
    - `[:punct:]`: Punctuations
6. Location:
    - `^` the start of the string
    - `$` the end of the string

### Replace Substrings
- `re.sub(pattern, new_string, original_string)`

### Find All Substrings That Match the Pattern
- `re.findall(re_pattern,string)`
- result is a **LIST** of match substrings!

In [None]:
#Recap
#Encrypt all numbers with *
email="Dear mam, My name is Peter Pan and my password is 9384894023. I am from CHINA. Can you help me remove my account?"
re.sub()

In [None]:
#Replace all-capitalized words with ?


In [None]:
#Extract img URL
#1: Include all characters appearing in the URL


In [None]:
#2: Define the starting and closing characters in the URL


In [None]:
#Extract the URL of the second img
#Step 1: Identify the "Next" button on the image
#Step 2: Click it
#Step 3: Extract the url
#--------------------------------------------------
next_button=


In [None]:
#Write a Loop to Collect All URLs
#Flow Chart, Change value, Exit condition



# QUIZ

https://www.menti.com/al67356nnssp

<img src="https://juniorworld.github.io/python-workshop/img/Week%209_selenium_quiz.png" width=200 align="left">

## Red Book - Collect Comment Details

In [None]:
#Get a LIST of comments visible in the current view
#Find Elements
comments=driver.find_elements(By.CLASS_NAME,"comment-item")

In [None]:
#How many comments are visible?
len(comments)

In [None]:
comments[0].text

In [None]:
#Way 1: Rude match. Split by Line Breaks
#You can use line break to split text into various pieces of info
#But this has a big condition: Red Book does not allow line break in the comment
#This rule does not apply to other platforms, like Weibo or Facebook
comments[0].text.split('\n')

In [None]:
#Way 2: Precise match
#Identify the child nodes by their XPATH
#ABSOLUTE PATH: starting from the current node by default
#Don't add "/" at the beginning -> / means starting from the root <html>
#RELATIVE PATH: "." represent the current node
comments[0].find_element(By.XPATH,'div[@class="right"]/div[@class="author-wrapper"]').text

In [None]:
#Get the comment content, comment time, comment location, likes received by the comment
comment_content=
comment_date=
comment_location=
comment_likes=

### Exercise 1
Write a Loop to extract first 10 comments and save them as a data frame

In [None]:
#Write your code here


## Scroll down in an internal container
We need to scroll down in the comment container to load more posts. <br>
However, comment container's behavior is independent from the page's behavior at large. If you scroll to bottom in the page, the status of the comment container will not be changed.<br>
So, what we need to do here is to imitate scrolling behaviors specifically on the container element.<br>
Technically, it is equivalent to scroll to the last visible content to the top of the container. 

<div style="background-color:#B5CAA0">
<h2>Check the content loading mode</h2>
<hr>
Old Content: 1. Removed 2. Existing<br>
New Content: 1. Whole new 2. Overlap 3. Gap
</div>

In [None]:
#print out the last comment visible on the current page for reference
comments[-1].text

In [None]:
from selenium.webdriver.common.action_chains import ActionChains
actions = ActionChains(driver)

In [None]:
#Scroll to the last comment to load more comments
actions.move_to_element(comments[-1]).perform()

In [None]:
#check the content loading mode



### Exercise 2
Write a Loop to save all comments into a data frame.

## Weibo - People's Daily

Log into your account and switch back to the **old version**
What's good about the old version:
1. Allows pagination by setting page parameter "page=2", "page=3", or date parameter "stat_date=202303"
2. More stable and less complex
3. No anti-bot techniques in place

In [None]:
driver.get("https://weibo.com/rmrb")

In [None]:
#get the posts visible on the current page
post_list=driver.find_elements(By.XPATH,'//div[@class="WB_feed_detail clearfix"]')

In [None]:
len(post_list)

In [None]:
post_list[0].text

In [None]:
first_post=post_list[0]

In [None]:
#Get the post time
created_at=first_post.find_element(By.XPATH,'.//div[@class="WB_from S_txt2"]').text

In [None]:
#Get the following values for the first post
#pid, text, comments, shares, likes
pid=
text=
comments=
shares=
likes=

In [None]:
#write a for loop to save the post details to a csv file



In [None]:
#print out the last post for our own reference
post_list[-1].text

In [None]:
#scroll to the bottom to load more posts
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

In [None]:
#check the page loading mode


In [None]:
#scroll down the page five times
#collect all posts



In [None]:
#collapse the post to show full text

