# Class 10 - Web Crawling (Continued)

### Element Not Found error
1. Page is not properly loaded
    - Solution: Check the browser status, refresh the page and relocate your cursor
2. Element is not visible or is removed from the current page → Stale Element Reference Exception
    - Solution: Check the browser status, refresh the page and relocate your cursor
3. Wrong XPATH or element match conditions
    - Solution: Check the code, Copy the automatically generated XPATH (inspect ) for a reference

### Anti-Bot Techniques

1. Rate limit: prevent the frequency of an operation from exceeding some threshold and allows relatively infrequent accesses, e.g. 100 calls/min
    - Calls (aka. Hits or Requests): data exchange activities that you launch in order to get a HTML document or other documents from a remote server.
        - <font color="red"> driver.get('URL')</font> One line One call
    - Mechanism: (1) Account-based; (2) IP-based; (3) Session-based
    - Consequent: Account blocked or IP blocked
    - Solutions:
        - Reverse engineer the rate limit: count the call number and identify the threshold
        - Account-based: Slow down and use pseudo accounts
            - Add time.sleep() before every drive.get()
        - IP-based: Use proxy and rotate a IP pool
            - A proxy is an intermediary between client requests and server responses. e.g. HKBU's VPN
            - Unauthenticated proxy:
            >```python
            chrome_options = Options()
            PROXY = "212.237.16.60:3128"
            #add proxy in chrome_options
            chrome_options.add_argument(f'--proxy-server={PROXY}')
            driver = webdriver.Chrome(PATH,options=chrome_options)
            #check new IP
            driver.get("https://api.ipify.org/?format=json")
               ```
        - Session-based: Use multiple browsers and rotate user-agent

<img src="https://madooei.github.io/cs421_sp20_homepage/assets/client-server-1.png" width=400>

2. Header & Cookies: The host will investigate each request's header to see if it contains non-human signifier, such as "automation control". Block browsers without authentic cookies (browsing history).
    - Solutions: 
        - Turn off "useAutomationExtension" and "Exclude enable-automation switch"
        - Add cookies
        - Rotate user-agent

        >```python
        #Add cookies
        driver.add_cookie(cookie_dict)
        driver.get_cookie()
        #Change user-agent
        driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36'})
        ```
3. CAPTCHA: Ask users to perform a certain task that is hard to be completed by bots to verify their identities
    - Text recognition
    - Click
    - Simple slider
    - Puzzle slider
    - Automatic reCAPTCHA

<table>
<tr><td><img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*0LXnPGyW3gHt_tKZlGL4Bg.png" width=500></td><td><img src="https://miro.medium.com/v2/resize:fit:1100/format:webp/1*GMcgXCSRkGW7GORpTh543Q.png" width=500></td><td><img src="https://images.squarespace-cdn.com/content/v1/5f8efd464888244a12c59aaf/3abf136f-6200-4343-84b7-eaca49d5f94b/Bot-Verification.png?format=1000w" width=500></td></tr>
    <tr><td><img src="https://www.jqueryscript.net/images/alphanumeric-captcha.jpg" width=500></td>
    <td><img src="https://www.jqueryscript.net/images/google-recaptcha-async.jpg" width=500></td>
    </tr>
    

In [None]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import chromedriver_autoinstaller
from selenium.webdriver.common.by import By
import pandas as pd
import numpy as np
import time

In [None]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

In [None]:
#install the chrome driver
#copy the returned path
chromedriver_autoinstaller.install()

In [None]:
#Disable the automatic control signifier
options = webdriver.ChromeOptions() 
options.add_argument("--disable-blink-features=AutomationControlled") 
options.add_experimental_option("excludeSwitches", ["enable-automation"]) 
options.add_experimental_option("useAutomationExtension", False)

In [None]:
#past the returned path inside the quotation
#run an automatically controlled chrome
driver=webdriver.Chrome(service=webdriver.ChromeService('Your Path'),options=options)

## Red Book - Collect Post Details

Download the file https://juniorworld.github.io/python-workshop/doc/RedBook_HKBU_COMM.csv

In [None]:
#Read post table
table=pd.read_csv('https://juniorworld.github.io/python-workshop/doc/RedBook_HKBU_COMM.csv')

In [None]:
table.head()

In [None]:
#Open first post


In [None]:
#Get title
#Fill in the blank
title=driver.find_element(By.ID, ).text

In [None]:
title

In [None]:
#Get post content
#Fill in the blank
content=driver.find_element().text

In [None]:
print(content)

In [None]:
#Get the post time
date=driver.find_element(By.CLASS_NAME,'date').text

In [None]:
date

In [None]:
#Get the numbers of likes, favorites, and comments received
#Hint: To avoid mismatch, you need to look up the condition in inspect window 
#      to assure you will not have mismatched results
likes=driver.find_element(By.XPATH,"//div[@class='interact-container']//div[@class='left']/span[1]").text
favs=
comments=

In [None]:
print(likes,favs,comments)

<div class="alert alert-block alert-info">
**<b>Contain Selector</b>** You can use contain() function to locate an element that contains a value as a part of an attribute. Syntax: Xpath=//tagname[contains (@Attribute, 'Value')]</div>

In [None]:
#Get img list
#Find Element By Class Name does not support class name with spaces
#You need to use XPATH to locate the element
imgs=driver.find_elements(By.XPATH,'//div[contains(@class,"swiper-slide")]')

In [None]:
#Actual image number = 7
#Contains duplicates that are inserted to ensure that the slider can iterate seemingly endless
len(imgs)

In [None]:
#imgs[0] is identitical to imgs[8]
imgs[0].get_attribute('style')

<div class="alert alert-block alert-info">
**<b>Regular Expression</b>** You can use re.findall() function to extract all matches from a string. The return value will be a list.</div>

In [None]:
#Use regular expression to find URL in any string
#Characters allowed in URL: alphabetic, numbers, and a few special characters like ! . ? / : _and -
import re
a = imgs[0].get_attribute('style')
re.findall('http',a)

In [None]:
re.findall('http[a-zA-Z0-9!.?/:_\-]+',a)[0]

In [None]:
#Write a loop to extract image links and save them as a list
img_links=[]


In [None]:
img_links

In [None]:
#Remove duplicates
#1: set() function can convert a list to a set with distinct elements
list(set(img_links))

In [None]:
#2. np.unique() function
np.unique(img_links)

### Exercise 1
Write a Loop to extract first 10 posts' details and save them as a csv named "COMM_Post_Details.csv"<br>
Post details should include title, content, date, likes, favs, comments, and imgs

In [None]:
#Write your code here
import csv
file=open('COMM_Post_Details.csv','w',encoding='utf-8',newline='\n')
writer=csv.writer(file)




file.close()

In [None]:
posts_table=pd.read_csv('COMM_Post_Details.csv',header=None)

In [None]:
posts_table.head()

## Red Book - Collect Comment Details

In [None]:
driver.get(table['Link'].iloc[0])

In [None]:
#Get a LIST of comments visible in the current view
#Find Elements
comments=driver.find_elements(By.CLASS_NAME,"comment-item")

In [None]:
#How many comments are visible?
len(comments)

In [None]:
print(comments[0].text)

In [None]:
#Way 1: Rude match. Split by Line Breaks
#You can use line break to split text into various pieces of info
#But this has a strong condition: Red Book does not allow line break in the comment
#This rule does not apply to other platforms, like Weibo or Facebook
commenter=comments[0].text.split('\n')[0]
comment_content=comments[0].text.split('\n')[1]
comment_time=comments[0].text.split('\n')[2]
comment_likes=comments[0].text.split('\n')[3]
comment_replies=comments[0].text.split('\n')[4]

In [None]:
print(commenter,comment_content,comment_time,comment_likes,comment_replies)

In [None]:
#Shortcut to assign a list of values to multiple variables
commenter,comment_content,comment_time,comment_likes,comment_replies=comments[0].text.split('\n')

In [None]:
#You can use _ to skip elements that you are uninterested in
commenter,comment_content,comment_time,_,_=comments[0].text.split('\n')

In [None]:
#Way 2: Precise match
#Identify the child nodes by their XPATH
#RELATIVE PATH: Start the path with "." which will confine the search to the elements subordinate to the current node
#Author screen name
comments[0].find_element(By.XPATH,'.//div[@class="author"]').text

In [None]:
#Author Profile Link
#Write the path
comments[0].find_element(By.XPATH,'WRITE YOUR PATH').get_attribute('href')

In [None]:
#Get the comment content, comment time, comment location, likes received by the comment
comment_content=comments[0].find_element(By.XPATH,'.//div[@class="content"]').text
comment_date=
comment_likes=

In [None]:
print(comment_content,comment_date,comment_likes)

### Exercise 2
Write a Loop to extract first 10 comments and save them as a csv named "COMM_1POST_10comments.csv"
<br>Required columns: commenter, comment_content, comment_date, comment_likes

In [None]:
#Write your code here
file=open("COMM_1POST_10comments.csv",'w',encoding='utf-8',newline='\n')
writer=csv.writer(file)
for comment in comments:
    
file.close()

In [None]:
comment_table=pd.read_csv('COMM_1POST_10comments.csv',header=None)
comment_table.head()

## Scroll down in an internal container
We need to scroll down in the comment container to load more posts. <br>
However, comment container's behavior is independent from the page's behavior at large. If you scroll to bottom in the page, the status of the comment container will not be changed.<br>
So, what we need to do here is to imitate scrolling behaviors specifically on the container element.<br>
Technically, it is equivalent to scroll to the last visible content to the top of the container. 

<div style="background-color:#B5CAA0">
<h2>Check the content loading mode</h2>
<hr>
Old Content: 1. Existing 2. Partly Removed 3. Completely Removed<br>
New Content: 1. Seamlessly Following 2. Gap
</div>

In [None]:
#print out the last comment visible on the current page for reference
comments[-1].text

In [None]:
#Way 1: Automate Scrolling Behaviors: ScrollIntoView
#Reminder: You may need to switch to the automate browser to realize the scrolling effect
driver.execute_script("arguments[0].scrollIntoView(true);",comments[-1])

In [None]:
comments=driver.find_elements(By.CLASS_NAME,"comment-item")

In [None]:
comments[0].text

In [None]:
len(comments)

In [None]:
comments[-1].text

In [None]:
#check the content loading mode
for comment in comments:
    print(comment.text)

In [None]:
#Way 2: Automate Mouse Behaviors
from selenium.webdriver.common.action_chains import ActionChains
actions = ActionChains(driver)

<div class="alert alert-block alert-info">
**<b>Tip</b>** move_to_element() move the mouse to the in-view center point of the element. This is otherwise known as “hovering.” Note that the element must be in the viewport or else the command will error.</div>

In [None]:
#Scroll to the last comment to load more comments
actions.move_to_element(comments[-1]).perform()

### Exercise 3
Write a Loop to extract ALL comments and save them as a data frame

In [None]:
#Write your code here



In [None]:
file=open("COMM_1POST_all_comments.csv",'w',encoding='utf-8',newline='\n')
writer=csv.writer(file)
for comment in comments:
    
file.close()

In [None]:
comments_table=pd.read_csv("COMM_1POST_all_comments.csv")
comments_table.head()

## Weibo - People's Daily

Scrolling method is not reliable if you want to collect many posts. It's hard to identify concrete points where the collection process starts and ends.<br>
To circumvent this limitation, we will limit our search to the month of February 2024 and subsequently iterate over the remaining months. 

In [None]:
driver.get("https://weibo.com/rmrb")

In [None]:
Feb_link="https://weibo.com/rmrb?is_ori=1&is_text=1&is_pic=1&is_video=1&is_music=1&is_forward=1&start_time=1706716800&end_time=1709222400"

- The start time and end time are formated as UNIX Timestamps, which represent the number of seconds that have elapsed since 00:00:00 UTC on 1 January 1970.
- To get the corresponding UNIX Timestamp for a certain time, we can utlize `datetime` package.
- `datetime.datetime(Year, Month, Date, Hour, Minute)` to create a datetime element and then convert to time tuple with `.timetuple()` method
- <font color='red'>Time Zone</font>: We are in UTC+8 Time zone. We need to <font color='red'>**-8**</font> hours from our time to yield a universal time code.
- `datetime.timedelta(days=X, weeks=Y)` function allow you to create a time difference element
- datetime +/- timedelta can be added or substracted.

In [None]:
import datetime
start_time=datetime.datetime(2024, 1, 31, 16, 00)

In [None]:
start_time

In [None]:
#Get the UNIX Timestamp of the start time
time.mktime(start_time.timetuple())

In [None]:
duration=datetime.timedelta(days=29)

In [None]:
start_time+duration

In [None]:
#Get the UNIX Timestamp of the end time


In [None]:
driver.get(Feb_link)

<div class="alert alert-block alert-info">
**<b>Reverse Selector</b>** You can use not() function to reverse your selection.</div>

In [None]:
#get the posts visible on the current page
post_list=driver.find_elements(By.XPATH,'//div[@class="vue-recycle-scroller__item-view" and not(contains(@style, "z-index"))]')

In [None]:
len(post_list)

In [None]:
post_list[0].text

In [None]:
first_post=post_list[0]

In [None]:
#Get the post time
created_at=first_post.find_element(By.XPATH,'.//a[@class="head-info_time_6sFQg"]').get_attribute('title')

In [None]:
created_at

In [None]:
#Get the following values for the first post
#pid, text, comments, shares, likes
pid=
text=
comments=
shares=
likes=

In [None]:
#scroll to the bottom to load more posts
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

In [None]:
#check the page loading mode
#If we scroll to the page bottom, the posts in between will be unfortunately skipped.
post_list=driver.find_elements(By.XPATH,'//div[@class="vue-recycle-scroller__item-view" and not(contains(@style, "z-index"))]')
post_list[0].text

In [None]:
#We need to scroll the page step by step, by increments of 500 pixels.
#First stop (0, 500)
driver.execute_script("window.scrollTo(0, 500);")
post_list=driver.find_elements(By.XPATH,'//div[@class="vue-recycle-scroller__item-view" and not(contains(@style, "z-index"))]')
print(len(post_list))
print(post_list[0].text)
print(post_list[-1].text)

In [None]:
#Scroll to (0, 1000)
driver.execute_script("window.scrollTo(0, 1000);")
post_list=driver.find_elements(By.XPATH,'//div[@class="vue-recycle-scroller__item-view" and not(contains(@style, "z-index"))]')
print(len(post_list))
print(post_list[0].text)
print(post_list[-1].text)

In [None]:
#scroll down the page five times
#collect all posts



In [None]:
#collapse the post to show full text

