In this lab, we will be building web scrapers for **news.cn**

Note this lab is meant to be educational. We are not liable for how you use this skill and please respect the copyright of the sites you are scraping.
![](https://www.comp.nus.edu.sg/~lekhsian/sws3023/newscn1.png)

One of the thing we will notice that's different from IMDB (in the previous lab) is that the data displayed on the website might not be part of the HTML source. Some content are loaded dynamically using AJAX. So for this lab we will be focusing on using **Selenium** to do the web scraping tasks.

In [1]:
#load necessary libraries
from selenium import webdriver
import time
import chromedriver_binary
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import pandas as pd


In [2]:
#disable build check to if latest version of Chrome is not supported
service = Service(service_args=['--disable-build-check'])

#open the window size to simulate mobile view
options = webdriver.ChromeOptions()
options.add_argument("window-size=500,800")
driver = webdriver.Chrome(service=service, options=options)


The chromedriver version (128.0.6582.0) detected in PATH at C:\Users\22507\anaconda3\lib\site-packages\chromedriver_binary\chromedriver.exe might not be compatible with the detected chrome version (126.0.6478.127); currently, chromedriver 126.0.6478.126 is recommended for chrome 126.*, so it is advised to delete the driver in PATH and retry


#### ❓Q1.
Complete this function (`get_categories()`) which return a set of tuples with the URL of the category and the text of the category. You should also include those categories in sub-menus.\
An example of the category is shown below in in <span style="color:white;background-color:#063565">white</span> or <span style="color:white;background-color:#c5d6e7">white</span>.

![](https://www.comp.nus.edu.sg/~lekhsian/sws3023/newscn2.png)
![](https://www.comp.nus.edu.sg/~lekhsian/sws3023/newscn3.png)
![](https://www.comp.nus.edu.sg/~lekhsian/sws3023/newscn4.png)

In [3]:
def get_categories():
    driver.get("https://english.news.cn/")
    results = set()

    #TODO
    #select all a element
    categories = driver.find_elements(By.CSS_SELECTOR, "#nav a")

    for category in categories:
        category_url = category.get_attribute("href")
        category_text = category.get_attribute("innerText")

        if category_url != "https://english.news.cn/#":
            results.add((category_url, category_text))

    return results


In [4]:
categories = get_categories()
categories


{('http://www.news.cn/', '中文'),
 ('https://arabic.news.cn/index.htm', 'عربي'),
 ('https://english.news.cn/africa/index.htm', 'Africa'),
 ('https://english.news.cn/asiapacific/index.htm', 'Asia & Pacific'),
 ('https://english.news.cn/china/index.htm', 'China'),
 ('https://english.news.cn/culture/index.htm', 'Culture & Lifestyle'),
 ('https://english.news.cn/europe/index.htm', 'Europe'),
 ('https://english.news.cn/globalink.htm', 'GLOBALink'),
 ('https://english.news.cn/indepth/index.htm', 'In-depth'),
 ('https://english.news.cn/list/china-business.htm', 'Biz'),
 ('https://english.news.cn/list/latestnews.htm', 'Latest'),
 ('https://english.news.cn/northamerica/index.htm', 'North America'),
 ('https://english.news.cn/photo/index.htm', 'Photos'),
 ('https://english.news.cn/posters/index.htm', 'Posters'),
 ('https://english.news.cn/silkroad/index.html', 'B & R Initiative'),
 ('https://english.news.cn/special/index.htm', 'Special Reports'),
 ('https://english.news.cn/special/qmttt/index.htm'

In [5]:
#this should most likely be 29
len(categories)


29

#### ❓Q2.
A few of the categories show a listing of the articles by default such as the **Culture & Lifestyle** category:
![](https://www.comp.nus.edu.sg/~lekhsian/sws3023/newscn6.png)

But many categories by default shows the news for the day such as the **World** category.
![](https://www.comp.nus.edu.sg/~lekhsian/sws3023/newscn5.png)

However, if you were to shrink the window, it does show mobile view.
![](https://www.comp.nus.edu.sg/~lekhsian/sws3023/newscn7.png)

Complete this function (`get_article_links(category_url, num_pages)`) which takes in `category_url` and `num_pages` arguments and extracts the list of url to articles in that category. `num_pages` allows us to control the number of pages expected (if there are more than one page of articles). For categories without the **More** button, this function will just return the list of url in the current page.

In [6]:
def get_article_links(category_url, num_pages):
    
    driver.get(category_url)
    article_elems = []
    urls = []

    try:
        article_elems = driver.find_elements(By.CSS_SELECTOR, 'a[href$="c.html"]')
        urls = list(set([article_elem.get_attribute("href") for article_elem in article_elems]))
        
        found_more_button = False
        
        try:
            load_more_button = driver.find_element(By.CSS_SELECTOR, "#more")
            found_more_button = True
        except:
            pass

        if(found_more_button):
            for i in range(1, num_pages+1):
                load_more_button.click()
                
                print("Page: {}...".format(i))

                WebDriverWait(driver, 10).until(
                    EC.invisibility_of_element_located((By.CSS_SELECTOR, "#more[disabled]"))
                )


        article_elems = driver.find_elements(By.CSS_SELECTOR, 'a[href$="c.html"]')
        urls = list(set([article_elem.get_attribute("href") for article_elem in article_elems]))

    except:
        pass


    return urls


In [7]:
article_urls = get_article_links("https://english.news.cn/culture/index.htm", 2)
print(len(article_urls))
article_urls


Page: 1...
Page: 2...
62


['https://english.news.cn/20240615/ed9e20ed5ff04c97b0f3c4983601dff0/c.html',
 'https://english.news.cn/20240626/7cdfa2aed41242b895ac72beb10d5bb5/c.html',
 'https://english.news.cn/20240626/1e8ccc203b8c4c9980a700a7abc883a7/c.html',
 'https://english.news.cn/20240617/5837b472ddae4f949373a1ce4009e86a/c.html',
 'https://english.news.cn/20240621/cb46c0f4a0a04683a545a6ed667f39f1/c.html',
 'https://english.news.cn/20240617/1bb53627f65b4d2f8cac57d4272d1b56/c.html',
 'https://english.news.cn/20240705/ad7aa2d7177f4d6d9c932b785da4db43/c.html',
 'https://english.news.cn/20230130/3f2ba4b7cd214a209dda790ddbdcb620/c.html',
 'https://english.news.cn/20240621/7b06e95e546b4282b262075beebfac81/c.html',
 'https://english.news.cn/20240621/f3aac624b641411581e9811d3e530313/c.html',
 'https://english.news.cn/20240707/ceeb3e0639cd409c855becb68da8411a/c.html',
 'https://english.news.cn/20240623/3bd828b33c7d4842bc7e839a5a6b54e2/c.html',
 'https://english.news.cn/20240703/fb8a3046e7904e4781bf276f2e6568d1/c.html',

In [8]:
article_urls = get_article_links("https://english.news.cn/culture/index.htm", 1)
print(len(article_urls))
article_urls


Page: 1...
42


['https://english.news.cn/20240626/7cdfa2aed41242b895ac72beb10d5bb5/c.html',
 'https://english.news.cn/20240626/1e8ccc203b8c4c9980a700a7abc883a7/c.html',
 'https://english.news.cn/20230130/3f2ba4b7cd214a209dda790ddbdcb620/c.html',
 'https://english.news.cn/20240705/ad7aa2d7177f4d6d9c932b785da4db43/c.html',
 'https://english.news.cn/20240707/ceeb3e0639cd409c855becb68da8411a/c.html',
 'https://english.news.cn/20240623/3bd828b33c7d4842bc7e839a5a6b54e2/c.html',
 'https://english.news.cn/20240703/fb8a3046e7904e4781bf276f2e6568d1/c.html',
 'https://english.news.cn/20240704/634dc0c894b246469684491a6f9e1a95/c.html',
 'https://english.news.cn/20240625/f8d3a97348d444a5beedd7ff7b0e7b43/c.html',
 'https://english.news.cn/20240625/3b028d05c7b048b1a0f080ec2b6fa556/c.html',
 'https://english.news.cn/20240621/23aec92e3e4343158db6137418a73e7a/c.html',
 'https://english.news.cn/20240622/0a74f67d5c7741af894bdaa2258155ee/c.html',
 'https://english.news.cn/20240701/f0b19d8b40f042c6a108450a13c160f1/c.html',

In [9]:
article_urls = get_article_links("https://english.news.cn/china/index.htm", 1)
print(len(article_urls))
article_urls


Page: 1...
57


['https://english.news.cn/20240706/d34bcf374d564717bfc9bdd2b5e82de6/c.html',
 'https://english.news.cn/20240707/c245be87624c49e1ad176b8f2af54527/c.html',
 'https://english.news.cn/20240705/928e0ee5e3664f129628defa9ebbf6ea/c.html',
 'https://english.news.cn/20240708/d1c17cbe3efe4d4983d5d56410d34c1d/c.html',
 'https://english.news.cn/20240707/bacd307b6d6a465bb0f5638e893c7d77/c.html',
 'https://english.news.cn/20240707/fe864e43088b45e89e62fe56c02ae410/c.html',
 'https://english.news.cn/20230130/3f2ba4b7cd214a209dda790ddbdcb620/c.html',
 'https://english.news.cn/20240706/aa3d0b608f934b8496dc44149029d64b/c.html',
 'https://english.news.cn/20240502/1a6899c9a061451c8af8625c2a5977b8/c.html',
 'https://english.news.cn/20240708/efb44dfc9b904672856dd1b65690cc4d/c.html',
 'https://english.news.cn/20240707/ceeb3e0639cd409c855becb68da8411a/c.html',
 'https://english.news.cn/20240707/6f81a2c90e6c4775a8640d0fb25426a6/c.html',
 'https://english.news.cn/20240706/9ff56e076a9d44739ec874ea6ba75935/c.html',

In [10]:
article_urls = get_article_links("https://english.news.cn/china/index.htm", 2)
print(len(article_urls))
article_urls


Page: 1...
Page: 2...
76


['https://english.news.cn/20240706/d34bcf374d564717bfc9bdd2b5e82de6/c.html',
 'https://english.news.cn/20240707/c245be87624c49e1ad176b8f2af54527/c.html',
 'https://english.news.cn/20240705/928e0ee5e3664f129628defa9ebbf6ea/c.html',
 'https://english.news.cn/20240708/d1c17cbe3efe4d4983d5d56410d34c1d/c.html',
 'https://english.news.cn/20240707/bacd307b6d6a465bb0f5638e893c7d77/c.html',
 'https://english.news.cn/20240707/fe864e43088b45e89e62fe56c02ae410/c.html',
 'https://english.news.cn/20240705/325101865ec347d19cdd5b2482ff6dea/c.html',
 'https://english.news.cn/20240706/aa3d0b608f934b8496dc44149029d64b/c.html',
 'https://english.news.cn/20230130/3f2ba4b7cd214a209dda790ddbdcb620/c.html',
 'https://english.news.cn/20240502/1a6899c9a061451c8af8625c2a5977b8/c.html',
 'https://english.news.cn/20240708/efb44dfc9b904672856dd1b65690cc4d/c.html',
 'https://english.news.cn/20240707/ceeb3e0639cd409c855becb68da8411a/c.html',
 'https://english.news.cn/20240707/6f81a2c90e6c4775a8640d0fb25426a6/c.html',

#### ❓Q3.

Complete this function (`get_article_details(url)`) which takes in an article `url` and extract the following information:
- title of the article (shown below in <span style="color:#00a8ff">blue</span>)
- source of the article (shown below in <span style="color:#ba00ff">purple</span>)
- editor of the article (shown below in <span style="color:#8f3900">brown</span>)
- date & time of the article (shown below in <span style="color:#ff00f6">pink</span>)
- text of article (shown below in <span style="color:#ff0000">red</span>)

![](https://www.comp.nus.edu.sg/~lekhsian/sws3023/newscn8.png)

If a field is not present, you should leave it as an empty string.

In [11]:
#potentially the url might be a broken link
def get_article_details(url):
    
    driver.get(url)

    #title
    title = ""

    try:
        title = driver.find_element(By.CSS_SELECTOR, "h1")
        title = title.text
    except:
        pass

    #source   
    source = ""

    try:
        source = driver.find_element(By.CSS_SELECTOR, ".source")
        source = source.text
        source = str.strip(source.split(":")[1])
    except:
        pass


    #editor   
    editor = ""

    try:
        editor = driver.find_element(By.CSS_SELECTOR, ".editor")
        editor = editor.text
        editor = str.strip(editor.split(":")[1])
    except:
        pass


    #datetime   
    article_date = ""

    try:
        article_date = driver.find_element(By.CSS_SELECTOR, ".time")
        article_date = article_date.text    
    except:
        pass


    #text
    article_text = ""

    try:
        article_text = driver.find_element(By.CSS_SELECTOR, "#detailContent")
        article_text = article_text.text    
    except:
        pass


    return (url, title, source, editor, article_date, article_text)


In [12]:
article = get_article_details("https://english.news.cn/20220707/2ea4c1dbe99144fd8ec0f821dc6cafc1/c.html")
article


('https://english.news.cn/20220707/2ea4c1dbe99144fd8ec0f821dc6cafc1/c.html',
 'British PM Johnson resigns',
 'Xinhua',
 'huaxia',
 '2022-07-07 21:06:59',
 'File photo taken on Feb. 10, 2022 shows British Prime Minister Boris Johnson at NATO headquarters in Brussels, Belgium. Boris Johnson resigned as British prime minister and the leader of the Conservative Party in a statement to the country on Thursday.\nHe said he will continue to serve as prime minister until a new Tory leader is chosen. (Xinhua/Zheng Huansong)\nLONDON, July 7 (Xinhua) -- Boris Johnson resigned as British prime minister and the leader of the Conservative Party in a statement to the country on Thursday.\nHe said he will continue to serve as prime minister until a new Tory leader is chosen.\n"The will of the parliamentary party is clear and the process for choosing a new leader should begin," he said, adding that the timetable for choosing a new leader will be announced next week.\nHe said he has appointed a Cabinet 

In [13]:
article = get_article_details("https://english.news.cn/20220706/4055513625634add999adc44f7bb68c4/c.html")
article


('https://english.news.cn/20220706/4055513625634add999adc44f7bb68c4/c.html',
 'Feature: Virtual "bird-woman" ambassador spreads Dunhuang culture',
 'Xinhua',
 'huaxia',
 '2022-07-06 20:57:59',
 'People visit the Mogao Grottoes, a UNESCO World Heritage site, in Dunhuang, northwest China\'s Gansu Province. (Xinhua/Zhang Yujie)\nLANZHOU, July 6 (Xinhua) -- Originated from a half-woman, half-bird creature on millennia-old murals, Jiayao has become the official virtual cartoon figure of Mogao Grottoes, a UNESCO World Heritage site in Dunhuang, northwest China\'s Gansu Province.\nJiayao appears as a girl wearing traditional robes with feathers sprouted from her back. The animation depicts her experience with the relics and interactions with other characters from the murals.\nAs Dunhuang culture\'s digital ambassador, Jiayao was developed by the Dunhuang Academy, an institution responsible for the conservation, management, and research of the Mogao Grottoes and other cultural relics.\nWhile o

#### ❓Q4.

Combine the above functions to automate extraction of articles on the site by going to every category and store the results into a Pandas Dataframe. Do you notice any issue with any of the above functions? If so, go back and fix it.

In [14]:
#TODO
categories = get_categories()
categories = list(categories)[7:9]
categories


[('https://kr.news.cn/index.htm', '한국어'),
 ('https://spanish.news.cn/index.htm', 'Español')]

In [15]:
articles = []

for category in categories:
    category_url, category_title = category
    print(f"## getting links from {category_title}...")
    links = get_article_links(category_url, 1)
    print(f"### obtained {len(links)} links...!")
    
    for link in links[:5]:
        article = get_article_details(link)
        articles.append(article)
        
        time.sleep(0.5)
    
    time.sleep(0.5)


## getting links from 한국어...
### obtained 28 links...!
## getting links from Español...
### obtained 25 links...!


In [16]:
articles = pd.DataFrame(articles, columns=["URL", "Title", "Source", "Editor", "Date", "Text"])
articles


Unnamed: 0,URL,Title,Source,Editor,Date,Text
0,https://kr.news.cn/20240708/ab963fe669f74a8f9e...,"中 발개위, 후난성∙장시성 응급 복구 지원에 2억 위안 배정",신화망 한국어판,陈畅,,[신화망 베이징 7월8일]중국 국가발전개혁위원회(발개위)는 7일 후난성(웨양시 화룽...
1,https://kr.news.cn/20240708/d2daccff75bd4c3f92...,"中 산둥, 무술 익히며 방학 보내요~",신화망 한국어판,朱雪松,,[신화망 베이징 7월8일] 여름방학 기간 어린이들이 다양한 활동에 참여하며 즐거운 ...
2,https://kr.news.cn/20240708/d764351bdb9f40aeb2...,"생일 축하합니다~ 루이바오·후이바오, 팬들과 함께한 첫 돌잔치",신화망 한국어판,陈畅,,[신화망 서울 7월8일] 한국 에버랜드가 7일 한국에서 태어난 쌍둥이 아기 판다 ...
3,https://kr.news.cn/20240708/0e5d4a43c5d94e9785...,"獨 자동차 업계 ""EU의 中 전기차 관세 추가 부과, 유럽 자동차 업계에 도움 안 돼""",신화망 한국어판,陈畅,,지난달 25일 일본 도쿄에서 촬영한 비야디(BYD) 전기차 'SEAL'. (사진/...
4,https://kr.news.cn/20240708/33bdb93c34714aafa3...,"中 리창 총리, 키어 스타머 英 신임 총리에 축전",신화망 한국어판,朴锦花,,[신화망 베이징 7월8일] 리창(李强) 중국 국무원 총리가 7일 키어 스타머 영국 ...
5,https://spanish.news.cn/20240707/eca7b3b77ead4...,China conmemora 87º aniversario de guerra de r...,,,,"(Xinhua/Li Xiang)\nBEIJING, 7 jul (Xinhua) --..."
6,https://spanish.news.cn/20240626/bbc46cc1527b4...,Lotos que crecen a partir de semillas antiguas...,,,,"NANNING, 25 junio, 2024 (Xinhua) -- Imagen de..."
7,https://spanish.news.cn/20240708/45ec17fbbe3a4...,EURO 2024: Inglaterra avanza a semifinales,,,,"DUSSELDORF, 7 julio, 2024 (Xinhua) -- Imagen d..."
8,https://spanish.news.cn/20240708/9dda587468c74...,ENFOQUE DE CHINA: Los pagos vía móvil brindan ...,,,,"BEIJING, 8 jul (Xinhua) -- Para Jamie y Lilian..."
9,https://spanish.news.cn/20240708/272e9a794e8d4...,Xi dice que China está dispuesta a promover as...,,,,"BEIJING, 8 jul (Xinhua) -- China está dispuest..."


In [17]:
driver.quit()
