In this lab, we will be building web scrapers for **news.cn**

Note this lab is meant to be educational. We are not liable for how you use this skill and please respect the copyright of the sites you are scraping.
![](https://www.comp.nus.edu.sg/~lekhsian/sws3023/newscn1.png)

One of the thing we will notice that's different from IMDB (in the previous lab) is that the data displayed on the website might not be part of the HTML source. Some content are loaded dynamically using AJAX. So for this lab we will be focusing on using **Selenium** to do the web scraping tasks.

In [1]:
#load necessary libraries
from selenium import webdriver
import time
import chromedriver_binary
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import pandas as pd


In [2]:
#disable build check to if latest version of Chrome is not supported
service = Service(service_args=['--disable-build-check'])

#open the window size to simulate mobile view
options = webdriver.ChromeOptions()
options.add_argument("window-size=500,800")
driver = webdriver.Chrome(service=service, options=options)


The chromedriver version (128.0.6582.0) detected in PATH at C:\Users\22507\anaconda3\lib\site-packages\chromedriver_binary\chromedriver.exe might not be compatible with the detected chrome version (126.0.6478.127); currently, chromedriver 126.0.6478.126 is recommended for chrome 126.*, so it is advised to delete the driver in PATH and retry


#### ❓Q1.
Complete this function (`get_categories()`) which return a set of tuples with the URL of the category and the text of the category. You should also include those categories in sub-menus.\
An example of the category is shown below in in <span style="color:white;background-color:#063565">white</span> or <span style="color:white;background-color:#c5d6e7">white</span>.

![](https://www.comp.nus.edu.sg/~lekhsian/sws3023/newscn2.png)
![](https://www.comp.nus.edu.sg/~lekhsian/sws3023/newscn3.png)
![](https://www.comp.nus.edu.sg/~lekhsian/sws3023/newscn4.png)

In [3]:
driver.get("https://english.news.cn/")

categories = driver.find_elements(By.CSS_SELECTOR,"#nav a")
categories

[<selenium.webdriver.remote.webelement.WebElement (session="39caf60a17db6e5dc065dc209fd01a6a", element="f.81E25999973C3CB0E31B880998593527.d.A0790F27660EAB40A61EA84C45A58A7E.e.13")>,
 <selenium.webdriver.remote.webelement.WebElement (session="39caf60a17db6e5dc065dc209fd01a6a", element="f.81E25999973C3CB0E31B880998593527.d.A0790F27660EAB40A61EA84C45A58A7E.e.14")>,
 <selenium.webdriver.remote.webelement.WebElement (session="39caf60a17db6e5dc065dc209fd01a6a", element="f.81E25999973C3CB0E31B880998593527.d.A0790F27660EAB40A61EA84C45A58A7E.e.15")>,
 <selenium.webdriver.remote.webelement.WebElement (session="39caf60a17db6e5dc065dc209fd01a6a", element="f.81E25999973C3CB0E31B880998593527.d.A0790F27660EAB40A61EA84C45A58A7E.e.16")>,
 <selenium.webdriver.remote.webelement.WebElement (session="39caf60a17db6e5dc065dc209fd01a6a", element="f.81E25999973C3CB0E31B880998593527.d.A0790F27660EAB40A61EA84C45A58A7E.e.17")>,
 <selenium.webdriver.remote.webelement.WebElement (session="39caf60a17db6e5dc065dc209

In [4]:

results = set()

for category in categories:
    category_url = category.get_attribute("herf")
    category_text = category.get_attribute("innerText")

    if category_url != "https://english.news.cn/#":
        results.add((category_url, category_text))

results

{(None, 'Africa'),
 (None, 'Asia & Pacific'),
 (None, 'B & R Initiative'),
 (None, 'Biz'),
 (None, 'Biz China Weekly'),
 (None, 'China'),
 (None, 'Culture & Lifestyle'),
 (None, 'Deutsch'),
 (None, 'Editions'),
 (None, 'Español'),
 (None, 'Europe'),
 (None, 'Français'),
 (None, 'GLOBALink'),
 (None, 'Home'),
 (None, 'In-depth'),
 (None, 'Latest'),
 (None, 'More'),
 (None, 'North America'),
 (None, 'Photos'),
 (None, 'Português'),
 (None, 'Posters'),
 (None, 'Special Reports'),
 (None, 'Sports'),
 (None, 'Video & Live'),
 (None, 'World'),
 (None, 'Xinhua Headlines'),
 (None, 'Xinhua New Media'),
 (None, 'Русский'),
 (None, 'عربي'),
 (None, '中文'),
 (None, '日本語'),
 (None, '한국어')}

In [5]:
def get_categories():
    driver.get("https://english.news.cn/")
    results = set()
    #TODO
    categories = driver.find_elements(By.CSS_SELECTOR, "#nav a")

    for category in categories:
        category_url = category.get_attribute("href")
        category_text = category.get_attribute("innerText")

        if category_url != "https://english.news.cn/#":
            results.add((category_url, category_text))

    return results



In [6]:
categories = get_categories()
categories


{('http://www.news.cn/', '中文'),
 ('https://arabic.news.cn/index.htm', 'عربي'),
 ('https://english.news.cn/africa/index.htm', 'Africa'),
 ('https://english.news.cn/asiapacific/index.htm', 'Asia & Pacific'),
 ('https://english.news.cn/china/index.htm', 'China'),
 ('https://english.news.cn/culture/index.htm', 'Culture & Lifestyle'),
 ('https://english.news.cn/europe/index.htm', 'Europe'),
 ('https://english.news.cn/globalink.htm', 'GLOBALink'),
 ('https://english.news.cn/indepth/index.htm', 'In-depth'),
 ('https://english.news.cn/list/china-business.htm', 'Biz'),
 ('https://english.news.cn/list/latestnews.htm', 'Latest'),
 ('https://english.news.cn/northamerica/index.htm', 'North America'),
 ('https://english.news.cn/photo/index.htm', 'Photos'),
 ('https://english.news.cn/posters/index.htm', 'Posters'),
 ('https://english.news.cn/silkroad/index.html', 'B & R Initiative'),
 ('https://english.news.cn/special/index.htm', 'Special Reports'),
 ('https://english.news.cn/special/qmttt/index.htm'

In [7]:
#this should most likely be 29
len(categories)


29

#### ❓Q2.
A few of the categories show a listing of the articles by default such as the **Culture & Lifestyle** category:
![](https://www.comp.nus.edu.sg/~lekhsian/sws3023/newscn6.png)

But many categories by default shows the news for the day such as the **World** category.
![](https://www.comp.nus.edu.sg/~lekhsian/sws3023/newscn5.png)

However, if you were to shrink the window, it does show mobile view.
![](https://www.comp.nus.edu.sg/~lekhsian/sws3023/newscn7.png)

Complete this function (`get_article_links(category_url, num_pages)`) which takes in `category_url` and `num_pages` arguments and extracts the list of url to articles in that category. `num_pages` allows us to control the number of pages expected (if there are more than one page of articles). For categories without the **More** button, this function will just return the list of url in the current page.

In [8]:
def get_article_links(category_url, num_pages):
    driver.get(category_url)
    article_elems = []
    urls = []
    
    #TODO
    try:
        article_elems = driver.find_elements(By.CSS_SELECTOR, 'a[href$="c.html"]')
        urls = list(set([article_elem.get_attribute("href") for article_elem in article_elems]))

        found_more_button = False

        try:
            load_more_button = driver.find_element(By.CSS_SELECTOR, "#more")
            found_more_button = True
        except:
            pass

        if(found_more_button):
            for i in range(1, num_pages+1):
                load_more_button.click()

                print("Page: {}...".format(i))

                WebDriverWait(driver, 10).until(
                    EC.invisibility_of_element_located((By.CSS_SELECTOR, "#more[disabled]"))
                )

    except:
        pass

    return urls

In [9]:
article_urls = get_article_links("https://english.news.cn/culture/index.htm", 2)
print(len(article_urls))
article_urls


Page: 1...
Page: 2...
22


['https://english.news.cn/20221017/139a768d90e74c2ba3f6c45740be3531/c.html',
 'https://english.news.cn/20230130/3f2ba4b7cd214a209dda790ddbdcb620/c.html',
 'https://english.news.cn/20240704/a7271cb017144ac592c250e806dbc8fb/c.html',
 'https://english.news.cn/20240707/b3dd1ce4aba547b69078dc7f7c7be875/c.html',
 'https://english.news.cn/20240709/ea8ba02d22f446b6b8788d29f00d4d8d/c.html',
 'https://english.news.cn/20240704/2387e1b3c8be43e696182c1703178605/c.html',
 'https://english.news.cn/20240704/5e585aade034460c8915f5c0ecf9c0f8/c.html',
 'https://english.news.cn/20240704/634dc0c894b246469684491a6f9e1a95/c.html',
 'https://english.news.cn/20240703/fb8a3046e7904e4781bf276f2e6568d1/c.html',
 'https://english.news.cn/20240704/0a3a49825f45402998904057f145b0ab/c.html',
 'https://english.news.cn/20240708/68c944ead26a4de0b066b09e199e49c1/c.html',
 'https://english.news.cn/20240705/bcfebde346444504b1f88a29a5bf2029/c.html',
 'https://english.news.cn/20240709/a1a747d6d2f743c1ac87aa12b70ed295/c.html',

In [10]:
article_urls = get_article_links("https://english.news.cn/culture/index.htm", 1)
print(len(article_urls))
article_urls


Page: 1...
22


['https://english.news.cn/20221017/139a768d90e74c2ba3f6c45740be3531/c.html',
 'https://english.news.cn/20230130/3f2ba4b7cd214a209dda790ddbdcb620/c.html',
 'https://english.news.cn/20240704/a7271cb017144ac592c250e806dbc8fb/c.html',
 'https://english.news.cn/20240707/b3dd1ce4aba547b69078dc7f7c7be875/c.html',
 'https://english.news.cn/20240709/ea8ba02d22f446b6b8788d29f00d4d8d/c.html',
 'https://english.news.cn/20240704/2387e1b3c8be43e696182c1703178605/c.html',
 'https://english.news.cn/20240704/5e585aade034460c8915f5c0ecf9c0f8/c.html',
 'https://english.news.cn/20240704/634dc0c894b246469684491a6f9e1a95/c.html',
 'https://english.news.cn/20240703/fb8a3046e7904e4781bf276f2e6568d1/c.html',
 'https://english.news.cn/20240704/0a3a49825f45402998904057f145b0ab/c.html',
 'https://english.news.cn/20240708/68c944ead26a4de0b066b09e199e49c1/c.html',
 'https://english.news.cn/20240705/bcfebde346444504b1f88a29a5bf2029/c.html',
 'https://english.news.cn/20240709/a1a747d6d2f743c1ac87aa12b70ed295/c.html',

In [11]:
article_urls = get_article_links("https://english.news.cn/china/index.htm", 1)
print(len(article_urls))
article_urls


Page: 1...
46


['https://english.news.cn/20240710/85adf87cb32b49d8a6d4516c96d81dcf/c.html',
 'https://english.news.cn/20240709/431eea5144a247acb1c082ee701793e4/c.html',
 'https://english.news.cn/20240710/40f1590ad16248f7a03ce461e38fa16f/c.html',
 'https://english.news.cn/20240710/aa94b961752b4806b0793eb369c7931a/c.html',
 'https://english.news.cn/20240709/e00176b6d2ed43af84dcb2a285dbb938/c.html',
 'https://english.news.cn/20240710/6f5cfc9916dc459ab302bf019a8962db/c.html',
 'https://english.news.cn/20240710/6834419438d247268cb6da73401de4ad/c.html',
 'https://english.news.cn/20240709/3907593a44464d0cbea858b1ad03b52b/c.html',
 'https://english.news.cn/20240608/641fc31bac944fc38fba1985f0380a4d/c.html',
 'https://english.news.cn/20240530/ae4e7b4f18ab44d380c9943784d33c6f/c.html',
 'https://english.news.cn/20221017/139a768d90e74c2ba3f6c45740be3531/c.html',
 'https://english.news.cn/20240708/33432cba5c274a8bbc71727f87f0c729/c.html',
 'https://english.news.cn/20240710/1c799d0e13cd40f295b627fdc4e5000f/c.html',

In [12]:
article_urls = get_article_links("https://english.news.cn/china/index.htm", 2)
print(len(article_urls))
article_urls


Page: 1...
Page: 2...
46


['https://english.news.cn/20240710/85adf87cb32b49d8a6d4516c96d81dcf/c.html',
 'https://english.news.cn/20240709/431eea5144a247acb1c082ee701793e4/c.html',
 'https://english.news.cn/20240710/40f1590ad16248f7a03ce461e38fa16f/c.html',
 'https://english.news.cn/20240710/aa94b961752b4806b0793eb369c7931a/c.html',
 'https://english.news.cn/20240709/e00176b6d2ed43af84dcb2a285dbb938/c.html',
 'https://english.news.cn/20240710/6f5cfc9916dc459ab302bf019a8962db/c.html',
 'https://english.news.cn/20240710/6834419438d247268cb6da73401de4ad/c.html',
 'https://english.news.cn/20240709/3907593a44464d0cbea858b1ad03b52b/c.html',
 'https://english.news.cn/20240608/641fc31bac944fc38fba1985f0380a4d/c.html',
 'https://english.news.cn/20240530/ae4e7b4f18ab44d380c9943784d33c6f/c.html',
 'https://english.news.cn/20221017/139a768d90e74c2ba3f6c45740be3531/c.html',
 'https://english.news.cn/20240708/33432cba5c274a8bbc71727f87f0c729/c.html',
 'https://english.news.cn/20240710/1c799d0e13cd40f295b627fdc4e5000f/c.html',

#### ❓Q3.

Complete this function (`get_article_details(url)`) which takes in an article `url` and extract the following information:
- title of the article (shown below in <span style="color:#00a8ff">blue</span>)
- source of the article (shown below in <span style="color:#ba00ff">purple</span>)
- editor of the article (shown below in <span style="color:#8f3900">brown</span>)
- date & time of the article (shown below in <span style="color:#ff00f6">pink</span>)
- text of article (shown below in <span style="color:#ff0000">red</span>)

![](https://www.comp.nus.edu.sg/~lekhsian/sws3023/newscn8.png)

If a field is not present, you should leave it as an empty string.

In [13]:
url = "https://english.news.cn/20240707/b3dd1ce4aba547b69078dc7f7c7be875/c.html"
driver.get(url)

title = ""

try:
    title = driver.find_element(By.CSS_SELECTOR, "h1")
    title = title.text
except:
    pass

title

"World's largest indoor ice, snow theme park opens in China's Harbin"

In [14]:
source = ""

try:
    source = driver.find_element(By.CSS_SELECTOR, ".source")
    source = source.text
    source = str.strip(source.split("1")[1])
except:
    pass

source

'Source: Xinhua'

In [15]:
editor = ""

try:
    editor = driver.find_element(By.CSS_SELECTOR, ".editor")
    editor = editor.text
except:
    pass

editor

'Editor: huaxia'

In [16]:
article_date = ""

try:
    article_date = driver.find_element(By.CSS_SELECTOR, ".time")
    article_date = article_date.text
except:
    pass

article_date

'2024-07-07 12:37:30'

In [17]:
article_text = ""

try:
    article_text = driver.find_element(By.CSS_SELECTOR, "#detailContent")
    article_text = article_text.text
except:
    pass

article_text

'Visitors have fun at an indoor ice and snow theme park in Harbin, northeast China\'s Heilongjiang Province, July 6, 2024. (Photo by Yin Zhongwei/Xinhua)\nHARBIN, July 7 (Xinhua) -- Tourists visiting Harbin, China\'s "Ice City" in its northernmost Heilongjiang Province, now have a new attraction to explore, namely an indoor ice and snow theme park that opened on Saturday.\nWith a construction area of 23,800 square meters, the park has clinched the Guinness World Record as the world\'s largest indoor ice and snow theme park.\nAccording to the park manager, the facility features nine themed areas showcasing lifelike ice sculptures illuminated by colorful lighting. The theme park maintains a constant temperature to accommodate visitors throughout the year.\nThe park will complement Harbin Ice-Snow World, a landmark outdoor ice and snow theme park covering 810,000 square meters, thereby transforming Harbin into an all-season resort destination.\nHarbin, with its cold winters, is celebrated

In [18]:
#potentially the url might be a broken link
def get_article_details(url):
    driver.get(url)

    #title
    title = ""

    try:
        title = driver.find_element(By.CSS_SELECTOR, "h1")
        title = title.text
    except:
        pass

    #source
    source = ""

    try:
        source = driver.find_element(By.CSS_SELECTOR, ".source")
        source = source.text
        source = str.strip(source.split(":")[1])
    except:
        pass


    #editor
    editor = ""

    try:
        editor = driver.find_element(By.CSS_SELECTOR, ".editor")
        editor = editor.text
        editor = str.strip(editor.split(":")[1])
    except:
        pass


    #datetime
    article_date = ""

    try:
        article_date = driver.find_element(By.CSS_SELECTOR, ".time")
        article_date = article_date.text
    except:
        pass


    #text
    article_text = ""

    try:
        article_text = driver.find_element(By.CSS_SELECTOR, "#detailContent")
        article_text = article_text.text
    except:
        pass
    
    return (url, title, source, editor, article_date, article_text)


In [19]:
article = get_article_details("https://english.news.cn/20220707/2ea4c1dbe99144fd8ec0f821dc6cafc1/c.html")
article


('https://english.news.cn/20220707/2ea4c1dbe99144fd8ec0f821dc6cafc1/c.html',
 'British PM Johnson resigns',
 'Xinhua',
 'huaxia',
 '2022-07-07 21:06:59',
 'File photo taken on Feb. 10, 2022 shows British Prime Minister Boris Johnson at NATO headquarters in Brussels, Belgium. Boris Johnson resigned as British prime minister and the leader of the Conservative Party in a statement to the country on Thursday.\nHe said he will continue to serve as prime minister until a new Tory leader is chosen. (Xinhua/Zheng Huansong)\nLONDON, July 7 (Xinhua) -- Boris Johnson resigned as British prime minister and the leader of the Conservative Party in a statement to the country on Thursday.\nHe said he will continue to serve as prime minister until a new Tory leader is chosen.\n"The will of the parliamentary party is clear and the process for choosing a new leader should begin," he said, adding that the timetable for choosing a new leader will be announced next week.\nHe said he has appointed a Cabinet 

In [20]:
article = get_article_details("https://english.news.cn/20220706/4055513625634add999adc44f7bb68c4/c.html")
article


('https://english.news.cn/20220706/4055513625634add999adc44f7bb68c4/c.html',
 'Feature: Virtual "bird-woman" ambassador spreads Dunhuang culture',
 'Xinhua',
 'huaxia',
 '2022-07-06 20:57:59',
 'People visit the Mogao Grottoes, a UNESCO World Heritage site, in Dunhuang, northwest China\'s Gansu Province. (Xinhua/Zhang Yujie)\nLANZHOU, July 6 (Xinhua) -- Originated from a half-woman, half-bird creature on millennia-old murals, Jiayao has become the official virtual cartoon figure of Mogao Grottoes, a UNESCO World Heritage site in Dunhuang, northwest China\'s Gansu Province.\nJiayao appears as a girl wearing traditional robes with feathers sprouted from her back. The animation depicts her experience with the relics and interactions with other characters from the murals.\nAs Dunhuang culture\'s digital ambassador, Jiayao was developed by the Dunhuang Academy, an institution responsible for the conservation, management, and research of the Mogao Grottoes and other cultural relics.\nWhile o

#### ❓Q4.

Combine the above functions to automate extraction of articles on the site by going to every category and store the results into a Pandas Dataframe. Do you notice any issue with any of the above functions? If so, go back and fix it.

In [21]:
#TODO
categories = get_categories()
categories = list(categories)[5:9]
categories



[('https://english.news.cn/photo/index.htm', 'Photos'),
 ('https://english.news.cn/sports/index.htm', 'Sports'),
 ('https://spanish.news.cn/index.htm', 'Español'),
 ('https://jp.news.cn/index.htm', '日本語')]

In [22]:
articles = []

for category in categories:
    category_url , category_title = category
    print(f"## getting links from{category_title}...")
    links = get_article_links(category_url, 1)
    print(f"###obtained {len(links)}links!")

    for link in links[:5]:
        article = get_article_details(link)
        articles.append(article)

        time.sleep(0.5)

    time.sleep(0.5)


## getting links fromPhotos...
###obtained 22links!
## getting links fromSports...
Page: 1...
###obtained 22links!
## getting links fromEspañol...
###obtained 26links!
## getting links from日本語...
###obtained 25links!


In [23]:
articles

[('https://english.news.cn/20240710/40f1590ad16248f7a03ce461e38fa16f/c.html',
  "Guinea-Bissau president lays wreath at Monument to People's Heroes in Beijing",
  'Xinhua',
  'huaxia',
  '2024-07-10 15:36:17',
  "President of Guinea-Bissau Umaro Sissoco Embalo lays a wreath at the Monument to the People's Heroes on the Tian'anmen Square in Beijing, capital of China, July 10, 2024. (Xinhua/Li Xiang)"),
 ('https://english.news.cn/20240710/6834419438d247268cb6da73401de4ad/c.html',
  'Report shows excellent environmental quality of Huangyan Dao area',
  'Xinhua',
  'huaxia',
  '2024-07-10 10:35:45',
  'This photo taken on July 10, 2024 shows the Chinese and English copies of a report titled "The Investigation and Assessment Report on Marine Ecology and Environment Status of Huangyan Dao" in Beijing, capital of China. China\'s Huangyan Dao area in the South China Sea enjoys excellent environmental quality and healthy coral reef ecosystem, said a report released on Wednesday. The report was 

In [24]:
articles = pd.DataFrame(articles, columns=["URL", "Title", "Source", "Editor", "Date", "Text"])
articles

Unnamed: 0,URL,Title,Source,Editor,Date,Text
0,https://english.news.cn/20240710/40f1590ad1624...,Guinea-Bissau president lays wreath at Monumen...,Xinhua,huaxia,2024-07-10 15:36:17,President of Guinea-Bissau Umaro Sissoco Embal...
1,https://english.news.cn/20240710/6834419438d24...,Report shows excellent environmental quality o...,Xinhua,huaxia,2024-07-10 10:35:45,"This photo taken on July 10, 2024 shows the Ch..."
2,https://english.news.cn/20240627/ce3a226c3e1d4...,In pics: Zuhair Murad's Fall/Winter 2024-2025 ...,Xinhua,huaxia,2024-06-27 07:57:59,A model presents a creation of Zuhair Murad's ...
3,https://english.news.cn/20240710/06ae73562a934...,In pics: Bohemia JazzFest 2024 in Prague,Xinhua,huaxia,2024-07-10 09:16:08,Musicians perform during the Bohemia JazzFest ...
4,https://english.news.cn/20221017/139a768d90e74...,Contact us,Xinhuanet,huaxia,2022-10-17 14:50:01,"Address: XINHUANET Co., Ltd, Jinyu Mansion, 12..."
5,https://english.news.cn/20240709/783a351231754...,Euro 2024: Kane shows optimism ahead of semi a...,Xinhua,huaxia,2024-07-09 19:39:45,"By Oliver Trust\nBERLIN, July 9 (Xinhua) -- Ah..."
6,https://english.news.cn/20221017/139a768d90e74...,Contact us,Xinhuanet,huaxia,2022-10-17 14:50:01,"Address: XINHUANET Co., Ltd, Jinyu Mansion, 12..."
7,https://english.news.cn/20230130/3f2ba4b7cd214...,About us,Xinhua,huaxia,2023-01-30 16:02:35,Xinhuanet is the web portal for news and infor...
8,https://english.news.cn/20240710/8c9d4a0ba1844...,Messi: Copa America final might be his last ch...,Xinhua,huaxia,2024-07-10 16:22:45,"NEW JERSEY, July 10 (Xinhua) -- Argentina capt..."
9,https://english.news.cn/20240710/bda2d62297a74...,Paris 2024 Athletics Preview: US eyes lion's s...,Xinhua,huaxia,2024-07-10 10:11:30,"BEIJING, July 10 (Xinhua) -- Boosted by a seri..."


In [25]:
driver.quit()