Part 1: Scraping One Page

Exercise 1 & 2:

In [1]:
import requests
from bs4 import BeautifulSoup

response = requests.get('https://www.bbc.com/news')
contents = response.text
soup = BeautifulSoup(contents, 'html.parser')

# print(contents)

When viewing the whole HTML file, I searched for the title "bloodshed" for verification and was successfully located.

Exercise 3 & 4:

In [3]:
# Extract all headers from the page with a regular expression

import re

headers = re.compile("(<h3.+promo.+>)+(.+)(</h3>)")
headers1 = soup.findAll("h3")

for item in headers1:
    try:
        print(headers.match(f"{item}").groups()[1])
    except AttributeError:
        pass

Ukraine orders evacuation of city it recaptured
Ukraine orders evacuation of city it recaptured
US lawyer Murdaugh guilty of killing wife and son
How Alex Murdaugh hid his dark side
Cambodian opposition leader sentenced for treason
Putin accuses Ukraine of border 'terrorist act'
Tennessee bans drag shows for children
Ros Atkins on... The creeping TikTok bans
Half of world could be overweight by 2035
At least 57 confirmed dead in Greece train crash
Scotland first to ban anaesthetic over environment
Text leak puts spotlight on police and quarantine in UK 
Scotland first to ban anaesthetic over environment
Text leak puts spotlight on police and quarantine in UK 
Jazz saxophone legend Wayne Shorter dies at 89
Teenager bitten by crocodile in Australian floods
India actor banned from stock market over malpractice
Hong Kong skyscraper fire seen on city's skyline
BBC World News TV
BBC World Service Radio
Man survives 31 days in jungle by eating worms
Egypt pyramid hidden corridor seen for firs

Firstly, I was unable to retrieve relevant header tags. For example, "BBC World News TV" is not a header we are interested in. The RE I used also captures media links (videos). In general, using regular expressions should be avoided since HTML language is unstable/varying (i.e. internet sites are regularly updated), and would therefore require frequent maintenance.

Exercise 5 & 6:

In [5]:
# Find top stories with soup.find

top = soup.find(id="news-top-stories-container")

h3 = top.find_all("h3")

for item in h3:
    try:
        print(headers.match(f"{item}").groups()[1])
    except AttributeError:
        pass

Victory is inevitable if allies keep promises - Zelensky
WATCH: One year of war in Ukraine in 87 seconds
Fighting to stay Ukrainian in a frontline mining town
BBC correspondents reflect on a year of warzone reporting
Has Putin's war failed?
Two friends changed by a year of war
How Putin's fate is tied to his war in Ukraine
Why China launched a charm offensive over Ukraine
US marks war anniversary with new Russia sanctions
Brothers leave Guantanamo Bay after almost 20 years
US ex-lawyer Alex Murdaugh admits he stole millions for drugs
Swimmers 'ruined' by fat-shaming and bullying
Moldova warns of Russian 'psy-ops' as tensions rise
Rebellious Andean bear sneaks out of US zoo - twice
Kenyan man freed over Britons' murder-kidnap
Nigerian politician arrested with $500,000 in cash
Netflix cuts prices in more than 30 countries
Nigerian politician arrested with $500,000 in cash
Netflix cuts prices in more than 30 countries
US billionaire financier Thomas Lee found dead at 78
Seoul offers radia

Exercise 7:

In [6]:
# Finding summaries

summaries = re.compile("(<p.+promo.+>)+(.+)(</p>)")
summaries1 = soup.findAll("p")

for item in summaries1:
    try:
        print(summaries.match(f"{item}").groups()[1])
    except AttributeError:
        pass

As the first German-made tanks arrive in Ukraine, President Zelensky urges allies to stick to their promises and deadlines.
Steve Rosenberg looks at why Vladimir Putin set sail in a storm of his own making a year ago.
The West may come away unimpressed - but convincing them was never likely the main goal for Beijing.
President Biden also announced over $2bn in military aid to both Ukraine and neighbouring Moldova.
Abdul and Mohammed Ahmed Rabbani were arrested in Pakistan in 2002. They were never charged by the US.
The former lawyer accused of murdering his wife and son had a powerful 60-pill-a-day opiate habit.
Former athletes tell of mistreatment at clubs across England, with allegations stretching back more than a decade.
Moldova's pro-EU leaders reject Russian claims that Ukraine plans to attack its breakaway territory.
The South American species escaped his habitat at the St Louis Zoo for the second time this month.
The BBC revealed last year that a senior Met officer who assisted

Exercise 7 & 8:

In [5]:
# Combine headers, summaries and sections into a list of dictionaries

combined = re.compile("(<div.+body.+>)(.+)(</h3></a>.+)(promo.summary.>)+(.+)(</p>.+)(<span.+true.>)([^<]+)(</span>.+)")

stories = []

for item in soup.find_all("div", {"class": "gs-c-promo"}):
    try:
        headline = combined.match(str(item)).groups()[1]
        summary = combined.match(str(item)).groups()[4]
        section = combined.match(str(item)).groups()[7].replace("&amp;", "&")
        story = {'headline': headline, 'summary': summary, 'section': section}
        if not any(s['headline'] == headline for s in stories):
            stories.append(story)
    except AttributeError:
        pass

for story in stories:
    print('Headline:', story['headline'])
    print('Summary:', story['summary'])
    print('Section:', story['section'])
    print('------------------')

# create a json file for storing the dictionaries

import json

with open('bbc.json', 'w') as f:
    json.dump(stories, f, indent=4)

Headline: Ukraine orders evacuation of city it recaptured
Summary: Families are told to leave Kupiansk, which Ukraine re-captured from Russia in September.
Section: Europe
------------------
Headline: US lawyer Murdaugh guilty of killing wife and son
Summary: "The evidence of guilt is overwhelming," the judge says after the jury's verdict is read in court.
Section: US & Canada
------------------
Headline: How Alex Murdaugh hid his dark side
Summary: Behind the courtly air of a country lawyer born to power and privilege, lay a cold-blooded killer.
Section: US & Canada
------------------
Headline: Cambodian opposition leader sentenced for treason
Summary: Kem Sokha is sentenced to 27 years house arrest which prevents him from contesting July's election.
Section: Asia
------------------
Headline: Putin accuses Ukraine of border 'terrorist act'
Summary: Kyiv denies Moscow's claim that Ukrainian saboteurs fired at civilians in a Russian village.
Section: Europe
------------------
Headline: 

Part 2: Scraping a Reliable News Dataset

In [4]:
from bs4 import BeautifulSoup
import csv
import requests
import re


group_nr = 8
letters = "ABCDEFGHIJKLMNOPRSTUVWZABCDEFGHIJKLMNOPRSTUVWZ"[group_nr%23:group_nr%23+10]
with open('article_titles_links.csv', 'w', encoding='utf-8', newline='') as f:
    output_strings = []
    printed_categories = set()
    for letter in letters:
        url = f"https://en.wikinews.org/wiki/Category:Politics_and_conflicts?from={letter}&to={letters[-1]}"
        while url:
            category_request = requests.get(url)
            contents = category_request.text
            category_soup = BeautifulSoup(contents, 'html.parser')
            category_section = category_soup.find(id="mw-pages")
            subcategories = category_section.find_all('a', title=True)
            for subcategory in subcategories:
                subcategory_name = subcategory['title']
                subcategory_name = subcategory_name.replace('"', '')
                if subcategory_name in printed_categories or subcategory_name.startswith('Q') or subcategory_name.startswith("'"):
                    continue
    
                printed_categories.add(subcategory_name)
                if subcategory_name[0] > letters[-1]:
                    
                    break
                subcategory_url = f"https://en.wikinews.org{subcategory['href']}"

                if 'pageto' not in subcategory_url and 'pagefrom' not in subcategory_url:
                    output_str = f"{subcategory_name},{subcategory_url}\n"
                    output_strings.append(output_str)         

            next_link = category_soup.find("a", string="next page")
            if next_link:
                url = f"https://en.wikinews.org{next_link['href']}"
            else:
                url = None

    output_strings.sort()
    for output_str in output_strings:
        f.write(output_str)

# Number of articles
print(len(output_strings))

2869


In [6]:
output_strings = [output_str.strip() for output_str in output_strings]

output_lists = [output_str.split(',') for output_str in output_strings]

output_tuples = [tuple(output_str.split(',')) for output_str in output_strings]

def res(): 
    lst = []
    for ele in output_strings:
        peter = re.search(r'https:.+', ele)
        peter = peter.group(0)
        lst.append(peter)
    return lst

urls = [output_str.split(',') for output_str in output_strings]
urls = res()

In [7]:
def get_article_content(inp):
    lst = []
    for elm in inp:
        i = inp.index(elm)
        response = requests.get(elm)
        contents = response.text
        soup = BeautifulSoup(contents, 'html.parser')
        lst.append([get_article_title(soup), urls[i], get_article_date(soup) , get_article_info(soup)])
        i =+ 1
    return lst    

def get_article_title(inp):
    n = inp.find('span', class_='mw-page-title-main')
    n = str(n)
    n = re.sub(r'<span(?:.*?)>', '', n)
    n = re.sub(r'</span>', '', n)
    return n

def get_article_date(inp):
    n = inp.find('strong', class_='published')
    n = str(n)
    n = re.sub(r'<strong(?:.*?)</span>', '', n)
    n = re.sub(r'</strong>', '', n)
    return n

def get_article_info(inp):
    n = inp.find('div', class_='mw-parser-output')
    n = inp.find_all('p')
    ntext = ''
    for elm in n:
        ntext += elm.text
    n = ntext
    n = re.sub(r'\n', '', n)
    n = re.sub(r'\"', '', n)
    n = re.sub(r'\[', '', n)
    n = re.sub(r'\]', '', n)
    n = re.sub(r'Share.+', '', n)
    n = re.sub(r'\xa0', '', n)
    n = re.sub(r'\w+, \w+ \d+, \d+', '', n) 
    return n

with open('articles.csv', 'w', encoding='utf-8', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['Title', 'URL', 'Date', 'Content'])
    writer.writerows(get_article_content(res()))

Summary:

One of the challenges with this assignment came up when I tried to retrieve the article links in a correct form for later use. Whenever there was a ",_" in an article URL, the hyperlink would be inactive and could therefore not be used for scraping. I was then able to get around this problem and created a csv file (article_titles_links.csv) with the article titles and their respective URLs. There were also other minor problems along the way that required hard-coding.

The number of articles is calculated to be 2869. In order to get a desired output, I relied on csv file creations and notably a functionality to deal with retrieving links on subsequent pages (next_link). The first part is maybe too extensive considering the output, while the last part of the code creates separate functions for articale title, date, content, and a mother function that incorporates these functions and appends to a list to give the final output. For this part, I was heavily dependent on regular expressions.

Scraping from Wikinews does not give me any decisive indication on whether we are dealing with trusted news articles or not. The fact that wikinews can be edited by anyone can be deemed 'sketchy', however the opposite can also be argued since the whole public potentially can edit this content, and not just some specific entity with vested interests.