# Scraping Hacker News

We’re going to scrape the https://news.ycombinator.com/news front page, using
requests and Beautiful Soup. Take some time to explore the page if you haven’t heard
about it already. Hacker News is a popular aggregator of news articles that “hackers”
(computer scientists, entrepreneurs, data scientists) find interesting.
We’ll store the scraped information in a simple Python list of dictionary objects for
this example. The code to scrape this page looks as follows:

In [1]:
import re

from bs4 import BeautifulSoup
import requests

articles = []

url = 'https://news.ycombinator.com/news'

r = requests.get(url)
html_soup = BeautifulSoup(r.text, 'html.parser')

for item in html_soup.find_all('tr', class_='athing'):
    item_a = item.find('a', class_='storylink')
    item_link = item_a.get('href') if item_a else None
    item_title = item_a.get_text(strip=True) if item_a else None
    
    next_row = item.find_next_sibling('tr')
    item_score = next_row.find('span', class_='score')
    item_score = item_score.get_text(strip=True) if item_score else '0 points'
    item_user = next_row.find('a', class_='hnuser')
    item_user = item_user.get_text(strip=True) if item_user else 'unknown user'
    # We use regex here to find the correct element
    item_comments = next_row.find('a', string=re.compile('\d+\s+comments?'))
    item_comments = item_comments.get_text(strip=True).replace('\xa0', ' ') if item_comments else '0 comments'
    
    articles.append({
        'link': item_link,
        'title': item_title,
        'score': item_score,
        'user': item_user,
        'comments': item_comments
    })
    
for article in articles:
    print(article)
    print()

{'link': 'https://www.sandimetz.com/blog/2016/1/20/the-wrong-abstraction', 'title': 'The Wrong Abstraction (2016)', 'score': '357 points', 'user': 'LopRabbit', 'comments': '95 comments'}

{'link': 'https://www.cadc.uscourts.gov/internet/opinions.nsf/533D47AF883C8194852582CD0052B8D4/$file/17-7035.pdf', 'title': 'Public.resource.org wins appeal on right to publish the law [pdf]', 'score': '89 points', 'user': 'DannyBee', 'comments': '15 comments'}

{'link': 'http://www.hurstwic.org/history/articles/manufacturing/text/viking_woodworking_riving.htm', 'title': 'Riving, a Viking-age woodworking technique', 'score': '157 points', 'user': 'sea6ear', 'comments': '59 comments'}

{'link': 'https://quariety.com/2018/07/20/peertube-the-decentralized-youtube-succeeds-in-crowdfunding/', 'title': 'PeerTube, the “Decentralized YouTube”, succeeds in crowdfunding', 'score': '404 points', 'user': 'Roccan', 'comments': '238 comments'}

{'link': 'https://www.scottaaronson.com/papers/philos.pdf', 'title': 'W