## Importing libraries

In [2]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np

## Functions

### `get_soup_from_page(url)`

This function takes a URL as input, retrieves the webpage content using the requests library, and parses it into a BeautifulSoup object for HTML manipulation. If the request fails, it prints the status code of the response.

In [3]:
def get_soup_from_page(url):
    try:
        driver = requests.get(url)
        soup = BeautifulSoup(driver.text, 'html.parser')
        return soup
    except: 
        return print('Status Code: ', driver.status_code)

### `extract_info(soup)` || `safe_get_text(elements, index, default=np.nan)`

`safe_get_text(elements, index, default=np.nan)`

This helper function attempts to retrieve the text from a specified index of a list of HTML elements. If the element at that index is not found, it returns a default value (defaulting to NaN).


`safe_get_text(elements, index, default=np.nan)`

This helper function attempts to retrieve the text from a specified index of a list of HTML elements. If the element at that index is not found, it returns a default value (defaulting to NaN).


In [4]:
info = []
def safe_get_text(elements, index, default=np.nan):
    """Returns the text of an element at a given index, or a default value if not found."""
    try:
        return elements[index].get_text().split()[0]
    except (IndexError, AttributeError):
        return default

def extract_info(soup):
    post = soup.select('.titleline a')
    desc = soup.select('.subtext span')
    n = len(post) // 2
    
    for i in range(n):        
        title = post[2 * i].get_text()
        href = post[2 * i].get('href')

        points = safe_get_text(desc[4 * i].select('span'), 0)
        hours = safe_get_text(desc[4 * i].select('span'), 1)
        comments = safe_get_text(desc[4 * i].select('a'), 3)
        by = safe_get_text(desc[4 * i].select('a'), 0)

        info.append([title, href, points, by, hours, comments])

### `extract_in_n_pages(n)`

This function extracts article details from a specified number of pages (`n`) on the Hacker News website. It constructs the URL for each page and retrieves its HTML content using the `get_soup_from_page` function. If the page is successfully parsed, it calls the `extract_info` function to gather data and prints a completion message. If there’s an error in retrieving the page, it prints an error message.

In [5]:
def extract_in_n_pages(n):
    for i in range(n):
        base_url = 'https://news.ycombinator.com/'
        
        if i==0:
            page=''
        else:
            page=f'?p={i}'
        
        url = base_url + page
        soup = get_soup_from_page(url)

        if str(soup) != 'Sorry.':
            extract_info(soup)
            print(f'Page {i+1} - Completed')
        else:
            print(f'Error in page: {i+1}')

## Execution

In [6]:
extract_in_n_pages(10)

Page 1 - Completed
Page 2 - Completed
Page 3 - Completed
Page 4 - Completed
Page 5 - Completed
Page 6 - Completed
Page 7 - Completed
Page 8 - Completed
Page 9 - Completed
Page 10 - Completed


In [7]:
df = pd.DataFrame(info, columns=['title','href','points','by','hours ago','comments'])

In [8]:
df.head()

Unnamed: 0,title,href,points,by,hours ago,comments
0,DeskPad – A virtual monitor for screen sharing,https://github.com/Stengo/DeskPad,885,geerlingguy,18,124
1,Nurdle Patrol,https://www.nurdlepatrol.org/app/,49,amar-laksh,5,3
2,The Copenhagen Book: general guideline on impl...,https://thecopenhagenbook.com/,534,sebnun,16,132
3,AAA Gaming on Asahi Linux,https://rosenzweig.io/blog/aaa-gaming-on-m1.html,707,6a74,21,234
4,Scuda – Virtual GPU over IP,https://github.com/kevmo314/scuda,88,kevmo314,8,16


In [9]:
df.to_csv('data.csv')