We'll use the requests library and BeautifulSoup, both standard for any scraping-related tasks. Since NYPost blocks Python requests we can use the fake_useragent library to bypass this.

In [73]:
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import pandas as pd

Initialise random useragent headers for URL requests to get around NYPost Python blocking.

In [74]:
ua = UserAgent()
headers = {'User-Agent': ua.random}

Specify keyword (search term) and desired number of search result pages for scraping. Use BeautifulSoup to obtain html, then select for divs with 'story__text' class and extract headline, meta information (author and date), and excerpt (byline). Store this in results list.

In [75]:
def nypost_scraper(keyword, max_pages):
    base_url = "https://nypost.com/search/"
    results = []
    for page in range(1, max_pages + 1):
        url = f"{base_url}{keyword}/page/{page}/"
        response = requests.get(url, headers=headers)
        if response.status_code!=200:
            print(f"Failed to fetch page {page}")
            break
        soup = BeautifulSoup(response.content, 'html.parser')
        story_texts = soup.find_all('div', class_='story__text')
        for story in story_texts:
            # Extract title and link
            headline = story.find('h3', class_='story__headline')
            if headline:
                title_tag = headline.find('a')
                title = title_tag.text.strip() if title_tag else "No title"
                link = title_tag['href'] if title_tag and title_tag.has_attr('href') else "No link"

            # Extract author and date
            meta = story.find('span', class_='meta meta--byline')
            if meta:
                # Split the content on the "|" character for date separation
                meta_parts = meta.text.strip().split('|')
                author = meta_parts[0].replace("By", "").strip() if len(meta_parts) > 0 else "No author"
                date = meta_parts[1].strip() if len(meta_parts) > 1 else "No date"
            else:
                author = "No author"
                date = "No date"
            # Extract excerpt
            excerpt_tag = story.find('p', class_='story__excerpt')
            excerpt = excerpt_tag.text.strip() if excerpt_tag else "No excerpt"

            # Append information to the list
            results.append({
                'title': title,
                'link': link,
                'author': author,
                'date': date,
                'excerpt': excerpt
            })
    return results

Call function and use pandas to fix formatting issues with author/date column.

In [76]:
results = nypost_scraper("congestion+pricing",max_pages=20)
df = pd.DataFrame(data=results)

In [77]:
df = df.rename(columns={'date' : 'time', 'author' : 'name_date'})
df.head()

Unnamed: 0,title,link,name_date,time,excerpt
0,MTA sues to keep congestion pricing in place a...,https://nypost.com/2025/02/19/us-news/mta-sues...,"Carl Campanile, Ben Kochman and Chris Nesi \t\...",1:54pm,MTA Chair and CEO Janno Lieber praised the tol...
1,NYC congestion pricing axed as Trump's DOT pul...,https://nypost.com/2025/02/19/us-news/nyc-cong...,"Jon Levine and Chris Nesi \t\t\t\tFebruary 19,...",12:02pm,"Congestion pricing, we hardly knew ye."
2,Gov. Hochul's cowardice is on full display as ...,https://nypost.com/2025/02/18/opinion/michael-...,"Michael Goodwin \t\t\t\tFebruary 18, 2025",10:36pm,"Gov. Hochul falls short on governing skills, b..."
3,"Oregon effort to shift border, join conservati...",https://nypost.com/2025/02/17/us-news/eastern-...,"Charles Creitz, Fox News \t\t\t\tFebruary 17, ...",12:15pm,"""This movement has always been about the peopl..."
4,Luxury skincare sale! One of our favorite bran...,https://nypost.com/2025/02/17/shopping/shop-th...,"Victoria Giardina \t\t\t\tFebruary 17, 2025",7:00am,"Luxury, within reach."


In [78]:
df[['author', 'date']] = df['name_date'].str.split(r'\t+', expand=True)
df.drop(columns=['name_date'], inplace=True)

Export as csv.

In [79]:
df.head()
df.to_csv('nyp_articles.csv')

### TODO: 
- Figure out a way to remove ads