# Scrape NYT

### Sample HTML tag

```html
<section class="story-wrapper"><a class="css-9mylee" href="https://www.nytimes.com/2024/12/01/us/politics/biden-hunter-pardon-politics.html" data-uri="nyt://article/dffb88f6-058f-5e6f-8a61-6b4c08e420e4" aria-hidden="false"><div><div class="css-xdandi"><div class="css-1a3ibh4"><p class="css-tdd4a3"><span class="css-wt2ynm">Analysis</span></p></div><p class="indicate-hover css-91bpc3">In Pardoning His Son, Biden Echoes Some of Trump’s Complaints</p></div><p class="summary-class css-1l5zmz6">President Biden and President-elect Trump now agree on one thing: The Biden Justice Department has been politicized.</p><div class="css-1tic89u"><div><p class="css-1a0ymrn" data-ttr="1">7 min read</p></div></div></div></a></section>
```

Notice that we need to extract the headline, as well as the summary

### Code
(you may have to install BeautifulSoup)

In [3]:
pip install BeautifulSoup4

Note: you may need to restart the kernel to use updated packages.


In [4]:
from bs4 import BeautifulSoup

In [5]:
html_element = """<section class="story-wrapper"><a class="css-9mylee" href="https://www.nytimes.com/2024/12/01/us/politics/biden-hunter-pardon-politics.html" data-uri="nyt://article/dffb88f6-058f-5e6f-8a61-6b4c08e420e4" aria-hidden="false"><div><div class="css-xdandi"><div class="css-1a3ibh4"><p class="css-tdd4a3"><span class="css-wt2ynm">Analysis</span></p></div><p class="indicate-hover css-91bpc3">In Pardoning His Son, Biden Echoes Some of Trump’s Complaints</p></div><p class="summary-class css-1l5zmz6">President Biden and President-elect Trump now agree on one thing: The Biden Justice Department has been politicized.</p><div class="css-1tic89u"><div><p class="css-1a0ymrn" data-ttr="1">7 min read</p></div></div></div></a></section>"""

In [6]:
soup = BeautifulSoup(html_element, 'html.parser')

In [7]:
headline1 = soup.find('section', class_='story-wrapper')
headline1.find_all('p')[1], headline1.find_all('p')[2]

(<p class="indicate-hover css-91bpc3">In Pardoning His Son, Biden Echoes Some of Trump’s Complaints</p>,
 <p class="summary-class css-1l5zmz6">President Biden and President-elect Trump now agree on one thing: The Biden Justice Department has been politicized.</p>)

In [8]:
title_and_summary_tag = headline1.find_all('p')
title = title_and_summary_tag[1].text
summary = title_and_summary_tag[2].text

title_and_summary = title + ". " + summary
title_and_summary

'In Pardoning His Son, Biden Echoes Some of Trump’s Complaints. President Biden and President-elect Trump now agree on one thing: The Biden Justice Department has been politicized.'

In [9]:
def get_text(html_element):
    title_and_summary_tag = html_element.find_all('p')

    if len(title_and_summary_tag) == 0: return None
    
    if len(title_and_summary_tag) < 2: # This function is not very robust :(
        return title_and_summary_tag[0].text
        
    title   = title_and_summary_tag[0].text
    summary = title_and_summary_tag[1].text
    
    title_and_summary = title + ". " + summary
    title_and_summary

    return title_and_summary

In [10]:
get_text(headline1)

'Analysis. In Pardoning His Son, Biden Echoes Some of Trump’s Complaints'

### Find ALL headlines

First, we download the front-page

In [11]:
import requests

In [12]:
%%time
response = requests.get('https://www.nytimes.com/')

CPU times: total: 0 ns
Wall time: 136 ms


In [13]:
response

<Response [200]>

In [14]:
print(response.text[:500])

<!DOCTYPE html>
<html lang="en" class=" nytapp-vi-homepage "  xmlns:og="http://opengraphprotocol.org/schema/">
  <head>
    <script>!function(t,e){"object"==typeof exports&&"object"==typeof module?module.exports=e():"function"==typeof define&&define.amd?define([],e):"object"==typeof exports?exports.Statsig=e():t.Statsig=e()}(this,()=>(()=>{"use strict";var $Q=(e)=>Object.defineProperty(e,"__esModule",{value:!0});var $Q2=(a,b,c)=>Object.defineProperty(a,b,c);var $P=(a,b)=>Object.assign(a,b);var $


In [15]:
html = BeautifulSoup(response.text)

In [16]:
html.find_all(class_="story-wrapper")[:5]

[<section class="story-wrapper"><a aria-hidden="false" class="css-9mylee" data-uri="nyt://article/130c5216-eb36-53fc-ba9b-0df51a295574" href="https://www.nytimes.com/2025/01/16/nyregion/eric-adams-trump-mar-a-lago.html"><div class="css-xdandi"><div class="css-1a3ibh4"><p class="css-ae0yjg"><span class="css-12tlih8">BREAKING</span></p></div><p class="indicate-hover css-1gg6cw2">Eric Adams Heads to Mar-a-Lago to Meet With Trump</p></div><p class="summary-class css-ofqxyv">The New York mayor, who is under federal indictment, has spoken warmly about President-elect Trump and has said he is open to receiving a pardon from him.</p><div class="css-1tic89u"><div><p class="css-1a0ymrn" data-ttr="1">2 min read</p></div></div></a></section>,
 <section class="story-wrapper"><a aria-hidden="false" class="css-9mylee" data-uri="nyt://legacycollection/425abb77-2cf6-5747-bf85-3a6594756b43" href="https://www.nytimes.com/live/2025/01/16/us/trump-news-hearings"><div class="css-xdandi"><div class="css-1a3i

### Extract headlines

In [17]:
html.find_all(class_="story-wrapper")[0]

<section class="story-wrapper"><a aria-hidden="false" class="css-9mylee" data-uri="nyt://article/130c5216-eb36-53fc-ba9b-0df51a295574" href="https://www.nytimes.com/2025/01/16/nyregion/eric-adams-trump-mar-a-lago.html"><div class="css-xdandi"><div class="css-1a3ibh4"><p class="css-ae0yjg"><span class="css-12tlih8">BREAKING</span></p></div><p class="indicate-hover css-1gg6cw2">Eric Adams Heads to Mar-a-Lago to Meet With Trump</p></div><p class="summary-class css-ofqxyv">The New York mayor, who is under federal indictment, has spoken warmly about President-elect Trump and has said he is open to receiving a pardon from him.</p><div class="css-1tic89u"><div><p class="css-1a0ymrn" data-ttr="1">2 min read</p></div></div></a></section>

In [18]:
html.find_all(class_="story-wrapper")[0].find_all('p')

[<p class="css-ae0yjg"><span class="css-12tlih8">BREAKING</span></p>,
 <p class="indicate-hover css-1gg6cw2">Eric Adams Heads to Mar-a-Lago to Meet With Trump</p>,
 <p class="summary-class css-ofqxyv">The New York mayor, who is under federal indictment, has spoken warmly about President-elect Trump and has said he is open to receiving a pardon from him.</p>,
 <p class="css-1a0ymrn" data-ttr="1">2 min read</p>]

In [19]:
for e in html.find_all(class_="story-wrapper")[:15]:
    #print(e)
    print(get_text(e))

BREAKING. Eric Adams Heads to Mar-a-Lago to Meet With Trump
LIVE. Trump’s Picks Are Quizzed on Tax Cuts, Tariffs and Fossil Fuels
Two Watchdogs Were Rebuffed From Joining Trump’s Cost-Cutting Effort. 2 min read
State Attorneys General Ask Courts to Preserve Biden-era Gun Control Measures. 3 min read
Johnson Installs Crawford on Intelligence Panel, Pulling It Closer to Trump. Speaker Mike Johnson appointed Representative Rick Crawford, replacing a Republican who had criticized President-elect Trump and broken with him on key issues.
Trump Picks a Jet-Setting Pal of Elon Musk to Go Get Greenland. 6 min read
A First-Day Trump Order: A Federal Stockpile of Bitcoin?. 5 min read
Stephen Miller, Channeling Trump, Has Built More Power Than Ever. Stephen Miller was the architect of Donald Trump’s hard-line immigration agenda in his first term. Now he is back with fewer rivals and more influence.
Stephen Miller, Channeling Trump, Has Built More Power Than Ever. Stephen Miller was the architect o

In [20]:
headlines = [get_text(headline) for headline in html.find_all(class_="story-wrapper")]

In [21]:
headlines[:5]

['BREAKING. Eric Adams Heads to Mar-a-Lago to Meet With Trump',
 'LIVE. Trump’s Picks Are Quizzed on Tax Cuts, Tariffs and Fossil Fuels',
 'Two Watchdogs Were Rebuffed From Joining Trump’s Cost-Cutting Effort. 2 min read',
 'State Attorneys General Ask Courts to Preserve Biden-era Gun Control Measures. 3 min read',
 'Johnson Installs Crawford on Intelligence Panel, Pulling It Closer to Trump. Speaker Mike Johnson appointed Representative Rick Crawford, replacing a Republican who had criticized President-elect Trump and broken with him on key issues.']

In [22]:
len(headlines)

123

### Write headlines to file

#### Create the filename

In [23]:
import datetime

In [24]:
datetime.datetime.today()

datetime.datetime(2025, 1, 16, 16, 41, 28, 359394)

In [25]:
datetime.datetime.today().strftime('%Y-%m-%d')

'2025-01-16'

In [26]:
TODAY = datetime.datetime.today().strftime('%Y-%m-%d')

In [27]:
TODAY

'2025-01-16'

In [28]:
filename = f"headlines_nyt_{TODAY}.txt"
filename

'headlines_nyt_2025-01-16.txt'

In [29]:
with open(filename, 'w', encoding='utf-8') as output_file:
    for headline in headlines:
        if headline is None: continue
        output_file.write(headline + '\n')