# Scraping New York Times Website

In [39]:
import requests
from bs4 import BeautifulSoup
import re

url = 'https://www.nytimes.com/'

data = requests.get(url)
soup = BeautifulSoup(data.text, 'html.parser')


## Observations
- `<main id="site-content"` -> `<div>` -> divided into two other `<div>` as below
- The top block is a 'spotlight' section with only one main theme. Tag is `<section>` and `data-block-tracking-id` is *Spotlight*. Within is a div, a `<h2>` and two `<span>`. Either span gives you the news title.
- On top of Spotlight with `data-block-tracking-id` being *Briefings*. `<div>`x3 then divded into three `<div>` for three articles. 
- `<article>` -> `<div>`x2 -> two `<div>` for thumbnail or text
- `<a>` -> `<div>`x2 -> `<h2>`

## Extract heading for spotlight

In [40]:
main_content = soup.findAll('main', attrs={'id': re.compile('^site-content')})
spotlight = main_content[0].find('section', attrs={'data-block-tracking-id': 'Spotlight'})
spotlight_heading = spotlight.find('h2').text
spotlight_heading

'George H.W. Bush, a Restrained and Seasoned Leader in Troubled Times, Dies at 94'

## Extract heading for briefings

In [41]:
briefings = main_content[0].find('section', attrs={'data-block-tracking-id': 'Briefings'})
briefings
briefings_headings = briefings.findAll('h2')
briefings_headings = [tag.text for tag in briefings_headings]
briefings_headings

['In the ‘At War’ Newsletter',
 'The Neediest Cases Fund',
 'The Daily Mini Crossword']

## Conclusion
- in `find` or `findAll`, first arg is the tag, second arg is a dictionary of attributes
- attribute values can use re.compile