# Scraping: https://www.nytimes.com/

Let's try to scrape the frontpage of the NYT. We're looking for

* Headlines
* Bylines
* Article links

## Getting started

We'll start by **importing the necessary libraries**.

In [1]:
import requests
from bs4 import BeautifulSoup

And then move into **downloading the page** and **importing it into BeautifulSoup**.

In [2]:
response = requests.get('http://www.nytimes.com')
doc = BeautifulSoup(response.text, 'html.parser')

A lot of people call the analyzed page variable `soup` but for once in my life I actually go against the popular thing - I like to call it `doc`, since it helps me remember that it's the *entire document*.

## ATTEMPT ONE: Grabbing the tags directly

Let's jump right into trying to grab the link.

Oh, look it's an.... `a` tag. No special class or anything. What if we try to get all of the `a` tags on the page?

In [6]:
links = doc.find_all('a')

for link in links[:10]:
    print("This is a link")
    print(link.text)

This is a link
LEARN MORE »
This is a link
Skip to content
This is a link
Skip to navigation
This is a link
中文 (Chinese)
This is a link
Español
This is a link





This is a link
Today’s Paper
This is a link
Video
This is a link
World
This is a link
U.S.


Okay, that's terrible. Do you know how many `a` tags are going to be on that page? Many many many. Many very useless ones.

## Talking to parents

When you can't uniquely identify something, sometimes you need to go up the tree to find its **parent**, the elements that are above it. We'll be looking for an element that covers the **entire story**, then we'll pick the link out of it.

Great, it looks like this:
    
    <article class="story theme-summary lede" id="topnews-100000004994965" data-story-id="100000004994965" data-rank="0" data-collection-renderstyle="LedeSum">

I'm going to go out on a limb and say we should look for an `article` tag, but what about the class? `story theme-summary lede` gives us three options:

* `story`
* `theme-summary`
* `lede`

`story` sounds promising, yeah?

In [9]:
story = doc.find_all('article', { 'class': 'story' })

for story in story[:10]:
    print("This is a story")
    print(story.text)

This is a story

Sessions Faces Questions From Senators in Russia Inquiry
By CHARLIE SAVAGE, EMMARIE HUETTEMAN and MICHAEL D. SHEAR 2:20 PM ET

Attorney General Jeff Sessions is testifying publicly before the Senate Intelligence Committee on Tuesday. We’re covering it live.
Rod J. Rosenstein, the deputy attorney general, said that Robert S. Mueller III, the special counsel in the Russia inquiry, would have “full independence.”

 Comments


This is a story

Should Sessions Expect Senatorial Courtesy? Not This Time 

This is a story

How the Right and Left Saw Comey’s Testimony, and More 

This is a story

Trump Is Considering Firing Mueller, Friend Says
By MICHAEL D. SHEAR and MAGGIE HABERMAN 
Some allies of President Trump have begun to attack the credibility of Mr. Mueller, the special counsel in the Russia investigation. Mr. Trump may look to terminate Mr. Mueller, said Christopher Ruddy, the Newsmax Media chief executive.

 Comments


This is a story

Uber Chief Taking Leave of Abse

Seems to work well enough! Now that we have a parent, **we can use that parent to grab the elements inside of the story.** We'll use `.find` and `.find_all` to get everything we need.

* STEP ONE: Get the story
* STEP TWO: Get the headline
* STEP THREE: Get the byline
* STEP FOUR: Get the link

If we examine the page, it looks like headlines might be `h2` tags that have a `story-heaing` class.

In [11]:
story = doc.find_all('article', { 'class': 'story' })

for story in story[:10]:
    print("This is a story")
    headline = story.find('h2', { 'class': 'story-heading' })
    print(headline.text)

This is a story
Sessions Faces Questions From Senators in Russia Inquiry
This is a story
Should Sessions Expect Senatorial Courtesy? Not This Time 
This is a story
How the Right and Left Saw Comey’s Testimony, and More 
This is a story
Trump Is Considering Firing Mueller, Friend Says
This is a story
Uber Chief Taking Leave of Absence After Investigation
This is a story
North Korea Frees American Student Said to Be in Coma
This is a story
3 Americans Remain Imprisoned in North Korea 1:33 PM ET
This is a story
On Tiny Norwegian Island, U.S. Keeps an Eye on Russia
This is a story
In Suicide Case, Teenage ‘Frailties’ Take Center Stage
This is a story
New Suit Says Trump Holdings Are Unconstitutional 


### An error strikes!

But we get an error!

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-57-9218ec61124f> in <module>()
          4     print("This is a story")
          5     headline = story.find('h2', { 'class': 'story-heading' })
    ----> 6     print(headline.text)

    AttributeError: 'NoneType' object has no attribute 'text'

Hm, a story missing a headline? Let's look at it a little closer. We could do this in a classy way, but let's just brute force it by print out every article just before the error line.

In [12]:
story = doc.find_all('article', { 'class': 'story' })

for story in story[:10]:
    print("This is a story")
    headline = story.find('h2', { 'class': 'story-heading' })
    print(story)
    print(headline.text)

This is a story
<article class="story theme-summary lede" data-collection-renderstyle="LedeSum" data-rank="0" data-story-id="100000005160207" id="topnews-100000005160207">
<h2 class="story-heading"><a href="https://www.nytimes.com/2017/06/13/us/politics/jeff-sessions-testimony.html">Sessions Faces Questions From Senators in Russia Inquiry</a></h2>
<p class="byline">By CHARLIE SAVAGE, EMMARIE HUETTEMAN and MICHAEL D. SHEAR <time class="timestamp" data-eastern-timestamp="2:20 PM" data-utc-timestamp="1497378055" datetime="2017-06-13">2:20 PM ET</time></p>
<p class="summary"><ul>
<li>Attorney General Jeff Sessions is testifying publicly before the Senate Intelligence Committee on Tuesday. We’re covering it live.</li>
<li>Rod J. Rosenstein, the deputy attorney general, said that Robert S. Mueller III, the special counsel in the Russia inquiry, would have “full independence.”</li></ul></p>
<p class="theme-comments">
<a class="comments-link" href="https://www.nytimes.com/2017/06/13/us/politic

The error seems to happen with this one piece here:
    
    <h1 class="story-heading"><a href="https://www.nytimes.com/2017/03/17/nyregion/norman-podhoretz-still-picks-fights-and-drops-names.html">Legendary New York Intellectuals Are His Ex-Friends</a></h1>
    <p class="summary">Norman Podhoretz, the former editor at Commentary magazine, looks back at the fierce, argumentative parties of New York’s intelligentsia.</p>
    <p class="byline">By JOHN LELAND </p>
    <p class="theme-comments">
    <a class="comments-link" href="https://www.nytimes.com/2017/03/17/nyregion/norman-podhoretz-still-picks-fights-and-drops-names.html?hp&amp;target=comments#commentsContainer"><i class="icon sprite-icon comments-icon"></i><span class="comment-count"> Comments</span></a>
    </p>
    </article>

Oh look, it uses an `h1` instead of an `h2`, but it's still a `story-heading`. Let's change our code to **look for a `story-heading` class regardless of tag name**.

In [16]:
story = doc.find_all('article', { 'class': 'story' })

for story in story:
    print("This is a story")
    headline = story.find(class_='story-heading')
    print(headline.text)

This is a story
Sessions Faces Questions From Senators in Russia Inquiry
This is a story
Should Sessions Expect Senatorial Courtesy? Not This Time 
This is a story
How the Right and Left Saw Comey’s Testimony, and More 
This is a story
Trump Is Considering Firing Mueller, Friend Says
This is a story
Uber Chief Taking Leave of Absence After Investigation
This is a story
North Korea Frees American Student Said to Be in Coma
This is a story
3 Americans Remain Imprisoned in North Korea 1:33 PM ET
This is a story
On Tiny Norwegian Island, U.S. Keeps an Eye on Russia
This is a story
In Suicide Case, Teenage ‘Frailties’ Take Center Stage
This is a story
New Suit Says Trump Holdings Are Unconstitutional 
This is a story
Montana Republican Is Sentenced in Assault on Reporter 1:27 PM ET
This is a story
Met Museum Changes Leadership Structure 10:01 AM ET
This is a story
Greece Declares Emergency After Earthquake 9:53 AM ET
This is a story
A Stagnant General Electric Will Replace Its Chief Executi

AttributeError: 'NoneType' object has no attribute 'text'

Another error! Let's print out again.

In [17]:
story = doc.find_all('article', { 'class': 'story' })

for story in story:
    print("This is a story")
    headline = story.find(class_='story-heading')
    print(story)
    print(headline.text)

This is a story
<article class="story theme-summary lede" data-collection-renderstyle="LedeSum" data-rank="0" data-story-id="100000005160207" id="topnews-100000005160207">
<h2 class="story-heading"><a href="https://www.nytimes.com/2017/06/13/us/politics/jeff-sessions-testimony.html">Sessions Faces Questions From Senators in Russia Inquiry</a></h2>
<p class="byline">By CHARLIE SAVAGE, EMMARIE HUETTEMAN and MICHAEL D. SHEAR <time class="timestamp" data-eastern-timestamp="2:20 PM" data-utc-timestamp="1497378055" datetime="2017-06-13">2:20 PM ET</time></p>
<p class="summary"><ul>
<li>Attorney General Jeff Sessions is testifying publicly before the Senate Intelligence Committee on Tuesday. We’re covering it live.</li>
<li>Rod J. Rosenstein, the deputy attorney general, said that Robert S. Mueller III, the special counsel in the Russia inquiry, would have “full independence.”</li></ul></p>
<p class="theme-comments">
<a class="comments-link" href="https://www.nytimes.com/2017/06/13/us/politic

AttributeError: 'NoneType' object has no attribute 'text'

It looks like it failed on this one. 

    <article class="story">
    <h3 class="kicker">
    <a href="http://wordplay.blogs.nytimes.com">Wordplay »</a>
    </h3>
    </article>

Now we have a choice to make: do we care about this? I... don't. If we want to skip through to the next element in a loop, we can use `continue`.

Let's say **hey, if you don't have a headline, we're going to skip you.**

In [18]:
story = doc.find_all('article', { 'class': 'story' })

for story in story:
    print("This is a story")
    headline = story.find(class_='story-heading')
    if not headline:
        continue
    print(headline.text)

This is a story
Sessions Faces Questions From Senators in Russia Inquiry
This is a story
Should Sessions Expect Senatorial Courtesy? Not This Time 
This is a story
How the Right and Left Saw Comey’s Testimony, and More 
This is a story
Trump Is Considering Firing Mueller, Friend Says
This is a story
Uber Chief Taking Leave of Absence After Investigation
This is a story
North Korea Frees American Student Said to Be in Coma
This is a story
3 Americans Remain Imprisoned in North Korea 1:33 PM ET
This is a story
On Tiny Norwegian Island, U.S. Keeps an Eye on Russia
This is a story
In Suicide Case, Teenage ‘Frailties’ Take Center Stage
This is a story
New Suit Says Trump Holdings Are Unconstitutional 
This is a story
Montana Republican Is Sentenced in Assault on Reporter 1:27 PM ET
This is a story
Met Museum Changes Leadership Structure 10:01 AM ET
This is a story
Greece Declares Emergency After Earthquake 9:53 AM ET
This is a story
A Stagnant General Electric Will Replace Its Chief Executi

Maybe we can also say hey, let's get rid of the whitespace on the headlines by using `.strip()`

In [19]:
story = doc.find_all('article', { 'class': 'story' })

for story in story:
    headline = story.find(class_='story-heading')
    if not headline:
        continue
    print(headline.text.strip())

Sessions Faces Questions From Senators in Russia Inquiry
Should Sessions Expect Senatorial Courtesy? Not This Time
How the Right and Left Saw Comey’s Testimony, and More
Trump Is Considering Firing Mueller, Friend Says
Uber Chief Taking Leave of Absence After Investigation
North Korea Frees American Student Said to Be in Coma
3 Americans Remain Imprisoned in North Korea 1:33 PM ET
On Tiny Norwegian Island, U.S. Keeps an Eye on Russia
In Suicide Case, Teenage ‘Frailties’ Take Center Stage
New Suit Says Trump Holdings Are Unconstitutional
Montana Republican Is Sentenced in Assault on Reporter 1:27 PM ET
Met Museum Changes Leadership Structure 10:01 AM ET
Greece Declares Emergency After Earthquake 9:53 AM ET
A Stagnant General Electric Will Replace Its Chief Executive
Slump in Tech Stocks Leaves Some Investors Mystified
Listen to ‘The Daily’
California Today: A Title to Unite the Bay Area
Short, but Great, Books for Your Commute
For Summer, Five Hot Destinations
‘The A.C.L.U.’s Worst Nigh

### Next step: Adding more pieces

Now we need to add in the links and the bylines. We'll start with the links by pulling in any `a` tags.

In [20]:
story = doc.find_all('article', { 'class': 'story' })

for story in story:
    headline = story.find(class_='story-heading')
    if not headline:
        continue
    print(headline.text.strip())
    link = story.find('a')
    print(link['href'])

Sessions Faces Questions From Senators in Russia Inquiry
https://www.nytimes.com/2017/06/13/us/politics/jeff-sessions-testimony.html
Should Sessions Expect Senatorial Courtesy? Not This Time
https://www.nytimes.com/2017/06/12/us/politics/jeff-sessions-senate-hearing-trump-russia.html
How the Right and Left Saw Comey’s Testimony, and More
https://www.nytimes.com/2017/06/12/us/politics/right-and-left-partisan-writing-you-shouldnt-miss.html
Trump Is Considering Firing Mueller, Friend Says
https://www.nytimes.com/2017/06/12/us/politics/robert-mueller-trump.html
Uber Chief Taking Leave of Absence After Investigation
https://www.nytimes.com/2017/06/13/technology/uber-travis-kalanick-holder-report.html
North Korea Frees American Student Said to Be in Coma
https://www.nytimes.com/2017/06/13/world/north-korea-otto-warmbier-rodman.html
3 Americans Remain Imprisoned in North Korea 1:33 PM ET
https://www.nytimes.com/2017/06/13/world/asia/north-korea-american-prisoner.html
On Tiny Norwegian Island,

## Adding in bylines

Bylines look like this:

    <p class="byline">By PETER BAKER and STEVEN ERLANGER <time class="timestamp" datetime="2017-03-17" data-eastern-timestamp="12:36 PM" data-utc-timestamp="1489768575">12:36 PM ET</time></p>
    
So... let's just grab the element inside of story that has the class of `byline`!

In [22]:
story = doc.find_all('article', { 'class': 'story' })

for story in story:
    headline = story.find(class_='story-heading')
    if not headline:
        continue
    print(headline.text.strip())
    link = story.find('a')
    print(link['href'])
    byline = story.find(class_='byline')
    print(byline.text)

Sessions Faces Questions From Senators in Russia Inquiry
https://www.nytimes.com/2017/06/13/us/politics/jeff-sessions-testimony.html
By CHARLIE SAVAGE, EMMARIE HUETTEMAN and MICHAEL D. SHEAR 2:20 PM ET
Should Sessions Expect Senatorial Courtesy? Not This Time
https://www.nytimes.com/2017/06/12/us/politics/jeff-sessions-senate-hearing-trump-russia.html


AttributeError: 'NoneType' object has no attribute 'text'

So we get another one of those "missing byline" errors, yeah? Well, maybe not everything has a byline. It doesn't mean we should skip the whole thing, let's just skip the byline for that one.

In [14]:
stories = doc.find_all('article', { 'class': 'story' })

for story in stories:
    headline = story.find(class_='story-heading')
    if not headline:
        continue
    print(headline.text.strip())
    link = story.find('a')
    print(link['href'])
    byline = story.find(class_='byline')
    if byline:
        print(byline.text)

Revised Trump Travel Ban Suffers Another Legal Setback
https://www.nytimes.com/2017/06/12/us/politics/trump-travel-ban-court-of-appeals.html
By ADAM LIPTAK 2:09 PM ET
Putin Opponent Is Arrested Amid Protests in Russia
https://www.nytimes.com/2017/06/12/world/europe/russia-aleksei-navalny-kremlin-protests.html
By NEIL MacFARQUHAR and ANDREW HIGGINS 1:26 PM ET
Why Cyberwar on ISIS Has Fallen Short of U.S. Hopes
https://www.nytimes.com/2017/06/12/world/middleeast/isis-cyber.html
By DAVID E. SANGER and ERIC SCHMITT 
Trump Holdings Are Unconstitutional, Pioneering Suit Says
https://www.nytimes.com/2017/06/12/us/trump-lawsuit-private-businesses.html
By SHARON LaFRANIERE 12:31 PM ET
Opioid Addicts Find an Ally in Blue
https://www.nytimes.com/2017/06/12/nyregion/when-opioid-addicts-find-an-ally-in-blue.html
By AL BAKER 
Addiction Drug Lacks Results, but It Has Powerful Friends
https://www.nytimes.com/2017/06/11/health/vivitrol-drug-opioid-addiction.html
Role of Trump’s Lawyer Blurs Public and 

**Looking a lot better!** Now the only problem is "By LOUIS LUCERO II 1:00 PM ET" instead of having "LOUIS LUCERO II" or even better "LOUIS LUCERO II".

## So I guess you better learn regular expressions, 'eh?