# Scraping: https://www.nytimes.com/

Let's try to scrape the frontpage of the NYT. We're looking for

* Headlines
* Bylines
* Article links

## Getting started

We'll start by **importing the necessary libraries**.

In [4]:
from bs4 import BeautifulSoup
import requests

And then move into **downloading the page** and **importing it into BeautifulSoup**.

In [6]:
response = requests.get("https://www.nytimes.com/")
doc = BeautifulSoup(response.text,'html.parser')
#html5lib is specific about what it parses, so don't use it.
#save this in a variable, soma uses doc because it's a good reminder it's an entire document

A lot of people call the analyzed page variable `soup` but for once in my life I actually go against the popular thing - I like to call it `doc`, since it helps me remember that it's the *entire document*.

In [48]:
doc.prettify()




**XML vs HTML: XML is basically anything, with HTML there are specific tags you can use.** The structure, however, looks similar.
IMG and BR tags are self-closing. They don't have closing tags.
**XHTML** & **HTML5** are not useful.

## ATTEMPT ONE: Grabbing the tags directly

Let's jump right into trying to grab the link.

Oh, look it's an.... `a` tag. No special class or anything. What if we try to get all of the `a` tags on the page?
Go to NYT.com figure out what you want. Let's get the headline tags. We inspect the page and see they're under h2.

In [9]:
headline_tags = doc.find_all('h2')

In [10]:
len(headline_tags)

181

In [12]:
for tag in headline_tags:
    print(tag.text)






Quick Site Sections Navigation
Site Search Navigation
Site Navigation
Site Mobile Navigation
Top News
ISIS Proves an Elusive Target for America’s Cyberweapons
Opioid Addicts Find an Ally in Blue
Addiction Drug Lacks Results, but It Has Powerful Friends 
Trump Era, Unlike Watergate Era, Has Rival Sets of Facts
Democrats Call for Sessions’s Testimony to Be Public 
Fired U.S. Attorney Says Trump Tried to Build Relationship With Him 
Role of Trump’s Lawyer Blurs Public and Private Lines
Obamacare Repeal Limits Flexibility for Those in Transition
Uber Board Discusses a Leave for Embattled C.E.O.
Uber’s C.E.O. Plays With Fire (April 23, 2017)
Jeffrey Immelt to Retire as General Electric Chief 9:28 AM ET
Bill Cosby Sex Assault Trial: The Defense’s Turn 5:00 AM ET
Macron’s Party on Track to Claim Majority in Parliament 
Should Puerto Rico Be 51st State? Residents Go to Polls. 

                                    Before the Cloud, a Mine of Data                            
The New York Ti

This isn't what we want. "Quick Site Sections Navigation" is not a title. We have to find the unique ID.
An ID is unique, vs a class is like your Columbia ID card vs you are the class of 2017. A class is category.
I don't just want h2s, I want h2s with the attribute where the class is storyheading (you learn this from inspecting the page)

Okay, that's terrible. Do you know how many `a` tags are going to be on that page? Many many many. Many very useless ones.

## Talking to parents

When you can't uniquely identify something, sometimes you need to go up the tree to find its **parent**, the elements that are above it. We'll be looking for an element that covers the **entire story**, then we'll pick the link out of it.

In [13]:
headline_tags = doc.find_all(class_='story-heading')
headline_tags = doc.find_all('h2', attrs={'class':'story-heading'})
#Here are two ways to do the same thing, which is finding the class story-heading. the results will be slightly different
#but it's okay. 

Great, it looks like this:
    
    <article class="story theme-summary lede" id="topnews-100000004994965" data-story-id="100000004994965" data-rank="0" data-collection-renderstyle="LedeSum">

I'm going to go out on a limb and say we should look for an `article` tag, but what about the class? `story theme-summary lede` gives us three options:

* `story`
* `theme-summary`
* `lede`

`story` sounds promising, yeah?

In [14]:
for tag in headline_tags:
    print(tag.text)

ISIS Proves an Elusive Target for America’s Cyberweapons
Opioid Addicts Find an Ally in Blue
Addiction Drug Lacks Results, but It Has Powerful Friends 
Trump Era, Unlike Watergate Era, Has Rival Sets of Facts
Democrats Call for Sessions’s Testimony to Be Public 
Fired U.S. Attorney Says Trump Tried to Build Relationship With Him 
Role of Trump’s Lawyer Blurs Public and Private Lines
Obamacare Repeal Limits Flexibility for Those in Transition
Uber Board Discusses a Leave for Embattled C.E.O.
Jeffrey Immelt to Retire as General Electric Chief 9:28 AM ET
Bill Cosby Sex Assault Trial: The Defense’s Turn 5:00 AM ET
Macron’s Party on Track to Claim Majority in Parliament 
Should Puerto Rico Be 51st State? Residents Go to Polls. 

                                    Before the Cloud, a Mine of Data                            
Your Monday Briefing
California Today: Talking to a Tony Winner


      Listen to ‘The Daily’
    

How to Save on Summer Travel
Ways Your iPhone Will Change After App

In [17]:
for tag in headline_tags[:3]:
    #this is saying, let's look at tag #3, which is actually the third one
    #looping through my h2 tags, now that i'm looking at one of those h2 tags
    #find me every link inside of it (remember we look for links using 'a').
    link = tag.find_all('a')
    #NOTE this work because there's only one link in the headline. if there were lots it would be a big problem.
    print(tag.text)
    print(link)
    print("-------------")

ISIS Proves an Elusive Target for America’s Cyberweapons
[<a href="https://www.nytimes.com/2017/06/12/world/middleeast/isis-cyber.html">ISIS Proves an Elusive Target for America’s Cyberweapons</a>]
-------------
Opioid Addicts Find an Ally in Blue
[<a href="https://www.nytimes.com/2017/06/12/nyregion/when-opioid-addicts-find-an-ally-in-blue.html">Opioid Addicts Find an Ally in Blue</a>]
-------------
Addiction Drug Lacks Results, but It Has Powerful Friends 
[<a href="https://www.nytimes.com/2017/06/11/health/vivitrol-drug-opioid-addiction.html">Addiction Drug Lacks Results, but It Has Powerful Friends</a>]
-------------


Seems to work well enough! Now that we have a parent, **we can use that parent to grab the elements inside of the story.** We'll use `.find` and `.find_all` to get everything we need.

* STEP ONE: Get the story
* STEP TWO: Get the headline
* STEP THREE: Get the byline
* STEP FOUR: Get the link

**Let's look for the summary.**
Look at the h2 tag. There's no summary there. Where is it? It's further down, in <p class = 'summary'

In [20]:
summary_tags = doc.find_all('p', attrs={'class':'summary'})
len(summary_tags)

33

Strange. There are only 33 summary tags, but there are way more headlines. They don't match up, we want them to match up. Is there a specific grouping for each of these units? If we can find the parent elements and find me the headline and summary for each thing, so they match up. What class makes sense for this? story-theme?

In [22]:
story_tags = doc.find_all(class_='story')
len(story_tags)

154

In [None]:
story_tags[0]
#look at the first one.

###  (( . means 'class' and # means ID for the following notes))

**Story**
Headline: .story-heading
Link: a
Summary: .summary
Byline: .byline

In [25]:
for story in story_tags:
    headline = story.find(class_='story-heading')
    link = story.find('a')
    summary = story.find(class_='summary')
    byline = story.find(class_='byline')
    print("-----------------")

-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
-----------------
----------

In [26]:
#let's print the headline:
for story in story_tags:
    headline = story.find(class_='story-heading')
    print(headline.text)
    link = story.find('a')
    summary = story.find(class_='summary')
    byline = story.find(class_='byline')
    print("-----------------")
# 'NoneType' object has no attribute 'text' is the error message at the bottom. what happened is you said find me a headline,
#but there wasn't one, so when you try to do .text on a headline it doesn't return anything.

ISIS Proves an Elusive Target for America’s Cyberweapons
-----------------
Opioid Addicts Find an Ally in Blue
-----------------
Addiction Drug Lacks Results, but It Has Powerful Friends 
-----------------
Trump Era, Unlike Watergate Era, Has Rival Sets of Facts
-----------------
Democrats Call for Sessions’s Testimony to Be Public 
-----------------
Fired U.S. Attorney Says Trump Tried to Build Relationship With Him 
-----------------
Role of Trump’s Lawyer Blurs Public and Private Lines
-----------------
Obamacare Repeal Limits Flexibility for Those in Transition
-----------------
Uber Board Discusses a Leave for Embattled C.E.O.
-----------------


AttributeError: 'NoneType' object has no attribute 'text'

In [27]:
#let's try it again, skipping over the ones that don't have headlines.
for story in story_tags:
    headline = story.find(class_='story-heading')
    if headline: #here we're saying, if headline exists, print it. use .trim or .strip to shorten things!
        print(headline.text)
    link = story.find('a')
    summary = story.find(class_='summary')
    byline = story.find(class_='byline')
    print("-----------------")

ISIS Proves an Elusive Target for America’s Cyberweapons
-----------------
Opioid Addicts Find an Ally in Blue
-----------------
Addiction Drug Lacks Results, but It Has Powerful Friends 
-----------------
Trump Era, Unlike Watergate Era, Has Rival Sets of Facts
-----------------
Democrats Call for Sessions’s Testimony to Be Public 
-----------------
Fired U.S. Attorney Says Trump Tried to Build Relationship With Him 
-----------------
Role of Trump’s Lawyer Blurs Public and Private Lines
-----------------
Obamacare Repeal Limits Flexibility for Those in Transition
-----------------
Uber Board Discusses a Leave for Embattled C.E.O.
-----------------
-----------------
Jeffrey Immelt to Retire as General Electric Chief 9:28 AM ET
-----------------
Bill Cosby Sex Assault Trial: The Defense’s Turn 5:00 AM ET
-----------------
Macron’s Party on Track to Claim Majority in Parliament 
-----------------
Should Puerto Rico Be 51st State? Residents Go to Polls. 
-----------------

              

In [28]:
for story in story_tags:
    headline = story.find(class_='story-heading')
    if headline: #here we're saying, if headline exists, print it. use .trim or .strip to shorten things!
        print(headline.text)
    link = story.find('a')
    print(link['href'])
    summary = story.find(class_='summary')
    byline = story.find(class_='byline')
    print("-----------------")

ISIS Proves an Elusive Target for America’s Cyberweapons
https://www.nytimes.com/2017/06/12/world/middleeast/isis-cyber.html
-----------------
Opioid Addicts Find an Ally in Blue
https://www.nytimes.com/2017/06/12/nyregion/when-opioid-addicts-find-an-ally-in-blue.html
-----------------
Addiction Drug Lacks Results, but It Has Powerful Friends 
https://www.nytimes.com/2017/06/11/health/vivitrol-drug-opioid-addiction.html
-----------------
Trump Era, Unlike Watergate Era, Has Rival Sets of Facts
https://www.nytimes.com/2017/06/11/business/media/comey-trump-watergate.html
-----------------
Democrats Call for Sessions’s Testimony to Be Public 
https://www.nytimes.com/2017/06/11/us/politics/jeff-sessions-russia-trump-attorney-general-senate.html
-----------------
Fired U.S. Attorney Says Trump Tried to Build Relationship With Him 
https://www.nytimes.com/2017/06/11/us/politics/preet-bharara-trump-contacts.html
-----------------
Role of Trump’s Lawyer Blurs Public and Private Lines
https://w

TypeError: 'NoneType' object is not subscriptable

In [29]:
#TypeError: 'NoneType' object is not subscriptable means the link might be missing
for story in story_tags:
    headline = story.find(class_='story-heading')
    if headline: #here we're saying, if headline exists, print it. use .trim or .strip to shorten things!
        print(headline.text)
    link = story.find('a')
    if link:
        print(link['href'])
        #let's just do it for the others because we know now how to fix it.
    summary = story.find(class_='summary')
    if summary:
        print(summary.text)
    byline = story.find(class_='byline')
    if byline:
        print(byline.text)
    print("-----------------")


ISIS Proves an Elusive Target for America’s Cyberweapons
https://www.nytimes.com/2017/06/12/world/middleeast/isis-cyber.html
The effectiveness of cyberweapons hit its limits against an enemy that exploits the internet to recruit, spread propaganda and use encrypted communications, all of which can be quickly reconstituted.
This is prompting officials to rethink how cyberwarfare techniques, first designed for fixed targets like nuclear facilities, must be refashioned.
By DAVID E. SANGER and ERIC SCHMITT 5:00 AM ET
-----------------
Opioid Addicts Find an Ally in Blue
https://www.nytimes.com/2017/06/12/nyregion/when-opioid-addicts-find-an-ally-in-blue.html
Police leaders are assigning themselves a big role in reversing a complex crisis, and not through mass arrests.
By AL BAKER 5:00 AM ET
-----------------
Addiction Drug Lacks Results, but It Has Powerful Friends 
https://www.nytimes.com/2017/06/11/health/vivitrol-drug-opioid-addiction.html
-----------------
Trump Era, Unlike Watergate E

In [36]:
#but this isn't useful. let's make it a list of dictionaries to make it useful so we can find things.

#let's make a new list called stories and add things to it! it's empty now. every time we go through let's add things
#to it:
stories = []

for story in story_tags:
    #make an empty dictionary:
    current = {}
    headline = story.find(class_='story-heading')
    if headline:
        print(headline.text)
        #let's put it in the dictionary. if you find a headline, put it in the dictionary. make sure you save in the
        #dictionary exactly what you're printing out, not just headline but headline.text; for link, href!
        current['headline'] = headline.text
    link = story.find('a')
    if link:
        print(link['href'])
        current['url'] = link['href']
    summary = story.find(class_='summary')
    if summary:
        print(summary.text)
        current['summary'] = summary.text
    byline = story.find(class_='byline')
    if byline:
        print(byline.text)
        current['byline'] = byline.text
    print("-----------------")


ISIS Proves an Elusive Target for America’s Cyberweapons
https://www.nytimes.com/2017/06/12/world/middleeast/isis-cyber.html
The effectiveness of cyberweapons hit its limits against an enemy that exploits the internet to recruit, spread propaganda and use encrypted communications, all of which can be quickly reconstituted.
This is prompting officials to rethink how cyberwarfare techniques, first designed for fixed targets like nuclear facilities, must be refashioned.
By DAVID E. SANGER and ERIC SCHMITT 5:00 AM ET
-----------------
Opioid Addicts Find an Ally in Blue
https://www.nytimes.com/2017/06/12/nyregion/when-opioid-addicts-find-an-ally-in-blue.html
Police leaders are assigning themselves a big role in reversing a complex crisis, and not through mass arrests.
By AL BAKER 5:00 AM ET
-----------------
Addiction Drug Lacks Results, but It Has Powerful Friends 
https://www.nytimes.com/2017/06/11/health/vivitrol-drug-opioid-addiction.html
-----------------
Trump Era, Unlike Watergate E

In [37]:
#let's make sure it's a dictionary.
print(current)

{'headline': 'Mortgage Calculator', 'url': 'https://www.nytimes.com/real-estate/mortgage-calculator'}


In [33]:
#are there stories in there? let's check
stories

[]

In [35]:
#there aren't, because they weren't pushed back into it. we have to fill the list with the dictionaries! The dictionary
#name is 'current' the list name is 'stories'
stories.append(current)

In [38]:
stories = []

for story in story_tags:
    #make an empty dictionary:
    current = {}
    headline = story.find(class_='story-heading')
    if headline:
        print(headline.text)
        current['headline'] = headline.text.strip() #also let's shorten this using
        #.strip, there's a LOT of extra white space etc.
    link = story.find('a')
    if link:
        print(link['href'])
        current['url'] = link['href']
    summary = story.find(class_='summary')
    if summary:
        print(summary.text)
        current['summary'] = summary.text.strip()
    byline = story.find(class_='byline')
    if byline:
        print(byline.text)
        current['byline'] = byline.text.strip()
    print("-----------------")
    stories.append(current) #here's what is different
    print(current)

ISIS Proves an Elusive Target for America’s Cyberweapons
https://www.nytimes.com/2017/06/12/world/middleeast/isis-cyber.html
The effectiveness of cyberweapons hit its limits against an enemy that exploits the internet to recruit, spread propaganda and use encrypted communications, all of which can be quickly reconstituted.
This is prompting officials to rethink how cyberwarfare techniques, first designed for fixed targets like nuclear facilities, must be refashioned.
By DAVID E. SANGER and ERIC SCHMITT 5:00 AM ET
-----------------
{'headline': 'ISIS Proves an Elusive Target for America’s Cyberweapons', 'url': 'https://www.nytimes.com/2017/06/12/world/middleeast/isis-cyber.html', 'summary': 'The effectiveness of cyberweapons hit its limits against an enemy that exploits the internet to recruit, spread propaganda and use encrypted communications, all of which can be quickly reconstituted.\nThis is prompting officials to rethink how cyberwarfare techniques, first designed for fixed target

**But this is still hard to use!! It's a list of dictionaries. Let's make it into a dataframe.**
If you have a bunch of dictionaries they don't always have the same content. So your code might break if you're trying to use them. But if you put it in a df and use it as a CSV it'll leave those columns blank?

In [40]:
#we need pandas to do this!
import pandas as pd

In [42]:
#save it in a dataframe.
#dataframe name = pd.DataFrame(list_name)

df = pd.DataFrame(stories)
df.head()

Unnamed: 0,byline,headline,summary,url
0,By DAVID E. SANGER and ERIC SCHMITT 5:00 AM ET,ISIS Proves an Elusive Target for America’s Cy...,The effectiveness of cyberweapons hit its limi...,https://www.nytimes.com/2017/06/12/world/middl...
1,By AL BAKER 5:00 AM ET,Opioid Addicts Find an Ally in Blue,Police leaders are assigning themselves a big ...,https://www.nytimes.com/2017/06/12/nyregion/wh...
2,,"Addiction Drug Lacks Results, but It Has Power...",,https://www.nytimes.com/2017/06/11/health/vivi...
3,By JIM RUTENBERG,"Trump Era, Unlike Watergate Era, Has Rival Set...",Different versions of the Trump-Russia scandal...,https://www.nytimes.com/2017/06/11/business/me...
4,,Democrats Call for Sessions’s Testimony to Be ...,,https://www.nytimes.com/2017/06/11/us/politics...


**Let's save this to a CSV** with the name "stories"


In [44]:
df.to_csv("stories.csv")
#save it to a CSV

In [46]:
stories_df = pd.read_csv("stories.csv")
stories_df.head()
#read it in

Unnamed: 0.1,Unnamed: 0,byline,headline,summary,url
0,0,By DAVID E. SANGER and ERIC SCHMITT 5:00 AM ET,ISIS Proves an Elusive Target for America’s Cy...,The effectiveness of cyberweapons hit its limi...,https://www.nytimes.com/2017/06/12/world/middl...
1,1,By AL BAKER 5:00 AM ET,Opioid Addicts Find an Ally in Blue,Police leaders are assigning themselves a big ...,https://www.nytimes.com/2017/06/12/nyregion/wh...
2,2,,"Addiction Drug Lacks Results, but It Has Power...",,https://www.nytimes.com/2017/06/11/health/vivi...
3,3,By JIM RUTENBERG,"Trump Era, Unlike Watergate Era, Has Rival Set...",Different versions of the Trump-Russia scandal...,https://www.nytimes.com/2017/06/11/business/me...
4,4,,Democrats Call for Sessions’s Testimony to Be ...,,https://www.nytimes.com/2017/06/11/us/politics...


In [47]:
#ew, look at the index, we don't want that. to get rid of it you have to do it at the beginning. let's do it again.
#we use INDEX=FALSE when converting a df to a CSV
df.to_csv("stories.csv", index=False)
stories_df = pd.read_csv("stories.csv")
stories_df.head()

Unnamed: 0,byline,headline,summary,url
0,By DAVID E. SANGER and ERIC SCHMITT 5:00 AM ET,ISIS Proves an Elusive Target for America’s Cy...,The effectiveness of cyberweapons hit its limi...,https://www.nytimes.com/2017/06/12/world/middl...
1,By AL BAKER 5:00 AM ET,Opioid Addicts Find an Ally in Blue,Police leaders are assigning themselves a big ...,https://www.nytimes.com/2017/06/12/nyregion/wh...
2,,"Addiction Drug Lacks Results, but It Has Power...",,https://www.nytimes.com/2017/06/11/health/vivi...
3,By JIM RUTENBERG,"Trump Era, Unlike Watergate Era, Has Rival Set...",Different versions of the Trump-Russia scandal...,https://www.nytimes.com/2017/06/11/business/me...
4,,Democrats Call for Sessions’s Testimony to Be ...,,https://www.nytimes.com/2017/06/11/us/politics...


## This is a common workflow. Save it into a list of dictionaries, convert it to a DF and save it to a CSV.

### An error strikes!

But we get an error!

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-57-9218ec61124f> in <module>()
          4     print("This is a story")
          5     headline = story.find('h2', { 'class': 'story-heading' })
    ----> 6     print(headline.text)

    AttributeError: 'NoneType' object has no attribute 'text'

Hm, a story missing a headline? Let's look at it a little closer. We could do this in a classy way, but let's just brute force it by print out every article just before the error line.

The error seems to happen with this one piece here:
    
    <h1 class="story-heading"><a href="https://www.nytimes.com/2017/03/17/nyregion/norman-podhoretz-still-picks-fights-and-drops-names.html">Legendary New York Intellectuals Are His Ex-Friends</a></h1>
    <p class="summary">Norman Podhoretz, the former editor at Commentary magazine, looks back at the fierce, argumentative parties of New York’s intelligentsia.</p>
    <p class="byline">By JOHN LELAND </p>
    <p class="theme-comments">
    <a class="comments-link" href="https://www.nytimes.com/2017/03/17/nyregion/norman-podhoretz-still-picks-fights-and-drops-names.html?hp&amp;target=comments#commentsContainer"><i class="icon sprite-icon comments-icon"></i><span class="comment-count"> Comments</span></a>
    </p>
    </article>

Oh look, it uses an `h1` instead of an `h2`, but it's still a `story-heading`. Let's change our code to **look for a `story-heading` class regardless of tag name**.

Another error! Let's print out again.

It looks like it failed on this one. 

    <article class="story">
    <h3 class="kicker">
    <a href="http://wordplay.blogs.nytimes.com">Wordplay »</a>
    </h3>
    </article>

Now we have a choice to make: do we care about this? I... don't. If we want to skip through to the next element in a loop, we can use `continue`.

Let's say **hey, if you don't have a headline, we're going to skip you.**

Maybe we can also say hey, let's get rid of the whitespace on the headlines by using `.strip()`

### Next step: Adding more pieces

Now we need to add in the links and the bylines. We'll start with the links by pulling in any `a` tags.

## Adding in bylines

Bylines look like this:

    <p class="byline">By PETER BAKER and STEVEN ERLANGER <time class="timestamp" datetime="2017-03-17" data-eastern-timestamp="12:36 PM" data-utc-timestamp="1489768575">12:36 PM ET</time></p>
    
So... let's just grab the element inside of story that has the class of `byline`!

So we get another one of those "missing byline" errors, yeah? Well, maybe not everything has a byline. It doesn't mean we should skip the whole thing, let's just skip the byline for that one.

**Looking a lot better!** Now the only problem is "By LOUIS LUCERO II 1:00 PM ET" instead of having "LOUIS LUCERO II" or even better "LOUIS LUCERO II".

## So I guess you better learn regular expressions, 'eh?