# Scraping: http://www.bbc.co.uk/news

Let's try to scrape the frontpage of BBC News. We're looking for

* Headlines
* Summary
* Article link

## Getting started

We'll start by **importing the necessary libraries**.

In [2]:
import requests
from bs4 import BeautifulSoup

And then move into **downloading the page** and **importing it into BeautifulSoup**.

In [3]:
response = requests.get('http://www.bbc.co.uk/news')
doc = BeautifulSoup(response.text, 'html.parser')

A lot of people call the analyzed page variable `soup` but for once in my life I actually go against the popular thing - I like to call it `doc`, since it helps me remember that it's the *entire document*.

## ATTEMPT ONE: Grabbing the tags directly

If we look at the page, we try to use the little arrow-selecty-thing to pick up the headlines and **disaster strikes**. We can't touch it! Apparently it's the ENTIRE BLOCK or something crazy like that?

But luckily we understand HTML, so we can click around on the right-hand Elements page. We navigate to the `h3` tag, which we know is the headline based on the tag name and the content.

Hm, what if we just grab all of the `h3` tags?

In [4]:
headlines = doc.find_all('h3')

for headline in headlines:
    print(headline.text)

US student freed by North Korea 'in coma'
UK's Theresa May closing in on deal to govern
Uber chief to take leave from company
London could lose EU euro clearing role
New drug creates 'real sun-tan'
Daredevil scales tower without harness
'My son was in collapsed Kenya building'
Food poisoning hits hundreds in Iraq camp
Cristiano Ronaldo accused of tax evasion
Deliberating Cosby jury revisits evidence
Jeff Sessions faces grilling over Russia
Deliberating Cosby jury revisits evidence
Jeff Sessions faces grilling over Russia
Hungary passes strict anti-foreign NGO law
Two US prison guards killed during escape
Officer shot in head at Munich station
'I fell in love with a man with dwarfism'
BBC World News TV
BBC World Service Radio
The women refugees helping each other build a new life
Trump cabinet takes turns to praise him
Inside Michael Palin's Monty Python diaries
Gamer gear: Revolutionary or overhyped?
Antarctic mystery painting linked to Scott
Travis Kalanick's rollercoaster reign at Ub

SO EASY, right? Kind of? Mostly it worked? ...except it doesn't have the link, nor does it have the summary.

Okay, so we could also get all of the `a` tags, but there are probably a lot of garbage `a` tags - footer content and stuff. Maybe the article `a` tags have a special class? If we take a look, we see `class="gs-c-promo-heading nw-o-link-split__anchor gs-o-faux-block-link__overlay-link gel-pica-bold"`. This isn't just one class, it's **many classes**.

* `gs-c-promo-heading`
* `nw-o-link-split__anchor`
* `gs-o-faux-block-link__overlay-link`
* `gel-pica-bold`

This is where guesswork comes ib. I think `gs-c-promo-heading` seems reasonable!

In [5]:
links = doc.find_all('a', { 'class': 'gs-c-promo-heading' })

for link in links:
    print(link.text)

US student freed by North Korea 'in coma'
UK's Theresa May closing in on deal to govern
Uber chief to take leave from company
London could lose EU euro clearing role
New drug creates 'real sun-tan'
VideoDaredevil scales tower without harness
Video'My son was in collapsed Kenya building'
Food poisoning hits hundreds in Iraq camp
Cristiano Ronaldo accused of tax evasion
Deliberating Cosby jury revisits evidence
Jeff Sessions faces grilling over Russia
Deliberating Cosby jury revisits evidence
Jeff Sessions faces grilling over Russia
Hungary passes strict anti-foreign NGO law
Two US prison guards killed during escape
Officer shot in head at Munich station
'I fell in love with a man with dwarfism'
BBC World News TV
AudioBBC World Service Radio
The women refugees helping each other build a new life
VideoTrump cabinet takes turns to praise him
VideoInside Michael Palin's Monty Python diaries
VideoGamer gear: Revolutionary or overhyped?
Antarctic mystery painting linked to Scott
VideoDaredevi

That looks pretty good, too! It's getting the `h3` text because the `h3` is inside of the `a` tag, but it doesn't have the *actual link*, the URL. If we look at the `a` tag...

    <a class="gs-c-promo-heading nw-o-link-split__anchor gs-o-faux-block-link__overlay-link gel-pica-bold" href="/news/world-middle-east-39302560">
   
...the URL is hiding in the `href` attribute. Once we have the link, it's actually easy to get an attribute, you just use `['href']`

In [6]:
links = doc.find_all('a', { 'class': 'gs-c-promo-heading' })

for link in links:
    print(link.text)
    print(link['href'])

US student freed by North Korea 'in coma'
/news/world-asia-pacific-40264468
UK's Theresa May closing in on deal to govern
/news/election-2017-40255958
Uber chief to take leave from company
/news/business-40264376
London could lose EU euro clearing role
/news/business-40264755
New drug creates 'real sun-tan'
/news/health-40260029
VideoDaredevil scales tower without harness
/news/world-europe-40260134
Video'My son was in collapsed Kenya building'
/news/world-africa-40258686
Food poisoning hits hundreds in Iraq camp
/news/world-middle-east-40257385
Cristiano Ronaldo accused of tax evasion
/news/world-europe-40260517
Deliberating Cosby jury revisits evidence
/news/world-us-canada-40264915
Jeff Sessions faces grilling over Russia
/news/world-us-canada-40260670
Deliberating Cosby jury revisits evidence
/news/world-us-canada-40264915
Jeff Sessions faces grilling over Russia
/news/world-us-canada-40260670
Hungary passes strict anti-foreign NGO law
/news/world-europe-40258922
Two US prison guar

Cool, 'eh? But now we have one final problem: **we don't have the summaries**. So well, we can just use the Inspector to pick one out...

    <p class="gs-c-promo-summary gel-long-primer gs-u-mt nw-c-promo-summary">Trade and Nato are high on the agenda as the much-anticipated Washington talks begin.</p>

Once again, we have a selection of options. `gs-c-promo-summary` seems promising.

In [7]:
summaries = doc.find_all('p', { 'class': 'gs-c-promo-summary' })

for summary in summaries:
    print(summary.text)

The US says Otto Warmbier is on his way home; his parents say he has been in a coma for a year.
The BBC understands a deal with the DUP has been largely agreed and there are no outstanding issues.
The decision comes after an internal review of the firm's corporate culture.
The European Union reveals plans to keep the lucrative industry in the EU after Brexit happens.
The drug mimics sunlight to make the skin produce the brown form of the pigment melanin.
Alain Robert, the French Spiderman, took an unusual route to the top of a Barcelona hotel.
The BBC's Anne Soy reports from Nairobi, where a seven-storey building has collapsed.
Iraqis fleeing the battle of Mosul fall ill from food poisoning after a Ramadan meal.
The Real Madrid footballer faces a lawsuit for allegedly defrauding Spain of millions.
The jury asks to hear Mr Cosby's testimony from a 2005 civil case as they consider their verdict.
The US attorney general faces a hearing over his Russia ties. Here are five questions he can 

Great, but **now we're stuck:** we don't have a way of combining the headlines and links to the summaries, and even if we did (cough`zip`cough), we couldn't be sure that they'd match up.

What the heck do we do now?

## ATTEMPT TWO: Parent elements

When you're just grabbing one element - a link and the text inside, or a list of headlines - you are only interested in the element you're looking at. Sometimes, though, **you need to scrape multiple elements at the same time.** When this happens, you need to look at what they all have in common.

If we look at a summary, a link and a title, we might find something like the following. **It's a trainwreck, but it's what we want.**

	<div class="gs-c-promo nw-c-promo gs-o-faux-block-link gs-u-pb gs-u-pb+@m gs-c-promo--inline gs-c-promo--stacked@m nw-u-w-auto gs-c-promo--flex" data-entityid="container-top-stories#3">
		<div class="gs-c-promo-image gs-u-display-none gs-u-display-inline-block@xs gel-1/2@xs gel-1/1@m">
			<div class="gs-o-media-island">
				<div class="gs-o-responsive-image gs-o-responsive-image--16by9"></div>
			</div>
		</div>
		<div class="gs-c-promo-body gel-1/2@xs gel-1/1@m gs-u-mt@m">
			<div>
				<a class="gs-c-promo-heading nw-o-link-split__anchor gs-o-faux-block-link__overlay-link gel-pica-bold" href="/news/world-middle-east-39302560">
				<h3 class="gs-c-promo-heading__title gel-pica-bold nw-o-link-split__text">Attack on Yemen migrant boat kills 42</h3></a>
				<p class="gs-c-promo-summary gel-long-primer gs-u-mt nw-c-promo-summary">It is unclear who was behind a helicopter attack which killed 42 refugees and injured 80.</p>
			</div>
			<ul class="gs-o-list-inline gs-o-list-inline--divided gel-brevier gs-u-mt-">
				<li><span class="gs-c-timestamp gs-o-bullet gs-o-bullet- nw-c-timestamp"><span class="gs-o-bullet__icon gel-icon"><svg viewbox="0 0 32 32">
				<polygon points="17,15.4 17,6 15,6 15,16.6 23.8,21.7 24.8,19.9"></polygon>
				<path d="M16,4c6.6,0,12,5.4,12,12c0,6.6-5.4,12-12,12S4,22.6,4,16C4,9.4,9.4,4,16,4 M16,0C7.2,0,0,7.2,0,16c0,8.8,7.2,16,16,16 s16-7.2,16-16C32,7.2,24.8,0,16,0L16,0z"></path></svg></span><time class="gs-o-bullet__text date qa-status-date relative-time" data-datetime="1h" data-seconds="1489768430" data-timestamp-inserted="true" datetime="2017-03-17T16:33:50.000Z">48 minutes ago</time></span></li>
				<li>
					<a aria-label="From Middle East" class="gs-c-section-link gs-c-section-link--truncate nw-c-section-link nw-o-link nw-o-link--no-visited-state" href="/news/world/middle_east"><span aria-hidden="true">Middle East</span></a>
				</li>
			</ul>
		</div>
	</div>

The very top part is the **parent element**, all of the other elements are inside of it. In order to scrape them all together, we need to grab each parent (each *story*) and then grab the parts inside of it (the headline, links, image, etc).

The part's class is `class="gs-c-promo nw-c-promo gs-o-faux-block-link gs-u-pb gs-u-pb+@m gs-c-promo--inline gs-c-promo--stacked@m nw-u-w-auto gs-c-promo--flex" data-entityid="container-top-stories#3"`, which would be terrifying except that we've struck onto a theme and suspect `gs-c-promo` might be what we're looking for.

In [17]:
stories = doc.find_all('div', { 'class': 'gs-c-promo' })
for story in stories:
    print(story.text)

US student freed by North Korea 'in coma'The US says Otto Warmbier is on his way home; his parents say he has been in a coma for a year.1han hour agoAsiaRelated contentNorth Korea: Secretive stateHow might Trump do a deal with N Korea?What's changed between US and N Korea
UK's Theresa May closing in on deal to governThe BBC understands a deal with the DUP has been largely agreed and there are no outstanding issues.2h2 hours agoElection 2017
Uber chief to take leave from companyThe decision comes after an internal review of the firm's corporate culture.16m16 minutes agoBusiness
London could lose EU euro clearing roleThe European Union reveals plans to keep the lucrative industry in the EU after Brexit happens.1han hour agoBusiness
New drug creates 'real sun-tan'The drug mimics sunlight to make the skin produce the brown form of the pigment melanin.2h2 hours agoHealth
VideoVideoDaredevil scales tower without harnessAlain Robert, the French Spiderman, took an unusual route to the top of a

So... kind of?

We apparently can't use `.text` because it's going to get take *all* of the text inside, it's going to take the headline *and* the summary. What we need to do instead is

* STEP ONE: Use the doc to get the story
* STEP TWO: Use the story to get the headline
* STEP THREE: Use the story to get the link
* STEP FOUR: Use the story to get the summary

### STEP ONE: Use the doc to get the story

In [18]:
stories = doc.find_all('div', { 'class': 'gs-c-promo' })
for story in stories:
    print("This is a story")

This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story
This is a story


## STEP TWO: Use the story to get the headline

Now we can do the same thing to find the link, and then use `['href']` to grab the link URL.

In [19]:
stories = doc.find_all('div', { 'class': 'gs-c-promo' })
for story in stories:
    print("THIS IS A STORY")
    headline = story.find('h3')
    print(headline.text)
    link = story.find('a')
    print(link['href'])

THIS IS A STORY
US student freed by North Korea 'in coma'
/news/world-asia-pacific-40264468
THIS IS A STORY
UK's Theresa May closing in on deal to govern
/news/election-2017-40255958
THIS IS A STORY
Uber chief to take leave from company
/news/business-40264376
THIS IS A STORY
London could lose EU euro clearing role
/news/business-40264755
THIS IS A STORY
New drug creates 'real sun-tan'
/news/health-40260029
THIS IS A STORY
Daredevil scales tower without harness
/news/world-europe-40260134
THIS IS A STORY
'My son was in collapsed Kenya building'
/news/world-africa-40258686
THIS IS A STORY
Food poisoning hits hundreds in Iraq camp
/news/world-middle-east-40257385
THIS IS A STORY
Cristiano Ronaldo accused of tax evasion
/news/world-europe-40260517
THIS IS A STORY
Deliberating Cosby jury revisits evidence
/news/world-us-canada-40264915
THIS IS A STORY
Jeff Sessions faces grilling over Russia
/news/world-us-canada-40260670
THIS IS A STORY
Deliberating Cosby jury revisits evidence
/news/worl

## STEP FOUR: Use the story to get the summary

Same thing again! This time we're looking for a `p`.

In [20]:
stories = doc.find_all('div', { 'class': 'gs-c-promo' })
for story in stories:
    print("THIS IS A STORY")
    headline = story.find('h3')
    print(headline.text)
    link = story.find('a')
    print(link['href'])
    summary = story.find('p')
    print(summary.text)

THIS IS A STORY
US student freed by North Korea 'in coma'
/news/world-asia-pacific-40264468
The US says Otto Warmbier is on his way home; his parents say he has been in a coma for a year.
THIS IS A STORY
UK's Theresa May closing in on deal to govern
/news/election-2017-40255958
The BBC understands a deal with the DUP has been largely agreed and there are no outstanding issues.
THIS IS A STORY
Uber chief to take leave from company
/news/business-40264376
The decision comes after an internal review of the firm's corporate culture.
THIS IS A STORY
London could lose EU euro clearing role
/news/business-40264755
The European Union reveals plans to keep the lucrative industry in the EU after Brexit happens.
THIS IS A STORY
New drug creates 'real sun-tan'
/news/health-40260029
The drug mimics sunlight to make the skin produce the brown form of the pigment melanin.
THIS IS A STORY
Daredevil scales tower without harness
/news/world-europe-40260134
Alain Robert, the French Spiderman, took an unu

AttributeError: 'NoneType' object has no attribute 'text'

### Missing elements

Oh god, an error! If you weren't paying attention, the error is

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-20-e55795264040> in <module>()
          7     print(link['href'])
          8     summary = story.find('p')
    ----> 9     print(summary.text)

    AttributeError: 'NoneType' object has no attribute 'text'

Since it showed up after we added in the `summary` part, I'm going to assume this is an issue because **not every story has a summary**. How do we get around it!!!

Well, just *ask if it has a summary*. If it does, you can use it. If it doesn't, ignore it. **It's just a simple `if` statement**.

In [21]:
stories = doc.find_all('div', { 'class': 'gs-c-promo' })
for story in stories:
    print("THIS IS A STORY")
    headline = story.find('h3')
    print(headline.text)
    link = story.find('a')
    print(link['href'])
    summary = story.find('p')
    if summary:
        print(summary.text)

THIS IS A STORY
US student freed by North Korea 'in coma'
/news/world-asia-pacific-40264468
The US says Otto Warmbier is on his way home; his parents say he has been in a coma for a year.
THIS IS A STORY
UK's Theresa May closing in on deal to govern
/news/election-2017-40255958
The BBC understands a deal with the DUP has been largely agreed and there are no outstanding issues.
THIS IS A STORY
Uber chief to take leave from company
/news/business-40264376
The decision comes after an internal review of the firm's corporate culture.
THIS IS A STORY
London could lose EU euro clearing role
/news/business-40264755
The European Union reveals plans to keep the lucrative industry in the EU after Brexit happens.
THIS IS A STORY
New drug creates 'real sun-tan'
/news/health-40260029
The drug mimics sunlight to make the skin produce the brown form of the pigment melanin.
THIS IS A STORY
Daredevil scales tower without harness
/news/world-europe-40260134
Alain Robert, the French Spiderman, took an unu

## Turning it into a CSV

Now that we have all of our elements, we can turn it into a CSV. There are three steps to building the CSV:
    
1. **Start with an empty list:** Each story we'll find, we'll add it to the list
2. **Build a dictionary** for each story element
3. **Convert the list to a DataFrame**, and then
4. **Export the DataFrame to a CSV**

The dictionary-buiding part can be complicated, so let's look at **two different ways of doing it**.

### Method One: All at once

For this method, we'll make our `story_dict` all at once, then add it to the `stories_list`.

In [25]:
# Start with an empty list
stories_list = []
stories = doc.find_all('div', { 'class': 'gs-c-promo' })
for story in stories:
    headline = story.find('h3')
    link = story.find('a')
    summary = story.find('p')
    # Does our story have a summary?
    if summary:
        # Build a dict that HAS a summary
        story_dict = {
            'headline': headline.text,
            'url': link['href'],
            'summary': summary.text
        }
    else:
        # Build a dict that does NOT have a summary
        story_dict = {
            'headline': headline.text,
            'url': link['href'],
        }    
    # Add the dict to our list
    stories_list.append(story_dict)

print(stories_list)

# Now that we're done, convert to a CSV and save.
# If you don't use index=False, you'll get an ugly dataframe!
import pandas as pd
df = pd.DataFrame(stories_list)
df.to_csv("bbc.csv", index=False)

[{'headline': "US student freed by North Korea 'in coma'", 'url': '/news/world-asia-pacific-40264468', 'summary': 'The US says Otto Warmbier is on his way home; his parents say he has been in a coma for a year.'}, {'headline': "UK's Theresa May closing in on deal to govern", 'url': '/news/election-2017-40255958', 'summary': 'The BBC understands a deal with the DUP has been largely agreed and there are no outstanding issues.'}, {'headline': 'Uber chief to take leave from company', 'url': '/news/business-40264376', 'summary': "The decision comes after an internal review of the firm's corporate culture."}, {'headline': 'London could lose EU euro clearing role', 'url': '/news/business-40264755', 'summary': 'The European Union reveals plans to keep the lucrative industry in the EU after Brexit happens.'}, {'headline': "New drug creates 'real sun-tan'", 'url': '/news/health-40260029', 'summary': 'The drug mimics sunlight to make the skin produce the brown form of the pigment melanin.'}, {'he

### Method Two: Filling in the blanks

For this method, we'll make our `story_dict` in the beginning, then fill in any pieces that exist.

In [26]:
# Start with an empty list
stories_list = []
stories = doc.find_all('div', { 'class': 'gs-c-promo' })
for story in stories:
    # Create a dictionary without anything in it
    story_dict = {}
    headline = story.find('h3')
    if headline:
        story_dict['headline'] = headline.text
    link = story.find('a')
    if link:
        story_dict['url'] = link['href']
    summary = story.find('p')
    if summary:
        story_dict['summary'] = summary.text
    # Add the dict to our list
    stories_list.append(story_dict)
    
# Now that we're done, convert to a CSV and save
# If you don't use index=False, you'll get an ugly dataframe!
import pandas as pd
df = pd.DataFrame(stories_list)
df.to_csv("bbc.csv", index=False)

In [28]:
df.head()

Unnamed: 0,headline,summary,url
0,US student freed by North Korea 'in coma',The US says Otto Warmbier is on his way home; ...,/news/world-asia-pacific-40264468
1,UK's Theresa May closing in on deal to govern,The BBC understands a deal with the DUP has be...,/news/election-2017-40255958
2,Uber chief to take leave from company,The decision comes after an internal review of...,/news/business-40264376
3,London could lose EU euro clearing role,The European Union reveals plans to keep the l...,/news/business-40264755
4,New drug creates 'real sun-tan',The drug mimics sunlight to make the skin prod...,/news/health-40260029
