# 🌐 Scraping, Part 4: Practice, practice, practice!

*Let's scrape some dataaaaaaa!*

## Let's start with NPR's "lite" homepage

It's https://text.npr.org/

Open it in your browser. View the raw HTML, and also practice popping open the element inspector.

## Q: What do you see? What would you want to extract from it?

Let's load the HTML in Python. Remember how?

In [1]:
import requests

In [2]:
html = requests.get("https://text.npr.org/").text
print(html[:300])

<!DOCTYPE html>
<html lang="en">
<head>
    <title>NPR : National Public Radio</title>
    <meta http-equiv="Content-Type" content="text/html;charset=utf-8">
    <meta name="viewport" content="width=device-width">
    <link id="favicon" rel="shortcut icon" type="image/png" href="data:image/png;base6


Now let's parse it with `BeautifulSoup`. Remember how?

In [3]:
from bs4 import BeautifulSoup

In [4]:
soup = BeautifulSoup(html)

## Q: How many links on the page?

In [5]:
links = soup.select("a")
len(links)

28

## Q: How to print the text of each link?

In [6]:
for link in links:
    print(link.text.strip())
    print("---")

Go To Full Site
---
Millions of Americans are under heat advisories as high temperatures roast the U.S.
---
Heat wave safety tips from the world's first chief heat officer
---
Are Biden's immigration moves at odds? Homeland Security chief says they can coexist
---
Louisiana will require the 10 Commandments displayed in every public school classroom
---
Tropical Storm Alberto, the first of the season, drenches Texas en route to Mexico
---
A supermarket trip may soon look different, thanks to electronic shelf labels
---
Russia and North Korea vow stronger partnership against the West with new treaty
---
Appreciating our enslaved ancestors despite the relics of the Confederacy
---
Wildfires blaze across New Mexico and California, prompting evacuations
---
A Russian court has sentenced a U.S. soldier to nearly 4 years in prison
---
Willie Mays - the 'Say Hey Kid' considered baseball's best all-around player - dies at 93
---
Mavis Staples: Tiny Desk Concert
---
Taylor Swift and Post Malone 

## Q: What if we want *just* the links to articles?

In [7]:
article_links = soup.select("main a")
len(article_links)

20

In [8]:
for link in article_links:
    print(link.text.strip())
    print("---")

Millions of Americans are under heat advisories as high temperatures roast the U.S.
---
Heat wave safety tips from the world's first chief heat officer
---
Are Biden's immigration moves at odds? Homeland Security chief says they can coexist
---
Louisiana will require the 10 Commandments displayed in every public school classroom
---
Tropical Storm Alberto, the first of the season, drenches Texas en route to Mexico
---
A supermarket trip may soon look different, thanks to electronic shelf labels
---
Russia and North Korea vow stronger partnership against the West with new treaty
---
Appreciating our enslaved ancestors despite the relics of the Confederacy
---
Wildfires blaze across New Mexico and California, prompting evacuations
---
A Russian court has sentenced a U.S. soldier to nearly 4 years in prison
---
Willie Mays - the 'Say Hey Kid' considered baseball's best all-around player - dies at 93
---
Mavis Staples: Tiny Desk Concert
---
Taylor Swift and Post Malone top the charts again

## Q: What are other selectors you could have used, besides `main a`?

In [9]:
len(soup.select(".topic-title"))

20

In [10]:
len(soup.select(".topic-container a"))

20

## Q: How to get the link to each article?

In [11]:
for link in article_links:
    print(link["href"])
    print("---")

/nx-s1-5012304
---
/nx-s1-5008872
---
/nx-s1-5009971
---
/nx-s1-5012597
---
/nx-s1-5011971
---
/nx-s1-5009271
---
/nx-s1-5011768
---
/g-s1-4807
---
/g-s1-5147
---
/g-s1-5159
---
/530056425
---
/1234569831
---
/nx-s1-5006116
---
/nx-s1-5009314
---
/nx-s1-5010219
---
/g-s1-5022
---
/g-s1-4916
---
/nx-s1-5011566
---
/g-s1-5154
---
/nx-s1-4950096
---


## Q: How to make it into a "real" link?

In [12]:
for link in article_links:
    print("https://text.npr.org" + link["href"])
    print("---")

https://text.npr.org/nx-s1-5012304
---
https://text.npr.org/nx-s1-5008872
---
https://text.npr.org/nx-s1-5009971
---
https://text.npr.org/nx-s1-5012597
---
https://text.npr.org/nx-s1-5011971
---
https://text.npr.org/nx-s1-5009271
---
https://text.npr.org/nx-s1-5011768
---
https://text.npr.org/g-s1-4807
---
https://text.npr.org/g-s1-5147
---
https://text.npr.org/g-s1-5159
---
https://text.npr.org/530056425
---
https://text.npr.org/1234569831
---
https://text.npr.org/nx-s1-5006116
---
https://text.npr.org/nx-s1-5009314
---
https://text.npr.org/nx-s1-5010219
---
https://text.npr.org/g-s1-5022
---
https://text.npr.org/g-s1-4916
---
https://text.npr.org/nx-s1-5011566
---
https://text.npr.org/g-s1-5154
---
https://text.npr.org/nx-s1-4950096
---


## Exercise: `pandas` refresher

How would you make `pandas` `DataFrame` representing each link's text and URL?

(You can forget, for now, about what paragraph the link is in.)

In [13]:
import pandas as pd

In [14]:
article_links_df = pd.DataFrame([ {
    "text": link.text.strip(),
    "url": "https://text.npr.org" + link["href"]
} for link in article_links ])

article_links_df

Unnamed: 0,text,url
0,Millions of Americans are under heat advisorie...,https://text.npr.org/nx-s1-5012304
1,Heat wave safety tips from the world's first c...,https://text.npr.org/nx-s1-5008872
2,Are Biden's immigration moves at odds? Homelan...,https://text.npr.org/nx-s1-5009971
3,Louisiana will require the 10 Commandments dis...,https://text.npr.org/nx-s1-5012597
4,"Tropical Storm Alberto, the first of the seaso...",https://text.npr.org/nx-s1-5011971
5,"A supermarket trip may soon look different, th...",https://text.npr.org/nx-s1-5009271
6,Russia and North Korea vow stronger partnershi...,https://text.npr.org/nx-s1-5011768
7,Appreciating our enslaved ancestors despite th...,https://text.npr.org/g-s1-4807
8,Wildfires blaze across New Mexico and Californ...,https://text.npr.org/g-s1-5147
9,A Russian court has sentenced a U.S. soldier t...,https://text.npr.org/g-s1-5159


If [list comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions) aren't your cup of tea, here's another way we could have done that:

In [15]:
article_links_list = []

for link in article_links:
    item_data = {
        "text": link.text.strip(),
        "url": "https://text.npr.org" + link["href"]
    }
    article_links_list.append(item_data)

article_links_list

[{'text': 'Millions of Americans are under heat advisories as high temperatures roast the U.S.',
  'url': 'https://text.npr.org/nx-s1-5012304'},
 {'text': "Heat wave safety tips from the world's first chief heat officer",
  'url': 'https://text.npr.org/nx-s1-5008872'},
 {'text': "Are Biden's immigration moves at odds? Homeland Security chief says they can coexist",
  'url': 'https://text.npr.org/nx-s1-5009971'},
 {'text': 'Louisiana will require the 10 Commandments displayed in every public school classroom',
  'url': 'https://text.npr.org/nx-s1-5012597'},
 {'text': 'Tropical Storm Alberto, the first of the season, drenches Texas en route to Mexico',
  'url': 'https://text.npr.org/nx-s1-5011971'},
 {'text': 'A supermarket trip may soon look different, thanks to electronic shelf labels',
  'url': 'https://text.npr.org/nx-s1-5009271'},
 {'text': 'Russia and North Korea vow stronger partnership against the West with new treaty',
  'url': 'https://text.npr.org/nx-s1-5011768'},
 {'text': 'A

In [16]:
article_links_df = pd.DataFrame(article_links_list)
article_links_df

Unnamed: 0,text,url
0,Millions of Americans are under heat advisorie...,https://text.npr.org/nx-s1-5012304
1,Heat wave safety tips from the world's first c...,https://text.npr.org/nx-s1-5008872
2,Are Biden's immigration moves at odds? Homelan...,https://text.npr.org/nx-s1-5009971
3,Louisiana will require the 10 Commandments dis...,https://text.npr.org/nx-s1-5012597
4,"Tropical Storm Alberto, the first of the seaso...",https://text.npr.org/nx-s1-5011971
5,"A supermarket trip may soon look different, th...",https://text.npr.org/nx-s1-5009271
6,Russia and North Korea vow stronger partnershi...,https://text.npr.org/nx-s1-5011768
7,Appreciating our enslaved ancestors despite th...,https://text.npr.org/g-s1-4807
8,Wildfires blaze across New Mexico and Californ...,https://text.npr.org/g-s1-5147
9,A Russian court has sentenced a U.S. soldier t...,https://text.npr.org/g-s1-5159


---

---

---