# News Sites Data Collection

1. start with NPR
2. explore feasability of collecting the following:
    - Headline
    - Author
    - Date
    - Full text
    - Media

1. Get featured article data
2. Get all others

In [110]:
import pandas as pd, requests
from bs4 import BeautifulSoup, SoupStrainer
from tqdm import tqdm

In [111]:
def article_text(link):
    page = requests.get(link)
    def article_strain(name, attrs):
        if dict(attrs).get('id', None) == 'storytext' or dict(attrs).get('class', None) == 'tags':
            return True
    soup = BeautifulSoup(page.content, 'html.parser', parse_only=SoupStrainer(article_strain))
    text = ''
    for p in soup.select('div p'):
        if len(p.parent.attrs) == 0 or 'storytext' in p.parent.attrs.get('class', []):
            text+=p.text
    return text

In [112]:
def npr_scraper():
    page = requests.get("https://www.npr.org/sections/news/")
    def strain(name, attrs):
        if name == 'div' and dict(attrs).get('class', None) == 'item-info':
            return True
    def teaser_split(header):
        time_teaser = header.select('p.teaser')[0].text
        split = time_teaser.index('•')
        return time_teaser[:split-1], time_teaser[split+2:]
    soup = BeautifulSoup(page.content, 'html.parser', parse_only=SoupStrainer(strain))
    headers = soup.find_all('div', class_='item-info')
    data = []
    for header in headers:
        cat = header.select('h3.slug')[0].text
        link = header.select('p.teaser a')[0]['href']
        title = header.select('.title')[0].text
        time, teaser = teaser_split(header)
#         print(title, link)
        text = article_text(link)
        data.append({'title':title,
                     'teaser':teaser,
                     'time':time,
                     'category':cat,
                     'link':link,
                     'text':text,
                     'source': 'NPR'})
    return data

In [101]:
df = pd.DataFrame(npr_scraper())

READ: House Democrats Release Draft Resolution On Impeachment Inquiry https://www.npr.org/2019/10/29/774380175/read-house-democrats-release-draft-resolution-on-impeachment-inquiry
'We Didn't See A Body': Baghdadi's Death Draws Doubts In Lands Where ISIS Ruled https://www.npr.org/2019/10/29/774129683/we-didn-t-see-a-body-baghdadi-s-death-draws-doubts-in-lands-where-isis-ruled
23 Senators Demand Investigation Into Mismanagement Of Student Loan Program https://www.npr.org/2019/10/29/774395247/23-senators-demand-investigation-into-mismanagement-of-student-loan-program
NCAA Plans To Allow College Athletes To Get Paid For Use Of Their Names, Images https://www.npr.org/2019/10/29/774439078/ncaa-starts-process-to-allow-compensation-for-college-athletes
Former Senate Aide Gets Probation For Helping Dox Republicans Over Kavanaugh Hearings https://www.npr.org/2019/10/29/774386731/former-senate-aide-gets-probation-for-helping-dox-republicans-over-kavanaugh-hea
Britain Braces For New Vote In Decemb

In [108]:
df

Unnamed: 0,category,link,source,teaser,text,time,title
0,Politics,https://www.npr.org/2019/10/29/774380175/read-...,NPR,"The resolution, which formalizes the impeachme...",House Democrats on Tuesday introduced a draft ...,"October 29, 2019",READ: House Democrats Release Draft Resolution...
1,Middle East,https://www.npr.org/2019/10/29/774129683/we-di...,NPR,"In Iraq and Syria, the ISIS leader's death has...","In Iraq and Syria, news of ISIS leader Abu Bak...","October 29, 2019",'We Didn't See A Body': Baghdadi's Death Draws...
2,Politics,https://www.npr.org/2019/10/29/774395247/23-se...,NPR,The senators are calling on the nation's stop ...,Updated at 4:20 p.m. ETTwenty-three U.S. senat...,"October 29, 2019",23 Senators Demand Investigation Into Mismanag...
3,Sports,https://www.npr.org/2019/10/29/774439078/ncaa-...,NPR,The organization says it's just starting to wo...,"In a surprise move, the NCAA says it intends t...","October 29, 2019",NCAA Plans To Allow College Athletes To Get Pa...
4,Politics,https://www.npr.org/2019/10/29/774386731/forme...,NPR,Prosecutors say Samantha Davis gave her ex-boy...,A former Democratic Senate staffer was sentenc...,"October 29, 2019",Former Senate Aide Gets Probation For Helping ...
5,Europe,https://www.npr.org/2019/10/29/774573501/brita...,NPR,Prime Minister Boris Johnson sought an early g...,"Still in turmoil over if, when or how to leave...","October 29, 2019",Britain Braces For New Vote In December That C...
6,Politics,https://www.npr.org/2019/10/29/773127809/how-t...,NPR,The CIA whistleblower complaint that sparked t...,"Even when he's praising his spy chiefs, Presid...","October 29, 2019",How The Relationship Between Trump And His Spy...
7,Shots - Health News,https://www.npr.org/sections/health-shots/2019...,NPR,October marks not only fire season in Californ...,Farm laborers in yellow safety vests walked th...,"October 29, 2019",Smoke And Power Outages Near California Wildfi...
8,Environment,https://www.npr.org/2019/10/29/774456972/in-ca...,NPR,"As fires blaze across the state, California fi...",High winds are in the forecast this week for a...,"October 29, 2019","In California, Air Tanker Pilots Help Keep Wil..."
9,National,https://www.npr.org/2019/10/29/774355965/firef...,NPR,"The Kincade Fire has burned more than 75,000 a...",Updated at 6:10 p.m. ETFirefighters knew their...,"October 29, 2019",Firefighters Brace For 'Critical 24-Hour Windo...
