# Web Scrapping NBC News

In [275]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

## Obtain List of news from the coverpage

URL Definition:

In [276]:
# url definition
url = 'https://www.nbcnews.com/'

In [277]:
# Request
r1 = requests.get(url)
print(r1.status_code)

# We'll save in coverpage the cover page content
coverpage = r1.content

# Soup creation
soup1 = BeautifulSoup(coverpage, 'html5lib')

# News identification
coverpage_news = []
for tag in soup1.find_all('h2', class_='styles_headline__ice3t'):
  for anchor in tag.find_all('a'):
    coverpage_news.append(anchor)
len(coverpage_news)


200


40

Now we have a list in which every element is a news article.

In [278]:
coverpage_news[:10]

[<a href="https://www.nbcnews.com/politics/joe-biden/biden-bounces-back-bike-fall-delaware-beach-home-rcna34251">Biden bounces back after bike fall near Delaware beach home</a>,
 <a href="https://www.nbcnews.com/news/world/big-sanction-big-russian-bank-still-operates-freely-global-economy-hel-rcna34123">Ukraine is accusing a major Russian bank of financing the war</a>,
 <a href="https://www.nbcnews.com/news/us-news/free-school-lunches-set-end-creating-perfect-storm-high-inflation-rcna33688">Free school lunches for all set to end, creating ‘perfect storm’ amid high inflation</a>,
 <a href="https://www.nbcnews.com/news/us-news/senator-questioned-supreme-court-birth-control-ruling-led-campus-group-rcna32192">Senator who questioned Supreme Court birth control ruling led campus group that promoted it</a>,
 <a href="https://www.nbcnews.com/tech/apple/apple-store-union-vote-workers-demand-fair-pay-rcna34256">Apple's record revenues during pandemic fuel employees' historic union push</a>,
 <a 

Let's take a look at one example and see what we can use.

In [279]:
n=0
link = coverpage_news[n]['href']
title = coverpage_news[n].get_text()
article = requests.get(link)
article_content = article.content
soup_article = BeautifulSoup(article_content, 'html5lib')

In [280]:
title

'Biden bounces back after bike fall near Delaware beach home'

In [281]:
body = soup_article.find_all('p')#, class_='speakable')
body

[<p class="headline-title-text h-lh js-headline-text"></p>,
 <p class="menu-section-heading">Profile</p>,
 <p class="menu-section-heading">Sections</p>,
 <p class="menu-section-heading">tv</p>,
 <p class="menu-section-heading">Featured</p>,
 <p class="menu-section-heading">More From NBC</p>,
 <p class="menu-section-heading">Follow NBC News</p>,
 <p class="">REHOBOTH BEACH, Del. — <a href="https://www.nbcnews.com/politics/joe-biden/" target="_blank">President Joe Biden</a> fell when he tried to get off his bike at the end of a ride Saturday at Cape Henlopen State Park near his <a href="https://www.nbcnews.com/politics/joe-biden/biden-first-lady-briefly-moved-delaware-beach-house-small-plane-flies-rcna31992" target="_blank">beach home in Delaware</a>, but said he wasn’t hurt.</p>,
 <p class="">“I’m good,” he told reporters after U.S. Secret Service Agents quickly helped him up. “I got my foot caught” in the toe cages.</p>,
 <p class="">Biden, 79, and first lady Jill Biden were wrapping u

Since we are only interested in the content we will only be using the tag with `class=""` and `class="endmark"`.

In [282]:
len(body)

13

In [283]:
x = soup_article.find_all('p', {'class':['','endmark']})
len(x)

5

In [284]:
x[0].get_text()

'REHOBOTH BEACH, Del. — President Joe Biden fell when he tried to get off his bike at the end of a ride Saturday at Cape Henlopen State Park near his beach home in Delaware, but said he wasn’t hurt.'

## Let's extract the text from the article

Firts we will define the number of articles we want:

In [285]:
number_of_articles = 5

In [286]:
# Empty lists for content, links and titles
news_contents = []
list_links = []
list_titles = []

for n in np.arange(0, number_of_articles):
        
    # Getting the link of the article
    link = coverpage_news[n]['href']
    list_links.append(link)
    
    # Getting the title
    title = coverpage_news[n].get_text()
    list_titles.append(title)
    
    # Reading the content (it is divided in paragraphs)
    article = requests.get(link)
    article_content = article.content
    soup_article = BeautifulSoup(article_content, 'html5lib')
    x = soup_article.find_all('p', {'class':['','endmark']})
    
    # Unifying the paragraphs
    list_paragraphs = []
    for p in np.arange(0, len(x)):
        paragraph = x[p].get_text()
        list_paragraphs.append(paragraph)
        final_article = " ".join(list_paragraphs)
        
    news_contents.append(final_article)

Let's put them into:
* A dataset which will be the input of the models (df_features)
* a dataset with the title and the link (df_show_info)

In [287]:
# df_features
df_features = pd.DataFrame(
     {'Article Content': news_contents 
    })

# df_show_info
df_show_info = pd.DataFrame(
    {'Article Title': list_titles,
     'Article Link': list_links})

In [288]:
df_features

Unnamed: 0,Article Content
0,"REHOBOTH BEACH, Del. — President Joe Biden fel..."
1,Ukraine is urging the United States and the Eu...
2,A federal waiver that made school breakfasts a...
3,"Tennessee Sen. Marsha Blackburn, an anti-abort..."
4,Whenever Noah McCool pitches his co-workers at...


In [289]:
df_show_info

Unnamed: 0,Article Title,Article Link
0,Biden bounces back after bike fall near Delawa...,https://www.nbcnews.com/politics/joe-biden/bid...
1,Ukraine is accusing a major Russian bank of fi...,https://www.nbcnews.com/news/world/big-sanctio...
2,"Free school lunches for all set to end, creati...",https://www.nbcnews.com/news/us-news/free-scho...
3,Senator who questioned Supreme Court birth con...,https://www.nbcnews.com/news/us-news/senator-q...
4,Apple's record revenues during pandemic fuel e...,https://www.nbcnews.com/tech/apple/apple-store...
