<a href="https://colab.research.google.com/github/kevinmfreire/wheres_waldo/blob/main/WS_NBC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scrapping NBC News

We first want to import all of our necessary libraries for our web scraping application

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

## Obtain List of news from the coverpage

URL Definition:

In [2]:
# url definition
url = 'https://www.nbcnews.com/'
article_url = "https://www.nbcnews.com/health/health-news/states-say-abortion-bans-dont-affect-ivf-providers-lawyers-worry-rcna35556"

We then use our libraries to:

* Make a request to the target web page(s) 
* Create a `BeautifulSoup` constructor to represent the requested webpage as a nested data structure and parse through it using the `html5lib` python library for parsing html.
* Find all articles in the landing page of nbc news by parsing through the data and finding all `<h2></h2>` tags with `class='styles_headline_ice3t'` then extracting all links within the anchor `<a></a>` tag.

In [4]:
# Request
request = requests.get(url)
print('Status code return: {}'.format(request.status_code))

# We'll save in coverpage the cover page content
coverpage = request.content

# Soup creation
soup1 = BeautifulSoup(coverpage, 'html5lib')

# News identification
coverpage_news = []
for tag in soup1.find_all('h2', class_='styles_headline__ice3t'):
  for anchor in tag.find_all('a'):
    coverpage_news.append(anchor)
print('Number of articles found: {}'.format(len(coverpage_news)))


Status code return: 200
Number of articles found: 40


Now we have a list in which every element is a news article.

In [5]:
coverpage_news[:10]

[<a href="https://www.nbcnews.com/news/danish-police-arrest-one-connection-shooting-copenhagen-mall-rcna36515">3 dead, 4 critically injured in Copenhagen mall shooting; police say gunman shot people at random</a>,
 <a href="https://www.nbcnews.com/news/us-news/akron-ohio-sets-downtown-curfew-cancels-fireworks-wake-jayland-walker-rcna36557">Akron, Ohio, sets downtown curfew, cancels fireworks in wake of Jayland Walker protests</a>,
 <a href="https://www.nbcnews.com/news/world/chunk-alpine-glacier-detaches-italy-killing-6-hikers-injuring-9-rcna36526">Chunk of alpine glacier detaches in Italy, killing at least 6 hikers, injuring 9</a>,
 <a href="https://www.nbcnews.com/news/world/israeli-military-gunfire-likely-killed-shireen-abu-akleh-us-concludes-rcna36550">Israeli military gunfire likely killed Palestinian American journalist, U.S. concludes</a>,
 <a href="https://www.nbcnews.com/news/world/russia-conquers-ukraines-luhansk-lysychansk-key-eastern-province-rcna36530">Russia takes control

Let's take a look at one example and see what we can use.

In [6]:
n=0
link = coverpage_news[n]['href']
title = coverpage_news[n].get_text()
article = requests.get(link)
article_content = article.content
soup_article = BeautifulSoup(article_content, 'html5lib')

In [7]:
print('Link: {}'.format(link))
print('Title of target link: {}'.format(title))

Link: https://www.nbcnews.com/news/danish-police-arrest-one-connection-shooting-copenhagen-mall-rcna36515
Title of target link: 3 dead, 4 critically injured in Copenhagen mall shooting; police say gunman shot people at random


In [8]:
print('Content for article: {}'.format(article_content))



In [9]:
body = soup_article.find_all('p')#, class_='speakable')
body

[<p class="styles_content__a8lrE"><span><a href="https://www.nbcnews.com/news/us-news/gunfire-erupts-area-fourth-july-parade-route-highland-park-illinois-sh-rcna36565">BREAKING: Gunfire erupts in area of Fourth of July parade route in Highland Park, Illinois, sheriff says</a></span><a href="https://www.nbcnews.com/news/us-news/gunfire-erupts-area-fourth-july-parade-route-highland-park-illinois-sh-rcna36565" rel="noreferrer" target="_blank"><span class="styles_icon__LS5ud icon icon-arrow-link"></span></a></p>,
 <p class="headline-title-text h-lh js-headline-text"></p>,
 <p class="menu-section-heading">Profile</p>,
 <p class="menu-section-heading">Sections</p>,
 <p class="menu-section-heading">tv</p>,
 <p class="menu-section-heading">Featured</p>,
 <p class="menu-section-heading">More From NBC</p>,
 <p class="menu-section-heading">Follow NBC News</p>,
 <p class="">COPENHAGEN, Denmark — Three people are dead and several others are injured, including four in critical condition, after a sho

Since we are only interested in the content we will only be using the `<p></p>` tag with `class=""` and `class="endmark"`.

In [10]:
len(body)

36

In [11]:
x = soup_article.find_all('p', {'class':['','endmark']})
len(x)

25

In [12]:
x[0].get_text()

'COPENHAGEN, Denmark —\xa0Three people are dead and several others are injured, including four in critical condition, after a shooting Sunday at a mall in Copenhagen, according to police.'

## Let's extract the text from the article

Firts we will define the number of articles we want:

In [13]:
number_of_articles = 5

In [14]:
# Empty lists for content, links and titles
news_contents = []
list_links = []
list_titles = []

for n in np.arange(0, number_of_articles):
        
    # Getting the link of the article
    link = coverpage_news[n]['href']
    list_links.append(link)
    
    # Getting the title
    title = coverpage_news[n].get_text()
    list_titles.append(title)
    
    # Reading the content (it is divided in paragraphs)
    article = requests.get(link)
    article_content = article.content
    soup_article = BeautifulSoup(article_content, 'html5lib')
    x = soup_article.find_all('p', {'class':['','endmark']})
    
    # Unifying the paragraphs
    list_paragraphs = []
    final_article = ""
    for p in np.arange(0, len(x)):
        paragraph = x[p].get_text()
        list_paragraphs.append(paragraph)
        final_article = " ".join(list_paragraphs)
        
    news_contents.append(final_article)

In [15]:
def clean_text(body):
  body = body.replace('\xa0',' ')
  return body

Before we cleaned the text from the `\xa0` unicode we get the following:

In [16]:
news_contents[0]

'COPENHAGEN, Denmark —\xa0Three people are dead and several others are injured, including four in critical condition, after a shooting Sunday at a mall in Copenhagen, according to police. A suspect, a 22-year-old Danish man who has not yet been named, was in custody after the shooting at the Field\'s shopping center, police said. The suspect appeared to act alone and target his victims randomly, officials said Monday, with no immediate signs of links to organized terror. The suspect will appear in court later Monday and be charged with manslaughter, police said — a crime that carries a potential sentence of life in prison. "My thoughts go to the wounded, relatives and others affected after this completely meaningless and terrible act," Justice Minister Mattias Tesfaye said in a news conference Monday. "We do not yet know the motive behind it, but I can assure you that the authorities are making every effort to get to the bottom of this matter so that the person or persons responsible c

After cleaning the text:

In [17]:
clean_cont = clean_text(news_contents[0])
clean_cont

'COPENHAGEN, Denmark — Three people are dead and several others are injured, including four in critical condition, after a shooting Sunday at a mall in Copenhagen, according to police. A suspect, a 22-year-old Danish man who has not yet been named, was in custody after the shooting at the Field\'s shopping center, police said. The suspect appeared to act alone and target his victims randomly, officials said Monday, with no immediate signs of links to organized terror. The suspect will appear in court later Monday and be charged with manslaughter, police said — a crime that carries a potential sentence of life in prison. "My thoughts go to the wounded, relatives and others affected after this completely meaningless and terrible act," Justice Minister Mattias Tesfaye said in a news conference Monday. "We do not yet know the motive behind it, but I can assure you that the authorities are making every effort to get to the bottom of this matter so that the person or persons responsible can 

Let's put them into:
* A dataset which will be the input of the models (df_features)
* a dataset with the title and the link (df_show_info)

In [18]:
# df_features
df_features = pd.DataFrame(
     {'Article Content': news_contents 
    })

# df_show_info
df_show_info = pd.DataFrame(
    {'Article Title': list_titles,
     'Article Link': list_links})

In [19]:
df_features

Unnamed: 0,Article Content
0,"COPENHAGEN, Denmark — Three people are dead an..."
1,Officials on Monday set a 9 p.m. curfew for do...
2,ROME — A large chunk of an Alpine glacier brok...
3,The United States has concluded that gunfire f...
4,The last Ukrainian bastion in a key eastern pr...


In [20]:
df_show_info

Unnamed: 0,Article Title,Article Link
0,"3 dead, 4 critically injured in Copenhagen mal...",https://www.nbcnews.com/news/danish-police-arr...
1,"Akron, Ohio, sets downtown curfew, cancels fir...",https://www.nbcnews.com/news/us-news/akron-ohi...
2,"Chunk of alpine glacier detaches in Italy, kil...",https://www.nbcnews.com/news/world/chunk-alpin...
3,Israeli military gunfire likely killed Palesti...,https://www.nbcnews.com/news/world/israeli-mil...
4,Russia takes control of a key eastern province...,https://www.nbcnews.com/news/world/russia-conq...
