<a href="https://colab.research.google.com/github/nicholasgriffen/intro-web-scraping/blob/master/NewsPageAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

We'll write some Python programs to work with the first Google News Search page for "Vail Colorado" - [this page](https://www.google.com/search?tbm=nws&q=vail+colorado).  Our programs will utilize several existing **Modules**, open source libraries of Python code. Below, I list some goals and the name of each relevant **Module**. Click the names for more information.

Our goals: 

1) retrieve with [Requests](http://docs.python-requests.org/en/master/)
---
2) format and examine with [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/)
---
3) transform with  [re](https://docs.python.org/3/library/re.html)
---
4) analyze with [NLTK](https://www.nltk.org/)
---
5) visualize with  [Pandas](https://pandas.pydata.org/)
---

# Importing Modules

In [0]:
# Comments are notes to ourselves and others, 
# unaffected by the syntactical rules of Python 
# Comments begin with a " # " symbol
# in Google Colab, hold ctrl or cmd and / to toggle a line comment
#
# The statements below import Modules 
# In other words, define names we'll use to invoke Module code 
import requests
from bs4 import BeautifulSoup
import nltk
import pandas
import re

# Retrieving the HTML Document

In [40]:
# store request url in a variable
request_url = "https://www.google.com/search?tbm=nws&q=vail+colorado"

# use the get function from the requests Module  
# store the output of the function in a variable
response = requests.get(request_url)

# extract the content from our response
# refer to http://docs.python-requests.org/en/master/api/#requests.Response
news_page = response.content

# use the built-in print function to display the news_page
print(news_page)

b'<!doctype html><html itemscope="" itemtype="http://schema.org/SearchResultsPage" lang="en"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><noscript><meta content="0;url=/search?q=vail+colorado&amp;tbm=nws&amp;ie=UTF-8&amp;gbv=1&amp;sei=8cCuXK3mDquP0gLy8JHoCQ" http-equiv="refresh"><style>table,div,span,p{display:none}</style><div style="display:block">Please click <a href="/search?q=vail+colorado&amp;tbm=nws&amp;ie=UTF-8&amp;gbv=1&amp;sei=8cCuXK3mDquP0gLy8JHoCQ">here</a> if you are not redirected within a few seconds.</div></noscript><title>vail colorado - Google Search</title><style>#gbar,#guser{font-size:13px;padding-top:1px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-right:.5em;vertical-

# Formatting and Examining the HTML

In [41]:
# use the BeautifulSoup function to create an interface to the HTML doc 
# store the interface in a variable
document_interface = BeautifulSoup(news_page)

# use the prettify method on the document_interface
# store the formatted HTML in a variable
formatted_news = document_interface.prettify()

# use the built-in print function to examine the formatted_news
print(formatted_news)

<!DOCTYPE html>
<html itemscope="" itemtype="http://schema.org/SearchResultsPage" lang="en">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"/>
  <noscript>
   <meta content="0;url=/search?q=vail+colorado&amp;tbm=nws&amp;ie=UTF-8&amp;gbv=1&amp;sei=8cCuXK3mDquP0gLy8JHoCQ" http-equiv="refresh"/>
   <style>
    table,div,span,p{display:none}
   </style>
   <div style="display:block">
    Please click
    <a href="/search?q=vail+colorado&amp;tbm=nws&amp;ie=UTF-8&amp;gbv=1&amp;sei=8cCuXK3mDquP0gLy8JHoCQ">
     here
    </a>
    if you are not redirected within a few seconds.
   </div>
  </noscript>
  <title>
   vail colorado - Google Search
  </title>
  <style>
   #gbar,#guser{font-size:13px;padding-top:1px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:ab

## Identifying Content

In [42]:
# Headlines look something like the example below
# <h3 class="r">
#   <a href="/url?q=https://www.vaildaily.com/news/its-rockslide-season-in-the-rockies/&amp;sa=U&amp;ved=0ahUKEwjTxriLosXhAhUCuVkKHaanAL8QqQIIISgAMAU&amp;usg=AOvVaw0xTTBcc4YQSad8QH01hD1O">
#    It's rockslide season in the Rockies
#   </a>
# </h3>

headlines = []
# use the document_interface to retrieve all h3 elements
h3_elements = document_interface.find_all('h3')

# for each h3 element in h3_elements
for h3 in h3_elements:
# use the built-in print function to examine the text of each element
  print(h3.text)
# collect the headline in a list for later use 
  headlines.append(h3.text)

# use the built-in print function to examine the list of elements
# print(h3_elements)

# use the built-in print function to examine one element
# print(h3_elements[0])

# use the built-in print function to examine the href of each h3.a element
# for h3 in h3_elements:
#   print(h3.a.get('href'))

Vail, Beaver Creek announce closing day lift operations
From snow depth to river flow: How high will the Eagle run?
Vail 'Civic Area' plan has options, but no firm ideas
Vail Mountain and Beaver Creek Resort announce lift operations for ...
It's rockslide season in the Rockies
Local lawmakers leading charge to bring down drug prices
Vail Pass is now open in both directions, please drive safely
Eagle County Schools Superintendent finalists hit final stretch
Colorado Snowsports Museum in Vail to welcome icon's ski fashion ...
Lindsey Vonn's got next: Legendary ski racer has big plans


In [43]:
# Sources look something like the example below
# <div class="slp">
#   <span class="f">
#     Vail Daily News - Apr 3, 2019
#   </span>
# </div>

sources = []
# use the document_interface to retrieve all div elements with class slp
source_elements = document_interface.find_all('div', attrs = {'class': 'slp'})

# for each source element in source_elements
for source in source_elements:
# use the built-in print function to examine the text of each element
  print(source.text)
# collect the headline in a list for later use 
  sources.append(source.text)
  
# use the built-in print function to examine the list of elements
# print(source_elements)

# use the built-in print function to examine one element
# print(source_elements[0])

# use the built-in print function to examine the text of each element
# for source in source_elements:
#   print(source.text)

Vail Daily News - Apr 3, 2019
Vail Daily News - 5 days ago
Vail Daily News - 5 days ago
Vail Daily News - 6 days ago
Vail Daily News - Apr 2, 2019
Vail Daily News - 5 days ago
Vail Daily News - Mar 13, 2019
Vail Daily News - 5 days ago
Vail Daily News - Mar 26, 2019
Vail Daily News - Apr 1, 2019


# Transforming the Text of HTML Elements

In [44]:
# Source element text contains a name and date like below
# Vail Daily News - Apr 3, 2019
# we are interested in transforming the source 
# to include only the source name
source_names = []

# define a pattern corresponding to text from the - to the end
date_pattern =r'\- .*$'

# for each source, use re.sub 
# substitute '' for the date_pattern
for source in sources:
  source_name = re.sub(date_pattern, '', source)
# use the built-in print function to see the transformed text
  print(source_name)
# collect the source_name into a list for later use
  source_names.append(source_name)

Vail Daily News 
Vail Daily News 
Vail Daily News 
Vail Daily News 
Vail Daily News 
Vail Daily News 
Vail Daily News 
Vail Daily News 
Vail Daily News 
Vail Daily News 


# Analyzing Headlines

# Visualizing Sources and Headlines

In [56]:
# use DataFrame method from pandas Module
# to create table of sources to headlines
table = pandas.DataFrame({'Source': source_names, 
                          'Headline': headlines, 
                         'Headline Length': [len(headline) for headline in headlines]})

table


Unnamed: 0,Headline,Headline Length,Source
0,"Vail, Beaver Creek announce closing day lift o...",55,Vail Daily News
1,From snow depth to river flow: How high will t...,59,Vail Daily News
2,"Vail 'Civic Area' plan has options, but no fir...",53,Vail Daily News
3,Vail Mountain and Beaver Creek Resort announce...,70,Vail Daily News
4,It's rockslide season in the Rockies,36,Vail Daily News
5,Local lawmakers leading charge to bring down d...,56,Vail Daily News
6,"Vail Pass is now open in both directions, plea...",61,Vail Daily News
7,Eagle County Schools Superintendent finalists ...,63,Vail Daily News
8,Colorado Snowsports Museum in Vail to welcome ...,68,Vail Daily News
9,Lindsey Vonn's got next: Legendary ski racer h...,58,Vail Daily News
