<a href="https://colab.research.google.com/github/nicholasgriffen/intro-web-scraping/blob/master/NewsPageAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

We'll write some Python programs to work with the first Google News Search page for "Vail Colorado" - [this page](https://www.google.com/search?tbm=nws&q=vail+colorado).  Our programs will utilize several existing **Modules**, open source libraries of Python code. Below, I list some goals and the name of each relevant **Module**. Click the names for more information.

Our goals: 

1) retrieve with [Requests](http://docs.python-requests.org/en/master/)
---
2) format and examine with [Beautiful soup](https://www.crummy.com/software/BeautifulSoup/)
---
3) analyze with [NLTK](https://www.nltk.org/)
---
4) visualize with  [Pandas](https://pandas.pydata.org/)
---

# Importing Modules

In [0]:
# Comments are notes to ourselves and others, 
# unaffected by the syntactical rules of Python 
# Comments begin with a " # " symbol
# The statements below import Modules 
# In other words, define names we'll use to invoke Module code 
import requests
from bs4 import BeautifulSoup
import nltk
import pandas as pd

# Retrieving the News

In [5]:
# use the get function from the requests Module  
# store the output of the function in a variable
response = requests.get("https://www.google.com/search?tbm=nws&q=vail+colorado")

# extract the content from our response
# refer to http://docs.python-requests.org/en/master/api/#requests.Response

news_page = response.content

# use the built-in print function to display the news_page
print(news_page)

b'<!doctype html><html itemscope="" itemtype="http://schema.org/SearchResultsPage" lang="en"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><noscript><meta content="0;url=/search?q=vail+colorado&amp;tbm=nws&amp;ie=UTF-8&amp;gbv=1&amp;sei=l76tXJOfCoLy5gKmz4L4Cw" http-equiv="refresh"><style>table,div,span,p{display:none}</style><div style="display:block">Please click <a href="/search?q=vail+colorado&amp;tbm=nws&amp;ie=UTF-8&amp;gbv=1&amp;sei=l76tXJOfCoLy5gKmz4L4Cw">here</a> if you are not redirected within a few seconds.</div></noscript><title>vail colorado - Google Search</title><style>#gbar,#guser{font-size:13px;padding-top:1px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-right:.5em;vertical-

# Formatting and Examining the News

In [7]:
# use the BeautifulSoup constructor on the news_page
# use the prettify method on the BeautifulSoup instance
# store the result in a variable

formatted_news = BeautifulSoup(news_page).prettify()

# use the built-in print function to examine the formatted_news
print(formatted_news)

<!DOCTYPE html>
<html itemscope="" itemtype="http://schema.org/SearchResultsPage" lang="en">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"/>
  <noscript>
   <meta content="0;url=/search?q=vail+colorado&amp;tbm=nws&amp;ie=UTF-8&amp;gbv=1&amp;sei=l76tXJOfCoLy5gKmz4L4Cw" http-equiv="refresh"/>
   <style>
    table,div,span,p{display:none}
   </style>
   <div style="display:block">
    Please click
    <a href="/search?q=vail+colorado&amp;tbm=nws&amp;ie=UTF-8&amp;gbv=1&amp;sei=l76tXJOfCoLy5gKmz4L4Cw">
     here
    </a>
    if you are not redirected within a few seconds.
   </div>
  </noscript>
  <title>
   vail colorado - Google Search
  </title>
  <style>
   #gbar,#guser{font-size:13px;padding-top:1px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:ab