# Scraping and Summarizing News Articles

This notebook gives a short demonstration of code to scrape and summarize news articles. It accompanies the blog post found here: {link}

In [1]:
# Imports
import requests
from bs4 import BeautifulSoup
from gensim.summarization import summarize

### We'll scrape and summarize the following article:
https://www.npr.org/2019/07/10/740387601/university-of-texas-austin-promises-free-tuition-for-low-income-students-in-2020

![article](images/article-choice.png)

In [3]:
# Retrieve page text
url = 'https://www.npr.org/2019/07/10/740387601/university-of-texas-austin-promises-free-tuition-for-low-income-students-in-2020'
page = requests.get(url).text

In [4]:
# Turn page into BeautifulSoup object to access HTML tags
soup = BeautifulSoup(page)

We can use Chrome's Inspect tool to find HTML tags by right-clicking on the page and choosing "Inspect." Then,  click on the little button to find HTML tags for a specific part of the page. That will look like this:

![little-button](images/little-button.png)

<br></br>

Let's find the tag which denotes the headline:

![select-headline](images/select-headline.png)

In [5]:
# Get headline
headline = soup.find('h1').get_text()
print(headline)

University of Texas-Austin Promises Free Tuition For Low-Income Students In 2020


The main text of the article is surrounded by the "p" tag. This time we’ll have to find all of the "p" tags contained on the page since the paragraphs of the article are each contained in a "p" tag.

In [6]:
# Get text from all <p> tags.
p_tags = soup.find_all('p')
# Get the text from each of the "p" tags and strip surrounding whitespace.
p_tags_text = [tag.get_text().strip() for tag in p_tags]
p_tags_text

['Vanessa Romo',
 'Claire McInerny',
 'From',
 'The University of Texas-Austin announced Tuesday it is offering full tuition scholarships to in-state undergraduates whose families make $65,000 or less a year.\n                \n                \n                    \n                    Jon Herskovitz/Reuters\n                    \n                \nhide caption',
 "Four year colleges and universities have difficulty recruiting talented students from the lower end of the economic spectrum who can't afford to attend such institutions without taking on massive debt. To remedy that — at least in part — the University of Texas-Austin announced it is offering full tuition scholarships to in-state undergraduates whose families make $65,000 or less per year.",
 "The University of Texas System Board of Regents voted unanimously on Tuesday to establish a $160 million endowment, drawing from the state's Permanent University Fund to begin the program in the fall of 2020.",
 '"Recognizing both the

In [7]:
# Filter out sentences that contain newline characters '\n' or don't contain periods.
sentence_list = [sentence for sentence in p_tags_text if not '\n' in sentence]
sentence_list = [sentence for sentence in sentence_list if '.' in sentence]
sentence_list

["Four year colleges and universities have difficulty recruiting talented students from the lower end of the economic spectrum who can't afford to attend such institutions without taking on massive debt. To remedy that — at least in part — the University of Texas-Austin announced it is offering full tuition scholarships to in-state undergraduates whose families make $65,000 or less per year.",
 "The University of Texas System Board of Regents voted unanimously on Tuesday to establish a $160 million endowment, drawing from the state's Permanent University Fund to begin the program in the fall of 2020.",
 '"Recognizing both the need for improved access to higher education and the high value of a UT Austin degree, we are dedicating a distribution from the Permanent University Fund to establish an endowment that will directly benefit students and make their degrees more affordable," Chairman of the Board of Regents Kevin Eltife said after the vote.',
 '"This will benefit students of our gr

In [8]:
# Combine list items into string.
article = ' '.join(sentence_list)

In [9]:
summary = summarize(article, ratio=0.3)

In [10]:
print(f'Length of original article: {len(article)}')
print(f'Length of summary: {len(summary)} \n')
print(f'Headline: {headline} \n')
print(f'Article Summary:\n{summary}')

Length of original article: 4672
Length of summary: 1859 

Headline: University of Texas-Austin Promises Free Tuition For Low-Income Students In 2020 

Article Summary:
Four year colleges and universities have difficulty recruiting talented students from the lower end of the economic spectrum who can't afford to attend such institutions without taking on massive debt.
To remedy that — at least in part — the University of Texas-Austin announced it is offering full tuition scholarships to in-state undergraduates whose families make $65,000 or less per year.
The University of Texas System Board of Regents voted unanimously on Tuesday to establish a $160 million endowment, drawing from the state's Permanent University Fund to begin the program in the fall of 2020.
"Recognizing both the need for improved access to higher education and the high value of a UT Austin degree, we are dedicating a distribution from the Permanent University Fund to establish an endowment that will directly benefit

In [3]:
alln = "This book is for the person who wants more control over their money and wants to beat the average returns of average investors. This book will not tell you exactly what to do . . . that is because what you do to become rich and how you do it is really up to you ... yet this book will help guide you in understanding why some investors achieve much higher returns than the average investor, with less risk and less money, and in much less time.Ninety percent of investors are average investors and should continue saving, investing in mutual funds and their 401(k) and retirement funds. The information in this book is for the 10 percent who want to educate themselves to become professional investors and increase their investment returns and accelerate the growth of their financial portfolios."

In [4]:
alln


'This book is for the person who wants more control over their money and wants to beat the average returns of average investors. This book will not tell you exactly what to do . . . that is because what you do to become rich and how you do it is really up to you ... yet this book will help guide you in understanding why some investors achieve much higher returns than the average investor, with less risk and less money, and in much less time.Ninety percent of investors are average investors and should continue saving, investing in mutual funds and their 401(k) and retirement funds. The information in this book is for the 10 percent who want to educate themselves to become professional investors and increase their investment returns and accelerate the growth of their financial portfolios.'

In [7]:
allan = summarize(alln, ratio=0.5)

In [8]:
allan

'yet this book will help guide you in understanding why some investors achieve much higher returns than the average investor, with less risk and less money, and in much less time.Ninety percent of investors are average investors and should continue saving, investing in mutual funds and their 401(k) and retirement funds.\nThe information in this book is for the 10 percent who want to educate themselves to become professional investors and increase their investment returns and accelerate the growth of their financial portfolios.'