# Web Scraping

## Part \#4: Getting XML from RSS Feeds

### What is an RSS feed?

RSS stands for "Really Simple Syndication." It's just a page of data conforming to the XML format that is updated frequently and can be processed in an automated way.

### Exploring Some RSS Feeds

Many organizations have RSS feeds. Some links are provided below that will allow you to find some of these feeds. Spend a few minutes exploring some prominent examples using Google Chrome. 

*Note: If the RSS feed you are looking at is a collection of HTML-style tags, you are in the right place. If not, right-click and select "View page source."* 

Okay, time to explore:

* Local News:
  * _Chicago Tribune_: http://www.chicagotribune.com/cs-rssfeeds-htmlstory.html
  * _The Daily Herald_: http://www.dailyherald.com/rss/
  * _The Chicago Sun Times_: http://www.thesuntimes.com/section/feed


* National/International News: 
  * _Reuters_: https://www.reuters.com/tools/rss
  * _USA Today_: https://www.usatoday.com/rss/
  * _The New York Times_: http://www.nytimes.com/services/xml/rss/index.html
  * _BBC News_: http://www.bbc.com/news/10628494


* Technology News: 
  * _Wired.com_: https://www.wired.com/about/rss_feeds/
  * _Ars Technica_: https://arstechnica.com/rss-feeds/
  * _CNET_: https://www.cnet.com/rss/
  
  
* Miscellaneous (Sports, Government, Science):  
  * _ESPN_: http://www.espn.com/espn/news/story?page=rssinfo
  * _Illinois Commerce Commission_: https://www.icc.illinois.gov/rss/
  * _US Congress_: https://www.congress.gov/rss
  * _NASA_: https://www.nasa.gov/content/nasa-rss-feeds


**Task \#1:** After you explore a few of the feeds above, try to find an RSS feed for another website you are interested in. This may be a news website for a certain type of news you like to follow (video games, style/fashion, politics, etc). Then fill in the information below for the feed (or feeds) you found:

Organization(s): Pitchfork

URL(s): https://pitchfork.com/rss/

Description(s): The RSS feed provides the XML links to different aspects of the Pitchfork brand, including music news, album reviews, and lists of the best albums in various genres and categories. The news link has the same tags structure of headers and paragraphs but the data stored between these tags is different every time.


#### An RSS Feed from the Wall Street Journal

The beauty of an RSS feed is that its content is updated regularly, but the structure of its tags always stays the same. This allows you to extract up-to-date data in an automated fashion. 

For example, here are the top stories from the "global news" section of _The Wall Street Journal_ from their RSS feed: https://feeds.a.dj.com/rss/RSSWSJD.xml

After looking at this link in Chrome, explore it using Beautiful Soup:

In [1]:
from bs4 import BeautifulSoup  
from urllib.request import urlopen

xml_page = urlopen("https://feeds.a.dj.com/rss/RSSWSJD.xml")   # Opens whatever page we are requesting
bs_obj = BeautifulSoup(xml_page, 'xml')

print(bs_obj.prettify())

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dj="http://dowjones.net/rss/" xmlns:wsj="http://dowjones.net/rss/">
 <channel>
  <title>
   WSJ.com: WSJD
  </title>
  <link>
   http://online.wsj.com
  </link>
  <atom:link href="http://online.wsj.com" rel="self" type="application/rss+xml"/>
  <description>
   WSJD - Technology
  </description>
  <language>
   en-us
  </language>
  <pubDate>
   Wed, 17 Apr 2019 20:10:01 -0400
  </pubDate>
  <lastBuildDate>
   Wed, 17 Apr 2019 20:10:01 -0400
  </lastBuildDate>
  <copyright>
   Dow Jones &amp; Company, Inc.
  </copyright>
  <generator>
   http://online.wsj.com
  </generator>
  <docs>
   http://cyber.law.harvard.edu/rss/rss.html
  </docs>
  <image>
   <title>
    WSJ.com: WSJD
   </title>
   <link>
    http://online.wsj.com
   </link>
   <url>
    http://online.wsj.com/img/wsj_sm_logo.gif
   </url>
  </image>
  <item>
   <title>

Now you can get a list of all headlines:

In [2]:
headlines = bs_obj.find_all('title')

print(headlines)

[<title>WSJ.com: WSJD</title>, <title>WSJ.com: WSJD</title>, <title>Pinterest Targets $19 a Share in IPO, Bankers Tell Investors</title>, <title>Qualcomm's Bet on 5G Pays Off</title>, <title>Uber Nears Investment Deal for Self-Driving Car Unit</title>, <title>In SEC vs. Elon Musk, a Question of When Tweets Matter</title>, <title>Netflix Subscriber Count Rises, but Growth Slows at Home</title>, <title>IBM's Shares Slide as Growth Challenges Remain</title>, <title>Microsoft's New Xbox One S Won't Play Videogame Discs</title>, <title>Nokia, the 1990s Cellphone Pioneer, Wants to Topple Huawei</title>, <title>IBM Struggles to Jump-Start Its Turnaround</title>, <title>Apple and Qualcomm Reach Patent Deal, Drop All Litigation</title>, <title>Texting Moves to the Workplace, as Do the Awkward Misfires. 'I'm Here. I Luv U.'</title>, <title>The U.S. Wants to Ban Huawei. But in Some Places, AT&amp;T Relies On It.</title>, <title>T-Mobile-Sprint Deal Runs Into Resistance From DOJ Antitrust Staff</t

If you want to strip out the `<title>` tags, use the *.getText()* method:

In [3]:
headlines = [story.getText() for story in headlines]

print(headlines)

['WSJ.com: WSJD', 'WSJ.com: WSJD', 'Pinterest Targets $19 a Share in IPO, Bankers Tell Investors', "Qualcomm's Bet on 5G Pays Off", 'Uber Nears Investment Deal for Self-Driving Car Unit', 'In SEC vs. Elon Musk, a Question of When Tweets Matter', 'Netflix Subscriber Count Rises, but Growth Slows at Home', "IBM's Shares Slide as Growth Challenges Remain", "Microsoft's New Xbox One S Won't Play Videogame Discs", 'Nokia, the 1990s Cellphone Pioneer, Wants to Topple Huawei', 'IBM Struggles to Jump-Start Its Turnaround', 'Apple and Qualcomm Reach Patent Deal, Drop All Litigation', "Texting Moves to the Workplace, as Do the Awkward Misfires. 'I'm Here. I Luv U.'", 'The U.S. Wants to Ban Huawei. But in Some Places, AT&T Relies On It.', 'T-Mobile-Sprint Deal Runs Into Resistance From DOJ Antitrust Staff', 'Delays to T-Mobile-Sprint Deal Would Dent SoftBank', 'Big News for People Who Spend Hours Staring at Maps on Planes', 'Sony Cracks Down on Sexually Explicit Content in Games', 'Families Use A

In this feed, the first two titles appear to be for the news website rather than for news stories themselves. This is an easy fix:

In [4]:
headlines = headlines[2:]

print(headlines)

['Pinterest Targets $19 a Share in IPO, Bankers Tell Investors', "Qualcomm's Bet on 5G Pays Off", 'Uber Nears Investment Deal for Self-Driving Car Unit', 'In SEC vs. Elon Musk, a Question of When Tweets Matter', 'Netflix Subscriber Count Rises, but Growth Slows at Home', "IBM's Shares Slide as Growth Challenges Remain", "Microsoft's New Xbox One S Won't Play Videogame Discs", 'Nokia, the 1990s Cellphone Pioneer, Wants to Topple Huawei', 'IBM Struggles to Jump-Start Its Turnaround', 'Apple and Qualcomm Reach Patent Deal, Drop All Litigation', "Texting Moves to the Workplace, as Do the Awkward Misfires. 'I'm Here. I Luv U.'", 'The U.S. Wants to Ban Huawei. But in Some Places, AT&T Relies On It.', 'T-Mobile-Sprint Deal Runs Into Resistance From DOJ Antitrust Staff', 'Delays to T-Mobile-Sprint Deal Would Dent SoftBank', 'Big News for People Who Spend Hours Staring at Maps on Planes', 'Sony Cracks Down on Sexually Explicit Content in Games', 'Families Use Apps to Track Relatives With Dement

You can do the same thing with links to these stories if you'd like:

In [5]:
urls = bs_obj.find_all('link')
urls = [link.getText() for link in urls]
print(urls)

['http://online.wsj.com', '', 'http://online.wsj.com', 'https://www.wsj.com/articles/pinterest-and-zoom-to-test-ipo-market-after-lyfts-stumble-11555493401?mod=rss_Technology', 'https://www.wsj.com/articles/qualcomms-bet-on-5g-pays-off-11555535027?mod=rss_Technology', 'https://www.wsj.com/articles/uber-nears-investment-deal-for-self-driving-car-unit-11555523985?mod=rss_Technology', 'https://www.wsj.com/articles/in-sec-vs-elon-musk-a-question-of-when-tweets-matter-11555493401?mod=rss_Technology', 'https://www.wsj.com/articles/netflix-subscriber-count-rises-but-growth-slows-at-home-11555446946?mod=rss_Technology', 'https://www.wsj.com/articles/ibms-shares-tumble-as-challenges-remain-in-hunt-for-growth-11555526709?mod=rss_Technology', 'https://www.wsj.com/articles/microsofts-new-xbox-one-s-wont-play-videogame-discs-11555449000?mod=rss_Technology', 'https://www.wsj.com/articles/nokia-the-1990s-cellphone-pioneer-wants-to-topple-huawei-11555438823?mod=rss_Technology', 'https://www.wsj.com/art

This time, it looks like the third link is where we want to start (The second entry is an empty String!)

In [6]:
urls = urls[3:]

print(urls)

['https://www.wsj.com/articles/pinterest-and-zoom-to-test-ipo-market-after-lyfts-stumble-11555493401?mod=rss_Technology', 'https://www.wsj.com/articles/qualcomms-bet-on-5g-pays-off-11555535027?mod=rss_Technology', 'https://www.wsj.com/articles/uber-nears-investment-deal-for-self-driving-car-unit-11555523985?mod=rss_Technology', 'https://www.wsj.com/articles/in-sec-vs-elon-musk-a-question-of-when-tweets-matter-11555493401?mod=rss_Technology', 'https://www.wsj.com/articles/netflix-subscriber-count-rises-but-growth-slows-at-home-11555446946?mod=rss_Technology', 'https://www.wsj.com/articles/ibms-shares-tumble-as-challenges-remain-in-hunt-for-growth-11555526709?mod=rss_Technology', 'https://www.wsj.com/articles/microsofts-new-xbox-one-s-wont-play-videogame-discs-11555449000?mod=rss_Technology', 'https://www.wsj.com/articles/nokia-the-1990s-cellphone-pioneer-wants-to-topple-huawei-11555438823?mod=rss_Technology', 'https://www.wsj.com/articles/ibms-revenue-falls-again-11555445687?mod=rss_Tec

**Task \#2:** Write a function, *random_headline(headline_list, link_list)*, that accepts a list of headlines and a list of links as input and returns an output string in the format "HEADLINE, read more at LINK."

**Note:** Be sure to test out your function to make sure it works as expected. Show the results of your tests below.*
**NOTE:** Not all headlines may have a link, and your two arrays may not be 'parallel'.  Try re-running the cells for the most up-to-date listings!

In [30]:
# Your code here
import random

def random_headline(headline_list, link_list):
    length = len(headline_list)
    # pick a random article
    choice = random.randint(0,length)
    # pick article
    headline = headline_list[choice]
    # pick link
    link = link_list[choice]
    # concatenate message
    output = headline
    output += ", read more at "
    output += link
    return(output)

random_headline(headlines,urls)

'Facebook and Google Get an Unusual Crew of Allies in Europe, read more at https://www.wsj.com/articles/facebook-and-google-get-an-unusual-crew-of-allies-in-europe-11555172293?mod=rss_Technology'

### Processing Other RSS Feeds

You already perused a few RSS feeds. This time, pick one of those feeds (or a new one) and explore it by writing code. As a reminder, here are some recommended feeds: 

* Local News:
  * _Chicago Tribune_: http://www.chicagotribune.com/cs-rssfeeds-htmlstory.html
  * _The Daily Herald_: http://www.dailyherald.com/rss/
  * _The Chicago Sun Times_: http://www.thesuntimes.com/section/feed


* National/International News: 
  * _Reuters_: https://www.reuters.com/tools/rss
  * _USA Today_: https://www.usatoday.com/rss/
  * _The New York Times_: http://www.nytimes.com/services/xml/rss/index.html
  * _BBC News_: http://www.bbc.com/news/10628494


* Technology News: 
  * _Wired.com_: https://www.wired.com/about/rss_feeds/
  * _Ars Technica_: https://arstechnica.com/rss-feeds/
  * _CNET_: https://www.cnet.com/rss/
  
  
* Miscellaneous (Sports, Government, Science):  
  * _ESPN_: http://www.espn.com/espn/news/story?page=rssinfo
  * _Illinois Commerce Commission_: https://www.icc.illinois.gov/rss/
  * _US Congress_: https://www.congress.gov/rss
  * _NASA_: https://www.nasa.gov/content/nasa-rss-feeds

**Task \#3:** Pick any of the feeds above (or more than one). Experiment using Python code, and show the results of your experimentation below. 

**Note:** This question is fairly open-ended, but at a minimum you must do the following: 

* Create a "random headline"-style function like you did above. You should not expect code such as _headlines = headlines[2:]_ or _urls = urls[3:]_ to fit perfectly with your data since these were modifications that may be specific to the way _The Wall Street Journal_'s RSS feed is organized.

* You must show that you engaged with the feed(s) you picked using the Beautiful Soup module and at least one Python data structure (probably lists). You will need to analyze the XML for the feed you pick and write your code to fit with the data. 

In [53]:
# Your code here
from bs4 import BeautifulSoup  
from urllib.request import urlopen

# get xms files and convert into BeautifulSoup
xml_page = urlopen("http://www.espn.com/espn/rss/news")   # Opens whatever page we are requesting
bs_obj = BeautifulSoup(xml_page, 'xml')

# get titles
titles = bs_obj.find_all('title')
titles = [story.getText() for story in titles]
titles = titles[2:]

# get links
links = bs_obj.find_all('link')
links = [link.getText() for link in links]
links = links[2:]

import random

def random_image(titles, links):
    length = len(titles)
    print(length)
    # pick a random article
    choice = random.randint(0,length)
    # pick article
    title = titles[choice]
    # pick link
    link = links[choice]
    # concatenate message
    output = title
    output += ", read the article at "
    output += link
    return(output)

random_image(titles,links)

34


'Terence Crawford and Amir Khan scout ... themselves, read the article at http://www.espn.com/boxing/story/_/id/26531888/terence-crawford-amir-khan-scout'