# Working with RSS Feeds

## Lesson Goals

* Learn about the feedparser library
* Use feedparser to parse RSS feeds
* Explore the components of parsed RSS feeds
* Convert results into data frames and conduct analysis

## Introduction

In the previous lesson, we learned how to use Python to extract structured information from web APIs. In this lesson, we are going to take a look at another source of structured web content called RSS. **RSS stands for Rich Site Summary or Really Simple Syndication**, and it is a type of web feed which **allows users and applications to access updates to online content in a standardized, computer-readable format (typically XML)**.

Python has an excellent library called feedparser that is very useful for reading and parsing RSS feeds. We are going to be using this library throughout the lesson, so let's make sure it is installed and imported. 

In [1]:
#!pip install feedparser

Collecting feedparser
  Downloading https://files.pythonhosted.org/packages/91/d8/7d37fec71ff7c9dbcdd80d2b48bcdd86d6af502156fc93846fb0102cb2c4/feedparser-5.2.1.tar.bz2 (192kB)
Building wheels for collected packages: feedparser
  Building wheel for feedparser (setup.py): started
  Building wheel for feedparser (setup.py): finished with status 'done'
  Stored in directory: C:\Users\ennes\AppData\Local\pip\Cache\wheels\8c\69\b7\f52763c41c5471df57703a0ef718a32a5e81ee35dcf6d4f97f
Successfully built feedparser
Installing collected packages: feedparser
Successfully installed feedparser-5.2.1


In [2]:
#!pip install feedparser
import feedparser

## RSS Feed Versions Formats

Due to the way web feeds evolved, there are various versions of RSS (0.9X, 1.0, 2.0, etc.) as well as alternate forms of feeds (Atom, CDF, etc.). We would have to worry about slight differences in formats if we were going to parse the feeds manually, but feedparser is able to handle all of them, so that is one less thing we need to worry about.
## Parsing RSS Feeds

To parse an RSS feed with feedparser, you just need to call its parse method and pass it a URL. Let's take a look at an example using the RSS feed for the tech subreddit category. 

In [3]:
reddit = feedparser.parse('https://www.reddit.com/r/tech.rss')

If we take a look at the results, we will see a nested dictionary structure that contains a lot of information and looks something like the following.

In [5]:
print(type(reddit))
print(reddit)

<class 'feedparser.FeedParserDict'>
{'feed': {'tags': [{'term': 'tech', 'scheme': None, 'label': 'r/tech'}], 'updated': '2019-08-10T10:39:43+00:00', 'updated_parsed': time.struct_time(tm_year=2019, tm_mon=8, tm_mday=10, tm_hour=10, tm_min=39, tm_sec=43, tm_wday=5, tm_yday=222, tm_isdst=0), 'icon': 'https://www.redditstatic.com/icon.png/', 'id': 'https://www.reddit.com/r/tech.rss', 'guidislink': True, 'link': 'https://www.reddit.com/r/tech', 'links': [{'rel': 'self', 'href': 'https://www.reddit.com/r/tech.rss', 'type': 'application/atom+xml'}, {'rel': 'alternate', 'href': 'https://www.reddit.com/r/tech', 'type': 'text/html'}], 'logo': 'https://f.thumbs.redditmedia.com/kI7eGVG6kaObGTdM.png', 'subtitle': 'The goal of /r/tech is to provide a space dedicated to the intelligent discussion of innovations and changes to technology in our ever changing world. We focus on high quality news articles about technology and informative and thought provoking self posts.', 'subtitle_detail': {'type': '

This is great because we can now use what we learned earlier in the program about working with data structures to explore and extract the information we need from this.
## Exploring the Parsed Feed

Let's take a look at the first level of dictionary keys from the results and see what each of them looks like.

In [13]:
reddit.keys()

dict_keys(['feed', 'entries', 'bozo', 'headers', 'href', 'status', 'encoding', 'version', 'namespaces'])

These are the different components of the RSS feed, and each of them is going to contain information about something more specific. For example, **feed** is going to contain information about this Reddit RSS feed, **entries** is going to contain information about the specific entries in the feed, etc.

Since the feed component is now structured as just a dictionary inside the **larger dictionary**, we can view its keys to get a sense of what type of information is available to us within the feed dictionary. 

In [7]:
reddit.feed.keys()

dict_keys(['tags', 'updated', 'updated_parsed', 'icon', 'id', 'guidislink', 'link', 'links', 'logo', 'subtitle', 'subtitle_detail', 'title', 'title_detail'])

Here, we can see that we would be able to extract any tags for the feed, when the feed was updated, and the icon image for the feed as well as the feed title, subtitle, and various other pieces of information about it. You can see what each of those looks like by calling each component from reddit.feed. 

In [14]:
reddit.feed.tags

[{'term': 'tech', 'scheme': None, 'label': 'r/tech'}]

In [15]:
reddit.feed.icon

'https://www.redditstatic.com/icon.png/'

In [16]:
reddit.feed.title

'/r/tech: Technological innovations and changes.'

In [17]:
reddit.feed.subtitle

'The goal of /r/tech is to provide a space dedicated to the intelligent discussion of innovations and changes to technology in our ever changing world. We focus on high quality news articles about technology and informative and thought provoking self posts.'

This is great, but the most interesting part of the feed is probably going to be the entries. We can access them as follows.

In [21]:
reddit.entries

[{'authors': [{'name': '/u/OriginalHoneyBadger',
    'href': 'https://www.reddit.com/user/OriginalHoneyBadger'}],
  'author_detail': {'name': '/u/OriginalHoneyBadger',
   'href': 'https://www.reddit.com/user/OriginalHoneyBadger'},
  'href': 'https://www.reddit.com/user/OriginalHoneyBadger',
  'author': '/u/OriginalHoneyBadger',
  'tags': [{'term': 'tech', 'scheme': None, 'label': 'r/tech'}],
  'content': [{'type': 'text/html',
    'language': None,
    'base': 'https://www.reddit.com/r/tech.rss',
    'value': '<!-- SC_OFF --><div class="md"><p>Hey guys!</p> <p>Thanks to <a href="/u/thonkerton">/u/thonkerton</a>, <a href="/r/tech">/r/tech</a> has an official discord server.</p> <p>The permanent join link is posted below, feel free to join and invite your friends. </p> <p>Should you have any questions, concerns or suggestions do not hesitate to reach out!</p> <p>Cheers!</p> <h4><a href="https://discord.gg/tech">https://discord.gg/tech</a></h4> </div><!-- SC_ON --> &#32; submitted by &#32

We can see that the data structure within this seems to be a **list where each entry is an element that contains a dictionary with the information for each entry**. We can access the individual entries via indexing and then we can look at the keys available for the entry by calling the keys() method.

In [22]:
reddit.entries[0].keys()

dict_keys(['authors', 'author_detail', 'href', 'author', 'tags', 'content', 'summary', 'id', 'guidislink', 'link', 'links', 'updated', 'updated_parsed', 'title', 'title_detail'])

If we wanted to obtain a particular piece of data for an entry, we could just index that entry and then call the key for the information we wanted. For example, if we wanted to get the title of the third entry, we would obtain it as follows.

In [23]:
reddit.entries[2].title

'Optimus Ride’s autonomous system makes self-driving vehicles a reality'

To extract the titles for all the entries, we could use a **list comprehension**.

In [25]:
titles = [reddit.entries[i].title for i in range(len(reddit.entries))]
titles

["/r/Tech now has it's own Discord server!",
 'You can break iPhone’s FaceID by putting glasses on unconscious people',
 'Optimus Ride’s autonomous system makes self-driving vehicles a reality',
 'Facebook might be fined billions after losing facial recognition lawsuit',
 'This Device Can Hear You Talking to Yourself',
 'Google launches ‘Live View’ AR walking directions for Google Maps',
 "Amazon is developing high-tech surveillance tools for America's police but critics raise fears of privacy abuses.",
 'Robot, heal thyself: scientists develop self-repairing machines',
 'Xiaomi teases vastly unnecessary 100-megapixel phone camera',
 'Skype, Slack, other Electron-based apps can be easily backdoored',
 'Shout out to u/votiwo',
 'Scientists Use Smartphone to Control Drug Delivery to Brain',
 'Can We Upload Our Conciseness To Machine',
 'The U.S. Army Plans To Field the Most Powerful Laser Weapon Yet',
 'Evolution of Blockchain Technology',
 'Google’s Pixel 4 is reportedly getting a 90Hz 

### Analyzing Data From an RSS Feed

Thus far, feedparser has helped us obtain data from an RSS feed and structure in a way that makes it easy for us to explore it and extract the information we need. If we wanted to analyze the data further, we could leverage the Pandas library and create a data frame containing the information about entries in the feed.

In [36]:
import pandas as pd

df2 = pd.DataFrame(reddit.entries)
df2.head(3)

Unnamed: 0,author,author_detail,authors,content,guidislink,href,id,link,links,summary,tags,title,title_detail,updated,updated_parsed
0,/u/OriginalHoneyBadger,"{'name': '/u/OriginalHoneyBadger', 'href': 'ht...","[{'name': '/u/OriginalHoneyBadger', 'href': 'h...","[{'type': 'text/html', 'language': None, 'base...",True,https://www.reddit.com/user/OriginalHoneyBadger,https://www.reddit.com/r/t3_7dx2ew,https://www.reddit.com/r/tech/comments/7dx2ew/...,[{'href': 'https://www.reddit.com/r/tech/comme...,"<!-- SC_OFF --><div class=""md""><p>Hey guys!</p...","[{'term': 'tech', 'scheme': None, 'label': 'r/...",/r/Tech now has it's own Discord server!,"{'type': 'text/plain', 'language': None, 'base...",2017-11-19T00:37:30+00:00,"(2017, 11, 19, 0, 37, 30, 6, 323, 0)"
1,/u/seo-client,"{'name': '/u/seo-client', 'href': 'https://www...","[{'name': '/u/seo-client', 'href': 'https://ww...","[{'type': 'text/html', 'language': None, 'base...",True,https://www.reddit.com/user/seo-client,https://www.reddit.com/r/t3_co1gf1,https://www.reddit.com/r/tech/comments/co1gf1/...,[{'href': 'https://www.reddit.com/r/tech/comme...,"&#32; submitted by &#32; <a href=""https://www....","[{'term': 'tech', 'scheme': None, 'label': 'r/...",You can break iPhone’s FaceID by putting glass...,"{'type': 'text/plain', 'language': None, 'base...",2019-08-09T12:06:17+00:00,"(2019, 8, 9, 12, 6, 17, 4, 221, 0)"
2,/u/ourlifeintoronto,"{'name': '/u/ourlifeintoronto', 'href': 'https...","[{'name': '/u/ourlifeintoronto', 'href': 'http...","[{'type': 'text/html', 'language': None, 'base...",True,https://www.reddit.com/user/ourlifeintoronto,https://www.reddit.com/r/t3_cob3pr,https://www.reddit.com/r/tech/comments/cob3pr/...,[{'href': 'https://www.reddit.com/r/tech/comme...,"&#32; submitted by &#32; <a href=""https://www....","[{'term': 'tech', 'scheme': None, 'label': 'r/...",Optimus Ride’s autonomous system makes self-dr...,"{'type': 'text/plain', 'language': None, 'base...",2019-08-10T00:38:31+00:00,"(2019, 8, 10, 0, 38, 31, 5, 222, 0)"


In [33]:
from pandas.io.json import json_normalize

In [38]:
json_normalize(reddit.entries).head(5)

Unnamed: 0,author,author_detail.href,author_detail.name,authors,content,guidislink,href,id,link,links,summary,tags,title,title_detail.base,title_detail.language,title_detail.type,title_detail.value,updated,updated_parsed
0,/u/OriginalHoneyBadger,https://www.reddit.com/user/OriginalHoneyBadger,/u/OriginalHoneyBadger,"[{'name': '/u/OriginalHoneyBadger', 'href': 'h...","[{'type': 'text/html', 'language': None, 'base...",True,https://www.reddit.com/user/OriginalHoneyBadger,https://www.reddit.com/r/t3_7dx2ew,https://www.reddit.com/r/tech/comments/7dx2ew/...,[{'href': 'https://www.reddit.com/r/tech/comme...,"<!-- SC_OFF --><div class=""md""><p>Hey guys!</p...","[{'term': 'tech', 'scheme': None, 'label': 'r/...",/r/Tech now has it's own Discord server!,https://www.reddit.com/r/tech.rss,,text/plain,/r/Tech now has it's own Discord server!,2017-11-19T00:37:30+00:00,"(2017, 11, 19, 0, 37, 30, 6, 323, 0)"
1,/u/seo-client,https://www.reddit.com/user/seo-client,/u/seo-client,"[{'name': '/u/seo-client', 'href': 'https://ww...","[{'type': 'text/html', 'language': None, 'base...",True,https://www.reddit.com/user/seo-client,https://www.reddit.com/r/t3_co1gf1,https://www.reddit.com/r/tech/comments/co1gf1/...,[{'href': 'https://www.reddit.com/r/tech/comme...,"&#32; submitted by &#32; <a href=""https://www....","[{'term': 'tech', 'scheme': None, 'label': 'r/...",You can break iPhone’s FaceID by putting glass...,https://www.reddit.com/r/tech.rss,,text/plain,You can break iPhone’s FaceID by putting glass...,2019-08-09T12:06:17+00:00,"(2019, 8, 9, 12, 6, 17, 4, 221, 0)"
2,/u/ourlifeintoronto,https://www.reddit.com/user/ourlifeintoronto,/u/ourlifeintoronto,"[{'name': '/u/ourlifeintoronto', 'href': 'http...","[{'type': 'text/html', 'language': None, 'base...",True,https://www.reddit.com/user/ourlifeintoronto,https://www.reddit.com/r/t3_cob3pr,https://www.reddit.com/r/tech/comments/cob3pr/...,[{'href': 'https://www.reddit.com/r/tech/comme...,"&#32; submitted by &#32; <a href=""https://www....","[{'term': 'tech', 'scheme': None, 'label': 'r/...",Optimus Ride’s autonomous system makes self-dr...,https://www.reddit.com/r/tech.rss,,text/plain,Optimus Ride’s autonomous system makes self-dr...,2019-08-10T00:38:31+00:00,"(2019, 8, 10, 0, 38, 31, 5, 222, 0)"
3,/u/seo-client,https://www.reddit.com/user/seo-client,/u/seo-client,"[{'name': '/u/seo-client', 'href': 'https://ww...","[{'type': 'text/html', 'language': None, 'base...",True,https://www.reddit.com/user/seo-client,https://www.reddit.com/r/t3_cny2lo,https://www.reddit.com/r/tech/comments/cny2lo/...,[{'href': 'https://www.reddit.com/r/tech/comme...,"&#32; submitted by &#32; <a href=""https://www....","[{'term': 'tech', 'scheme': None, 'label': 'r/...",Facebook might be fined billions after losing ...,https://www.reddit.com/r/tech.rss,,text/plain,Facebook might be fined billions after losing ...,2019-08-09T05:48:00+00:00,"(2019, 8, 9, 5, 48, 0, 4, 221, 0)"
4,/u/ourlifeintoronto,https://www.reddit.com/user/ourlifeintoronto,/u/ourlifeintoronto,"[{'name': '/u/ourlifeintoronto', 'href': 'http...","[{'type': 'text/html', 'language': None, 'base...",True,https://www.reddit.com/user/ourlifeintoronto,https://www.reddit.com/r/t3_cnxjhv,https://www.reddit.com/r/tech/comments/cnxjhv/...,[{'href': 'https://www.reddit.com/r/tech/comme...,"&#32; submitted by &#32; <a href=""https://www....","[{'term': 'tech', 'scheme': None, 'label': 'r/...",This Device Can Hear You Talking to Yourself,https://www.reddit.com/r/tech.rss,,text/plain,This Device Can Hear You Talking to Yourself,2019-08-09T04:51:49+00:00,"(2019, 8, 9, 4, 51, 49, 4, 221, 0)"


Now that we have the information in a data frame, we can use Pandas to perform a variety of aggregations and calculations. For example, suppose we wanted to know which author has posted the most entries. We could do that by aggregating by author, counting the number of entry titles, and then sorting the results in descending order.

In [39]:
authors = df.groupby('author', as_index=False).agg({'title':'count'})
authors.columns = ['author', 'entries']
authors.sort_values('entries', ascending=False)

Unnamed: 0,author,entries
16,/u/seo-client,3
13,/u/ourlifeintoronto,3
3,/u/Kylde,2
0,/u/Anders_Nystrom,1
11,/u/electricneurons,1
19,/u/vzhou842,1
18,/u/the_spotless_mind,1
17,/u/surajpal8447,1
15,/u/sayus99,1
14,/u/punkthesystem,1


Similarly, if we wanted to see which entries had the longest titles, we could create a new column called title_length that contains the number of characters in the title and then sort the data frame by that new column. 

In [40]:
df['title_length'] = df['title'].apply(len)
df[['title', 'author', 'title_length']].sort_values('title_length', ascending=False)

Unnamed: 0,title,author,title_length
24,Japan successfully tests flying car which hove...,/u/Kylde,149
6,Amazon is developing high-tech surveillance to...,/u/djwired,113
15,Google’s Pixel 4 is reportedly getting a 90Hz ...,/u/seo-client,89
20,"Audi says its sleek new $2,000 electric scoote...",/u/electricneurons,86
19,Researchers discover troubling new security fl...,/u/Shrill_Hillary,79
21,AT&T employees took bribes to plant malware on...,/u/aptelement,76
17,Current breakthroughs in Quantum computing and...,/u/Feanuruz,75
3,Facebook might be fined billions after losing ...,/u/seo-client,72
2,Optimus Ride’s autonomous system makes self-dr...,/u/ourlifeintoronto,70
1,You can break iPhone’s FaceID by putting glass...,/u/seo-client,70


These are just a couple of the things you can analyze about the entries using the information we were able to obtain.
## Summary

RSS feeds are an important source of information because they provide us with structured, parseable data that we would otherwise have had to obtain via messier methods such as web scraping. In this lesson, we introduced the feedparser library and learned how to use it to parse an RSS feed. Once we had a parsed feed, we learned how we could explore the contents of that feed and view the different types of information contained within. Finally, we saw an example of how to take data parsed from an RSS feed, convert it into a Pandas data frame, and perform some basic analyses on it.