## RSS

You can find RSS feeds on many different sites. [Library of Congress](https://www.loc.gov/rss/) has a lot. Most blogs and news web sites have them, for example [Tech Crunch](https://techcrunch.com/rssfeeds/), [New York Times](http://www.nytimes.com/services/xml/rss/index.html), and [NPR](https://help.npr.org/customer/portal/articles/2094175-where-can-i-find-npr-rss-feeds-). The [DC Public Library](http://www.dclibrary.org/) even gives you an RSS feed of your [catalog searches](https://catalog.dclibrary.org/client/rss/hitlist/dcpl/qu=python). iTunes delivers podcasts by [aggregating RSS feeds](http://itunespartner.apple.com/en/podcasts/faq) from content creators. 

Today we are going to take a look at the [Netflix Top 100 DVDs](https://dvd.netflix.com/RSSFeeds). We will use the Python package [FeedParser](https://pypi.python.org/pypi/feedparser) to work with the RSS feed. FeedParser will allow us to deconstruct the data in the feed.

In [1]:
import feedparser
import pandas as pd

In [2]:
RSS_URL = "http://dvd.netflix.com/Top100RSS"

In [3]:
feed = feedparser.parse(RSS_URL)

In [4]:
type(feed)

feedparser.FeedParserDict

"parse" is the primary function in FeedParser. The returned object is dictionary like and can be handled similarly to a dictionary. For example, we can look at the keys it contains and what type of items those keys are.

In [5]:
feed.keys()

dict_keys(['feed', 'bozo', 'status', 'headers', 'href', 'namespaces', 'entries', 'encoding', 'version'])

In [6]:
type(feed.bozo)

int

In [7]:
type(feed.feed)

feedparser.FeedParserDict

We will look at some, but not all, of the data stored in the feed. For more information about the keys, see the [documentation](http://pythonhosted.org/feedparser/).

We can use the version to check which type of feed we have.

In [8]:
feed.version

'rss20'

Bozo is an interesing key to know about if you are going to parse a RSS feed in code. FeedParser sets the bozo bit when it detects a feed is not well-formed. (FeedParser will still parse the feed if it is not well-formed.) You can use the bozo bit to create error handling or just print a simple warning.

In [9]:
if feed.bozo == 0:
    print("Well done, you have a well-formed feed!")
else:
    print("Potential trouble ahead.")

Well done, you have a well-formed feed!


We can look at some of the feed elements through the feed attribute.

In [10]:
feed.feed.keys()

dict_keys(['subtitle_detail', 'ttl', 'cf_treatas', 'link', 'links', 'title', 'title_detail', 'language', 'subtitle'])

In [11]:
print(feed.feed.title)
print(feed.feed.link)
print(feed.feed.description)

Netflix Top 100
http://dvd.netflix.com
Top 100 Netflix movies, published every 2 weeks.


The [reference section](http://pythonhosted.org/feedparser/reference.html) of the feedparser documenation shows us all the inforamtion thatcan be in a feed. [Annotated Examples](http://pythonhosted.org/feedparser/annotated-examples.html) are also provided. But note the caution provided-

"Caution: Even though many of these elements are required according to the specification, real-world feeds may be missing any element. If an element is not present in the feed, it will not be present in the parsed results. You should not rely on any particular element being present."

For example, our feed is RSS 2.0. One of the elements available in this version is the published date.

In [12]:
feed.feed.published

AttributeError: object has no attribute 'published'

We can see from our error, our feed is not using 'published'.

As with [standard python dictionaries](https://docs.python.org/3.5/library/stdtypes.html#dict), we can use the "get" method to see if a key exists. This is useful if we are writing code.

In [13]:
feed.feed.get('published', 'N/A')

'N/A'

The data we are looking for are contained in the entries. Given the feed we are working with, how many entries do you think we have?

In [14]:
len(feed.entries)

100

The items in entries are stored as a list.

In [15]:
type(feed.entries)

list

In [23]:
type(feed.entries[0])

feedparser.FeedParserDict

In [24]:
feed.entries[0].keys()

dict_keys(['links', 'summary_detail', 'link', 'title', 'guidislink', 'title_detail', 'summary', 'id'])

In [25]:
feed.entries[0].title

'Game of Thrones'

In [26]:
i = 0
for entry in feed.entries:
    print(i, feed.entries[i].title)
    i += 1

0 Game of Thrones
1 Sully
2 Jason Bourne
3 The Magnificent Seven
4 The Accountant
5 Hell or High Water
6 Hacksaw Ridge
7 Star Trek Beyond
8 The Legend of Tarzan
9 Deepwater Horizon
10 Manchester by the Sea
11 Free State of Jones
12 Suicide Squad
13 Independence Day: Resurgence
14 The Girl on the Train
15 Inferno
16 Bad Moms
17 Homeland
18 X-Men: Apocalypse
19 Ghostbusters
20 Central Intelligence
21 Money Monster
22 Passengers
23 The Secret Life of Pets
24 Florence Foster Jenkins
25 Mechanic: Resurrection
26 Outlander
27 The Infiltrator
28 War Dogs
29 Jack Reacher: Never Go Back
30 Miss Peregrine's Home for Peculiar Children
31 Finding Dory
32 Now You See Me 2
33 Arrival
34 Doctor Strange
35 The Nice Guys
36 Moonlight
37 Me Before You
38 Snowden
39 The Jungle Book (2016)
40 Don't Breathe
41 The BFG
42 Ben-Hur
43 The Shallows
44 Blood Father
45 Captain Fantastic
46 Captain America: Civil War
47 Sausage Party
48 The Huntsman: Winter's War
49 Pete's Dragon
50 The Man Who Knew Infinity
51 W

Given that information, what is something we can do with this data? Why not make it a dataframe?

In [27]:
df = pd.DataFrame(feed.entries)

In [28]:
df.head()

Unnamed: 0,guidislink,id,link,links,summary,summary_detail,title,title_detail
0,False,https://dvd.netflix.com/Movie/Game-of-Thrones/...,https://dvd.netflix.com/Movie/Game-of-Thrones/...,"[{'rel': 'alternate', 'href': 'https://dvd.net...","<a href=""https://dvd.netflix.com/Movie/Game-of...","{'base': 'http://dvd.netflix.com/Top100RSS', '...",Game of Thrones,"{'base': 'http://dvd.netflix.com/Top100RSS', '..."
1,False,https://dvd.netflix.com/Movie/Sully/80103102,https://dvd.netflix.com/Movie/Sully/80103102,"[{'rel': 'alternate', 'href': 'https://dvd.net...","<a href=""https://dvd.netflix.com/Movie/Sully/8...","{'base': 'http://dvd.netflix.com/Top100RSS', '...",Sully,"{'base': 'http://dvd.netflix.com/Top100RSS', '..."
2,False,https://dvd.netflix.com/Movie/Jason-Bourne/800...,https://dvd.netflix.com/Movie/Jason-Bourne/800...,"[{'rel': 'alternate', 'href': 'https://dvd.net...","<a href=""https://dvd.netflix.com/Movie/Jason-B...","{'base': 'http://dvd.netflix.com/Top100RSS', '...",Jason Bourne,"{'base': 'http://dvd.netflix.com/Top100RSS', '..."
3,False,https://dvd.netflix.com/Movie/The-Magnificent-...,https://dvd.netflix.com/Movie/The-Magnificent-...,"[{'rel': 'alternate', 'href': 'https://dvd.net...","<a href=""https://dvd.netflix.com/Movie/The-Mag...","{'base': 'http://dvd.netflix.com/Top100RSS', '...",The Magnificent Seven,"{'base': 'http://dvd.netflix.com/Top100RSS', '..."
4,False,https://dvd.netflix.com/Movie/The-Accountant/8...,https://dvd.netflix.com/Movie/The-Accountant/8...,"[{'rel': 'alternate', 'href': 'https://dvd.net...","<a href=""https://dvd.netflix.com/Movie/The-Acc...","{'base': 'http://dvd.netflix.com/Top100RSS', '...",The Accountant,"{'base': 'http://dvd.netflix.com/Top100RSS', '..."


Challenge: write code to create a dataframe of the top 10 movies from the Netflix Top 100 DVDs and iTunes. Check to see if your feed is well formed. Compile the name of the feed as the souce, the published date, the movie ranking in the list, the movie title, a link to the movie, and the summary. If the published date does not exist in the feed, use the current date. Save your dataframe as a csv. Here is a link to one [possible solution](./rss_challenge.py).