***Parsing Podcast Feeds to Create the Dataset***

In [2]:
import feedparser 

# Bulleye's podcast feed url 
podcast_feed_url = "https://www.npr.org/rss/podcast.php?id=510309"

# Parse the xml feed
feed_object = feedparser.parse(podcast_feed_url)

In [3]:
# Grab the show description and list of episodes 
show_description = feed_object.feed.description
podcast_episodes = feed_object.entries 

In [4]:
# Let's look at the show's description text
print(show_description)

Bullseye from NPR is your curated guide to culture. Jesse Thorn hosts in-depth interviews with brilliant creators, culture picks from our favorite critics and irreverent original comedy. Bullseye has been featured in Time, The New York Times, GQ and McSweeney's, which called it "the kind of show people listen to in a more perfect world." (Formerly known as The Sound of Young America.)


In [6]:
# Let's look at one of the episodes to see what data we have
episode_summary = podcast_episodes[0].summary
print(episode_summary)

This week, we're revisiting our conversation with Emmy-award winning actress Edie Falco. She's best known for her roles in <em>The Sopranos, Oz</em> and <em>Nurse Jackie</em>. When she spoke to us in 2018, she had just starred in the movie <em>Outside In</em>. Edie talks to Jesse about landing her first acting gig — which she started the day after she graduated from acting school at SUNY Purchase. Plus, Edie tells us why she thinks comedy isn't for her, and what it was like to work with James Gandolfini for nearly a decade on <em>The Sopranos</em>.


***Extracting Data from Plain Text with NLP***

In [25]:
import spacy, feedparser
from collections import defaultdict 

nlp = spacy.load("en_core_web_lg")

In [26]:
def extract_people(doc):
    # Merge any entities that are split across tokens 
    for ent in doc.ents:
        ent.merge()
    
    # Get a list of all the people mentioned in the text. 
    people_names = [entity.text for entity in doc.ents if entity.label_=="PERSON"]
    
    # Filter out names that aren't both a first and last name.
    people_names = [name for name in people_names if len(name.split(" "))==2]
    
    # Converting the list to a set removes any duplicate names 
    return list(set(people_names))

In [27]:
# Parse the podcast feed 
feed_object = feedparser.parse("https://www.npr.org/rss/podcast.php?id=510309")

# Grab the show description and list of episodes 
show_description = feed_object.feed.description
podcast_episodes = feed_object.entries

In [28]:
# Grab the hosts of the show from the show description
doc = nlp(show_description)
hosts = extract_people(doc)

In [29]:
# Create dictionaries to track apperances 
appearance_count = defaultdict(int)
appearance_list = defaultdict(list)

In [30]:
# Loop through each episode in the podcast feed
for episode in podcast_episodes:
    # Grab the episode's title and description text
    episode_title = episode.title 
    episode_description = episode.summary
    
    # Get a list of people that appear in the show description 
    doc = nlp(episode_description)
    people_in_episode = extract_people(doc)

In [31]:
# Record who appeared in the episode(if they aren't a host)
for person in people_in_episode:
    if person not in hosts:
        appearance_count[person] += 1
        appearance_list[person].append(episode_title)

In [35]:
# Now let's find the Top 3 most frequent guests on this podcast:
most_frequent_guests = sorted(
    appearance_count
)[0:3]
#print(appearance_list)

In [36]:
# Print out the results 
print(f"Show hosts: {hosts}")

Show hosts: ['Jesse Thorn']


In [39]:
for person in most_frequent_guests:
    # Next, let's look up all the specific episodes that a particular person appeared on:
    print(f"{person} appeared on the following episodes:")
    for episode_title in appearance_list[person]:
        print(" - {}".format(episode_title))
        
    print()

David Wain appeared on the following episodes:
 - David Wain & Belle and Sebastian

Doug Kenney appeared on the following episodes:
 - David Wain & Belle and Sebastian

Stuart Murdoch appeared on the following episodes:
 - David Wain & Belle and Sebastian



In [40]:
print(podcast_episodes)




