## Enriching Web Syndication Feeds using Watson Natural Understanding

This notebook fetches a web syndication feed and enriches it with keyword, entity and category information. The notebook source is located at https://github.com/ibm-watson-data-lab/enriching-web-syndication-feeds

Setup:
 * Load [this notebook](https://raw.githubusercontent.com/ibm-watson-data-lab/enriching-web-syndication-feeds/master/enrich_syndication_feed.ipynb) into a project in [Data Science Experience](http://datascience.ibm.com/analytics) ([Instructions](https://apsportal.ibm.com/docs/content/analyze-data/creating-notebooks.html)) 
 * [Provision an instance of the Watson Natural Language service in Bluemix](https://console.bluemix.net/catalog/services/natural-language-understanding/) (use the _Lite_ plan, which is free!)
 * Take note of the service credentials.
 * Customize the Syndication feed URL `feed_url` in cell 3, if desired 
  ```
  # TODO: customize syndication feed URL
  feed_url = 'http://feeds.wnyc.org/radiolab'
  ```
 * Enter the credentials of your Watson Natural Language service instance in cell 4. 
  ```
  # @hidden_cell
  # TODO: replace with your Watson Natural Language Understanding service credentials (https://console.bluemix.net/catalog/services/natural-language-understanding/)
  nlu_credentials = {
      "url": "https://gateway.watsonplatform.net/natural-language-understanding/api",
      "username": "my_nlu_username",
      "password": "my_nlu_password"
  }
  ```
  
 * Run all cells

In [None]:
# uncomment and run this cell if any of the libraries listed in the next cells cannot be imported; restart the kernel
#!pip install --user feedparser
#!pip install --user pixiedust
#!pip install --user --upgrade watson-developer-cloud

In [None]:
import feedparser
import numpy
import pandas as pd
import pixiedust
import pprint
import requests
import urllib
import watson_developer_cloud
import watson_developer_cloud.natural_language_understanding.features.v1 as Features
import IPython
from datetime import datetime
from time import mktime
from timeit import default_timer as timer

Customize the configuration.

In [None]:
# ---------------------------
# configuration settings
# ---------------------------
# @hidden_cell
# TODO: customize syndication feed URL
feed_url = 'http://feeds.wnyc.org/radiolab'

# TODO: replace with your Watson Natural Language Understanding service credentials (https://console.bluemix.net/catalog/services/natural-language-understanding/)
nlu_credentials = {
      "url": "https://gateway.watsonplatform.net/natural-language-understanding/api",
      "username": "",
      "password": ""
}

# Enrich feed data with information derived using Natural Language Understanding processing (True/False)
use_nlu = True

# log debug output (False/True)
debug = False

# Natural Language Understanding currently supports the following languages (expressed using ISO 639-1 code):
# "ar" (Arabic), "en" (English), "fr" (French), "de" (German), "it" (Italian), "pt" (Portuguese), "ru" (Russian), "es" (Spanish), and "sv" (Swedish). 
# Refer to https://www.ibm.com/watson/developercloud/doc/natural-language-understanding/index.html#supported-languages for more information
# Set ignore_feed_language to False to enable processing of feeds in languages other than "en" (some NLU features might not work or return unexpected results)
ignore_feed_language = False

if use_nlu and \
    (nlu_credentials['username'] is None or len(nlu_credentials['username'].strip()) == 0 or \
     nlu_credentials['password'] is None or len(nlu_credentials['password'].strip()) == 0):
    print 'Error. Watson Natural Language Understanding credentials must be configured.'


Load the feed.

In [None]:
# load feed
if feed_url is not None:
    feed = feedparser.parse(feed_url)
    if feed['status'] != 200:
        print 'Error. The feed could not be loaded from {}. The server returned status code {}'.format(feed_url, feed['status'])
    else:
        print 'Feed was loaded from {}. It contains {} items.'.format(feed_url, len(feed['entries']))
        language = feed.feed.get('language', 'en-us').split('-')[0]
        if language != 'en':
            print 'Warning! The feed uses language "{}".'.format(language)
            if ignore_feed_language:
                language = 'en'
                print 'Feed language will be ignored. Using default "en", which might yield unexpected results.'
            else:
                print 'One or more Watson Natural Language Understanding features might not work. Refer to https://www.ibm.com/watson/developercloud/doc/natural-language-understanding/index.html#supported-languages for details. In the configuration settings cell set ignore_feed_language to True to use the default, which is English.'
        
else:
    print 'Error. A feed URL must be configured. Check your settings and re-run this cell.'

In [None]:
if debug:
    # print parsed feed entries
    for item in feed.entries:
            print '*' * 80
            pprint.pprint(item)

Create `items_df` DataFrame representing the feed. Columns: publication year,publication month, publication month, episode title, episode tags, episode summary,  episode URL

In [None]:
summmary_property_names = ['subtitle_detail', 'summary_detail']
# example schemes: None, 'http://www.itunes.com/'
schemes = [None, 'http://www.itunes.com/']

episodes = []
# extract information from every feed item
for item in feed.entries:
    # construct datetime object from publication date
    published = datetime.fromtimestamp(mktime(item['published_parsed']))

    # episode title and publication info
    episode = {
        'title': item.get('title', item.get('subtitle', None)),
        'published_year': published.strftime('%Y'),
        'published_month': published.strftime('%Y-%m'),
        'published_day': published.strftime('%Y-%m-%d')
    }
    
    # episode URL is either
    # (1) a web page for the specific episode or
    # (2) a direct link to the media (e.g. mp3 file)
    # (3) None
    url = item.get('link')
    if url is None:
        # no episode web page was specified; try getting the media URL
        # multiple media URLs might be specified (e.g. if media is available in different formats)
        mc = item.get('media_content', [])
        for c in mc:
            url = c.get('url')
            if url is not None:
                break
    episode['url'] = url
    
    # episode summary
    summary = episode['title']
    for pname in summmary_property_names:
        if item.get(pname) is not None:
            if len(item[pname].get('value','').strip()) > 0:
                summary = item[pname].get('value')
                break
    episode['summary'] = summary
    
    # episode tags
    if item.get('tags') is not None:
        tags = []
        for tag in item.get('tags'):
            if len(schemes) == 0 or tag['scheme'] in schemes:
               tags.append(tag['term'])
        tags.sort()
        episode['tags'] = ','.join(tags)
    else:
        episode['tags'] = None
        
    episodes.append(episode)

# create DataFrame
cols = ['published_year','published_month', 'published_day', 'title', 'tags', 'summary', 'url']
items_df = pd.DataFrame(episodes, columns = cols)

if debug:
    print 'DataFrame shape:\t{}'.format(items_df.shape)
    print 'DataFrame structure:\n{}'.format(items_df.dtypes)

IPython.display.display(items_df.head(5))

Enrich `items_df` DataFrame with keyword, category, and entity information.

In [None]:
#
# enrich syndication feed by running Natural Language Understanding analysis for each item:
# - add categories for each item
# - add entities for each item
# - add keywords for each item
#
if use_nlu:
    
    # Apply optional entity filter; refer to https://www.ibm.com/watson/developercloud/natural-language-understanding/api/v1/?python#entities for details
    # Examples: apply no filter; all entity types will be returned: [] 
    #           only identify companies, people and organizations: []'Company', 'Person', 'Organization']
    entity_filter = ['Company', 'Person', 'Organization']
    
    # Categorize content into a 5-level taxonomy. The top three categories will be returned as a CSV string. (https://www.ibm.com/watson/developercloud/natural-language-understanding/api/v1/?python#categories)
    # category list: https://www.ibm.com/watson/developercloud/doc/natural-language-understanding/categories.html
    items_df['categories'] = None
    
    # Identify people, cities, organizations, and many other types of entities (https://www.ibm.com/watson/developercloud/natural-language-understanding/api/v1/?python#entities)
    # entitiy types and subtypes: https://www.ibm.com/watson/developercloud/doc/natural-language-understanding/entity-types.html
    items_df['entities'] = None
    
    # Identify the important keywords (https://www.ibm.com/watson/developercloud/natural-language-understanding/api/v1/?python#keywords)
    items_df['keywords'] = None   
    
    try: 
        nlu = watson_developer_cloud.NaturalLanguageUnderstandingV1(version='2017-02-27',
                                                                    username = nlu_credentials['username'],
                                                                    password = nlu_credentials['password'])
        
        def call_nlu(summary):
            ''' Send request to Watson NLU and fetch categories, entities and keywords
                Input: string to be analyzed. This string can contain HTML.
                Output: raw NLU response (Dict)
            '''
            
            if debug:
                start = timer()
            else:
                print '\b.',
            try:
                response = nlu.analyze(html = summary, \
                                       features = [Features.Categories(), Features.Entities(), Features.Keywords()], \
                                       language = language)
                if debug:
                    end = timer()
                    print 'NLU summary processing completed in {:.1f} seconds.'.format(end - start)
            except watson_developer_cloud.WatsonException as e:
                print u'\nError. Watson Natural Language Understanding processing for "{}" failed: {}'.format(summary, e)
                response = {}
            return response

        def getKeywords(response):
            ''' Extract keywords from NLU response
                Input: raw NLU response
                Output: array of keywords, formatted as a csv string: "keyword1,...,keywordN" or None
            '''
            out = []
            for keyword in response.get('keywords', []):
                # {u'relevance': 0.941254, u'text': u'pediatrician Nadine Burke'}
                out.append(keyword['text'])
            if len(out) > 0:
                return ','.join(out)
            return None

        def getEntities(response, *args):
            ''' Extract entities from NLU response
                Input: raw NLU response
                Input: Tuple containing explicitly white-listed entity types. If no white-list is provided all entity types will be captured
                Output: array of entities, formatted as a dict {'entity_type1': ['entity1',...], 'entity_type2': [...]} or None
            '''
            ignore_filter = len(args) == 0
            out = {}
            for entity in response.get('entities', []):
                # {u'count': 1, u'relevance': 0.974444, u'text': u'Nadine Burke Harris',  u'type': u'Person'}
                if ignore_filter or entity['type'] in args:
                    if out.get(entity['type']) is None:
                        out[entity['type']] = []
                    out[entity['type']].append(entity['text'])
                elif debug:
                    print 'Skipping entity {} because it is not white-listed.'.format(entity['type'])
            if len(out) > 0:
                return out
            else:
                return None

        def getCategories(response):
            ''' Extract categories from NLU response
                Input: raw NLU response
                Output: array of categories, formatted as a csv string: "category1,...,categoryN" or None
            '''
            out = []
            for category in response.get('categories', []):
                # {u'label': u'/science/biology', u'score': 0.578503}
                out.append(category['label'])
            if len(out) > 0:        
                return ','.join(out)
            return None   
    
        print 'Running Natural Language Understanding analysis for {} feed items'.format(len(items_df.index))
        nlu_start = timer()
        items_df['raw_nlu_response'] = items_df['summary'].apply(call_nlu)
        nlu_end = timer()
        print '\nNLU analysis completed in {:.1f} seconds'.format(nlu_end - nlu_start)
        
        items_df['keywords'] = items_df['raw_nlu_response'].apply(getKeywords)
        items_df['entities'] = items_df['raw_nlu_response'].apply(getEntities,args = ('Company','Person','Organization'))
        items_df['categories'] = items_df['raw_nlu_response'].apply(getCategories)
        
    except watson_developer_cloud.WatsonException as e:
        print u'Error. Watson Natural Language Understanding processing failed: {}'.format(e)
        

Inspect the enriched `items_df` DataFrame. If NLU analysis completed successfully the following new columns were added:
 - `categories` ([Details](https://www.ibm.com/watson/developercloud/doc/natural-language-understanding/categories.html))
 - `entities` ([Details](https://www.ibm.com/watson/developercloud/doc/natural-language-understanding/entity-types.html))
 - `keywords`
 - `raw_nlu_response` (NLU response details)

In [None]:
IPython.display.display(items_df.head(5))

### Prepare for tag analysis

Create two analysis DataFrames:
 - `tag_associations_df`: associations between tags and episodes. Columns: tag, episode title, publication year, publication month, publication day, episode URL
 - `tag_df`: summarized tag information for basic data exploration. Columns: tag, publication year, publication month, occurence per month

In [None]:
# create tag dataframe
tag_associations = []
for row in items_df.itertuples():
    if row[5] is not None:
        for tag in row[5].split(','):
            tag = tag.strip()
            tag_associations.append((tag, row[4], row[1], row[2], row[3], row[7]))     

tag_associations_df = pd.DataFrame(tag_associations, columns=['tag','title','published_year', 'published_month','published_day','url'])
IPython.display.display(tag_associations_df.head(2))

# tags by year and month
tag_df = pd.DataFrame(tag_associations_df.groupby(['tag', 'published_year', 'published_month']).size().sort_values(ascending = False), columns = ['count']).reset_index()
IPython.display.display(tag_df.head(5))

if tag_df.size > 0:
    display(tag_df)
else:
    print 'No tags to display'

### Prepare for keyword analysis

Create two DataFrames:
 - `keyword_associations_df`: associations between extracted keywords and episodes. Columns: keyword, episode title, publication year, publication month, publication day, episode URL
 - `keyword_df`: summarized monthly keyword information for basic data exploration. Columns: keyword, publication year, publication month, occurence per per month

In [None]:
if 'keywords' in items_df.columns:
    # create keyword dataframe
    keyword_associations = []
    for row in items_df.itertuples():
        if row[10] is not None:
            for keyword in row[10].split(','):
                keyword = keyword.strip()
                keyword_associations.append((keyword, row[4], row[1], row[2], row[3], row[7]))     

    keyword_associations_df = pd.DataFrame(keyword_associations, columns=['keyword','title','published_year', 'published_month','published_day','url'])
    IPython.display.display(keyword_associations_df.head(2))
    # keywords by year and month
    keyword_df = pd.DataFrame(keyword_associations_df.groupby(['keyword', 'published_year', 'published_month']).size().sort_values(ascending = False), columns = ['count']).reset_index()
    IPython.display.display(keyword_df.head(2))
    if keyword_df.size > 0:
        display(keyword_df)
    else:
        print 'No keywords to display'    
else:
    keyword_associations_df = None
    keyword_df = None
    print 'The source does not contain keyword information.'

### Prepare for category analysis

Create two DataFrames:
 - `category_associations_df`: associations between derived categories and episodes. Columns: category, episode title, publication year, publication month, publication day, episode URL
 - `category_df`: summarized monthly category information for basic data exploration. Columns: category, publication year, publication month, occurence per per month

In [None]:
if 'categories' in items_df.columns:
    # create category dataframe
    category_associations = []
    for row in items_df.itertuples():
        if row[8] is not None:
            for category in row[8].split(','):
                category = category.strip()
                category_associations.append((category, row[4], row[1], row[2], row[3], row[7]))     

    category_associations_df = pd.DataFrame(category_associations, columns=['category','title','published_year', 'published_month','published_day','url'])
    IPython.display.display(category_associations_df.head(2))
    # categories by year and month
    category_df = pd.DataFrame(category_associations_df.groupby(['category', 'published_year', 'published_month']).size().sort_values(ascending = False), columns = ['count']).reset_index()
    IPython.display.display(category_df.head(2))
    if category_df.size > 0:
        display(category_df)
    else:
        print 'No categories to display'   
else:
    category_associations_df = None
    category_df = None
    print 'The source does not contain category information.'

### Prepare for entity analysis

Extract entities (people, organizations, ...) and try to associate them with Wikipedia entries to provide context information.

Create two DataFrames:
 - `entity_associations_df`: associations between identified entities and episodes. Columns: entity type, entity value, episode title, publication year, publication month, publication day, episode URL, wikipedia URL
 - `entity_df`: summarized monthly entity information for basic data exploration. Columns: entity type, entity value, publication year, publication month, occurence per per month

In [None]:
if 'entities' in items_df.columns:
    # validate wikipedia URL
    exact_url_only = True

    def getWikipediaURL(entity, exact_match_only = False):
        url = 'https://en.wikipedia.org/w/index.php?{}'.format(urllib.urlencode({'search': entity},'utf-8'))
        r = requests.head(url)
        print '\b.',
        if r.status_code == 302:
            return r.headers['Location']
        elif exact_match_only:
            return None
        else:
            return url    

    print 'Looking up entities on Wikipedia ' 
    # create entity dataframe and lookup entities on Wikipedia
    entity_associations = []
    for row in items_df.itertuples():
        if row[9] is not None:
            for entity_type in row[9].keys():
                for entity_value in row[9][entity_type]:
                    entity_associations.append((entity_type, entity_value, row[4], row[1], row[2], row[3], row[7], getWikipediaURL(entity_value, exact_url_only)))     
    
    entity_associations_df = pd.DataFrame(entity_associations, columns=['entity_type','entity_value', 'title','published_year', 'published_month','published_day','url', 'wikipedia_url'])
    IPython.display.display(entity_associations_df.head(2))
    # entities by year and month
    groupBy = ['entity_type', 'entity_value', 'published_year', 'published_month']
    entity_df = pd.DataFrame(entity_associations_df.groupby(groupBy).size().sort_values(ascending = False), columns = ['count']).reset_index()
    IPython.display.display(entity_df.head(200))
    if entity_df.size > 0:
        display(entity_df)
    else:
        print 'No entities to display'   
else:
    entity_associations_df = None
    entity_df = None
    print 'The source does not contain entity information.'

Display all entities that were mentioned in the podcast

In [None]:
for url in entity_associations_df.get('wikipedia_url',pd.Series([])).unique():
    if url is not None:
        print url

***