## Arxiv Concept Explorer

**[Arxiv](http://arxiv.org/)** is a site where researchers can post research papers at any time. The site can contain some of the **most recent, state of the art research**, since papers may be posted prior to conference or journal submission deadlines. Arxiv also allows open access for all of its papers, spanning many disciplines.

In this notebook, we develop an application for **discovering high-level concepts and keywords** that appear in Arxiv papers, and **retrieving papers** relevant to the concepts. The application serves as a way of discovering emerging ideas, surveying the prominent themes in a field, or efficiently navigating through the wealth of content on the site.

The notebook uses **Declarative Widgets** and the **Bokeh visualization library**, demonstrating how these two tools can be used together to build powerful interactive notebook applications.

------------
### Setup

Initialize the declarative widgets extension.

In [None]:
from urth import widgets

widgets.init()

The application requires an `AlchemyAPI` API key ([available here](http://www.alchemyapi.com/api/register.html)).

Run the next 3 cells and use the widget below to enter your key:

In [None]:
%%html
<!-- Import Dependencies -->
<link rel="import" href="urth_components/urth-viz-table/urth-viz-table.html" is="urth-core-import">
<link rel="import" href="urth_components/paper-input/paper-input.html" is='urth-core-import' package='PolymerElements/paper-input'>
<link rel="import" href="urth_components/paper-button/paper-button.html" is='urth-core-import' package='PolymerElements/paper-button'>
<link rel="import" href="urth_components/paper-icon-button/paper-icon-button.html" is='urth-core-import' package='PolymerElements/paper-icon-button'>
<link rel="import" href="urth_components/paper-progress/paper-progress.html" is='urth-core-import' package='PolymerElements/paper-progress'>

<!-- Define data channels -->
<urth-core-channel name='plot' id='plotChannel'></urth-core-channel>
<urth-core-channel name='status' id='statusChannel'></urth-core-channel>
<urth-core-channel name='table' id='tableChannel'></urth-core-channel>

In [None]:
def set_key(key):
    global ALCHEMY_API_KEY
    ALCHEMY_API_KEY = key
    print("AlchemyAPI key set!")

In [None]:
%%html
Please input your AlchemyAPI key and click "Set":
<template is="dom-bind">
    <urth-core-function id="setkey" ref='set_key' arg-key="{{key}}"></urth-core-function>
    <paper-input label="AlchemyAPI Key" value="{{key}}"></paper-input>
    <paper-button raised onclick="setkey.invoke()">
      Set
    </paper-button>
</template>

Next, we'll install and import the necessary libraries. We assume that `conda` can be used to install `Bokeh`, or `Bokeh` is already installed:

In [None]:
!conda install -y bokeh

In [None]:
!pip install feedparser

In [None]:
import urllib.request
import feedparser
import time
import json
import re
import operator
from datetime import datetime
from collections import defaultdict

from bokeh.models import ColumnDataSource
from bokeh.plotting import figure, show, output_notebook
from bokeh.charts import Histogram
from bokeh.models import CustomJS
from bokeh.models.renderers import GlyphRenderer

import numpy as np
import pandas as pd

from urth.widgets.widget_channels import channel

In [None]:
output_notebook()

--------------

We'll be downloading the `NUM_PAPERS` most recent Arxiv papers using the [Arxiv API](http://arxiv.org/help/api/user-manual):

In [None]:
base_url = 'http://export.arxiv.org/api/query?'

# Number of papers per Arxiv API request.
PAGE_SIZE = 100

# Maximum number of papers to download.
NUM_PAPERS = 500

This is used to cache the papers downloaded from Arxiv, making repeat searches quicker:

In [None]:
entry_cache = {}

-------------
The next few cells contain numerous functions for retrieving, searching, plotting, and displaying the Arxiv data.

#### Retrieving Data Functions

This cell contains functions responsible for interacting with the Arxiv API, AlchemyAPI, and post-processing of the text and results.

AlchemyAPI's `TextGetRankedConcepts` and `TextGetRankedKeywords` endpoints are used for finding concepts and keywords in the downloaded paper data:

In [None]:
def download_entries(query, max_results=10000, use_cache=True):
    """ 
    Downloads entries for the given query term from Arxiv.
    
    Entries are retrieved in PAGE_SIZE chunks. A delay of
    3 seconds is introduced between each API call per Arxiv
    API guidelines.
        
    :param query: String to query for
    :param max_results: Integer
    :param use_cache: Cache downloaded entries for queries when True
    :return: List of dictionaries, where each dictionary is an Arxiv entry.
    """
    query = query.replace(' ', '+')
    if use_cache and query in entry_cache:
        channel('status').set('status', "Using cached data.")
        channel('status').set('progress', 100)
        return entry_cache[query]
    
    entries = []
    
    iters = int(max(max_results / PAGE_SIZE, 1))
    for i in range(iters):
        start = i*PAGE_SIZE
        q = build_query(query, start, PAGE_SIZE)
        channel('status').set('status', "Retrieving papers from Arxiv. Downloading results {}-{} out of {}...".format(
                start+1, min(start+PAGE_SIZE, max_results), max_results)
             )
        channel('status').set('progress', start / max_results * 100)
        response = urllib.request.urlopen(base_url+q).read()
        feed = feedparser.parse(response)
        if len(feed.entries) == 0:
            break
        entries.extend(feed.entries)
        
        # to be nice to the Arxiv servers
        time.sleep(3)
    channel('status').set('status', "Downloaded {} entries.".format(len(entries)))
    channel('status').set('progress', 100)
    entry_cache[query] = entries
    return entries

def build_query(query, start, max_results): 
    """ Constructs the Arxiv API query. """
    return 'search_query={}&start={}&max_results={}&sortBy=submittedDate&sortOrder=descending'.format(query,
                                                         start,
                                                         max_results)

def _clean_text(txt):
    """ 
    :param txt: A String of text.
    :return: A String of cleaned text, words space-separated.
    """
    alpha_only = re.sub("[^a-zA-Z\s]+", "", txt.replace('\n', ' ')).lower()
    return alpha_only

def get_text(entries):
    """ 
    Retrieve the test to be used as input to Alchemy API.
    
    :param entries: List of Arxiv entries [{entry_1}, ..., {entry_n}]
    :return: A String containing the text from all entries.
    """
    entry_text = []
    for e in entries:
        entry_text.append(_clean_text(e.title))
        # Exclude due to AlchemyAPI truncation
        # entry_text.append(_clean_text(e.summary))
    all_text = ' '.join(entry_text)
    return all_text

def _alchemy_api_call(txt, url):
    """ 
    Make an Alchemy API POST request using the given text.
    
    :param txt: Text to send
    :param url: Url to POST to
    :return: List of dictionaries, where each dictionary is a concept.
    """
    base = url
    params = urllib.parse.urlencode(dict(
            apikey=ALCHEMY_API_KEY, text=txt, outputMode='json'))
    req = urllib.request.Request(base, bytes(params, 'ascii'))
    response = urllib.request.urlopen(req).read()
    results = json.loads(response.decode('utf-8'))
    return results

def concepts_api_call(txt):
    """ 
    Retrieve concepts for the given string of text using Alchemy API.
    
    :param txt: Text to send
    :return: List of dictionaries, where each dictionary represents a concept.
    """
    base = "http://gateway-a.watsonplatform.net/calls/text/TextGetRankedConcepts"
    return _alchemy_api_call(txt, base)

def keywords_api_call(txt):
    """ 
    Retrieve keywords for the given string of text using Alchemy API.
    
    :param txt: Text to send
    :return: List of dictionaries, where each dictionary represents a keyword.
    """
    base = "http://gateway-a.watsonplatform.net/calls/text/TextGetRankedKeywords"
    return _alchemy_api_call(txt, base)

def get_concepts(entries):
    """ 
    Retrieve concepts for the given Arxiv entries.
    
    :param entries: List of Arxiv entries represented as dictionaries.
    :return: List of dictionaries, where each dictionary represents a concept.
    """
    txt = get_text(entries)
    return concepts_api_call(txt)

def get_keywords(entries):
    """ 
    Retrieve keywords for the given Arxiv entries.
    
    :param entries: List of Arxiv entries represented as dictionaries.
    :return: List of dictionaries, where each dictionary represents a keyword.
    """
    txt = get_text(entries)
    return keywords_api_call(txt)

def _plot_data(results_dict, kind):
    xs = [c['text'] for c in results_dict[kind]]
    ys = [float(c['relevance']) for c in results_dict[kind]]
    return (xs, ys)   

def keywords_plot_data(keywords_dict):
    """ 
    Format data from a keywords dictionary for use in a plot.
    
    :param concepts_dict: Dictionary representing a keyword.
    :return: ([keyword labels], [relevance scores])
    """
    return _plot_data(keywords_dict, 'keywords')

def concepts_plot_data(concepts_dict):
    """ 
    Format data from a concept dictionary for use in a plot.
    
    :param concepts_dict: Dictionary representing a concept.
    :return: ([concept labels], [relevance scores])
    """
    return _plot_data(concepts_dict, 'concepts')

#### Searching Data Functions

This cell contains functions for finding papers that are relevant to a concept or keyword. We use [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) scoring to retrieve the relevant papers.

Note the `top_docs_for_query` function, which performs the search for relevant documents. This function will be used in the `on_plot_click` handler below, and returns results in a form that works with the `urth-viz-table`:

In [None]:
def top_docs_for_query(query, entries, tf_idf_index, n=100):
    """
    Finds the most relevant documents for a given query.
    
    A document is represented as a list of several 'columns' that
    can be consumed by a table. Thus, the function also returns the
    column schema.
    
    Only results with positive relevance are returned.
    
    :param query: String concept or keyword to find relevant docs for.
    :param entries: List of dicts, where each dict is an Arxiv entry
    :param tf_idf_index: Dictionary, Index used to compute document relevance
    :param n: Maximum number of results to return.
    """
    results = find_relevances_for_query(query, entries, tf_idf_index)
    sorted_results = sorted(results.items(), key=operator.itemgetter(1), reverse=True)
    relevances = compute_relevances(sorted_results)
    id_relevance_pairs = [(pair[0], rel) for pair, rel in zip(sorted_results, relevances)]
    data = [DocResult(eid, rel).table_data() for eid, rel in id_relevance_pairs if rel > 0.0][:n]
    return data, DocResult.columns

def _datestring(struct_date):
    return "{}/{}/{}".format(
        struct_date.tm_mon, struct_date.tm_mday, struct_date.tm_year)

def get_entry_by_id(eid, entries):
    for e in entries:
        if e.id == eid:
            return e

def find_relevances_for_query(query, entries, tf_idf_index):
    """
    Computes the relevance of each entry to the given query.
    
    A document's relevance is determined by the total tf-idf 
    score of the query words with respect to the document.
    
    :param query: String
    :param entries: List of dicts
    :tf_idf_index: Dictionary, Index used to compute relevance
    :return: Dictionary mapping entry_id to relevance score
    """
    entry_rels = defaultdict(int)
    for i, entry in enumerate(entries):
        words = []
        tfs = []
        for word in query.split(' '):
            tf_idf_score = tf_idf_index[(i, word.lower())]
            entry_rels[entry.id] += tf_idf_score
    return entry_rels

def compute_relevances(id_rel_pairs):
    """
    Finds relevance score given [(entry_id, relevance)]
    """
    rels = [rel for _, rel in id_rel_pairs]
    max_r = max(rels)
    return [rel / max_r for rel in rels]


def _create_doc(entry, exclude=[]):
    """
    Creates a bag-of-words representation of an
    entry's text.
    
    :param entry: An Arxiv entry dictionary
    :param exclude: List of words to exclude in the
                    representation
    :return: Dictionary representing the entry's text
             of the form words[word] = frequency
    """
    words = defaultdict(float)
    for section in [entry.title, entry.summary]:
        text = _clean_text(section).split(' ')
        for word in text:
            if word not in exclude and word != '':
                words[word] += 1.0
    return words

def _get_vocab(docs):
    """
    :param docs: List of document dictionaries
    :return: Set of unique word tokens in docs
    """
    vocab = set()
    for doc in docs:
        for word in doc.keys():
            vocab.add(word)
    return vocab

def build_tf_idf_index(entries):
    """
    Builds an index with tf-idf scores for all
    (entry_number, word) pairs. Words are extracted from each entry
    using the _create_doc function.
    
    Entry numbers are equivalent to the entry's index in `entries`.
    
    :param entries: list of dicts, where each dict is an Arxiv entry
    :return: Dictionary index with tf-idf scores, of the form:
             index[(entry_number, word)] = float tf_idf score
    """
    channel('status').set('status', "Building tf-idf index...")
    channel('status').set('progress', 0)
    index = defaultdict(float)
    docs = [_create_doc(e) for e in entries]
    vocab = _get_vocab(docs)
    num_docs = len(docs)
    for i, doc in enumerate(docs):
        for word in vocab:
            if word in doc:
                index[(i, word)] = tf_idf(word, i, docs)
        if i % 100 == 0:
            channel('status').set('progress', (i + 1) / num_docs * 100)
    channel('status').set('progress', 100)
    channel('status').set('status', "tf-idf index built.")
    return index

def tf_idf(term, entry_index, docs):
    """
    Computes the tf-idf score for the term wrt the
    document list, using the index.
    """
    return tf(term, docs[entry_index])*idf(term, docs)

def tf(term, document):
    """
    Number of times the term appears in the document.
    """
    return document[term]

def idf(term, documents):
    """
    Inverse document frequency of the term for
    the list of documents.
    """
    n = len(documents)
    return np.log(n * 1.0 / df(term, documents))

def df(term, documents):
    """
    Number of documents the term appears in.
    """
    n = 0
    for doc in documents:
        if term in doc:
            n += 1
    return n

class DocResult:
    """
    Represents a retrieved document.
    """
    
    # Column names for the table_data representation
    columns = ['Title', 'Author', 'Date', 'Link', 'Relevance']
    
    def __init__(self, entry_id, relevance):
        self.entry = get_entry_by_id(entry_id, entries)
        self.relevance = relevance
        
    def table_data(self):
        """
        Represents this DocResult in a form consumable
        by a table.
        """
        row = []
        for column in self.columns:
            if column == 'Date':
                row.append(_datestring(self.entry['published_parsed']))
            elif column == 'Relevance':
                row.append(self.relevance)
            else:
                row.append(self.entry[column.lower()])
        return row

#### Plotting Functions

Now we define the two Bokeh plots used in the notebook, and functions for updating plots that have already been rendered.

Note the use of `push_notebook` to dynamically update the Bokeh plots:

In [None]:
def _compute_bars(ys):
    bar_width = 0.2
    tops = ys
    bottoms = [0 for y in ys]
    lefts = [x-bar_width + 1 for x in np.arange(0, len(ys))]
    rights = [x+bar_width + 1 for x in np.arange(0, len(ys))]
    return (tops, bottoms, lefts, rights)

def _compute_text_positions(ys):
    x = [x - 0.2 for x in np.arange(1, len(ys)+1)]
    y = [y + 0.05 for y in ys]
    return (x, y)

def update_relevances_plot(xs, ys, p):
    """
    Update the given relevance plot with new data.
    
    xs: List of bar labels
    ys: List of relevance scores
    p: A rendered relevance plot
    """
    rend = p.select(dict(type=GlyphRenderer))[0]
    tops, bottoms, lefts, rights = _compute_bars(ys)
    x_text, y_text = _compute_text_positions(ys)
    rend.data_source.data['top'] = tops
    rend.data_source.data['bottom'] = bottoms
    rend.data_source.data['left'] = lefts
    rend.data_source.data['right'] = rights
    rend.data_source.data['keywords'] = xs
    rend.data_source.data['relevances'] = ys
    rend.data_source.data['x'] = x_text
    rend.data_source.data['y'] = y_text
    rend.data_source.data['text'] = xs
    rend.data_source.push_notebook()

def plot_relevances(data, kind, n=10, p=None, height=350):
    """
    Plots keywords/concepts vs. relevance.
    
    Updates the plot if it has already been rendered to prevent
    duplication.
    
    :param data: ([bar labels], [relevance scores])
    :param kind: Used to label x-axis, e.g. 'Keywords' or 'Relevance'
    :param n: Maximum number of bars to display
    :param p: Optional already-rendered plot.
    :param height: Optional height of the figure.
    :return: The rendered plot.
    """
    xs, ys = data

    xs = xs[:n]
    ys = ys[:n]
    
    if p is not None:
        update_relevances_plot(xs, ys, p)
        return p
    
    title = "Relevant {}s".format(kind)
    p = figure(
        width=1000, 
        height=height, 
        y_range=[0, 3], 
        title=title,
        tools="tap")
    p.yaxis.axis_label = "Relevance"
    p.xaxis.axis_label = kind
    
    tops, bottoms, lefts, rights = _compute_bars(ys)
    x_text, y_text = _compute_text_positions(ys)
    
    source = ColumnDataSource(data=dict(
            top=tops, bottom=bottoms, left=lefts, right=rights,
            keywords=xs, relevances=ys,
            x=x_text,
            y=y_text,
            text=xs
        ))
    
    p.quad(
        top='top', bottom='bottom', left='left', right='right', 
        source=source, tags=['a'], line_color='black'
    )
    p.text(x='x', y='y', 
           text='text', source=source, angle=np.pi/3
    )
    p.xgrid[0].ticker.desired_num_ticks = len(xs)
    p.xaxis.major_label_orientation = np.pi/4
    
    # Set the selected keyword using urth-core-channel when
    # an item on the plot is clicked.
    source.callback = CustomJS(code=""" 
        var selectedIndices = cb_obj.get('selected')['1d'].indices;
        var allData = cb_obj.get('data');
        var selectedKeyword = allData.keywords[selectedIndices[0]];

        document.getElementById('tableChannel').set('keyword', selectedKeyword);
        document.getElementById('plotChannel').set('selectedKeyword', selectedKeyword);
    """)
    
    show(p)
    return p

def update_frequencies_plot(dates, fqs, p):
    """
    Update the given frequencies plot with new data.
    
    dates: List of dates
    fqs: List of frequencies
    p: A rendered relevance plot
    """
    rend = p.select(dict(type=GlyphRenderer))[0]
    rend.data_source.data['dates'] = dates
    rend.data_source.data['fqs'] = fqs
    rend.data_source.push_notebook()

def plot_date_frequencies(entries, p=None):
    """
    Plots the number of publications per day.
    
    Updates the plot if it has already been rendered to prevent
    duplication.
    
    :param entries: List of dicts, where each dict is an Arxiv entry
    :param p: Optional already-rendered plot.
    :return: The rendered plot.
    """
    dates = [d.astype(datetime).date() for d in np.array(
                [e.published for e in entries], dtype=np.datetime64)]

    freqs = defaultdict(int)

    for d in dates:
        freqs[str(d)] += 1

    fqs = [freqs[str(d)] for d in dates]

    if p is not None:
        update_frequencies_plot(dates, fqs, p)
        return p
    
    source = ColumnDataSource(data=dict(
        dates=dates,
        fqs=fqs
    ))
    
    title = "Publications per day"
    p = figure(
        width=1000, 
        height=340, 
        x_axis_type="datetime",
        title=title
    )
    p.yaxis.axis_label = "# Paper Submissions"
    p.xaxis.axis_label = "Date"
    
    # add renderers
    p.circle('dates', 'fqs', source=source)
    p.line('dates', 'fqs', source=source)

    # show the results
    show(p)
    return p

#### Watch Handler

After clicking a keyword, the most relevant papers are retrieved
and displayed in an `urth-viz-table`. This is implemented using:

1. A javascript callback attached to the Bokeh plot
2. An `urth-core-channel` `watch` handler

The javascript callback sets a value that triggers the `watch` handler.
The `watch` handler then retrieves the top documents, and `set`s a value containing
the document data, as defined below:

In [None]:
def on_plot_click(old, new):
    global entries
    global index
    keyword = new
    if keyword is not None:
        docs, columns = top_docs_for_query(keyword, entries, index)
        data = {
            "data": docs,
            "columns": columns,
            "timestamp": int(round(time.time() * 1000))
        }
        channel('table').set('data', data)
    
channel('plot').watch('selectedKeyword', on_plot_click)

#### Top-level function

The `run_all` function defined below executes the entire process of downloading documents, finding concepts, and plotting for a given query.

We will bind to `run_all` using an `urth-core-function`, such that the function is called when a user clicks the `Run` button.

The global variables are used to enable plot and data refreshing; the user can issue multiple queries without re-running the cell. Also note the calls to `channel('table').set('showTable', ...)`, which dynamically hide the table during processing:

In [None]:
concept_plot = None
relevance_plot = None
frequency_plot = None

def run_all(query, n=NUM_PAPERS):
    global entries
    global index
    global concept_plot
    global relevance_plot
    global frequency_plot
    
    channel('table').set('showTable', False)
    channel('table').set('query', query)
    entries = download_entries('all:{}'.format(query), n, use_cache=True)
    index = build_tf_idf_index(entries)
    cd = get_concepts(entries)
    kws = get_keywords(entries)
        
    concept_plot = plot_relevances(
        concepts_plot_data(cd), 
        kind="concept", 
        p=concept_plot)
    relevance_plot = plot_relevances(
        keywords_plot_data(kws), 
        kind="keyword", 
        n=15, 
        p=relevance_plot, 
        height=400)
    frequency_plot = plot_date_frequencies(
        entries, 
        p=frequency_plot)
    
    channel('status').set('status', '')
    channel('table').set('showTable', True)

#### Using the concept explorer

Type in a topic query, e.g. `bioinformatics`, and click `Run`. The most recent `NUM_PAPERS` papers will be downloaded from Arxiv and analyzed. 

The concepts and keywords will be shown in a bar chart; click on a concept / keyword or its corresponding bar to search for related papers, which will then be displayed in a table. 

Clicking on a table row will open the paper's Arxiv entry in a new tab.

In [None]:
%%html
<h2 style="text-align:center">Arxiv Concept Explorer</h2>

<div style="text-align:center">Discover concepts and trends 
    in recent research papers related to a topic</div>

<template is='urth-core-bind' channel='table'>
    <urth-core-function id="runall" ref="run_all" arg-query="{{query}}"></urth-core-function>
    <paper-input label="Topic" value="{{query}}"></paper-input>
    <paper-button raised onclick="runall.invoke()">Run</button>
</template>


<template is='urth-core-bind' channel='status'>
    <template is="dom-if" if="{{status}}">
        <div>{{status}}</div>
        <paper-progress value="{{progress}}"></paper-progress>
    </template>
</template>

<template is='urth-core-bind' channel='table'>
    
    <template is="dom-if" if="{{showTable}}">
        <template is="dom-if" if="{{keyword}}">
            <p>Recent <strong><span>{{query}}</span>
                </strong> Papers Relevant to 
                <strong>"<span>{{keyword}}</span>"</strong> on Arxiv:</p>
            <p>Click a row to open the paper in a new tab</p>
            <urth-viz-table datarows="{{ data.data }}" selection="{{sel}}" columns="{{ data.columns }}">
                <urth-viz-col></urth-viz-col>
                <urth-viz-col></urth-viz-col>
                <urth-viz-col></urth-viz-col>
                <urth-viz-col></urth-viz-col>
                <urth-viz-col></urth-viz-col>
            </urth-viz-table>
            <script>
                $('urth-viz-table').on('selection-changed', function () {
                    var sel = $('urth-viz-table')[0].selection;
                    var urlColumn = 3;
                    var url = sel[urlColumn];
                    window.open(url, '_blank');
                });
            </script>
        </template>
    </template>
</template>