This notebook generates statistics on CHI 2019 papers. Given a bunch of URLS, it finds the number of figures, tables, citations, and downloads for each paper, along with character count of body content.

Import packages and list of urls.

In [1]:
import requests
import pandas as pd
import re
from pyquery import PyQuery 

papers = pd.read_csv('../data/chi_2019_urls.txt', sep='\t')
urls = papers.url

Get corresponding page content for each URL. URLs point to the HTML full text versions of CHI 2019 proceedings.

In [2]:
tables = []
figures = []
lengths = []
remaining_urls = []

for url in urls:
    r = requests.get(url)
    if r.status_code != requests.codes.ok:
        print(r.status_code)
    page = r.text
    current_url = r.url
    
    # Some papers don't have HTML fulltext versions, so need to check URL before processing data
    if 'fullHtml' in current_url:
        count_figures = len(set(re.findall('Figure [0-9]+', page)))
        count_tables = len(set(re.findall('Table [0-9]+', page)))
        
        pq = PyQuery(page)
        body = pq('section.body p').text()

        lengths.append(len(body))
        figures.append(count_figures)
        tables.append(count_tables)
        remaining_urls.append(current_url)

Create list of URLs that point to summary pages for each paper. This page contains the abstract along with (more pertinently) the download and citation counts.

In [3]:
summary_urls = []

for url in remaining_urls:
    summary_urls.append(url.replace('/fullHtml', ''))

Extract citation and download counts from summary pages.

In [4]:
citations = []
downloads = []

for url in summary_urls:
    r = requests.get(url)
    if r.status_code != requests.codes.ok:
        print(r.status_code)
    page = r.text
    
    citation_text = re.findall('<div class="citation">Total Citations<span class="bold">[0-9,]+</span></div>', page)[0]
    count_citations = int(re.findall('[0-9,]+', citation_text)[0].replace(',',''))
    
    download_text = re.findall('<div class="metric">Total Downloads<span class="bold">[0-9,]+</span></div>', page)[0]
    count_downloads = int(re.findall('[0-9,]+', download_text)[0].replace(',',''))
    
    citations.append(count_citations)
    downloads.append(count_downloads)

Put everything into a DataFrame and export for analysis in R.

In [5]:
dict = {'url': summary_urls, 'characters': lengths, 'figures': figures, 'tables': tables, 'citations': citations, 'downloads': downloads}
df = pd.DataFrame(dict)
df.to_csv('../data/chi_2019_stats.tsv', sep='\t')