In [6]:
import json


with open('science-parse-test/parses/2019.json') as f:
    doc = json.load(f)

In [27]:
doc_data = doc['metadata']
len(doc_data['references'])
len(doc_data['referenceMentions'])


72

94

In [28]:
doc_data.keys()
doc_data



dict_keys(['source', 'title', 'authors', 'emails', 'sections', 'references', 'referenceMentions', 'year', 'abstractText', 'creator'])

{'source': 'CRF',
 'title': 'Extending Machine Language Models toward Human-Level Language Understanding',
 'authors': ['James L. McClellanda',
  'Felix Hillb',
  'Maja Rudolphc',
  'Jason Baldridged',
  'Hinrich Schützee'],
 'emails': [],
 'sections': [{'heading': 'Extending Machine Language Models toward Human-Level Language Understanding',
   'text': 'James L. McClellanda,b,2, Felix Hillb,2, Maja Rudolphc,2, Jason Baldridged,1,2, and Hinrich Schützee,1,2\naStanford University, Stanford, CA 94305, USA; bDeepMind, London N1C 4AG, UK; cBosch Center for Artificial Intelligence, Renningen, 71272, Germany; dGoogle Research, Austin, TX 78701, USA; eLMU Munich, Munich, 80538, Germany\nThis manuscript was compiled on December 13, 2019\nLanguage is central to human intelligence. We review recent breakthroughs in machine language processing and consider what remains to be achieved. Recent approaches rely on domain general principles of learning and representation captured in artificial neural 

In [30]:
[item['heading'] for item in doc_data['sections']]

['Extending Machine Language Models toward Human-Level Language Understanding',
 'Language Understanding | Natural Language Processing | Situation',
 'Models | Machine Language Models | Brain System for Understanding',
 'Principles of Neural Computation',
 'Neural Language Modeling',
 'Scaling Up to Process Natural Text. Elman’s task—predicting',
 'The Human Integrated Understanding System (IUS)',
 'Toward an Artificial Integrated Understanding System']

In [18]:
doc_data['references'][0]

doc_data['referenceMentions'][0]

{'title': '2016) Google’s neural machine translation system: Bridging the gap between human and machine',
 'author': ['Y Wu'],
 'venue': None,
 'citeRegEx': '1',
 'shortCiteRegEx': '1',
 'year': 2016}

{'referenceID': 0,
 'context': 'More impressive still is modern machine translation (1).',
 'startOffset': 52,
 'endOffset': 55}

In [24]:
# referenceMentions have IDs but these don't seem to correspond to the number used in the text
[item for item in doc_data['referenceMentions'] if item['referenceID'] == 0 ]

[{'referenceID': 0,
  'context': 'More impressive still is modern machine translation (1).',
  'startOffset': 52,
  'endOffset': 55},
 {'referenceID': 0,
  'context': 'This work demonstrates that (1) we understand and remember texts better when we can relate the statements in the text to a familiar situation; (2) information that conveys aspects of the situation can be provided by a picture accompanying the text; (3) the characteristics of the objects we remember depend on the situations in which they occurred in a text; (4) we represent in memory objects not explicitly mentioned in texts; and (5) after hearing a sentence describing spatial or conceptual relationships among objects, we retain memory for these relationships rather than the linguistic input.',
  'startOffset': 28,
  'endOffset': 31},
 {'referenceID': 0,
  'context': 'Journal of verbal learning and verbal behavior 20(1):120–136.',
  'startOffset': 49,
  'endOffset': 52},
 {'referenceID': 0,
  'context': 'Artificial Intell

dict_keys(['source', 'title', 'authors', 'emails', 'sections', 'references', 'referenceMentions', 'year', 'abstractText', 'creator'])

## bulk processing

```bash
time java -jar cli/target/scala-2.12/science-parse-cli-assembly-3.0.1.jar ../../sample_pdfs/ -o ../parses/ -f sample-pdfs.jsonl
```
(3 min for 36 papers)

In [34]:
import os, json

DIR = 'science-parse-test/parses'

articles = {}
for file in os.listdir(DIR):
    with open(os.path.join(DIR, file)) as f:
        articles[file] = json.load(f)

In [43]:
# headings

for name, article in articles.items():
    data = article['metadata']
    print(name)
    for section in (data['sections'] or []):
        print('\t' + (section['heading'] or ''))
    print()

2019.pdf.json
	Extending Machine Language Models toward Human-Level Language Understanding
	Language Understanding | Natural Language Processing | Situation
	Models | Machine Language Models | Brain System for Understanding
	Principles of Neural Computation
	Neural Language Modeling
	Scaling Up to Process Natural Text. Elman’s task—predicting
	The Human Integrated Understanding System (IUS)
	Toward an Artificial Integrated Understanding System

Multi agent cooperation and emergence of natujral langauge.pdf.json
	1 INTRODUCTION
	2 GENERAL FRAMEWORK
	3 EXPERIMENTAL SETUP
	4 LEARNING TO COMMUNICATE
	4.1 OBJECT-LEVEL REFERENCE
	5 GROUNDING AGENTS’ COMMUNICATION IN HUMAN LANGUAGE
	6 DISCUSSION
	ACKNOWLEDGMENTS

savage-rumbaugh1980.pdf.json

Exploiting Deep Semantics and Compositionality of Natural Language.pdf.json
	
	II. PRELIMINARIES AND RELATED WORK
	A. What Makes Natural Language Understanding Hard
	B. Embodied Construction Grammar and Compositionality
	C. A Brief Survey on NLU for Robo

In [79]:
# check how often figure captions are incorporated into main text...
import re
from collections import defaultdict

fig_counts = defaultdict(int)
for name, article in articles.items():
    data = article['metadata']
#     print(name)
    for section in (data['sections'] or {}):
        text = section.get('text', '')
        
        # figures should be at the beginning of a fragment or preceded by a whitespace
        m = re.findall(r'(^|\.\s)[fF]ig(\.|ure)?[0-9\s]', text)
        if len(m) > 0:
#             print(name)
#             print(text)
            fig_counts[name] += len(list(m))

print(json.dumps(fig_counts, indent=2))
    

{
  "2019.pdf.json": 1,
  "Multi agent cooperation and emergence of natujral langauge.pdf.json": 4,
  "Exploiting Deep Semantics and Compositionality of Natural Language.pdf.json": 1,
  "Microblogging_during_two_natural_hazards_events_Wh.pdf.json": 1,
  "Faking Sandy: Characterizing and Identifying Fake Images.pdf.json": 1,
  "Natural Language Does Not Emerge \u2018Naturally\u2019 in Multi-Agent Dialog.pdf.json": 2,
  "ghosh1999.pdf.json": 2,
  "Reference-Aware Language Models.pdf.json": 1,
  "Emergence of linguistic conventions in multi-agent reinforcement.pdf.json": 1,
  "2000,Beer.pdf.json": 1,
  "a2-gupta.pdf.json": 4,
  "fpsyg-02-00355.pdf.json": 2,
  "On_the_continuity_of_mind_Toward_a_dynam.pdf.json": 15,
  "Capacity, Bandwidth, and Compositionality in Emergent Language Learning.pdf.json": 1,
  "dzeroski2001-Relational_Reinforcement_Learning.pdf.json": 6,
  "Symbol Emergence in Cognitive Developmental.pdf.json": 4
}


Obervations on 36 sample pdfs:

#### PROs
- references at the end are parsed
- one-sentence reference mention contexts are provided
- many figures and side notes

#### CONs
- additional text like bottom notes or figure captions are either incorporated into the main text or discarded...
- no information about text size etc.
- long application startup (without a running server): processing of a single paper 102s...