# Papers with Code ML papers dataset

In [1]:
import sys
sys.path.append("/home/ubuntu/github/mkardas/paper-extractor")

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
from sota_extractor2.data.paper_collection import PaperCollection
from pathlib import Path

DATA_PATH = Path("/home/ubuntu/pwc/arxiv-s3/arxiv")
PICKLE_PATH = Path("/home/ubuntu/pwc/pc-pickle.pkl")
#DATA_PATH = Path("/home/ubuntu/pwc/arxiv-pwc/arxiv")
#PICKLE_PATH = Path("/home/ubuntu/pwc/pc-pickle-fast.pkl")

## Dataset
The dataset was created by parsing 75K arXiv papers related to machine learning. Due to parsing errors, the dataset contains texts and tables extracted from 56K papers. 
```
.
└── arxiv
    ├── texts
    │   └── 0709
    │       ├── 0709.1667.json
    │       ...
    │   ...
    ├── tables
    │   └── 0709
    │       ├── 0709.1667
    │       │   ├── metadata.json
    │       │   ├── table_01.csv
    │       │   ...
    │       ...
    │   ...
    └── structure-annotations.json
```

`texts` directory contains `.json` files with papers' content organized into sections. `metadata.json` list tables and their captions found in a given paper. `table_xx.csv` contains data of a given table (nested tables are flattened). We provide a simple API to load and access the dataset. Due to large number of papers it is recommended to load the dataset in parallel (default uses number of processes equal to number of CPU cores) and store it in a pickle file. Set `jobs=1` to disable multiprocessing.

In [4]:
%time pc = PaperCollection.from_files(DATA_PATH)

CPU times: user 3min 10s, sys: 10.4 s, total: 3min 20s
Wall time: 7min 16s


In [5]:
pc.to_pickle(PICKLE_PATH)

In [6]:
%time pc = PaperCollection.from_pickle(PICKLE_PATH)

CPU times: user 3.48 s, sys: 144 ms, total: 3.63 s
Wall time: 3.58 s


PaperCollection is a wrapper for `list` of papers with additional functions added for convenience. 

In [7]:
len(pc)

56696

## Tables
Each `Paper` contains `text` and `tables` fields. Tables can be displayed with color-coded labels.

In [7]:
paper = pc.get_by_id('1607.04315')
table = paper.tables[0]
table.display()

0,1,2,3,4
Model,d,|Î¸|M,Train,Test
Classifier with handcrafted features [12],-,-,99.7,78.2
LSTM encoders [12],300,3.0M,83.9,80.6
Dependency Tree CNN encoders [13],300,3.5M,83.3,82.1
SPINN-PI encoders [14],300,3.7M,89.2,83.2
NSE,300,3.4M,86.2,84.6
MMA-NSE,300,6.3M,87.1,84.8
LSTM attention [15],100,242K,85.4,82.3
LSTM word-by-word attention [15],100,252K,85.3,83.5
MMA-NSE attention,300,6.5M,86.9,85.4


Table's data is stored in `.df` pandas `DataFrame`. Each cell contains its content `value`, annotated `gold_tags` and references `refs` to other papers. Most of the references were normalized across all papers.

In [8]:
table.df.iloc[4,0]

Cell(value='SPINN-PI encoders [14]', gold_tags='model-competing', refs=['xxref-XBowmanGRGMP16'])

Additionally, each table contains `gold_tags` describing what is the content of the table.

In [8]:
table.gold_tags

NameError: name 'table' is not defined

## Text Content
Papers' content is represented using elastic search document classes (can be easily `save()`'ed to an existing elastic search instance). Each `text` contains `title`, `abstract`, and 'authors'. Paper's text is split into `fragments`.

In [10]:
paper.text.abstract

'Abstract We present a memory augmented neural network for natural language understanding: Neural Semantic Encoders. NSE is equipped with a novel memory update rule and has a variable sized encoding memory that evolves over time and maintains the understanding of input sequences through read , compose and write operations. NSE can also access 1 xxanchor-x1-2f1 multiple and shared memories. In this paper, we demonstrated the effectiveness and the flexibility of NSE on five different natural language tasks: natural language inference, question answering, sentence classification, document sentiment analysis and machine translation where NSE achieved state-of-the-art performance when evaluated on publically available benchmarks. For example, our shared-memory model showed an encouraging result on neural machine translation, improving an attention-based baseline by approximately 1.0 BLEU.'

In [11]:
paper.text.print_toc()

1 xxanchor-x1-10001 Introduction

2 xxanchor-x1-20002 Related Work

3 xxanchor-x1-30003 Proposed Approach

3.1 xxanchor-x1-40003.1 Read, Compose and Write

3.2 xxanchor-x1-50003.2 Shared and Multiple Memory Accesses

4 xxanchor-x1-60004 Experiments

4.1 xxanchor-x1-70004.1 Natural Language Inference

4.2 xxanchor-x1-80004.2 Answer Sentence Selection

4.3 xxanchor-x1-90004.3 Sentence Classification

4.4 xxanchor-x1-100004.4 Document Sentiment Analysis

4.5 xxanchor-x1-110004.5 Machine Translation

5.1 xxanchor-x1-130005.1 Memory Access and Compositionality

6 xxanchor-x1-140006 Conclusion

xxanchor-x1-150006 Acknowledgments

xxanchor-x1-160006 References

A xxanchor-x1-17000A Step-by-step visualization of memory states in NSE

In [12]:
paper.text.fragments[1]

# 1 xxanchor-x1-10001 Introduction,
Recently several studies have explored ways of extending the neural networks with an external memory [ xxref-Xgraves2014neural – xxref-Xgrefenstette2015learning ]. Unlike LSTM, the short term memories and the training parameters of such a neural network are no longer coupled and can be adapted. In this paper we propose a novel class of memory augmented neural networks called Neural Semantic Encoders (NSE) for natural language understanding. NSE offers several desirable properties. NSE has a variable sized encoding memory which allows the model to access entire input sequence during the reading process; therefore efficiently delivering long-term dependencies over time. The encoding memory evolves over time and maintains the memory of the input sequence through read , compose and write operations. NSE sequentially processes the input and supports word compositionality inheriting both temporal and hierarchical nature of human language. NSE can read from

In [None]:
paper.text.print_section("Machine Translation")